Optimizing Grid Site Manager Performance with Virtual Machines
Ludmila Cherkasova, Diwaker Gupta
Hewlett-Packard Labs
|
Abstract: Virtualization can enhance the functionality and ease the management of current and future Grids by enabling on-demand creation of services and virtual clusters with customized environments, QoS provisioning and policy-based resource allocation. In this work, we consider the use of virtual machines (VMs) in a data-center environment, where a significant portion of resources from a shared pool are dedicated to Grid job processing. The goal is to improve efficiency while supporting a variety of different workloads. We analyze workload data for the past year from a Tier-2 Resource Center at the RRC Kurchatov Institute (Moscow, Russia). Our analysis reveals that a large fraction of Grid jobs have low CPU utilization, which suggests that using virtual machines to isolate execution of different Grid jobs on the shared hardware might be beneficial for optimizing the data-center resource usage. Our simulation results show that with only half the original infrastructure employing VMs (50 nodes and four VMs per node) we can support 99% of the load processed by the original system (100 nodes). Finally, we describe a prototype implementation of a virtual machine management system for Grid computing.
One of the major challenges in this space is to build robust, flexible, and efficient infrastructures for the Grid. Virtualization is a promising technology that has attracted much interest and attention, particularly in the Grid community [2, 3, 4]. Virtualization can add many desirable features to the functionality of current and future Grids:
This paper focuses on the performance/fault isolation and flexible resource allocation enabled by the use of VMs. VMs enable diverse applications to run in isolated environments on a shared hardware platform and dynamically control resource allocation for different Grid jobs. To the best of our knowledge, this is the first paper to analyze real workload data from a Grid and give empirical evidence for the feasibility of this idea. We show that there is actually a significant economic and performance incentive to move to a VM based architecture.[][][]
Figure 2: Basic Trace Characterization.
To get a better understanding of CPU utilization, we introduce a new metric called the CPU Usage Efficiency (CUE), defined by the percentage of actual CPU time consumed by a job out of the wall CPU time. That is, CUE = ACT/WCT. Figure 3 shows the CDF of CPU-usage-efficiency per job. For example,[] [] []
Figure 4: General per Virtual Organization Statistics.
[] [] []
Figure 5: Grid Job I/O usage Profiles.
To analyze potential performance benefits of introducing VMs, we use a simulation model with the following features and parameters:
First of all, we evaluated the scenario when the original system has a decreased number of nodes. As Table 1 shows, when the original system has 40 nodes the percentage of rejected jobs reaches high 20%, while under scenario when these 40 nodes support 4 VMs per node the rejection rates are only 2.75% for the overall trace.
Figure 7: Original work-flow for a Grid job.
The modular architecture and API exposed by Usher made it very easy to implement a “policy daemon” geared towards our use case. We are, in fact, experimenting with several policy daemons varying in sophistication to evaluate the trade-off between performance and overhead.
Figure 8: VM Management System Architecture.
This document was translated from LATEX by HEVEA.