[top |
prev |
next]
Isolation with Flexibility:
A Resource Management Framework for Central Servers
David G. Sullivan,
Margo I. Seltzer
Division of Engineering and Applied Sciences
Harvard University, Cambridge, MA 02138
{sullivan,margo}@eecs.harvard.edu
4. Prototype Implementation
We have implemented the extended framework in
VINO 0.50,
and we use it to manage CPU time, memory, and disk bandwidth. In the
following sections, we describe the key details of our implementation,
including the scheduling and allocation mechanisms that we employ.
4.1 Threads and Currencies
A thread can be thought of as a special type of currency. Like all
currencies, threads hold backing tickets, and they also issue
temporary transfer tickets to threads on which they are waiting (see
Section 4.3). By having VINO's thread class inherit
from its currency class 4, threads can issue and
receive tickets using the same methods as non-thread currencies, and
other methods (e.g., the one that recursively computes a ticket's base
value) can also treat threads and currencies interchangeably. Threads
also reuse currency data members to keep track of the tickets that
they hold and have issued.
Currencies that are also threads are identified by the process id of
the thread. All other currencies are identified by a unique currency
identifier (cid).
4.2 Currency Configuration and Permissions
By default, the system creates one currency for each active user of
the system, and it funds these user currencies equally from the base
currency. Each user's currency in turn funds the tasks of that
user. The user currencies are created by means of a function,
cid_for_client(), that is invoked when a process changes its real user
id (uid); this function uses a uid to cid mapping to determine which
currency should be used to fund the process. If no mapping exists for
a given user, a new currency is created and funded. In either case,
the process' existing funding is revoked and replaced with funding
from the appropriate user currency. Once a user's login shell is
correctly funded, processes forked by that shell are funded by the
same currency. More generally, child processes are funded by the
issuer of the first ticket in the parent's list of backing tickets for
that resource. 5
Each user currency is owned by the corresponding user and has a
currency mode (see Section 2.3) that allows
the user to manipulate it using a set of system calls added for this
purpose. A user's tasks receive equal ticket allocations by default;
the user can modify these allocations and thereby alter the relative
resource rights of the tasks. Users can also create currencies and
fund them with tickets from their user currency. However, they cannot
increase the funding of their user currency itself, because they do
not have permission to issue other currencies' tickets. Thus, each
user's tasks are securely isolated from the tasks of other users, and
each user has the same total resource rights.
Other currency-configuration policies could also be specified. On
extensible operating systems like VINO [Sma98],
superusers could safely download specialized versions of the
cid_for_client function (which is also called when a process' real
group id (gid) changes) to specify arbitrary configuration schemes
based on uid and gid, as well as alternative access-control policies
for currencies. Approaches that do not involve extensibility could
also be employed to accommodate a more limited range of possible
configurations and access controls.
4.3 Managing CPU Time
Our prototype uses the original lottery-scheduling algorithm
[Wal94] to schedule the CPU, randomly choosing
an active ticket and traversing the runnable queue to find the thread
that holds the ticket. In searching for the winning thread, the system
computes the base value of each thread's CPU backing tickets. We cache
these base values to avoid unnecessary computation, although the
cached values must be invalidated whenever a change occurs in a
currency's count of active tickets (e.g., when a thread starts up,
blocks, or exits).
Our prototype also uses two other features of the original
lottery-scheduling framework, compensation tickets and ticket
transfers. Compensation tickets are issued to threads who do not use
their full quantum, temporarily inflating their resource rights to
give them a higher probability of being chosen when they next become
runnable. Ticket transfers occur when a thread blocks attempting to
acquire a kernel mutex or to allocate memory. Because the thread is
itself a currency (see Section 4.1), it issues a
ticket and uses it to fund the thread on which it is waiting (the
holder of the mutex or the pageout daemon), transferring its resource
rights to that thread. This can reduce the time that the waiting
thread spends blocked, and it prevents priority inversion from
occurring. When the thread is made runnable again, its transfer
tickets are revoked.
A number of deterministic algorithms, including stride scheduling
[Wal95] and EEVDF
[Sto96],
can also be used within the lottery-scheduling framework to provide
increased accuracy and lower response-time variability for interactive
processes. Because our work primarily addresses ways of overcoming the
limits that proportional-share frameworks impose on flexible resource
allocation, the algorithms used are not crucial.
4.4 Managing Memory
Effective proportional-share memory management is complicated by the
difficulty of determining which processes are actively competing for
memory and by the undesirability of a strict partitioning of memory
among processes. When scheduling the CPU, threads that are blocked are
simply ignored and the values of their tickets are not counted. Our
data [Sul99b] indicates that a similar approach
to memory management is not effective. When the system experiences
heavy memory pressure, any process that blocks, even momentarily, can
lose a large number of pages to the activity of the pageout daemon,
resulting in erratic paging behavior and poor throughput. The obvious
alternative, namely leaving the memory tickets of all processes active
at all times, is also not viable, because pages belonging to idle
processes tend to remain in memory indefinitely, reducing the number
of pages available to active processes and effectively partitioning
the total memory of the system. We have therefore chosen to give
memory guarantees only to privileged processes that explicitly request
them. Other processes compete for the unreserved portion of memory,
which we ensure comprises at least five percent of the memory not
wired by the kernel. While this approach is limited (e.g., it can
easily lead to thrashing), it allows us to experiment with the
resource trade-offs that applications can make.
Processes running as root can obtain hard memory shares from the base
currency. Once a currency is funded with memory tickets, resource
principals with appropriate permissions can obtain soft or hard
subshares of its allocation. As with any resource, hard shares are
obtained using the reserve() system call and soft shares using the
fund() system call. In the common case of obtaining a hard share from
the base currency, users simply specify the size in kilobytes of the
memory share they are requesting. In other cases, users either specify
a number of memory tickets (for soft shares) or a percentage of the
issuing currency's share (for hard shares).
To maintain guaranteed shares, we altered the behavior of VINO's
pageout daemon so that pages owned by processes that have not exceeded
their memory guarantee are not reclaimed. This approach does not limit
processes to their memory shares, but merely ensures that they can
receive at least that amount. In the absence of memory pressure,
processes receive as much memory as they need. If a process holds soft
memory tickets, the number of pages to which it is entitled can change
dynamically as the value of its tickets changes. The pageout daemon
thus needs to compute the current base value of the processes' memory
tickets; cached values are used whenever possible.
4.5 Managing Disk Bandwidth
Our prototype supports proportional sharing of disk bandwidth using
the hierarchical YFQ algorithm [Bru99b]. YFQ
approximates ideal proportional sharing by maintaining a virtual time
function and per-disk queues of outstanding disk requests for each
resource principal. Each of the queues has an associated finish tag
that reflects the principal's past disk activity, its current share of
the disk, and the length of the request at the head of the
queue. Requests from queues with the smallest finish tags are
forwarded to the device driver in small batches that can be reordered
by the driver or device to achieve better aggregate throughput.
A principal's disk tickets are active whenever it has an outstanding
request. To adjust to dynamic changes in the number of active tickets,
the base value of a principal's disk tickets is recomputed (using
cached values if possible) whenever a request reaches the head of its
queue, and this value is used to compute the queue's new finish
tag.
4.6 Emulating Nice
To support the semantics of nice, we created a utility that runs with
root privileges and executes programs with funding from the base
currency. This utility reduces the funding of the caller's user
currency by the amount requested for the new job, thus preserving
isolation. The utility actually creates a new currency for the task,
funds that currency with the requested number of tickets, and uses the
new currency's tickets to fund the task (see
Fig. 4, bottom).
This level of indirection is needed in case the task spawns any
children; if so, they will share the funding of the new
currency. While this utility would typically be used to give a small
percentage of the CPU to long-running, CPU-intensive jobs, it can be
used with other resources as well. It successfully overcomes the lower
limits imposed by currencies without employing VINO's extensibility
mechanism, as we had originally planned
[Sul99a]. The other broker methods could
also be overridden using similar setuid-root utilities.
4.7 Carrying Out Exchanges
As discussed in
Section 3.2.1,
a number of challenging questions must
be answered before a system that fully supports dynamic ticket
exchanges can be built. At this point, we have implemented a framework
that allows us to easily test the effects of ticket exchanges and
thereby gain insight into the issues involved.
We provide two mechanisms for experimenting with exchanges. First,
users with appropriate permissions can employ our reserve(),
fund(), and unfund() system calls to implement static exchanges,
preset modifications to the default ticket allocations. Second, we
have implemented a simple central dealer in the kernel, and we allow
applications to propose exchanges dynamically using a system call
(exch_offer) added for this purpose. When an exchange is carried out,
the four tickets involved
(see Section 3.2.2) are linked to each other
in a circular queue so that the exchange can be invalidated when one
of the principals exits, loses too much funding, or retracts the
exchange. If one of the tickets is deleted, all four of them are,
thereby cancelling the exchange.
____________________
4.
VINO is written in C++.
5.
Actually, if a process holds tickets from more than one currency, its
children should be funded by all of these currencies. We plan to
extend our implementation to deal with this case.
[top |
prev |
next]