Our first attempt to reduce memory power dissipation is a direct
implementation of the PAVM design described in Section 3
within the Linux operating system. We extend the task structure to
include the needed counters to keep track of the active node set,
, for each process
. As soon as the next-to-run process is
determined in the scheduling function, but before switching contexts,
the nodes in
of that process are transitioned to Standby mode,
and the rest are transitioned to Nap mode. This way, power is reduced,
the resynchronization time is masked by the context switch, and the
process does not experience any performance loss.
We also modify page allocation to use the preferred set, , to
reduce the size of the active sets.
Linux relies on the buddy system [20] to handle the underlying
physical page allocations. Like most other page allocators, it treats
all memory equally, and is only
responsible for returning a free page if one is available, so the physical
location of the returned page is generally nondeterministic.
For our purpose, the physical location of the returned page is not only
critical to the performance but also to the energy footprint of the
requesting process. Instead of adding more complexity to an
already-complicated buddy system, a NUMA management layer is placed
between the buddy system and the VM, to handle the preferential
treatment of nodes.
The NUMA management layer logically partitions all physical memory into multiple nodes and manages memory at a node granularity. The Linux kernel already has some node-specific data structures defined to accommodate architectures with NUMA support. To make the NUMA layer aware of the nodes in the system, we populate these structures with the node geometry, which includes the number of nodes in the system as well as the size of each node. As this information is needed before the physical page allocator (i.e., the buddy system) is instantiated, determining the node geometry is one of the first things we do at system initialization. On almost all architectures, node geometry can be obtained by probing a set of internal registers on the memory controller. On our testbed with 512 MB of RDRAM, we are able to correctly detect the 16 individual nodes, each consisting of a single 256 Mbit device. Node detection for other memory architectures can be done similarly.
Unfortunately, NUMA support for x86 in Linux is not complete. In particular, since the x86 architecture is strictly non-NUMA, some architecture-dependent kernel code was written with the underlying assumption of having only a single node. We remove these hard-coded assumptions and add multi-node support for x86. With this, page allocation is now a two-step process: (i) determine from which node to allocate, and (ii) do the actual allocation within that node. Node selection is implemented trivially by using a hint, passed from the VM layer, indicating the preferred node(s). If no hint is given, the behavior defaults to sequential allocation. The second step is handled simply by instantiating a separate buddy system on each node.
With the NUMA layer in place, the VM is modified such that
with all page allocation requests, it passes of the requesting
process down to the NUMA layer as a hint. This ensures that allocations
tend to localize in a minimal number of nodes for each process.
In addition, on all possible execution paths, we ensure that the VM
updates the appropriate counters to accurately
bookkeep
and
for each process with minimal overheads,
as discussed in Section 3.