Next: Use of Performance Reflection
Up: Use of Performance Reflection
Previous: Use of Performance Reflection
Memory hierarchy performance can be very sensitive to competition on shared resources. For example, the standard configuration of IBM Regatta node has modules containing two Power4 processors that share a common cache and interface to main memory. Since it is known that many large scientific programs are memory-bandwidth bound, there is also an HPC variant of the hardware that contains only a single processor per
module.
For bandwidth-limited applications
the second processor adds little or no additional performance and eliminating it saves cost while
further eliminating possible cache interference.
While it would not save the cost of the extra processors, monitoring miss rates of the shared cache of a standard node would enable the system to either schedule only one thread per module or to possibly identify ``compatible'' threads to co-schedule.
Similar scheduling strategies [9,11] have been proposed for use with
Simultaneous Multi-threading (SMT) [13].
The performance of non-uniform memory access (NUMA)
machines is dependent
on the assignment of threads to processors.
The kernel can monitor memory behavior by, depending on
the level of architectural support, measuring remote references, cache miss behavior, or cycles per instruction (CPI). Thread rescheduling decisions can then be based on this feedback.
Next: Use of Performance Reflection
Up: Use of Performance Reflection
Previous: Use of Performance Reflection
Sameh Mohamed Elnikety
2003-06-15