|
2nd USENIX Windows NT Symposium
   
[Technical Program]
Extending NT Virtual Memory by SCI-based Hardware DSMpresented at the 2nd USENIX Windows NT Symposium, August 1998, Seattle WA, USA
https://wwwbode.informatik.tu-muenchen.de/Par/arch/smile/sisci/pthreads/
IntroductionDue to their ease of use, shared memory programming models are increasingly becoming of interest for clusters of workstations. The Scalable Coherent Interface (SCI) [1], a new interconnection technology in the SAN area with hardware DSM capabilities, offers a good platform for this programming model allowing a global and transparent SCI Virtual Memory to be built. This forms the basis for transparent shared memory programming libraries, like thread packages [2], which opens the architecture of clusters of workstations to a second programming model (next to message passing) and with this to a new group of applications and users. SCI Virtual MemoryThe DSM offered by SCI is based only on sharing contiguous physical memory segments. In order to construct a global virtual memory, techniques well known from software DSM packages need to be applied. The data is distributed at page granularity and the SCI-VM software layer provides applications with a consistent view onto the distributed memory resources. Unlike in software DSM systems, however, remote pages need not be replicated or migrated; they can simply be mapped by the SCI adapter and accessed using SCI's remote memory capabilities. The SCI-VM concept requires the ability of mapping individual physical pages into the virtual address space. Unfortunately, Windows NT does not provide or document this functionality. Therefore, the only way for implementation is through bypassing the operating system and mapping single pages by directly manipulating the CPU's page table.
Implementing SCI-VM and first resultsUsing the techniques described above, we have implemented a restricted test version of the SCI-VM on a reduced cluster consisting of two PCs (233 MHz Pentium II with 440FX Chipset). One node was used for computation while the second acted as a memory server only supplying remote memory. This memory is interleaved with the local memory on the computation node at page granularity in a round robin fashion. On top of this global memory we tested two synthetic kernels, a sum and a matrix multiplication code as well as one larger application, the "volrend" code from the SPLASH-2 suite [3]. All codes ran in three different versions: exclusively on local memory to provide a baseline comparison, in an unoptimized version, and in an optimized version where we used hardware buffering and prefetching techniques from the SCI adapter card as well as caching. The latter optimization is not directly supported by the SCI hardware as it does not implement cache consistency. To compensate, a software cache coherence scheme was applied. As expected, in the unoptimized case all codes performed several orders of magnitude slower than the baseline comparison due to the long read latencies over the SCI network (around 6 microseconds). With optimizations activated, however, all codes performed very well yielding an overhead of less than 75%. In some cases, an overhead as low as 5.1% could be obtained. These numbers indicate that these applications, when run in parallel on several compute nodes, would exhibit an acceptable speedup. Future experiments with several compute nodes will provide actual speedup numbers. References
[1] IEEE Computer Society. IEEE Std 1596-1992, IEEE Standard for the Scalable Coherent Interface, Aug. 1993.
Martin Schulz, schulzm@in.tum.de, 6/16/1998. |
This paper was originally published in the
Proceedings of the 2nd USENIX Windows NT Symposium,
August 3-5, 1998,
Seattle, Washington, USA
Last changed: 10 April 2002 aw |
|