Table 2 shows the measured performance of the test cases in Table 1 under generic Linux and when using AASFP. All numbers are the average results of 5 runs. To get correct results, the file system buffer cache must be flushed each time. Instead of rebooting our testing machine each time, we generate a file with random contents and whose size is 128MBytes. Notice this size is bigger than or equal to the size of the system's physical memory size plus the system's swap space size. Therefore, a sequential read through out this file should wipe out all the related content of the previous run. To verify this, we compare the results of Backward 1 under normal Linux with the machine rebooted each time and with reading the above big file. The numbers are shown in Table 3.
Under Linux, only sequential disk prefetching is supported. Under AASFP, only application-specific disk prefetching is supported. The number within the parenthesis shows the AASFP overhead, which is due to the extra time to run the prefetch thread. This overhead is in general insignificant because the cost of performing computation in the prefetch thread, and the associated process scheduling and context switching is relatively small when compared to the computation overhead. The percentages of disk I/O time that AASFP can mask are listed in the fourth column. This is calculated by taking the ratio between the disk I/O time that is masked and the total disk I/O time without prefetching.
The fifth column in Table 2 shows performance improvement column of AASFP over Linux. For reference, we also list the CPU Time, which is the pure computation time of the application (i.e., excluded I/O), in the sixth column. The overhead due to synchronization, as explained in Section 3.2 is shown in the seventh column here. The last column shows the associated compilation overhead of AASFP to extract the prefetch thread from each program.
For the volume visualization application with a 4-KByte block size, AASFP achieves 54% and 22% overall performance improvement for orthonormal and non-orthonormal viewing directions, respectively. There is not much performance gain for the cases that use 32-KByte block size. Retrieving 32-KByte blocks corresponds to fetching eight 4K pages consecutively. There is substantial spatial locality in this access pattern, which detracts the relative performance gain from AASFP. This also explains why the generic Linux prefetching is comparable to AASFP when 32-KByte blocks are used.
For the out-of-core FFT apparently there is no significant performance improvement. This is due to its extremely sequential accessing patterns. Although FFT is well known by its butterfly algorithmic structure, which suggests random disk access patterns, a more careful examination revealed that not only out-of-core FFT, but also many other out-of-core applications exhibit sequential disk access patterns. Nevertheless our results show that even under such fairly regular access patterns AASFP can still provide as good performance as sequential prefetching. This means that AASFP does not mistakenly prefetch something that is unnecessary and eventually hurt the overall performance, and that the prefetch thread does not add any noticeable overhead in this case.
For the micro-benchmark, AASFP provides 86% performance improvement for the Backward 1 case, which represents the worst case for the sequential prefetching scheme used in Linux. Note that AASFP almost does not lose any performance for the Forward 1 and Forward2 case when compared to Linux. This is the best case for the Linux kernel. The last measurement again demonstrates that AASFP performs as well as generic Linux for sequential access patterns.
Another potential factor that can hurt the prefetching efficiency is synchronization. Too many forced synchronizations sometimes will slow down how far ahead the prefetch thread can go, thus limiting the performance improvement of AASFP. In all the above test cases, the synchronization-related overhead is either non-existent or negligible, because there are neither conditional branches nor user-inputs (for deciding which data file to use), as can be shown in Table 2.
Finally, the associated compilation overhead is also minimal, and
furthermore, the prefetch thread code extraction needs to be done only once in a preprocessing mode. This suggests
not only the simplicity of AASFP but also its applicability to other similar programs.