Performances results for daxpy are presented figure and
acceleration is showed
. Observed results are typical for
level 1 operations: we observe a peak when data fits into L1 cache and a
smooth decrease when data does not fits in L1 but remains in L2. Performance
from main memory is driven by memory bandwidth. Acceleration is quite good for
large vectors especially when datas are in cache.
Performance results for ddot are showed figure and
figure
for acceleration. Comments on performances results are the
same as for daxpy but we observed that scaling is not as good as for daxpy:
cache operation scaling is roughly 1.8 but remains good.
On the other side memory operation scaling in poor (
). To understand
that we make some memory bandwidth measurements with very large vectors to
not consider time spent in synchronization and the results are presented in
table
: we see that single processor ddot use more than half of
the theoretical peak memory bandwidth (the system uses pc100 SDRAM allowing
800e6 B/s
memory bandwidth) and dual processor uses up to 75% of available bandwidth
which seems very good for the test system.
|
The daxpy and ddot BLAS show good acceleration with the blasth library when
datas are present into cache memory, this behavior is in fact common to all
level 1 operations since each component of vectors is used only one time in
computation (no temporal locality). Operations on small vectors ()
will not scale due synchronization cost compared to the small number of
floating point operations issued (n or 2.n for level 1 BLAS). The memory
bandwidth is the key point for scalability when datas are out of cache which is
always true for very large data sets.