Check out the new USENIX Web site. next up previous
Next: Language Evaluation Up: Results with PLAPACK Previous: Results with PLAPACK

   
Performance Evaluation


  
Figure: Performance comparison of hand-optimized and Broadway-optimized PLAPACK applications.


cholesky_6x6.eps
=.9in




liaponov_6x6.eps
=.9in




  
Figure: Performance comparison of hand-customized and Broadway-customized PLA_Trsm() function for the Cholesky program. For the Lyapunov program, the hand-customized PLA_Trsm() function matched the performance of the Broadway-customized version.
\begin{figure*}
\centerline{\epsffile{unit_trsm_6x6.eps}}
\epsfysize=.9in
\end{figure*}


  
Figure: Scalability of the Cholesky programs as the number of processors grows.
\begin{figure}
\centerline{\epsffile{cholesky_scale_3072.eps}}
\epsfysize=.9in
\end{figure}

Figure [*] shows the performance improvement of the Cholesky and Lyapunov programs. For fairly large matrices ( $6144 \times 6144$), the Broadway-optimized Cholesky program is 26% faster than the baseline and the hand-optimized program is 22% faster than the baseline. For the Lyapunov program, the Broadway system does not perform as well as the manual approach, improving performance by 9.5% compared to the hand-optimized improvement of 21.5% for $250 \times 250$ matrices, and improving performance by 5.8% compared to 6.1% for $2000 \times 2000$ matrices. The two approaches obtain identical performance on the PLA_Trsm() kernel, but the hand-optimized program performs a few additional optimizations to other parts of the code.

Note that there is considerable room for further improving the Lyapunov program, since PLA_Trsm() only accounts for 11.6% of the execution time for 250$\times$250 matrices, and only 5.8% of the time for $2000 \times 2000$ matrices. When our compiler is complete, we will apply our optimizations to all parts of the PLAPACK library, including the PLA_Gemm() routine, where Lyapunov spends a majority of its time.

Since our experiment focuses on the benefits of specializing the PLA_Trsm() routine, Figure [*] shows the performance difference between the generic PLA_Trsm() routine and the version that was customized for Cholesky by our compiler. Notice that we observe similar results for different numbers of processors. Figure [*] shows how the performance of the various Cholesky programs scale with the number of processors.

The results reveal several interesting points.

Closer examination of the Cholesky results reveal that specialization and dead code elimination account for almost all of the performance benefits, while high level copy propagation (where the copy operations are library routines) contributes insignificantly.


next up previous
Next: Language Evaluation Up: Results with PLAPACK Previous: Results with PLAPACK
Samuel Z. Guyer
1999-08-25