Check out the new USENIX Web site. next up previous
Next: 3. Some results Up: 2. Base principles Previous: 2.1 Synchronization

2.2 Passing parameters and data sharing

Passing BLAS parameters to the slave is realized by writing in a shared variable the address of the first parameter of the BLAS call before synchronization as shown in the following example env_base holds a pointer to the data needed by slave process and env_blasth_signal_value holds the function to be run by slave:
void **env_base;
void (*env_blasth_signal_value)();

void blasth_daxpy(const int *n,
                  double *alpha,
                  double *X,
                  const int *incx,
                  double *Y,
                  const int *incy){
 // realize Y = *alpha * X + Y
 // where X and Y are vectors of 
 // size *n with respective increments
 // of *incx and *incy 
 // executed by the 
 // master from the
 // application program

 env_base = (void **)&n;
 env_blasth_signal_value = TH_DAXPY;

 // tell the slave there is
 // some job to do
 blasth_master_sync(); 

 // some job

 // wait for the slave
 blasth_master_sync_end();
}


void blasth(){
 // excuted by the slave from the
 // environement setup

 while(1){
  // wait for the master
  blasth_sync(); 

  //call the function set by the master
  env_blasth_signal_value(); 
 }
}

TH_DAXPY(){
 // at this point env_base 
 // contains a pointer to the
 // first needed parameter
 // (int *)env_base[0] is a 
 // pointer to the size of vectors (*n)
 // (double *)env_base[1] is a 
 // pointer to the scaling factor (*alpha)
 // (double *)env_base[2] is a 
 // pointer to the first element
 // of vector X
 // ....
 
 // some job
 
 // tell the master 
 // that job is finished
 blasth_sync_end();
}
The blasth_daxpy calling sequence is identical to the daxpy calling sequence from a C program (the BLAS library is originally written in f77 so the API is f77 compliant) and the parameters are written before the synchronization variable so the strong memory ordering (for write operations) of the Pentium processor family ensure that slave process will see exactly the same parameters in TH_DAXPY as the master in blasth_daxpy.

Data sharing is done by splitting the result between the master and the slave: if the result is a vector the master has to construct the first half and slave has to construct the second half; if the result is a matrix of size m x n the master will construct either the first n/2 columns or m/2 rows and the slave will construct the remaining columns or rows. We show splitting examples in figure [*] for dgemv and dgemm (respectively matrix vector product and matrix matrix product).

Figure: data splitting for dgemv and dgemm
\begin{figure}
\begin{center}
\epsfig{file=split.eps,height=8cm} \end{center} \end{figure}

We does not use cycling split of data to avoid cache line sharing between processors (especially when writing data). The splitting are also chosen to avoid temporary data which would require dynamic allocation.


next up previous
Next: 3. Some results Up: 2. Base principles Previous: 2.1 Synchronization
Thomas Guignon
2000-08-24