Programming the Tera Multiprocessor
Many of the performance tuning tricks used on other multiprocessors do not make sense on the Tera Multiprocessor
Data Blocking
- Perform looping computation on ‘blocks’ of data, where the block size is chosen to correspond to data cache size
- Goal is to keep data block totally within the cache, until you are finished with it. Then move on to the next block.
Data Stride
- If matrix data is stored in Row major order (entries B[i,j and B[i,j+1] are adjacent), then access data in row order so that memory accesses are sequential.
- C stores matrices in Row Major, Fortran in Column major!