Co-Array Fortran at Rice

CAF Home | Documentation | Download | Publications | Performance | Links | Project Contacts


Last Updated 4 November 2011

Performance

Since the CAF 2.0 prototype only recently became operational, achieving high performance is a work in progress. To date, we have been experimenting with the HPC Challenge (HPCC) benchmarks and using their performance issues to help focus our implementation efforts. Below is some preliminary information about experiences thus far with implementation of these benchmarks.

In experiments with HPCC codes on up to 8192 CPU cores of a Cray XT, we achieved 18.3 TFLOP/s with HPL, 2.01 GUP/s with RandomAccess, 357.8 GFLOP/s with FFT, a bandwidth of 10.70 TByte/s with STREAM triad, and 12286 Mnode/s with UTS.

table 1: Number of source lines of code (SLOC) for each benchmark

Benchmarks SLOC
RandomAccess 409
STREAM Triad 63
Global FFT 450
Global HPL 786
Unbalance Tree Search (UTS) 544

table 2: Performance of each benchmark on 64, 256, 1024, 4096 and 8192 cores

# of RandomAccess STREAM Triad Global FFT Global HPL UTS
cores(GUP/s) (TByte/s) (GFlop/s) (TFlop/s)MNode/s
Oct 2010 Oct 2011Jul 2011Oct 2010Oct 2011
64 0.08 0.17 6.69 0.36 163.1
256 0.24 0.54 22.82 1.36 645
1024 0.69 2.66 67.80 4.99 2371
4096 2.01 10.70 187.04 18.3 7818
8192 357.80 12286

© 2009-2011 Rice University