Performance
Since the CAF 2.0 prototype only recently became operational,
achieving high performance is a work in progress. To date, we have
been experimenting with the HPC
Challenge (HPCC) benchmarks and using their performance issues to help
focus our implementation efforts. Below is some preliminary
information about experiences thus far with implementation of these
benchmarks.
In experiments with HPCC codes on up to 8192 CPU cores of a Cray XT, we achieved 18.3 TFLOP/s with HPL, 2.01 GUP/s with RandomAccess, 357.8 GFLOP/s with FFT, a bandwidth of 10.70 TByte/s with STREAM triad, and 12286 Mnode/s with UTS.
table 1: Number of source lines of code (SLOC) for each benchmark
Benchmarks | SLOC |
RandomAccess | 409 |
STREAM Triad | 63 |
Global FFT | 450 |
Global HPL | 786 |
Unbalance Tree Search (UTS) | 544 |
table 2: Performance of each benchmark on 64, 256, 1024, 4096 and 8192 cores
# of | RandomAccess | STREAM Triad | Global FFT | Global HPL | UTS |
cores | (GUP/s) | (TByte/s) | (GFlop/s) | (TFlop/s) | MNode/s |
| Oct 2010 | Oct 2011 | Jul 2011 | Oct 2010 | Oct 2011 |
64 | 0.08 | 0.17 | 6.69 | 0.36 | 163.1 |
256 | 0.24 | 0.54 | 22.82 | 1.36 | 645 |
1024 | 0.69 | 2.66 | 67.80 | 4.99 | 2371 |
4096 | 2.01 | 10.70 | 187.04 | 18.3 | 7818 |
8192 | | | 357.80 | | 12286 |