Coarray Fortran 2.0 - Performance Results

Performance

Since the CAF 2.0 prototype only recently became operational, achieving high performance is a work in progress. To date, we have been experimenting with the HPC Challenge (HPCC) benchmarks and using their performance issues to help focus our implementation efforts. Below is some preliminary information about experiences thus far with implementation of these benchmarks.

In experiments with HPCC codes on up to 8192 CPU cores of a Cray XT, we achieved 18.3 TFLOP/s with HPL, 2.01 GUP/s with RandomAccess, 357.8 GFLOP/s with FFT, a bandwidth of 10.70 TByte/s with STREAM triad, and 12286 Mnode/s with UTS.

table 1: Number of source lines of code (SLOC) for each benchmark

Benchmarks	SLOC
RandomAccess	409
STREAM Triad	63
Global FFT	450
Global HPL	786
Unbalance Tree Search (UTS)	544

table 2: Performance of each benchmark on 64, 256, 1024, 4096 and 8192 cores

# of	RandomAccess	STREAM Triad	Global FFT	Global HPL	UTS
cores	(GUP/s)	(TByte/s)	(GFlop/s)	(TFlop/s)	MNode/s
	Oct 2010	Oct 2011	Jul 2011	Oct 2010	Oct 2011
64	0.08	0.17	6.69	0.36	163.1
256	0.24	0.54	22.82	1.36	645
1024	0.69	2.66	67.80	4.99	2371
4096	2.01	10.70	187.04	18.3	7818
8192			357.80		12286