Home

SW4 cross-HPC scaling-test results

Strong and weak scaling

SW4 parallelisation has not yet plateaued at the tested core/node counts for either strong or weak scaling. The observed super-scaling at some points is not completely understood, but may just be due to random run-to-run variation. In practice, four nodes is the maximum configuration size that can be reliably scheduled from the Mahuika queue, but the Cascade investigation will be extended to higher node counts to identify where scaling eventually plateaus.

SW4 strong-scaling speedup vs core count. SW4 weak-scaling efficiency vs node count.

Measured throughputs and scaling efficiency

HPC (binary build)	Throughput, $T$ (Giga cell-updates / core-hour)	Scaling efficiency (%)
Cascade (znver4)	3.5	99
Mahuika Genoa (znver4)	3.0	96
Mahuika Genoa (znver3)	2.8	90*
Mahuika Milan (znver3)	1.6	90*
RCH (znver3)	1.4	90

* Estimated as the largest efficiency drop that could be hidden by the inter-run variability.

Throughputs are the median across the four weak-scaling runs (see the figure below). Enabling SW4's optional NaN check adds approximately 5 % overhead.

Cascade's throughput drops by roughly 30 % for simulation domains shaped as thin slabs, like those used in the strong-scaling investigations (see the lower panel of the figure below). The cause isn't fully understood, but likely relates to the interplay of its processor architecture and memory system.

SW4 per-core throughput across the tested HPCs. Upper panel: weak-scaling throughput vs node count. Lower panel: weak/strong throughput ratio.

Memory

The memory required for a simulation domain of $N$ grid cells can be estimated with the empirical model

$M \approx 270 + 510 \times 10^{-6} N$

where $M$ is the memory in MB.

Estimating job size of SW4 simulations

The total number of cell-updates in an SW4 simulation is given by $N_{updates} = n_x \times n_y \times n_z \times n_t$ where $n_x$, $n_y$, $n_z$ are the number of grid cells in the $x$, $y$, and $z$ dimensions, respectively, and $n_t$ is the total number of time steps in the simulation.

The compute, $C$, required for a simulation of $N_{updates}$ cell updates is given by

$$C = \frac{N_{updates}}{(T \times 10^9)}$$

where $C$ is in core-hours, and $T$ is the throughput value from the table above in Giga cell-updates / core-hour.

Assuming ideal scaling, the required wall-clock time, $t_w$, in hours, is given by

$$t_w = \frac{C}{n_{core}}$$

where $n_{core}$ is the number of processor cores.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly