The Exascale Challenge to the UK - Part Tree - Capacity

In this post we want to consider how much science we can do with the computers we currently have available (ARCHER and NEXCS), and in subsequent posts we’ll consider the route to exascale.

To do that, we have to define science. There are a number of ways we can do that, but for the purposes of this exercise we’re going to define it in two ways:

What we can do with current generation models, and
What we might want to be able to do with future generation models.

How much science can we currently do?

Obviously we run a mix of models, but for the purposes of this exercise, we’re going to define science, as “number of years we can simulate” using the N512 model we used for benchmarking in the last post.

However, as discussed in Balaji et al (2017)¹, and is clear in the results in that previous post, we have a choice between speed and throughput when working out how much work we can do with a finite amount of compute. If we run as fast as possible (“speed”), we use more nodes per simulated years than if we run slower, so often we choose to run slower to get more throughput, but we still need to run “fast enough”. For the purpose of this exercise we are going to define “fast enough” as 1 simulated year per (real) day (SYPD).

So the question of “how much science can we currently do?” comes down to how many years of the N512 model can we run in one day at 1 SYPD on the platforms we have available.

To address that, we can use the data from the last post. I fitted a quadratic to each curve, and found the minimum number of nodes (of the two solutions) for running at 1 SYPD. These data are summarised in the following table:

	ARCHER (24)	NEXUS (24)	NEXUS (36)
No I/O	383	390	316
With I/O	568	550	406

We’ll come back to the difference between the I/O and no I/O runs in a subsequent post, and why the difference shouldn’t be that large, but for the moment we’ll use the no I/O runs to compare the two machines we currently have available, and our portion of them².

The two platforms these benchmark were run on were NEXCS and ARCHER:

NEXCS is a Cray XC-40, with 6720 nodes each with 36 Broadwell cores (currently “number 15 in the world”: top500 entry).
ARCHER is a Cray XC-30, with 4920 nodes, each with 24 Ivy Bridge cores (currently “number 79 in the world”: top500 entry).

(We’ll come back to those “number XX” designations in a future post … not least because they’re pretty meaningless in practice!)

: One of the Met Office XC-40s (probably not the one we use for NEXCS though!)

: ARCHER

It’s pretty easy to use those node numbers to assert the relative capability of the machines, in terms of the the number of ensemble members they can run per day as being:

ARCHER (24)	NEXUS (36)
4920/383	6720/316
= 12.8	= 21.2

In both cases, the NERC community cannot use all the machines all the time, since NERC has contracted portions of the machine - so we can calculate the number we could sustain long term if we could use the entire NERC computing allocations of 22% and 4.2% respectively:

ARCHER (24)	NEXUS (36)
12.8*.22	6720/316
= 2.8	= 0.9

It doesn’t take a climate expert to realise that we cannot easily conceive of doing centennial scale ensembles of this model with these machines. Any worthwhile experiment will consist of at least a pair of numerical experiments (e.g. a control run and some sort of perturbation), and a minimum of three ensemble members each (preferably a lot more) so we would need (at least) 2 x 3 x 100 years of simulation. So one (we obviously want to do many experiments ³) long term NERC climate experiment would require both machines for 600/3.7=162 days, or half a year!

We can do useful time-slice experiments in a reasonable period of time though, since we can typically consider doing 30 years rather than 100 years, and that factor of 3 results in experiments that could in theory be completed in two months or so. Of course, no one user⁴ gets both machines all the time, so the real turn around is much longer.

Still to come in this exciting - ! :-) - set of blog posts:

What conclusions might we draw from these results for what climate science we might be able to do with the likely available computing in the next few years and beyond? (And how good might our predictions be?),
The data issues associated with all of this, and
What this means for UK scientific competitiveness.

Balaji, V., Maisonnave, E., Zadeh, N., Lawrence, B. N., Biercamp, J., Fladrich, U., Aloisio, G., Benson, R., Caubel, A., Durachta, J., Foujols, M.-A., Lister, G., Mocavero, S., Underwood, S., & Wright, G. (2017). CPMIP: Measurements of Real Computational Performance of Earth System Models in CMIP6. Geosci. Model Dev., 10(1), 19–34. https://doi.org/10.5194/gmd-10-19-2017 ↩
Many will be surprised at me ignoring I/O for this comparison, given my long-standing statement that I/O will be a rate-limiter for our compute in the near future. Suffice to say, we will get to that argument, but we have to construct the rest of it first. ↩
Meehl, G. A., Moss, R., Taylor, K. E., Eyring, V., Stouffer, R. J., Bony, S., & Stevens, B. (2014). Climate Model Intercomparisons: Preparing for the Next Phase. Eos, Transactions American Geophysical Union, 95(9), 77–78. ↩
It’s important to realise that these large experiments may be run by one or two people operating as a team, but are generally analaysed by much larger teams of people who generate the resulting scientific outputs. However, there are lots of such teams sharing these computers. ↩