The Exascale Challenge to the UK - Part One

It is no secret that much of the progress in climate system modelling in recent decades has been on the back of the (then) inexorable increase in computational power associated with Moore’s Law - much of which came for free as chips got faster. However, now they’re not so much getting faster as we will get more processors and more processor types, and we have to be smarter about how we programme for them (Lawrence et al, . This is a problem for all of science!

The US, Japan and China have been working on the problem for a while, and Europe is rapidly trying to catch up, but the “exascale problem” is all but invisible in the UK. While it’s obviously unfeasible for the UK to build an exascale computer on their own in the current economic and political climate, it’s absolutely the case that UK science and industry will have to work out how to use exascale computing, not least because the tools and techniques will percolate down eventually. However, even before the percolation, UK will have to exploit exascale computing to remain competitive, so what will that entail for us?

The Challenges

ASAC (2014) list ten challenges, which I’ve reordered and categorised according to my personal bias into “industry”, “computer science”, and “interdisciplinary”, where the latter is meant to indicate that I think these are challenges for computer science, mathematics, and discipline specific practitioners!

The industry challenges are:

Energy efficiency: Creating more energy-efficient circuit, power, and cooling technologies.
Industry Interconnect technology: Increasing the performance and energy efficiency of data movement.
Memory Technology: Integrating advanced memory technologies to improve both capacity and bandwidth.

Arguably those are things that industry will address, with or without government funding, and it probably wont be UK industry, although once we were players … The computer science challenges are:
Scalable System Software: Developing scalable system software that is power- and resilience- aware.
Resilience and correctness: Ensuring correct scientific computation in face of faults, reproducibility, and algorithm verification challenges.
Programming systems: Inventing new programming environments that express massive parallelism, data locality, and resilience_
Scientific productivity: Increasing the productivity of computational scientists with new software engineering tools and environments.

Those are still areas where the UK can and could make a difference, not only scientifically, but in terms of creating opportunities for our own industry. The interdisciplinary challenges are:
Data management: Creating data management software that can handle the volume, velocity and diversity of data that is anticipated.
Exascale Algorithms: Reformulating science problems and redesigning, or reinventing, their solution algorithms for exascale systems.
Algorithms for discovery, design, and decision: Facilitating mathematical optimization and uncertainty quantification for exascale discovery, design, and decision making.

These are all areas where mathematics, computer science, and discipline skills come into play. However, somewhat surprisingly, that list doesn’t include one more important challenge:
Data Volumes! It’s clear (e.g. our talk that using exabytes of data are likely to be a problem for our community before we get to exaflop computing.

Why is data a problem?

Haven’t the big cloud providers and social media giants got this sussed (Google wasreputed to have exabytes of data years ago, Facebook had an analysis cluster with 300 PB three years ago …) Well yes, they have got volumes stored sussed, but it’s not obvious they have throughput sussed. LOTUS, our JASMIN batch cluster is currently reading up to 2 PB of data a day into analysis workflows, and we’re just one climate site. More importantly, that’s petabytes of data a day … we don’t yet have suitable software for our analysis environments for anything else … at scale!

Some of these issues with data at exascale have been covered in Acin et al (2015), who cover some of the many challenges ahead. Some of those key challenges include:

Time to build systems: “Although computing and data processing technology is known for its fast growth, assembling these technologies into large-scale, reliable and performant systems often takes about a decade.”
Metadata scale issues: “The overall increase in scale. The increase in number of objects will affect data and metadata equally, which will need Exascale support. The increase in data volume indicates that when raw data approaches the Zettabyte scale, the metadata handling system will approach the Petabyte scale”
Metadata attribution issues: “File systems offer virtually no practical support for associating user-specified metadata to data objects, essentially limited to specifying file names. Maintaining (file based) schemes is labour intensive, and would become even more so if support was included for data provenance and preservation.”
Variability in the workflow: “The challenge is often compounded by the fact that the exact manner in which the data will need to be analysed is not known a priori, hence techniques often used in commercial computing, such as pre-calculation, precise data placement or massive data caching are difficult or impossible to implement”.

They also note the following, in the context of networks, but I think it also applies to data:
Getting things from research to production: “a widening gap between the possibilities … and the capabilities that are actually deployed in production applications.” Much effort globally is being deployed on distributed data solutions but:
User Management is Problematic: “Current solutions for managing users and their data access roles are, however, very far from being satisfactory. Present systems involve a lot of manual intervention by many parties, and therefore are rather error prone and cumbersome. Furthermore, they are poorly (or not at all) integrated with low level data processing services and between different sites”.

They don’t go into in any detail, but I would also add in terms of data fabric:
POSIX doesn’t scale, at least with thousands of processes trying to open the same file . However, object stores are not the panacea either, at least not for latency constrained writing. We’re going to see tiered storage for sure, potentially with all of burst buffers, object stores, and tape. The good news is that at least we as a community are rather used to some of the ideas needed to manage our data on tape efficiently (e.g. Smart et al, 2017).
Confusion over need: Repositories are not active storage systems - many of the putative solutions in this space, from academia at least, are addressing where to put cold data, not where to put high-volume data that needs to be part of ongoing active workflows for up to years at a time. Both matter.
Costs Matter: There is not a good understanding in the user community as to the true cost of storage, when they should re-compute, and where they should be putting their data (in terms of hot/warm/cold storage tiers). This is in part because there is confusion by providers over these issues too (see the previous point).

Bibliography for this post

Smart, S. D., Quintino, T., & Raoult, B. (2017). A Scalable Object Store for Meteorological and Climate Data. 1–8. https://doi.org/10.1145/3093172.3093238.
Acı́n V., Bird, I., Boccali, T., Cancio, G., Collier, I. P., Corney, D., Delaunay, B., Delfino, M., dell’Agnello, L., Flix, J., P Fuhrmann, Gasthuber, M., Gülzow, V., Heiss, A., Lamanna, G., Macchi, P.-E., Maggi, M., Matthews, B., Neissner, C., … Shiers, J. (2015). Architectures and Methodologies for Future Deployment of Multi-Site Zettabyte-Exascale Data Handling Platforms. Journal of Physics: Conference Series, 664(4), 042009. https://doi.org/10.1088/1742-6596/664/4/042009.
ASCAC Subcommittee. (2014). Top Ten Exascale Research Challenges.
Lawrence, B. N., Mike Rezny, Reinhard Budich, Peter Bauer, Jörg Behrens, Mick Carter, Willem Deconinck, Rupert Ford, Christopher Maynard, Steve Mullerworth, Carlos Osuna, Andy Porter, Kim Serradell, Sophie Valcke, Nils Wedi, & Simon Wilson (2018). Crossing the Chasm: How to Develop Weather and Climate Models for next Generation Computers. Geoscientific Model Development. https://doi.org/10.5194/gmd-11-1799-2018.