From ESiWACE to ExCALIData

I gave a talk to the ExCALIBUR benchmarking community about the work we plan to do in the ExCALIBUR programme. This new activity is called ExCALIData, and really comprises two projects ExCALIStore and ExCALIWork. It is “cross-cutting” work funded as part of the Met Office strand of ExCALIBUR to deliver cross-cutting research in support of all programme elements (see the presentation in my next post for details). The entire programme is a national programme with two funding strands, one for public sector research establishments and the NERC environmental science research community, and one delivered by EPSRC for the rest of the UKRI community.

Excalibur itself is a multi-million pound multi-year project which aims to deliver software suitable for the next generation of supercomputing at scale - the “exascale” generation. The two ExCALIData projects have been funded in response to a call which essentially asked for two projects which addressed:

  • support for optimal (performant) use of a given system’s storage configuration (portable) for their particular application without having to know the details of the system configuration (productive) in order to configure their application; and
  • deliver a completely new paradigm for where certain computations are performed by reducing the amount of data movement needed, particularly in the context of ensembles (but implicitly, make sure that this can be done productively, and results in portable code which is optimally performant in any given environment).

The call asked for particular application to any one of the existing Excalibur use cases but a plan for how it might be applicable to others. We of course addressed climate modelling as our first use-case, but in doing so aimed to address a couple of other use cases and build some generic tools.

We were in a good place to bid for this work building on the back of our EsIWACE activities on I/O and workflow, so the talk I gave to the benchmarking community was in two halves: the first half addressed the problem statement and some of the things we have been doing in ESiWACE, and the second described the programme of work we plan in Excalibur.

From EsiWACE to ExCALIBUR

Presentation: pdf (1.6MB)

This was a short motivation for our particular problem in terms of massive grids of data from models and how increasing resolution leads (if nothing is done) to vastly more data to store and analyse. The trends in data volumes match the trends in computing and will lead to exabytes of simulation data before we hit exaflops. Our exabytes are a bit more difficult to deal with (in some ways) than Facebook’s, but like all the big data specialists, as a community we are building our own customised computing environments - in this case we are specialising the analysis environment sooner than we have specialised the simulation platform (but that too is coming).

In WP4 of EsiWACE(2) we have been working to “mitigate the effects of the data deluge from high-resolution simulations”, by working on tools for carrying out ensemble statistics “in-flight” and on tools to hide storage complexity and deliver portable workflows. There were several components to that work: the largest of which has been the development of the Earth System Data Middleware (ESDM), an activity led by Prof Julian Kunkel. The others were mainly led by me, and are the main thrust of the ExCALIData work, although we will be doing some comparisons of other technologies with the ESDM.

An Introduction to ExCALIData

Presentation: pdf (1.6 MB)

ExCALIData aims to address the end-to-end workflow for (large)data in, simulation, (larger)data out, analysis, store workflows. We designed two complementary projects to address the two goals in the call with six main activities (three to each project) and one joint work package on knowledge-exchange. (They were designed in such a way that only one could have been funded, but we could get synergy from having two.)

The six work packages address

  1. storage interfaces, and our our idea of semantically interesting “atomic_datasets” as an interface to multiple constituent data elements (quarks) distributed across storage media.
  2. how best we can use fabric and solid state storage. There are a plethora of technology options, but how can we deploy some of them for real? Which ones?
  3. A comparison of I/O middleware (including the domain specific ESDM, and the generic ADIOS2)
  4. Active Storage in software, and
  5. Active Storage in hardware
  6. Extending I/O server functionality that we developed in ESiWACE for “atmosphere only” workflows for coupled model workflows.

For the active storage work the key idea is to deploy something actually usable by scientists and build two demonstrator storage systems which deliver the active storage functionality.

Not surprisingly, this is a complex project, and we needed a lot of partners, in this case: DDN and StackHPC to help us with the storage servers and the University of Cambridge to help us with some of the more technical hardware activities (especially the work on fabric, solid state, storage, and the benchmarking of comparative solutions).