Met. Dept Seminar
Includes the following talk:
- Handling Data and Provenance in an AI world
NCAS CMS spend a lot of effort on tools and information systems to handle the data and complexity of both observational and simulation data. In this seminar we cover a few of those tools and explain how we’re positioning these tools to support the FAIR production and use of observations, simple (e.g Healpix) and complex(e.g. LFRIC mesh) model data, in the context of rapidly exploding volumes and needs.
Presentation: pdf (5.7 MB)
The title was a bit of click-bait, as the talk was really about data handling rather than AI per se, though the production of data from AI models was definitely a key piece of motivation.
The talk was laid out in six main sections:
-
An introduction and motivation, how the volume and number of files are driving us to build better and more useful tools. In this section I also covered a way of thinking about interacting with data via metadata, and introduced the stack of tools we work on.
-
From metadata to tools. An introduction to the CF conventions, and how software can exploit them. The notion of a spectrum of tools and choosing your tool for the problem. An example (healpix) of how hard it is to develop standards.
-
Documentation Some discussion of things we need to think about when documenting model workflows. The Essential Model Documentation that will be required for CMIP7.
-
Why you should care about chunking. Chunking is a big mystery that hasn’t really mattered until now - datasets are large, and chunking really impacts on performance, especially fo remote access. (In passing, we cover why HDF5 can do most everything Zarr can do, just as efficiently, but most people are simply not aware of how to make that happen.) An introduction to pyfive, and our new prototype cfs3 tools (for preparing data for, and uploading data to, object stores, and investigating the data that is in an object store).
-
Applications. A quick tour of some of the higher level activities we are involved in, pyactivestorage, esmvaltool, and TWINEVISION.
As always, I was representing work mostly done by the coauthors on this presentation, so errors are mine!