Since 2006 I’ve been recording my significant talks here (see also the summary list).

ESIWACE2 Mid-Term Review

GotoMeeting, October, 2020

Data Systems at Scale

Presentation: pdf (0.5MB)

This was a presentation given as part of the esiwace mid-term review. It describes some of the progress made in one of the work packages, WP4, which is about data systems at scale. Here the definition of “at scale” means “for weather and climate simulation on exascale (and pre-exascale) supercomputing” and for downstream analysis systems.

I have an older post linking to a talk describing the work package objectives.

In this presentation I highlighted

  • our work on ensemble handling. Amongst other activities we have demonstrated “in-flight ensemble data processing” with a large high resolution ensemble (10 member global 25 km resolution using 50K cores).
  • progress with the Earth System Data Middleware (ESDM), which includes NetCDF interface and some new backends, and
  • work by our industry colleagues, Seagate and DDN both on ESDM backends and on new active storage systems which will be able to do simple manipulations “in storage”, and
  • our work on S3NetCDF, a drop in extension for netcdf-python suitable for use with object stores.

Earthcube - What about Data?

Boulder via Hangout, May 2020

When (and how) should a simulation be Fair?

Presentation: pdf (4 MB).

After a reminder about FAIR (Findable, Accessible, Interoperable, Reusable - not Reproducible), I discuss some of the issues about applying these concepts to Simulation Data. Apart from the volume issues, we have to decide just what “Simulation Data” actually means, does it mean the entire workflow, or just the outputs. There are a huge variety of types of simulation, are they all important? Using the ES-DOC vocabularies, I point out that in practice simulations are not meaningfully reproducible - except in trivial cases, and actually it is more important to reproduce experiments - which is of course the heart of model intercomparison. At CEDA we have two decades of experience curating data, and we now use the JASMIN platform to provide a data commons to make data accessible. Lots of simulation data is analysed on JASMIN, alongside the CEDA archive, but for fifteen years we have made decisions about whether to archive simulation data based on (what was then) the BADC Numerical Model Data Policy. The key insight in developing that policy was that it’s relatively easy to decide what’s important, and what’s not important, but there is a lot of middle ground for which value judgements are important. We developed polices to decide on the important and not important data, and to guide the value judgements, and these are summarised here. Underlying all decisions are questions about affordabilty, and whether or not adequate metadata can be (or will be) produced. Without such metadata curation of simulation data becomes pointless. The bottom line is that not all simulation data should be FAIR, but that which needs to be FAIR needs to be well documented. Money matters as simulation data at scale is incredibly expensive.

The UM User's Workshop 2019 - Next Generation Modellin Systems

Exeter, June, 2019

Challenges facing the modelling community

Presentation: pdf (8 MB)

This was a talk given to set the scene for the Next Generation Modelling System sessions of the 2019 Unified Model (Um) Users workshop - a meeting of the international UM partnership bringing together indivudals from the US, Korea, Australia, NZ, and the UK.

It was a shortened version of my seminar on the end of climate modelling, where I was aiming to explain to some of the attendees why the radical changes in model structure coming with our next generation LFRic model are necessary given some of the technologies in play. I was able to omit some of the key parts of the seminar because there were other speakers addressing those subjects (e.g. Rupert Ford was up later talking about Domain Specific Languages).


Hamburg, March, 2019

Data Systems at Scale

Presentation: pdf (0.5MB)

This was a presentation given as part of the esiwace2 kickoff. It describes one of the work packages, WP4, which is about data systems at scale. Here the definition of “at scale” means “for weather and climate simulation on exascale (and pre-exascale) supercomputing” and for downstream analysis systems. We expect I/O performance and data volumes will limit their exploitation without new storage and analysis libraries.

This work package describes an ambitious programme of work, building on lessons learnt from ESiWACE. It will address a range of issues in workflow, from modifications to a workflow engine (cylc) and a scheduler (slurm), to new I/O libraries and data handling tools. Some integration with pangeo is likely. There is a key role for ensemble analysis tools to mitigate aqainst writing data in the first place.

A key goal will be to deploy these new tool in real workflows, within the project, and elsewhere, and to develop them in a way that has a credible path to sustainability. In practice deployability is likely to involve three key facets:

  • portability, to ensure a wider range of stakeholders,
  • containerisation, to deliver as much as possible within userspace, and
  • vendor engagement alongside open-source products, allowing differentiation in the market and realistic business models for those developing customised high-performance storage solutions.

The work package is being led from the University of Reading (Computer Science, and NCAS), with partners from CNRS-IPSL, STFC, Seagate, DDN, DKRZ, ICHEC, and the Met Office.

(See mid-term update for an update on progress.)

NCAS Seminar

Reading, March, 2019

The end of Climate Modelling as we know it

Presentation: pdf (15 MB)

This was a seminar given in Meteorology at the University of Reading as part of the NCAS seminar series.

In this presentation I was aiming to introduce a meteorology and climate science community to the upcoming difficulties with computing: from the massive explosion in hardware variety that we associate with the end of Moore’s Law, to the fact that in about a decade we may only be able to go faster by spending more money on power (provided we have the parallelisation). Similar problems with storage arising from both costs and bandwidth are coming …

I set this in the scene of our existing experience of decades of unbridled increases in computing at more or less the same power consumption until the last decade when we’ve still had more compute, but at the same power cost per flop (for CPUs at least).

Of course it wont mean the end of climate science, and it will probably take longer to arrive than currently being predicted, but it does mean a change in the way we organise ourselves, in terms of sharing data and experiment design, and it certainly means “going faster” will require being smarter, with maths, AI etc.

This seminar differs from my last one in that I didn’t talk about JASMIN at all, and I spent more time on the techniques we will need to cross the chasm between science code and the changing hardware.

I finished with some guidance for our community: we will need to become better at using the right tool(s) for the job, rather than treating everything as a nail because we have a really cool hammer:

  • More use of hierarchy of models;
  • Precision use of resolution;
  • Selective use of complexity;
  • Less use of “ensembles of opportunity”, much more use of “designed ensembles”;
  • Duration and Ensemble Size : thinking harder a priori about what is needed;
  • Much more use of emulation.

Computer Science Seminar

Reading, February, 2019

Supercomputers are no longer all the same and it will get worse; a climate modelling perspective

Presentation: pdf (30 MB)

This was a seminar given in Computer Science at the University of Reading. I was aiming to kick the ball through a few balls at once, being introductions to

  • Climate modelling for computer scientists, to get across the size of the computational and data handling problems, leading to
  • The changing way we deal with simulation data - from “download” to “remote access”, and the need for
  • New data analysis supercomputers, and in particular, JASMIN.

From there I switched gear to

  • A discussion of Moore’s Law, and it’s demise, and
  • How we can still make progress in (climate) science with new maths (including AI and ML), new algorithms, new programming techniques, and new ways of working with data.

The bottom line I wanted to get across to this audience was that there is lots of scope for computer science, and lots of interesting things to do.

NERC HPC Strategy

London, November, 2018

Presentation: pdf (7 MB)

This was the kickoff talk to a NERC “town hall meeting” workshop envisaged as part of the establishment of a formal NERC HPC strategy. After my talk, we had a three component workshop: some international talks, some breakout discussions, and some local talks. The main aim of the meeting was

To begin a process of establishing and maintaining a NERC strategy for high performance computing which reflects the scientific goals of the UK environmental science community, government objectives, and the need for international competitiveness.

The breakout groups covered four themes:

  1. How does our science face a future where computers do not run faster? Are there known algorithmic routes to improve time to solution for fixed FLOPs (but maybe enhanced parallelism) that we are not yet exploring? What else can we do?
  2. Which HPC related international collaborations do we depend upon, and how can we sustain these, intellectually and financially?
  3. What are the implications of the new UKRI funding routes (e.g. Industrial Strategy Challenge Fund) for environmental HPC resource requirements (software, hardware, people)?
  4. How do we gather information about the impact of our HPC related research? What are the routes to demonstrating economic impact?

… and my talk was attempting to provide an introduction and context for these objectives and breakouts. The key point of course is that free lunch of easy (and relatively cheap) computing is over. Advancing along our traditional axes of resolution, complexity, ensembles etc, will be complicated by harder to use and (relatively) more expensive computing and storage.

I finished with six key points:

  • The NERC HPC recurrent budget is currently fixed, but there is usage growth via both growth within the existing communities, and the addition of new communities.
  • Future computing will not be any faster than current computing, we will only get to do-more/go-faster by expending energy (electricy) and by being smarter about using parallelism, by algorithms, or software, or both.
  • HPC platforms are changing, with more heterogeneity and customisation, and that’s happening just as we have to take on machine learning as a new tool.
  • NERC needs plans and evidence to justify HPC infrastructure, now and into the future, especially if we want to grow it!
  • We have ambitious international scientific competitors — both a threat (if we can’t compete computationally) and an opportunity (for collaboration)
  • All of which is why NERC needs a strategy …

(Incidentally, the preference for lots of post-it notes in this presentation is a homage to town-hall meetings past and their flip-charts and “stick your comment on these posters” sesssions! EPSRC, I’m looking at you :-) )

Extreme Data Workshop

Jülich, September, 2018

Beating Data Bottlenecks in Weather and Climate Science

Presentation: pdf (6 MB)

The data volumes produced by simulation and observation are large, and becoming larger. In the case of simulation, plans for future modelling programmes require complicated orchestration of data, and anticipate large user communities. “Download and work at home” is no longer practical for many applications. In the case of simulation these issues are exacerbated by users who want simulation data at grid point resolution instead of at the resolution resolved by the mathematics, and who design numerical experiments without knowledge of the storage costs.

There is no simple solution to these problems: user education, smarter compression, and better use of tiered storage and smarter workflows are all necessary - but far from sufficient. In this presentation we introduce two approaches to addressing (some) of these data bottlenecks: dedicated data analysis platforms, and smarter storage software. We provide a brief introduction to the JASMIN data storage and analysis facility, and some of the storage tools and approaches being developed by the ESIWACE project. In doing so, we describe some of our observations of real world data handling problems at scale, from the generic performance of file systems to the difficulty of optimising both volume stored and performance of workflows. We use these examples to motivate the two-pronged approach of smarter hardware and smarter software - but recognise that data bottlenecks may yet limit the aspirations of our science.

Meeting Attendees
Meeting Attendees

(At at a workshop organised by Martin Schultz at Jülich)

Open Meeting for Hydro-JULES - Next generation land-surface and hydrological predictions

Wallingford, September, 2018

Building a Community Hydrological Model

Presentation: pdf (9 MB)

In this talk we introduce how NCAS will be supporting the Hydro-Jules Project by designing and implmenting the modelling framework and building and maintaing and archive of driving data, model configurations, and supporting datasets. We will also be providing training and support for the community on the JASMIN supercomputer.

We also discuss some of the issue that HydroJules will need to prepare for, including the impending change to the UK Unified Modelling modelling framework and exascale computing.

JASMIN User Conference

Milton, June, 2018

The Changing Nature of JASMIN

Presentation: pdf (9 MB)

This talk was part of a set of four to give attendees at the JASMIN user conference some understanding of the recent and planned changes to the physical JASMIN environment.

The introduction covers a logical and schematic view of the JASMIN system and why it exists, before three sections covering the compute usage, data movement, and storage growth over recent years. JASMIN shows nearly linear growth in total users, active users with data acess and active users of both the LOTUS batch cluster and the interactive generic science machines. Despite the changing size of the batch cluster (it has grown in size) we have managed to keep utilisation in the target 60-80% range (we have deliberately targeted a lower utilisation rate so as to allow the use of batch to be more immediate, given that keeping the data online is the more expensive part of this system). Usage of the managed cloud systems has been substantial, and the cloud itself has grown, targetting more customised solutions for a wide array of tenants. External cloud usage has been relatively low, which reflects the lack of elasticity and its usage for primarily pets rather than cattle. Where JASMIN is really unique however is in the amount of data movement invovled in the day to day business, with PB/day being sustained in the batch cluster for significant periods. Archive growth has been capped, but shows some interesting trends, as does the volume of Sentinel data held - and overall growth has been linear despite a range of external constraints and self-limiting behaviours. Elastic tape usage started small, but has become more signficant as disk space constraints have become more of an issue - this despite a relatively poor set of user facing features.

These factors (and others) led to the 2017/18 phase 4 capital upgrade which is being deployed now, with a range of new storage types. Looking forward, it is clear that the “everything on disk” is probably not the right strategy and we have to look to smarter use of tape.

Data-Intensive weather and climate science

Exeter, June, 2018

Climate Data: Issues, Systems, and Opportunities

Presentation: pdf (25 MB)

The aim of this talk was to introduce students at the Mathematics for Planet Earth summer school in Exeter to some of the issues in data science. I knew I would be following Wilco Hazeleger who was talking on Climate Science and Big Data, so I didn’t have to hit all the big data messages.

My aim was to cover some of the usual isues around heterogeneity and tools, but really cover some of the upcoming volume issues, using as many real numbers as I could. One of the key points I was making was that as we go forward in time, we are moving from a science dominated by the difficulty of simulation to one that will be dominated by the difficulty of data handling - and that problem is really here now, although clearly the transition to exascale will involve problems with both data and simulation. I also wanted to get across some o fthe difficulties associated with next generation model intercomparison - interdependency and data handling - and how those apply to both the modellers themselves as well as the putative users of the simulation data.

The 1km model future is interesting in terms of data handling. I made a slightly preposterous extrapolation (an exercise for the reader is to work out what is preposterous) … but only to show the potential scale of the problem, and the many opportunities for doing something about it.

The latter part of the talk covered some of the same ground as my data science talk from March, to give the students a taste of some of the things that can done with modern techniques and (often) old data.

The last slide was aimed at reminding folks that climate science has always been a data science, and actually, always a big data science! Climate data is always large compared to what others are actually handling … and that we have always managed to address the upcoming data challenges. I hope now will be no different.

EuroHPC Requirements Workshop

Brussels, June, 2018

EuroHPC: Requirements from Weather and Climate

Presentation: pdf (9 MB)

Abstract: pdf

This talk covered requirements for the upcoming pre-exascale and exascale computers to be procured by the EuroHPC project. The bottom line is that Weather and Climate have strong constraints on HPC environnments, and we believe that procurement benchmarks should measure science throughput in science units (in our case Simulated Years Per (real) Day, SYPD). We also recommend that the EuroHPC project takes cognisance of the fact that HPC simulations do not directly generate knowledge, the knowledge comes from analysis of the data products!

Data Sciences for Climate and Environment (Alan Turing Institute)

London, March, 2018

Opportunities and Challenges for Data Science in (Big) Environmental Science

Presentation: pdf (18 MB) (See also video).

I was asked to give a talk on data science in climate science. After working out what “data science” might mean for this audience, I took a rather larger view of what was needed and talked about data issues in environmental science, before quickly talking about hardware and software platform issues. Most of the talk covered a few applications of modern data science: data assimilation, classification, homogenising data, and using machine learning to infer new products. I finished by reminding everyone that in collaborations between climate science and statisticians and computer scientists, we need to be careful about our use of the word “model” (with a bit of help from xkcd). I finished with reminding everyone that climate science has always been a data science.

The full set of videos from all the speakers is available.

Teaser Image

ESiWACE General Assembly

Berlin, December, 2017

Exploiting Weather & Climate Data at Scale (WP4)

Presentation: pdf (4 MB)

This was a talk I would have given in partnership with Julian Kunkel, but as I was still at home thanks to a wee bit of cold air causing chaos at LHR, Julian had to give all of it. The version linked here is the version I would have given, the actual version Julian gave will (eventually) be available on the ESIWACE website.

Teaser Image

The talk covers the “exploitability” component of the ESIWACE project. The work we describe is development of cost models for exascale HPC storage, plus new software to write and manage data at scale.

European Big Data Value Forum

Versailles, November, 2017

The Data Deluge in High Resolution Climate and Weather Simulation

Presentation: pdf (5 MB).

This talk was given by Sylvie Joussaume, but we had worked on it together, so I think it’s fair enough to include here. We wanted to show the scale of the data problems we have in climate science, and some of the direction in which we are moving with respect to “big data” technologies and algorithms.

Science and the Digital Revolution - Data, Standards, and Integration

Royal Society, London, November, 2017

Data Interoperability and Integration: A climate modelling perspective

Presentation: pdf (11.5 MB).

I was asked to give a talk at a CODATA meeting which was aimed at developing a roadmap for:

  1. Mobilising community support and advice for discipline-based initiatives to develop online data capacities and services;
  2. Priorities for work on interdisciplinary data integration and flagship projects;
  3. Approaches to funding and coordination; and
  4. Issues of international data governance.

For this talk I was asked to address an example from the WMO research community on what we have accomplished in standardising a range of things, and reflecting on what has worked/failed and why. I wasn’t given much time to prepare, so this is what they got!

Gung Ho Network Meeting

Exeter University, July 2017

Performance, Portability, Productivity: Which two do you want?

Presentation: pdf (4 MB).

I talked about two papers that I’ve recently been involved with: “CPMIP: measurement of real computational performance of Earth System Models in CMIP6” (which appeared in early 2017) and “Crossing the Chasm: How to develop weather and climate models for next generation computers?” which at the time was just about to be submitted to GMD.

International Supercomputing (ISC) and JASMIN User Conferences

Frankfurt and Didcot, June 2017

I gave two versions of this talk, one at at the International Supercomputing Conference’s Workshop on HPC I/O in the data centre, and one at the 2017 JASMIN User’s Conference.

The talk covered the structure and usage of JASMIN, showing there is a lot of data movement both in the batch system and the interactive environment. One key observation was that we cannot afford to carry on with parallel disk, and we don’t think tape alone is a solution, so we are investigating object stores, and object store interfaces.

The UK JASMIN Environmental Commons

Presentation: pdf (12 MB - the ISC version).

The UK JASMIN Environmental Commons: Now and into the Future

Presentation: pdf (12 MB - the JASMIN user conference version).

NERC Town Hall Meeting on Data Centre Futures

London, October, 2016

Data Centre Technology to Support Environmental Science

Presentation: pdf (18 MB).

This was a talk given at a NERC Town Hall meeting on the future of data centres in London, on the 13th of October 2016. My brief was to talk about underlying infrastructure, which I did here by discussing the relationship between scientific data workflows and the sort of things we do with JASMIN.

Meteorology meets Computer Science Symposium

University of Reading, September 2016

Computer Science Issues in Environmental Infrastructure

Presentation: pdf (14 MB).

This was a talk at a University of Reading symposium held with Tony Hey, Geoffrey Fox and Jeff Dozier as guest speakers as part of the introduction o the new Computer Science Department in the Reading University School of Mathematical, Physical and Computer Sciences SMPCS.

The main aim of my talk was to get across the wide range of interesting generic science and engineering challenges we face in delivering the infrastructure needed for environmental science.

JASMIN User Conference

RAL, June, 2016

Science Drivers: Why JASMIN?

Presentation: pdf (19 MB).

Keynote scene setter for the inaugural JASMIN user conference: how the rise of simulation leads to a data deluge and the necessity for JASMIN, and a programme to improve our data analysis techniques and infrastructure.

CEDA Vocbulary Meeting

RAL, March, 2016

A ten minute introduction to ES-DOC technology

Presentation: pdf (2 MB).

A brief introduction to some of the basic tools being use to define ES-DOC CIM2 and the CMIP6 extensions.

IS-ENES2 2nd General Assembly

Hamburg, February, 2016


Presentation: pdf (1 MB).

This was an introductino to how ES-DOC is planning on supporting CMIP6.

International Computing in Atmospheric Science (ICAS)

Annecy, September, 2015

UK academic infrastructure to support (big) environmental science

Presentation: pdf (18 MB).

Abstract: Modern environmental science requires the fusion of ever growing volumes of data from multiple simulation and observational platforms. In the UK we are investing in the infrastructure necessary to provide the generation, management, and analysis of the relevant datasets. This talk discusses the existing and planned hardware and software infrastructure required to support the (primarily) UK academic community in this endeavour, and relates it to key international endeavours at the European and global scale – including earth observation programmes such as the Copernicus Sentinel missions, the European Network for Earth Simulation, and the Earth System Grid Federation.

RCUK Cloud Workshop

Warwick, June, 2015

Why Cloud? Earth Systems Science

Presentation: pdf (6 MB).

Alternative title: Data Driven Science: Bringing Computation to the Data. This talk covered the background trends and described the JASMIN approach.


Vienna, April, 2015

Beating the tyranny of scale with a private cloud configured for Big Data

Presentation: pdf (5 MB).

At the last minute I found that I wasn’t able to attend, but Phil Kershaw gave my talk. The abstract is available here (pdf).

Big Data and Extreme-Scale Computing (BDEC)

Barcelona, January, 2015

There were two back to back meetings organised as part of the 2015 Big Data and Extreme Computing meeting website. In the first, organised as part of the European Exascale Software Initiative (EESI), I gave a full talk, in the second, I provided a four page position paper with a four page exposition.

It starts and Ends with Data: Towards exascale from an earth system science perspective

Presentation: pdf (7 MB).

Six sections: the big picture, background trends, hardware issues, software issues, workflow, and a summary.

Bringing Compute to the Data

Presentation: pdf (3 MB).

This was my main BDEC contribution. There was also a four page summary paper: pdf.

AGU Fall Meeting

San Francisco, December, 2014

I was honoured to be the third recipient of the AGU Leptoukh Lecture awarded for significant contributions to informatics, computational, or data sciences.

Presentation: pdf (30 MB).


The grand challenges of climate science will stress our informatics infrastructure severely in the next decade. Our drive for ever greater simulation resolution/complexity/length/repetition, coupled with new remote and in-situ sensing platforms present us with problems in computation, data handling, and information management, to name but three. These problems are compounded by the background trends: Moore’s Law is no longer doing us any favours: computing is getting harder to exploit as we have to bite the parallelism bullet, and Kryder’s Law (if it ever existed) isn’t going to help us store the data volumes we can see ahead. The variety of data, the rate it arrives, and the complexity of the tools we need and use, all strain our ability to cope. The solutions, as ever, will revolve around more and better software, but “more” and “better” will require some attention.

In this talk we discuss how these issues have played out in the context of CMIP5, and might be expected to play out in CMIP6 and successors. Although the CMIPs will provide the thread, we will digress into modelling per se, regional climate modelling (CORDEX), observations from space (Obs4MIPs and friends), climate services (as they might play out in Europe), and the dependency of progress on how we manage people in our institutions. It will be seen that most of the issues we discuss apply to the wider environmental sciences, if not science in general. They all have implications for the need for both sustained infrastructure and ongoing research into environmental informatics.

Symposium on HPC and Data-Intensive Apps

Trieste, November, 2014

Or to give it it’s full name: Symposium on HPC and Data-Intensive Applications in Earth Sciences: Challenges and Opportunities@ICTP, Trieste, Italy.

I gave two talks at this meeting, the first in the HPC regular session, on behalf of my colleague Pier Luigi Vidale, on UPSCALE, the second a data keynote on day two.

Weather and Climate modelling at the Petascale: achievements and perspectives. The roadmap to PRIMAVERA

Presentation: pdf (37 MB).


Recent results and plans from the Joint Met Office/NERC High Resolution Climate Modelling programme are presented, along with a summary of recent and planned model developments. We show the influence of high resolution on a number of important atmospheric phenomena, highlighting both the roles of multiple groups in the work and the need for further resolution and complexity improvements in multiple models. We introduce plans for a project to do just that. A final point is that this work is highly demanding of both the supercomputing and subsequent analysis environments.

Infrastructure for Environmental Supercomputing: beyond the HPC!


We begin by motivating the problems facing us in environmental simulations across scales: complex community interactions, and complex infrastructure. Looking forward we see the drive to increased resolution and complexity leading not only to compute issues, but even more severe data storage and handling issues. We worry about the software consequences before moving to the only possible solution, more and better collaboration, with shared infrastructure. To make progress requires moving past consideration of software interfaces alone to consider also the “collaboration” interfaces. We spend considerable time describing the JASMIN HPC data collaboration environment in the UK, before reaching the final conclusion: Getting our models to run on (new) supercomputers is hard. Getting them to run perfomantly is hard. Analysing, exploiting and archiving the data is (probably) now even harder!

Presentation: pdf (22 MB )

NERC ICT Current Awareness

Warwick, October, 2014

JASMIN - A NERC Data Analysis Environment

Presentation: pdf (18MB).

The talk basically covered an explanation of what JASMIN actually consists of, and provides, and it’s relationship to the Cloud. It included some discussion of why JASMIN exists in the context of the growing data problem in the community.

NCAS Science Meeting

Bristol, July, 2014

The influence of Moore’s Law and friends on our computing environment!

Presentation: pdf (19 MB).

I gave a talk on how Moore’s Law and friends are influencing atmospheric science, the infrastructure we need, and how we trying to deliver services to the community.

e-Research NZ

Hamilton, June/July, 2014

I gave three talks at this meeting:

Environmental Modelling at both large and small scales: How simulating complexity leads to a range of computing challenges

Presentation: pdf (2 MB).

On Monday, in the HPC workshop, despite using the same title I had for the Auckland seminar, I primarily talked about the importance of software supporting collaboration, using coupling as the exemplar (reprising some of the material I presented in Boulder in early 2013).

The road to exascale for climate science: crossing borders or crossing disciplines, can one do both at the same time?

Presentation: pdf (9 MB)

On Tuesday I gave the keynote address:

Abstract: The grand challenges of climate science have significant infrastructural implications, which lead to requirements for integrated e-infrastructure - integrated at national and international scales, but serving users from a variety of disciplines. We begin by introducing the challenges, then discuss the implications for computing, data, networks, software, and people, beginning from existing activities, and looking out as far as we can see (spoiler alert: not far!)

JASMIN: the Joint Analysis System for big data

pdf (5 MB)

On Wednesday I gave a short talk on JASMIN:

Abstract: JASMIN is designed to deliver a shared data infrastructure for the UK environmental science community. We describe the hybrid batch/cloud environment and some of the compromises we have made to provide a curated archive inside and alongside various levels of managed and unmanaged cloud … touching on the difference between backup and archive at scale. Some examples of JASMIN usage are provided, and the speed up on workflows we have achieved. JASMIN has just recently been upgraded, having originally been designed for atmospheric and earth observation science, but now being required to support a wider community. We discuss what was upgraded, and why.

IS-ENES2 Kickoff Meeting

Paris, France, May, 2013

The Future of ESGF in the context of ENES and IS-ENES2

Presentation: pdf (2MB).

I probably tried to do too much in this talk. There were three subtexts:

  1. We as a community have too much data to handle, and I mentioned the apocryphal estimate that only 2/3 of data written is read … but I confused folks … that figure applies to institutional data, not data in ESGF …
  2. That the migration of data and information between domains (see the talk) requires a lot of effort, and that (nearly) no one recognises or funds that effort (kudos to KNMI :-),
  3. That portals are easy to build, but hard to build right, and maybe we need fewer, or maybe we need more, but either way, they need to both meet requirements in technical functionality, and information (as opposed to data) content.

Coupling Workshop (CW2013)

Boulder, Colorado, February, 2013

Bridging Communities: Technical Concerns for Building Integrated Environmental Models

Presentation: pdf (1MB).

2nd IS-ENES Workshop on High-performance computing for Climate Models

Toulouse, France, January, 2013

Data, the elephant in the room. JASMIN one step along the road to dealing with the elephant.

Presentation: pdf (2MB).

AGU Fall meeting

San Francisco, December, 2012

Issues to address before we can have an open climate modelling ecosystem

Presentation: pdf (2 MB).

Authors: Lawrence, Balaji, DeLuca, Guilyardi, Taylor

Abstract Earth system and climate models are complex assemblages of code which are an optimisation of what is known about the real world, and what we can afford to simulate of that knowledge. Modellers are generally experts in one part of the earth system, or in modelling itself, but very few are experts across the piste. As a consequence, developing and using models (and their output) requires expert teams which in most cases are the holders of the “institutional wisdom” about their model, what it does well,and what it doesn’t. Many of us have an aspiration for an open modelling ecosystem, not only to provide transparency and provenance for results, but also to expedite the modelling itself. However an open modelling ecosystem will depend on opening access to code, to inputs, to outputs, and most of all, on opening the access to that institutional wisdom (in such a way that the holders of such wisdom are protected from providing undue support for third parties). Here we present some of the lessons learned from how the metafor and curator projects (continuing forward as the es-doc consortium) have attempted to encode such wisdom as documentation. We will concentrate on both technical and social issues that we have uncovered, including a discussion of the place of peer review and citation in this ecosystem.

(This is a modified version of the abstract submitted to AGU, to more fairly reflect the content given the necessity to cut material to fit into the 15 minute slot available.)

ESA CCI 3rd Colocation Meeting

Frascati, September, 2012

Exploiting high volume simulations and observations of the climate

Presentation: pdf (7 MB).

An introduction to ENES and ESGF with some scientific motivation.

ICT Competitiveness

Trieste, September, 2012

Weather and Climate Computing Futures in the context of European Competitiveness

Presentation: pdf (4 MB).

In this talk I addressed some elements of how climate science interacts with policy and societal competitiveness in the contentxt of extreme climate events etc, but the main body was on the consequences for modelling and underlying infrastructure.

This table drives much of the conversation:

Key numbers for Climate Earth System Modelling 2012 2016 2020
Horizontal resolution of each coupled model component (km) 125 50 10
Increase in horizontal parallelisation wrt 2012
(hyp: weak scaling in 2 directions)
1 6.25 156.25
Horizontal parallelization of each coupled model component
(number of cores)
1,00E+03 6,25E+03 1,56E+05
Vertical resolution of each coupled model component
(number of levels)
30 50 100
Vertical parallelization of each coupled model component 1 1 10
Number of components in the coupled model 2 2 5
Number of members in the ensemble simulation 10 20 50
Number of models/groups in the ensemble experiments 4 4 4
Total number of cores (4x6x7x8x9)
Data produced (for one component in
2,5 26 1302
Data produced in total (in
200 4167 1302083
Increase 1 21 6510

The bottom line in the talk was that to deliver on all of this requires European scale infrastructure, that is, computing AND networks targeted at data analysis as well as data production!

An Aussie Triumvirate

Canberra and Gold Coast, November, 2010

These were three talks given in Australia during a short trip in November, 2010. I gave one in Canberra, and two on the Gold Coast.

British experience with building standards based networks for climate and environmental research

Presentation: pdf (5 MB)

This was the keynote talk for the Information Network Workshop, Canberra, November, 2010.

The talk covered organisational and technological drivers to network interworking, with some experience from the UK and European context, and some comments for the future.

All the other talks from the meeting are available on the osdm website.

Rethinking metadata to realise the full potential of linked scientific data

Presentation: pdf (3 MB)

Metadata Workshop, Gold Coast, November 2010

This talk begins with an introduction to our metafor taxonomy, and why metadata, and metadata tooling, are important. There is an extensive discussion of the importance of model driven architectures, and plans for extending our existing formalism to support both RDF and XML serialisations. We consider our the observations and measurements paradigm needs extension to support climate science, and discuss quality control annotation.

All the talks from this meeting are available on an [ ANDS website].

Provenance, metadata and e-infrastructure to support climate science

Presentation: pdf (9 MB)

This was a keynote for the Australian e-Research Conference, 2010.

Abstract: The importance of data in shaping our day to day decisions is understood by the person on the street. Less obviously, metadata is important to our decision making: how up to date is my account balance? How does the cost of my broadband supply compare with the offer I just read in the newspaper? We just don’t think of those things as metadata (one persons data is another persons metadata). Similarly, the importance of data in shaping our future decisions about reacting to climate change is obvious. Less obvious, but just as important, is the provenance of the data:who produced it/them, how, using what technique, is the difficulty of the interpretation in anyway consistent with the skills of the interpreter? In this talk I’ll introduce some key parts of the metadata spectrum underlying our efforts to document climate data, for use now and into the future. In particular, we’ll discuss the information modelling and metadata pipeline being constructed to support the currently active global climate model inter-comparison project known as CMIP5. In doing so we’ll touch on the metadata activities of the European Metafor project, the software developments being sponsored by the US Earth System Grid and European IS-ENES projects, and how all these activities are being integrated into a global federated e-infrastructure.

All conference talks are available here.

NSF Cyberinfrastructure for Data

Redmond, USA, September, 2010

Cyberinfrastructure Challenges (from a climate science repository perspective)

Presentation: pdf (2MB).

I gave a very short presentation at this NSF sponsored workshop. See my blog entry for a commentary.

Update (Bryan, 31st January, 2017): The output of this workshop eventually appeared in an NSF Report.

ENES Earth System Modelling Scoping Meeting

Montvillargennes, March, 2010

Software & Data Infrastructure for Earth System Modelling

Presentation: pdf (1 MB).

This meeting was targeted as being the first step in a foresight process for establishing a European earth system modelling strategy. This is the talk on software and data infrastructure prepared for the meeting (authored with Eric Guilyardi and Sophie Valcke).

RDF and Ontology Workshop

Edinburgh, June, 2006

Distributed Data, Distributed Governance, Distributed Vocabularies: The NERC DataGrid

Presentation: pdf ( 6 MB).

The workshop page is available here!

NERC e-Science All-Hands-Meeting

AHM, April, 2006

NERC DataGrid Status

Presentation: pdf (5 MB).

In this presentation, I present some of the motivation for the NERC DataGrid development (the key points being that we want semantic access to distributed data with no centralised user management), link it to the ISO TC211 standards work, and take the listener through a tour of some of the NDG products as they are now. There is a slightly more detailed look at the Climate Sciences Modelling Language, and I conclude with an overview of the NDG roadmap.