cyberinfrastructure for data

Today I’m sitting in a “data task force” meeting organised by NSF to address future cyberinfrastructure requirements. We’re in a Microsoft building in Redmond: comedy moment as it seems that of the three dozen or so folks here, half of them are using Macs … (and I suspect a few more are like me, using Linux). That said, I’m obviously drinking from the kool aid, because there is a feeling afoot that Microsoft are doing the right thing in a number of fields nowadays (unlike Oracle who are trending darkwards).

Anyway, I was asked to present a couple of slides on challenges for open access data repositories … but clearly I don’t believe that such challenges are independent of the science, so I produced a few more slides to give context. I also wanted to make a lot of points, not just a couple … so I did that, but I do get to some big points eventually.

My slides are here (pdf). The main points I wanted to make are

(slide two): Climate science is a global problem, as indicated by CMIP5, which involves dozens of organisations producing petabytes of data organised into millions of datasets. The solution to handling this sort of thing has to be global. (I could have led with something from the surface temperature meeting a couple of weeks ago, the same imperative for a global solution would emerge).
(slide three): For all that, the global solutions have to be supported by national solutions. In the UK we might consider, for large scale earth simulation, developing a national cache - for simulations to be analysed “centrally” (to avoid a many-to-many copying problem of large volumes of data). To do that, we need high bandwidth, reliable network links to data producers internal to the UK, and further afield. One also wants to allow folks to analyse the data on their own terms: hence the provision of something with the attributes of a private cloud.
(slide four): Just to make the point that it’s not just simulation, earth observation data is growing massively as well - and it’s globally distributed as well. This is just ESA.
(slide five): And to make the point that sensors are proliferating (from lamp posts, ships and planes, to long term observatories).
(slide six): and in passing, to note that the amount of available storage can’t keep up with the data being produced.
(slide seven): Getting to the meat of it all: All these data sources are globally distributed, heterogeneous, and voluminous - and increasing on all those axes (more distributed, more heterogeneous, more voluminous). All of these data products need to be integrated, manipulated, and understood - and preserved (most of these data cannot be recaptured. These lead to major challenges, designing, building and delivering reliable:
- Global data systems, with global data movement, and global caches!
- National data systems, connected to major data sources internally and externally - with virtualised computing.
- Metadata systems to drive everything: they need to generate as much provenance as possible automatically, but intuitive human tools are needed. They need to be really reliable and quite prevalent. They need to be smart, and exploit as much machine understanding as possible via ontologies etc.
(slide eight):An attempt to depict all the things one needs to consider if one wants to avoid a WORN¹ (“Write Once Read Never”) archive. I tried to produce a stack diagram based on things ranging from the technology driven (the computer architectures) through to the science driven (the services, portals and visualisation needed). Key points that I noted include:
1. Choosing the storage system is becoming more and more a balancing act between the energy cost (in Joules used and produced) and the opportunity cost (immediate access via high bandwidth or retrieval from offline or nearline media).
2. Nothing will scale without metadata (all sorts), and interoperability (both here and now between software and human and over time) will require major investments in semantic tooling … significantly beyond just marking things up with unconstrained RDF … and dealing with the heterogeneity will require exploiting model driven architectures to build information systems which understand the metadata and data structures. This will depend on usage conventions for both data and metadata formats and schemae. It will also depend on data modelling paradigms and tooling that are significantly better than those available now - including data specific metamodels which are richer and more comprehensive. We will need automatic tools which capture provenance, but we will also need to continue to supplement such information with human input.
3. Applications will be decomposed into server side activities and client side activities, decoupled by networks that are not really able to cope with high volume data transfers - particularly over long distances (often the problems may be in the first or last mile, but only exposed with high volume long distance transfers). (This point is not about the bandwidth, which maybe there, but not utilisable because of component and/or software incompatibilities.)
4. Server side calculations will become more and more important - and these will require both “free form” APIs (essentially allowing the equivalent of machine and/or script virtualisations - because point and click doesn’t scale) and fixed APIs (allowing the construction of complex portals for those who can and and do want to explore pre-canned functionality).
5. It’s important that the systems have appropriate security policies along with authentication and authorisation tooling - because even if the data is intellectually open, it’s useless if the systems are overloaded and/or compromised.
6. Two overarching points from this slide:
  - The existing cyberinfrastructure at pretty much every level is far too flakey. Research councils around the world have invested in bleeding edge infrastructure research, but have not necessarily invested in reliable infrastructures (and here I’m talking about all the levels of infrastructure - although strangely they often invest heavily in networks that have high bandwidth potential).
  - Much of the heterogeneity can be addressed by improved standardisation, along with better tooling for specialisation and extension, so that the gamut of heterogeneity can be reduced to a manageable number of activities.
(slide nine): My final slide was making the point that all of this is predicated on a number of social and cultural challenges:
1. Rewards: Everything is predicated on metadata? That means there need to be rewards for metadata, since metadata creation is a commons issue: you yourself don’t get the direct benefit, but you benefit from the efforts of others.
2. Curation: Over time, the information needs to be migrated through the requirements of new/changed user communities. That needs people with appropriate careers to manage the evolution of the relevant ontologies (they can’t be expected to migrate the data itself without automatic tooling - there will be too much data for it to be a manual process, and even too much data to imagine manual appraisal/disposal could suffice instead).
3. Citation: Back to the reward! With citation, effort is rewarded, and the edifice of the scientific method is maintained.
4. Licenses and IPR: all of this will depend on clear and unambiguous licensing and IPR rules.
5. Trust & Reliance : A global, national and institutional infrastructure will need to have interdependent components that can be relied upon, now and into the future. Folks will need to trust URIs issued by one institution will be de-referencable in the future.
6. Plans. And none of this can be done without plans that have appropriate timescales, that are understood, by individuals, communities, and funders alike, and updated as appropriate.

In the discussion I also talked about the importance of policies that mandate that publications should be associated with either the data themselves or information on where the data can be obtained.

Under pressure to pull out a couple of specific recommendations:

NSF should invest in reliable infrastructure for data management over the long term, and in better and more reliable tooling to support that.
It should continue to invest in the computing techniques to deliver model driven architectures for data and metadata services - and follow it up with funding to develop reliable implementations, and
Ensure it has appropriate data management policies in places, backed with mechanisms to implement/police them.

Of course, that’s just me, this has time to run. We’ll see what the task force eventually recommends: hopefully something a goodly more specific and encompassing than those final recommendations (and I guess they’ll come after a second, open, workshop).

comments (2)

gary (on Friday 01 October, 2010)

I concur 100%. We are drowning in data, and there are not enough big enough tools in place to turn them into knowledge.

Josh (on Saturday 02 October, 2010)

I’m a MLIS student at Syracuse U. We just had someone form NSF speak to this exact issue. As part of their strategy to harness data they’re actively recruiting eScience Librarians.

apparently I invented this term according to Tony Hey who recalls me introducing him to the concept in a talk to the MRC in the early zeroes ↩