Reviewing Parson and Fox on the data publication metaphor

Mark Parsons have a draft paper out for public review and are inviting comments there …

… but I found myself hamstrung by the “comment” paradigm, so I’m responding here.

Firstly, the paper itself: Like Chris Rusbridge, commenting on the blog, I found myself in equal parts agreeing and disagreeing … but the bottom line is that it is a good question to ask, so I hope a version of this paper appears! The more this gets discussed the better, either I’m right (and data Publication - as I define it - should be a dominant paradigm), or I’m wrong - but if right, it’ll never happen until folks discuss and buy into it, and if I’m wrong, well best I find out :-)

Ok, some specific comments:

There is an underlying assumption in this paper, which bubbles up to plain view in spots, and on the blog, that Publication and sharing are in conflict. I flat out disagree with this … and don’t know of any evidence that supports such an assertion.
- The Publication metaphor already explicitly supports both via preprints (sharing or publication) and the formal paper of record (Publication).
Metaphors are obviously useful, but as the authors agree, flat out dangerous as well. I’ve blogged on the importance of “verifiable statements” before (and folks should look at the Tauber link from that!). So do I agree with these particular metaphorical paradigms?
- Well, no not really, we at CEDA would claim we were playing in all four paradigms as they are defined, and I don’t think we’re that unusual. It is true that the examples they have chosen can be roughly characterised by these examples, but I think there’s a bit of selection bias, sadly. Partly, I think that’s engendered by using these activities to try and stratify data management as opposed to data activity. For example, we are a big archive (Big Iron¹) who have some datasets which we wish to publish, and we are involved in map making and linked data - but I would argue that the second two are activities that depend on data management, they are not data management per se! Other groups will have very different data management standards and methods, but still be involved in map making and linked data. I think this is recognised to some extent in the paper (via the table where the focus line makes these key points, the qualification that they’re not mutually exclusive, and the first paragraph of section 4).
- So, because I think these paradigms do mix management and activity they don’t contribute easily to the question about publication, so I don’t find them helpful in this context.
- All that said, the discussion of how misunderstood definitions limit understanding is totally on the money (up to the point where there is an implicit assertion that “traditional peer review” is homogeneous enough to be differed from what one could do with data Publication. I think PLoS is an interesting analogy here …
Why do the authors think concepts like “registration, persistence” etc are not relevant (see the reference to Penev et al)? The answer appears to come from the next sentence where there is a conflation between Publication and some (closed) implementations thereof.
I think the dynamic, annotation infested ( :-) ), world of Publications that I foresee is not inconsistent with open access, clear provenance, yet the authors assert that worrying about provenance and definition is undue? (I defy you to find a URL to a data object that is useful without you having some implicit or explicit knowledge about the medium that will be returned if you dereference a URL - obviously definition matters in the data world, and just as obviously some level of containerisation is necessary - even if it’s only to say “stream of type X starts here, terminated by this binary string”).
- Of course annotation, federation, transformation are important, but if you want to use objects from that complex world, in a way that is scientifically useful, you need to know whether you can repeat or replicate your workflow!
- Particularly, often one doesn’t need to see all that “ecosystem” in the workflow. The annotations are logically distinct from the data (and may themselves be different publications). If they do need to be together, then the anthology is a useful metaphor - publish poems on their own, and in collections!
- What is linked data, but simply linked data? … the objects that are linked, may, or may not be Published … fair enough … no conflict there?
- So I simply don’t buy that Publication is in conflict with unlocking the deep web! It’s orthogonal in some sense!
All those picky disagreements aside, I find myself agreeing, the various paradigms available all lack in some way - but everything is a compromise. The analysis of their paradigms is fair enough … in general … but the conclusion lacks some rigour that they might be able to take from our paper.
I really liked the discussion of infrastructure and ecosystems … particularly when the authors, particularly when it got to recognising the different roles that exist in Publication and how they can be decoupled. (Again, our paper helps in this regard, although I think it could be usefully extended using the arguments that Parsons and Fox have espoused).
- When we get to releases, we start to see the concept of versions (editions anyone?) of data … right on!
- I think there is real scope to consider how the infrastructure and ecosystem around data Publication will be (should be) very different from those around literary publication. I’d really like that section to be expanded …
… it is important not to be hidebound by … any one metaphor. Yes, Yes, Yes.

All of which is me saying my definition of Data Publication is clearly different from Parsons and Fox. Which is utterly fine … neither of us is right or wrong, we just need to define what we are talking about!

Ok, now to consider the blog traffic :-) (which at the time of writing was up to comment 10).

The one point that I think that is explicitly different from any of the points above, is the importance of social science and philosophy in establishing both our expectations and our practice … absolutely true!!!

The other point (to reiterate) is that data Publication is not in conflict with sharing, but there is a time for sharing and a time for not sharing - and whatever system(s) we come up with have to leave the decision as to how much sharing is wanted or desired to individuals (or their funders) - the “infrastructure” cannot make that decision (positively or negatively). It needs to facilitate it!

(In particular, it is just fine for scientists to hold some data close to their virtual chests … consider, if nothing else, health records, or the geographical location of the last members of particular species etc).

Finally, I too am a big advocate of citation, but it is not the answer, but it’s certainly an answer! Any metaphor is both useful and limiting, and even the best solutions are only useful 80% of the time. That means this data “publication” paradigm will involve multiple solutions. Bring on the separation of concerns :-)

comments (2)

bryan (on Friday 16 December, 2011) No prizes for guessing why the URL spells metaphor with a for!

Bryan (on Tuesday 17 January, 2012) Please comment on their original blog site, not here!

or not? Like Chris in the comments on their blog, we’re petascale, but all commodity hardware … today! ↩