The relationship between collecting metadata, and the optimum size of a child's plate of food

I’ve said it before, and no doubt I’ll have to keep saying it, but the word metadata is understood by nearly every individual differently. This has a number of consequences, starting with defining (in any given) case, what comprises metadata. The problem is nicely encapsulated a recent email on the Galeon list. I hope Gerry Greager wont mind me further publicising his statement:

A lot of otherwise really sharp folks tend to define everyone's data and metadata by their own prejudices, including me. After all, MY data's easy to identify and define, and I can see how YOUR data should be identified and defined, too. What? you don't agree with me? How dare you?

The corollary of this position I’d state as:

It's really easy for me to create my metadata, and time consuming and unrewarding to create the metadata you need. How dare you ask me to waste time doing it? Oh, by the way, can you please create the metadata I need to consume your data? What, you don't want to unless you're a co-author? Bizarre!

(In this context, consumption goes well beyond merely being able to load and manipulate the data, the meaning and context matter too …)

This has interesting consequences for those of us trying to collect metadata within projects, like, for example metafor. There, one of our goals is to document the models used and simulations produced in CMIP5. That means, we’re going to be asking the modelling groups to enter metadata about those models and simulations, and it’s going to be time consuming to do so. I expect many will consider it not of direct benefit to them (that said, I hope just as many, if not more, will recognise direct benefits). Indirect benefits should be obvious: the better documented we make these models and simulations, the better the interpretations and derivative science should be, particularly when those intepretations and derivations are done by those outside the normal community of model data users, by folks who need that extra metadata to be sure of what they are doing (or even to do it at all).

Ok, so I think I make a cogent argument about benefits, so where does childrens eating behaviour come in? Well, I think when one is trying to gather metadata, we’re in the same boat as parents are with young children: if you put too much food on the plate, kids just dabble round the sides and don’t each much. Put the right amount on the plate, and kids gobble it up. Too little, and you’re back to “don’t each much”.

So, when asking for metadata, it’s crucial to ask for just the right amount, enough for a large proportion (but not all) the potential data consumers, but not so much that the task of producing it puts off the metadata producers, and you end up getting little or none of what you need. (And don’t ask so little, that you end up getting little or none of what you need.)

Returning to metafor, the question I keep asking myself is: “Should we ask that, will it put folk off answering at all?” The problem of course is, knowing what the answer is. Again, like children, we need to try and second guess how much capacity and desire there is …

With a metadata entry tool, we’re in the even more complicated situation of a children’s party: we have different children (with different capacities and desires), so we have to work out the average plate size, but allow for second and third helpings for those with large capacity. That is, we need to guess how much metadata we can reasonably ask for in the average without making it look too large, but make it possible for those with the interest and desire to give us much much more information within the same structures.

The answer is further complicated by the situation. To push this analogy even further, In the case of CMIP5 we have the advantage of a peer-induced pressure (for increased metadata production), just like at that party, where there is peer-induced pressure (for increased food consumption).