data scientists are anally retentive too!

Well, that title ought to make this blog article inaccessible to those of us with anally retentive firewalls, but because it’s only a minor misquote of an article from the Times Higher Education, it is legitimate!

The original quote is: Professor Nunberg said

"like most high-tech companies, Google puts a much higher premium on innovation than maintenance. They aren't good at the punctilious, anal-retentive sort of work librarians are used to."

Well yes: but I think Professor Nunberg said the right thing for the wrong reasons. Firstly, the context: Google Books has a lot of metadata, and a lot of it’s wrong, and Nunberg was effectively saying that Librarians do metadata better.

Well no: they clearly have procedures in place for checking metadata, but they don’t necessarily have expertise, but if they do, that might not be enough. IMHO, in the long run, crowd sourced metadata (where one has potentially multiple experts on tap) will vastly exceed (individual) expert sourced metadata. Because when metadata is more than just (title, author, year), metadata really is just another sort of content, and when content matters for descriptive matter, Wikipedia v Britannica (e.g. here and here) shows us the direction of travel. Crowd sourcing wins, in the long run.

(Caveat: generating information is still an individual or at best a team game, you can’t crowd source information ab initio, even if you can crowd source the evaluation and description of it. Caveat-squared: you can of course crowd source data - crowds can generate data. Anyway, enough of this digression.)

However, where librarians do matter, and data scientists too - for whom the same argument applies - is in setting the initial conditions! Since, after-all, we’re never “in the long-run”, we’re always getting there, and we get there much faster if we know where we started from. Attention to detail is a great quality for librarians and data scientists alike (actually all scientists), such anal retention is an integral requirement for many of us …

Taking us back to the THE article, clearly if the index doesn’t point you to the vicinity of the thing you want, it will take a long time to get there. Similarly, if you don’t have good scientific metadata, it takes a long time to choose the tree from the wood, and having done so, you might not have an initial evaluation for how good that tree was …

What Nunberg might have said (who knows, maybe he did say, I only had the THE article to go on) was: why on earth didn’t Google start with an algorithm to utilise the professionally curated metadata rather than their own? And why not put a proper crowd source front end, so it can be improved?

All of which is rather ho hum, but what really got me thinking was the parallel between Google, and the attitude of a great many of the scientific community to managing their data, in which case I’d say:

Like Google, most scientists put a much higher premium on innovation than maintenance. They aren't good at the punctilious, anal-retentive sort of work that is necessary for society to get the best out of the data it has paid to collect!

Which is to say, it’s not just being anally retentive that matters, it’s what you’re anally retentive about.

OK, well that’s the last time this subject (a-r) comes up on this blog … even if it was enough to provoke me out of my workload-induced-blog silence (although in truth, what was really needed was the sudden cancellation of a scheduled meeting).

comments (1)

Chris Rusbridge (on Friday 09 December, 2011) Great to see you back posting again Bryan, I’ve been missing Bryan’s Blog. I wonder how I can get more of your meetings cancelled…

If the THE reference was about Google Books metadata, the really odd thing is that AFAIK they had access to the Library metadata “from the get-go”. Both RLG and OCLC had shared all or part of their union catalogues with Google, and of course there was metadata in the catalogies of the libraries the books were in.

The other thing worth saying is that, um, quite a lot of library metadata is pretty rubbish. The best way to spot this is to look up a few title on COPAC; you’ll usually see several versions of the title even though they should have been unioned [?]?

The final thing is to wonder what are the limits of crowd-sourcing? The wikipedia model doesn’t translate beyond its own circumstances well, and most crowd-sourced services struggle. Chemspider managed some sort of synergy with wikipedia, but at least partly through the super-human efforts of one man.

I really want to know because I’d like to persuade TNA to go for a partly-crowd-sourced model for the PRONOM file formats registry. Without it or something effective and pretty complete of the same ilk, we shall be royally stuffed!