I'm actively putting together what I intend to present at #rds2013 – Managing Research Data – on Feb 27th next Wednesday. It's difficult to present in 15 minutes something whose details need years to work out. And something where there is currently no clear consensus. In previous posts I have brain-dumped some of the things I want to say:
- http://blogs.ch.cam.ac.uk/pmr/2013/02/12/rds2013-0-managing-research-data-at-columbia-ny-i-am-not-constrained-in-what-i-say/ General Introduction
- http://blogs.ch.cam.ac.uk/pmr/2013/02/13/rds2013-principles-for-managing-research-data/ about 15 principles for us to think about. At one stage I thought this would be the backbone of my talk and I would have one slide per principle. Now I am not sure – maybe we want a different view, while still keeping the principles and letting people add on to them.
I am going to try a different slice through the problem. Even if I don't use it, it's helpful.
- Where are we at?
- What do we want to do?
- What are the obstacles (UPDATE)
- How can we do it?
I'll deal with the first in this post…
Where are we at? And who are "we"?
Domains. I'm going to make some restrictions – not to talk about BIG SCIENCE (e.g. the http://en.wikipedia.org/wiki/Square_Kilometre_Array ). Big Science plans its data capture, management and release and access. I'd like to know what percentage of a Big Science project spend is data management– it would be a very useful indicator. In the same vein I shan't cover data on human subjects and other sensitive areas – that rules out much social science and medical research and since I don't practice the former I'll primarily stick to the long-tail of biosciences and physical science. I hope, however, that some of the principles are still valid and useful. Note that the long-tail of science is very long – it's not a second class citizen. It's just that its data are very badly managed at present. And I shall also stick to "potentially public data" - data which has an impact on publicly visible research – not necessarily publicly funded. So, if a pharma company does cheminformatics research then if it publishes the results, the data on which the research is based is potentially public data. There's also the requirement IMO to make available any data on which public decisions are to me made. If a drug company wants to licence a drug, then the supporting data is potentially public data (PPD). There might be reasons for not publishing it, but they must be clear and independent.
Players. We can list:
- Researchers. This is where the research originates. (Researchers want to be paid to do it| get the facilities to do it| tell the world about it| get credit for it| and to make money out of it) select one or more. Data management is a chore – they don't get paid for it; it's tedious; there are no decent tools. Data management is left to the lab head, who is probably just too old to have any idea how to do it properly. Many researchers are highly protective of "their" data.
- Academia. They have a responsibility for managing the data in their institutions (The UK FOI has made that clear). They haven't addressed the problem, probably don't know who or what could help. They don't cost data. Almost all scientific data their researchers collect is outside their control. When something goes wrong (i.e. climateGate they pick up a lot of flak).
- Government. Increasingly realise data is important and make noises about it. They make their own data public. They cost data. Probably manage it somewhere in the fairly-poor to fairly-good scale.
- Funders. They know data is important. Some manage it well and have data centres. Others leave it to researchers.
- Publishers. They generally haven't much clue about managing data (there are honourable exceptions – IUCr, EGU),. Many refuse to publish any data with their papers. Some have data collection and resale products. This is almost all after-the-fact. Some would love to own and control this space just as they have with publishing – after all academia doesn't know the value of its output – getting them to give it away and buy it back would work very nicely here.
- Informatics industry. Big ones (Google, Microsoft, Yahoo, probably IBM, SAS) not interested – small fragmented market. Real scope here for newcomers to possess and resell data.
- Science Industry. Does data quite well. Recognizes it as a cost and an asset. Often secretive (e.g. pharma) so incurs major inefficiency costs as there is no pre-competitive support and many software companies are poor.
- Citizens. Excluded by academia. Many are committed to new generation approaches (Wikipedia, Open Streetmap). They contribute a lot and get little in return.
Scale. Enormous. I calculate publicly funded research is of the order of 300,000,000,000 USD/year. Maybe 500 B if you include work published by industry and government. Assume 10% of this is data management and assume half is Big Science we still end up with 15 Billion USD per year for data in the long-tail. That 15 Billion is an unrecognised an unsupported cost. It should be recognised – it's not respomsible just to hope it will manage itself on a PC and Excel.
What's the Value? The human genome had a multiplier of 180. For each dollar put in the human race got 180 dollars back. That won't apply to all data but I'd settle for a multiplier of at least 5. So if a project costs 200,000 USD it generates 1 million in value. Of that at least 10% is data – i.e. 100,000. So if a project fails to publish or manage data it is throwing away huge amounts of value.
Example. Computational chemistry can predict the behaviour of matter (crystals, proteins, etc.) I estimate that >> 1 billion USD is spent on PPD. And it's value is several times this. None of this is published. None. This is certainly the single most valuable area where we would benefit from good modern data management. The lost value to the world, especially industry (semiconductors, carbon storage, energy, medical proteins, etc.) is certainly billions.
Organizations. Who can help? (NOT the publishers as they have shown themselves to act against the interests of anyone other than their shareholders.)
International bodies. ICSU/CODATA, UNESCO, Learned societies (if decoupled from commercial publishing). Funders (NSF, NIH, RCUK). National Laboratories. Governments. JISC. NCBI, EBI, Ueir/PubMedCentral
OKFN, SPARC, OSI, OSF, Foundations. All very sumpathetic.
Training. Very little. Sophie Kershaw (our Panton fellow) is changing that through novel graduate course and she'll be feeding me material. It's probably the most important things - I'll say more under "what we can do".
Current Practice. There is very little consensus on how / what to do and widespread apathy. There are no positive pointers from #openaccess . University repositories have spent perhaps 2 billion USD worldwide and they have provided very little of any kind for Science. They are not designed for science data and should never be used.
The best examples of practice come from software – with Github/Bitbucket and a large number of innovative and highly useful tools. If we could emulate these repositories for science, both in their value and their relative openness, that could be massive. But data are not software and it's hard