Dictated into Arcturus
Today I, and the other authors of the Panton Principles, have been invited by Biomed Central to hand out the prizes which they have sponsored (with Microsoft) for the best papers that make data Open. As part of this they have asked us to give short presentations (no more than 5 minutes) and I thought I would try to get some of my points down in a blog post before the event.
I have talked to a lot of people about how to make data Open and have a first-pass analysis of the things that have come together. Following the traditional analysis of a murder (motive, means, opportunity) I have called them motivation, means and resources.
- Means. Unless the tools and protocols are in place for sharing data the barriers are formidable. This is often referred to as “data formats” but is usually more complex and involves ontologies, semantics, and data selection. Unless scientists are given clear guidelines and tools that are agreed by the community it is hard for them to share data.
- Resources. Although Data storage is very cheap, it is not zero-cost and unless there are clear places all methods for making it available and for making it at least partially persistent it is again unreasonable to expect scientists to deposit data. There probably is a large amount of freely available storage provided by academia and other institutions, but very often this is heterogeneous and hotchpotch. The simple assumption that all data can be put in institutional repositories is very unlikely to succeed in most cases.
How are we going to provide all of these three components? We already have many examples where data sharing is accepted so we can draw some generalisations from this. There normally needs to be a well motivated independent trustable authority (often domain-specific) that oversees the process. Examples of this are the Bio informatics centres (NCBI, EBI, PDB, etc.) and similar resources in high energy physics. But in many sciences whether data is heterogeneous and the disciplines are uncoordinated by National or International bodies there is a major problem before data can be captured and shared.
What role can publishers play? As long as the data are small and extremely common it is possible for publishers to man data on their websites that low-cost. A good example of this is in crystallography where CIF (data) files are routinely required as a prerequisite of publication and where they appear alongside the full text. But it is uncommon to see Excel spreadsheets or molecular structures routinely attached to publications. This is not surprising; publishing data costs effort and therefore money and then there is no immediate return for the publisher in doing so.
What about institutional repositories? My concern here is that they have been set up primarily to address the needs of managing full text manuscripts (or even only abstracts) as a result of a variety of political pressures. The most common reason is to manage the research assessment exercises that are now routine in certain countries. Others include advertising the institution (rather than individuals). Again this is seen as a chore by many of the academics. It also means that repository managers take a manuscript- like approach to information rather than a holistic approach to the capture of scholarship. Moreover a typical university will deal with thousands of different types of scholarship and it is unreasonable to expect a repository manager to have any knowledge of all other than an extremely small fraction. For that reason a better solution will probably involve domain-specific repositories and I shall address this in future posts.
Let me finish by congratulating the winners and other honourable mentions of the Open Data pric=zes. We will be discussing today how these awards can be developed so that next year and hopefully in future years this can contribute to the culture of open data publication and make people aware of the three components that need to be addressed. At least it will help with the motivation!