I have been invited to write an article for Elsevier's Serials Review and mentioned it in an earlier post (Open Data: Datument submitted to Elsevier’s Serials Review). I had hoped to post the manuscript immediately afterward but (a) our DSpace crashed and (b) Nature Precedings doesn't accept HTML So DSpace is up again and you can see the article. This post is about the content, not the technology
[NOTE: The document was created as a full hyperlinked datument, but DSpace cannot handle hyperlinks and it numbers each of the components as a completely separate object with an unpredictable address. So none of the images show up - it's probably not a complete disaster - and you lose any force of the datument concept (available here as zip) which contains an interactive molecule (Jmol) ]
Open Data (OD) is an emerging term in the process of defining how scientific data may be published and re-used without price or permission barriers. Scientists generally see published data as belonging to the scientific community, but many publishers claim copyright over data and will not allow its re-use without permission. This is a major impediment to the progress of scholarship in the digital age. This article reviews the need for Open Data, shows examples of why Open Data are valuable and summarises some early initiatives in formalising the right of access to and re-use of scientific data.
PMR: The article tries not to be too polemic and to review objectively the area of Open Data (in scientific scholarship), in the style that I have done for Wikipedia. The next section shows Open Data in action, both on individual articles and when aggregating large numbers (> 100,000) articles. Although the illustrations are from chemistry and crystallography the message should transcend the details. Finally I try to review the various intitiatives that have happened very recently and I would welcome comments and corrections. I think I understand the issues raised in the last month but they will take time to sink in.
So, for example, the last section I describe and pay tribute to the Open Knowledge Foundation, Talis and colleagues, and Science/Creative Commons. I will blog this later but there is a now a formal apparatus for managing Open Data (unlike Open Access where the lack of this causes serious problems for science data). In summary, se now have:
- Community Norms("this is how the community expects A and B and C to behave - the norms have no legal force but if you don't work with them you might be ostracized, get no grants, etc.")
- Protocols. These are high-level declarations which allow licences to be constructed. Both Science Commons and The Open Knowledge Foundation have such instruments. They describe the principles to which conformant licences must honour. I use the term meta-licence (analogous to XML, a meta-markuplanguage for creating markup languages).
- Licences. These include PDDL and CC0 which conform to the protocol.
Throughout the article I stress the need for licences, and draw much analogy from the Open/Free Source communities which have meta-licences and then lists of conformant licences. I think the licence approach will be successful and will be rapidly adopted.
The relationship between Open Access and Open Data will require detailed work - they are distinct and can exist together or independently. In conclusion I write:
Open Data in science is now recognised as a critically important area which needs much careful and coordinated work if it is to develop successfully. Much of this requires advocacy and it is likely that when scientists are made aware of the value of labeling their work the movement will grow rapidly. Besides the licences and buttons there are other tools which can make it easier to create Open Data (for example modifying software so that it can mark the work and also to add hash codes to protect the digital integrity).
Creative Commons is well known outside Open Access and has a large following. Outside of software, it is seen by many as the default way of protecting their work while making it available in the way they wish. CC has the resources, the community respect and the commitment to continue to develop appropriate tools and strategies.
But there is much more that needs to be done. Full Open Access is the simplest solution but if we have to coexist with closed full-text the problem of embedded data must be addressed, by recognising the right to extract and index data. And in any case conventional publication discourages the full publication of the scientific record. The adoption of Open Notebook Science in parallel with the formal publications of the work can do much to liberate the data. Although data quality and formats are not strictly part of Open Data, their adoption will have marked improvements. The general realisation of the value of reuse will create strong pressure for more and better data. If publishers do not gladly accept this challenge, then scientists will rapidly find other ways of publishing data, probably through institutional, departmental, national or international subject repositories. In any case the community will rapidly move to Open Data and publishers resisting this will be seen as a problem to be circumvented