Typed and scraped into Arcturus
[Pedantic note: I use "data" as either a singular or plural noun according to the feel of the sentence.]
This and following posts have several purposes. They are to help me get my ideas in order for http://opensciencesummit.com/ where I am giving a 10-minute talk on “The Open Knowledge Foundation” (to which I would now add “and Panton”); to try to address the enormous scope of Open Data; and to prepare the ground for a funded project on Managing Research Data.
The current theme is “Panton Papers”. The idea is that part of the value of the Panton Principles (http://www.pantonprinciples.org/ ) is that the whole document is short and the key points are simply made. But the “Principles” can therefore only address the motivation and the procedures for Open data in a general manner, and many of the problems are in the details. I believe that many of the problems in Open Access (which is simpler than Open Data) arose because not enough communal effort was given to the practice of Open Access and I want to avoid as many OD problems as possible before they occur.
Over the last 2 years (when Open Data has started to become important and discussed) I have seen several potentially difficult areas. I’ll simply list the ones I have thought of here and then outline the idea of the Panton Papers. This discussion is mirrored in part by the OKF open-science discussion list (http://lists.okfn.org/pipermail/open-science/2010-July/thread.html ) and you may wish to subscribe. There’s also a regular working group on open-science. (Almost everything in OKF is Open , but it may take a little while to find out where you want to be!). The issues that I currently have are:
- What is data? Images? Graphs? Tables? Equations? Accounts of experiements? This is a major problem and almost completely unexplored. Without solving this we are held back 10 year or more in our ability to re-use the primary scientific literature (e.g. by closed-access publishers who claim that factual graphs belong to them).
- Why should data be open? (and when should it not be?). I’ve put forward ideas in http://en.wikipedia.org/wiki/Open_science_data and http://precedings.nature.com/documents/1526/version/1 . They range from moral, to legal/quasi-legal to utilitarian.
- Who owns data? This is one of the trickiest areas – there is legal and contractual ownership and there is moral ownership. Generally there is far far too much “ownership” of data.
- When should data be released? This is a key question (see http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2473 for an example). Some communities have solved it – most haven’t addressed it and will have to go through the rigour of working out release protocols.
- How and where should data be exposed? I am strongly of the opinion that we need domain-specific repositories (which could be national or international) and the Institutional Repositories are almost never the best place to expose data (I expect and welcome alternative opinions). The “how” depends on understanding what the data and metadata are and is increasingly dependent on specialist software and information standards. “Archival” is often the wrong word to use.
- Datamining and textmining. Most authors, publishers, repository owners are unaware of the enormous power of automated analysis of the literature. Some closed access publishers expressly forbid these activities. We have to liberate the right of the scientific community to do this enthusiastically and efficiently.
- Reproducibility. Science is based on reproducibility – we expect to be able to replicate the “materials and methods” of an experiment and to try to falsify its claims. Physical materials are beyond the immediate discussion (though this may change) but much science is now based on computing. It should be possible to replicate simulations, data cleaning, data analysis, model fitting etc. This is a tricky area. It is difficult (though with virtualization and the cloud is becoming easier) to reproduce the computing environment. Large or complex data sets are a major problem but must be addressed. This is not without monetary cost.
I may add more.
The idea is that each of these is a “Panton Paper”. It may or may not be crafted in Pantonia (the hectare of the Chemistry Department, The OKF headquarters, and the Panton Arms in Cambridge UK). Everything I now write is mutable.
Each paper will have a toplevel document of similar form to the Panton Principles (i.e. 3-8 ideas, with short explanatory paragraph(s). This document will be crafted by the OKF in public view on a wiki or Ether/Piratepad. Anyone can take part. We shall welcome contributions from a wide range of disciplines (in fact this is essential). At some stage version 1.0 of the paper will be frozen and will be formally published. We have an offer from a major publisher to do this and I am hoping we can announce this at Open Science Summit.
The Paper should carry a wider range of links to other essays in Open Data and should carry examples from different disciplines. For example there is a well tried and accepted process in many areas of bioscience and astronomy as to what when and how data get published.
The wiki will be mutable so that changes in policies, and updates to links will be continuous, even after the V1.0 publication. This will also serve as an example of a new type of publication where the static, immutable “paper” is replaced by a reviewed series of time-dependent hyperdocuments.
Over the next few days I will refine this for presentation at OSS.