Open Data

There are several reasons why I’m currently thinking about Open Data (see Open Data at WP for some collected wisdom and links). We’re currently collecting more chemistry data that we intend to make Openly available (see CrystalEyeknowledge base as an example). I’ve been asked to write an article for Serials review (Elsevier) on the subject and am putting my ideas in order. Chemspider announced Something New and Exciting Coming Soon… which contained an image with “Open Data” (no details). And Peter Suber announced New OA database on material properties, originally from the Chemistry Central blog which announced “The database is yet another of the free, on-line chemical services to have emerged in recent years. ” The use of “OA” was, I think, Peter’s.
I didn’t agree with Peter in his description of Material Properties as an “Open Access” database, and I’m worried that we shall see the same imprecision in the use of “Open Data”. So I wrote to Peter and am amplifying the arguments here. As a baseline Peter and I are both on the advisory board of the The Open Knowledge Foundation (initiated by Rufus Pollock) which has developed the Open Knowledge Definition. I think it’s important to take this as a starting point for this analysis, thought there are aspects of databases which make the system much more complex.
It’s good that the principle is simple to summarise:

In the simplest form the definition can be summed up in the statement that “A piece of knowledge is open if you are free to use, reuse, and redistribute it”. For details read the latest version of the full definition (with explanatory annotations).

I’m going to look at the most important clauses for science/chemistry (emphases are mine) – I have omitted other clauses but I adhere to them as well:

1. Access The work shall be available as a whole and at no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The work must also be available in a convenient and modifiable form.


3. Reuse

The license must allow for modifications and derivative works and must allow them to be distributed under the terms of the original work. The license may impose some form of attribution and integrity requirements: see principle 5 (Attribution) and principle 6 (Integrity) below.


4. Absence of Technological Restriction

The work must be provided in such a form that there are no technological obstacles to the performance of the above activities. This can be achieved by the provision of the work in an open data format, i.e. one whose specification is publicly and freely available


5. Attribution

The license may require as a condition for redistribution and re-use the attribution of the contributors and creators to the work.


6. Integrity

The license may require as a condition for the work being distributed in modified form that the resulting work carry a different name or version number from the original work.


8. No Discrimination Against Fields of Endeavor

The license must not restrict anyone from making use of the work in a specific field of endeavor. For example, it may not restrict the work from being used in a business, or from being used for military research.


PMR: There are significantly different types of Open Data in science. There is raw data produced by the scientific experiment and increasingly published alongside “fulltext” or publications or theses. There is a curated, critical snapshot of a given experiment, perhaps images from a telescope or satellite. In this post I discuss the problems of “databases” or “knowledgebases” which are both fragmented and dynamic (e.g. CrystalEye and the Materials database.)
It is critical to distinguish between “Free” and Open. “Free”, in this context, simply means that the provider has mounted the data (not necessarily the whole data) on a web page. There is often no licence, no copyright, no guarantee of availability, no commitment to archival, no explicit freedom of re-use. The materials database is in this category – and to be fair it didn’t call itself Open.
A major problem, which we have discussed in some detail on this blog over CrystalEye, is that many databases are both hypermedia and dynamic. They are spread over many components and they change with time. Both CrystalEye and Materials fall into that category. It is technically difficult to make them easily available and there is no agreed mechanism for doing this.
The work must be available as a whole. I agree this is critical, but it’s often difficult. Leaving aside the dynamic aspect there are a few possibilities.

  • Bundle the data into a single “file” or a set of files. This has worked historically for the Protein Databank. The difficulties are that there is not usually a single simple object to bundle, and that it requires considerable maintenance.
  • Provide an iterator over the data. This could either be a generic tool such as wget (which recurses over a hyperdocument) or a bespoke tool which is guaranteed to iterate over the data. This is the approach we have adopted (Jim Downing wrote it specifically to help the community download the data and has made it available under Open Source).
  • Collaborate with a data provider (e.g. a Bioinformatics institute). This is a good approach if your community supports the idea of Open Data, but chemistry has yet to see the light



A few other comments. “convenient and modifiable form” and “no technological obstacles” cannot be defined precisely but I would ague that if the Open Data provider had published their formats and if there was Open Source code that would read the data that was sufficient. Note that for many files ASCII is sufficient if the metadata is well provided. There is no requirement for the Open Data provider to provide installation help for downloaders if the instructions are minimally clear.
Open Access for scholarly publications implicitly guarantees certain aspects which are not guaranteed by default for Open Data:

  • The whole of the work is available. This is almost always trivial for articles (but as we have seen is a problem for some sorts of data).
  • There will be continued access to the work. This is based on (Gold) the permanence of Open Access publishers and the copying to inter/national repositories and (Green) the permanence of institutional repositories and in some cases inter/national repositories (self-archival on personal webpages does not guarantee permanent access). Repositories in general do not archive data.
  • The work can be re-used. This is clear if a licence is embedded in the work or provided by the repository. Note that many repositories do not make the licence position clear.
  • The work is in a convenient and modifiable form. Trivially readable for sighted humans. The rest is not always true.

Almost all these are major problems for Open Data.
So I very much hope that we can use Open Data in a strict form which adheres to the Open Knowledge Foundation guidelines. This is a good time to cement or challenge them. But it would be a serious problem if we allow “Freely accessible” to become synonymous with “Open Data”.

This entry was posted in data, open issues. Bookmark the permalink.

2 Responses to Open Data

  1. Klaus Graf says:

    “There will be continued access to the work. This is based on (Gold) the permanence of Open Access publishers and the copying to inter/national repositories and (Green) the permanence of institutional repositories and in some cases inter/national repositories (self-archival on personal webpages does not guarantee permanent access).”
    One can add agreements of publishers with legal deposit libraries which allows the libraries to make its journal archives publicly available if the publisher does’nt offer online access to the journal any more (KB Den Haag with Elsevier). One can also add initiatives like LOCKSS.
    As archivist of RWTH Aachen perment access to data is a core issue for me.

  2. Pingback: Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Open Data - 2

Leave a Reply

Your email address will not be published. Required fields are marked *