I posted receently about the problems of describing Open Data -how strict should we be about boundaries? Peter Suber has replied What counts as open data?, Klaus Graf has also given an important emphasis on archiving in a comment to my post. Also Peter has blogged another example of an “Open Access” database: Jan Christian Bryne and eight co-authors, JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update, Nucleic Acids Research, November 2007.
Abstract: JASPAR is a popular open-access database for matrix models describing DNA-binding preferences for transcription factors and other DNA patterns. […] JASPAR is available [here].
In the last 2 months I have been thinking fairly furiously about Open Data and realising it can be considerably more complex than Open Access scholarly publishing. I’m certainly clear that the borderlines may have to be fuzzy, but not infinitely. Peter writes:
- Just a quick note on my offline talk with PMR about Material Properties, which I called “OA” in a blog post. Neither of us could find its licensing terms, so we couldn’t tell just how open it was. I needed (I still need, we all need) a generic term for such resources when we do know they are free of charge but don’t know any details about their licensing terms. For better or worse “OA” has become that generic term, even while it has a narrower, earlier, more technical and more proper sense through the BBB definition. I readily and often acknowledge that I use the term “OA” both ways –widely and narrowly, as a generic term and as the technical term for the BBB level of openness. I also readily and often acknowledge that this ambiguity causes problems –see for example the Poynder interview at pp. 30-31. I can add that I resisted this dual sense as long as I could and only acquiesced when it became an undeniable fact of actual usage. For perspective, I’ve also argued that this kind of semantic spread is not a special calamity for our technical term, but affects most technical terms in wide use and needn’t prevent precise communication.
- One tempting solution is to come up with a new generic term so that “OA” can be limited to its strict BBB sense. That’s desirable but difficult, since coining terms is not the same thing as assuring their use, let alone their intended use. BTW, “free” would not make a better generic term, at least not yet, since it suggests to many people that a work is merely free of charge and does not also remove permission barriers. A good generic term would cover all kinds of free online content, including those that are BBB OA.
- I share PMR’s hope that the term “open data” can stay fairly well tethered to its technical definition. But the data world needs a generic term for the same reason that the publication world does. If we had a good generic term for free online content, perhaps it could allow “open data” to remain univocal.
PMR: I agree with the sentiments here – but suspect we are both unclear of the way forward. I’m reluctant to use OA for databases since “OA” is already having to work pretty hard to manage the differences in practice and philosophy in scholarly publishing of articles (data is rarely included). “Open Source” has more or less got its act together, although there is both tension within the community of the Free/Open + viral schools; and also abuse of the term “Open Source”. I am keen to avoid the abuse of “Open Data” while it is still struggling to play a role.
I n some disciplines “free” implies “Open”. In biosciences there is an unwritten agreement that freely available data is Open. Sequences, strucures, genes are made available usually without formal copyright or formal licensing. There are thousands of databases with the same attitude as JASPAR (above) – our own MACiE database is similar. In all these it’s commonplace to download the whole data – for example we state “Each MACiE entry in the database can be downloaded separately as a CML file. This option is available from the left side panel, underneath the reaction step lists.” The bioscience community has a tradition of sharing and re-use which doesn’t need to be spelt out. Admittedly there is potential for confusion, and some databases do restrict their usage, And some will effectively have a no-commercial use clause. But there is a strong tradition of meetings where the principles are reinforced, where collaboration is made and traded. Generally it is expected that data will be re-used.
In chemistry, in contrast, the tradition is of gathering information and reselling it. There are no public Chemoinformatics Centres in the same way as Bioinformatics – in fact the Bioinfromatics Centres are steadily taking over the biological parts of chemistry. So by default it has to be assumed that any database on the web, however freely accessible at a point in time has no guaranteed permanency of access. There are often explicit barriers to re-use. So it’s important to have clear guidelines and clear labels – otherwise “Open Data” or “Open Access” is meaningless and acts only as a way of marketing warm feelings.
This is more difficult because data are undervalued in the peer-esteem economy. A “paper” – however poorly read, however bad – even to retraction, is part of the sacrosanct “scholarly record”. Libraries and curation centres have a duty to capture this. In contrast there is nothing like the same obligation on any organisation to capture public datasets. Admittedly its’ harder, but that’s not the real reason.
So I reiterate some guidelines. I’m still working these out and would welcome comment. (I don’t feel we should stray too far from the The Open Knowledge Foundation guidelines. ) As a start I would suggest the following:
- There must be some mechanism whereby the community could, if it wished, capture the resource for public archival without permission. This could be as simple as spidering the site, or a relational dump, or a massive file, or an iterator.
- There must be no permission barriers to re-use including commercial re-use.
- The data must either be the whole work (at a given point in time) or be clearly bounded (i.e. there should be no hidden data that the world cannot get access to in the same way).
- There should be no time limits on access and re-use.
Data is now acquiring the same power as software did two decades ago. It won’t be surprising if there are tensions – commercial, political, social. We need to identify and plan for them.
Pingback: ChemSpider Blog » Blog Archive » The Entire ChemSpider Database is On Its Way to PubChem!