Following discussion on the SPARC Open Data list I got a mail:
I’d like to hear more discussion on open data, too. In particular, what are the practical approaches that will help adoption of open data by researchers themselves? We all know technology is not the big issue here. The biggest challenge is how to get researchers to share their raw experiment data on the web. Because this is quite different from traditional publication, a big uphill battle is expected.
Since I’m a software developer (with research background), my thinking is always centered around creating free new tools and services for the end users (i.e. researchers). And the key is to come up with a set of tools and services that can benefit users immediately as they open up their data bit by bit on the web. Of course, they have to be relatively easy to implement (ideally, no funding is required).Fortunately, open source software and communities have made this possible. In addition, the emerging semantic web technologies seem to be right for this task. So, I’m working with W3C semantic web group to openly develop ontologies for representing research data (at high level). I’m also developing necessary web publishing tool and R&D community search engine through open source project. My hope is that researchers will open more data when they actually see these open data bring more visibility and recognition to their work through a community search engine.
PMR: The simplest think that researchers can do is to add a Creative Commons license to their data. It costs nothing, is a simple cut-and-paste, and could be trivially made a template in any data production tool. (For example if you publish spreadsheets in Excel add a creative commons license as your last/first line. Every time you open a blank stylesheet it would have this.
Similarly scientific software developers could output the additional line:
“The data output by this program are offered under a Creative Commons Attribution Share-alike license.”
(or better in a machine-readable XML/RDF format like the creative commons already do).
We might add the following:
“The authors regard these as not copyrightable Open Data. Some publishers wish to claim that if published alongside a journal article they own the copyright on them. This license effectively forbids them to claim copyright without the authors permission and acquiescence.
I think the effect of this would be dramatic. Scientists would start to see these messages and think: “Why should I give these data to the publisher?” And if the publisher simply adds a copyright notice saying “all these data are copyright the publisher – you cannot use them for X, Y, Z without permission” this would be in violation of the authors’ license. The author would have to deliberately remove this statement to hand over the IPR to the publisher.
It is never easy to design a viral campaign, but this has all the prerequisites of a meme:
- it infects a significant number of the potential population
- they wish to reproduce and spread the meme
- the costs of replicating the meme are effectively zero
The first two are unknowns. The critical things are to get the form of words right and not to foul up technically. Not all data sets can carry text (although an increasing number will be accompanied by metadata which is ideal for this.)
If the scientific programmers buy into this it is unstoppable.