Open Data – what can I do? Simple, legal, viral suggestion

Following discussion on the SPARC Open Data list I got a mail:

I’d like to hear more discussion on open data, too. In particular, what are the practical approaches that will help adoption of open data by researchers themselves? We all know technology is not the big issue here. The biggest challenge is how to get researchers to share their raw experiment data on the web. Because this is quite different from traditional publication, a big uphill battle is expected.
Since I’m a software developer (with research background), my thinking is always centered around creating free new tools and services for the end users (i.e. researchers). And the key is to come up with a set of tools and services that can benefit users immediately as they open up their data bit by bit on the web. Of course, they have to be relatively easy to implement (ideally, no funding is required).Fortunately, open source software and communities have made this possible. In addition, the emerging semantic web technologies seem to be right for this task. So, I’m working with W3C semantic web group to openly develop ontologies for representing research data (at high level). I’m also developing necessary web publishing tool and R&D community search engine through open source project. My hope is that researchers will open more data when they actually see these open data bring more visibility and recognition to their work through a community search engine.
Cheers,
AJ
AJ Chen, Ph.D.
Palo Alto, CA. USA
1-650-283-4091
web2express.org
W3C Scientific Publishing task force
“Open data on semantic web”

PMR: The simplest think that researchers can do is to add a Creative Commons license to their data. It costs nothing, is a simple cut-and-paste, and could be trivially made a template in any data production tool. (For example if you publish spreadsheets in Excel add a creative commons license as your last/first line. Every time you open a blank stylesheet it would have this.
Similarly scientific software developers could output the additional line:
“The data output by this program are offered under a Creative Commons Attribution Share-alike license.”
(or better in a machine-readable XML/RDF format like the creative commons already do).
We might add the following:
“The authors regard these as not copyrightable Open Data. Some publishers wish to claim that if published alongside a journal article they own the copyright on them. This license effectively forbids them to claim copyright without the authors permission and acquiescence.
I think the effect of this would be dramatic. Scientists would start to see these messages and think: “Why should I give these data to the publisher?” And if the publisher simply adds a copyright notice saying “all these data are copyright the publisher – you cannot use them for X, Y, Z without permission” this would be in violation of the authors’ license. The author would have to deliberately remove this statement to hand over the IPR to the publisher.
It is never easy to design a viral campaign, but this has all the prerequisites of a meme:

  • it infects a significant number of the potential population
  • they wish to reproduce and spread the meme
  • the costs of replicating the meme are effectively zero

The first two are unknowns. The critical things are to get the form of words right and not to foul up technically. Not all data sets can carry text (although an increasing number will be accompanied by metadata which is ideal for this.)
If the scientific programmers buy into this it is unstoppable.

This entry was posted in "virtual communities", open issues. Bookmark the permalink.

14 Responses to Open Data – what can I do? Simple, legal, viral suggestion

  1. JamesM says:

    I like this idea a lot, but I think it would be a hard sell getting my supervisor on board for something that decreases the probability of a paper being published.
    At any rate, should we be adding CC licences to our supporting information, and then mentioning it in the cover letter when we submit papers? At some later date?

  2. Robin Rice says:

    There was an article in the October issue of Ariadne, Creative Commons Licences in Higher and Further Education: Do We Care? which points out some of the questions around widespread use of Creative Commons licenses. “Naomi Korn and Charles Oppenheim discuss the history and merits of using Creative Commons licences whilst questioning whether these licences are indeed a panacea.”

  3. pm286 says:

    (1) I think this would depend on the domain and the journal. In many of them I suspect the editors never look at the supporting information – the reviewers do. So it is quite likely to be accepted. And it shouldn’t make any difference to the paper being published – if the publisher puts up sufficient resistance you will have to decide. But the handover of intellectual property should only take place after the paper has been accepted. The worst that might happen in practice is a delay until one side gives in.
    I would add it to the supplementary info and say nothing – as the SI isn’t the journal’s property. I don’t know your subject (although if it’s chemistry you may not succeed with the ACS – Henry and I have managed it once. But they have no moral right to the data).

  4. pm286 says:

    (2)Thanks. This is wonderful. I am aware of several possible criticisms of my suggestion – but I’ll see if any come up.

  5. JamesM says:

    Yes Peter, I was thinking of ACS journals, JCIM in particular…

  6. I personally think this is an end run around the more important issue – if you publish in Open Access journals the problem would be solved since they should not object to the data rights being retained by the author. So basically, do not publish in ACS journals or Elsevier, etc. And only publish in Open journals. And if they are not available in your field, create them. I would go even further than making the data Open. I advocate releasing data but not allowing it to be used for any publication that is not itself Open Access. I have no idea how to do this, but if it was done, it would in essence force anybody who wanted to use certain data sets for papers to publish in an Open manner.

  7. Philip Small says:

    O! Just the data. The raw data. The article(s) themselves can still be copyrighted, so this does not impair the publication potential related to copyright. That means that others are free to do their own unique analysis of the data. Science advances. No publishers are harmed, in fact, where data supply is a limiting factor as in a breaking field, it results in more publishable material in advancing fields. Publishers interests, are well served by this. Scientists who produce data in high demand fields get faster turn around in attribution. Research scientists and their patrons are well served by this. With an open data standard, scientific misconduct will decrease. Science is well served by this. What’s not to love?

  8. pm286 says:

    (5) Excellent. (for non-chemists this is the “chemoinformatics journal” and publications are frequently based on data. Does this data appear with the publication. Not very often. Does the software used to process it? Almost never. and the statistics packages. Occasionally available. So papers often read like:
    “We took this set of molecules (proprietary) and computed molecular descriptors using (commercial software with binary code and protected by trade secrets) to generate a statistical model (software not available) and got a wonderful straight line”. I know the journal editors are trying to change this but it would help a lot if they promoted Open Data, Open Source Open Standards. (Where have I heard that before?)

  9. pm286 says:

    (6)
    # Jonathan Eisen Says:
    December 12th, 2006 at 8:35 pm e
    >>>I personally think this is an end run around the more important issue – if you publish in Open Access journals the problem would be solved since they should not object to the data rights being retained by the author.
    Unfortunately some do. I have tried to highlight this problem – essentially some publishers offer “Open Access” which is not compliant with BOAI. I am working up to blogging more about this.
    >>>So basically, do not publish in ACS journals or Elsevier, etc. And only publish in Open journals. And if they are not available in your field, create them. I would go even further than making the data Open. I advocate releasing data but not allowing it to be used for any publication that is not itself Open Access. I have no idea how to do this, but if it was done, it would in essence force anybody who wanted to use certain data sets for papers to publish in an Open manner.
    This is very similar to GPL viral copyleft. I can’t see how it can be done as – ideally – facts should be free of license. I haven’t come across this idea before – it’s intriguing, but since I am not a great fan of GPL (I use other OS licenses) I’d need convincing.

  10. pm286 says:

    (7) I general agreement – the question is how…

  11. JamesM says:

    Now I think of it, for cheminformatics, as long as the data’s available somewhere, no-one seems to mind where you have to go to get it (and in fact, no-one seems to mind too much if the data isn’t available at all, unfortunately). So in the future, I shall suggest just sticking everything on our website and mention the URL in the paper, and dispense with supporting information entirely. We already do that with code, so all the extra flannel can go there as well.

  12. pm286 says:

    (11). This is great – exactly the right approach. If you are in a university or similar institution then consider using a repository. Not everyone has one, but they are appearing very fast. And many of them want data. We’ve put 250, 000 MOPAC calculations in ours (http://www.dspace.cam.ac.uk).
    However you may have to be careful with some publishers. If it’s not on the web site the reviewers can’t see it. But if you put it there the journal might consider that is prior publication. We’ve had stuff of this sort…
    P.

  13. Richard says:

    And of course, publishers (well, ALPSP and STM members) have this statement as guidance.
    Richard

  14. Pingback: Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Open Data : help from ALPSP and STM

Leave a Reply

Your email address will not be published. Required fields are marked *