Open Data in biomedical science

Heather (Research Remix) has a most important post on data sharing – she has analysed the data deposition policies of some of the major journals/publishers. Note that this is orthogonal to Open Access – not all these publishers are OA, but many are agressive about requiring data deposition – and that’s good. My comments during and after her post:

Diverse journal requirements for data sharing

Many academic journals make sharing research data a requirement for publication, but their policies vary widely. I’ve been wanting to understand this better: below is a summary of my Tuesday Morning Delve into the world of “Information for Authors”.I selected 10 journals, two from each of the following ad hoc categories: general science (Nature and Science), medicine (JAMA and NEJM), oncology (JCO and Cancer), genetics (Human Molecular Genetics and PLoS Computational Biology), and bioinformatics (Bioinformatics and BMC Bioinformatics). The results are obviously just the tip of the iceberg, but I found them enlightening.
PMR note: although Science and Nature are general journals almost all the emphasis is on biomedical in this discussion. I would not be surprised to find that the requirements were very different in – say – chemistry or materials science.
Nature has the most stringent requirements, followed closely by Science. These journals required data sharing for the most diverse types of data, specified acceptable databases,escrow requirements, and actually had “teeth” clauses… they specify a statement of consequences for times when you ask for data and the authors don’t provide it.The medical journals do have requirements for clinical trials registries, and sometimes suggestions for data inclusion based on clinical trial design, though they have no mention of requirements or encouragement for sharing (obviously deidentified) research data except that NEJM requires sharing microarray data.I’m out of time this morning to highlight the other findings, but you can have a look for yourself below.These rough conclusions of mine are consistent with Table 2-1, “Policies on Sharing Materials and Data of 56 Most Frequently Cited Journals”, in [Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences (2003). National Research Council of the National Acadamies]:
Their more exhaustive (though dated) analysis also suggests that few clinical-medicine journals have a policy, or if they do it rarely mentions depositing data. About half the life-science journals have some kind of a policy about depositing data. Almost no journals have a statement of consequences.
Conclusions: kuddos to Nature and Science. I’m surprised that the policies of other journals are so lax.
PMR: I am afraid I am not surprised. I don’t know about medical science but the major commercial journals have no incentive for data deposition, although a senior representative from Wiley told me they copyrighted data so they can sell it back to us.
Not sure this analysis is worth digging into more deeply. It isn’t quite where my research is headed, though I do believe the trends would be informative. If anyone else wants to use this as a starting point, have at it!
PMR: this is too important to leave at this stage. It’s something that the blogosphere – with a good wiki could manage easily. Apart from the spammability of wikis I’d suggest it very strongly
{Tried to post table here, but can’t get it to display nicely}
PMR: know how it feels – blogging software is strictly text and image only.
PMR: see below:

Well done Heather. It’s hard work and often depressing. I once tried to read some publishers’ contracts on what authors and readers could do and found them incomprehensible. I thought at the time it was incompetence and out-of-date pages – now I think much of the license area contains deliberate FUD.
I’m interested in the strength of Nature:

Indexed, publicly accessible database, or, where one does not exist, to readers promptly on request. Any supporting data sets for which there is no public repository must be made available [..] any interested reader [..] from the authors directly, [..]. Such material must be hosted on an accredited independent site [..] or sent to the Nature journal at submission [..]. Such material cannot solely be hosted on an author’s personal or institutional web site.

Note that institutional web sites (does that mean repositories) are not good enough for Nature! If you read this as a non-biologist there are precious few sites where you could deposit data. Maybe at somewhere like BMC which offers reposition services. I think there is a great opportunity here for the new semantic web.
… and Science:

Large data sets must be deposited in an approved database and an accession number provided for inclusion in the published paper. Large data sets with no appropriate approved repository must be housed as supporting online material at Science, or when this is not possible, on the author’s web site, provided a copy of the data is held in escrow at Science

What Heather does not mention is what the public access to the data are. Most of the databases are biological and therefore Open.

Approved databases: Worldwide Protein Data Bank ; Research Collaboratory for Structural Bioinformatics, Macromolecular Structure Database (MSD EMBL-EBI), or Protein Data Bank Japan], BioMag Res Bank, and Electron Microscopy Data Bank (MSD-EBI), Cambridge Crystallographic Data Centre {CLOSED). GenBank or other members of the International Nucleotide Sequence Database Collaboration (EMBL or DDBJ) and SWISS-PROT. Gene Expression Omnibus ; ArrayExpress.

The Cambridge Crystallographic Data Centre (no direct connection with PMR) has 350, 000 entries and last time I enquired allows only 25 to be downloaded free (0.01 %). I shall return to this later.
Science is as good as its word – there are many articles with exposed supporting info – here’s a chemistry one and it looks of high technical quality (haven’t read the science):
It doesn’t say anything about copyright and I hope that Science can confirm that they do not assert copyright. It would be extremely useful if they suggested (or required)that authors add Science Commons license to the data. This would act as a high-profile encouragement to the others.
Nature is similar – here is
the supplemental data here has been formatted by Nature – but no copyright has been added – and again I hope that they can take the same approach I have suggested.
Whatever your views on Open access, these two journals have made a good start on Open Data. A long way to go as the data are in the dreaded hamburger PDF (molecules are destroyed by PDFisation), although plaudits for Nature Chemical Biology which sends molecules to Pubchem. We need more semantic data here, please.
Also up in PMR’s good books are Royal Soc. Chemistry and Int. Union of Crystallography which expose all there supplemental data openly and, although muddled, effectively free of copyright.
The ACS is halfway. It does expose supplemental info, but it copyrights them (and I know from first hand intercourse that this is deliberate).
The less satisfactory publishers are harder to be precise about as they hide their information.
Wiley – Hides data for subscribers only and copyrights them aggressively. I suspect that some data are not even required.
Springer – does not seems to manage data itself and hides those that it does get. I have written to my Springer contact asking for clarification but not yet heard back
Elsevier – I suspect they require little data, ut have no hard evidence
It would be EXTREMELY useful for the blogosphere to collect information on these practices. If we all do a little we could cover the whole field. And shame those who need to be shamed.
I shall write more later on supplemental data.

  1. I’m glad you found my exploration useful! I appreciate your comments, and the perspective from outside biomedicine.
    Doing my quick look wasn’t actually very time-consuming or difficult, since I only considered the “required data elements” which journals require from authors. I agree with you: trying to understand (and sometimes even find) the copyright language is painful. Thank you for discussing it in this post; you’ve helped me find a bit of traction.
    I agree. A more complete picture of both the required data elements and the data copyright restrictions across journals would be very valuable, both for authors choosing a journal for publication and scientists understanding what data they can reuse.
    If anyone does set up a wiki, please let me know, I’d love to contribute a bit. In the meantime, I can share the Google documents for editing with anyone who wants to add a few lines.
    >It would be extremely useful if they suggested (or required) that authors add Science Commons
    >license to the data. This would act as a high-profile encouragement to the others.
    Great idea.

