Publishing Data: The long-tail of science

I am going to explore aspects of “publishing data” in STM disciplines and probably run to several posts. This will specifically cover the “long-tail” rather than “big-science” (such as high-energy physics, satellite surveys, climate models, sky surveys, etc.). In big science the data are often collected as part of a large project which has a data management process and specific resources for doing that. These data sets are often huge (petabytes) and require special resources to manage them. I also exclude data collected in major facilities such as neutron or X-ray sources as these usually have good support for data.

By contrast long-tail science represents the tens of thousands of ordinary laboratories – often “wet” – where data are collected as part of the experiment but where there is no central planning of their projects. They range over biomedical science, chemistry, materials, crystallography and other disciplines. There is no clear borderline but we’ll use these broad categories. The amount of data, and its heterogeneity vary hugely.

We are now seeing a major cultural change where many funders and some scientific societies are pushing for the publication of data. There are several motivations, and I have discussed some of them in this blog, but they include:

  • A resource to validate the experiment and to prevent careless or fraudulent science
  • A resource for the community to reuse, through aggregation, mashups and to act as reference data

The problem is that there is no simple universal way to do this. The bioscience community has pioneered this – NCBI/Pubmed, EBI and other data centres have many databases and some disciplines require deposition. Despite these being (inter)national centres there is a continuing concern about funding, as it normally has to come from specific grants made for fixed periods. There have been times when key resources such as Swissprot and PDB have had very uncertain futures.

One common model has been to associate data publication with “fulltext” publication as supplemental/supporting data/information. This has been financed – in many cases for over 15 years – by marginal costs within the conventional model – either through author-side fees or subscription. This is possible because the actual costs of long-tail data are also marginal:

  • They are not normally peer-reviewed (they *should* be , but that’s a different matter)
  • They can be trivially transformed into flat files which require little management and where the data are opaque
  • The technology for publishing them is simpler than the main fulltext – indeed almost costless.
  • The actual physical storage costs almost nothing
  • There is no expectation of sustainability beyond that expected for the fulltext

Kudos here (in chemistry at least) goes mainly towards society and/or open-access publishers. Nature also. But not Elsevier, Wiley and Springer which seem to have less commitment to maintain the data record of science. There is a lot of illogicality – Journal of Neuroscience killed its supplemental information while the proteomics community followed Mol. Cell. Proteomics and insisted on data publication (in a repository). The Am. Chem. Soc. Requires crystal structure data but refuses spectra on the basis that it is not a data repository.

Over the last few days I have had conversations and mail which suggest there is a groundswell of people wishing to solve this problem – I reiterate for the “long-tail”. We’ve discussed, and will continue to discuss, Figshare. Here’s an extremely encouraging meeting run by Iain Hrynaszkiewicz two months ago and just published. ( with attendees: Alex Ball (UKOLN), Theo Bloom (Public Library of Science), Diane Cabell (Oxford Internet Institute), David Carr (Wellcome Trust), Matt Cockerill (BioMed Central), Clare Garvey (Genome Biology), Trish Groves (BMJ), Michael Jubb (Research Information Network), Rebecca Lawrence (F1000), Daniel Mietchen (EvoMRI Communications), Elizabeth Moylan (BioMed Central), Cameron Neylon (Science and Technology Facilities Council), Elizabeth Newbold (British Library), Susanna Sansone (University of Oxford), Tim Stevenson (BioMed Central), Victoria Stodden (Columbia University), Angus Whyte (Digital Curation Centre) and Ruth Wilson (Nature Publishing Group). Notice the wide spread of interests, so this groundswell should be taken seriously. Read the report first. I’ll just highlight a little and add my own comments:

Goal 1: Establish a process and policy for implementing a variable publishers’/authors’ license agreement, allowing public domain dedication of data and data elements of scientific articles

PMR:: Critical. Solving the licence problems before you start saves years later (this has been a main problem with “Open Access”). At least today no-one is arguing that data are intellectual property that belong to one or more of the parties and where walled gardens and paywalls could be constructed.

Goal 2: Consensus on the role of peer reviewers in articles including supplementary (additional) data files

PMR: This is objectively difficult and needs a consensus approach. Some reviewers, especially those who suspect the validity of the publication, will wish to use the data as a check. But there are often problems of software and expertise – how many non-chemists would understand CIFs, Mol, CML, JCamp. How many can read Matlab files without the software, etc.

Goal 3: Sharing of information and best practices on implementation of journal data sharing/deposition policies

PMR: There will probably be a variety of ad hoc solutions – no single approach suits everyone

Recent conversations suggest that many funders and publishers are exploring data publication as first-class citizens. I’ve had talks about

  • data-only journals
  • publications where the data and full-text occur side by side
  • publication as a continuous record of the science
  • domain-specific repositories

and more.

The point is that the world is now starting to change and the traditional publication model is now seen as increasingly anachronistic. Journals have declining importance except for publishers to brand and sell metrics whereas data repositories are seen as exciting and on the increase. When academia finally adjusts to data-as-a-metric then there will be a rush away from journals and towards repositories.

And these repositories will, I hope and expect, be run in a better, more useful fashion than the outdated journal model. They won’t be cost-free, but they won’t be expensive (recent conversations suggest a small fraction of current publication costs).

It won’t be easy or immediate. It will probably take ten years and end up as a complicated heterogeneity. But that is what the current century has to tackle and solve. It could return democracy to the practising scientists.

In later posts I’ll address the fundamental things I would like to see in data repositories.

One thought on “Publishing Data: The long-tail of science

  1. Pingback: Science in the Open » Blog Archive » Building the perfect data repository…or the one that might get used

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>