Supplementary Data must be published somewhere to validate the science

Dictated into Arcturus

There has been quite a lot of discussion over the last days and about the decision by the Journal Of Neuroscience to stop posting supplemental data. This has been reviewed by Heather Piwowar (http://researchremix.wordpress.com/2010/08/13/supplementary-materials-is-a-stopgap-for-data-archiving/ ) with a useful review

The issue seems to be that the journal can no longer make the effort to host the supplemental data on its website. That’s neither surprising or unreasonable.

However the critical responsibility of a journal, if it has any, is to make sure that the science has been properly conducted and is in principle (and hopefully in practice) reproducible. I believe it is irresponsible for a journal to publish science based on data which is not available to the reader. This is more important than any aspect of archiving or data re-use. The mainstream model of publication is that If the science is not valid it should not be published. (Models such as PLoSOne legitimately take a different view where debatable science can be published for open community view and review). If the material does not warrant publication then there is little point in archiving it. If the material represents invalid science then there is little point in disseminating it.

The reactions to J neuroscience have been mixed. Drug monkey exalts in an engaging fashion, but it is difficult to see that DM is responsible. It appears that DM wishes to be excused from the burden of providing the data that supports DM’s work. If this is true then I can give DM no support whatever.

There is some confusion and a suggestion that the supplemental material should be cut down and included in the material behind the publisher’s pay wall. This again there would be wholly regrettable. If the data are needed to support the experiment then they should be published in full. Moreover I believe it to be retrogressive if there is an insistence that only its subscribers can have access to the data on which the experiment was based.

So, if any data which is being published now would not be published in the future, then this is a seriously retrograde step.

Some other correspondents believe that the data should be in a repository, and I would agree with this so long as it was an Open repository and not closed (access, re-use, etc.). Some members of the institutional community believe that it should be in an institutional repository. Here for example is Dorothea Salo:

And this, Peter Murray-Rust, is partly why I believe institutions are not out of the data picture yet. The quickest, lowest-friction data-management service may well reside at one’s institution. It’s not to be found at a picky, red-tape-encumbered, heavily quality-controlled disciplinary data service on the ICPSR model, which is the model most disciplinary services I know of use. It’s certainly possible, even likely, that data will migrate through institutions to disciplinary services over time, and I have no problem with that whatever—but when the pressure is on to publish, I suspect researchers will come to a local laissez-faire service before they’ll put in the work to burnish up what they’ve got for the big dogs. (Can institutional data services disrupt the big-dog disciplinary data services? Interesting question. I don’t know. I think a lot depends on how loosely-coupled datasets can be. Loose coupling works better for some than others.)

Since since it’s addressed to me, I’ll respond. I personally do not have an absolute “religious” divide between domain repositories (DSR) and institutional repositories but what I am passionate about is the validation of data before and at the time of publication. A “heavily quality-controlled disciplinary data service” is because the community of scientists wishes to know that the data on which the science is based is valid. You cannot publish a crystal structure in a reputable journal without depositing the data in a formal manner which is susceptible to validation. You cannot publish a sequence without submitting it to a database which will review the scientific basis of the data. That’s not “picky”, it’s an essential part of data-rich science. I continue to see papers published whether data are either nonexistent, incomplete, all presented in such a way it is impossible to validate the science.

The reality is that this is a time consuming tedious process. If you don’t like it, don’t do science. As an example, when I did my doctorate, I measured about 20,000 data points and wrote them into a book. I then typed up each data point, where it occurred both in my thesis and in the full text of a paper journal. That was not unusual, everybody had to do it. The International Union Of Crystallography developed this so it has now become almost universal . In similar fashion the Proteomics community is insisting that data is deposited in a DSR.

I have said before that this could, in principle, be done by the library community. But it isn’t being done and I see no signs of it being done in the future. It is much harder to do it in the library community because the data are dispersed over many individual repositories and is now a technology at the moment that creates an effective federation. It is possible to conceive of a world where data went straight into the institutional repositories, and we are working on a JISC funded project to investigate this. But it is essentially unknown at present. If IRS wish us to change the model then they are going to have to present a believable infrastructure very shortly.

As an example, I have appealed for a place where I can put my results from the Green Chain Reaction (#solo). This is a valid scientific endeavour (in the area of molecular informatics) and its data requirements are modest compared with many other scientific disciplines. There are perhaps 1000 institutional repositories worldwide and I’d be delighted if one of them stepped forward and offered to host the data. Host, not archive. I’m not asking for any domain expertise, simply ingesting and holding an HTML web page tree.

If IRs can help with the problem of supplementary data they will have to act very quickly.

And if all this leads to less, not more, data being published Openly that’s a disaster.

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Supplementary Data must be published somewhere to validate the science

  1. Pingback: Twitter Trackbacks for Unilever Centre for Molecular Informatics, Cambridge - Supplementary Data must be published somewhere to validate the science « petermr’s blog [cam.ac.uk] on Topsy.com

  2. rpg says:

    Well said, Peter. I was hoping somebody respectable (i.e. not me) would defend supplementary info. Of course, when the data are essential to the conclusion of the paper, they should be in the paper. I’d like to see someone try to publish data factors in the primary literature…

Leave a Reply

Your email address will not be published. Required fields are marked *