Open Data is critical for Reproducible Research

I have been contacted by a group in Lausanne working in the audiovisual area who want to publish their work so it can be reproduced. Here's their outline and links to protocols:

Reproducible Research

In our lab, we try to make our research reproducible. This means that all the results from a paper can be reproduced with the code and data available online. A more detailed motivation why we believe this is important can be found here. We also give a detailed description of the procedure to follow when making your publication reproducible.

Reproducible papers from our lab

News

Join us in a discussion about the use of reproducible research and how to make it work on our reproducible research forum!

For the latest news on reproducible research, please have a look at our RR Blog!

We have organized a special session at ICASSP 2007 about reproducible research (co-organized with Mauro Barni and Fernando Perez-Gonzalez). We had great talks, an interested audience, and some good discussions!
[More info]

Motivation

After a colleague asked something about a paper you wrote, you spend a considerable amount of time finding back the right program files you used in that paper. Not to talk about the time to get back to the set of parameters used to produce that nice result.

Because this type of situations sounded all too familiar to many people of the lab, we are now trying to make our research reproducible. Most of the ideas about reproducible research come from Jon Claerbout and his research group at Stanford University. We believe reproducible can be helpful in many ways:

  • It will help us in the first place, to reproduce figures in the revisions of a paper, to create earlier results again in a later stage of our research, etc.
  • Other people who want to do research in the field can really start from the current state of the art, instead of spending months trying to figure out what was exactly done in a certain paper. It is much easier to take up someone else's work if documented code is also available.
  • It highly simplifies the task of comparing a new method to existing methods. Results can be compared more easily, and one is also sure that the implementation is the correct one.

This may all sound very trivial, and in discussions with colleagues, there was a general agreement that this is how research should be performed. However, in practice, only few examples are available today. Making articles reproducible indeed requires a certain investment in time. However, we think that it is worth the investment. The interest is hard to quantify, but from download statistics and Google rankings, we can see that it really pays off!

How to make a paper reproducible?

Of course, it all starts with a good description of the theory, algorithm, or experiments in the paper. A block diagram or a pseudo-code description can do miracles! Once this is done, make a web page containing the following information:

  1. Title
  2. Authors (with links to the authors' websites)
  3. Abstract
  4. Full reference of your paper, with current publication status, and a PDF of your paper
  5. All the code to reproduce all the results, images and tables. Make sure all the code is well documented, and that there is a readme file explaining how to execute it
  6. All the data (images, measurements, etc) to reproduce all the results, images and tables. Add a readme file explaining what the data represent
  7. A list of configurations on which you tested your code (software version, platform)
  8. An e-mail address that people can use for comments and remarks (and to report bugs)

Depending on the field in which you work, it can also be interesting to add the following (optional) information to the web page:

  1. Images (add their captions, so that people know what Figure xx is about)
  2. References (with abstracts)

For every link to a file, add its size between brackets. This allows people to skip large downloads if they are on a slow connection.For examples, see the list of reproducible papers above. Note that we are currently working on an automated setup using EPrints to simplify this process. Keep an eye on this webpage!

Other reproducible research

Reproducible electronic documents: Jon Claerbout and his colleagues at the Stanford Exploration Project initiated (to our knowledge) the discussions about reproducible research.
Wavelab: David Donoho and his colleagues at the Stanford Statistics Department developed Matlab code to reproduce their results on wavelets.
Sweave Demo: a demo of Sweave, a package to do literate programming and good documentation using the statistical software R, by Charlie Geyer.
Reproducible Neurophysiological Data Analysis: a page by Christophe Pouzat on reproducible research in neurophysiology using R and Sweave.

This is wonderful to hear. I've been blogging about Jean-Claude's Useful Chemistry where he is trying to expose chemistry as it is done. And our own new BlueObelisk community project on analysing the Openness of data in chemistry. Realistically most groups will want to publish data and software retrospectively, and we need to get into the habit. However it's a challenge. In an ideal world where the publishers were actively trying to help publish reproducible science it would be very difficult. With a large number of influential publishers working against the Open publication of data it's harder than that - but not impossible.

So how to go forward? I think a useful way will be to create metrics for the reproducibility of the science in a given paper. A "reproducibility" score - rather like a "readability" or "accessibility" score. Obviously this is harder in laboratory subjects and even harder in observational ones, so let's start with software, data and informatics.

In general bioinformatics experiments would have a high reproducibility score. The data are openly available in EBI, NCBI and elsewhere. Most of the programs are open source. It may not yet be possible for a reviewer and reader to "push this button to repeat study", but it's not inconceivable.

Computational linguistics is also a good discipline for reproducibility. You are required to make your corpus available, and your annotation scheme, and your software should be open.

Chemoinformatics is awful. You can use a set of molecules without specifying what they are in detail, use a commercial program (probably with irreproducible versioning) that most people can't afford, use a non-portable machine-learning algorithm, and fail to deposit your protocol for selecting data points. As a result there is no check on anything other than the trustability of the humans involved.

So I suggest we use the term "potentially reproducible" to describe work which, in principle, a third party "reader" could reproduce with only the data and software described in the paper. We call anything else "irreproducible", even if we believe it. I'd welcome guidance on this. Then we could create a "PRS" - "potentially reproducible score".

I think that we would find many domains where people would value this. I know many people who have tried to reproduce garbage science. It can ruin their careers.

Before criticising publishers I invite them to take an active view on this. But it is difficult to reproduce an experiement which only a few hundred people can read, where the data are copyrighted and where you have to subscribe to an expensive databank.

This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Open Data is critical for Reproducible Research

  1. Pingback: Reproducible Research in Blogosphere « Reproducible Research Ideas

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>