What is Data and what should be Open?

I got a long and thoughtful comment from Steve Bachrach on Open Data which contains important points.

Great post on an important topic. Im just going to throw some random thoughts together here.

One of the problems in this area (copyrights, databases, patents) is that the governing legal rules differ from country to country. I applaud the notion of generating a chemistry community – maybe even a science community – set of best practices. This can, I hope, cross boundaries.

PMR: Yes. That’s why the OpenKnowledgeFoundation and ScienceCommons effprts are critical. They understand the legal issues. Talis/Jordan_Hatcher donated legal services to address this

I worry about the need to claim that data is open, as in a sense diminishing what is rightfully there from the start. My understanding is that data cannot be copyrighted here in the US. While the statement the melting point of benzene is 5 °C can be copyrighted, the underlying fact is in the public domain and one can freely use that information for any purpose. The same is true about spectra – the IR image of the absorption of benzene is copyrightable (How one displays data can be a creative effort – see below) but the fact that benzene does not absorb at 1900 cm-1 is data and can be freely re-used.

PMR: This is a logical position and one I used to take. However many scientists do not know or appreciate this. Moreover many publishers will stick copyright notices on this type of material.. Hence it is important for the author to assert that all data is Open.

As part of best practices I would like to see journals insist that all data used in preparing the article be submitted as part f the article. The data should then be made available for re-use with no limitations and inhibitions (I.e. no cost). Currently, that would be as supporting materials but I hope that in the future this data might be more intimately connected with the article itself. In an ideal world, data should be packaged in a way to facilitate re-use, but that is something I am willing to let the world grow into slowly.

PMR: this is the position I have consistently taken for several years. I am also working with my colleagues and Microsoft and JISC to develop the tools to make it effortless in the chemistry community.

I think you need to be very careful about the notion to Assert that Data covers images and tables and other ways of representing data. Representation of data can very much be a creative process and should be copyrightable. For example, a landscape photographer is simply taking an image of data – the way the countryside appears on a particular date and time. Let that photographer be Ansel Adams, and he obtains a reprentation of data that is coloured by the artists perceptions and biases and influences and creates a product that I call art – and that image should be protected. A chemist may do the same thing with an array of data to create a plot that clearly and distinctly manipulates the data to make a point. In such a case, I would protect the image (the plot) with copyright and insist that the underlying data (such as the excel file) be published simultaneously within the supporting materials and that data is open in the public domain. Things get a bit tricky here too – does the photo of a gel run by a biochemists count as data or as a creative image? I think the former, but Im open to discussion.

PMR: This is a critical point and I’ll stress it in my talk tomorrow.

The primary issue is whether the author wishes to copyright the image and whether Community Norms support this copyrighting. The primary problem is that the author often hands over copyright of all images to the publisher who has put no creative effort into them. I would argue the following:

  • There are many areas of science where images and tables are the natural way of communicating the science. The alternative of words or unorganised numbers is often counterproductive. In these cases the Community Norms should urge the author to use appropriate PDDL or CC0 methods to protect the images, while asserting that they belong to the community.

  • There are cases where the author puts creative effort into the image (and to a lesser extent) the table. In most cases that is so that the reader understands the underlying data better. I suspect that few authors would wish to prevent others reproducing their work as long as attribution was prominent indeed they would encourage it. The problem is that copyright in publishers’ hands leads to reduced communication.

To do this the author should automatically stamp all images and tables with their authorship and their OpenData intention. This will become easier once tools support it. I have argued that all computational chemistry programs should emit an open data message by default with a runtime/comandline switch to remove this if required. Similarly I intend that out Chem4Word tool automatically stamps OpenData on every molecular structure unless disabled. This does NOT mean the molecule cannot be patented it only means that when released the image or connection table can be freely used.

The same should be done for gels, pictures of cells, movies of mice, etc. There is no moral reason for publishers to own this. It’s the fundamental information infrastructure of science.

Bottom line is I fully support an effort to create best practices and think tht th Panton Principles are a great starting point!

PMR: thanks

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *