petermr's blog

A Scientist and the Web

 

More on Open Data for molecules

Antony Williams (Chemspider) has engaged in a useful discussion about the various aspects of access to scientific data: Acting as a Community Member to Help Open Access Authors and Publishers
There are some valuable points merging where we start to be clear what we agree on and where we differ. (I am also aware that there are some implications for the definition of Open Access which I cannot comment on for a few days):

(AW) Recently I posted about our intention to post the full Molbank articles on ChemSpider. PMR commented on my potential over-extension of their Open Access nature: “PMR: I also support publishers who make their material available. I don’t want to appear churlish, but Molbank use what is effectively a NC (non-commercial) license and this is what concerned me (and others) when I posted about 1 year ago. I don’t think it has changed. So sorry, Antony, it’s not “as Open Access as they can be” especially if one has to ask permission to mount the material.” He may be right.

PMR: If we take the public definitions of Open Access (often called “BBB”) then I think all the OA community will agree that NC is not as Open as possible. I’ll be writing more in a few days…

What I do know is that I prefer to get into relationship with the groups/people I work with in the community. Simply grabbing their content/data without some connection doesn’t feel comfortable. AND, I realize in these days of search engines and scraping that’s quite acceptable.

PMR: This is a very valuable point. Yes, simply taking material without asking (and having it taken) “doesn’t feel comfortable”. I’ve felt that on several occasions – when I have exposed my/Open Source (sic) BO code (e.g. in Bioclipse) and seen it taken and developed for commercial use I’ve felt agitated. But I know this is part of the process and if you can’t live with Open approaches then you take a different approach. I have also been careful on occasion not to take Open material and carry out bulk download and transformation without alerting the owner (and there normally is an owner or a proxy owner). For example we have long and continuing discussions with the International Union of Crystallography, Royal Society of Chemistry and the American Chemical Society. IUCr’s Acta Crystallographica E is now Open Source. In principle we could download all the papers and mount them on our site. We could even sell them. But we wouldn’t do that (although others might). And there could be a trademark issue even though there wasn’t a licence issue – we’d need to make sure we weren’t purporting to be the defintive site. Note that much of this issue springs from current practice in chemistry. In bioscience it is common and often de rigeur to make your data available to the databanks. The issue of “ownership” of sequences and genes has been publicly fought over 3 decades and I doubt that authors would be values if they added non-commercial (NC) tags to their data.

When I approached MDPI, the publishers of Molbank, they were gracious in their willingness to have ChemSpider support, integrate and utilize their content. This is contrary to some of my experiences with some other advocates of Open Data and Open Access where trying to get their “Open Data” is like pulling teeth. MDPI appear to be the opposite, in my experience.

PMR: If something is Open Data or open Access then you have a legal right to download it. You don’t have to ask permission. I can’t understand what you are talking about unless it’s a veiled reference to the technical issues in downloading CrystalEye data (see below). If it’s anything else then please let me know and I’ll take it up with them.

I commented on Peter’s blog tonight: “Regarding your comment “especially if one has to ask permission to mount the material.” I think that’s a comment on the fact that I asked permission? I asked permission for the reason that I am focused on building a community for chemists and this includes me staying in relationship with publishers. I think you know this about me from my previous comments about CrystalEye

PMR: I am also interested in building a community for chemists (Blue Obelisk). We also mount data under various data and code licences. Anyone can download it without our permission.

“http://www.chemspider.com/blog/intention-to-scrape-crystaleye-content-and-staying-in-relationship-with-publishers.html”

AW: I judge its a better way to Build the Structure Centric Community for Chemists on ChemSpider.

PMR: That’s great. I think Chemspider has moved a long way and is fulfilling a useful role. Alone of the chemical aggregators you have taken on board the issues of Open Data – as a result of engagement with the OA community. You have a commercial site which uses Openness as part of it’s business model – that’ts fine – it’s part of Web 2.0. I have a different agenda – it’s not incompatible – just different.

CS: So, while I didn’t have to ask for permission, I did. the result was an excellent exchange, newfound relationships and an opportunity to build an enhanced relationship WITH support and permission.

PMR: Again no problem. But the scale of the problem means that it is impossible for individuals to engage with all publishers. First many of them simply fail to answer (this is one of the main problems – many publishers are simply not trying). Even in chemistry there are ca 60 “Open Access” journals. We started going through them and found that the scale was simply too large. However we did get some progress – for example Libertas Academica had a NC licence – we pointed this out – and they immediately changed with enthusiasm. I did the same with Molbank and they made it clear that they understood the issue and didn’t want to change it. We know where we are.

PMR: You want Molbank’s data – fine. You and then are happy to share it. I imagine you can negotiate how it can be used in your data base. What I assume you can’t do is then to release the data as Open. So, for example, I cannot take data from Chemspider or Molbank and put it in a repository where it can be used for commercial purposes. By contrast CrystalEye can be used for commercial purposes.

PMR: a major aspect of this is scale. You’ve spent 6 months in discussion with one publisher (ACS) about their supplemental data (supporting info). They are not prepared to assert that you can use it and redistribute it. In contrast I have taken it on the assumption that I am legally allowed to use it and post it as Open Data. You don’t feel this enables you to re-use my data, so you see immediately the effect of the anticommons (Open Source and the Tragedy of the Lurkers). If data are OpenData (CC-BY or CC0) then these problems disappear immediately. Individual negotiations between parties don’t scale, Open data does.

CS: Many bloggers it appears assume that “concerned parties” read their blogs. For example, when you posted this: http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=1048 did you make the editors at Molbank aware of the error or did you just scrape their content and blog?

PMR: No I didn’t and I don’t intend to. There is at least one error in almost every chemical paper I read. The same is true for biology. The quality of the actual publication of data in science is almost universally poor. There are communities (crystallography, thermochemistry, astrophysics, protein sequence, atmopsherics, etc.) where there are some excellent initiatives which chemistry would do well to emulate. But they are almost all initiated by the community and often by no means welcomed by the publishers

CS: I have adopted a new approach of late – when I see issues with peoples data, websites etc I inform them directly to help them clean up errors. I’ve done this for Drugbank, PubChem, a number of blogsites, and so on. In case you didn’t inform them I will send them your blog link tonight…also to the original author since I’m sure they will appreciate it too. This, I believe, is being a member of the community and since the authors and the publishers are taking actions to contribute to the Open Access community it’s part of my personal charge to help.” I have sent an email to the original author and to the MDPI editors with the hope they might clean the article or post an Erratum. This is what I feel is appropriate as an active member of the community. If you see errors on ChemSpider please do let us know directly. We have a “Add: Feedback” on every record page and do pay attention to your input.

PMR: This is a potentially useful approach – I don’t know whether it scales and how you measure the quality. We also are intrisucing a feature on CrystalEye where we use Connotea (and, yes, we’ve discussed it with friends at Nature) as an annotation mechanism. Note that we could have done this without consulting Nature but it is possible that it would swamp Connotea and we didn’t want to do that.

PMR: BTW you often use the word “curating” for your activities and I suggest you really mean “annotation”. Digital Curation is about preservation of material and the historical record, not making assertions (however valid) about “correctness”. (Chris Rusbridge reads this blog and I’m sure he’ll be happy to engage with you).

PMR: I’ve written enough so I will tackle your concerns about CrystalEye later. Please let me know if that is your only concern or there are other sites. Note, to start with, that it isn’t trivial to make data sets easily available whatever the motivation. Open Data states:

As set out in the Open Knowledge Definition, knowledge is open if “one is free to use, reuse, and redistribute it without legal, social or technological restriction.”

There are no such restrictions on CrystalEye. It may not be in the form you would like it, and I’ll address that later. But you suggest – wrongly – that I and others deliberately make it difficult to access the data. That is not true.

2 Responses to “More on Open Data for molecules”

  1. Chris Rusbridge says:

    Peter, you wrote: ‘PMR: BTW you often use the word “curating” for your activities and I suggest you really mean “annotation”. Digital Curation is about preservation of material and the historical record, not making assertions (however valid) about “correctness”. (Chris Rusbridge reads this blog and I’m sure he’ll be happy to engage with you).’ In fact this meaning of ‘curating’ is very common in the bio-sciences, and probably pre-dates our own use by several years. A “curated database” as a database of annotations is very common. I’ve been meaning to write a blog entry about it.

    I think Chemspiderman also uses the term to mean something like “perform quality control”. I don’t have the reference, but I remember seeing comments like “we have been curating the chemical information in Wikipedia”.

    I think both uses fit within our definition: “Digital curation is maintaining and adding value to a trusted body of digital information for current and future use; specifically, we mean the active management and appraisal of data over the life-cycle of scholarly and scientific materials.” In these cases, the stress is on “adding value”!

  2. pm286 says:

    (1) Many thanks Chris – I’m happy to use both uses.

Leave a Reply