I normally don't like blogging more than twice a day, but sometimes it's inevitable. (People sometimes suggest I blog too much, but there is so much we have to change and such a short time that I take the risk). This is in response to Mat Todd, a champion of Open Source drug discovery (likely to be an increasingly common theme here). He asked for my thoughts on his blog post http://intermolecular.wordpress.com/2011/08/07/raw-data-in-organic-chemistry-papersopen-science/ which starts:
Open science is a way of conducting science where anyone can participate and all ideas and data are freely available. It's a sensational idea for speeding up research. We're starting to see big projects in several fields around the world, showing the value of opening up the scientific process. We're doing it, and are on the verge of starting up something in open source drug discovery. The process brings up an important question.
I'm an organic chemist. If I want people to get involved and share data in my field I have to think about how to best share those data. I'm on the board of more than one chemistry journal that is thinking about this right now, in terms of whether to allow/encourage authors to deposit data with their papers. Rather than my formulating recommendations for how we should share chemical data, I wanted to throw the issue open, since there are some excellent chemistry bloggers out there in my field who may already have well-founded opinions in this area. Yes, I'm
I won't quote the whole but here are my interspersed replies. If you aren't very chemical skip read quickly…
Mat, great post – answering various points:
>Mat>Open science is a way of conducting science where anyone can participate and all ideas and data are freely available. It's a sensational idea for speeding up research. We're starting to see big projects in several fields around the world, showing the value of opening up the scientific process. We're doing it, and are on the verge of starting up something in open source drug discovery. The process brings up an important question.
I am excited about the OSDD effort(s) and think there is a lot of Open technology they can use.
>Mat>I'm an organic chemist. If I want people to get involved and share data in my field I have to think about how to best share those data. I'm on the board of more than one chemistry journal that is thinking about this right now, in terms of whether to allow/encourage authors to deposit data with their papers.
Many already do "require" PDFs. There is no agreed way of doing it, but if what you mean is depositing JCAMPs then YES. The OS community can hack any variants
>Mat>1) You have to save the data and then upload them. Well, this was a problem in 1995, but not now.
agreed – trivial in time and size of files
2) The data files are large. Not really. A 1H NMR spectrum is ca. 200KB.
>Mat> 3) It's a pain. Yes, a little. But we must suffer for things we love.
>Mat>4) People might find mistakes in my spectra/assignments. Yes. You're a scientist. This is a Good Thing.
Yes – and some bad chemistry has been detected and corrected
>Mat>An important fact: For many papers, supporting information is actually public domain, not behind a paywall along with the rest of the paper. The ACS, for example, would, by posting raw data as SI, allow the free exchange of raw spectroscopic data. That would be neat.
The ACS requires CIFs and I congratulate them. If they could just extend that to JCAMPs and computational logfiles that would almost solve everything
>Mat>1) X-ray crystallography. This is the exception. Data are routinely deposited raw, and may be downloaded. Not always the case, but XRD blazes a trail here.
True for all OA journals (but not much crystallography here except IUCr ActaE), RSC, IUCr, ACS require CIFs (Applause). Wiley, Springer, Elsevier do not publish this supplemental data. Only available from CCDC and then not in bulk without subscription.
>Mat>2) NMR spectroscopy. The big one. IUPAC recommends the JCAMP-DX file format. Jean-Claude Bradley has been a proponent of this format, and has demonstrated how it can be used in all kinds of applications. We've played with it, and in one of our recent papers we deposited all the NMR data in this format in the SI. We've been posting JCAMP-DX files in our online electronic lab notebooks, e.g. here. My opinion of this file format (both generating it, and reading it) has not been great. There are two formats, I understand, and we found that if we saved the data in the wrong format, we couldn't read the data with certain programs, but could with others. i.e. we had to get the generation of the file just right.
Don't fully understand this. There are actually several formats but the OpenSource software reads all of them. CML-Spect supports these and is readable by JSpecview. This need not be a problem if people have the will to solve it.
>Mat>I don't know if people have experience of this. I was in touch with one of the ACS journals recently, who indicated that their view was that the journal is not a data repository, and that posting of raw data (which was in their view to some extent desirable) should be posted elsewhere, e.g. to an institutional repository. This is an option. I think it's less convenient. PLoS seem happy to host the data.
I have an idea, which I think will fly. [see below]
>Mat>3) IR data. Don't know if there is a standard. If the file is small, saving raw data could be encouraged. Would allow easy comparisons of fingerprint regions.
JCAMP will hack this
>Mat>4) Mass spectrometry. It's not clear to me there is a huge advantage here to sharing raw data, for a typical low res experiment?
JCAMP will do this for "1-D" spectra (e.g. not involving GC or multiple steps
>Mat>5) HPLC data. Again, the outputs are fairly simple, and I'm not clear about the advantage of raw data (which I'm assuming would be absorbance vs. time table). Would (perhaps) permit verification that traces have not been cropped to remove pesky impurities.
Again it wouldn't take much to solve this
>Mat>6) Anything else?
I think we should use FigShare (see http://blogs.ch.cam.ac.uk/pmr/2011/08/03/figshare-how-to-publish-your-data-to-write-your-thesis-quicker-and-better/ ) and I'll explain why in my blog in a day or so
… ok .. take a breath. The main points have been:
- The technology for recording digital spectra have been around for at least 30 years
- The files are not large, and are trivial to upload
- There's lots of open source software.
The ACS is keen to see these data available but (according to Mat) don't want to act as a data base. So,
I copied in Mark Hahnel (see link above). Figshare sounds like exactly what is needed.
Figshare has been developed with zero cash (but a lot of love from Mark). That will scale as far as establishing that the concept works and scientists like it.
We don't have to convince the senior chemists – all we have to convince is graduate students. Because they are the ones that will benefit it and help develop the next phase. Whatever that is.
The University community (including the repositories) should take careful note of what Mark has done. Because he has filled a real need, not built a theoretical design. And this is where innovation comes from (in our own group Nick Day built Crystaleye in his final PhD year http://wwmm.ch.cam.ac.uk/crystaleye and for which we are finding a permanent home. ) Let's have mechanisms for supporting the products of such innovation. Meanwhile if any graduate student wishes to archive spectra let's see how Mark recommends we develop it. And if you wish to deposit crystal structures, let's do them in the new crystaleye2 http://crystaleye.ch.cam.ac.uk/ .
There is now no technical reason not to archive high-quality data for chemistry in a completely Open manner.