A good deal of the reasoning behind transcription of spectral data in publication is to impart meaning to the spectra. The 1H NMR spectrum of rasfonin, for instance, would be indeciferable to me, but the data written in the publication, transribed by the author and annoted for every peak would make (more) sense. It’s great to get an idea what the spectra look like, but more often than not, the actual spectra can be found in the supplementory data as a scan of the original. The combination of these two data sources gives the synthetic chemist everything they need.
Before I get onto the horror, Let me make it very clear that Tot. Syn's blog is excellent and I'm hoping that he can meet us at the Pub on Monday lunch. His blog is a model of the future of chemoniformatics and we'd like to bounce some ideas off him.
(I'm also not specifically criticising the authors of the paper - at least not more than all other organic chemists because this supporting information (SI) is typical. I am of course suggesting gently that the process of publishing organic chemical experiments is seriously and universally broken).
The supporting information is a hamburger PDF and this example excellently makes my point. (Please readers, read it - or as much as you can manage - as I need help. Especially from anyone who is involved in graphical communication). It's a separate document from the original paper and even though on the ACS site remarkably seems to be openly viewable. Maybe the ACS will close it sometime or maybe this exercise shows that Openness enhances downloads.
The SI draws the spectra on their sides! This is a clear indication that they aren't meant to be read on the screen, but printed out. But the SI is 106 pages long. That's not unusual - we have seen over 200 pages. I am sure that many organic chemists who want to read it will print it out rather than trying to read it on the screen. The spectra run from pp 36-107 with no navigational aids - if you want to link a compound to its spectrum you have to scroll through the spectra till you find its formula. Some compounds are depicted as chemical formulae on the spectra and some, but not all, contain index numbers (bold in the text).
Let's assume that you are at a terminal and your lab has used up its paper bill. You scroll down to the infrared spectrum of a compound:
It doesn't look very promising, so I turn my head 90 degrees to look at it. Not very comfortable. So there is a tool on Adobe reader that rotates the page to give:
This is awful. It looks like the spectra I used to collect 30 years ago when the pen plotter was running out (before that we plotted the spectra by hand it's good for the soul). The resolution is probably 0.1 or better in the x-direction. I have no idea why it is so awful.
want to look back to the text where the author has made the annotations (there are no annotations on the spectra so we have to skip back 70 pages) to find:
Our helpful Adobe reader has turnd all the pages round, so we have to turn this one back again. And, I suspect, the only real way to navigate this is to print it out.
The authors obviously spent a lot of time preparing this SI. The publisher probably calls it a "creative work" - you can claim copyright on creative works. I'd call it a destructive work. It doesn't actually have a copyright notice, although the ACS has a meta-copyright where they assert copyright over all SI (except one from Henry Rzepa and me).
Now - please help me with the PDF. I have blogged earlier about OSCAR - the data extraction tool that can extract massive information from chemical papers in HTML or even Word. But it doesn't work with PDF. Is there any way of extracting all the characters from this document? If I try to cut and paste I can only get one page at a time? Yes, I could probably hack something like PDFBox. But otherwise PDF is an appalling efficiently way of locking up and therefore destroying information.
The message is simple:
STOP USING PDF FOR SCIENTIFIC INFORMATION
DO NOT USE PDF FOR DIGITAL CURATION