Turning Spectral PDF Hamburgers into Semantic Cows

I’m in Penn State University in the middle of Pennsylvania for two meetings:

The second, on Thursday, is run by Karl Mueller of PSU on chemistry in the digital age. I am kicking off with a talk on The Chemical Semantic Web. I’ve used the title before but the content keeps changing.

because of , among other things …

our OREChem project funded by Lee Dirks of Microsoft Research. We are having a review on Wednesday. For an overview see Carl Lagoze’s paper: http://journal.webscience.org/112/2/websci09_submission_10.pdf

There are five groups covering chemistry and informatics sometimes integrated, sometimes for different departments: Cornell, Indiana , Southampton , Penn State and Cambridge. OREChem is about using ORE to add RDF to chemical information. It’s all about linking and interoperability.

Many multi-site projects don’t really do more than fund individual sites. OREChem is not like them. Because we are digital and because we face a common problem there isn’t enough information there’s a strong sense of purpose. And that has crystallised in a subproject to liberate chemical information.

There’s lots of problems in this liberation most political but this is tackling the hamburger cow problem. (see, for example, a blog discussion last year). Turning a PDF into XML is like turning a hamburger back into a cow. Turning PDF into CML is even worse and into OREChem is yet again worse.

It would be so simple for chemists to archive their spectra as JCAMP files, but they and their publishers (who are the primary problems) are so far totally uninterested. They spend time and effort turning digital spectral cows into appalling epaper hamburger PDFs. A chemist could send their data to the publisher it’s less than a megabyte and this could be mounted as supplemental data. It would take a few minutes at each end. But no, the chemist has to create a paginated document in PDF which must take ages. Days, at least in some cases.

Here’s a typical example. As the publisher has COPYRIGHTED THE DATA I can’t reproduce this in this blog. Yes, we are even restricted in disseminating our own science, but that’s a different topic. So here’s a link which you can follow without paying or getting lawyer’s letters. The point is that if you scroll through this you will see zillions of lovely cows transformed into ugly epaper hamburgers.

And trust me, PDF is really, really horrible.

But Southampton, PSU and we are fighting back in this project. If (with a very few exceptions) the chemical community is not forward-looking enough to embrace semantic chemistry enthusiastically we are going to have to do it anyway. And we have developed a hamburger2cow processor. Like one of those sci-fi movies (the curry monster in Red Dwarf). Bill Brouwer and I started this last December and we’ve been doing our own thing in parallel. We met yesterday also with Mark who is doing a PhD at Southampton and we found that our bits fitted perfectly. We can turn a PDF hamburger into a cow. Not all hamburgers immediately. And some hamburgers are so awful they are past redemption. But certain types of hamburger. Enough to show the chemists what they are missing. And to highlight the lack of value that the publishing process currently adds and encourage/shame them into embracing the semantic chemistry age.

Details will be revealed at the meeting on Thursday.

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Turning Spectral PDF Hamburgers into Semantic Cows

  1. carlos says:

    I agree that PDF is not the right format for storing NMR data, no question about that. However, JCAMP can also be terrible if the original FID is not saved together with the f-domain spectra. In fact, it’s the FID the most valious piece of information about an NMR spectrum, as I have blogged some time ago (http://nmr-analysis.blogspot.com/2008/08/spectra.html). I have seen too many JCAMP files lacking the original FID, which IMHO, is not good.
    Carlos

    • pm286 says:

      @carlos. I was using JCAMP to emphasize that the authors should deposit semantic data. IIRC JCAMP can hold FID? So I agree with your sentiment but this is a matter of discipline not technology. How many JCAMP of any sort have you seen on publishers sites? Or in theses? that is the problem – once we get the process started it should be fairly easy to include FID. Is there any Open software that can process FID? and is there an effective standard which is independent of manufacturers?

Leave a Reply

Your email address will not be published. Required fields are marked *