How can we publish semantic chemical documents?

Tobias Kind has submitted a very thoughtful comment (in reply to Approaches to compound documents – ORE, PDF, DOCX) which deserves printing and commenting.

TK: Hello Peter,
thanks for your thoughts. The more I read the more complex and frustrating it gets. I was just reading your comments about Adobe Acrobat; I would assume that everybody in the chemistry world has an Acrobat Full license. But I recognize that’s not the case. Furthermore there are people who have problems opening a ZIP file, so one can not assume that everybody is operating at the same level of tools.

PMR: I personally do not have an Acrobat full licenCe. That’s not religious – I just don’t have one.  Maybe the University has a site licence – I don’t know. FWIW I send manuscripts as *.doc. I realise this is also a proprietary and yes, it has to be paid for. It’s just that I happen to have it painlessly. In contrast it is possible to get Open Source/free solutions for ZIP (though only after the infamous patent ran out).

TK: That’s the problem with the long tail. According to the power law it is probably safe to assume that the majority of chemists doesn’t even care if there is chemical semantics lurking out of a document. Not to offend the majority of chemists, but at the end of the day it’s only the number of publications on the CV that counts (well quality of course).

PMR: agreed

TK: Tim,
RSC with Project Prospect and Nature with journals that annotate structures and submit them to PubChem are probably top notch regarding semantics. And as I said before, yes PDF can include metadata with XMP, http://www.adobe.com/products/xmp/ but as long as there are no easy (free) tools out there its hard to push semantics from the PDF side. Acrobat Reader 8 did not know XMP and yes one can attach XML using the Full Acrobat. But the mentioned ExifTool is not a commodity tool for most chemists.

PMR: agreed. I am a pragmatist in that I can reasonably persuade chemists to include various bits of information into Word2007, but not into LaTeX or Acrobat. If everyone used semantic Acrobat instead of Word I’d probably be suggesting that. I’m pleased to see that Writer is better than Open Office, but it doesn’t (I think) solve the semantic packaging problem seamlessly.

TK:  But then again the whole semantics train currently depends on the journal itself or the editorial board, or single people or innovative groups at the publisher side. And there is certainly the tools side, so for WEB 2.0 in chemistry only a broad range of software tools can act as an enabler for chemical semantics.

PMR: Completely agreed. This is why the International Union of Crystallography deserves praise – it designs and requires semantic data publication for many of its journals.

TK:Peter,
I would not go so far as “PDF corrupts and restricts thought”. Chemists can not make third parties responsible for the current mess in missing annotations and data exchange. Most of the better chemistry and life sciences journals allow supporting info, so what speaks against attaching the source HTML, DOC as supplement. Yes its redundant, but as long as publishers do not convert supporting data into bitmap PDF it is not a problem.

PMR: Like Tufte I allow myself some hyperbole. I would certainly say that “in a digital age where many new forms of information and publication are possible, a universally used format whose primary purpose is to allow printing of documents onto paper is an active restriction on the imagination”.

PMR: As an example if you go to ACS Journal of Proteome Research, you can find some of the evil PDFs, and even the evil flat 2D PDF attachments including molecular spectra or information. But a few publications also include supplement RAW data (as XLS. MDB or ZIP) and even PDB codes. So I assume if the authors and reviewers insist on publishing meta data in the supplement in a specific format the journal would agree. Well, then there is that unholy ACS supplement data copyright. But there are also ways to submit data on personal websites. For instance you could find the ACS journal supplement data for “T.IMPAFIFEHIIK.R” also on google: “Powered by Yates Bioinformatics Team; This is ongoing project with preliminary results”, ok copyrighted by the Yates group itself ;-)

PMR: If you look at ACS J.Org.Chem you will see that almost every paper has a large supplement. This is almost always in PDF. It’s clearly taken a lot of work to create. The information was, originally, semantic and the publication process has encouraged the community to turn it into PDF. The spectra were JCAMPs (or could be JCAMPs), The molecules were CDX or Mol, The reactions were RXN, etc. All have been steamrollered into flat PDFs.

PMR: The exception are the CIFs, designed, advocated, and managed by the IUCr. They have shone as an example to the rest of the chemical world.

TK: For example some of our public US taxpayer funded metabolomics data sets are fully available via our SetupX LIMS and study design database:
http://fiehnlab.ucdavis.edu:8080/m1/
For those public studies people can download all the raw data and all the annotated and result data and even the underlying software. Not all research data is open access and publicly available and yes we are also guilty of publishing flat PDFs without any semantics, but we allow people to reproduce some of our experiments and download RAW and processed data and all needed software and that can only be topped by Open NoteBook Science, the purest form of scientific reporting.

PMR: This again is the influence of the bioscience community. It makes me envious.

TK: Tobias Kind
fiehnlab.ucdavis.edu

It’s technically trivial – yes trivial – to publish molecules and spectra. If a journal said “no need to write 200 pages of supplemental info in PDF, just publish the *.cdx, *.mol. “.jdx.” That’s all. But where is the editorial push for this? Will any chemical editors (technical, management, academic) step up and say “this journal will require authors to deposit semantic chemistry in … months/years”. That’s all it takes. There wouldn’t even be much resistance – probably rejoicing.

The good news is that we have an Open Source infrastructure that can convert all of these legacy formats into semantic chemistry (Chemical Markup Language, CML) essentially automatically. We’ve done it for crystallography in the chemistry depertament here and the issues are not technical but things like embargoes.

You don’t even need to know about CML.

I’ll be explaining in future posts how it is now conceptually simple to publish chemical data in semantic form. I’d like to work with, not against, publishers. And, with some like IUCr and RSC we do.

This entry was posted in semanticWeb, Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *