Approaches to compound documents – ORE, PDF, DOCX

A very thoughtful and timely comment from Tobias Kind:


as you mentioned Chem4Word, what happens to such a (semantically enriched) word document as seen at http://research.microsoft.com/en-us/projects/chem4word/ when journals will convert it to a “flat” 2D bitmap “dumb” PDF. Basically information is lost. Most of the chemistry journals only accept PDF and Word but finally convert it to 2D PDF for DTP (desktop publishing) purposes. From the supplement sections of many chemistry journals which still provide structures or spectra or data tables as bitmap PDF (shudder/shiver) I don’t see any good coming (I agree that’s not a PDF problem and a simple ZIP file would solve the problem). Actually the process is more complex, as figures (containing structure and spectra) are processed independently from the text and tables are mostly included in the text.


One of the major shortcomings of the W3C effort is its failure to address compound documents and packaging. The web, with HTML and its embedding and linking, means that a resource is rarely a single entity but is composed of images, links to other documents, etc. We met this restriction when I submitted my first deposition to DSPace – it was an HTML document and I had to upload every image separately. Even then I couldn’t get the links to work.
So there are several de facto approaches:

  • zip up the components
  • download “web page” in proprietary form (such as mht)
  • save document as *.doc (Word 2003 compound document)
  • embed objects in PDF

These are relatively easy to do and most will accept a wide range of ASCII and binary components. The problems are:

  • the contents are not semantic
  • there are no semantics of the structure
  • to reverse the process you need the same technology as used to create the package

In practice this means that reading is normally limited to recreating the same visual experience as the author had, often only on a limited range of platforms. It is uncommon to extract semantic components. Unpacking a foo.grot.bin will not enhance it beyond the original – often untransportable binary.
MIME types could have solved this problem but they aren’t up to supporting the complexities of the problem.
If you are happy to say “{corporation}  solves all my problems and I have a full range of authoring and reading tools and so do all my readers” where {corporation} ==Adobe or Microsoft, you can stop reading.
There is an urgent need to address the compound document problem, ranging from tight “single” documents with embedded components to “web sites” and even more distributed systems. We are involved in two such efforts:

  • ORE. I’ll write another post on this… But it’s an Open semantic packaging system based on RDF. We are funded with others by Microsoft (OREChem)
  • Chem4Word (disclaimer, we are funded by MS but I still speak my mind).  This uses the Microsoft Word2007 DOCX format (OOXML) which is a container for XML components (and proprietary binary). I am not going to defend DOCX as a wonder of clarity and Openness but it is an Open standard (whatever you feel about the process of arriving there) and – in principle – can interoperate with the Open Office “equivalent” ODT. In practice they don’t interoperate well but we are collaborating with Peter Sefton (blog) on authoring theses in ODT (funded by JISC). I think interoperability between ODT and OOXML will come slowly and undramatically. The fundamental problem is that its’ hard and neither ODT or OOXML has been well designed, certainly not for interoperability. Still we and Peter will persevere to developed semantic XML-based authoring systems. More later

===========================================================================
Tobias: PDF itself can hold XML or CML data, so PDF is not dumb by design, but nobody at the (put your preferred chemistry journal in here) seems to be aware of, or at least ignores that DOC or PDF can hold embedded XML data. Are you aware of any efforts from ADOBE or publishers to push semantics and chemistry into the PDF world? So how will Chem4Word stand against flat PDF or OpenOffice Write? What would happen if every reaction drawing, every spectrum would be required as embedded or attached raw/xml format? I guess there are too many different formats 🙂


Personally I am not a fan of using PDF as a packaging standard. The quality of PDF varies enormously and – yes – I have had correspondents on this blog who say all we have to do is buy Adobe tools and the problem is solved. I’m against single-vendor solutions in science, whether they be instruments, simulation programs, reagents or authoring tools. You get lockin which limits vision and innovation. You struggle with lazy vendors who don’t care. I haven’t tried to understand Adobe’s format.
But my biggest criticism of PDF is that it locks our thinking in the paper age. Why should documents have page numbers? why should be have two columns per page. Because the publishers force it on us. So PDF is a tool for constricting thought, not liberating it. Edward Tufte has similar views on The cognitive style of Powerpoint. He even argues (cogently) that the corruption of the message by PP was so bad that it was in considerable part responsible for the space shuttle disaster.
PDF corrupts and restricts thought. It encourages us to destroy semantics. Yes, it can – with great effort – hold semantic objects but nobody uses it to do so. That’s because PDF and PP make us lazy.
We are making progress Chem4Word so that it can author DOCX files containing chemistry. You may think I’ve just swapped one proprietary format for another. But the components are all XML standards (CML can hold molecules, reactions, spectra, spectral annotations, crystallography, compchem, synthetic recipes). All these are easily extracted from DOCX. So we can author with Chem4Word but use as XML. As with everything we do it’ll be Open (details later). But anyone can process DOCX files without Word2007 – it’s a zip containing XML components. It’s not fun, but Joe Townsend has already shown that we can extract a lot of chemistry from theses in normal *.doc files. Don’t take that as an excuse for laziness, because we have to move forward.
As for publishers there are small signs of change. We work very closely with the Royal Society of Chemistry and they understand the value of semantics. We’ll keep you in touch.

This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Approaches to compound documents – ORE, PDF, DOCX

  1. Tony Hammond says:

    Hi Peter:
    I must just take issue with one point you make:
    “Yes, it [PDF] can – with great effort – hold semantic objects but nobody uses it to do so.”
    Not true. Publishers are even now beginning to provide structured metadata in their PDFs via XMP. At the moment this may be limited to simple bibliographic type information (DOI, title, authors, etc.), but with the XMP framework in place this could be readily extended to capture and make available other types of semantics. Both Elsevier and Nature Publishing Group are now providing meaningful XMP payloads for their current PDF publications. I have blogged about our initiative here [1].
    I do accept that PDF is not the easiest format to work with but extracting the XMP payloads is very straightforward using open-source software such as Phil Harvey’s excellent ExifTool [2].
    Cheers,
    Tony
    [1] http://blogs.nature.com/wp/nascent/2008/12/xmp_labelling_for_nature.html
    [2] http://www.sno.phy.queensu.ca/~phil/exiftool/

  2. Pingback: ptsefton » Opening up Microsoft

  3. Tobias Kind says:

    Hello Peter,
    thanks for your thoughts. The more I read the more complex and frustrating it gets. I was just reading your comments about Adobe Acrobat; I would assume that everybody in the chemistry world has an Acrobat Full license. But I recognize that’s not the case. Furthermore there are people who have problems opening a ZIP file, so one can not assume that everybody is operating at the same level of tools. That’s the problem with the long tail. According to the power law it is probably safe to assume that the majority of chemists doesn’t even care if there is chemical semantics lurking out of a document. Not to offend the majority of chemists, but at the end of the day it’s only the number of publications on the CV that counts (well quality of course).
    Tim,
    RSC with Project Prospect and Nature with journals that annotate structures and submit them to PubChem are probably top notch regarding semantics. And as I said before, yes PDF can include metadata with XMP, http://www.adobe.com/products/xmp/ but as long as there are no easy (free) tools out there its hard to push semantics from the PDF side. Acrobat Reader 8 did not know XMP and yes one can attach XML using the Full Acrobat. But the mentioned ExifTool is not a commodity tool for most chemists.
    But then again the whole semantics train currently depends on the journal itself or the editorial board, or single people or innovative groups at the publisher side. And there is certainly the tools side, so for WEB 2.0 in chemistry only a broad range of software tools can act as an enabler for chemical semantics.
    Peter,
    I would not go so far as “PDF corrupts and restricts thought”. Chemists can not make third parties responsible for the current mess in missing annotations and data exchange. Most of the better chemistry and life sciences journals allow supporting info, so what speaks against attaching the source HTML, DOC as supplement. Yes its redundant, but as long as publishers do not convert supporting data into bitmap PDF it is not a problem.
    As an example if you go to ACS Journal of Proteome Research, you can find some of the evil PDFs, and even the evil flat 2D PDF attachments including molecular spectra or information. But a few publications also include supplement RAW data (as XLS. MDB or ZIP) and even PDB codes. So I assume if the authors and reviewers insist on publishing meta data in the supplement in a specific format the journal would agree. Well, then there is that unholy ACS supplement data copyright. But there are also ways to submit data on personal websites. For instance you could find the ACS journal supplement data for “T.IMPAFIFEHIIK.R” also on google: “Powered by Yates Bioinformatics Team; This is ongoing project with preliminary results”, ok copyrighted by the Yates group itself 😉
    For example some of our public US taxpayer funded metabolomics data sets are fully available via our SetupX LIMS and study design database:
    http://fiehnlab.ucdavis.edu:8080/m1/
    For those public studies people can download all the raw data and all the annotated and result data and even the underlying software. Not all research data is open access and publicly available and yes we are also guilty of publishing flat PDFs without any semantics, but we allow people to reproduce some of our experiments and download RAW and processed data and all needed software and that can only be topped by Open NoteBook Science, the purest form of scientific reporting.
    Cheers
    Tobias
    Tobias Kind
    fiehnlab.ucdavis.edu

Leave a Reply

Your email address will not be published. Required fields are marked *