A very thoughtful and timely comment from Tobias Kind:
as you mentioned Chem4Word, what happens to such a (semantically enriched) word document as seen at http://research.microsoft.com/en-us/projects/chem4word/ when journals will convert it to a “flat” 2D bitmap “dumb” PDF. Basically information is lost. Most of the chemistry journals only accept PDF and Word but finally convert it to 2D PDF for DTP (desktop publishing) purposes. From the supplement sections of many chemistry journals which still provide structures or spectra or data tables as bitmap PDF (shudder/shiver) I don’t see any good coming (I agree that’s not a PDF problem and a simple ZIP file would solve the problem). Actually the process is more complex, as figures (containing structure and spectra) are processed independently from the text and tables are mostly included in the text.
…
One of the major shortcomings of the W3C effort is its failure to address compound documents and packaging. The web, with HTML and its embedding and linking, means that a resource is rarely a single entity but is composed of images, links to other documents, etc. We met this restriction when I submitted my first deposition to DSPace – it was an HTML document and I had to upload every image separately. Even then I couldn’t get the links to work.
So there are several de facto approaches:
- zip up the components
- download “web page” in proprietary form (such as mht)
- save document as *.doc (Word 2003 compound document)
- embed objects in PDF
These are relatively easy to do and most will accept a wide range of ASCII and binary components. The problems are:
- the contents are not semantic
- there are no semantics of the structure
- to reverse the process you need the same technology as used to create the package
In practice this means that reading is normally limited to recreating the same visual experience as the author had, often only on a limited range of platforms. It is uncommon to extract semantic components. Unpacking a foo.grot.bin will not enhance it beyond the original – often untransportable binary.
MIME types could have solved this problem but they aren’t up to supporting the complexities of the problem.
If you are happy to say “{corporation} solves all my problems and I have a full range of authoring and reading tools and so do all my readers” where {corporation} ==Adobe or Microsoft, you can stop reading.
There is an urgent need to address the compound document problem, ranging from tight “single” documents with embedded components to “web sites” and even more distributed systems. We are involved in two such efforts:
-
ORE. I’ll write another post on this… But it’s an Open semantic packaging system based on RDF. We are funded with others by Microsoft (OREChem)
- Chem4Word (disclaimer, we are funded by MS but I still speak my mind). This uses the Microsoft Word2007 DOCX format (OOXML) which is a container for XML components (and proprietary binary). I am not going to defend DOCX as a wonder of clarity and Openness but it is an Open standard (whatever you feel about the process of arriving there) and – in principle – can interoperate with the Open Office “equivalent” ODT. In practice they don’t interoperate well but we are collaborating with Peter Sefton (blog) on authoring theses in ODT (funded by JISC). I think interoperability between ODT and OOXML will come slowly and undramatically. The fundamental problem is that its’ hard and neither ODT or OOXML has been well designed, certainly not for interoperability. Still we and Peter will persevere to developed semantic XML-based authoring systems. More later
===========================================================================
Tobias: PDF itself can hold XML or CML data, so PDF is not dumb by design, but nobody at the (put your preferred chemistry journal in here) seems to be aware of, or at least ignores that DOC or PDF can hold embedded XML data. Are you aware of any efforts from ADOBE or publishers to push semantics and chemistry into the PDF world? So how will Chem4Word stand against flat PDF or OpenOffice Write? What would happen if every reaction drawing, every spectrum would be required as embedded or attached raw/xml format? I guess there are too many different formats 🙂
Personally I am not a fan of using PDF as a packaging standard. The quality of PDF varies enormously and – yes – I have had correspondents on this blog who say all we have to do is buy Adobe tools and the problem is solved. I’m against single-vendor solutions in science, whether they be instruments, simulation programs, reagents or authoring tools. You get lockin which limits vision and innovation. You struggle with lazy vendors who don’t care. I haven’t tried to understand Adobe’s format.
But my biggest criticism of PDF is that it locks our thinking in the paper age. Why should documents have page numbers? why should be have two columns per page. Because the publishers force it on us. So PDF is a tool for constricting thought, not liberating it. Edward Tufte has similar views on The cognitive style of Powerpoint. He even argues (cogently) that the corruption of the message by PP was so bad that it was in considerable part responsible for the space shuttle disaster.
PDF corrupts and restricts thought. It encourages us to destroy semantics. Yes, it can – with great effort – hold semantic objects but nobody uses it to do so. That’s because PDF and PP make us lazy.
We are making progress Chem4Word so that it can author DOCX files containing chemistry. You may think I’ve just swapped one proprietary format for another. But the components are all XML standards (CML can hold molecules, reactions, spectra, spectral annotations, crystallography, compchem, synthetic recipes). All these are easily extracted from DOCX. So we can author with Chem4Word but use as XML. As with everything we do it’ll be Open (details later). But anyone can process DOCX files without Word2007 – it’s a zip containing XML components. It’s not fun, but Joe Townsend has already shown that we can extract a lot of chemistry from theses in normal *.doc files. Don’t take that as an excuse for laziness, because we have to move forward.
As for publishers there are small signs of change. We work very closely with the Royal Society of Chemistry and they understand the value of semantics. We’ll keep you in touch.