Hamburgers – theses in PDF

Having blogged about the excitement of automatic reading and semantic enhancement of chemical theses I come to the startk reality of PDF.
“Turning PDF into XML is like turning a hamburger back into a cow” (anon).
So I searched for Openly exposed electronic chemical theses on the web.  Yes, they are out there, but – in PDF. So here’s a typical tale of the sort of waste of time that I am colleagues have to go through.
I find a major academic institution (Foo) with a repository of theses. There are two versions – one that the world can read, and one private to Foo faculty. They seem to be the same, except that the Open one is “not printable”.  I load it into my browser – it displays. I try to save as text. I try to select text and paste into a text editor (this usually works). It doesn’t. So presumably there is deliberately some sort of gremlin in the document which prevents Adobe tools saving the text. (I expect Adobe developed this specifically anayway).
PDF has already destroyed the structure of the document. But perhaps I can at least save the words. OSCAR3 is very good at reconstructing chemistry from words. I save the PDF locally (that seems to be technically allowed) and then I open it in a text editor. Gibberish – but I expected that.
So I download PDFBox from Sourceforge. A typical example of noble-spirited Open Source development – trying to make life better than the hamburger culture. It has an executable called ExtractText. I run it. “Null Pointer Exception” (This means the program has failed to trap an error – but I forgive them since by definition a hamburger is an error. I then notice another executable (SplitText). Expecting it to fail I run it. Surprisingly it works. It produces 200 little PDF files (one for each page in the thesis). Not the ideal thing to work with but serious progress.
Then I notice an option (-split). This says “only start splitting after n pages”). So I use -split 200. This creates one large PDF page (the same as the original document). This doesn’t seem like progress, but it is – the new file behaves perfectly with ExtractText. I can now convert the PDF to text without problems. And run it into OSCAR3. And more of this later.
Of course the resultant text is awful but at least it contains all the right words and in the right order. It cannot manage suffixes (for example H2SO4 – the chemical formula for sulfuric acid somes out as:
H
2
SO
4
).
That’s because PDF has no semantics. The ‘2’ and ‘4’ are just characters with X,Y coordinates – not associated with anything.
So the message is clear.
Do not author documents in PDF alone.
If you use another format (Word, HTML, TeX and perhaps even some time XML), preserve that version. If you are required to destroy the semantic into a hamburger, insist that the rich version is preserved. In your institutional repository.
Does this sorry story suggest that really we should be using XML for science, not PDF?

This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Hamburgers – theses in PDF

  1. Pingback: How to *not* package technical documentation « Scale or die

Leave a Reply

Your email address will not be published. Required fields are marked *