OREChem; PDFHamburger to Chemistry revealed (a bit)

More on the PDFHamburger2Chemistry story. It’s making progress, but progress is like walking through glue. Yesterday we had an illuminating and at times depressing insight into the ghastly innards of the PDF hamburger. As you know hamburgers contain all sorts of hidden horrors sawdust, fat, water, etc. and we had an insight into the PDF equivalents.

The story so far: Bill Brouwer is a physicist in PSU chemistry dept and has enthusiastically contributed to the ORE project, working closely with Karl Mueller and Lee Giles’ team (known for CiteSeer and ChemXSeer). One of the goals for OREChem is to extract chemistry from conventional sources (e.g. theses, articles) and convert to CML and then RDF. Bill and I worked in parallel for the last 5 months on different parts of this technology.

Bill had to get involved deeply in PDF technology. Has anyone heard of ASCII85? It’s a ghastly way of transmitting binary information to a printer and it perfuses the PDF technology. So, for example, rather than transmitting a simple character (O) to a printer, the PDF abbatoir will convert the character to a bitmap and then encode the bitmap in ASCII85 and transmit this multicharacter string to the printer. (It helps to remember that the main purpose of PDF is to talk to printers, not humans. Which is why it carries so little useful semantic information and so many horrors ASCII85 is not the only one). Anyway Bill was able to hack much of this and actually use the components of the PDF (features) to try to detect which bits were molecules, which were spectra, etc.

Then Mark Borkum came out to PSU 3 weeks ago and has taken over from Bill and made spectacular progress. Mark’s a first year PhD computer scientist now working in Jeremy Frey’s group. Mark’s continued interpretation of PDF has allowed us to design a 3-part system:

  • PDF2SVG (Mark). Mark wants to do this really properly (the current tools do a good, but not lossless conversion). And it needs semantics adding, such as superscripts. (PDF has no idea of superscripts, simply draw large character, the change to smaller font, increase y and x and draw). That’s a sub (or superscript. Remember that printers are dumb and PDF talks to printers. So it’s not trivial to reconstruct subscripts (you did remember the thousands of kerning character pairs, didn’t you?). The idea at the end of this is that Mark will have split up the document into semantic components we won’t know what all the semantics are but we can guess some.

  • SVG2Chemistry (PMR). SVG is a good technology for reconstructing semantic objects from graphics. This has to be heuristic after all what does two crossed lines + mean? It could be plus, or it could be tetramethylmethane. SVG and PDF don’t know. However with good SVG, including subscripts and text runs this is very promising

  • Spectral deconvolution and analysis (Bill). Bill has been interpreting the spectra in terms of components. He started doing this in the PDF analysis but it makes much more sense to do it at the end.

  • Note the value of modularisation. It allows each person to concentrate on the bits they are expert in. And not how SVG and CML represent formal contracts for handing over information. It makes unit testing and integration much easier.

  • We’ll be working out today at the meeting how we take this forward.

  • But even when we are successful.

  • NO MORE PDF CHEMISTRY HAMBURGERS PLEASE…

  • BTW: What mark is doing is generic and others must have struggled with this. There is a range of PDF2Foo tools (pdf2txt, pstoedit, PDFBox etc.) I congratulate all those who have waded through the same bog. I am sure Mark would welcome any help and experience here.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *