Organic Theses: Hamburger or Cow?

This is my first attempt to see if a chemistry thesis in PDF can yield any useful machine-processable information. I thank Natasha Schumann from Frankfurt for the thesis (see below for credits).
A typical chemical synthesis looks like this (screenshot of PDF thesis).
For non-chemists this consists of name and number of compound, recipe for synthesis, structural diagram (picture), number and analytical data (Mass Spec, Infrared and Ultraviolet). This is a very standard format and style.The ugly passive (“To a solution of X was added Y”) is unfortunately universal (cf. “To my dog was donated a bone by me”). The image is not easily deconstructed (note the badly placed label “1” and “+” making machine interpretation almost impossible – that is where we need InChIs).
I then ran PDFBox on the manuscript. This does as good a job as can be expected and produces the ASCII representation.
This is not at all bad (obviously the diagram is useless) and the greek characters are trashed but the rest is fairly good. I fed this to OSCAR1; it took about 10 seconds to process the whole thesis. You can try this as well!
OSCAR has parsed most of the text (obviously it can’t manage the diagram labels but the rest is sensible. It has extracted much of the name (fooled a bit by the greek characters) and pulled out everything in the text it was trained to do (nature, yield, melting point). It cannot manage the analytical data because the early OSCAR only read RSC journals but OSCAR3 will do better and can be relatively easily trained to manage this format.
So first shots are better than I have got in the past. OSCAR found data for 40 compounds – ca. 4 per second. Assuming that there are many similar theses there is quite a lot it can do. But not all have PDF that behaves this well…
Acknowledgement (from PDF)
Chiral Retinoid Derivatives:
Synthesis and Structural Elucidation of a New Vitamin A Metabolite
Von der Fakultät für Lebenswissenschaften der Technischen Universität Carolo-Wilhelmina zu Braunschweig zur Erlangung des Grades einer
Doktorin der Naturwissenschaften (Dr. rer. nat.) genehmigte
D i s s e r t a t i o n von Madalina Andreea Stefan aus Ploiesti (Rumänien)

This entry was posted in chemistry, data, XML. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *