As I have shown it is hard and lossy to recover information from theses (or anything else!) written in PDF. In unfavourable cases it fails completely. I have a vision which I’ll reveal in future posts, but here I’d like to know how you wrote, write (or intend to write) your theses. This is addressed to synthetic chemists, but other comments would be useful. I have a real application and potential sponsorship in mind.
Firstly I guess that most of you write using Word. Some chemists use LaTeX (and Joe, who is just writing up) told us that the most important thing he would do differently when he started his PhD would be to use LaTeX). I would generally agree with this, although I am keen to see – in the future – what can be done with Open Office and Open Document tools which will use XML as the basis. The unpredictable thing is how quickly OO arrives and what authoring support it has.
A main reason for using Word is that it supports third-party tools whose results can be embedded in a Word document. The most important of these are molecular editors (such as ChemDraw (TM) and IsIsDraw (TM)). These are commercial products and have closed source. They also generally use binary formats which are difficult to untangle. (When these formats are embedded in Word they are impossible to decode – the Word binary format is not documented and efforts to decipher it are incomplete). In some cases I could extract many (but not all) of the ChemDraw files in a document. There are also MS tools such as Excel.
I’d be interested to know if OO and/or the release of MS’s XML format has changed things and what timescales we can reasonable expect for machine-processable compound documents. But for the rest of the discussion I’ll assume that the current practice is Word + commercial tools. (In later posts I shall try to evangelise a brighter future…)
The typical synthetic chemistry thesis contains inter alia :
- discursive free text describing what was done and why
- enumerated list of compounds (often 200+) with full synthetic details and analytical data.
The free text looks like:
============ OR ===========
===== the compound information looks like ========
Note that compounds are identified by a bold identifier ( e.g. 38) which normally increases in serial order throughout the text. This is fragile, in that the insertion of a new number requires manual editing throughout the text (this is confirmed by various chemical gurus). Compounds are drawn in the middle of free text sections, and again in the compound information. There are no tools to enforce consistency between the numbering and the diagrams. Moreover information such as reagents, yields, physical and analytical data are repeated in several places. These have to be manually transcribed and (unless you tell me differently) this is a tedious, frustrating and error-prone process.
Moreover at this stage of writing the thesis the student has to assemble all the data for the 200 compounds. Are they all there? Could any of the spectra be muddled? Is that figure in the lab book a 2 or a 7? Heaven help if a spectrum is missing and the compound has now decomposed into a brown oil or got lost in the great laboratory flood. Of course none of this ever happens…
So are you all happy with how you authored or will author your thesis? I haven’t even touched on how peaks are transcribed from spectra and how the rigmarole of spectra peaks has to be authored and formatted. If so, I’ll shut up. Else I will make some serious and positive suggestions in a later blog.