Egon Willighagen blogged this. There is now a real opportunity for the Open Source chemistry community to create high-quality tools for the extraction of molecular information from legacy documents. Besides full-text articles other good areas to look are probably theses and supplemental data.
Before I copy the post, I'll review the methods available (to the Open Source community)
- explicit connection table. This is the best, but rare. It might occur in theses, but is uncommon. (Some word documents include binary CDX and/or MDL files but this is an awful hack. I've done it and don't recommend it)
- Implicit connection table. PLEASE USE InChI! In the absence of this there might be a SMILES
- crystal structure. This is very good and uses CIF2CML. see CrystalEye (http://wwmm.ch.cam.ac.uk). Crystal structure coordinates are often reported in theses and supplemental data
- output of computational chemistry programs. Again very good and uses CIF2CML code.
- Chemical name. Parsable by OPSIN (part of the OSCAR3 package). Probably runs at between 25% and 70% depending on the domain. Will be improved by lots of little incremental bits (see below).
- Spectra data. Very variable and usually incomplete. Works for small molecules. Use SENECA or lookup against shifts in NMRShiftDB. Very useful to check structures created by other methods
- Chemical structure diagram. This is what is discussed below. Remember that although it's easy for a human to understand a picture it can be very difficult for a machine. We can divide it into three parts (a) turn a bitmap into a series of graphics primitives (lines, text) (b) turn the graphics primitives into chemical primitives (bonds, atoms, labels). The first can be very hard, especially for fuzzy diagrams. The second is much easier, especially when the first has worked well. It is well suited when the input is PDF which although disgusting and horrendous can reveal the graphics primitives. I have done this for several instances of supplemental data and it's variable. With an increasing amount of diagrams munged into PDF the vectors are often captured well. The third depends on the chemical semantics. Much of it involves recognising conventions (e.g. what does "OBz" mean?). I'm hopeful
In both names, spectra and diagrams there is a lot of heuristics and this is where everyone can help. There are probably a few hundred abbreviations, groups, etc. in common use and enough to give us a high degree of success. If we all add a few of these we can make rapid progress. You don't have to be a programmer to do it.
Also, as Egon says, the combination of the methods will help a lot. What's "THF"? It could be tetrahydrofuran or tetrahydrofolate. If you know the formula is C4H8O you know it's the second. If you know it's got two fused six-rings in, even if you can work out the atoms, it's clearly not the second. And so on.
Enough from me:
Igor wrote a message to the CCL mailing list about OSRA:
We would like to announce a new addition to the set of chemoinformatics tools available from the Computer-Aided Drug Design Group at the NCI-Frederick. OSRA is a utility designed to convert graphical representations of chemical structures, such as they appear in journal articles, patent documents, textbooks, trade magazines etc., into SMILES.OSRA can read a document in any of the over 90 graphical formats parseable by ImageMagick (GIF, JPEG, PNG, TIFF, PDF, PS etc.) and generate the SMILES representation of the molecular structure images encountered within that document.
The email does not give any information on the fail rate, but the demo they provide via the webinterface does show some minor glitches (the bromine is not recognized):
- Rich Apodaca said...
- Great find - thanks, Egon!
- Joerg Kurt Wegner said...
- I posted about it yesterday not knowing that you have already posted it. That's funny! I found it in my del.ico.us network and you via CCL ... so the social network seems to work
- Egon Willighagen said...
- Joerg, I am officially on holiday, but reading my email... so, missed the del.ico.us trigger...Interesting that you meantion the CCL mailing list as social network... to me, social networks were more like being able to socialize with accounts outside my main areas of interest, which CCL would be...
- Antony said...
- I did some testing on this the day it was released and found a number of issues during the tests and blogged about it here http://www.chemspider.com/blog/?p=83However, as a first release it definitely has potential and I am looking forward to helping them
... and, whether or not it's usable directly in other code we should be able to abstract much of the functionality into code-independent data files