Since Open Data are very rare in chemistry, how do we get them? We’ve been working on making this easier, but it’s still very hard work. So here’s a brief overview:
-
Born-semanticOpen. This is the ideal situation, where tools and the culture are used to create data which is intrinsically semantic and aggressivly labelled as Open. This is no production scale example of this. We hope that Chem4Word will be used as a primary tool for creating semantic documents, and that we can add an Open Data stamp to the result. In that way every use of the tool will create Open Semantic data
-
Conversion of structured and semi-structured Open legacy to Semantic Open. An example of this is CrystalEye, where we aggregate Open legacy data (CIFs) and convert to CML. This is then published with the Open data tag in every page. Spectra, if in JCAMP, are also tractable It is also possible, though harder to convert most computation chemistry outputs into CML. Gaussian archive, GAMES are relatively simple – Gaussian logfiles are a nightmare.
-
Born-digital computation. By inserting FoX or other CML-generating tools into the source code of comp chem programs. We’ve done this for at least 10 – and this means that we get lossless conversion of comp chem into CML, with complete ontological integrity.
-
Recovery from text and PDF (Text mining). Conversion to PDF destroys all semantics, most structure and all ontology. So it has to be messy heuristics to recover anything and we never get 100% recall or precision. We don’t touch bitmaps. Our current tools in Cambridge can:
-
extract chemical structures from images. This depends on the actual way the image is represented but with vectors rather than bitmaps we have achieved 95% precision on several documents
-
extracts spectra from images. This is also tractable – we haven’t done enough to get metrics and we haven’t covered all the types of instrument but again ca 95% is manageable
-
text-mining. OSCAR2 is able to recover peak lists from chemical analytical data with over 90%, the failures being mainly due to typos and punctuation. OSCAR3 can extract reaction information (Lezan Hawizy) with probably > 80% precision. We can also convert chemical names to structures (OPSIN) and Daniel Lowe has made impressive progress and – for certain corpora – can achieve ca 70%
Some important caveats. Anything other than born-semantic is lossy. The recal/precision can range from 95% to 5%. That sounds silly, but the results are critically dependent on how the documents were created and published. The more human steps (copying, editing) the worse the recall and precision. But with high quality PDFs an impressive amount can be extracted.
But the major challenge is restrictive approaches to extraction. If publishers threaten extractors with legal action for getting science from papers we have destroyed the dream of Linked Open Data in Science.
But for those publishers who support the creation of publication of Born-semantic Open data the future is really exciting.
This Blog Post prepared with ICE 4.5.6 from USQ in Open Office