Community involvement in information capture and extraction

Posted on January 7, 2008 by pm286

There has been a large increase in the number of people and organisations interested in extracting or capturing chemical information from the public domain. This is typified by the ongoing discussions between individuals and organisations – here’s a comment on this blog from Antony Williams -Chemspiderman – who has been working very hard to develop approaches towards Open data (comment to Open Data in Science):

I’m in the middle of curating all chemical structures on Wikipedia. I spent a couple of hours discussing it with Martin Walker last night. The process involves a lot of manual work…I’m at over 150 hours right now. There are issues with chemical names not matching the structure diagrams (people can use nomenclature very poorly!) so this will be an ongoing issue for ANYBODY using name to structure conversion structure. However, there are many names agreeing with the chemical structure. Have you thought about applying OSCAR to WIkipedia to generate a real structure file? You can then add that into the WWMM and hook up to Wikipedia. If you wait a while I’ll have one done and will hopefully be able to get Wikipedia to accept InChIKeys on the structures directly and therefore make Wikipedia searchable by InChIKey. I’ll log about this soon but have other deadlines in the way at present. I have just co-authored a book chapter on name to structure conversion and talked about OSCAR-3 but couldn’t comment too much on capabilities. I can add it in in proofing.. Here are 10 names of structure on Wikipedia …they are correct for the structures. You commented “If the names can be interprted or looked up then OSCAR does a good job. “How well does OSCAR does on this set of 10? If you want to post the InChI strings I’ll check the structures and let you know…

We are very grateful for this work. We are also doing similar things and we’d be delighted to coordinate – I have also been mailing Martin and booked an IRC with him and WP-CHEM colleagues asap.
As Antony says there is a lot of hard work. The good news about social computing – of the sort he and we have been fostering – is that in principle it can scale. The difficulty is that it can be difficult to run technically hard projects – and this is a technically hard project. The reason is that it is not about certainty – what is the formula of “snow”? – but requires evaluation of assertions (X says the formula of A is C30H22O5; Y says the formula of A is C32H24O5).
There is an awful lot of grunt work. First we have to get the data. For Wikipedia this has been done manually, but I am looking at whether data can be extracted from other sources and fed in automatically. There’s at least 1000 common compounds that “should” be in WP. There’s the problem of rights – I think we are getting to the stage where the resistance to mining data from chemistry text will weaken. Then we have to deal with the syntax. PDF is still a major hurdle. Can we use images (I’ll post about that later). My work over the holiday has shown that extraction from web pages is still fragile, but we can get a lot. (e.g. does anyone have a parser for ALL inline chemical formulae – e.g. C(CH3)2=(CH2)2COC(CH3)2Cl.2H20 ? JUMBO does a so-so job. If anyone can do better that would be very useful).
Then when the data have all been extracted those from different source can be compared. This often shows real errors. In the case we absolutely need reasoning tools like RDF. It will highlight inconsistencies of the type above. But it can’t resolve them. Can we develop heuristics including probability? Recommender systems – A has fewer inconsistencies per entry than B so we weight A higher.
I shall respond to the technical questions on images and names in separate posts. I make it very clear that this is research – not a production system. There may be cases where precision and recall run at 20%. This is not a failure, it’s a starting point. Some of this is skunkworks – and I am reluctant to involve the community in skunk works. It takes time before we can reasonably loose development code at sourceforge.
Part of the point is to encourage authors and publishers to deposit semantic data as well as text. If all papers had InChIs (with compound numbers) then we probably wouldn’t have to extract stuff from images. (There are still compounds which can only be represented graphically). Similarly if all chemists published NMR spectra with molecular structures and assignments then we wouldn’t have to do any of these. All this is technically possible already.
There are many areas where the community can help. Chemical nomenclature is one. Part of the low recall for OPSIN is that the compounds aren’t in the vocabulary. It’s not part of our research. But it’s relatively easy for anyone to add these – and once added they are done. I’m guessing that we could double the PR by this method. But I’ll comment on this in detail.
So I think we shall see a valuable increase in distributed Open chemical information projects this year. It’s difficult to get funding – but not impossible – and we are hopeful. One important activity is workshops.
More later (including comments on OPSIN, OSRA, etc.).

This entry was posted in Uncategorized. Bookmark the permalink.

9 Responses to Community involvement in information capture and extraction

Wolfgang Robien says:

January 7, 2008 at 10:48 am

–snip–
Similarly if all chemists published NMR spectra with molecular structures and assignments then we wouldn’t have to do any of these. All this is technically possible already.
–snip–
I fully agree; the real-world situation is different – usually you get pdf’s from your library server. Therefore I appreciate your OSCAR-3 approach. The only problem I have is seperating between claims and reality. In one of your blogs you have shown, that OSCAR is able to extract shiftvalues, comments and frequency. This is a reasonable starting point for research – the central point “understanding the structure, its numbering and the assignment” – is a very complex task, which seems totally unresolved to me. On the other hand there is Egon Willighagen’s claim that he has already generated fully automatically an assigned CNMR-dataset using the same program, which really surprises me, when I compare your specifications with what is necessary to get a fully assigned dataset.
My personal opinion is, that a claim and the reality should have a certain overlap. OSCAR-3 is a good idea for an interesting research topic, but the claim shouldnt be ‘assigned datasets’ – it should be ‘extraction of information from a text in order to make subsequent MANUAL processing easier’ – thats it at the moment. When HKOs list of specification is done, then the claim ‘fully assigned NMR-dataset’ is justified.
In the case the spectral information is published as file(s) in an agreed format, OSCAR is not longer necessary. The only specification for the format of such a (set of) file(s) I have is simple: It must be convertible WITHOUT ANY LOSS of information – the format details like CML, SDF, MOL, JCAMP etc. are a second-order problem.

Reply
ChemSpiderMan says:

January 7, 2008 at 3:43 pm

Peter…you commented “There’s at least 1000 common compounds that “should” be in WP.” Can you send me the list? It is possible that the basic data already exist in ChemSpider, or at least the majority of them. I can extract these framework data as structures, validated identifiers, InChIKeys etc and work with Martin Walker and WP-Chem to initiate the articles. Please send me the list offline.
In December of this year Martin and I were discussing the need for an IRC chat with WP-CHEM to review my work that I committed to over the Xmas holidays. Since I have progressed fairly well you might want to sit in on that call when Martin schedules it. There is definitely work to be done. ANd it’s mostly eyeballs work..lots of manual checking.

Reply
Egon Willighagen says:

January 7, 2008 at 3:56 pm

Wolfgang,
“On the other hand there is Egon Willighagen’s claim that he has already generated fully automatically an assigned CNMR-dataset using the same program, which really surprises me, when I compare your specifications with what is necessary to get a fully assigned dataset.”
pardon me??? Where did I make that claim?
You asked “How many fully assigned C-NMR spectra have been integrated into NMRShiftDB using OSCAR-3 during the past 6 months ?”, to which I answered that I have used OSCAR3/Bioclipse to submit a fully assigned structure-NMR combo to the NMRShiftDB. That *is* reality. There was no ‘automatic’ in that question; I used semi-automatic approach, as explained in Peter’s first blog item on this. My apologies for not having understood you intended ‘automatic’ there, but please do not put words in my mouth.

Reply
Wolfgang Robien says:

January 7, 2008 at 6:04 pm

(3) Egon:
I was talking about OSCAR-3 / AUTOMATIC data extraction **from the literature**, which is the purpose of OSCAR-3. I agree that I have used the word ‘AUTOMATICALLY’ only in the first question and not in the second one, which is definitely my fault. When asking about a program which sole purpose is to *automatically* collect data from publications, the further use of the word ‘automatically’ contains some redundancy.
Please see my other post on
http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=916#comments ( see #11 )

Reply
pm286 says:

January 7, 2008 at 7:22 pm

(4)
The purpose of OSCAR3 is NOT primarily to support data extraction from the literature – this is a very minor part. OSCAR3 has many roles including chemical entity recognition, chemical phrase recognition, entity typing, standoff annotation, and it acts as a primarily tool for research into some of the next generation of chemical natural language system. OSCAR3 is integrated into a number of ontologies, and can also create measures of confidence when parsing. Please understand that this is NLP research, not production data extraction. OPSIN is an add-on which carries out name2structure conversion. It truns out that OSCAR3 does a very good job of extracting peakLists, with high Precision/Recall (usually > 90% which is very good for NLP). Any interpretation of those is done outside OSCAR.
OSCAR-DATA is represents OSCAR alighed towards data checking, on which it does a useful job and is highly used by some referees and we know of some journals who also use it.

Reply
pm286 says:

January 7, 2008 at 7:25 pm

(2).
This is a VERY rough estimate formed by looking at our RDF of Wikipedia and some we have got from elsewhere on the web. I haveen’t got a list – when I have tamed RDF I’ll post one.
I’d like to start positing the results but I want to make sure I don’t offend anyone’s rights first.
We’ll probably meet in IRC – haven’t heard from Martin yet.

Reply
ChemSpiderMan says:

January 8, 2008 at 11:17 pm

(6) Peter…sorry I misinterpreted your original meaning. I assumed when you said “There’s at least 1000 common compounds that “should” be in WP.” that would simply be a list of chemical names and I an’t see how you would offend anyone’s rights with that list. If that’s not Open Data what is?

Reply
Pingback: Dedicating Christmas Time to the Cause of Curating Wikipedia at The ChemConnector Blog - Observations and Musings for the Chemistry Community
pm286 says:

January 18, 2008 at 7:08 pm

(7) There’s no problem if humans type the data in – but if we extract them robotically there might be.
P.

Reply