I received the following wonderful mail yesterday to which I have replied. (I don’t reveal identities of emailers without permission but I sometimes quote. ) Daniel Lowe has seen my reply and I have included his amendments.
Prof Foo reminded me of your call-out for OPSIN vocabulary contributors.
My colleague Xyzzy Bar and I have been toying with the possibility of using OPSIN and OSRA as part of our KNIME workflows.
Not being programmers, the only way we can contribute to the development of these tools would be assisting in ‘less skilled’ efforts.
Could you let us know what you had in mind for vocabulary contributors? If you were thinking of something along the lines of keeping a list of phrases that OPSIN misinterpreted as we came across them, then submitting that list along with corrections, we would be happy to participate.
This is fantastic. Generally the volunteers in Open Source projects hack code but we are increasingly finding systems where many other contributions are equally or more valuable. So I replied
Our vision – and we were exploring this today – is that Open Source is catching up the commercial components and will soon overtake it. We see OSCAR as the leading recognizer of chemical entities at present – metrics are hard to come by as it’s difficult to do controlled tests on closed source but informal reports confirm that.
We also have reports of at least two software companies which are interested in using OSCAR in their workflows and integration. And if I didn’t tell you already we now have a project with OMII (ex eScience and JISC – Steve Brewer (copied)) where it is being refactored. Your interest will give considerable encouragement. We meet on Thursday.
We also see OPSIN and OSRA overtaking current commercial tools. (We have some open source code which we could make available to open-source structure reconstruction from images such as OSRA). So we’ll concentrate on OPSIN for now.
Daniel Lowe (copied) has been doing great work on OPSIN which parses IUPAC names into structure. Note that it is very difficult to get certified IUPAC names in the public domain for metrics. So we rely on IUPAC names generated by programs. In practice this is probably OK as an increasing amount of IUPAC names are probably **generated** by programs. We know of 4 programs doing this – FooChem, BarChem, Y2Chem and ZorkChem. Daniel has names generated by all 4. To get a good sample he takes 10000 molecules from Pubchem and their generated names and then analyses them for recall (how many structures are interpreted) and precision (how many of those are correct). I think the following is a reasonable summary (note that the commercial systems have both name generators (e.g. producing FooNames) and interpreters (producing FooStructures):
* names from the 4 programs are significantly different.
* OPSIN has been developed against FooNames.
* OPSIN has better precision than any other program
* Against FooNames names OPSIN scores over 80% recall and 98+% precision. FooStructure is slightly better and BarStructure slightly worse.
* OPSIN is somewhat worse than FooStructure against other program-generated names but significantly better than BarChem
…OPSIN can be enhanced by:
* adding new syntaxes – e.g. dispiro compounds (needs a programmer)
* adding new rules to existing syntaxes (e.g. different functional classes). This currently requires programming
* adding new vocabulary (e.g. semi-trivial names with numbering). This can be done completely outside the codebase. Thus “7-methyl-guanosine” requires the entry of guanosine and its numbering scheme into a vocabulary file – any dedicated chemist can do this with minimal training.
* testing. This is very valuable. The analysis of why OPSIN fails on recall and which are the best places to add more effort
* generation of corpora. This is also extremely valuable but quite tricky as we have to avoid copyright.
documentation and tutorials. Always golden
We have a student joining us for 3 weeks in the summer and have to have OPSIN is a state where she can make contributions. So we are very highly motivated to get OPSIN into shape where volunteers can crowdsource it.
We don’t need you to be able to hack code but we do need you to be able to install and run OPSIN and to gather metrics. I’ll talk to Daniel tomorrow and we’ll talk about how we can put together suggestions. BTW we are keen on KNIME and we have close links to that group
So some questions (some order of preference if possible). I am assuming that you have experience in chemistry (hope that’s not too rude)
* could you install and run OPSIN (a Java program) – you can ask colleagues for help.
* do you have a corpus of names that you are currently interested in. If so, please describe its provenance. To be useful it will need corresponding structures
* Would you be prepared to analyse OPSIN failures and systematize the reasons why they failed?
* could you enter new rules – these are archetypal structures as SMILES strings and name features.
* could you enter new structures (e.g. guanosine) and their numbering?
I warn you that this is boring work and the payback is in many small increments. I tend to do this sort of thing while watching the cricket – enter a structure while the bowler walks back – watch the ball – enter several structures while drinks are taken – watch the next ball – enter tens of structures when bad light starts…
And we’d be very interested in what you are doing anyway and where this fits in.