Aggregated chemistry and quality

David Bradley wrote:

Peter, great to see that WWMM is starting to gain some momentum. I certainly think there are various behavioural properties of chemists that have held many back from taking part in blogs and wikis and general web 2.0 type stuff. But, I interviewed Steve Bachrach for Reactive Reports and he admits to having been reluctant to get involved but has now recognised (through reading your blog) that there is value in these technologies and is soon to start his own. The interview is here
Meanwhile, I take it you saw the launch of the chemistry database search tool, which the developers reckon could provide a one-stop shop for chemistry database searching (pubchem, chebi, academic, NFP, and commercial). It’s got 10million+ entries so far, but I think they’re adding to it all the time. (I have to confess an interest here, as they gave me some webspace to host a new chemistry blog – Spinneret)

I didn’t see the launch, so I had a look, and typed “Sodium chloride”. I got three entries, one of which (formatting removed) was:

Structure    Properties
Cl2Na2     ChemSpider ID:     5290429
Molecular Formula:     Cl2Na2
Molecular Weight:     22.99
Nominal Mass:     23 Da
Average Mass:     22.9892 Da
Monoisotopic Mass:     22.989221 Da
ACD/Name (Index):
SMILES:     [Na+].[Na+].[Cl-].[Cl-]
InChI:     InChI=1/2ClH.2Na/h2*1H;;/q;;2*+1/p-2/f2Cl.2Na/h2*1h;;/q2*-1;2m
Data Source(s):
Names, Database IDs and Synonyms:     sodium chloride; (NaCl)2

This is rubbish.
There is little point in collecting 10 million structures if you cannot rely on any of them. It actually detracts from the hard work of people like Stefan, Christoph and others on NMRShiftDB as the general user of the database will judge all entries by the lowest common denominator.
There are two modern ways to ensure quality in large robotically created collections.

  • use robot heuristics to validate quality. This is what Nick Day does in crystaleye. The quality of the entries is potentially superior to the published structures.
  • use social computing. This is what Wikipedia does. It has 23,000 chemical entries. Brilliant.

So my solution would not be to proliferate substandard aggregators, but to concentrate on robotic and social annotation of Pubchem. Pubchem is essentially a linkbase onto which the robotosphere and the blogosphere can layer annotations. So we can create standoff annotation of resources such as CrystalEye and Pubchem which the community and its robots can annotate. They are both Open Data – can be completely downloaded – and are ideal starting points around which to create the Open chemosphere.

