World Wide Molecular Matrix, Pubchem, quality metrics, etc.

David Bradley has assured me he will take my concerns back to the ChemSpider people. This doesn’t remove them, and I’ll explain why.
There is no Open gold standard for chemical information. That’s a historical fact and we have to do what we can to put it right. Chemical information is difficult and the closed and restrictive nature of the commercial chemical software companies and the information suppliers means there have been no community efforts to solve problems of quality. As a result quality is appallingly low, as evidenced by my example.
The simple message is that many chemical software and information companies have had little concern for quality. The chemzoo experience shows that they don’t care what they output.
This is a very depressing experience for us in academia. Chemoinformatics is done on suspect data sets with closed software or bespoke not portable systems. Metrics are impossible. Computer scientists and others would be horrified (Peter Corbett and co. spend a lot of their time creating metrics for linguistic tools).
Pubchem offers the primary ray of hope. It has two main sources (correct me if I am wrong)

  • data from the NIH enviromnemt (NCI, Roadmap, etc.) which usually contains biological measurements
  • connnection tables, links and metadata donated from many organisations in the chemical community. Examples are the ZINC database from John Irwin, compounds in Nature Chemical biology, NMRShiftDB etc. When CrystalEye is released well give them ours.

The data in Pubchem is very variable. It has been accumulated over 30+ years in the search for anticancer compounds and others (JC is giving some of his compounds for screening). There were no good ways of managing the chemistry 35 years ago.  Marc Niclaus from NCI has shown that 30% of the compounds are not what it says on the bottle. That’s fine – we are all clear about that. It’s not Pubchem’s role to sort out the historical data.
So there is junk in the historical record. And there is junk in some of the links donated. That may be where the Na2Cl2 for sodium chloride came from. If there is – say – an error rate of 1% (and I suspect it’s much higher) that’s 100,000 errors in pubchem. No single human will clean that up.
But one of the many virtues of Pubchem is that it is now THE open repository for molecular metadata. It seems obvious and natural that anyone wanting to open their data should  give PC the metadata. And that’s what they do – Steve Bryant has some graphs showing the rise in depositions (can’t find it immediately).
But Pubchem is not, and should not be, a data repository except for NIH data. But nor should any other organisation try to aggregate all the data. Whar we should do is pool the metadata (InChIs, names, etc.( in pubchem and develop links and searches to distributed repositories and datasets elsewhere.
Surely it’s inefficient to seach over multiple databases? No, the bioscientists do it all the time using WebServices and workflows. Chemistry is 10 years behind in it’s thinking and would do well to look at modern ways of doing things rather than developing eMolecules, chemspider, etc. (As you can see they are all completely reliant on Pubchem to provide most of the resource).
What we should then develop is specialised data and metadata search engines. Of course Pubchem has a perfectly good substructure search so there is no need to create other ones and since Pubchem is open and the otehrs aren’t the OpenSource community can innovate and upgrade round Pubchem.
Where does this leave WWMM? I see this as a federated set of repositories for specialised purpose that link to Pubchem, but provide extra data. So Nick Day’s crystaleye has complete data on all legally accessible electronic crystal structures. In conjunctional with Jimmy Stewart (MOPAC) whom I visited last w/e we are going to run MOPAC over all of those which are of suitable quality. That will be tens of thousands. Each calculation will take 1-5 hours but our Condor system will be happy to do that. We’ll very carefully compare experiment with theory – and  create protocols and metrics to show when this works and when it doesn’t And, during the process, we’ll certain discover some experimental errors in the structures. We’ll be making the results Open, of course.
If we are going to scrape the web (rather than simply re-index existing collections) we’ll need OSCAR. This will be new information – compounds extracted from free text. Of course we’ll have to be careful not to violate copyright, but if we start with a few million Pubmed abstracts that should do for a start

This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to World Wide Molecular Matrix, Pubchem, quality metrics, etc.

  1. The comparison of theory and experiment for crystal structures is definitely a good test bed for doing Open Science right now with as much automation as possible. It will be interesting to see how much the chemical community will participate and give you feedback.

  2. Na2Cl2 is not junk. It’s chemistry. The existence of Na2Cl2 microclusters has been reported (http://prola.aps.org/abstract/PRB/v36/i8/p4577_1). The sodium chloride dimer is also on the NIST webbook and indexed into PubChem as record cid=6914545.
    Here’s the definition of what ChemSpider is trying to do from the What is ChemSpider page “There are tens if not hundreds of chemical structure databases and no single way to search across them. There are databases of curated literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data and on and on. The only way to know whether a specific piece of information is available for a chemical structure is to have simultaneous access to all of these databases. Since many of these databases are for profit there is no way to easily determine the availability of information within these commercial or even in the open access databases. With ChemSpider the intention is to aggregate into a single database all chemical structures available within open access and commercial databases and to provide the necessary pointers from the ChemSpider search engine to the information of interest. This service will allow users to either access the data immediately via open access links or have the information necessary to continue their searches into commercially available systems. The question “is there specific information about my chemical” will be answered. Accessing the information may require a commercial transaction with the appropriate provider.”
    Our intention is to do exactly as suggested “pool the metadata (InChIs, names, etc and develop links and searches to distributed repositories and datasets elsewhere.” It’s already started.

  3. Pingback: ChemSpider Blog » Blog Archive » Aggregated Chemistry and Quality - is ChemSpider a Good Representative?

  4. Pingback: ChemSpider Blog » Blog Archive » Open Source Data, Testing Quality and Returning Value – Interactions with NMRSHIFTDB and the Blue Obelisk Community

Leave a Reply

Your email address will not be published. Required fields are marked *