David Bradley alerted me to Chemspider, an engine which scrapes the web for information on chemical compound information and calculates properties. I blogged yesterday about what it did for sodium chloride.
I am slightly sorry to do this as I have had some acquaintance with the people involved, but I cannot let garbage science go uncommented. That is what peer-review is about. And it can be painful.
Fom the Chemspider website:
ACD/Labs Software Integrated into ChemSpider Service
Partnership brings predicted logP property values and systematic nomenclature identifiers to over 10 million chemical structures online, to support ChemZoo’s vision of creating a chemical community.
RALEIGH, North Carolina, and TORONTO, Canada, April 10, 2007—ChemZoo and Advanced Chemistry Development, Inc., (ACD/Labs) announced a collaboration that will allow integration of a number of ACD/Labs software tools to the ChemSpider service, a new online chemistry database and property prediction service provider. ACD/Labs properties will be generated and published for over 10 million chemical structures using some components of the ACD/Labs PhysChem and Nomenclature software suites.
From what I can see the spider scrapes information and passes it to the Zoo. The Zoo is filled with monkeys. (The same monkeys who are trying to write Shakespeare by hitting typewriter keys at random). The monkeys seem to be using ACD Lab software to calculate properties. ACD software has been around for several years and I suppose it gets some answers right but it doesn’t do very well on calcium carbonate. Here’s the entry:
0953 Chemical Structure CCaO3 Molecular Weight 62.02 logP -0.809 hydrogen bond donors 2
Well, they got the chemical formula right. Calcium carbonate is marble, limestone, chalk. It’s hard and doesn’t dissolve in water. You calculate its molecular weight as follows:
Ca=40 + C=12 + 3 (O=16)
and your child will tell you that it comes to 100. (I’ve missed the decimal point). The monkeys only get to 62 before they give up. I have no idea how they get this. The monkeys also tell us how many hydrogen atoms can be used to bind to other molecules. I can’t see any hydrogen atoms, nor can you but the monkeys found 2.
Why am I so angry about this? Because NIST and NMRShiftDB and Joe Townsend and Nick Day work very hard at calculating molecular properties. Volker Thome in our lab has spent years calculating the properties of calcite. They are trying to show that data quality matters. The ChemZoo monkeys are destroying the value of chemical data.
There are ways of calculating molecular properties properly – it’s hard work, takes care and only applies to certain compounds. It’s hard measuring the properties – that’s what NIST does. If the spider extracted the data from NIST without their permission then it has broken copyright. And I hope NIST gets them to remove the data. It’s harder with NMRShiftDB – as a Blue Obelisk member we make our data Open for any reasonable re-use.
But giving it to the ChemMonkeys is not reasonable. The zoo should close.
Could H-bond donor be interpreted as the atom which has the lone pair/negative charge? (probably not, now I see that it also has 3 H-bond acceptors…)
Actually, there is a clear pattern between the sodium chlorite and the calcium carbonate examples: they are both salts and the monkeys picked only one component to calculate the molecular weight. Chemoinformatics is like knives: if you do not know how to handle them, then you are in serious trouble.
Oh, forgot to type that up… about those hydrogens: talk the carbonate SMILES, consider the implicit hydrogens of you forget the charge, and there are your hydrogen bonds. By looking at the MW you can see that this has happened.
I’ll pass on this link to my fellow simians and report back with comments once they have replaced their typewriter ribbon.
Thanks for the critique
db
Peter, good catch – thanks for reporting. I’ve put a notice that properties are calculated for the main component only. Hopefully this is it. Please keep reporting all problems which you’ll find – as you might seen this is the beta version and only with public comments and feedback we can improve it and make it useful.
So is the World Wide Molecular Matrix planning on Collating data? All I was able to find was chemical structure generation. If you are planning on Collating the data ACD/Labs would be happy to also provide the WWMM with logP prediction capabilities, as we have done with Chemspider and eMolecules.
As for the logP/Solubility predictions, to the best of my chemical and plumbing knowledge calcium carbonate is somewhat soluble at low temperature and considerable less soluble as the temperature rises. If you do not believe this please call up ANY plumber and they can happily give you a quick real world chemistry lesson on why hot water heaters, boilers, coffee makers have to be descaled regularly to remove the calcium carbonate deposits. Now should the ACD/Labs software provides predictions on the parent ion/compound. For the CaCO3 example that would be for the CO3 group, which will protonate in water, so the predicted logP is for the neutral H2CO3 and the logD would represent the distribution coefficient based upon the selected pH.
So here is a quick suggestion, maybe we should focus assisting the efforts on curating the world’s chemical information opposed to simple pointing out flaws in Beta software that is making the first pass at collating the data, which from my experience is the first step required in eventually curating the data. So i would like to personally thank PubChem, eMolecules and Chemspider for making the first attempts at collecting all the chemical data available. FYI, when I was examining the Chemspider website it appears that anyone can provide feedback, so we could constructively suggest that chemspider add a new field that would allow the users to curate the data. So in the same vane as wikipedia. FYI, the chemspider substructure search of calcium carbonate does provide a link to wikipedia.
Hi…banana-biter from ChemZoo here… one more at the chimps tea party.
ChemSpider uses third party components for the generation of certain properties. Passing over 10 million compounds through has shown a number of issues in the nature of the dataset and the applicability of the third party components to the diversity of the dataset. Feedback has been provided and the issues already addressed but passing 10 million structures through a series of prediction algorithms is not undertaken lightly and therefore will be performed in the near future.
The system was released in beta form. Known bugs are posted at http://www.chemspider.com/KnownBugs.aspx. As commented on the website at this page “We know that more bugs will be identified based on the testing of our users and the fastest way to receive real stress testing is to make the system public and ask for feedback….and where necessary, deal with the fallout and potential mocking. We encourage you to report bugs to us, as you find them, at bugs@chemspider.com. ”
So, thanks for the feedback. The monkeys will get back to our keyboards and address the feedback. We’d welcome the feedback through our feedback page rather than via blogs. That said…you have EVERY right to be concerned about quality..we are.
By the way, we believe that Wikipedia is a very valuable resource. That’s why we have linked up the synonyms from ChemSPider out to Wikipedia. Repeat your search on Calcium Carbonate and click on Caltrate for example to get http://en.wikipedia.org/wiki/Caltrate.
Clearly this type of linkage also throws up errors in the linkage to wikipedia and we’ll be optimizing shortly. In order to help build the public curation process we have put up a “Help Curate Data” link. This was always our intention…to create a chemical community around the chemical structure database…and your posting simply accelerated it. Now we hope that we’ll be joined by more monkey’s on typewriters in the ChemZoo…actually, the preference is flies on the Spider web!
By the way, in trying to get further connection to Wikipedia there is work being done right now to enable the ChemSketch drawing package to export PNG files with embedded InChI (and why not other structure data…formula, formula weight..even logP) and then searchable directly from a drawing package or through ChemSpider. http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry/Structure_drawing_workgroup
As well you know with your work on the Worldwide Molecular Matrix, not everything that goes live first day is perfect…and it’s why we have declared beta.
Pingback: ChemSpider Blog » Blog Archive » Do Monkeys know about Calcium Carbonate?
Pingback: ChemSpider Blog » Blog Archive » Monkeys and Calcium Carbonate. Should ChemZoo Close its Gates?
Pingback: ChemSpider Blog » Blog Archive » Calcium Carbonate, logP predictions and Chris Lipinski
Pingback: ChemSpider News » Blog Archive » The Calculation of Masses for Multicomponent Systems
Pingback: ChemSpider Blog » Blog Archive » Curators Perform Heroic Duties. They Should be Celebrated!