In response to a lively blog interchange on quality and validation of data Antony Williams has produced a useful comment (into which I insert annotations):
Thanks for the feedback on the definitions. I have connected with our collaborators at ACD/Labs, specifically the PhysChem product manager, and have pointed him to your comments on the blog. I will leave it to him to choose whether or not to edit the definitions or not.
Thanks. Definitions (ontologies) are key to the emerging semantic web and this is becoming mandatory so I obviously encourage this.
The display of the units for PSA on the initial search results page was an oversight since it is on EVERY other view of the results display so thanks for pointing it out. It was fixed within minutes of reading your blog.
I have an obsession with scientific units of measurement – it is a solved problem in principle but rarely implemented. We are developing semantic approaches to units.
Regarding your observations about Prussian Blue and solubility. There’s a lot of misinformation out there for sure… http://ptcl.chem.ox.ac.uk/MSDS/IR/iron_III_ferrocyanide.html named as Prussian Blue and defined as soluble in water. However, I am going with the Wikipedia definition which talks about : “Soluble” Prussian Blue – Prussian Blue is insoluble, but it tends to form such small crystallites that colloids are common. These colloids act like solutions, for example they pass through fine filters. According to Dunbar and Heintz, these “soluble” forms tend toward compositions with the approximate formula KFe2Fe(CN)6
There was nothing special about my selection of Prussian Blue – and in fact your suggestions below take care of many other concerns
Based on your multiple comments I am considering recalculating the properties having prefiltered and excluded compounds based on the following constraints:
This is indeed what I consider the correct way to manage the database – create a series of protocols and measure the value of each of them in improving the accuracy/quantity ratio.
1)Exclude substances containing elements other than As,B,Br,C,Cl,F,Ge,H,I,N,O,P,Pb, S,Se,Si,Sn, the elements supported by ACD/PhysChem predictors.
2)Only include single component substances – would resolve your issue with CaCO3 and Prussian Blue
3)Exclude substances represented as a single atom
4)Exclude structures containing isotopes
5) Exclude radicals
6)Exclude structures with a delocalized charge
These are very close to the filters based on molecular formula which I would recommend. Since I don’t have knowledge of your metadata (e.g. date, format, contributor, etc.) I can’t comment, but it may be that these are also useful filters. In general you may need to be prepared to sacrifice a considerable quantity of data in return for greater confidence in quality.
I welcome your comments….
These are the types of filters that we now routinely institute in deciding which components of a chemical dataset are worth including. We normally develop these (e.g. for CrystalEye) by computing the difference between theory and experiment and the devising filters – and certainly all of the above have been included. In crystallography we also use temperature of experiment, etc. – you may wish to remember that many physical properties are directly dependent on both temperature and the physical state of the substance. Developing protocols can take time but it is worth it.