Data validation and Protocol validation

In response to a lively blog interchange on quality and validation of data Antony Williams has produced a useful comment (into which I insert annotations):

Thanks for the feedback on the definitions. I have connected with our collaborators at ACD/Labs, specifically the PhysChem product manager, and have pointed him to your comments on the blog. I will leave it to him to choose whether or not to edit the definitions or not.

Thanks. Definitions (ontologies) are key to the emerging semantic web and this is becoming mandatory so I obviously encourage this.

The display of the units for PSA on the initial search results page was an oversight since it is on EVERY other view of the results display so thanks for pointing it out. It was fixed within minutes of reading your blog.

I have an obsession with scientific units of measurement – it is a solved problem in principle but rarely implemented. We are developing semantic approaches to units.

Regarding your observations about Prussian Blue and solubility. There’s a lot of misinformation out there for sure… http://ptcl.chem.ox.ac.uk/MSDS/IR/iron_III_ferrocyanide.html named as Prussian Blue and defined as soluble in water. However, I am going with the Wikipedia definition which talks about : “Soluble” Prussian Blue – Prussian Blue is insoluble, but it tends to form such small crystallites that colloids are common. These colloids act like solutions, for example they pass through fine filters. According to Dunbar and Heintz, these “soluble” forms tend toward compositions with the approximate formula KFe2Fe(CN)6

There was nothing special about my selection of Prussian Blue – and in fact your suggestions below take care of many other concerns

Based on your multiple comments I am considering recalculating the properties having prefiltered and excluded compounds based on the following constraints:

This is indeed what I consider the correct way to manage the database – create a series of protocols and measure the value of each of them in improving the accuracy/quantity ratio.

1)Exclude substances containing elements other than As,B,Br,C,Cl,F,Ge,H,I,N,O,P,Pb, S,Se,Si,Sn, the elements supported by ACD/PhysChem predictors.
2)Only include single component substances – would resolve your issue with CaCO3 and Prussian Blue
3)Exclude substances represented as a single atom
4)Exclude structures containing isotopes
5) Exclude radicals
6)Exclude structures with a delocalized charge

These are very close to the filters based on molecular formula which I would recommend. Since I don’t have knowledge of your metadata (e.g. date, format, contributor, etc.) I can’t comment, but it may be that these are also useful filters. In general you may need to be prepared to sacrifice a considerable quantity of data in return for greater confidence in quality.

I welcome your comments….

=====
These are the types of filters that we now routinely institute in deciding which components of a chemical dataset are worth including. We normally develop these (e.g. for CrystalEye) by computing the difference between theory and experiment and the devising filters – and certainly all of the above have been included. In crystallography we also use temperature of experiment, etc. – you may wish to remember that many physical properties are directly dependent on both temperature and the physical state of the substance. Developing protocols can take time but it is worth it.
Best

This entry was posted in data. Bookmark the permalink.

4 Responses to Data validation and Protocol validation

  1. Petere, it appears you support my suggestions. So the question I have for the readers is this…and I will make the same request on the ChemSPider blog.
    1) DO people believe that isotopes will make a difference (within prediction error) to the calculation of the physicochemical properties predicted. I have my own judgments but put this question out there for public feedback.
    2) Should all multi-component systems be excluded? I demonstrated clearly in an earlier post that prediction of LogP for CaCO3 was appropriate so should it be excluded or not?
    You commented “These are very close to the filters based on molecular formula which I would recommend. Since I don’t have knowledge of your metadata (e.g. date, format, contributor, etc.) I can’t comment, but it may be that these are also useful filters.”. So, the ones I have suggested are close…what additional ones would you suggest?
    Also, we DO have the date, format and contributor data available. How would you use these data yourself to make a decision to predict physchem properties. Assuming all data available as MOL/SDF files how would date of submission and contributor info be used?
    Looking forward to your feedback. Thanks

  2. Pingback: ChemSpider Blog » Blog Archive » Physical Property Predictions - Filtering Out Potential Problematic Data on ChemSpider…or is it NOT a problem?

  3. pm286 says:

    (1)
    1) DO people believe that isotopes will make a difference (within prediction error) to the calculation of the physicochemical properties predicted. I have my own judgments but put this question out there for public feedback.
    PMR> Deuterium has a significant influence on many physical properties – e.g. boiling point of D2O – and obviously on vibrational frequencies. But in general it depends on the accuracy and precision of the property. We, for example, compute phonons of crystalline materials and these are certainly isotope dependent.
    2) Should all multi-component systems be excluded? I demonstrated clearly in an earlier post that prediction of LogP for CaCO3 was appropriate so should it be excluded or not?
    PMR> It depends what your properties are. Until you address the aspect of physical state I would strongly suggest you omit multi-component systems. For example we are working with calcite, vaterite and other forms of CaCO3 and these have many properties that depend on the polymorph. In principle (log)P should be independent of polymorph but I would be suspicious of this for many systems
    You commented “These are very close to the filters based on molecular formula which I would recommend. Since I don’t have knowledge of your metadata (e.g. date, format, contributor, etc.) I can’t comment, but it may be that these are also useful filters.”. So, the ones I have suggested are close…what additional ones would you suggest?
    PMR> I don’t know what your properties actually are – the only ones displayed are MW, (log)P, polar surface area and volume. Since I don’t know the algorithm for the last two I an’t comment, but I would expect both to depend on molecular flexibibility.
    Also, we DO have the date, format and contributor data available. How would you use these data yourself to make a decision to predict physchem properties. Assuming all data available as MOL/SDF files how would date of submission and contributor info be used?
    PMR> Since I assume you have compared experiment with prediction I would look to see if outliers showed any predictabiluty of source, data, etc. For example some submitters may routinely get molecular formulae garbled (e.g. hydrogen atoms). We have found that some have garbled Celcius and Kelvin – several crystallographic experiements were reported at 298 degC, which is almost certainly an error.

  4. Pingback: ChemSpider Blog » Blog Archive » Prediction Errors and Filtering the ChemSpider Database - How Accurate Does a Prediction Need to Be?

Leave a Reply

Your email address will not be published. Required fields are marked *