Data-driven science and repositories: consideration of errors

The main theme of the current posts is to show how Open publication of data aids scientific research. Our particular domain is chemical crystallography, but these posts contains ideas which I hope have wider applicability and I will skim over the more technical details. There may, however, be some posts wher I need to explain some concepts.
As we have blogged earlier (CrystalEye – an example of a data repository) CrystalEye was developed by Nick Day as part of his PhD work. The primary aim is to see if large amounts of data – larger than a human can inspect – can be reliably used for scientific work. Before describing this I shall beriefly review “errors” and indicate the implications for data repositories
I’m restricting my discussion to physical science where believe that in general an experiment is repeatable by other scientists and should give consistent data. However we all know that the scientific literature contains “errors”. This is a very general term (and often too judgmental) but there are centuries of work systematising the detection and categorisation – such as in the science of metrology. My discussion will be very superficial and is not intended to be a systematic or authoritative coverage; it’s more an indication to data-driven scientists and data repositarians of issues they should address.
“Errors” can include:

  • variance in the original experiments. All scientific measurements should be quoted with error estimates, which can often be obtained by repeated measurement.
  • systematic errors (bias) in the measurements. Sometimes the causes are known but often they are not. Bias is often discovered when measurements are made in different laboratories or with different methods and equipment. Miscalibration of instruments is a common cause.
  • misunderstandering or misreporting of the physical quantity or measurement. For example in chemistry there are several concepts of “bond length” – the distance between 2 atoms – and they are fundamentally different. One effect is due to the uncertainty principle – atoms do not occupy a fixed position even at absolute zero.
  • omission of relevant independent variables. Thus a crystal structure varies with temperatute and pressure. Often these are not explicitly recorded – there is often a default assumption that measurements are done in “normal conditions” – about 25 deg C and 1 atmosphere. But many theoretical calculations relate to absolute zero and no pressure.
  • omission of units of measurement. This should never happen, but many computer program still emit raw numbers and assume the user knows what the units are.
  • Transcription and typographical errors. These are still common. Many chemists still measure spectra with rulers. Many scientists write numbers in a lab book and type them wrongly. Many computer operations fail to report invalid input or produce corrupted output. For example we used a well-known theoretical program which takes free format input limired to 80 characters on each line. However lines greater than this were not flagged as errors but simply ignored silently which led to gross errors hard to detect. Even copying files – perhaps by cut-and-paste – can corrupt information.
  • Our inability to describe effects comprehensively. In crystallography, for example, it is frequently found that atoms are “disordered” – a simple picture is that they are sometimes in place A and sometimes in place B. Whether they hop between these places or whether the disorder is a statistical average over a macroscopic crystal may not be known. A full treament of disorder may be difficult and expensive and include weeks of work on a neutron source (which needs a nuclear reactor).

We therefore need to know which of these are important. If typographical errors are very low (e.g. less than 1% probability in a data set) we can concentrate on effects which occur more frequently (say 20% of the time). If there is a typo in every data set we may have to use statistical methods to detect them or even abandon the effort. If we estimate a quantity by two different methods and the variance between them is low, then this gives confidence in the precision of each (though says nothing about the accuracy).
Nick Day showed this approach in his online analysis of measured and computed 13C chemical shits (Open NMR: Nick Day’s “final” results). This showed a range of “errors” in both the measured data and the computed data. However it was possible to find many data which each approach reinforced the validity of the other. It was also possible to find outliers and detect the effects reponsible for them (not just “explain them away”).
Nick’s NMR work built on the work that Joe Townsend did by comparing molecular structures in crystals with computer structures in the gas phase. These are not identical concepts but are similar enough that Joe was able to develop rules showing when they could be regarded as “agreeing”. Nick has now been doing this with crystal structures and their computed structures using theoretical methods.
I’ll be blogging about this. It won’t be formally Open Notebook Science but it will be pre-publication in the same way as the NMR work. The next posts will review where we get our data from and why we need Open Data publication.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *