Data Validation and publication – some early ideas

Dictated into Arcturus

On Friday David Shotton from Oxford and I visited Iain.Hrynaszkiewicz and colleagues at Biomed Central to discuss collaboration on several grants funded by JISC. In our case they are Open-bibliography (#jiscopenbib) and #jiscxyz (on data) and in David’s they are Open Citations (#jiscopencite) and Dryad-UK. These involve several other partners (who I shall mention later, but highlight here The International Union Of Crystallography). Our meeting with Iain was about access to bibliographic material and also their requirements for data publication. I’ll be blogging a lot about this, but one thing was very clear:

Data should be validated as early as possible, for example before the manuscript is sent to a publisher for consideration.

This has a major influence on how and where data are stored and published (and archived) and influence whether we should have Domain Specific Repositories and where they should be. Our group on Friday was unanimous that repositories should be domain-specific. (I know this is controversial and I’ll be happy to comment on alternative models – I shall certainly blog this topic).

I am also keen that wherever possible data validation should be carried out by agreed algorithmic procedures. Humans do not review data well, and there are recurring examples of science based on unjustifiable data, where pre-publication data review would have helped the human reviewers to take a more critical view. I shall publish a blog on one such example

Here are some things that a machine is capable of doing before a manuscript is submitted. I’ll analyse them in detail later. They will vary from discipline to discipline.

  • Labelling. Have the data components been clearly and unambiguously labelled? (for example we have had *.gif files (image) wrongly labelled as *.cif) crystallography)
  • formats and content. (Is the content of the files described by an Open specification? Does the content conform to those rules?)
  • units. Do all quantities have scientific units of measurement.
  • Uncertainty. Is an estimate given of the possible measurement and other variation?
  • Completeness Checklist. Are
    all the required
    components present? If, say, there is a spectrum of a molecule is the molecular formula also available?
  • “Expected values”. In most disciplines data fall within a prescribed range. For example human age is normally between zero and 120. A negative value would cause suspicion as would an age of 999.
  • Interrelations between components. Very often components are linked in such a way that the relationship can be tested by a program.
  • Algorithmic validation of content. In many disciplines it’s possible to compute either from first principles or heuristically what the expected value is. For example the geometry and energy of a molecule can be predicted from Quantum Mechanics.
  • Presentation to humans. The robot reviewer should compile a report that a human can easily understand – again I shall show a paper where if this had been done to paper would have been seriously criticized.



This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Data Validation and publication – some early ideas

  1. Pingback: Twitter Trackbacks for Unilever Centre for Molecular Informatics, Cambridge - Data Validation and publication – some early ideas « petermr’s blog [] on

Leave a Reply

Your email address will not be published. Required fields are marked *