Linus Torvalds of Linux fame is creand dited with the law
“given enough eyeballs, all bugs are shallow”
In a communal Open Source project every developer and every tester (or user when the code is released) can contribute bugs to a buglist. There is both the incentive to post bugs and the technology to manage them. (How many of you send off bug reports after a Blue Screen Of Death on Windows?) The bugs are found, listed, prioritised and – as developers are available – are fixed. Large sites such as Apache has huge lists – many thousands – the Blue Obelisk projects have less but it is still the way we try to work. The key thing is that bugs are welcomed – of course we hate hearing about a new bug at 1 a.m. – but we’d rather know now than six months down the line.
Can this be extended to peer-review? We can hardly extend Linus’ Law to chemistry (we have an even more famous Linus) but something like:
“With many readers, all data can be cleaned”not very punchy, but it gives the idea.
Can we have communal peer-review? Is peer-review not something that has to be done by the great and the good? No – just as all bugs are not equal, so peer-review can be extended over the community. This is being explored by Nature – typical examples are:
Scientific publishers should let their online readers become reviewers.
andPeer review would be improved by discussions across the lab.Here I want to explore a special case of peer review – data review. In many sciences the data are of prime importance – they almost are the publication. Where this happens some sciences implement impressive systems for data review – a good example is in crystallography where all papers are reviewed by machines as well as humans. Here’s a paper that had no adverse comments from the CheckCIF robot and here is one with quite a lot of potential problems:
Alert level B PLAT222_ALERT_3_B Large Non-Solvent H Ueq(max)/Ueq(min) ... 4.38 Ratio PLAT413_ALERT_2_B Short Inter XH3 .. XHn H16A .. H18A .. 2.06 Ang.
Alert level C PLAT062_ALERT_4_C Rescale T(min) & T(max) by ..................... 0.95 PLAT220_ALERT_2_C Large Non-Solvent C Ueq(max)/Ueq(min) ... 3.44 Ratio PLAT230_ALERT_2_C Hirshfeld Test Diff for O1A - C15A .. 5.01 su PLAT318_ALERT_2_C Check Hybridisation of N1B in Main Residue . ? PLAT720_ALERT_4_C Number of Unusual/Non-Standard Label(s) ........ 24
The robot knows about several hundred problems. The process is so well established that authors submit their manuscripts to CheckCIF before they send them off to the journal. For really serious problems (Alert level A) the authors either have to fix them or send a justification as to why the work is fit for
publication.
How common is this sort of data checking in science? It happens in bioscience – authors have to prepare carefully checked protein structures and sequences. I think it happens in parts of astronomy (though I can’t remember where). Until recently there was nothing like this in chemistry but now we have two approaches, OSCAR (described in next post) and ThermoML. ThermoML is an impressive collaboration between NIST, IUPAC and at least 4 publishers, whereby all data in relevant journals is checked and archived in a public database.
Crystallography and thermochemistry are technically set up for semantic data checking and authors in those subjects are well aligned towards validated authoring of data. But can it work retrospectively? Can the community look at what has already been published and “clean it up”? In the next post I’ll show an experiment for synthetic organic chemistry and how, with the aid of OSCAR, we can clean up published data. And , since readers are now both human and robotic:
“With many readers, all data can be cleaned”