Robots can detect error; but images MUST be Open

Here is an extremely compelling reasons why data – including image and graphs in papers MUST be regarding as Open Data, not as published owned copyright. Simply, many researchers “doctor” their data. Not “most”, but “many”. The intent may be simply to make the data look “better” or they may be deliberately falsifying their data to obtain a prestigious publication. Or somewhere in between – the research is sloppy, with unclear results and will “benefit” from enhanced data. Here’s a brief excerpt, but read the article. I’ll comment later and make a proposal to publishers…

OA enhances error correction Jeffrey Young, Journals Find Fakery in Many Images Submitted to Support Research, Chronicle of Higher Education, May 29, 2008.
…As computer programs make images easier than ever to manipulate, editors at a growing number of scientific publications are turning into image detectives, examining figures to test their authenticity.
And the level of tampering they find is alarming….
One new check on science images, though, is the blogosphere. As more papers are published in open-access journals, an informal group of watchdogs has emerged online.
“There’s a lot of folks who in their idle moments just take a good look at some figures randomly,” says John E. Dahlberg, director of the division of investigative oversight at the Office of Research Integrity [at the US Department of Health and Human Services, which includes the NIH]. “We get allegations almost weekly involving people picking up problems with figures in grant applications or papers.”
Such online watchdogs were among those who first identified problems with images and other data in a cloning paper published in Science by Woo Suk Hwang, a South Korean researcher. The research was eventually found to be fraudulent, and the journal retracted the paper….

PMR: A typical example is a gel, used to show how many proteins or nucleic acids you have got and how pure they are. I’ve taken this from Wikipedia:

Image:AgaroseGel.jpg
This is a good gel – the bands are parallel and there are no thumb prints. Many gels don’t “run straight” so it’s tempting for the author to “striaghten” then in Photshop or similar.
Robots can detect this. Better than humans.
We have software in our group that can detect errors in chemistry. In graphs, molecular structures, text, etc. It would be fairly straightforward to download all the world’s published chemistry and check it for errors. Note that in chemistry errors are mainly due to human error rather than fraud. However we find an awful lot. In closed access journals. (That’s most of them as most chemistry is closed). Our robots did a check on a journal issues  by a prestigious chemical publisher and found an error in almost every article. Some were trivial – missing punctuation, some were spelling errors. Some were serious. In a recent article we’ve been looking at 30% of the chemical names are seriously wrong. Our robots find that sort of thing. Some of the  chemical formulae are wrong. Some of the molecular masses are wrong.
I think this matters. I think the editors of the journal would agree. I think the publishers would not like to have incorrect science in their journal.
We can do this at near-zero cost. Some of our methods (OSCAR) are well tested, others are still alpha. They don’t work on all papers. But we’d like to see if we can detect errors. (And it’s not just error detection, but adding a very significant degree of semantics – I showed some of our automatic thesis-eating robot last week at the RSC).
The only thing holding us back is that we may be accused of stealing content. We will shan’t do this. We would download the articles, scan then for data and delete the articles. We wouldn’t sell them to non-subscribers. We shan’t post them on the web-site (though we would post the extracted data).
So this is a genuine offer to publishers. We are interested in seeing whether we can extract data and detect errors in publications. Last wek all the publishers at the RSC agreed that facts weren’t copyright and they agreed that scientific images and artwork should not be regarded as being protected by publisher copyright.
So I’d like feedback from publishers. The time has come when robotic extraction and analysis of data makes sense.
And if a publisher forbids the automatic analysis of scientific data for error detection, and defends this through copyright or server-side police, is that advancing the cause of science? Wouldn’t it make sense to make all the data Open?
Now?

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Robots can detect error; but images MUST be Open

  1. Martin Griffies says:

    The example of a gel you’ve shown also illustrates another problem: lossy data formats. The gel readers that I know of, use other file formats for storing the original data. (TIFFs, I think, but I wouldn’t swear to it)
    The original data in the original formats can be re-examined and interpreted, but once they are converted into JPG and the like, this ability is lost. There is a great temptation to undertake the conversion because the lossless and original files may be large thus increasing storage costs; but the trade-off is this ability to re-use and reinterpret the data.

  2. pm286 says:

    (1) Martin – thank you for this. I agree 100%. There is little excuse for not storing the whole lot – it’s probably less than 1 second of popular video. (I only used the WP picture because it’s open). And you can compress the TIFF using non-lossy formats
    The crystallographers are among the leaders by addressing the question of how to store diffraction data rather than extracted peak intensities. And similarly we should do this with spectra – spectra-not-peakLists

Leave a Reply

Your email address will not be published. Required fields are marked *