Why we need data repositories: prevention of Scientific Fraud (ACS and others please respond)

[Warning – this blog contains praise and criticism of the chemistry publishing industry].

I’ve just been catching up on the chemical blogosphere by reading Chembark (http://blog.chembark.com/about/ )

This site is maintained by Paul Bracher. Paul is currently a National Science Foundation ACC Postdoctoral Fellow at the California Institute of Technology. He completed his doctoral work in organic chemistry at Harvard University, his undergraduate studies at New York University, and his secondary education at Thomas Jefferson High School for Science & Technology. Paul enjoys writing about himself in the third person

Chemistry has one of the finest blogospheres and those who criticize grey literature should take time to read and change their views. Chembark has recently been spending a great deal of time on a very high-profile case of scientific fraud (Sezen/Sames). He has detailed it meticulously. Where he needed official information he sent an FAOI (Freedom of Information) request to the government, which took its time in replying. [I am fortunate in the UK where we have “What do they know?” http://www.whatdotheyknow.com/user/peter_murray_rust – is there a similar system in the US?] Here’s Chembark’s commitment to the case:

3 December 2010 – Acknowledgment of receipt of ChemBark’s FOIA request
8 December 2010 – Denial of ChemBark’s request for expedited processing
20 June 2011 – Follow-up letter to DHHS FOIA Office
22 June 2011 – Cover letter from DHHS with Bengu Sezen Investigation FOIA Materials
22 June 2011 – FOIA Materials for Bengu Sezen Investigation

This is contained in (http://blog.chembark.com/2011/07/07/the-sezen-files-%E2%80%93-part-i-new-documents/ ) – the first of several impressive posts which are worth reading – and we haven’t got to the end.

The case has also been reported by the American Chemical Society (ACS) which has also reported very responsibly and which has taken the view that not only is the case a disgrace but that Columbia University has not taken appropriate action. Note that the fraud has seriously harmed, perhaps destroyed, the careers of innocent chemists trying to repeat the experiments. Here’s Rudy Baum’s editorial (http://cenblog.org/the-editors-blog/2011/08/sezen-sames-and-columbia/ ) (quoted without permission but in appreciation of the strength of argument):

Columbia’s investigation focused exclusively on Sezen’s misconduct. From the ORI report obtained by C&EN, it appears that Columbia has not made any attempt to probe whether Sames was guilty of scientific misconduct himself during Sezen’s time in his lab. And in fact, a close reading of Columbia’s policy on research misconduct—which was adopted in February 2006—suggests that, in the university’s eyes, Sames’ behavior throughout was acceptable. Allegations of misconduct, the policy states, must be made to the “appropriate Responsible Academic Officer,” who is a “Chair, Dean, or Director,” not a principal investigator. The policy, which runs to 14 pages of legalistic, process-oriented gobbledygook, doesn’t appear to mention the responsibility of PIs at all.

And Columbia? The claim that the university can’t talk about the case to protect individuals’ privacy is laughable. Most of the redactions in the report of the university’s investigation are easily filled in with the appropriate names. The Sezen/Sames case is an embarrassment, the malefactor has been banished from the ivory tower, an up-and-coming young professor is moving along with his career, and Columbia is putting the unpleasantness behind it. It may be good spin doctoring, but it’s a lousy way to run a great research institution.

This could happen ANYWHERE. In YOUR laboratory. It impossible to know how much fraud is undetected. In a major case the Int. Union of Crystallography detected massive systematic fraud (http://journals.iucr.org/e/issues/2010/01/00/me0406/me0406bdy.html where 70 papers were systematically fabricated). Great kudos goes to the IUCr for detecting this.

Which they were able to do because they DEMAND MACHINE_UNDERSTANDABLE DIGITAL DATA AS A REQUIREMENT FOR PUBLICATION. By contrast the chemistry community requires supplemental information, but only as PDF. Here’s the main C&EN article (http://pubs.acs.org/cen/science/89/8932sci1.html ). Don’t worry if you don’t understand chemistry – the plot is easy to follow. Again I quote without permission.

Investigators concluded Sezen fabricated ¹H NMR data for at least seven compounds, including the product of the arylation reaction shown, based on the pattern of the products’ reported coupling constants. After breaking out their rulers, the committee found that the starting compounds’ ¹J_CH coupling constants vary, as would be expected in genuine ¹H NMR spectra, but the products’ do not. The poor resolution of Sezen’s other printed spectra prevented the committee from conducting a more comprehensive analysis. [PMR’s emphasis]

Basically the investigating committee had to measure distances with rulers on paper spectra, rather than analysing the numbers that the spectrometer had emitted. By printing to paper, rather than preserving the original data, the laboratory system leaves a massive hole for fraud. Despite my and other campaigns for digital data, all chemistry laboratories and all journals continue to use e-paper, so all are open to easy fraud. Schulz continues (and don’t be put off by the chemistry): (Note – I have reproduced a diagram, thereby violating copyright. This diagram is actually a creative work of Sezen, see below):

[Schulz ]WHITEOUT This ³¹P NMR spectrum from Sezen’s doctoral thesis and research papers was presented as proof that she made the supposedly promising C–H functionalization catalyst RhCl(CO)[P(Fur)₃]₂. The investigation showed that Sezen created it out of whole cloth by merging ³¹P spectra of a simple phosphorus compound (triphenylphosphine) and then applying correction fluid to remove peaks from a triphenylphosphine oxide contaminant in that compound. [PMR’s emphasis].

PMR: For non-chemists: A chemist makes a compound and runs an NMR spectrum (these machines cost hundreds of thousands of dollars and have integral computers which analyse all the data in DIGITAL form.). The chemist takes the spectrum, analyses (annotates) it – see the black text – and argues that it confirms the identity of the compound. The supervisor or journal editor/reviewer is required to accept this argument to justify publication.

NOTE: The spectrum probably originally contained 16384 measurements along the x-axis (frequency). The image here represents the loss of about 99% of that data. The numbers are unreadable (I actually have NO idea what the range is). If this is the diagram published in the journal article it should never have been allowed. (It’s actually the worst I have seen). If it was part of the publication, then the editor of the journal must take responsibility for allowing it.

What then apparently happened is that two independent spectra were merged. I don’t know how this was done. It might have been in Photoshop ™ or Sezen might have written a program (I am sure Chembark will tell me). This, of course is fraud. Then, because the result didn’t look quite right Sezen applied correction fluid (“Tippex”) to the PAPER spectrum. [This is pretty crude. It’s ultimately detectable.].

IF THE DATA HAD BEEN DIGITAL THROUGHOUT SOFTWARE COULD HAVE DETECTED THE FRAUD (there were satellite peaks in the wrong place, etc.).

Mat Todd has asked the ACS if they will accept digital spectra (http://intermolecular.wordpress.com/2011/08/07/raw-data-in-organic-chemistry-papersopen-science/ ) and …

I was in touch with one of the ACS journals recently, who indicated that their view was that the journal is not a data repository, and that posting of raw data (which was in their view to some extent desirable) should be posted elsewhere, e.g. to an institutional repository. This is an option. I think it’s less convenient. PLoS seem happy to host the data. [PMR emphasis]

This lack of commitment from the chemistry publishers is unacceptable. It’s effectively saying they can’t be bothered to accept digital data routinely. There is no technical problem – the data are much smaller than the enormous PDF-bitmaps that they routinely allow as supplemental information. They compress well. There are Open Source viewers. If the IUCr can do this routinely (and detect fraud) why can’t the ACS, RSC, etc. “e.g. an institutional repository” is simply ducking the responsibility. The responsibility of the journal is to take reasonable steps to ensure that the science is “correct”. And depositing digital data files with the publisher is trivial. And if they don’t have the expertise to analyse spectra, then the blogosphere has lots of Open Software.

[ If you want an interim solution Mark Hahnel and Mat and I are investigating Open Source technology and the role of Figshare. We’d love to have others involved. It will happen].

So maybe we’ll end up with a system where “papers” are published in ACS and RSC and “data” is published in Figshare (This is much more attractive than publishing it in repositories and I’ll explain why later).

What happens when the world wakes up and starts to value data and creates metrics for it? The metrics will reference Figshare, not ACS…

5 Responses to Why we need data repositories: prevention of Scientific Fraud (ACS and others please respond)

Carlos says:

August 9, 2011 at 3:49 pm

I am a strong advocate of the need to deposit the experimental data (digital) for many reasons, such as those you have discussed in your blog. This would have certainly helped to clarify, once and for all, the famous debate with Hexacyclinol (there is much written about it, including my own humble contribution, http://nmr-analysis.blogspot.com/2011/03/hexacyclinol-nmr-spectra-vs-plain.html), but from my point of view, it does not mean that this data cannot be easily manipulated at convenience. For example, in the case of 1H-NMR, I could synthesize very easily a spectrum digitally adding the 13C satellites in their proper place, and change the line widths of the signals (to take into account the different relaxation times), add Gaussian noise, etc. If your aim is cheating, there is now perfect digital “Tippex” for that.
The other point I would like to mention is about the format of the digital data. I think that the proper way to deposit data is by using the original acquired data. Depositing other formats (e.g. JCAMP) is fine if this is done as something supplementary (e.g. for displaying purposes), but this should not replace the original data, otherwise there will be a loss of information. For example, it is possible that the JCAMP file does not include all the information contained in the original data and that this information can be very important in the future (I have found this problem too often  ).
Files in JCAMP format will compress relatively well, but a good compression rate is not so easy to achieve with original data (e.g. raw FID).
Kindest regards,
Carlos

Casey W. Stark says:

August 9, 2011 at 4:00 pm

I wish journals would take code, raw data, and analyzed data, but I don’t think they could care less beyond the paper.
If this were to happen, how do you think it should be distributed? A folder of all the data in a tarball with the PDF or …? And what about large data sets? At least to check my work, others would need my code, several to hundreds of terabytes of data, and access to a beefy computer. Maybe it’s out of reach for this type of work.

Richard Kidd says:

August 9, 2011 at 4:02 pm

Hi Peter
RSC have always been more than happy to host the original data as ESI (massive files excepted), and in addition of course anyone can deposit spectra with ChemSpider. For the journals ESI we do at least need a pdf in addition for the peer review process, but that doesn’t exclude the original data in any way.
No reflection on Figshare though, all credit to Mark
ATB
Richard

Steven Bachrach says:

August 9, 2011 at 5:27 pm

I want to point out that NISO and NFAIS have a working group on standards and best practices for handling supplementary materials. The group is still knee-deep in work and a ways from reporting out any recommendations.
http://www.niso.org/publications/isq/free/Beebe_SuppMatls_WG_ISQ_v22no3.pdf

Richard Kidd says:

August 10, 2011 at 11:48 am

Also spotted this: http://acscinf.org/meetings/242/242nm_CINF_abstracts.php#S42

Why we need data repositories: prevention of Scientific Fraud (ACS and others please respond)

5 Responses to Why we need data repositories: prevention of Scientific Fraud (ACS and others please respond)

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta