Data publication: some replies

I have had several comments to blog posts about publishing data. Since c omments are not very common I'm replying in a post rather than in thread. They generally raise important points which are general and I'm grateful for the opportunity:

  1. The value of raw data in preventing fraud. Carlos says: August 9, 2011 at 3:49 pm

    I am a strong advocate of the need to deposit the experimental data (digital) for many reasons, such as those you have discussed in your blog. This would have certainly helped to clarify, once and for all, the famous debate with Hexacyclinol (there is much written about it, including my own humble contribution,, but from my point of view, it does not mean that this data cannot be easily manipulated at convenience. For example, in the case of 1H-NMR, I could synthesize very easily a spectrum digitally adding the 13C satellites in their proper place, and change the line widths of the signals (to take into account the different relaxation times), add Gaussian noise, etc. If your aim is cheating, there is now perfect digital "Tippex" for that.

    [PMR>> I have never said it is impossible to fake data. However it requires a lot of different skills. Not everyone can do what you have suggested. Moreover even this can leave subtle signals in the various moments of the data. The data have to be consistent with the known chemistry of the system. Fake data can be "too good to be true". And if chemists were to publish their spectra digitally then there would be a great deal of prior information – as the crystallographers have.]

    The other point I would like to mention is about the format of the digital data. I think that the proper way to deposit data is by using the original acquired data. Depositing other formats (e.g. JCAMP) is fine if this is done as something supplementary (e.g. for displaying purposes), but this should not replace the original data, otherwise there will be a loss of information. For example, it is possible that the JCAMP file does not include all the information contained in the original data and that this information can be very important in the future (I have found this problem too often
    L ).
    Files in JCAMP format will compress relatively well, but a good compression rate is not so easy to achieve with original data (e.g. raw FID).

    [PMR>> This is an ongoing issue for many data-rich fields. In crystallography the first deposition was the coordinates, then at a later date the anisotropic temperature factors, then the structure factors and now the actual diffraction images. The technology advances continuously. So while I agree the FID is ideal, the real-space spectrum is still extremely valuable and will serve 99% of what is needed. Note also that the FID acts as a complete check that the author has NOT fudged their data – it is probably impossible for anyone to change the raw data without leaving traces.]

  2. Journals supporting supplemental data files Richard Kidd says: August 9, 2011 at 4:02 pm 

    Hi Peter

    RSC have always been more than happy to host the original data as ESI (massive files excepted), and in addition of course anyone can deposit spectra with ChemSpider. For the journals ESI we do at least need a pdf in addition for the peer review process, but that doesn't exclude the original data in any way.

    No reflection on Figshare though, all credit to Mark

    [PMR: Thanks. This is good to know – that the RSC will act to publish data files in the traditional way]

    Also spotted this:

    [PMR: Yes, NIST have been working for several years with publishers to deposit thermochemical data (ThermoML) and have managed to build a model where all data is fully OKD-Open. Kudos to them and to the various journals, Journal of Chemical and Engineering Data, Fluid Phase Equilibria, The Journal of Chemical Thermodynamics, International Journal of Thermophysics, and Thermochimica Acta.

    When an author has submitted a new manuscript to these journals, it will be reviewed by NIST in two stages. The first stage provides to Editors a NIST Literature Report and the second stage provides a NIST Data Report. These two reports are generated on demand by new tools that NIST recently incorporated into ThermoData Engine (TDE) software. The literature report assists Editors and reviewers with their assessment of the manuscript's scientific contribution, the degree of overlap with published data, and the need for comparison with those published data. The second stage of NIST review occurs just after peer review is completed and prior to an Editors' final decision. The data report provides provides a complete assessment of data quality, their underlying uncertainties, their sample descriptions, and their descriptions of experimental methods.

    It's an excellent example, like crystallography, or real data curation and publication. The data are thoroughly checked by numerous mechanism for internal and external consistency. It's a good example of a community which values data at least as much as, if not more than, full-text.]

(C )Standards Steven Bachrach says: August 9, 2011 at 5:27 pm  (Edit)

I want to point out that NISO and NFAIS have a working group on standards and best practices for handling supplementary materials. The group is still knee-deep in work and a ways from reporting out any recommendations.

[PMR: Power to their effort. I hope they swap idea with Iain H's group. Here's some points:

some general issues for potential Recommended Practices. Among them are
the following:

»»Clear, consistent indicators of content

»»Metadata needs

»» Universal agreement on citation practices

»»Consideration of use of the DOI

»» Potential cost recovery

»»Common vocabulary

»» Peer review

»» Preservation and interaction with repositories

»» Archiving

»»Clearly defined specific responsibilities for the parties





This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Data publication: some replies

  1. Thank you for the important blog article and comments. Publishing original data can be beneficial in many ways, all of which are probably not possible to predict at the time of publication. Having a record of original data gives insight into contemporary methodologies and knowledge that the full text paper may not reveal.

    To add an observation, from one in biomedical research, on improving scientific integrity at an earlier in the scientific process, as data is being collected. I think research leaders in all disciplines should recognize the fundamental importance of weekly (at least) meetings of the research team with the Principal Investigator. During these meetings all staff, new scientists in training, & those with more experience, can learn from the regular review and discussion of the benchmarks of data collection & management.

    The absence of these weekly meetings was, I believe, the single biggest factor allowing for the data fabrication and falsification that I observed 20 years ago as a PhD student. I pushed to get these meetings organized, and when they did occur, it made it easier to get the offender to stop and easier to “salvage” original data. I bring this up because it is one of the important "environmental" components that have been overlooked in the review and investigations of scientific misconduct as the RCR field has developed over the last 25 years.

    Thank you to those in the informatics and publishing field for your work in making original data available along with the authors publication.

    That’s this reader's two cents/two pence.

    • pm286 says:

      Good points - and certainly possible - in principle - to build data management into much PhD supervision. We do this in software with code review meetings.

  2. Pingback: The Science of Managing the Research Protocol: Principal Investigator Meetings with the Research Team | Research Integrity

Leave a Reply to David Van Houten Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>