I have had several comments to blog posts about publishing data. Since c omments are not very common I’m replying in a post rather than in thread. They generally raise important points which are general and I’m grateful for the opportunity:
The value of raw data in preventing fraud. Carlos says: August 9, 2011 at 3:49 pm
I am a strong advocate of the need to deposit the experimental data (digital) for many reasons, such as those you have discussed in your blog. This would have certainly helped to clarify, once and for all, the famous debate with Hexacyclinol (there is much written about it, including my own humble contribution, http://nmr-analysis.blogspot.com/2011/03/hexacyclinol-nmr-spectra-vs-plain.html), but from my point of view, it does not mean that this data cannot be easily manipulated at convenience. For example, in the case of 1H-NMR, I could synthesize very easily a spectrum digitally adding the 13C satellites in their proper place, and change the line widths of the signals (to take into account the different relaxation times), add Gaussian noise, etc. If your aim is cheating, there is now perfect digital “Tippex” for that.
[PMR>> I have never said it is impossible to fake data. However it requires a lot of different skills. Not everyone can do what you have suggested. Moreover even this can leave subtle signals in the various moments of the data. The data have to be consistent with the known chemistry of the system. Fake data can be “too good to be true”. And if chemists were to publish their spectra digitally then there would be a great deal of prior information – as the crystallographers have.]
The other point I would like to mention is about the format of the digital data. I think that the proper way to deposit data is by using the original acquired data. Depositing other formats (e.g. JCAMP) is fine if this is done as something supplementary (e.g. for displaying purposes), but this should not replace the original data, otherwise there will be a loss of information. For example, it is possible that the JCAMP file does not include all the information contained in the original data and that this information can be very important in the future (I have found this problem too often L ).
Files in JCAMP format will compress relatively well, but a good compression rate is not so easy to achieve with original data (e.g. raw FID).
[PMR>> This is an ongoing issue for many data-rich fields. In crystallography the first deposition was the coordinates, then at a later date the anisotropic temperature factors, then the structure factors and now the actual diffraction images. The technology advances continuously. So while I agree the FID is ideal, the real-space spectrum is still extremely valuable and will serve 99% of what is needed. Note also that the FID acts as a complete check that the author has NOT fudged their data – it is probably impossible for anyone to change the raw data without leaving traces.]
Journals supporting supplemental data files Richard Kidd says: August 9, 2011 at 4:02 pm
RSC have always been more than happy to host the original data as ESI (massive files excepted), and in addition of course anyone can deposit spectra with ChemSpider. For the journals ESI we do at least need a pdf in addition for the peer review process, but that doesn’t exclude the original data in any way.
No reflection on Figshare though, all credit to Mark
[PMR: Thanks. This is good to know – that the RSC will act to publish data files in the traditional way]
Also spotted this: http://acscinf.org/meetings/242/242nm_CINF_abstracts.php#S42
[PMR: Yes, NIST have been working for several years with publishers to deposit thermochemical data (ThermoML) and have managed to build a model where all data is fully OKD-Open. Kudos to them and to the various journals, Journal of Chemical and Engineering Data, Fluid Phase Equilibria, The Journal of Chemical Thermodynamics, International Journal of Thermophysics, and Thermochimica Acta.
When an author has submitted a new manuscript to these journals, it will be reviewed by NIST in two stages. The first stage provides to Editors a NIST Literature Report and the second stage provides a NIST Data Report. These two reports are generated on demand by new tools that NIST recently incorporated into ThermoData Engine (TDE) software. The literature report assists Editors and reviewers with their assessment of the manuscript’s scientific contribution, the degree of overlap with published data, and the need for comparison with those published data. The second stage of NIST review occurs just after peer review is completed and prior to an Editors’ final decision. The data report provides provides a complete assessment of data quality, their underlying uncertainties, their sample descriptions, and their descriptions of experimental methods.
It’s an excellent example, like crystallography, or real data curation and publication. The data are thoroughly checked by numerous mechanism for internal and external consistency. It’s a good example of a community which values data at least as much as, if not more than, full-text.]
I want to point out that NISO and NFAIS have a working group on standards and best practices for handling supplementary materials. The group is still knee-deep in work and a ways from reporting out any recommendations. http://www.niso.org/publications/isq/free/Beebe_SuppMatls_WG_ISQ_v22no3.pdf.
[PMR: Power to their effort. I hope they swap idea with Iain H’s group. Here’s some points:
… some general issues for potential Recommended Practices. Among them are
»»Clear, consistent indicators of content
»» Universal agreement on citation practices
»»Consideration of use of the DOI
»» Potential cost recovery
»» Peer review
»» Preservation and interaction with repositories
»»Clearly defined specific responsibilities for the parties