#jiscxyz #okfn #quixotechem
I’m off to #JISCMRD (Managing Research Data) to hear about the new round of projects including our own JISCXYZ. Ours concentrates on the publication of data and we are working with publishers to save and validate data at early stages in the publication process.
Meanwhile here’s an indication of how to destroy data (supplemental data):
That’s the commonest method. And here’s another (http://www.rsc.org/suppdata/OB/b2/b209981k/geometry.pdf ). This file could have released useful data to the world. In fact it destroyed it by putting it into PDF. The file should have looked like:
1\1\GINC-PIRX\FOpt\UB3LYP\Aug-CC-pVDZ\C4H8Cl1(2)\BERND\19-Feb-2002\\
#P B3LYP/AUG-CC-PVDZ OPT=TIGHT GEOM=CHECK GUESS=READINT=ULTRAFINE\\Be
D001 with INT=ULTRAFINE\,2\C,0.1063168353,0.3005635652,-0.5502851935
\C,0.1053918322,-1.157928312,-0.5404967856\C,1.3891660007,0.9682412707
,-1.0097749893\H,-0.7786669558,0.7139466375,-1.0458655312\H,1.55655799
58,0.73320935,-2.0718423113\H,1.327841017,2.0566771132,-0.8980161267\H
,2.2479630589,0.603209853,-0.4322274841\Cl,-0.2178927161,0.8953362005,
1.2758334388\C,-1.1453929968,-1.9640369835,-0.4846598299\H,1.059275967
3,-1.6602096716,-0.361576248\H,-1.0103688266,-2.9449592768,-0.96235794
85\H,-1.4450387789,-2.1570821331,0.5624777398\H,-1.9862773313,-1.44654
45235,-0.9684597597\\Version=x86-Linux-G98RevA.7\HF=-617.4354366\S2=0.
755661\S2-1=0.\S2A=0.750023\RMSD=7.998e-09\RMSF=8.394e-07\Dipole=0.129
3808,-0.5162435,-0.9249031\PG=C01 [X(C4H8Cl1)]\\@
Notice the precise formatting. This is REQUIRED to read the file in. Instead the author or the publisher (neither of whom apparently care) tipped it into PDF which introduced spurious line ends. It’s UNREADABLE by a machine. Follow the link and Read the file and see what I mean .
It’s beautiful and garbage. A sickly hamburger.
That’s because almost all publishers don’t care about data. Which means that many of their publications are second-rate. Many are suspect scientifically because the data aren’t published.
A parser which understood the required format could certainly remove spurious line-endings, unless they introduce ambiguity (which these ones seem not to do, but I don’t know the format).
But yes, PDF is absolutely the wrong answer. All the publishers have is a PDF hammer, so everything looks to them like a textual nail.
The problem is that there is significant TRAILING whitespace!. It’s fortran after all. If the lines are split we don’t know where the whitespace is.
Talking of space, lack of it can cause even greater problems. Thus the purveyor of a well known quantum mechanics program recently released a new version. We needed it to explore a spectroscopic method known as ROA (Raman optical activity, a very powerful method for assigning absolute configurations of chiral molecules). The output is complex, and demands graphical interpretation. So it was loaded up into (the only program I know which recognises ROA data) and displayed … nothing. Turns out the ROA scattering intensities for the (otherwise relatively unremarkable molecule) were rather larger than normal. The relevant field is identified in the output with the string ROA– and the numerical value is identified as -9999.9. Well, for our molecule, this ended up as ROA—10000.0 (the numbers are fictitious, to illustrate the problem). You can see how one missing space totally messed up the interpretations. I am also reminded how, when we first did a calculation on a molecule containing TWO iodine atoms, the energy display similarly vanished. Yes, you guessed, the total energy reached -10000.00 (remember, these programs started life in an era where even the thought of an all electron calculation on a single iodine was beyond the pale).
Of course, all my suggestions, urgings, etc to persuade the purveyors of the two programs involved in the story above to adopt a structured format in which white space less less potentially destructive have thus far failed.
Pingback: Twitter Trackbacks for Unilever Centre for Molecular Informatics, Cambridge - Another PDF hamburger; why must scientific publishing destroy science? « petermr’s blog [cam.ac.uk] on Topsy.com
Here is an interesting discussion on data and its interpretation. Its all about how the fine structure constant varies according to where in the universe it is measured. The controversy is apparently because the raw data used for analysis has not always been made openly available. And this from the physics community which is actually rather good about this sort of thing.
Lovely – one of ours. This is from 2002 – I don’t know for sure whether we received a txt or doc file from the author, but it could have been a mistake by us. Our policy isn’t to convert txt to pdf but mistakes can get occasionally be made. We should remember what the issue was in 2002 though…we were more worried then about how often the doc format changed and PDFing the suppdata was the best bet to ensure the files would still be readable in future. I fully support retaining and publishing more raw data alongside and within publications, but the block to this isn’t with the publishers.
The error was made by the author, period.
Blaming “PDF” for this is like blaming your car for a crash. Yes… sometimes it’s the car’s fault, but c’mon… not that often.