One of the major questions that arose at the ZCAM meeting on Computational Chemistry and databases (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2619) was the publication of data.
In some subjects such as crystallography (and increasingly synthetic chemistry), the publication of a manuscript requires the publication of data as supplemental/supporting information/data (the terms vary). This is a time consuming process for authors but many communities feel it is essential. In other disciplines, such as computational chemistry, it hasn't ever been mandatory. In some cases (e.g. J. Neuroscience) it was mandatory and is now being abolished without a replacement mechanism. In other cases such as Proteomics it wasn't mandatory and is now being made so. So there are no universals.
At the ZCAM meeting there was general agreement that publishing data was a "good thing" but that there were some barriers. Note that compchem, along with crystallography, is among the least labour-intensive areas as it's a matter of making the final machine-generated files available. By contrast publishing synthetic chemistry can require weeks of work to craft a PDF document with text, molecular formulae and spectra. Some delegates said that suppInfo could take twice as long as the paper. (There was rejoicing (sadly) in some of the neuroscience community that they no longer needed to publish their data).
So this post explores the positive and negative aspects of publishing data.
Here were some negatives (they were raised and should be addressed)
- If I publish my data my web site might get hacked (i.e. it is too much trouble to set up a secure server). I have some sympathy – scientists should not have to worry about computer infrastructure if possible. We do, but we are rather unusual.
- It may be illegal or it may break contractual obligations. Some compchem programs may not be sold to the various enemies of US Democracy (true) and maybe it's illegal to post their outputs (I don't buy this, but I live in Europe). Also some vendors put severe restrictions on what can be done with their programs and outputs (true) but I doubt that publishing output breaks the contract (but I haven't signed such a contract)
- If I publish my data certain paper-hungry scientists in certain countries will copy my results and publish them as theirs (doesn't really apply after publication, see below)
- Too much effort. (I have sympathy)
- Publishers not supportive (probably true)
Now the positives. They fall into the selfish and the altruistic. The altruistic is the prisoner's dilemma (i.e. there is general benefit but *I* benefit only from other people being altruistic). The selfish should be compelling in any circumstances.
- The quality of the science improves if results are published and critiqued. Converge on better commonality of practice.
- New discoveries are made ("The Fourth Paradigm") from mining this data, mashing it up, linking it, etc.
- Avoid duplication (the same work being recycled)
- Avoid fraud (unfortunately always probable)
- Provide teaching and learning objects (very valuable in this field)
- Contribute to a better information infrastructure for the discipline
- Advertise one's work. Heather Piwowar has shown that publishing data increases citations. This alone should be compelling reason.
- Use a better set of tools (e.g. for communal authoring)
- Speed up the publication process (e.g. less work required to publish data with complying publishers).
- Be mandated to comply (by funder, publisher, etc.)
- Store one's data in a safe (public) place
- Be able to search one's own data and share it with the group
- Find collaborators
- Create more portable data (saves work everywhere)
That's the "why". I hope it's reasonably compelling. Now the "when" , "where" and "how".
The "when" is difficult because the publication process is drawn out (months or even years). The data production and publication is decoupled from the review of the manuscript. (This is what our JISCXYZ project is addressing). The "where" is also problematic. I would have hoped to find some institutional repositories that were prepared to take a role in supporting data management, publication, etc. but I can't find much useful. At best some repositories will store some of the data created by some of their staff in some circumstances. BTW it makes it a lot easier if the data are Open. Libre. CC0. PDDL, etc. Then several technical problems vanish.
So the scientist has very limited resources:
- Rely on the publisher (works for some crystallography)
- Rely on (inter)national centres (works for the rest of crystallography).
- Put it on their own web site. A real hassle. Please let's try to find another way.
- Find a friendly discipline repository (Tranche, Dryad). Excellent if it exists. Of course there isn't a sustainable business model but let's plough ahead anyway
- Twist some other arms (please let me know).
Anyway there is no obvious place for compchem data. I'd LOVE a constructive suggestion. The data need not be huge – we could do a lot with a few Tb per year – we wouldn't get all the data but we'd get most that mattered to make a start.
So, to seed the process, we'll see what we can do in the Quixote project. If nothing else we (i.e. our group) may have to do it. But I would love a white knight to appear.
That's the "where". Now the "when" and "how". I'd appreciate feedback
If we are to have really useful data it should point back to the publication. Since the data and the manuscript are decoupled that only works when the publisher takes on the responsibility. Some will, others won't.
An involved publisher will take care of co-publishing the paper and the data files. Many publishers already do this for crystallography. The author will have to supply the files, but our Lensfield system (used in the Green Chain reaction) will help.
Let's assume we have a non-involved publisher...
Let's also assume we have a single file in our plan9 project: pm286/plan9/data.dat (although we can manage thousands) and that there is a space for a title. When we know we are going to publish the file we'll get a DataCite DOI. (I believe this only involves a small fixed yearly cost regardless of the number of minted dataDOIs – please correct me if not). We'll mint a DOI. Let's say we have a root of doi:101.202, so we mint: doi:101.202/pm286/plan9/data.dat . We add that to the title (remember that our files are not yet semantic). This file is then semantified into /plan9/data.cml with the field (say)
<metadata DC:identifier=" doi:101.202/pm286/plan9/data.cml"/>
The author adds the2 identifiers to the manuscript (again the system could do this automatically, e.g. for Word or LaTeX documents).
After acceptance of the manuscript the two files (data.dat and data.cml) are published into the public repository. Again our Lensfield system and the Clarion/Emma (JISC-CLARION) tools can manage the embargo timing and make this automatic. The author can choose when they do this so they don't pre-release the data.
So the reader of the manuscript has a DataCite DOI pointing to the repository. What about the reverse?
This can be automated by the repository. Every night (say) it trawls recently published papers (looking for DataCite DOIs. Whenever these are located then the repository is updated to include the paper's DOI. In that way the repository data will point to the published paper.
This doesn't need any collaboration from the publisher except to allow their paper to be read by robots and indexed. They already allow Google to do this. So why not a data repository?
And what publisher would forbid indexing that gave extra pointers to the published work?
So – some of the details will need to be hammered out but the general process is simple and feasible.
In any case we'll go ahead with a data repository for compchem...