Chemspider raises an important and valuable issue. How is data reposited?
PMR: First, we – or rather Jim – has released the SPECTRa tools ( SPECTRa tools released). There are systems in place in at least Cambridge and Imperial. However that’s only part of the story – the easy bit. The main problem is to create a system and business model where there is a natural incentive to deposit data. We have found very considerable resistance and apathy – it’s easy to get excited by the ODOSOS community, but in practice most chemists don’t care.
One difficult problem is “when” ? When the data are originally collected the chemist would never make them public. Although we can dream I doubt that chemists will rush to Open Notebooks. When the paper is published it would be appropriate to reposit them. But the publication occurs many months after the submission of the manuscript. So you need an escrow repository – that;s why we had to spend so much time on that in SPECTRa. This requires a mechanism of trust and I suspect it can come from the following sources:
- the departmental analytical infrastructure.
- the institutional infrastructure (especially for theses)
- a respected publisher (Perhaps an obvious role for BMC).
So although I applaud the Chemspider offer to archive data I think it will need a large number of different business models to make it work. Each university department is different and each publisher is different. I wish it wasn’t so. We have to change the culture of data – which is one reason why I shall be attending the Digital Curation Centre meeting next week in Washington.
Finally, what is data? Data without metadata can be almost valueless. Many of the “SD” files on the web have no metadata and you have to guess the tags. Spectra are easier which is why we have started with them, crystallography and compchem. Moreover the metadata are often available in the file – not always but enough that it’s valuable.
But also what and where is the extent of the data? Pubchem, for example, is a linkbase but not a database. It does not, for example, carry melting points. (This is a simplification but it’s generally true). So “putting data in Pubchem” is essentially adding the links to Pubchem through the connection table (or possibly the name in a fwe cases – I don’t know). And unless there are synchronised ontologies then it’s often unclear what quantities are equivalent and what aren’t. There is a lot of difference between a human reading a few entries and a machine reading many thousands and interpreting the results.