Noted in Peter Suber’s Blog:
Additional 2.2 Million Structures Now Searchable in Freely Available Database
Thomson Scientific, … provider of information solutions to the worldwide research and business communities, today announced the deposit of 2.2 million chemical structures from Thomson Pharma into PubChem, the freely accessible database that provides information on the biological activities of small molecules. PubChem was developed by the National Center for Biotechnology Information at the US National Institutes of Health (NIH) to help biomedical researchers identify chemical structures with the potential to treat diseases. The addition of Thomson Pharma’s extensive collection of biologically active and pharmacologically active structures, derived from worldwide patent and literature sources, significantly enhances the value of this research tool to the scientific and medical community.
Thomson Pharma (thomsonpharma.com) is the comprehensive pharmaceutical information solution that covers the entire drug discovery and development pipeline. PubChem searchers with a Thomson Pharma subscription can link directly from PubChem to more detailed information on their structure of interest including:
- — activities reported for that compound
- — drug reports citing that compound
- — synthetic methods and critical reaction data
- — patents, journals and news stories that feature the compound
- — synonyms and trade names
- — related salts and isomers
“My colleagues and I are very pleased to welcome the addition of Thomson Pharma chemical structures to PubChem,” said Steve Bryant, Director, PubChem, National Institutes of Health. “This will allow our users to cross-reference information in PubChem, such as biological testing results from the NIH Molecular Libraries program, with the further rich sources of information provided by Thomson Pharma.”
This is partially good news. It means that Pubchem is becoming the de facto standard for chemical indexing. We can rapidly and, increasingly reliably, find the compounds we want without hassle, delay, subscription, licenses and non-re-use (all the devils of the anticommons). More below.
But the devil of the details states:
PubChem searchers with a Thomson Pharma subscription can link directly from PubChem to more detailed information on their structure of interest including … (my italics)
This appears to mean that only the structure, not the data, are freely accessibly. Don’t get me wrong, the structures are very valuable in itself (although I don’t know whether these are 2.2 million NEW structures – if so I’ll be happily surprised). Some certainly will be.
In a twentieth century morality and legality the data are Thomson’s – won by the sweat of the brow. But in the twenty first century – which some of us inhabit – note were the data come from…
- — activities reported for that compound (mainly in scientific journal articles – PMR)
- — drug reports citing that compound (probably appearing in the public domain – PMR)
- — synthetic methods and critical reaction data (originally published in scientific journals – PMR)
- — patents, journals and news stories that feature the compound (patent information is generally made Open by the patent offices – PMR)
So to a large extent this is OUR data. Its re-use is technically (though not socially) straightforward. I have been approached by patent offices who would like to make their outputs re-usable with CML. Drug reports (e.g. from regulatory authorities such as WHO (whom I have worked with) should be Open). The primary literature could be made Open to robot indexes if the political will was there. No cheer for the closed data.
But one cheer for the forces of Openness driven by commercialism. I guess that Thomson have done this because they sense a market opportunity. When Pubchem becomes the leading chemical index – just as Google is the leading free-text index – then everyone will want to be linked therefrom. It’s not altruism – just good business. And others will follow. But I hope their data is open.
So the other cheer is for Pubchem as index. There is, of course, quite a lot of data in Pubchem and Medline, but increasingly Pubchem is becoming a linkbase. And that is just what we want. If we can persuade all journals to make their published compound data available we have an Open chemical data system. This semantic publishing is not difficult – it just needs a different business model. If it’s compelling enough for Thomson to link their data to Pubchem, what about chemistry articles? Well, no surprise, Nature already started this with Nature Chemical Biology. I am sure it’s a good move – I bet they get more clicks or whatever excites their business people.
So a simple prediction. In 5 years’ time (and that is ludicrously conservative) the majority of scientific journals will be linked to by Pubchem. There are good semantic tools (like Peter Corbett’s OSCAR3) that will help to take the drudgery out of conversion. So where’s the problem?
Did I mention before that chemists are conservative?
Steve Bryant mentioned in a meeting the other day that
there is only a fairly small overlap between the Thompson
compounds and what was already in PubChem, so this really is
a significant addition.
(1) Many thanks Dan.
This is good to know as it increases the number of connection tables. Presumably some of these came from patents.
P.
I didn’t have time last week to comment on it, but I think your broader point is
important to discuss. The discussion is also important in a practical sense, as over the next
few months, the Molecular Library Roadmap initiative will be going through a mid-course evalutation.
In the present budget situation, any evaluation is quite justifiably going to take a very hard
look not only at the quality of work being done, but also at the question of whether this is
is the best way for NIH to spend its money. You point out the success PubChem has already
achieved as a chemical index and that can easily be backed up by any number of metrics, but
it isn’t the whole story, and as you point out, the rest of the story is about data and the
freedom to use it any way you wish. If it were only about the indexing, I think it would be
difficult to make a strong case that PubChem is doing anything that CAS doesn’t already do
just as well or better. But try getting biological testing data from CAS or even bulk downloads
of chemical structures. It is pretty easy to list things that can be done via PubChem that
can’t be done via CAS, but the bigger challenge is to make a compelling case that these things are
important things for NIH to be spending money on. Because it is only fairly recently that
large chunks of open data have become available, there are relatively few highly successful
finished “stories”. Because there are relatively few big successes, there hasn’t yet been a
compelling reason for the average scientist to learn what what can be done and how to do it.
The challenge for those of us who think it is important to support NIH efforts in this
area is to 1) create clearly written documents that describe specific examples of the type
of things that open data makes possible and 2) use existing open data to crate the success
stories that realize the promise of open data. I think it is wonderful that PubChem has been
set up in a way that allows a mutually beneficial interaction with commercial concerns, but
without open data (and I think that the vast, vast majority of the biological data in
PubChem is from DTP and the Roadmap screening network) I think the effort will fall significantly
short of what it could be.
(3) Many thanks Dan. What is clear is that the structure, curation, maintenance and deployment of open databases (or knowledgebases) is flexible. There is no longer a need for everything to be in the same place.