Chemspider and Pubchem – open data

I was very pleased to see:
ChemSpider Blog » Blog Archive » The Entire ChemSpider Database is On Its Way to PubChem!
which describes how the Chemspider database is being offered to Pubchem as “open data”. Chemspiderman has made a valuable attempt to navigate the complexities of Open Data and recursive licences. It is technically difficult and takes  us into unknown territory. For a start it is difficult to decribe what the final object is. I understand Pubchem as a collection of links coupled to authority – i.e. Pubchem holds links to the Chemspider compounds but does not actually hold the data. (I am not aware that Pubchem holds any data other than a fairly small amount of computed data (e.g. number of rotatable bonds) and names). It does, of course, hold the data that NIH collects through the roadmap program. But I’d be happy to be corrected.
Chemspider repeats my suggestions for criteria for Open Data and adds:

CS: For right now I am giving up on trying to track where Open Data might end up. Based on my previous discussions with Peter Suber regarding navigating the complexities of Open Access definitions, I understand there is a need to define our own policies. I’m not going to do that here but what I will be clear with is that once the ChemSpider structure set is deposited in PubChem then we are at the mercies of THEIR data sharing policies. I believe Peter [PMR, not sure which Peter – but if me, see below] holds up PubChem as the primary example of Open Data (but maybe not). So, I believe it should be true to say that the ChemSpider structure set IS Open Data when accessed/downloaded/shared from PubChem. But I understand that will then be the PubChem data set and all association with us will likely be lost. But that is fully acceptable!

PMR: This shows the complexities. We will need to see how the data actually end up in Pubchem. But at present Pubchem holds only links to authorities. Thus if I search for aspirin I get 61 suppliers of information (search result) each entry in which links back to the supplier’s site.So any “data” (e.g. melting point) is not in Pubchem. Unless Chemspider is different then I would expect that only the links would be held in Pubchem. If I am right, then accessing Chemspider through Pubchem is simply another way of accessing Chemspider.
In a comment Rich Apodaca says:

Regardless of how exactly linkage occurs, the end result would be that any third party could, independently of ChemSpider, reconstruct the entire ChemSpider compound database. By using the ChemSpider Web APIs, they could develop a parallel service that re-processes the ChemSpider analytical data and patent/primary literature data, possibly mashing up the data from other sources as well.
This sets the bar very high for Open data in chemistry. I’m not sure what to call it, but it’s a game-changer.

If Chemspider allows the direct download and re-use of their data from their site then I also congratulate them. This is completely independent of whether the entries are linked from Pubchem. However it will be necessary to add a licence statement to the Chemspider pages (not Pubchem) making this clear.
It may be picky but I don’t think that Pubchem – in common with many other bioscience sites – actually gives explicit permission for re-use. Agreed that it is a work of the US government so should be free of copyright. There is an unspoken tradition in bioscience that data and collections are “Open” in some way but it isn’t well spelt out.
It should be.

This entry was posted in data, open issues. Bookmark the permalink.

2 Responses to Chemspider and Pubchem – open data

  1. DrZZ says:

    Pubchem holds links to the Chemspider compounds but does not actually hold the data. (I am not aware that Pubchem holds any data other than a fairly small amount of computed data (e.g. number of rotatable bonds) and names). It does, of course, hold the data that NIH collects through the roadmap program. But I’d be happy to be corrected.

    There are two kinds of deposits in PubChem, substance and bioassay. The substance deposit is what you are thinking of. It gives you the chemical structure, maybe a few other fields, some PubChem calculated properties and a link that the submitter can use. Very often the submitter uses this link to point to additional data the submitter has about this compound. But it doesn’t mean that that is the only way to get that data. If the submitter also does a bioassay deposit, then all that data will be in PubChem as well. Lots of groups other than the Roadmap screening centers have deposited such data. To get an idea, set the search database to “PubChem BioAssay” (upper left of screen) and then click on the tab the says “Limits”. Down near the bottom of the screen you see a field for Source. It is a menu pick and you can browse the list to see who has deposited data. When you do a substance or compound search, the default result list will let you see what Bioassay data is actually stored in PubChem. Just go to the links on the far right and select assays. It is even broken down for you so if you want you can only go to assays in which the assay submitter considered the compound active. So for your aspirin example I see aspirin listed as being tested in 128 assays.
    As far as re-use, you can find this where you can download the data (including assay data) in bulk

    Databases of molecular data on the NCBI FTP site include such examples as
    nucleotide sequences (GenBank), protein sequences, macromolecular structures,
    molecular variation, gene expression, and mapping data. They are designed to
    provide and encourage access within the scientific community to sources of
    current and comprehensive information. Therefore, NCBI itself places no
    restrictions on the use or distribution of the data contained therein. However,
    some submitters of the original data may claim patent, copyright, or other
    intellectual property rights in all or a portion of the data they have submitted.
    NCBI is not in a position to assess the validity of such claims and, therefore,
    cannot provide comment or unrestricted permission concerning the use, copying,
    or distribution of the information contained in the molecular databases.

  2. I made a copy of PubChem yesterday and used it to populate a local MySQL database, and calculated accurate (monoisotopic) masses on the fly for our metabolomics people. Not sure, but the FTP download does not contain links to other databases though.

Leave a Reply

Your email address will not be published. Required fields are marked *