Curating data on ChemSpider…should it be supported by the community?

Put simply Chemspider is Web 1.0; The chemical blogosphere, Pubchem, Blue Obelisk, CrystalEye is Web 2.0.Chemspider’s business model was fine for the early web. No public content, significant effort to extract it, few alternative sites. Chemspider looks extremely similar to Chemfinder which started about 2000 (I can’t remember when). It’s model was similar – aggregate compound supplier information, add some searchable properties, encourage the community to contribute, offer unlimited free access. Then, for whatever business reason it stopped giving free access.I see very little difference between Chemfinder and Chemspider. They are both closed, proprietary, do not expose data, or metadata, or algorithms; have closed code, do not allow downloads or re-use. They lose metadata in their aggregation process. I have nothing personal against Chemspider (or, if they are associated, ACDLabs) – I just think the Web 1.0 model is out of date for chemistry.
99% of Chemspider’s data appears to come from Pubchem. If so, surely it is better to curate Pubchem directly. There are mechanisms for this and as Pubchem is effectively the normalised source it gives less problems for maintenance. And if any site is creating a more acceptable name-structure linkage, then this is better done by a standoff markup rather than aggregate-and-possess. Web 1.0 denormalises and confuses data; Web 2.0 looks to normalise and mashup. Egon has shown how Pubchem can be annotated with standoff community involvement – surely a more scalable model.
Selected from the Chemspider blog:

ChemSpider has been online since March 24th 2007, about 6 weeks. We opened the ability to curate the data one month later.
Is there a need to curate the data? ChemSpider is built up of a series of databases. The list of contributors continues to increase
and there will be some very exciting announcements made in the next few days about new contributors.

One of the largest components is the PubChem database. Peter Murray-Rust recently blogged about the quality of the name-structure pairs inside the PubChem database. He used as an example methane… I point you to the original blog for his comments. For my purposes I will use water. Here is the list of names, synonyms and registry numbers posted for Water at PubChem. Certainly a number of these have carried over to ChemSpider. Out of interest it is worth comparing the results of the searches for the word “water” at both PubChem and ChemSpider. Search Pubchem for water title=”Water on PubChem”>here and ChemSpider for water title=”Water on ChemSPider”>here. 228 hits versus 1. Looking at ChemSpider we get the following list of names, synonyms and registry numbers. The hyperlinks below are those links to wikipedia.
water; Water vapor; Dihydrogen oxide; Distilled water; Purified water; Water, purified; hydrogen oxide; Deionized water; Oxygen atom; dihydridooxygen; ether; ethers; hydroxide; oxidane; Monooxygen;
[… other (wrong) synonyms deleted]
NOT what I would call a quality set of names. These will be curated, some will be done with appropriate robots and some manually.
This is an extreme. Let’s look at other examples already identified by curators. Below is an example of curation in process.
Some examples of curated data
Returning to Peter’s blog…an excerpt states “Pubchem faithfully reflects the broken nature of chemical infomation. It cannot mend
it – there are only ca. 20 people – and anyway the commercial chemical information world prefers to work with a broken system. But
could social computing change it? Like Wikipedia has? [..] I think chemistry is different. And I think we could do it almost effortlessly
– rather like the Internet Movie Database. Here every participant can vote for popularity or tomatos. A greasemonkey-like system could allow us to flag “unuseful names” or to vote for the preferred names and structures. And this doesn’t have to be done on PubChem – it could be a standoff site [..].” I happen to agree. I believe social computing can change it. That is the purpose of the curating process on ChemSpider. When we set up the system we were not sure that people would care or help in curating the data. Why? Here’s why people might NOT want to help us curate the data:

  1. ChemSpider is not PubChem. The data cannot be downloaded.
  2. ChemSpider is a business…why should people help a business increase the quality of the data they host?
  3. ChemSpider is new. Who says that the efforts made to curate data will be of value to others? How long will ChemSpider be around to allow peoples work to benefit others?

All valid questions. And they likely ARE deterrents to people helping improve the quality of data on ChemSpider. So, what are the
answers to these questions.. are they enough to convince ChemSpider users to assist in curating the database? Our responses to the
questions above are as follows:

  1. We do not have permission from all depositors to ChemSpider to allow their data to be downloaded, only viewed. However, we WILL redeposit all curated data originally sourced from PubChem back to PubChem. In an email exchange this past week with Steve Bryant from PubChem commented that they would willingly accept curated data back to their database. We will also make available a downloadable database of all curated data originally sourced from public sources. We will also provide feedback to other depositors when we find errors.
  2. I have done my utmost to explain this in a previous post here.
  3. ChemSpider has traction. It is getting lots of use. Based on interest we believe that our initial efforts have already provided enough response to have us continue this work. We have challenges as discussed previously but we are busily addressing these now. We believe that every effort made to improve the quality of data on the ChemSpider database will benefit all users and the community in general with our giveaway to PubChem and other database providers of the curated data.

I have outlined only a small number of possible concerns above. There may be more. I welcome any other questions you may have about our intentions.

I interpret this to mean that Chemspider is curating a name-structure map. Fine. They are doing it with robots and humans. I can see how to use both – I have no idea what the percentage recall and precision is. They will own the results and the results will not be made Openly available but served through their gateway. You are invited to contribute.
The Web 2.0 community will use a different mechanism.
This entry was posted in chemistry. Bookmark the permalink.

4 Responses to Curating data on ChemSpider…should it be supported by the community?

  1. Pingback: Free Chemistry Databases on the Web: Creating a Comprehensive Guide

  2. Pingback: ChemSpider Blog » Blog Archive » Growing Support for the ChemSpider Approach As Contributors Add Databases – Multiple Entry Points to the Service Are Used

  3. Pingback: ChemSpider Blog » Blog Archive » ChemSpider as a part of Web 2.0 - and what is that Web 2.0 anyways?

  4. Pingback: ChemSpider Blog » Blog Archive » Another Response to Constructive Feedback from Peter Murray-Rust…

Leave a Reply

Your email address will not be published. Required fields are marked *