CAS and InChI – who can assign identifiers?

I’ve had two useful comments on CAS and InChI identifiers which have updated my knowledge (a feature of closed organizations and authorities is that updates often trickle out in small amounts, particularly if they represent an unwelcome progress towards openness). Before I comment in detail, some thoughts about identifiers.

WP gives:

With reference to a given (possibly implicit) set of objects, a unique identifier is any identifier which is guaranteed to be unique among all identifiers used for those objects and for a specific purpose. There are three main types of unique identifiers, each corresponding to a different generation strategy:

  • serial numbers, assigned incrementally
  • random numbers, selected from a number space much larger than the maximum (or expected) number of objects to be identified. Although not really unique, some identifiers of this type may be appropriate for identifying objects in many practical applications and are, with abuse of language, still referred to as “unique”
  • names or codes allocated by choice which are forced to be unique by keeping a central registry such as the EPC Information Services.

I shall omit (2) – the UUID generated algorithmically and, if long enough, “likely to be unique”.

The assignment of identifiers is a non-trivial task which requires expert knowledge of the domain and training for those assigning the identifiers. To avoid collisions and errors there is normally a single authority which is repsonsible – I am not aware of common identifier systems which are created collectively though it’s possible.

Identifier systems are the IP of the authority creating them and can be protected by copyright. This type of protection is common in many domains such as maps and chemistry. It is also important for regulataory and safety purposes that identifiers are well maintained so that any dispersal of the identifier system maintains integrity. For this reason many authorities will not allow their identifiers to be used by others without licence.

I’ll now comment on the comments, and add a summary

  1. Rich Apodaca says:

    Peter, here are some other views about the limitations of InChI/InChIKey and the idea of an InChI resolver authority:

http://depth-first.com/articles/2008/12/02/five-questions-about-the-inchi-resolver

These are tough problems without easy answers.

I’d like to correct something you mention in this post:

>So Pubchem does not display any CAS numbers.
Not true. PubChem not only displays them, it lets anybody download their entire collection of CAS numbers (>350,000 at my last count) along with the rest of the PubChem database.

  1. To see CAS numbers in Pubchem, you need to look at Substance summary pages, not Compound summary pages. For example, you’ll see the CAS number for caffeine (58-08-2) appears on this page:

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=9684

This gives a nice CAS number lookup facility that is remarkably accurate:

http://depth-first.com/articles/2007/05/21/simple-cas-number-lookup-with-pubchem

These CAS numbers are added by individual depositors, and PubChem aggregates them. This feature was used to create a CAS number lookup facility in Chempedia with the ability to trace who ‘assigned’ which CAS number in PubChem (although the site is now down for major redesign):

http://depth-first.com/articles/2008/05/26/simple-cas-number-lookup-and-more-with-chempedia

You might also want to check out Common Chemistry, the free CAS number service created by CAS for the public to use:

http://zusammen.metamolecular.com/2009/03/31/sixty-four-free-chemistry-databases-part-6-common-chemistry-from-chemical-abstracts-service

PMR: First, many thanks for these corrections and updates (Common Chemistry came out very recently and I missed it)

re Pubchem. I believe that most of the CAS numbers came from the NCI database where NCI paid CAS for the right to include the numbers. I do not know whether those can be re-used without infringing copyright – and I’d welcome authoritative information.  I’m going to challenge “remarkably accurate”. This can only be asserted by CAS itself as it is the only authority that can assert what a VCAS number is. Alternatively they may, though I doubt it, have allowed individuals to use their Scifinder service as a way of checking CAS numbers. I suspect that this is forbidden by the use of contract.

When Wikipedia authors first checked their CAS numbers against Scifinder they were immediately told by CAS that they were in breach of contracts. There was an outcry (in which I took part) and CAS changed to allow this – I am not sure whether there is a limit, but I would be very surprised if widespread distribution of CAS numbers (against names or structures) was allowed in an authoritative manner.

“These CAS numbers are added by individual depositors”. This is part of the problem. No depositor has the authority to assert that a given substance is linked to a given CAS number. They can speculate, copy, etc. but mistakes will occur and there is no authority.

  1. Mat Todd says:

    …but is there anything inherently wrong with InChI? The behaviour of glucose in solution is arguably a chemical reaction, and therefore something that needs to be described by a network of InChIs, rather than being a limitation of InChIs themselves.

CAS numbers are used widely in chemical catalogues. They are useful for searching because they are short, and because they are unambiguous in what they are meant to describe. The shortest way of searching without CAS is molecular formula or drawing the structure, which are either longer or ambiguous. CAS can’t describe clays accurately either, beyond what one might buy from a supplier.

In the future, I’m going to search for chemical information using structures and networks of structures on web pages. For this I’m not going to care that InChIs are being used behind the scenes. What’s the upshot? Use InChIs and develop reaction networks. For fuzzy InChIs like clays – well, aren’t these cases minorities that can be worked out later? When was the last time you used a clay?

When was the last time you used a zeolite? Or a polymer? Probably last time you went into the lab.

Here we have the substance-molecule dichotomy very clearly. CAS states its numbers refer to substances. InChI necessarily refers to molecular structure. Many substances consist of several molecular structures. Many molecular formulae occur as more than one substance.

The mistake is as serious as equating a coding sequence to a protein structure. In many cases a 1:1 correspondence works; in many it is a completely wrong picture of our scientific knowledge. The same is true for chemistry.

It’s a very hard problem and requires a lot of work. Crowdsourcing InChIs and CAS may be the first generation and they will hopefully advance the political discussion to ther extent where Openness is seen to be essential. At present I think the two most likely authorities are Pubchem and Wikipedia as there has to be a promise of sustainability. I think WP will do a very good job on ca 10,000 common chemical substances and molecules though it badly needs a coherent identifier scheme itself (indexing pages by natural language name does not constitute an identifier for an entity). Pubchem – rightly – captures all depositor metadata, but we have yet to work out how to identify the conflicts.

Pubchem has substances as well as molecular formulae for those compounds physically submitted to the Molecular Libraries. For the rest is has assertions from depositors which may or may not make it clear what substances if any were involved.

I completely support the InChI effort but it’s now time to take stock of the complexities as well as continuing to try to make chemical information Open.

This entry was posted in "virtual communities", Uncategorized. Bookmark the permalink.

One Response to CAS and InChI – who can assign identifiers?

  1. Mat Todd says:

    Peter, I think trying to pin down the exact nature of a substance and label it is important. I suspect it’s important because we need computers to be able to handle the data. But it reminds me of efforts to label vague concepts with names more generally. What is ‘British?’ How many hairs must I lose before I’m bald? At what wavelength does red become orange? To decide that such labels are important is half the battle.
    Beyond the zeolite/clay examples above, there was an interesting episode in a recent synthesis of quinine from Robert Williams (10.1002/anie.200705421). To quote another site (http://tinyurl.com/dg8me8):
    “following the old ways without the benefits of modern storage methods of reactive metals may have been critical in their success. Initially, their yield of quinine was very low. They suspected that the aluminium powder used as a reducing agent in the last step was the problem. It was too fresh! Leaving it in air for a short period leads to the formation of a coating of aluminium oxide. When the experiment was repeated with this powder, the yield matched that reported by Woodward and Doering.”
    Even commercial reagents with the same labels can be a mixture of things in a time-dependent manner.

Leave a Reply

Your email address will not be published. Required fields are marked *