Identifiers: why we need them and when CAS and InChI don't mix

I have tried to write this without needing to know chemistry as there is an important political point.

I have been involved in InChI since the beginning and I am a great supporter of it. But it’s not a simple concept and it’s now being overused and badly used. I don’t know whether the situation is recoverable. But I can at least explain the problem.

First, chemistry is complicated. God or the LawsOfPhysics requires that. In principle every time you make a chemical compound it has a different composition.

That’s true for a huge number of compounds – a common one is clay. It has varying amounts of metal ions – sodium, potassium, calcium, aluminium, etc. Yet we are prepared to use names like “montmorillonite” to describe a single chemical concept even though it’s subtly different every time. But that’s beyond InChI.

Similarly we have glucose. We all know what that is – you can buy bottles at the drugstore and all are the “same stuff”. But when you put it in water you get a mixture of at least three species (open chain, alpha and beta isomers). But that’s beyond InChI.

Informatics for modern science requires precise description of concepts (ontology) and identifier systems. For example when we (at Glaxo) determined the structure of a drug bound to HIV protease the experiment has an identifier in the Protein data bank (1HTE). That identifier has been given by the Protein Data Bank, which validates the data, including versioning, thus acting as an authority. It’s more difficult to describe HIV protease. It varies between strains of the virus which mutates extremely rapidly – that’s one reason why it’s difficult to create drugs. So there are zillions of identifiers – ours was BH10, but they are generally classified under PF00077 (Sanger centre). The actual ids don’t matter – the point is that there have to be authorities and there have to be identifier systems.

But that’s not so easy in chemistry. The main problem is that there is no authority that catalogues chemistry and assigns open identifiers. It’s a very tricky problem because the identity of many compounds is very difficult. The most obvious problem – which no-one has formalised and which we are starting to do – is that there is an enormous difference between the macroscopic and microscopic – which I will refer to as “substances” and “molecules”. The conventional way to do this is through names, but names are variously applied to substances and molecules without distinguishing. It needed an authority to help define the problem and it needs an ontology.

Now the International Union of Pure and Applied Chemistry (IUPAC) knows there is a problem. It’s a worthy body and has honoured me by making me a fellow. It creates a large and detailed set of rules for naming compounds – from their molecular structure. But it doesn’t have the resources to name everything.

One useful way forward is to identify the molecular structures (or connection tables). It is very important to realise that not all compounds (e.g. sodium chloride) have connection tables and some like glucose have several. It’s only when a compound and a connection table have a 1:1 relationship that the InChI approach is useful. But for millions of compounds that’s roughly true. Unfortunately you have to know a lot of chemistry to know when it works and when it doesn’t. There’s no system in the world that manages it – that’s why we need an an ontology.

But for those millions of compounds that pharma are interested in it’s often possible to draw structures that are good enough to identify the compound. The problem is that the structures can be drawn in different sizes, orientations, etc. So we create a connection table – which atoms are joined to which. This is an enormous advance and allows us to classify compounds and search them by computer.

But remember it only helps in the macroscopic world only if there is a 1:1 relationship between CT and compound.

Now anyone can create a connection table and everyone will do it differently – they will call the atoms different names so no-one can compare them. But by using graph theory it is possible to “canonicalise” the CTs and one early paper by the Weiningers described exactly how, using a representation called SMILES.

The SMILES system was adopted by the pharma industry which used it to produce “canonical SMILES”. Now this looks like a good thing, but it wasn’t. First the program (DAYLIGHT) was closed and commercial, and secondly the program gave different answers from the algorithm published in the literature. Chaos.

Several of us tried to get the Daylight company to release the algorithm, but to no avail. They never have done and the closed nature has helped to hold chemistry backwards as it discredited the idea of universal identifier systems and interoperability.

So IUPAC put together a working group to develop an Open alternative. I was on this group – contributed a lot of input. It’s been a political success in that it’s highly used. But the design has serious flaws.

First it tries to tackle the structure -compound problem by creating multiple representations for certain types of compound (tautomers). That complicates InChIs severely, without adding commensurate benefit. Secondly it tries to represent imperfect knowledge. One of the major problems in chemical representation is that people miss out hydrogen atoms. That’s saves writing time, but it’s lazy and it introduces untold errors. Another problem is that chemists often represent stereochemistry imprecisely, so it’s not uncommon to find many connection tables for the “same” compound or substance. That means that in practice there can be many InChIs for the same compound. There has been an attempt to fix this by declaring certain representations more fundamental, and this will help but the multiple-InChI still remains. Only an ontological approach will help.

It’s compounded because people found the long lengths of many InChIs were inconvenient and so created hashes (InChI-KEY). Here the different InChIs for the same compound have no similarity and they can only be compared by having a resolving authority. So we are now back to authorities, without the infrastructure or communal will to make them work.

Because there are already many identify-giving authorities in chemistry. Some identify substances (like safety authorities); others identify substances. The American Chemical Society has the best known and largest identify system – the CAS registry number. But it has several drawbacks. The way in which identifiers are assigned is not public so we cannot know whether a substance corresponds to a given CAS number. Then there is no public indication whether an identifier is for a connection table or a substance. And then CAS numbers are CAS’s IP.

Now that’s OK in that they have assigned them by the “sweat of their brow” so they copyright them. If you want to find out what structure corresponds to what number you have to pay 5.80 USD (last time I looked) per compound. If you have hundreds of thousands of compounds (such as government agency responsible for health or environment) that’s a lot of dollars. So Pubchem does not display any CAS numbers.

Pubchem is – as far as I know – the only system that tries to distinguish compounds (CID) from substances (SID). Substance information is donated by suppliers of information and without an ontology it won’t be clear whether it’s molecule or substance – you have to guess from the type of depositor.

In summary, chemistry, like bioscience, is complicated. It needs the community to work out description systems and identifiers, but unlike bioscience, the political aspects are keeping it in the dark ages. Until we have a public Open authority for substance identifiers we can’t solve the chemical problem. And while the ACS lobbies against openness and the NIH (for whatever reason; PRISM< Pubchem, Conyers) it's going to be tough.

But I know who my money is on.

This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to Identifiers: why we need them and when CAS and InChI don't mix

Rich Apodaca says:

April 5, 2009 at 1:57 am

Peter, here are some other views about the limitations of InChI/InChIKey and the idea of an InChI resolver authority:
http://depth-first.com/articles/2008/12/02/five-questions-about-the-inchi-resolver
These are tough problems without easy answers.
I’d like to correct something you mention in this post:
>So Pubchem does not display any CAS numbers.
Not true. PubChem not only displays them, it lets anybody download their entire collection of CAS numbers (>350,000 at my last count) along with the rest of the PubChem database.
To see CAS numbers in Pubchem, you need to look at Substance summary pages, not Compound summary pages. For example, you’ll see the CAS number for caffeine (58-08-2) appears on this page:
http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=9684
This gives a nice CAS number lookup facility that is remarkably accurate:
http://depth-first.com/articles/2007/05/21/simple-cas-number-lookup-with-pubchem
These CAS numbers are added by individual depositors, and PubChem aggregates them. This feature was used to create a CAS number lookup facility in Chempedia with the ability to trace who ‘assigned’ which CAS number in PubChem (although the site is now down for major redesign):
http://depth-first.com/articles/2008/05/26/simple-cas-number-lookup-and-more-with-chempedia
You might also want to check out Common Chemistry, the free CAS number service created by CAS for the public to use:
http://zusammen.metamolecular.com/2009/03/31/sixty-four-free-chemistry-databases-part-6-common-chemistry-from-chemical-abstracts-service

Mat Todd says:

April 5, 2009 at 1:37 pm

…but is there anything inherently wrong with InChI? The behaviour of glucose in solution is arguably a chemical reaction, and therefore something that needs to be described by a network of InChIs, rather than being a limitation of InChIs themselves.
CAS numbers are used widely in chemical catalogues. They are useful for searching because they are short, and because they are unambiguous in what they are meant to describe. The shortest way of searching without CAS is molecular formula or drawing the structure, which are either longer or ambiguous. CAS can’t describe clays accurately either, beyond what one might buy from a supplier.
In the future, I’m going to search for chemical information using structures and networks of structures on web pages. For this I’m not going to care that InChIs are being used behind the scenes. What’s the upshot? Use InChIs and develop reaction networks. For fuzzy InChIs like clays – well, aren’t these cases minorities that can be worked out later? When was the last time you used a clay?

Steven Bachrach says:

April 5, 2009 at 2:05 pm

Peter,
There can never be a system of identification of anything, including chemical entities, without and agreement as to how identifiers will be constructed and interpreted. Asking for some ab initio identification scheme is simply asking for the impossible.
InChI defines a subset of chemistry, with defined limitations and rules. The new standard InChI provides an identifier that has been agreed upon by a large block of users. Is it perfect – NO! The INChI session at the recent Salt Lake City ACS meeting revealed a large number of imperfections – but all of these are well-recognized. For example, while 1,3-tautomers are treated as identical, 1,5-tautomers are not. Is this correct? Depends on your use. Are organometalics treated properly? Well, the problem is how does one treat localized vs delocalized bonding – i.e. are there 10 Fe-C bonds in ferrocene or two η-6 bonds between the iron and the CP rings?
No current identifier system treats the chemical identities I am most interested in – transtions tates, excited states, reaction trajectories, etc. Our current identifiers requires fidelity to the Lewis type chemical bond – when this model starts to fail, then identifiers start to fail too.
In one of your previous posts you talked about the necessity of producing conventions, and that is the key or moving forward here. The InCHI standard is a convention for a subspace of chemical space. If it doesn’t cover your molecules, then you need to work with the InChI team to create the convention extensions. The significant advantage of InChI over other identifiers is that is open and well-defined and extensible and algorithm-driven. Now other ID system offers all of these advantages.

pm286 says:

April 5, 2009 at 2:41 pm

@Steve Thanks. I have addressed most of these points in the last few minutes. As you say, InChI covers a subspace of chemistry – the subspace for which connection tables are relatively unique and where most chemists will agree on representations. It’s possible to make small extensions into some fields (such as organometallics), given a great deal of work, but generally InChI has reached the boundaries of what it can do.
My concern is that people are trying to link InChIs to substances. That cannot work without an authority which decides on what a particular substance is. And it has to be done on a case-by-case basis or at least a class-by-class. That’s what CAS has been doing for years. I don’t know whether they have done it well because the process is not exposed. Since we find that inter-human agreement is not likely to exceed 90% on recognisinig chemical entities I would expect a lot of errors and inconsistencies.
The way forward needs to use ontologies as that is the only framework that can hold the complexity of chemistry. It’s not going to be easy, but it’s necessary