Cameron Neylon picks up the theme of an alternative approach to identifying chemicals (especially in the context of CAS’s blanket refusal to allow normal scientific practice – thed quoting of authority). I reproduce some key quotes and then introduce what I hope to be a series of posts explaining the basic principles. Be warned – identifying chemicals ranges from straightforward to almost impossible. There are no trivial answers such as “use InChI” or “use PubChem CID”. I am fully supportive of Cameron’s post, but comment where it needs tightening…
Now following on from my post about feeds it is clear that we also want to provide a good range of searchable indexes for people to be able to tell what we are using. So we would ideally want to expose InChi, InChiKey, SMILES, CML perhaps, PubChem Ids etc. These can all be converted one to the other using web services so we don’t need to type all of them in manually. All that is required is a nice logging screen where we can drop in one type of index key, the size of the bottle, supplier, lot numbers, perhaps a link to safety data. The real question is what is the index key that is easiest to input? For those of you in or near a laboratory I suggest an exercise. Go and pick up the nearest bottle of commodity stuff from a commercial supplier (i.e. not oligos or peptides). What is written on it? What is a nice short identifier that can consistently be found on pretty much any bottle of chemicals? For those unlucky people who don’t have a laboratory at their fingertips I have provided a clue below.
PMR: Most of these are interconvertible but not always. For example there are substances that PubChem can hold that do not have SMILES or InChI. And CML can describe substances that the others cannot (e.g. non-stoichiometric compounds and certain mixtures). I’ll explain later
The Chemical Abstracts Service number is the one identifier that can reasonably reliably be found on most commercially supplied substances. Yet, as described by Peter Murray-Rust and Antony Williams recently you can’t look these up without paying for them. And indeed by recording them for your own purposes (say in a database of the compounds we have in the laboratory) we may be violating the terms of the license.
PMR: Exactly.
So what to do? Well we can adopt another standard or standards. Jean-Claude Bradley argued in a comment on my recent post that InChiKey is the way to go, but for this specific purpose (logging materials in) this may be too much to type in many cases (certainly SMILES, InChi and CML would be). You can’t expect people to draw in the structure each time a compound comes in, particularly if we get into arguments about which precise salt of cAMP we are using today. What is required is a simple, relatively short number. This is what makes the CAS number so appealling; it is short, easily typed in, and printed on most bottles.
PMR: InChI and CAS are fundametally different and one cannot replace the other. I will explain later
So, along with Peter I think the answer is to use PubChem CID numbers. PubChem doesn’t use CAS numbers and CAS actively lobbied the US government to limit the scope of PubChem. PubChem CIDs are relatively short, and there are a range of web services from which other descriptions can be retrieved (see e.g. PubChem Power User Gateway). The only thing that is missing is the addition of CID’s on bottles. If we can get wide enough agreement on this I think the answer is to start writing to the suppliers. It’s not great effort on their part to add CIDs (or if there is something better, some other index) to the bottles I would have thought and it provides a lot of extra value for them. PubChem can provide links through to up to date safety data (without the potential legal issues that maintaining a database of MSDS forms with CAS numbers creates), it provides free access to a supplier index through which customers can find them, and it could also save them a small fortune in CAS license fees.
PMR: This is exactly the right political approach. There are some technical points that need to be addressed but we don’t need to do it all at once.
There is another side to this, which is that if there is a wholesale shift (or even the threat of a shift) away from CAS as the only provider of chemical indexing, then perhaps the ACS will wake up and realise that not only is this protectionism bad for chemistry, but it is bad for their business. The database of CAS numbers has no real value in its own right. It is only useful as a pointer to other information. If the ACS were to make the use and indexing of CAS numbers free then it would be driving traffic to its own value added services. The ACS needs to move into the 21st (or perhaps the 20th) century in terms of both its attitudes and business models. We often criticise the former, but without shifts in the latter there is a real risk of critical damage to an organisation that still has the potential to make a big contribution to the chemical sciences. If the major chemical suppliers were to start printing PubChem CID’s on their bottles it might start to persuade the powers that be within the ACS that things need to change.
PMR: Again I agree with all of this. However the CAS number is also interlinked with the CAS content and these are difficult to disentangle. So the CAS number not only indentifies a compound but links to the information in CAS. This information presumably has physical properties, reactions, etc as well as the simple identity. However I cannot comment authoritatively on anything in CAS because by doing so I would violate the conditions that the University has signed up to. CAS might then cut us off and while that wouldn’t worry me, my colleagues would kill me. This is a greater sanction than legal action. It’s reminiscent of computational chemistry companies who forbid comments on their programs (bugs and benchmarks).
So, to finish; do people agree that CID is a good standard index to aggregate around? If so we should start writing to the major chemical manufacturers, perhaps through open letters in the general literature (obviously not JACS), to suggest that they include these on their packaging. I’m up for drafting something if people are prepared to sign up to it.
PMR: It’s critical that we hear from PubChem on this. We need to know more about the substanceId as well as the compound ID.
So I shall post more tomorrow (I’m currently in Redmond – and have to crash). But here are some axioms:
- Almost everyone can dispense with the use of CAS numbers (the exceptions are where their are legal regulations requiring them or they are mandated by a vendor (but we should seek to change those).
- Identifying chemicals can be very hard and there are no simple universal solutions, but…
- For most compounds it’s easy, and CAS is not required
- For common compounds there are alternative ways
- There have to be at least three ways of identifying compounds: names, identifiers, connection tables.
- The key problem in C21 is providing the relationships between these, which can be done with RDF.
- There do have to be authorities which are stable and which we can trust. That used to be CAS, now we should see if PubChem and its funders wish to take on some of that role.
When we have created the new system is will be greatly superior to current practice and give us more cnnfident on what information we can use.
March 8th, 2008 at 5:32 am eWe are legally required to supply vendor MSDS forms to our staff. The vendors have included CAS numbers on their MSDS forms, and we keep the forms in a database. So technically, we must be in breach of our SciFinder license?
If we get sued, I wonder whether the judge would side with the legal statutes or the contractual agreement?
What CAS should be doing is making CAS numbers an open standard – like PDF files – that everybody can adopt.