Cameron Neylon picks up the theme of an alternative approach to identifying chemicals (especially in the context of CAS’s blanket refusal to allow normal scientific practice – thed quoting of authority). I reproduce some key quotes and then introduce what I hope to be a series of posts explaining the basic principles. Be warned – identifying chemicals ranges from straightforward to almost impossible. There are no trivial answers such as “use InChI” or “use PubChem CID”. I am fully supportive of Cameron’s post, but comment where it needs tightening…
What to use as a the primary key for chemicals?
Now following on from my post about feeds it is clear that we also want to provide a good range of searchable indexes for people to be able to tell what we are using. So we would ideally want to expose InChi, InChiKey, SMILES, CML perhaps, PubChem Ids etc. These can all be converted one to the other using web services so we don’t need to type all of them in manually. All that is required is a nice logging screen where we can drop in one type of index key, the size of the bottle, supplier, lot numbers, perhaps a link to safety data. The real question is what is the index key that is easiest to input? For those of you in or near a laboratory I suggest an exercise. Go and pick up the nearest bottle of commodity stuff from a commercial supplier (i.e. not oligos or peptides). What is written on it? What is a nice short identifier that can consistently be found on pretty much any bottle of chemicals? For those unlucky people who don’t have a laboratory at their fingertips I have provided a clue below.
PMR: Most of these are interconvertible but not always. For example there are substances that PubChem can hold that do not have SMILES or InChI. And CML can describe substances that the others cannot (e.g. non-stoichiometric compounds and certain mixtures). I’ll explain later
The Chemical Abstracts Service number is the one identifier that can reasonably reliably be found on most commercially supplied substances. Yet, as described by Peter Murray-Rust and Antony Williams recently you can’t look these up without paying for them. And indeed by recording them for your own purposes (say in a database of the compounds we have in the laboratory) we may be violating the terms of the license.
PMR: Exactly.
So what to do? Well we can adopt another standard or standards. Jean-Claude Bradley argued in a comment on my recent post that InChiKey is the way to go, but for this specific purpose (logging materials in) this may be too much to type in many cases (certainly SMILES, InChi and CML would be). You can’t expect people to draw in the structure each time a compound comes in, particularly if we get into arguments about which precise salt of cAMP we are using today. What is required is a simple, relatively short number. This is what makes the CAS number so appealling; it is short, easily typed in, and printed on most bottles.
PMR: InChI and CAS are fundametally different and one cannot replace the other. I will explain later
So, along with Peter I think the answer is to use PubChem CID numbers. PubChem doesn’t use CAS numbers and CAS actively lobbied the US government to limit the scope of PubChem. PubChem CIDs are relatively short, and there are a range of web services from which other descriptions can be retrieved (see e.g. PubChem Power User Gateway). The only thing that is missing is the addition of CID’s on bottles. If we can get wide enough agreement on this I think the answer is to start writing to the suppliers. It’s not great effort on their part to add CIDs (or if there is something better, some other index) to the bottles I would have thought and it provides a lot of extra value for them. PubChem can provide links through to up to date safety data (without the potential legal issues that maintaining a database of MSDS forms with CAS numbers creates), it provides free access to a supplier index through which customers can find them, and it could also save them a small fortune in CAS license fees.
PMR: This is exactly the right political approach. There are some technical points that need to be addressed but we don’t need to do it all at once.
There is another side to this, which is that if there is a wholesale shift (or even the threat of a shift) away from CAS as the only provider of chemical indexing, then perhaps the ACS will wake up and realise that not only is this protectionism bad for chemistry, but it is bad for their business. The database of CAS numbers has no real value in its own right. It is only useful as a pointer to other information. If the ACS were to make the use and indexing of CAS numbers free then it would be driving traffic to its own value added services. The ACS needs to move into the 21st (or perhaps the 20th) century in terms of both its attitudes and business models. We often criticise the former, but without shifts in the latter there is a real risk of critical damage to an organisation that still has the potential to make a big contribution to the chemical sciences. If the major chemical suppliers were to start printing PubChem CID’s on their bottles it might start to persuade the powers that be within the ACS that things need to change.
PMR: Again I agree with all of this. However the CAS number is also interlinked with the CAS content and these are difficult to disentangle. So the CAS number not only indentifies a compound but links to the information in CAS. This information presumably has physical properties, reactions, etc as well as the simple identity. However I cannot comment authoritatively on anything in CAS because by doing so I would violate the conditions that the University has signed up to. CAS might then cut us off and while that wouldn’t worry me, my colleagues would kill me. This is a greater sanction than legal action. It’s reminiscent of computational chemistry companies who forbid comments on their programs (bugs and benchmarks).
So, to finish; do people agree that CID is a good standard index to aggregate around? If so we should start writing to the major chemical manufacturers, perhaps through open letters in the general literature (obviously not JACS), to suggest that they include these on their packaging. I’m up for drafting something if people are prepared to sign up to it.
PMR: It’s critical that we hear from PubChem on this. We need to know more about the substanceId as well as the compound ID.
So I shall post more tomorrow (I’m currently in Redmond – and have to crash). But here are some axioms:
- Almost everyone can dispense with the use of CAS numbers (the exceptions are where their are legal regulations requiring them or they are mandated by a vendor (but we should seek to change those).
- Identifying chemicals can be very hard and there are no simple universal solutions, but…
- For most compounds it’s easy, and CAS is not required
- For common compounds there are alternative ways
- There have to be at least three ways of identifying compounds: names, identifiers, connection tables.
- The key problem in C21 is providing the relationships between these, which can be done with RDF.
- There do have to be authorities which are stable and which we can trust. That used to be CAS, now we should see if PubChem and its funders wish to take on some of that role.
When we have created the new system is will be greatly superior to current practice and give us more cnnfident on what information we can use.
>Do people agree that CID is a good standard index to aggregate around?
It’s easy to come up with an identifier system. PubChem CIDs and ChemSpider IDs are just two examples of many.
Curating the database backing the identifier is where the real work is, especially in a system designed for all of chemistry. Few have tackled the problem head-on, including PubChem:
http://depth-first.com/articles/2006/12/12/the-problem-with-ferrocene
(1) Absolutely right Rich. I shall address this later. The question is whether we can find an area where our efforts can make real impact. I think Wikipedia is one place to start.
Rich has made an appropriate statement and likely knows where I stand on this. I’ve declared my appreciation for curators and their hard work. ‘http://www.chemspider.com/blog/curators-perform-heroic-duties-they-should-be-celebrated.html’
I don’t feel that the majority of people understand the challenge fully. It becomes more obvious as you attempt to “do” curation.
Regarding curation…a system is already in place on ChemSpider and working, to a certain extent, especially for organics (http://www.chemspider.com/blog/chemspider-has-curated-over-500-comments.html) and YES< we still do not have organometallics, inorganics and polymers supported the way I would like yet. Our next shift will be to support “substances” rather than structures and work is underway. This is particularly to support “samples” that are as yet uncharacterized but do have properties: sample ID, color, material characteristics, spectra, other analytical data…but not yet “identified”. The outcome of this will be substance definition and management.
For now I have to agree with Rich…curating the database is where the real work is. There is work to do but an eye for detail is an absolute. And robots are dangerous without validation…I know, our own robots are watched carefully and we are learning all the time about how to tweak and improve.
I’m not a chemist, so pardon my possible naivete, but I’m interested in nomenclature and categorization schemes, and I’ve done some of my own work attempting to correlate various databases. One problem I’ve noticed with PubChem cid’s is that they don’t tend to be unique, or at any rate, that it can be hard to figure out which is the “primary” cid. The first example I encountered was Thimerosal — is it 5908, 67361, or 16682923?
Pingback: Science in the open » Who’s got the bottle?
Pingback: Science in the Open » Blog Archive » Who’s got the bottle?