The issue with CAS identifiers

To a recent post by Glyn Moody (The World’s Leading Anti-Scientific Society)
[ largely quoting me so I shan’t repeat it…]

[GM] … Clearly, it’s time to kill off this pernicious closed CAS system, which is damaging science, by boycotting it entirely…

 

 

ChemSpiderMan replied:
I’ve made two comments on CAS and the Wikipedia commentary.
http://www.chemspider.com/blog/cas-discourages-using-scifinder-to-help-curate-wikipedia-structures-and-cas-numbers.html
http://www.chemspider.com/blog/enforcing-copyright-of-cas-numbers.htmlI am not ready to abandon hope that the ACS/CAS can reach a point whereby they recognize the value, both public relations wise as well as good for their business. I believe it’s easy to declare that we should just abandon CAS and their dominant position but it is not so easy. The relationship between CAS numbers and their >100 years of literature/patents is deeply entrenched in their offering. They have very skilled people dealing with their systems and, other than the protectionism we judge is prevailing, their systems are good. Sure, they can be more open but let’s try and achieve that with a common-good discussion rather than abandoning them. While it’s easy to talk about RDF solutions and OWL and, and, and, these are solutions yet to be proven. They are valiant efforts and need to be pursued but they are yet to be proven. Also, think about the politics…if PubChem IDs prevail and damage CAS’ business then CAS initial views of PubChem damaging their business will have been validated…people will be out of jobs and all hell might break loose. I say bring the right people to the table to work through the complex business issues and do it soon.
That said, I’ll acknowledge that I prefer to try and navigate the complex issues to a mutually beneficial point rather than go into attack mode preferred by others.
5:56 PM
glyn moody said…
Thanks for your thoughts.Fortunately (for me) I have the luxury of not being directly involved with any of this. I also write from the position of one who blogs extensively about these issues, and is not afraid to rush in where angels fear to tread.I accept that there is plenty of room to negotiate and to attempt to move the ACS; I wish the best of luck to those trying to do that.But I also think it can be useful (good cop, bad cop) to suggest more outrageously radical solutions – like chucking the ACS completely.
Speaking as an outsider observer, I have been frankly disgusted (as in Tunbridge Wells) by its behaviour – not just on this occasion, but previously, too. As I said in the post, science seems a quintessentially open endeavour, and for the ACS to put money over knowledge seems unforgivable.

PMR: I will try and clarify the issues, and I hope that the ACS or CAS can reply if I get them wrong…

  • There is no god-given chemical informatics system – Principia Chimica. All systems of names, chemical structures, identifiers have a degree of arbitrariness. To avoid chaos we look to authorities to provide and to some extent regulate these. Authorities include IUPAC, CAS, PubChem, SwissProt, Genbank, International Union of Crystallography and hundreds more.
  • These authorities influence the community by (a) respect (e.g. the International Unions and learned societies) (b) market dominance (as for Thomson ISI for citations) (c) regulation, either legal or enforced by other autorities (e.g. patents, FDA, etc.). Anything in (c) is very difficult to change. It is up to the players to argue their cases.
  • There is usually no absolute right or wrong – what is the CAS number for penicillin? For many years the structure wasn’t known (Dorothy Hodgkin solved it) but there was still a name and an identifier. I do not now know whether “penicillin” has a CAS number. What is the CAS number for “glucose” – not alpha-D-glucopyanoside…? So although it is a worthy attempt to curate the “right” structure for a guiven name in many cases it is impossible – it is simply a question of authority. What is the structure of “snow”? This depends on an authority and cannot be answered without also quoting them.

The ONLY definitive statement of the relationship between CAS numbers, structures and names is from Chemical Abstracts. It is no use looking them up in catalogs as these are frequently wrong and in any case do not carry the authority.
To give an example from another authority – what is the PDB number for insulin (also solved by Dorothy Hodgkin and coworkers)? I go to the PDB site and type in “insulin” and find
insulin.jpg
Ooops! There’s more than one “insulin” – (there’s more than one “penicillin” as well). But I can browse all of them if required. If I want the one shown its accession number is 1T1K. If I send that to any bioscientist they will know what I am talking about. But if there is any disgareement about the identity of 1T1K then this page is the ONLY authority.
The problems with the current CAS identifiers are:

  • There is NO public lookup of them (unlike everything in bioscience including PubChem, and PubMed).
  • It is now expressly FORBIDDEN to transmit publicly the results of any lookup of this information

Traditionally a library would purchase a printed copy of Chemical Abstracts. Libraries were proud of their CAS – it could run to several metres of celluose and carbon. But then anyone with access to a library (the BL if necessary) could resolve questions like this. I doubt very much if there was a prohibition on telling people what names were associated with which CAS numbers.
In the electronic age control is easier. Note that the control will not be through copyright but by contract to subscribers. ACS can and does cut of subscribers for what they unilaterally determine are breaches of contract. The current prohibition specifically relates to contracts. Since, however, I know of no way of accessing modern CAS information other than through a contract-based system (and I’d be grateful to know if there is) I will break my instituition’s contract if I try to help create better science by clarifying information.
Chemspiderman and some Wikipedians take the view that WP should negotiate with CAS. Since WP is a democracy it will find its own way of resolving this. I take the following view:

  • Wikipedia requires authoritative sources for its information.
  • The assignment of a CAS number to one or more WP entries requires the authority of CAS
  • CAS forbids WP to use this authority
  • Therefore WP cannot include CAS numbers if it wishes to uphold its principles of authoritative sources – there are NONE available to it.

This is the logical argument for a boycott and I’d be happy to see counter arguments.
On the political front I regard CAS’s action as unacceptable for a scientific society which enjoys charitable status by virtue of its respect in the community. Charities have a responsibility to help the community – this action is diametrically opposed.

This entry was posted in Uncategorized. Bookmark the permalink.

7 Responses to The issue with CAS identifiers

  1. DrZZ says:

    I think it would help if people think precisely about what a CAS number is (and is not). Peter might be able to dig up some references, but I distinctly remember at the InChi meetings the CAS folks repeatedly pointing out that the CAS number was never intended to be a unique chemical identifier and there were some serious problems in using it as such. It was designed to be and is an index to information extracted from the chemical literature. That extraction took considerable effort and expense and it is not at all unreasonable that CAS put policies in place that allow them to recoup that expense and continue providing that service. I think that any discussion that doesn’t recognize these things will lose contact with reality pretty quickly. There is no doubt whatsoever that the abstraction done by CAS is very valuable and therefore CAS numbers can be valuable, but any use of CAS numbers has to recognize that they are controlled by CAS and that CAS will be looking for a way to fund their continued existence. I agree with Antony that this still leaves room for discussions about mutually beneficial arrangements, but those arrangements would result in a fairly narrow range of utility.
    While I agree that CAS can determine policies for the use of its work, it doesn’t mean I have to like it and it certainly doesn’t mean that I have to agree it’s best for (or even consistent with) good science. I just think in terms of damage to moving science forward, closed CAS numbers are trivial compared to ACS policies on datamining of journal articles. If we are going to get into a fight, I would rather it be about something more clearly worth fighting for.

  2. pm286 says:

    (1) Point well taken…

  3. Steven Bachrach says:

    I think it is important to keep in mind that the CAS number was designed to index the Chemical Abstracts database. That gives it some remarkable flexibility – like the fact that there are CAS numbers for “kerosene and “petroleum ether” – though neither is a pure chemical substance.
    As Peter has suggested, “penicillin” does have a CAS number but there is no chemical structure associated with this CAS number. Interesting – but again, the number is being used to index an entry into the database.
    If we are to build a system for exchanging information about an object (chemical or otherwise), it seems to me that a system for identifying that object (i.e. giving it a “name”) should be accomplished in such a way that the sender and receiver of the “name” do not need to resort to a third party to know just what the hell they are talking about! If the CAS number is the means for “naming” chemical substances, then both the sender and receiver must consult (and in this case, pay) CAS to determine what the chemical object is.
    The advantage of a third-part intermediary is that if that third-party can be trusted, we have a method for vetting the identity of the object. CAS does a very admirable job at vetting its numbers and the community generally regards CAS with a great deal of trust and authority.
    If we are to create some open system of naming chemical object, we will need some authority to vet these names. The InChI/InChIkey offers the authority of IUPAC, and a tool for both sender and receiver to interpet the name without contacting (or paying) an authority. The downside is that InChI has some real restrictions on its applicability – though continual work is in the offing to extend its service. PubChem, while being a free intermediary, rests at heart upon users’ depositions for its authority and as such is pretty rife with problems. The PubChem ID was also designed as a pointer into the database – not designed to be a unique identifier of chemical objects. So, without significant work on curation, to me the PubChem ID has a long way to go to become a meaningful naming system.

  4. Since the recent exchanges were all initiated by the CAS Numbers on Wikipedia I just thought I would point out the article on Penicillin on Wikipedia here: http://en.wikipedia.org/wiki/Penicillin. As defined there: “Penicillin (abbreviated PCN) is a group of β-lactam antibiotics “. That said there is a ChemBox in the article for Penicillin G, “one” of the series. It has a CAS number: 61-33-6. This links back to MeSH here: http://www.nlm.nih.gov/cgi/mesh/2006/MB_cgi?mode=&term=Penicillin+G
    MeSH lists many CAS Numbers for various Penicillin G’s…
    Related Number 113-98-4 (mono-K salt)
    Related Number 21193-94-2 (sulfate)
    Related Number 30411-69-6 ((5 beta)-isomer)
    Related Number 47294-44-0 ((6 alpha)-isomer)
    Related Number 69-57-8 (mono-Na salt)
    Related Number 75333-20-9 (mono-NH4 salt)
    this info is just for readers interest in the details..

  5. Steven Bachrach says:

    And just to reiterate – and this points to what can be done when one uses index numbers rather than identifiers tied to structures – CAS has an CA number for “penicillin” that’s not one of the ones listed above in Tony’s comment and 11493 References are linked to this CA number. Also, the listing in SciFinder for penicillin also has four deleted registry numbers for penicillin.

  6. Steven Bachrach says:

    Oh – and one other thing – one of the CA numbers listed in Tony’s comment above is not a valid CA number! So there’s another problem in curation.

  7. Dave says:

    <blockquote cite=”As I said in the post, science seems a quintessentially open endeavour, and for the ACS to put money over knowledge seems unforgivable.”There’s the problem. Too many scientific endeavors are guilty of this. Almost anyone funded is guilty.

Leave a Reply

Your email address will not be published. Required fields are marked *