petermr's blog

A Scientist and the Web

 

Why and how we should move away from CAS numbers

In a recent post (CAS Discourages Using SciFinder to Help Curate Wikipedia) I commented on the refusal  of CAS to allow Wikipedia to use the CAS numbers and/or related information obtained from their Scifinder(TM) product. As far as I know there are no current public printed sources from CAS that provide the same information. Effectively, therefore, CAS is exercising agressive monopoly control over chemical information. The legal re-use of CAS information is explained here in which it is made clear that all information is copyight, permission is almost always required, although collections of up to 10, 000 CAS  numbers  do not require permission.

A User or Organization may include, without a license and without paying a fee, up to 10,000 CAS Registry Numbers or CASRNs in a catalog, website, or other product for which there is no charge. The following attribution should be referenced or appear with the use of each CASRN: CAS Registry Number® is a Registered Trademark of the American Chemical Society. CAS recommends the verification of the CASRNs through CAS Client ServicesSM.

PMR: Wikipedia has between 1000 and 2000 chemical substances (ca 0.01% of the total number of substances in CAS). Wikipedia cares passionately about correctness and copyright. It is a fundamental policy of Wikipedia to quote sources – this is one of the critical platforms on which respect for WP rests.

The American Chemical Society – hitherto a respected learned society – is now telling a voluntary community of scholars that it forbids them to check their facts. It is preventing them disseminating chemistry.

I wonder if there is anything in the history of learned societies that matches this action. There are so many ways they could have responded positively. WP is nowhere near the 10,000 compound limit. They are not a threat (although CAS’s mail suggests they are scared of WP). CAS could have made a donation to Wikipedia of the 10,000 most common compounds in the CAS database. A “CAS-Wikipedia” page would give the correct CAS number-structure relationship, preventing error. There are so many positive things they could have done.

As it is they have done the following:

  • re-asserted their position that they care for revenue more than supporting the wider chemical community
  • re-advertised themselves as one of the least progressive learned societies
  • alienated a growing number of young scientists who look to the Web as a critical part of the future of chemistry

But worst of all they are implicitly encouraging bad chemistry. Here’s an example from a recent comment:

  1. Name (required) Says:
    March 8th, 2008 at 5:32 am eWe are legally required to supply vendor MSDS forms to our staff. The vendors have included CAS numbers on their MSDS forms, and we keep the forms in a database. So technically, we must be in breach of our SciFinder license?

    If we get sued, I wonder whether the judge would side with the legal statutes or the contractual agreement?

    What CAS should be doing is making CAS numbers an open standard – like PDF files – that everybody can adopt.

CAS numbers are widely used in chemical regulation and commerce to identify substances (see MSDS (WP) for example). This action from CAS will encourage people to guess CAS numbers. If a chemist wants the CAS number for acetone s/he will not now go to CAS (6 USD) – she’ll find a suppliers catalog and take one from there. I know from expeience that there are huge numbers of errors in number-structure relations and so these errors will be propagated.

The commercial chemical community is ultra-conservative but even so there is a limit to this central control. The use of CAS numbers has been abandoned by organisations such as PubChem for exactly this reason. PubChem now has nearly 20 million substances. It holds records for all compounds that are likely to occur on MSDS. It’s highly respected (although ACS lobbied the US government to limit Pubchem’s activities). It is part of the NIH and now – with the NIH mandate – effectively safe from the ACS. It provides a credible alternative.

We (including Wikipedia) should now switch from using CAS numbers to using PubChem IDs wherever possible. It won’t be a simple transition – certainly we shan’t find 100% overlap. But it will solve all the common substances and therefore 90%+ use of CAS numbers.

We shall need software. We and others are now developing the next generation of chemical informatics software using RDF (Resource Description Framework). RDF allows the description of ambiguities and ontologies. This will allow chemical information to be gleaned directly from authoritative sources using robots. (Of course some of the authorities are currently conservative and do not allow access to their material because of restrictive copyright and licences, but that is starting to change, even in chemistry). As information becomes more open, the CAS system will be increasingly isolated in a world of chemical commerce. Robert Massie ( Robert Massie on OA and PMR) worries that sites in China are stealing information from CAS:

many sites in China have sprung up to provide information on how to break into the computer systems of major US universities in order to gain access to SciFinder.

If I were running CAS that wouldn’t be my worry. I’d be terrified that in five years’ time the world – perhaps through China – has developed an Open system that was rapidly replacing Scifinder because it was better as well as free.

And I shall be posting from time to time how I think this can be done. The first step is to transfer whatever is possible to PubChem.

Leave a Reply