petermr's blog

Open Data in Science (pod and vodcasts)

Posted on March 20, 2008 by pm286

Margaret Henty (ANU) has done a very professional job of collecting the material (of several media types from the APSR 2008 meeting in Brisbane:

APSR is pleased to announce that podcasts and vodcasts of Open Access Collections held in Brisbane on February 14 are now available at the following url.
http://www.apsr.edu.au/open_access_collections/presentations.html
Because vodcast files are very large, they are available in parts (for those with bandwidth issues) and in total (for those in bandwidth plenty).
Thanks are due to the team of technicians who made this all possible.
***********************************************************************
Margaret Henty
National Services Program Coordinator
Australian Partnership for Sustainable Repositories
W. K. Hancock Building (#43)
The Australian National University
Canberra, ACT, 0200, AUSTRALIA
phone 61 2 6125 7685 mob. 0404 878 442
fax 61 2 6125 5526
http://www.apsr.edu.au

I haven’t yet finished downloading, and I’m pleased / scared to see what I actually looked like and said.
Because my presentations have live demos it is difficult to capture them on traditional slide media so I rely on occasional videos of the event. So many thanks to Margaret (and for a great time in Australia).

Posted in Uncategorized | Leave a comment

Leo Waaijers: DARE to Inspire

Posted on March 14, 2008 by pm286

I was privileged to be asked to present a homage to Leo Waaijers yesterday at the SURF foundation in Utrecht (actually in an old castle+microbrewery – Stadskasteel Oudaen). Leo has been an architect of so much in the Netherlands that I cannot list all of it – perhaps the best is simply to reference one of his latest articles in Ariadne where he reviews eScholarship – practice and technology – and the principles of Openness.
The theme of the afternoon was “Inspire”. t was a great occasion where after the presentations we had a theatrical homage to Leo – hooded monks representing the four elements, and an excellent set of computer graphics and personal tributes from round the world – including many from UK/JISC.
I took the opportunity to honour Leo’s work in creating the DARE repository :

Promise of Science makes over 15,000 e-theses searchable.
DAREnet is the result of the DARE programme, funded by SURF. All Dutch universities, the National Library of the Netherlands, the Royal Netherlands Academy of Arts and Sciences (KNAW) and the Netherlands Organisation for Scientific Research (NWO) participate. From 1 January 2007 KNAW Research Information has taken over responsibility for the DAREnet website.

I was able to find over 2000 publications when searching for “chemistry”. I haven’t checked but I think most are theses. This is a fantastic resource. I could download one and within a minute[*] OSCAR had read much of the chemical data from it. It changes the way we should capture and publish data. The Netherlands can justly feel proud in leading the world with the coherence and commitment of their program on capturing eScholarship.

PS [*] Yes, I have to admit that I had to convert the PDF to something useful (ASCII) and then the parsing takes a few seconds. So, a simple message.
Besides archiving your PDF, archive the WORD or LaTeX file. That’s the message. It’s simple, so I’ll repeat it:
Besides archiving your PDF, archive the WORD or LaTeX file.

Posted in Uncategorized | 1 Comment

OKCON 2008

Posted on March 14, 2008 by pm286

I have been so busy I haven’t managed to blog the Open Knowledge Foundation’s OKCON 2008. Here’s it in brief:
OKCon: The Annual Open Knowledge Conference

When: Saturday 15th March 2008 10:30-18:30 (doors open 10am)
Where: Clement House (D602), London School of Economics, London, UK (getting there) (another map)
Programme: programme page
Registration: register page
Wiki: http://okfn.org/wiki/okcon/
Mailing lists: Join the announce list and/or the discussion list

PMR: Don’t offhand know if there’s space still available – maybe someone can respond.
Rufus Pollock has asked me to chair the session on visualization.
Session 2 (1200-1315): Visualization and Analysis

Liz Turner (Freelance Designer and Visualizer Extraordinaire)
Gael Varoquaux (Mayavi2 – the next Generation Visualization Toolkit)
Martin Albrecht (SAGE the Open Source Mathematics Engine)

I’m really excited about this. I’ve been involved in graphics for nearly 30 years and the world still hasn’t solved the problem of how to get graphics to the client. But graphics is so important in making your message obvious and widely known. So there is still a lot of scope for work here.
And while I’m here enormous Kudos to Rufus for getting OK off the ground and keeping focussed. It’s made critical contributions over tha last year – e.g. with Science Commons.
Look forward to meeting old and new friends tomorrow.
I’d like to reemphasize how i

Posted in Uncategorized | Leave a comment

Commercialization of Open Source code (Bioclipse)

Posted on March 14, 2008 by pm286

Bioclipse is an Open Source chemo- and bio-informatics toolkit (rich client) developed by Ola Spjuth and colleagues and a wide virtual community (including me). It’s Open Source (LGPL). Under the heading Bioclipse pirated Christoph Steinbeck writes

A company called InfoCom, located in Arizona, advertises a product called iBioTech , which by all evidence is identical with Bioclipse. They say their iBioTech product has a plugin for chemoinformatics call “bc_cdk” (surprise :-)) and one called “bc_jmol” for 3D visualization.
While this is something that we have explicitly not tried to prevent; still it is kind of appalling to see it happen. My personal expectation on having my/our code being used by others was always that they would offer additional services and still acknowledge the original authors.
The case that we are currently seeing includes renaming and the pretension that the prodcut has been created by them (citation: iBioTech from InfoCom laboratories). The question here is of course: Is this covered by the Bioclipse license or not.
I was surprised that my first two attempts to dig into the Bioclipse license led into nowhere. There was no such thing as a LICENSE file in the top level directory in bioclipse trunk. The next thing to do was to look at the code of net.bioclipse.BioclipsePlugin.java, which in my opinion should contain a header clearly stating the license for this code. Nothing there.
But of course, the Bioclipse website has a full coverage of the license issues with Bioclipse. I would clearly say that the redistribution of a number of parts of Bioclipse, including CDK, requires the distributor to make it clear to the customer where the source code is available, which then automatically implies giving proper credit.

PMR: There is further discussion on the mailing list (https://lists.sourceforge.net/lists/listinfo/bioclipse-devel).
All of us know when we write Open Source is that others may re-use it. Most licenses allow commercialization without the permission of the authors. Almost all licences require that the modified version acknowledge the provenance and authors of the earlier version. So far, so good.
But there are also ethical considerations. I may take a piece of Open Source and develop it for my own purposes but in so doing I create a Fork. As the WP article says:

Free (libre) or open source software is, by definition, that which it is possible to fork without permission of the original creator. However, licensed forks of proprietary software (e.g. Unix) can also be important.

On the matter of forking, the Jargon File says:

“Forking is considered a Bad Thing—not merely because it implies a lot of wasted effort in the future, but because forks tend to be accompanied by a great deal of strife and acrimony between the successor groups over issues of legitimacy, succession, and design direction. There is serious social pressure against forking. As a result, major forks (such as the Gnu-Emacs/XEmacs split, the fissioning of the 386BSD group into three daughter projects, and the short-lived GCC/EGCS split) are rare enough that they are remembered individually in hacker folklore.”

PMR: This expresses the Community Norms (the Science Commons phrase). It’s not illegal, but it may cause massive problems.

In the current case my problem (and I suspect most of the other authors) is:

The company has not acknowledged the identity of the code (they have simply renamed it)
The company has not acknowledged the provenance of the code or the contributions of the authors
The company has not (at least from current reports) included a licence which indicates that the software is based on Open Source.

I wait to see confirmation that these are all incompatible with the licence.

There is also the likelihood that the code will devealue Bioclipse. It confuses the community. There may be mutant versions. If the mutant versions cause bugs this brings unjustified criticism on Bioclipse. Any restrictions imposed by the company make be confused with Bioclipse policy, etc.

In my own case (JUMBO) I use Artistic licence which specifically requires third parties to (a) acknowledge the provenance and (b) choose different name for the software.

This post is a first reaction. I – and I assume the rest of the Bioclipse team – would be grateful for any experiences of this and how it should be treated.

Posted in Uncategorized | 5 Comments

CAS will cooperate with Wikipedia

Posted on March 12, 2008 by pm286

Antony Williams reports that CAS has agreed to cooperate with Wikipedia Chemistry on the use of CAS numbers: A Message of Support and Public Service from the Chemical Abstracts Service

[…]
This week conversations have been ongoing between WP:Chem and CAS. The conversations have been conducted by Martin Walker, a member of the WP:Chem team as well as the ChemSpider Advisory Group. Martin and I have similar opinions in regards to how to participate in the community and I honor his approach in working through this potentially difficult situation. The outcome of the discussions are declared here on Wikipedia.

New announcement from CAS

CAS, a division of the American Chemical Society, is pleased to announce that it will contribute to the Wikipedia project. CAS will work with Wikipedia to help provide accurate CAS Registry Numbers^� for current substances listed in Wikiprojects-Chemicals section of the Wikipedia Chemistry Portal that are of widespread general public interest.
The CAS Registry is the world�s most comprehensive collection of chemical substances and the CAS Registry Number is the recognized global standard for chemical substance identification.
CAS views Wikipedia as an important societal tool for the general public, and this collaboration with Wikipedia is in line with CAS� mission as a Division of the American Chemical Society.
We look forward to working with the Wikipedia volunteers over the next few weeks to make this happen.Eshively (talk) 13:40, 12 March 2008 (UTC)
[Chemspider] I think this is excellent. I implicitly agree with the statement “The CAS Registry is the world�s most comprehensive collection of chemical substances.” For CAS to offer support to the Wikipedia team for the curation project is, for me, an indication of commitment to public service and I am indebted to the participants in this decision. I’m excited to get back underway with the curation project and will start up my efforts again this weekend. This decision by CAS has invigorated me to keep eyeballing structures as fast (and carefully) as possible.
My sincere appreciation is extended to the CAS management team and decision-makers. My gratitude to WP:Chem for staying engaged in the conversation to get to this outcome. My encouragement to us all to get this project done and have a high quality validated dataset of chemicals available as a public resource. Onwards and upwards!

Although there are (I think) over 20 million chemicals with CAS numbers the vast majority are likely only to have been reported once or a very small number of times. It is the CAS numbers for the common compounds (perhaps 10,000) that are valuable. They are widely used and available in catalogs, safety data, etc. Most of these will find their way into Wikipedia where chemists and other scientists will add information and annotations. Note that there are no “right” or “wrong” assignments of structure and properties, but rather annotations with more or less authority – it is the authority that is critical. While Wikipedians can use the public literature to make assertions about the structure, properties and nomenclature of compounds, only CAS can act as the authority to link CAS numbers to names or structures.

Posted in Uncategorized | 3 Comments

The issue with CAS identifiers

Posted on March 10, 2008 by pm286

To a recent post by Glyn Moody (The World’s Leading Anti-Scientific Society)
[ largely quoting me so I shan’t repeat it…]

[GM] … Clearly, it’s time to kill off this pernicious closed CAS system, which is damaging science, by boycotting it entirely…

ChemSpiderMan replied:: I’ve made two comments on CAS and the Wikipedia commentary.
http://www.chemspider.com/blog/cas-discourages-using-scifinder-to-help-curate-wikipedia-structures-and-cas-numbers.html
http://www.chemspider.com/blog/enforcing-copyright-of-cas-numbers.htmlI am not ready to abandon hope that the ACS/CAS can reach a point whereby they recognize the value, both public relations wise as well as good for their business. I believe it’s easy to declare that we should just abandon CAS and their dominant position but it is not so easy. The relationship between CAS numbers and their >100 years of literature/patents is deeply entrenched in their offering. They have very skilled people dealing with their systems and, other than the protectionism we judge is prevailing, their systems are good. Sure, they can be more open but let’s try and achieve that with a common-good discussion rather than abandoning them. While it’s easy to talk about RDF solutions and OWL and, and, and, these are solutions yet to be proven. They are valiant efforts and need to be pursued but they are yet to be proven. Also, think about the politics…if PubChem IDs prevail and damage CAS’ business then CAS initial views of PubChem damaging their business will have been validated…people will be out of jobs and all hell might break loose. I say bring the right people to the table to work through the complex business issues and do it soon.
That said, I’ll acknowledge that I prefer to try and navigate the complex issues to a mutually beneficial point rather than go into attack mode preferred by others.; 5:56 PM
glyn moody said…: Thanks for your thoughts.Fortunately (for me) I have the luxury of not being directly involved with any of this. I also write from the position of one who blogs extensively about these issues, and is not afraid to rush in where angels fear to tread.I accept that there is plenty of room to negotiate and to attempt to move the ACS; I wish the best of luck to those trying to do that.But I also think it can be useful (good cop, bad cop) to suggest more outrageously radical solutions – like chucking the ACS completely.
Speaking as an outsider observer, I have been frankly disgusted (as in Tunbridge Wells) by its behaviour – not just on this occasion, but previously, too. As I said in the post, science seems a quintessentially open endeavour, and for the ACS to put money over knowledge seems unforgivable.

PMR: I will try and clarify the issues, and I hope that the ACS or CAS can reply if I get them wrong…

There is no god-given chemical informatics system – Principia Chimica. All systems of names, chemical structures, identifiers have a degree of arbitrariness. To avoid chaos we look to authorities to provide and to some extent regulate these. Authorities include IUPAC, CAS, PubChem, SwissProt, Genbank, International Union of Crystallography and hundreds more.
These authorities influence the community by (a) respect (e.g. the International Unions and learned societies) (b) market dominance (as for Thomson ISI for citations) (c) regulation, either legal or enforced by other autorities (e.g. patents, FDA, etc.). Anything in (c) is very difficult to change. It is up to the players to argue their cases.
There is usually no absolute right or wrong – what is the CAS number for penicillin? For many years the structure wasn’t known (Dorothy Hodgkin solved it) but there was still a name and an identifier. I do not now know whether “penicillin” has a CAS number. What is the CAS number for “glucose” – not alpha-D-glucopyanoside…? So although it is a worthy attempt to curate the “right” structure for a guiven name in many cases it is impossible – it is simply a question of authority. What is the structure of “snow”? This depends on an authority and cannot be answered without also quoting them.

The ONLY definitive statement of the relationship between CAS numbers, structures and names is from Chemical Abstracts. It is no use looking them up in catalogs as these are frequently wrong and in any case do not carry the authority.
To give an example from another authority – what is the PDB number for insulin (also solved by Dorothy Hodgkin and coworkers)? I go to the PDB site and type in “insulin” and find

Ooops! There’s more than one “insulin” – (there’s more than one “penicillin” as well). But I can browse all of them if required. If I want the one shown its accession number is 1T1K. If I send that to any bioscientist they will know what I am talking about. But if there is any disgareement about the identity of 1T1K then this page is the ONLY authority.
The problems with the current CAS identifiers are:

There is NO public lookup of them (unlike everything in bioscience including PubChem, and PubMed).
It is now expressly FORBIDDEN to transmit publicly the results of any lookup of this information

Traditionally a library would purchase a printed copy of Chemical Abstracts. Libraries were proud of their CAS – it could run to several metres of celluose and carbon. But then anyone with access to a library (the BL if necessary) could resolve questions like this. I doubt very much if there was a prohibition on telling people what names were associated with which CAS numbers.
In the electronic age control is easier. Note that the control will not be through copyright but by contract to subscribers. ACS can and does cut of subscribers for what they unilaterally determine are breaches of contract. The current prohibition specifically relates to contracts. Since, however, I know of no way of accessing modern CAS information other than through a contract-based system (and I’d be grateful to know if there is) I will break my instituition’s contract if I try to help create better science by clarifying information.
Chemspiderman and some Wikipedians take the view that WP should negotiate with CAS. Since WP is a democracy it will find its own way of resolving this. I take the following view:

Wikipedia requires authoritative sources for its information.
The assignment of a CAS number to one or more WP entries requires the authority of CAS
CAS forbids WP to use this authority
Therefore WP cannot include CAS numbers if it wishes to uphold its principles of authoritative sources – there are NONE available to it.

This is the logical argument for a boycott and I’d be happy to see counter arguments.
On the political front I regard CAS’s action as unacceptable for a scientific society which enjoys charitable status by virtue of its respect in the community. Charities have a responsibility to help the community – this action is diametrically opposed.

Posted in Uncategorized | 7 Comments

Compounds, substances and identifiers

Posted on March 9, 2008 by pm286

There has been discussion recently (e.g. CAS Discourages Using SciFinder to Help Curate Wikipedia Structures and CAS Numbers and the Wikipedia Project: CAS Validation page) about the use of CAS identifiers and possible alternatives. One suggestion is that CAS numbers could be replaced by InChI (International Chemical Identifier – Wikipedia, the free encyclopedia) strings. This may work in some cases but will fail in many others – this post is to introduce the problem of identifying chemicals and to make it clear there is no simple solution. I’d be grateful for comments.
To start we must recognize what we are talking about. I’ll use Wikipedia definitions and explanations where possible.

A chemical substance is a material with a definite chemical composition.
A chemical compound is a substance consisting of two or more different elements chemically bonded together in a fixed proportion by mass.

It is important to realise the distinction between the two and the variety of language used. I suspect that I use slightly different language. Here are some things I would regard as substances that do not have definite chemical compositions:

polystyrene
zeolite
rust
fuming sulfuring acid

This introduces the concept of “sample” – the fact that within a given concept of a many substances the composition can vary. From Wikipedia (Definition of Substance):

Chemical substances (also sometimes referred to as a pure substances) are often defined as “any material with a definite chemical composition” in most introductory general chemistry textbooks.^[3] According to this definition a chemical substance can either be a pure chemical element or a pure chemical compound. However, there are exceptions to this definition, a pure substance can also be defined as a form of matter that has both definite composition and distinct properties.^[4] and the chemical substance index published by CAS also includes several alloys of uncertain composition.^[5] Non-stoichiometric compounds are a special case (in inorganic chemistry) that violates the law of constant composition, and for them, it is sometimes difficult to draw the line between a mixture and a compound, as in the case of palladium hydride.

The most appropriate authority in Chemistry is the International Union of Pure and Applied Chemistry which spends much effort on publishing terminology and nomenclature. Its Gold Book defines some thousands of terms. There is no direct entry for substance. Phrases include:

polymer: A substance composed of macromolecules .

product: A substance that is formed during a chemical reaction.
amount of substance,n: Base quantity in the system of quantities upon which SI is based. It is the number of elementary entities divided by the Avogadro constant
analgesic: Substance which relieves pain, without causing loss of consciousness.

which implies that while “substance” often refers to materials of fixed composition, this is not always true.
It is the uncertainty and variability in chemicals that makes it impossible to have a single system for identifying them. Much chemistry is based on the observation that many substances can be created in a pure form or purified and that this process is reproducible between laboratories. Good examples are crystalline compounds (and crystallisation is an excellent method of purification).
PubChem recognises that substances and compounds are not identical and has a different system of identifiers for each – we’ll return to this later.
The methods of identifying chemicals include:

names. IUPAC is the primary authority for this.
chemical structure. This is a representation of the types and interrelations (bonding or spatial relationship) of the atoms within the material.
authority-assigned identifiers (CAS, PubChem)

Within all of these there can be wide ranges of generality or specificity. It’s reasonable to talk of “zeolite” as a chemical substance, and it’s also reasonable to subclassify it into many different subtypes. In many cases it may be necessary to refer to particular sample which, unless one has access to the physical material, are often best described by their observed properties (e.g. spectra, thermodynamic properties, etc.)
In some sub categories there is a useful correspondence between chemical structures and names (e.g. in organic chemistry of pure compounds, and this is where InChI and CAS most overlap). But in solid and bulk materials we shall need alternative approaches to InChI – it is not a universal solution.

Posted in Uncategorized | 2 Comments

Wikipedia has ca 7000 chemical structures

Posted on March 9, 2008 by pm286

I am delighted to be corrected in my statement about the number of compounds in Wikipedia:

chemconnector.com/chemunicating/the-curation-of-almost-5000-structures-on-wikipedia.html
(ChemConnnector is written, I think, by Antony Williams, Chemspiderman).
[…] There are likely legal reasons for a number of these databases to have CAS Numbers. As I continued to peruse the list I was more than impressed by the number of databases serving up CAS numbers online, and, I believe, a number of them containing over 10,000 numbers which, as I have commented before, is rather a magic number. Should Wikipedia be concerned about the 10,000 CAS number issue with some of the other issues being discussed now?
PMR recently commented on my blogpost here. He said “PMR: Wikipedia has between 1000 and 2000 chemical substances (ca 0.01% of the total number of substances in CAS).”
The number of chemical substances in Wikipedia is actually MUCH higher than that…I know since I’ve been looking at them, in detail as described here. To clarify, I am building an SDF file from the chemicals on Wikipedia so that it can be deposited on ChemSpider hooked up back to Wikipedia. This was done earlier by linking up chemical names but it was far from perfect so we are doing it in this more “curated” manner. The outcome from the work, and thanks to multiple other sets of eyes from WP:CHEM, will be a curated SDF file. I will return the SDF file with the following fields generated: SMILES string, Systematic Name, InChIString, InChIKey. These can then be used to homogenize the available fields in the Chemical Boxes etc.
In doing the work (I have already worked through the whole alphabet) I have over 4900 compounds already curated at a first level. I have disregarded the majority of inorganics and organometallics for this pass. ca. 5000 organics manually curated is ENOUGH of a challenge. I estimate the number of chemical compounds to be about 6500-7000, and it’s growing. So, it’s about a factor of 3-4 times bigger than PMR’s estimate. The vast majority do have CAS numbers. While it hasn’t hit 10,000 yet… it’ coming.

PMR: I’m delighted to know this. I should perhaps have said “entries with infoboxes for chemical compounds and substances”. My estimate was taken from the formal lists on WP such as:
http://en.wikipedia.org/wiki/List_of_inorganic_compounds
which had explicit pointers to entries.
The point of this was that I could automatically extract the infoboxes and convert to RDF (which I have done). This RDF is then combined with other sources and to show consistencies and inconsistencies with other sources. The goal of this – with the full involvement of Wikipedians – is to create an RDF resource for the OREChem project (Chemistry Repositories). The resource will, of course, be Open from the start. Although the primary goal is to develop and test the ORE technology and design we hope that it will also produce a top quality chemical resource with – perhaps – ca 10,000 compounds or substances.
We can, I believe, do this without violating either the CAS trademark or the terms of our Scifinder licence. The result will be a collection of compounds in widespread use (e.g. undergraduate work, commerce). It will have a superior informatics design to current chemical information sources and will be Open. It therefore has the technical potential to replace CAS numbers and information and I’ll post more about this later.

Posted in Uncategorized | 3 Comments

What to use as a the primary key for chemicals?

Posted on March 9, 2008 by pm286

Cameron Neylon picks up the theme of an alternative approach to identifying chemicals (especially in the context of CAS’s blanket refusal to allow normal scientific practice – thed quoting of authority). I reproduce some key quotes and then introduce what I hope to be a series of posts explaining the basic principles. Be warned – identifying chemicals ranges from straightforward to almost impossible. There are no trivial answers such as “use InChI” or “use PubChem CID”. I am fully supportive of Cameron’s post, but comment where it needs tightening…

What to use as a the primary key for chemicals?

Now following on from my post about feeds it is clear that we also want to provide a good range of searchable indexes for people to be able to tell what we are using. So we would ideally want to expose InChi, InChiKey, SMILES, CML perhaps, PubChem Ids etc. These can all be converted one to the other using web services so we don’t need to type all of them in manually. All that is required is a nice logging screen where we can drop in one type of index key, the size of the bottle, supplier, lot numbers, perhaps a link to safety data. The real question is what is the index key that is easiest to input? For those of you in or near a laboratory I suggest an exercise. Go and pick up the nearest bottle of commodity stuff from a commercial supplier (i.e. not oligos or peptides). What is written on it? What is a nice short identifier that can consistently be found on pretty much any bottle of chemicals? For those unlucky people who don’t have a laboratory at their fingertips I have provided a clue below.

PMR: Most of these are interconvertible but not always. For example there are substances that PubChem can hold that do not have SMILES or InChI. And CML can describe substances that the others cannot (e.g. non-stoichiometric compounds and certain mixtures). I’ll explain later

The Chemical Abstracts Service number is the one identifier that can reasonably reliably be found on most commercially supplied substances. Yet, as described by Peter Murray-Rust and Antony Williams recently you can’t look these up without paying for them. And indeed by recording them for your own purposes (say in a database of the compounds we have in the laboratory) we may be violating the terms of the license.

PMR: Exactly.

So what to do? Well we can adopt another standard or standards. Jean-Claude Bradley argued in a comment on my recent post that InChiKey is the way to go, but for this specific purpose (logging materials in) this may be too much to type in many cases (certainly SMILES, InChi and CML would be). You can’t expect people to draw in the structure each time a compound comes in, particularly if we get into arguments about which precise salt of cAMP we are using today. What is required is a simple, relatively short number. This is what makes the CAS number so appealling; it is short, easily typed in, and printed on most bottles.

PMR: InChI and CAS are fundametally different and one cannot replace the other. I will explain later

So, along with Peter I think the answer is to use PubChem CID numbers. PubChem doesn’t use CAS numbers and CAS actively lobbied the US government to limit the scope of PubChem. PubChem CIDs are relatively short, and there are a range of web services from which other descriptions can be retrieved (see e.g. PubChem Power User Gateway). The only thing that is missing is the addition of CID’s on bottles. If we can get wide enough agreement on this I think the answer is to start writing to the suppliers. It’s not great effort on their part to add CIDs (or if there is something better, some other index) to the bottles I would have thought and it provides a lot of extra value for them. PubChem can provide links through to up to date safety data (without the potential legal issues that maintaining a database of MSDS forms with CAS numbers creates), it provides free access to a supplier index through which customers can find them, and it could also save them a small fortune in CAS license fees.

PMR: This is exactly the right political approach. There are some technical points that need to be addressed but we don’t need to do it all at once.

There is another side to this, which is that if there is a wholesale shift (or even the threat of a shift) away from CAS as the only provider of chemical indexing, then perhaps the ACS will wake up and realise that not only is this protectionism bad for chemistry, but it is bad for their business. The database of CAS numbers has no real value in its own right. It is only useful as a pointer to other information. If the ACS were to make the use and indexing of CAS numbers free then it would be driving traffic to its own value added services. The ACS needs to move into the 21st (or perhaps the 20th) century in terms of both its attitudes and business models. We often criticise the former, but without shifts in the latter there is a real risk of critical damage to an organisation that still has the potential to make a big contribution to the chemical sciences. If the major chemical suppliers were to start printing PubChem CID’s on their bottles it might start to persuade the powers that be within the ACS that things need to change.

PMR: Again I agree with all of this. However the CAS number is also interlinked with the CAS content and these are difficult to disentangle. So the CAS number not only indentifies a compound but links to the information in CAS. This information presumably has physical properties, reactions, etc as well as the simple identity. However I cannot comment authoritatively on anything in CAS because by doing so I would violate the conditions that the University has signed up to. CAS might then cut us off and while that wouldn’t worry me, my colleagues would kill me. This is a greater sanction than legal action. It’s reminiscent of computational chemistry companies who forbid comments on their programs (bugs and benchmarks).

So, to finish; do people agree that CID is a good standard index to aggregate around? If so we should start writing to the major chemical manufacturers, perhaps through open letters in the general literature (obviously not JACS), to suggest that they include these on their packaging. I’m up for drafting something if people are prepared to sign up to it.

PMR: It’s critical that we hear from PubChem on this. We need to know more about the substanceId as well as the compound ID.

So I shall post more tomorrow (I’m currently in Redmond – and have to crash). But here are some axioms:

Almost everyone can dispense with the use of CAS numbers (the exceptions are where their are legal regulations requiring them or they are mandated by a vendor (but we should seek to change those).
Identifying chemicals can be very hard and there are no simple universal solutions, but…
For most compounds it’s easy, and CAS is not required
For common compounds there are alternative ways
There have to be at least three ways of identifying compounds: names, identifiers, connection tables.
The key problem in C21 is providing the relationships between these, which can be done with RDF.
There do have to be authorities which are stable and which we can trust. That used to be CAS, now we should see if PubChem and its funders wish to take on some of that role.

When we have created the new system is will be greatly superior to current practice and give us more cnnfident on what information we can use.

Posted in Uncategorized | 6 Comments

Why and how we should move away from CAS numbers

Posted on March 8, 2008 by pm286

In a recent post (CAS Discourages Using SciFinder to Help Curate Wikipedia) I commented on the refusal of CAS to allow Wikipedia to use the CAS numbers and/or related information obtained from their Scifinder(TM) product. As far as I know there are no current public printed sources from CAS that provide the same information. Effectively, therefore, CAS is exercising agressive monopoly control over chemical information. The legal re-use of CAS information is explained here in which it is made clear that all information is copyight, permission is almost always required, although collections of up to 10, 000 CAS numbers do not require permission.

A User or Organization may include, without a license and without paying a fee, up to 10,000 CAS Registry Numbers or CASRNs in a catalog, website, or other product for which there is no charge. The following attribution should be referenced or appear with the use of each CASRN: CAS Registry Number^® is a Registered Trademark of the American Chemical Society. CAS recommends the verification of the CASRNs through CAS Client Services^SM.

PMR: Wikipedia has between 1000 and 2000 chemical substances (ca 0.01% of the total number of substances in CAS). Wikipedia cares passionately about correctness and copyright. It is a fundamental policy of Wikipedia to quote sources – this is one of the critical platforms on which respect for WP rests.
The American Chemical Society – hitherto a respected learned society – is now telling a voluntary community of scholars that it forbids them to check their facts. It is preventing them disseminating chemistry.
I wonder if there is anything in the history of learned societies that matches this action. There are so many ways they could have responded positively. WP is nowhere near the 10,000 compound limit. They are not a threat (although CAS’s mail suggests they are scared of WP). CAS could have made a donation to Wikipedia of the 10,000 most common compounds in the CAS database. A “CAS-Wikipedia” page would give the correct CAS number-structure relationship, preventing error. There are so many positive things they could have done.
As it is they have done the following:

re-asserted their position that they care for revenue more than supporting the wider chemical community
re-advertised themselves as one of the least progressive learned societies
alienated a growing number of young scientists who look to the Web as a critical part of the future of chemistry

But worst of all they are implicitly encouraging bad chemistry. Here’s an example from a recent comment:

Name (required) Says:
March 8th, 2008 at 5:32 am eWe are legally required to supply vendor MSDS forms to our staff. The vendors have included CAS numbers on their MSDS forms, and we keep the forms in a database. So technically, we must be in breach of our SciFinder license?
If we get sued, I wonder whether the judge would side with the legal statutes or the contractual agreement?
What CAS should be doing is making CAS numbers an open standard – like PDF files – that everybody can adopt.

CAS numbers are widely used in chemical regulation and commerce to identify substances (see MSDS (WP) for example). This action from CAS will encourage people to guess CAS numbers. If a chemist wants the CAS number for acetone s/he will not now go to CAS (6 USD) – she’ll find a suppliers catalog and take one from there. I know from expeience that there are huge numbers of errors in number-structure relations and so these errors will be propagated.
The commercial chemical community is ultra-conservative but even so there is a limit to this central control. The use of CAS numbers has been abandoned by organisations such as PubChem for exactly this reason. PubChem now has nearly 20 million substances. It holds records for all compounds that are likely to occur on MSDS. It’s highly respected (although ACS lobbied the US government to limit Pubchem’s activities). It is part of the NIH and now – with the NIH mandate – effectively safe from the ACS. It provides a credible alternative.
We (including Wikipedia) should now switch from using CAS numbers to using PubChem IDs wherever possible. It won’t be a simple transition – certainly we shan’t find 100% overlap. But it will solve all the common substances and therefore 90%+ use of CAS numbers.
We shall need software. We and others are now developing the next generation of chemical informatics software using RDF (Resource Description Framework). RDF allows the description of ambiguities and ontologies. This will allow chemical information to be gleaned directly from authoritative sources using robots. (Of course some of the authorities are currently conservative and do not allow access to their material because of restrictive copyright and licences, but that is starting to change, even in chemistry). As information becomes more open, the CAS system will be increasingly isolated in a world of chemical commerce. Robert Massie ( Robert Massie on OA and PMR) worries that sites in China are stealing information from CAS:

many sites in China have sprung up to provide information on how to break into the computer systems of major US universities in order to gain access to SciFinder.

If I were running CAS that wouldn’t be my worry. I’d be terrified that in five years’ time the world – perhaps through China – has developed an Open system that was rapidly replacing Scifinder because it was better as well as free.
And I shall be posting from time to time how I think this can be done. The first step is to transfer whatever is possible to PubChem.

Posted in Uncategorized | 6 Comments

Open Data in Science (pod and vodcasts)

Leo Waaijers: DARE to Inspire

OKCON 2008

Commercialization of Open Source code (Bioclipse)

CAS will cooperate with Wikipedia

New announcement from CAS

The issue with CAS identifiers

Compounds, substances and identifiers

Wikipedia has ca 7000 chemical structures

What to use as a the primary key for chemicals?

What to use as a the primary key for chemicals?

Why and how we should move away from CAS numbers

Recent Posts

Recent Comments

Archives

Categories

Meta