Two things from today – a presentation by one of my colleagues (revealed later) on “Naming things” – or a similar title – which isn’t yet public and they’ll be giving at a meeting next week. I won’t give it away, except to say that they presented a beautiful and thoughtful talk.
Names are hard.
And a useful reply (to my post Chemical names and structures continued) from Rich Apodaca about the problem of Chemical Abstracts Services (CAS) identifiers. Identifiers are names, with the additional facet that they are usually controlled by an authority and may also (possibly) have IP restrictions (i.e. e copyrighted).
Peter, there is no authority on CAS numbers other that CAS itself. No third party, no matter how carefully they attempt to do it, can create an authoritative version of any subset of the CAS database. It’s impossible by definition.
PMR: I ahve always asserted exactly this. CAS has a right to develop whatever system of identifiers it likes. Only CAS can give this authority. I imagine there have been court cases where the identity of a compound depended inter alia on CAS identifiers and an expert witness could only use CAS to support evidence.
Having said that, there is a widespread need to use CAS numbers outside of the CAS Registry system. The question is how to best do this.
If we can’t have authority, at least we can use provenance. This is why Chempedia doesn’t merely give the CAS number associated with a structure (or vice versa), it gives the name and URL of the organization making the assertion. It’s a ternary, not binary relationship (cas number-structure-organization).
PMR: I fully agree with this analysis. However the assertion is often originally made by an organization which isn’t mentioned and copying (including errors) is common.
And what you see is broad consensus on certain cas number/structure mappings, and disagreement on others.
Consider caffeine:
http://chempedia.com/registry_numbers/58-08-2
and cyclosporin:
http://chempedia.com/registry_numbers/59865-13-3
as polar opposites in terms of consensus. If Chempedia had tried to be intelligent and hide the discrepancies, users would be misled. Recording and displaying the provenance of the data in plain sight lets humans judge for themselves what’s really happening.
PMR: again agreed. However we must beware that consistency is not due to copying. Note that caffeine is the only compound out 30 million+ that can be obtained freely from the CAS website.
CAS numbers are really nothing more than a name assigned to a structure. Like all names, it’s an opinion with the twist that in this case one organization (CAS) is always right. But the minute I communicate the answer CAS gives me to someone else, the information becomes unreliable. Recording the source of the CAS number/structure mapping is one way to determine reliability. It’s not a yes/no answer, but some shade of gray.
Not the way I’d prefer to have things work, but it’s what really happens. Designers of information systems need to factor this into the systems they create.
Any ideas on other ways to address this problem?
PMR: There is the additional problem that copying CAS numbers and making assertions about them could lead to copyright problems with CAS. So that is a major disincentive to a solution. There are many identifiers which are “CAS” but where the letters “CAS” do not appear – possibly for fear of copyright.
By far the best solution would be for CAS to take an Open approach to its CAS identifiers. This is the spirit of the current century. The major value of CAS – and other – identifiers is for common compounds – to give names to chemicals that cannot be easily identified by other means, or where there are confusing names or chemical formulae. We don’t really need a CAS number for water (H2O) – but we do benefit from an identifier for glucose.
It was good to see CAS allowing Wikipedians to use Scifinder (to which they subscribe) to check for CAS numbers against Wikipdeia chemicals. (Note that Wikipedia is name-based, not identifier-based). I think it would enhance their business, as well as their standing – if they were to offer free lookup – and free re-use – for – say – 500,000 common chemicals. (Of couse I’d like the whole lot, but let’s start somewhere). This would require courage but would reinforce the authority that CAS already offers.
The alternative is that CAS numbers will continue to be spread around the web in a way that degrades their authenticity. Or that some other, more Open, authority will develop. That looks impossible, but so did Wikipedia.