Chemical names and structures continued

Antony Williams, Rich Apodaca and I have been having a debate on our blogs about how to identify chemicals (I choose this word so as not to be too specific) to a machine. Antony has a long and detailed response (Is there 100% in chemical names and compounds?) and Rich has commented (see comment there) . Here are some points, but if you are interested in the machine aggregation of information on chemicals, read them in full.

CS: Recently I posted on whether or not there is “a right structure for a compound“. I taked about trade names and registered chemical entities and posited the question regarding “whether a Registered Trade Name is absolute? I�m asking the question since I�m actually not sure. ”
There were two responses…
1) Rich Apodaca commented:”you�d probably find agreement among chemists that a trade name uniquely identifies one specific chemical entity. Ditto CAS Number.”

I, like Rich, am of the opinion that a CAS Number does uniquely identify a specific chemical entity, not necessarily a unique structure. Of course, CAS numbers can be confusing too as I have commented here. Aspirin, for example, has 6 CAS numbers! So Rich and I agree on this…can anyone from CAS confirm or not whether our belief is right?

[illustration of various names snipped…]
We are in absolute agreement about this issue. The names are not identical. One declares stereo and the other doesn’t. The question then is what synonyms are useful to the user of ChemSpider to locate the structure if they have a systematic name. One might assume that the more the merrier. There is an enormous number of variants of bracket styles and dashes that could give rise to probably dozens of names that are all consistent with the structure and the names shown come from different sources.

PMR: For certain purposes, it is valuable to collect as many names as possible, for example for locatioln of lookup. But these should be accompanied with metadata. A similar example is:

On a record view we list “Names and Synonyms”. The question marks Peter sees are for a French name shown here:

Looks fine in my broswer and pasted in here too: N-{2-[({5?-[(dim�th?ylamino)m?�thyl]fur?an-2-yl}m?�thyl)sul?fanyl]�th?yl}-N’-m�?thyl-2-ni?tro�th�ne?-1,1-diam?ine. So, not junk (saying that the French name is junk would offend the Parisians). Notice that the Z- has been removed (for now) and that the name is labeled French on the record. If any of you are seeing issues in your browser let us know and we will investigate at our end.

PMR: Without the metadata giving the langauage information is losr. For example what does “pain” mean? If the language is not given there is a tendency to interpret this as english.  We have to acknowledge that the language of science is currently english (it wasn’t when I started and we had to read French and German papers). So RDF, for example, provides a language qualifier (e.g. @en or @fr). The addition of that qualifier transforms the information from junk to meaningful.

CS: I� look forward to seeing how Zantac and Ranitidine are handled in this new world- if its a structured ontology then it sounds like an integration of MeSH with structures? Wikipedia is over 5000 organics now and is the culmination of thousands of hours of work by many dedicated individuals. And is not error-free. Any other efforts will be prone to similar issues so it’s going to be a major undertaking and I look forward to the results. The ChEBI team are already doing a good job in this area. You can see an ontology Tree View here. So, I’m definitely excited to see what will be better! Exciting times.

PMR: We spent some time yesterday discussing our ontology for chemicals, which covers many of these points It is not trivial to build one and not surprisingly we argue. I like Tails’ ratio of 75% arguing and 25% building – that’s certainly the position with ontologies. Rich Apodaca commented:on the discussion:

Tony, you’d probably find agreement among chemists that a trade name uniquely identifies one specific chemical entity. Ditto CAS Number.
But in practice (in databases, Excel spreadsheets, books, reviews, peer-reviewed articles, etc.), you’d find some disagreement about the structure that a particular identifier should be linked with, and vice-versa.
The disagreements would range from the baffling (completely wrong structure) to the annoying (wrong stereochemistry) to the amusing (ionized carboxylate vs. protonated).
For databases that aggregate content from diverse sources, the best practice may be to model this situation with a many-to-many relationship, rather than a one-to-one or even one-to-many.
In other words, CAS numbers, trade names, and IUPAC names may be better modeled as social networking-style tags than as unique identifiers. I’m not saying this is the way things should be – just that this is how situation appears to have evolved.
See this article, which discusses the problem as it applies to CAS numbers used in the wild and how Chempedia addresses it:
http://depth-first.com/articles/2008/05/26/simple-cas-number-lookup-and-more-with-chempedia

PMR: I very much like the idea of regarding chemical names as social identifiers. But, of course, that only works for humans. The machines can aggregate the tags but they cannot make inferences from them. The problem is that when they are put into databases they lose their social context and are managed by hard boolean logic. That fails immediately and often dramatically. A major cause is the loss of metadata and authorities. In this world you cannot use voting (which is why Chempedia cannot be seen as an authority for CAS numbers, only a useful guide).
We have to use authorities (provenance) in our information. Thus the statements:
Ranitidine is the Z-isomer
and
Ranitidine is the E-isomer
may be seen as contradictory. That’s why people have suggested that RDF should have quads, not triples, such as
Antony_Williams asserts ranitidine hasIsomer Z
Wikipedia asserts ranitidine hasIsomer E
Both these are true. That is the language we should use in the semantic web
PeterMR still deliberately fails to make an assertion about this isomerism and is waiting to see what others think.

This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Chemical names and structures continued

  1. Rich Apodaca says:

    Peter, there is no authority on CAS numbers other that CAS itself. No third party, no matter how carefully they attempt to do it, can create an authoritative version of any subset of the CAS database. It’s impossible by definition.
    Having said that, there is a widespread need to use CAS numbers outside of the CAS Registry system. The question is how to best do this.
    If we can’t have authority, at least we can use provenance. This is why Chempedia doesn’t merely give the CAS number associated with a structure (or vice versa), it gives the name and URL of the organization making the assertion. It’s a ternary, not binary relationship (cas number-structure-organization).
    And what you see is broad consensus on certain cas number/structure mappings, and disagreement on others.
    Consider caffeine:
    http://chempedia.com/registry_numbers/58-08-2
    and cyclosporin:
    http://chempedia.com/registry_numbers/59865-13-3
    as polar opposites in terms of consensus. If Chempedia had tried to be intelligent and hide the discrepancies, users would be misled. Recording and displaying the provenance of the data in plain sight lets humans judge for themselves what’s really happening.
    CAS numbers are really nothing more than a name assigned to a structure. Like all names, it’s an opinion with the twist that in this case one organization (CAS) is always right. But the minute I communicate the answer CAS gives me to someone else, the information becomes unreliable. Recording the source of the CAS number/structure mapping is one way to determine reliability. It’s not a yes/no answer, but some shade of gray.
    Not the way I’d prefer to have things work, but it’s what really happens. Designers of information systems need to factor this into the systems they create.
    Any ideas on other ways to address this problem?

  2. Pingback: ChemSpider Blog » Blog Archive » More about Names, Structures and Curation on ChemSpider

  3. Peter. names ARE labeled with the language when the details are available from depositors. Else, we allow curators to do it. We also retain metadata for a lot of the information that gets deposited on the site. Not for the identifiers as yet but it’s possible we could. For details see my comments here: http://www.chemspider.com/blog/more-about-names-structures-and-curation-on-chemspider.html

Leave a Reply

Your email address will not be published. Required fields are marked *