What is a chemical compound? and what's a label

Steve Bachrach poses an interesting question on the CHMINF-L list. I have omitted the citations and some other material – you can read the archive if necessary.

 
I have run into an interesting chemical problem that has led to both theoretical and applied database questions. I am hoping that some of the experts on the list can shed some light.
I have been looking into the recent controversy concerning the structure of (+)-hexacyclinol. This compound was first isolated in 2002 by Graefe et al who proposed a structure for it…. La Clair recently synthesized this structure, or it least reportedly so. [SB: By the way – do a google search
on hexacyclinol so see how the blogosphere responded to this problem.] Then Rychnovsky …proposed an alternative structure for (+)-hexacyclinol, which was subsequently synthesized and confirmed to be identical to the original natural product
(PMR: yes the blogosphere is worth reading, e.g. Totally Synthetic (blogroll) and Tenderbutton (but this is now password-only).
So here is first my theoretical question: How do you index such a situation? The original structure of the molecule (+)-hexacyclinol is wrong, and a subsequent one is right. So, when you query a database, which structure matches up with the name “(+)-hexacyclinol”? My guess is that it should be the correct one – but then what do you do with the old
structure? Obviously, this is not the first, nor will it be the last, compound whose structure in contested.
Now here is the more applied aspect. A search in SciFinder for (+)-hexacyclinol gives CA 484674-97-7, which is the original (and, we now know, wrong!) structure. Querying for the papers that have this structure returns the Grafe, La Clair and Rychnovsky papers, but not the Porco paper. But entering the “true” hexcyclinol structure and then doing a search locates 2 structures CA 903574-41-4 and CA 903574-42-5, which look to me to be identical. Furthermore, the only paper that is linked to these “two” structures is the Rychnovsky paper. In other words, the Porco paper that reports the actual synthesis and x-ray structure of hexacyclinol does not have any hexacyclinol structure(s)
(correct or not) attached to it!
PMR: Note – Scifinder is a Closed access tool to the Closed Chemical Abstracts database of chemical information. I cannot therefore comment on Steve’s Ids.
(By the way, a PubChem search for hexacyclinol comes up dry, but all of the above papers are indexed in PubMed.) Any explanations?

(PMR: Yes, Hexacyclinol is not very interesting except to chemists so no-one has deposited a data collection containing it to Pubchem. If synthetic chemists contribute collections of targets to Pubchem I am sure Pubchem will be delighted to accept them. However many chemists are still unaware that PubChem exists.)

PMR: There is nothing strange in this – it’s common in all most disciplines. As a science progresses the interpretation of objects changes. Genes, organisms, galaxies are all frequently reclassified. It’s actually a strange feature of structural chemistry that there are so many cases that aren’t fluid and where a structure and a substance can be associated and where this association can persist for a long time.
The language of “right” and “wrong” is what is causing the problem. These statements should be recast in terms of annotations or assertions, labelled with the authority that makes them. (Incidentally this is what is at the basis of the RDF-based Semantic web). The above could be written:
2002: Graefe asserts that C1 is the structure associated with a given substance (S1). Graefe gives S1 the label “hexacyclinol”. Graefe also asserts that certain reported physical data (D1) belong to S1.
2005? Le Clair makes S2 and asserts that it is the same substance as S1 and re-asserts that S1 is has the structure C1
2006. Rychnovsky makes S3 and asserts it is not identical with S1 or S2 but should be associated with the structure C1.
By the laws of chemistry (which say that a given substance S should only be associated with one structure C) we have a contradiction. So…
2006. Many chemists, including the blogosphere, assert that Le Clair’s statement is false.
But that may not be the end of it.
That is a simplified picture. Henry Rzepa has written about “What is mauveine?” – it is by no means clear what this industrially spectacular purple pigment was or is. He has devised an RDF scheme for presenting, and possibly resolving, a number of assertions about the structure of a compound.
Not all chemistry has the luxury of being able to associate a precise formula with a given substance in a jar. Here are some simple examples,
  • What is aluminium chloride?
  • What is glutamate?
  • What is glucose?

These are legitimate scientific statements which require several assertions, linked in a mini-semantic web. That is why we need to move from a twentieth-century way of describing chemistry (as exemplified by the CA numbers) to a semantic one. There is lots of room for volunteers.
I’m reminded of being shown round the British Museum of Natural History – and a room full of fish specimens in ethanol in labelled glass containers. The biologist said that some countries had asked for their specimens – their property – back – the BM had resisted. But if it ever came to that the BM would keep the labels – their metadata.

This entry was posted in chemistry. Bookmark the permalink.

4 Responses to What is a chemical compound? and what's a label

  1. I fully recognize that this situation is not unique, both within the discipline of chemistry nor in science in general. I brought it up because it was the first time I really thought about the consequence of “time-dependent science” and databases. What I mean by “time-dependent science” is that what is thought to be a correct interpretation in science changes with time. So in 2002 the generally accepted structure for hexacyclinol was what Graefe proposed, but now in 2006 we have a different structure.
    So the question of “right” or “wrong” is important in that it corresponds to what the community believes at a given point in time. So the database record should somehow track that time-dependence. But I think that a search for hexacyclinol today should at least give back the currently accepted structure, and even better would be to get the “old” structure too. I agree that a well-structured meta-tag system could handle this.
    I know why PubChem has no record of hexacylinol – my question of “Any explanations?” was meant for the whole post.
    The CAS morass is also interesting – it points out how important careful human examination of articles remains today in building authoritative databases. Searching today on hexacylinol within CAS, either by name or structure, does not lead you to the Porco paper containing the synthesis and x-ray structure of the compound. That to me is a CAS failure in abstracting/indexing.

  2. pm286 says:

    (1)Thanks Steve,

    I fully recognize that this situation is not unique, both within the discipline of chemistry nor in science in general. I brought it up because it was the first time I really thought about the consequence of “time-dependent science” and databases. What I mean by “time-dependent science” is that what is thought to be a correct interpretation in science changes with time. So in 2002 the generally accepted structure for hexacyclinol was what Graefe proposed, but now in 2006 we have a different structure.

    Indeed. The human genome has been re-interpreted dozens of times since it was first released. I think less than half of the original annotations are now “right”.

    So the question of “right” or “wrong” is important in that it corresponds to what the community believes at a given point in time. So the database record should somehow track that time-dependence. But I think that a search for hexacyclinol today should at least give back the currently accepted structure, and even better would be to get the “old” structure too. I agree that a well-structured meta-tag system could handle this.

    Chemistry has, perhaps, a clearer idea of right and wrong. It takes time for people to re-adjust their views. I have only dipped into the blogosphere but my understanding is that laClair does not agree with the blogosphere or other organs of public opinion. But I don’t believe that anyone has actually formally retracted anything. So what is actually “right” now? Majority discourse on the web? Or did the editors of Rychnovsky’s paper announce that his statement was right and la Clair’s was wrong. Or do we assume that Rychnovsky is right because he has asserted it and the journal has published the paper? Again I haven’t followed the discussion, but I don’t believe anyone has worn a hairshirt in public.

    I know why PubChem has no record of hexacylinol – my question of “Any explanations?” was meant for the whole post.

    And Pubchem, of course, are the modern way of doing things. They publish assertions, with attribution. They make no judgment. There are lots of “wrong” things in Pubchem. What is needed – and it will come – is scoial computing (annotation) applied to Pubchem. The chemical world can, and will, vote as to which Pubchem entries are agreed to represent “accurate statements in 2007”, and so on. There will be no help from the mainstream chemical community because this smacks of everything new like Wikipedia, blogs, etc.

    The CAS morass is also interesting – it points out how important careful human examination of articles remains today in building authoritative databases. Searching today on hexacylinol within CAS, either by name or structure, does not lead you to the Porco paper containing the synthesis and x-ray structure of the compound. That to me is a CAS failure in abstracting/indexing.

    Possibly. But if all articles simply listed the assertions about what compounds they contained, what names and data were associated with them, what is the role of CAS or any other indexer. Do they sit in judgment over whether compound X is associated with structure Y and name Z? If so they have to do it for all 10 million? Do they? Can they continue to do so?
    The attraction of structure-based indexing (InChI) is that it is uniquifiable in most cases. Names are problematic – there are synonyms and heteronyms. But some substances require names (paraffin, dextrose, zeolites, montmorillonite, etc.). So for 95% of organic chemistry, perhaps, unique structures can be used. But for a lot of things it can’t

  3. Chris says:

    Can I throw a little petrol on the fire.
    Whilst it is possible to write a unique structure the actual sample is often not a single entity. Whilst I’m sure chemists would love to claim their compounds are 100% pure in reality there are often, isomers, hydrates, salts, impurities, residual solvents, or even incorrect assignments. These can all have an impact on the properties of the molecule (both physical and biological). So for databases of properties it is often better to link via a sample identity rather than a structure.

  4. pm286 says:

    (3)This is the key, of course. At the macroscopic level everything consists of samples. In principle each of these is different and would require a unique batch number. Chemical/pharma companies do, of course, maintain batch numbers. I can remember a case of an active sample to which we assigned several different strctures as the weeks went by, but the sample remained roughly the same. Actually it had been prepared 20 years earlier, and had decomposed over the years. It is unlikely that the original formula described what was in the bottle then (the structure was assigned on the assumption of the reaction that had taken place). So the only way to describe the material is a dynamic combination of sample identifiers (realising that to describe the sample itself requires a date) and various chemical structures in different amounts at different times.
    Most literature-based public chemical information systems have no concept of a sample and so are incapable of representing the situation. In many cases it works OK – make/isolate a compound, find out what it is, report this – and it stays unchanged for many years. That doesn’t happen in biosciences.
    We have anticipated this in CML and have various elements (sample, substance, identifier, name) to manage those aspects of chemical identity which are not based on a connection table. So it is possible to describe the following statements:
    molecule M1 has name N1 on date D1
    molecule M1 has name N2 on date D2
    molecule M2 has name N3 on date D1
    molecule M2 has name N1 on date D2
    and so forth. It is up to the community to decide on date D1, D2 what is the preferred interpretation of these statements. I would not use the words “right” and “wrong” – I would use “generally agreed (see reference) by authority X”.
    The commercial databases give a false aura of certainty in a subject which has a lot of agreement. I would rather see all evidence being made Open (hence my emphasis on Open Data) and comments on that data. So rather than say “there is a lot of rubbish in Pubchem”, I would say “there are some assertions in Pubchem (usually linking names and connection tables) that I would like to challenge”. We need a social annotation mechanism. If we had that then this discussion would not have arisen – we would simply have added the connection tables for the various “hexacyclinols” to Pubchem and annotated them.
    I expect this sort of thing to happen. It will be ignored or opposed by most chemists. But hey…

Leave a Reply

Your email address will not be published. Required fields are marked *