Steve Bachrach poses an interesting question on the CHMINF-L list. I have omitted the citations and some other material – you can read the archive if necessary.
I have run into an interesting chemical problem that has led to both theoretical and applied database questions. I am hoping that some of the experts on the list can shed some light.I have been looking into the recent controversy concerning the structure of (+)-hexacyclinol. This compound was first isolated in 2002 by Graefe et al who proposed a structure for it…. La Clair recently synthesized this structure, or it least reportedly so. [SB: By the way – do a google searchon hexacyclinol so see how the blogosphere responded to this problem.] Then Rychnovsky …proposed an alternative structure for (+)-hexacyclinol, which was subsequently synthesized and confirmed to be identical to the original natural product
So here is first my theoretical question: How do you index such a situation? The original structure of the molecule (+)-hexacyclinol is wrong, and a subsequent one is right. So, when you query a database, which structure matches up with the name “(+)-hexacyclinol”? My guess is that it should be the correct one – but then what do you do with the oldstructure? Obviously, this is not the first, nor will it be the last, compound whose structure in contested.Now here is the more applied aspect. A search in SciFinder for (+)-hexacyclinol gives CA 484674-97-7, which is the original (and, we now know, wrong!) structure. Querying for the papers that have this structure returns the Grafe, La Clair and Rychnovsky papers, but not the Porco paper. But entering the “true” hexcyclinol structure and then doing a search locates 2 structures CA 903574-41-4 and CA 903574-42-5, which look to me to be identical. Furthermore, the only paper that is linked to these “two” structures is the Rychnovsky paper. In other words, the Porco paper that reports the actual synthesis and x-ray structure of hexacyclinol does not have any hexacyclinol structure(s)(correct or not) attached to it!
(By the way, a PubChem search for hexacyclinol comes up dry, but all of the above papers are indexed in PubMed.) Any explanations?
(PMR: Yes, Hexacyclinol is not very interesting except to chemists so no-one has deposited a data collection containing it to Pubchem. If synthetic chemists contribute collections of targets to Pubchem I am sure Pubchem will be delighted to accept them. However many chemists are still unaware that PubChem exists.)
- What is aluminium chloride?
- What is glutamate?
- What is glucose?
These are legitimate scientific statements which require several assertions, linked in a mini-semantic web. That is why we need to move from a twentieth-century way of describing chemistry (as exemplified by the CA numbers) to a semantic one. There is lots of room for volunteers.
I’m reminded of being shown round the British Museum of Natural History – and a room full of fish specimens in ethanol in labelled glass containers. The biologist said that some countries had asked for their specimens – their property – back – the BM had resisted. But if it ever came to that the BM would keep the labels – their metadata.
I fully recognize that this situation is not unique, both within the discipline of chemistry nor in science in general. I brought it up because it was the first time I really thought about the consequence of “time-dependent science” and databases. What I mean by “time-dependent science” is that what is thought to be a correct interpretation in science changes with time. So in 2002 the generally accepted structure for hexacyclinol was what Graefe proposed, but now in 2006 we have a different structure.
So the question of “right” or “wrong” is important in that it corresponds to what the community believes at a given point in time. So the database record should somehow track that time-dependence. But I think that a search for hexacyclinol today should at least give back the currently accepted structure, and even better would be to get the “old” structure too. I agree that a well-structured meta-tag system could handle this.
I know why PubChem has no record of hexacylinol – my question of “Any explanations?” was meant for the whole post.
The CAS morass is also interesting – it points out how important careful human examination of articles remains today in building authoritative databases. Searching today on hexacylinol within CAS, either by name or structure, does not lead you to the Porco paper containing the synthesis and x-ray structure of the compound. That to me is a CAS failure in abstracting/indexing.
(1)Thanks Steve,
Indeed. The human genome has been re-interpreted dozens of times since it was first released. I think less than half of the original annotations are now “right”.
Chemistry has, perhaps, a clearer idea of right and wrong. It takes time for people to re-adjust their views. I have only dipped into the blogosphere but my understanding is that laClair does not agree with the blogosphere or other organs of public opinion. But I don’t believe that anyone has actually formally retracted anything. So what is actually “right” now? Majority discourse on the web? Or did the editors of Rychnovsky’s paper announce that his statement was right and la Clair’s was wrong. Or do we assume that Rychnovsky is right because he has asserted it and the journal has published the paper? Again I haven’t followed the discussion, but I don’t believe anyone has worn a hairshirt in public.
And Pubchem, of course, are the modern way of doing things. They publish assertions, with attribution. They make no judgment. There are lots of “wrong” things in Pubchem. What is needed – and it will come – is scoial computing (annotation) applied to Pubchem. The chemical world can, and will, vote as to which Pubchem entries are agreed to represent “accurate statements in 2007”, and so on. There will be no help from the mainstream chemical community because this smacks of everything new like Wikipedia, blogs, etc.
Possibly. But if all articles simply listed the assertions about what compounds they contained, what names and data were associated with them, what is the role of CAS or any other indexer. Do they sit in judgment over whether compound X is associated with structure Y and name Z? If so they have to do it for all 10 million? Do they? Can they continue to do so?
The attraction of structure-based indexing (InChI) is that it is uniquifiable in most cases. Names are problematic – there are synonyms and heteronyms. But some substances require names (paraffin, dextrose, zeolites, montmorillonite, etc.). So for 95% of organic chemistry, perhaps, unique structures can be used. But for a lot of things it can’t
Can I throw a little petrol on the fire.
Whilst it is possible to write a unique structure the actual sample is often not a single entity. Whilst I’m sure chemists would love to claim their compounds are 100% pure in reality there are often, isomers, hydrates, salts, impurities, residual solvents, or even incorrect assignments. These can all have an impact on the properties of the molecule (both physical and biological). So for databases of properties it is often better to link via a sample identity rather than a structure.
(3)This is the key, of course. At the macroscopic level everything consists of samples. In principle each of these is different and would require a unique batch number. Chemical/pharma companies do, of course, maintain batch numbers. I can remember a case of an active sample to which we assigned several different strctures as the weeks went by, but the sample remained roughly the same. Actually it had been prepared 20 years earlier, and had decomposed over the years. It is unlikely that the original formula described what was in the bottle then (the structure was assigned on the assumption of the reaction that had taken place). So the only way to describe the material is a dynamic combination of sample identifiers (realising that to describe the sample itself requires a date) and various chemical structures in different amounts at different times.
Most literature-based public chemical information systems have no concept of a sample and so are incapable of representing the situation. In many cases it works OK – make/isolate a compound, find out what it is, report this – and it stays unchanged for many years. That doesn’t happen in biosciences.
We have anticipated this in CML and have various elements (sample, substance, identifier, name) to manage those aspects of chemical identity which are not based on a connection table. So it is possible to describe the following statements:
molecule M1 has name N1 on date D1
molecule M1 has name N2 on date D2
molecule M2 has name N3 on date D1
molecule M2 has name N1 on date D2
and so forth. It is up to the community to decide on date D1, D2 what is the preferred interpretation of these statements. I would not use the words “right” and “wrong” – I would use “generally agreed (see reference) by authority X”.
The commercial databases give a false aura of certainty in a subject which has a lot of agreement. I would rather see all evidence being made Open (hence my emphasis on Open Data) and comments on that data. So rather than say “there is a lot of rubbish in Pubchem”, I would say “there are some assertions in Pubchem (usually linking names and connection tables) that I would like to challenge”. We need a social annotation mechanism. If we had that then this discussion would not have arisen – we would simply have added the connection tables for the various “hexacyclinols” to Pubchem and annotated them.
I expect this sort of thing to happen. It will be ignored or opposed by most chemists. But hey…