Can chemical structures be right or wrong?

Chemspiderman has commented…

  1. ChemSpider Blog » Blog Archive » Dictionary Lookups and Optical Structure Recognition Versus Structure Drawing. Which is Less Error Prone? Says:
    October 2nd, 2007 at 5:48 am e[…] Luqidcarbon has put up a recent blog posting about the speed by which he/she can draw structures in ChemDraw and asked for challengers. PRM has commented in Chemical SpeedDrawing. The challenge is outlined below… […]
  2. ChemSpiderMan Says:
    October 2nd, 2007 at 6:21 am ePeter, I think the structure of discodermolide is wrong…this is where a look-up in a reference dictionary is necessary…and I think we both support that effort. But it MUST be curated. it IS correct on Wikipedia but drawn incorrectly by liquidcarbon and everyone afterwards…
    It is why I favor the scan and convert software for this…there is the version from Marc Nicklaus’ lab but I must admit that my present bias is to use CLiDE (http://www.simbiosys.ca/clide/index.html) because it can be batched and because the results appear to be so far ahead of the Open Source code at present. We do not have time to work on the Open Source support at present as ChemSpider is very distracting and we are focused on potentially using the batch processing for extracting novel structures from Open Access articles.
    I put a detailed blog posting about this at: http://www.chemspider.com/blog/?p=180

PMR: I have already posted on this blog that – in general – chemical structures are not right or wrong. They may be associated with other information and the chemical community as a whole decides that this association is useful or counterproductive. Please read the argument carefully.
If, for example, I write CH5 is this structure wrong? It violates the valency rule, after all. No. It’s not wrong, it just can’t be found in a bottle in most labs. It can be found in mass specs and interstellar space. There is an arrogance in the chemical informatics community that assumes the only discipline that matters is synthetic organic chemistry. In general no chemical structure that obeys the algebra is wrong. (The algebra says things like “no fractional charges on molecules” (although ther can be on crystal cells,  “if A is bonded to B then B is bonded to A”).
There are unacceptable uses in mainstream C19 organic chemistry, such as carbon with 5 “valencies”. Such structures may be deemed “wrong” by organic chemists. It was clear that when Chemspider was set up the support for inorganic compounds was almost non-existent – I pointed this out and I think the position is improved somewhat. But I don’t have time to check – I expect there are many compunds represented by discrete “connection tables” which in my view are far worse chemical sins. But I am turning my attention elsewhere.
So  “Peter, I think the structure of discodermolide is wrong”. No. I think this means “liquidcarbon has drawn a structure to which s/he has associated  the name ‘discodermolide’ and Chemspiderman things this association is incompatible with current usage. ” OK. Discodermolide is a substance of relatively minor importance compared to penicillin G and THC.  It has 103 hits in Pubmed, compared with 30,000 for taxol.  Maybe it will become famous one day. Until then I don’t really care that liquidcarbon may have got it  “wrong”.
What I do care about is that we develop a community process – not regulated by a closed commercial company or a closed learned society division – that allows us to converge towards a cluster of agreed names at any point in time. In some cases this is easy – I think we all agree what Pen-G is – in some cases this is a question of removing known errors – and Wikipedia is great for this. (BTW I made a correction to the strucure of Acetyl-CoA in Wikipedia, and the wikichemists agree the structure is noew “correct” – but this is a natural part of using WP and I do these things every other day).
Pubchem has got it right. It simply records what name a human or organization has attached to a connection table, and gives the reference. That is all it needs to do. We then, as a community, need to evolve a Web 2.0 mechanism for annotation that allows us to find the “right” structure rapidly.
That’s the sort of thing we shall soon start to be doing with the peer-reviewed literature – if our grant gets funded. Social computing to create consensus on data and names. All Open. All in public view. Versioned. With metadata. And until the chemical “databases” adopt C21 metadata they are largely useless in the C21. Pubchem understands this. And ChEBI, and some Blue Obelisk efforts. No-one else seems to have got the point.

This entry was posted in chemistry, open issues. Bookmark the permalink.

2 Responses to Can chemical structures be right or wrong?

  1. Peter, Your original posting was on how long it took to copy a structure (well that’s my interpretation anyways).
    If someone copies a structure and misses a fragment of the structure by default isn’t it wrong? If some draws Taxol and reverses a stereocenter by accident is that taxol anymore? I don’t think so…I think it’s a poor copy, and wrong.
    This is the type of quality issue that is essential to have tracked.
    Maybe I misinterpreted your point about right and wrong structures?

  2. DrZZ says:

    I agree wholeheartedly with Peter. I think the emphasis on curation is misplaced. It isn’t that the issues are unimportant, it is that many of the curation questions are essentially scientific questions and saying you are going to solve that by curation leads to a situation where these scientific decisions are made in ways that don’t become part of the database record. I think at this point in time the emphasis should be on an architecture that allows multiple structure/name claims from multiple sources to be compared and then ways to annotate and track the discussions that arise as consensus is reached on the inconsistencies. This is exactly what scientists do and I don’t see why we should build databases that hide these issues from the people who use them. To be sure most inconsistencies will be the equivalent of typos, but even these problems are ill served by the current set up. I can tell you that at least 3 or 4 other groups have contributed structures to PubChem that were originally copied from the DTP database and hence are not really independent data sources. If there was a problem in a DTP generated structure do we look at PubChem and say that because a number of independent groups have the same structure that ther is a consensus developing on an alternative structure? Not if we knew that all the structures originally came from the same source. Setting up databases that assume the only information needed for a structure is the connection table, coordinates, and whether the representation is right or wrong throws away all the information of where the structure came from, how was it manipulated, what evidence was claimed in support of it, etc. It doesn’t even allow us to know what correction to make. We just had a example where someone found a name for a DTP compound that didn’t match the structure at all. It would be pretty easy to substitute the correct structure for that name, but in this case that would be exactly the wrong thing to do. The primary data from the supplier was the structure and it was a DTP mistake that associated an incorrect name with that structure. In my opinion the most pressing need is for an awareness of the importance of these data items and for an architecture that captures and makes use of them. I think PubChem has made a good start and I would very much like to encourage that, rather than use its limited resources to try to be the final (and hidden) arbiter of what is right and wrong.

Leave a Reply

Your email address will not be published. Required fields are marked *