Noel reviews chemical depiction and SDG

A really useful post from Noel O'Blog about chemical depiction and structure diagram generation (SDG). The chemical structure of compounds in "2D diagrams" is often the most important way of communicating chemical information. There is a gradually growing realisation that diagrams need to be clear and use consistent conventions (I have been involved with IUPAC in this activity).

There are two aspects as Noel shows clearly:

  • if you know what atoms are connected, and you know where to put them on the page then how do you draw the best/most_useful diagram? Among the things you are allowed to alter are:
  1. the font-size and color
  2. the width of lines,
  3. the color of bonds
  4. whether hydrogen atoms are shown or not
  5. how aromatic rings are drawn (double bonds or circles)
  6. where charges should be located
  7. how close bond lines should approach atoms
  8. what happens when lines cross
  9. how to depict stereochemistry
  10. where exactly to position double bonds (inside rings, inside and outside, mitred, etc.)
  • what you must not do is alter the position of atoms.

It must be clear what the compound is - correctness is more important than beauty. A major problem is when atoms are very close - it can be difficult to distinguish the atoms and often there are spurious "rings". There is no correct answer, but it's worth looking at some of Noel's collection of molecules drawn by different programs. The molecules are randomly taken from Pubchem (so probably don't exercise the inorganic features). Here's the post:

Now for some pretty pictures as well as some not so pretty. Yes, it's the turn of the structure diagram generators (SDGs) to strut their stuff and throw some shapes. How do they perform for 100 random compounds from PubChem?

Here are my [NO'B] results for depiction and structure diagram generation [...]

Some comments:
(0) Rich Apodaca has written an overview of Open Source SDGs.
(1) 2D coordinate generation is independent of depiction. A SDG typically has both parts but coordinates could be generated with one toolkit and depicted with another.
(2) Looking good is not the same as chemical accuracy. But looking good is important too! :-)

(5) The PubChem images appear to be generated by an OpenEye product (for sure, the coordinates are). I don't know what version.
(7) It is important to consider how to handle hydrogens. With OASA, I just drew all the hydrogens. This is probably not a good idea.
(10) PubChem entries with more than 1 connected component were not included in this test. (As a result, the number of molecules shown is actually less than 100.)

PMR: Make you own choice as to what looks nice, but some are dead wrong in the stereochemistry. Personally I deprecate any depiction of atom-centered that does not have the pointed end on the atom and is not wedge-shaped. (Thick lines are often ambiguous, and perspective diagrams easy lead to errors). Here's part of a typical line (7250053)- all the compounds are meant to be the same, but they cannot be. At least two are in error

So it's not impossible but not completely trivial to depict structures. Structure Diagram Generation (where the coordinates are not given) is much harder and there is often an impossible tensions between accuracy, arbitrary convention, and aesthetics. Sometimes only a human can do it.

This entry was posted in Uncategorized. Bookmark the permalink.

5 Responses to Noel reviews chemical depiction and SDG

  1. There is an option in the Renderer2DModel for not showing hydrogens. Rather useful, I'd say.

    Not sure what is happening with the incorrect wedge in the CDK image ?!?! Please file a bug report in CDK's BTS on SF, just pointing to this blog item.

  2. pm286 says:

    (1) Egon - and Noel. I don't know where the bug is. It could be that the data file that was sent out didn't correspond to the Pubchem picture. I assume it was a MOL/SD file.

    Noel - I'd be interested in the files - JUMBO has a depiction engine now using SVG and it would be interesting to see how it got on.

  3. baoilleach says:

    Just to correct an error in my original blog and reproduced above. Point (5) above has been corrected to read "The PubChem images and coordinates are generated by the Cactvs toolkit".

  4. Wolf-D. Ihlenfeldt says:

    Here are some comments on the more subtle mechanics working behind the scenes of the PubChem layouter. This may help to explain why the PubChem images (mostly) look familiar, recognizable and pleasant, and why other tools fail:

    a) Ring system alignment. Dominant ring systems are drawn in a standard orientation (13199309). We do not want to see steroids in weird orientations.

    b) Ring system substituents and embedded hetero atoms also influence the orientation - substituents preferentially on top, heteroatoms on bottom (14610101).

    c) You want a clear indication whether a double bond is stereo or not. Stereo double bonds have explicit H, non-stereogenic double bonds are crossed (3834779). Plotting a stereogenic double bond in a random cis or trans fashion without crossing bonds, wavy substituent bonds or other indicators is a grave error.

    d) The stereochemistry of ring double bonds and double bonds attached to a ring (11870297) is definitely important when they are part of a macrocyle. No other renderer seems to get this right.

    e) In some cases, wedges are best put on explicit hydrogens (14785619). None of the other renders manage to display any recognizable stereochemistry for this cpd.

    f) Charges and isotope labels are an important part of the structure data. Omitting them is just wrong (12389967).

    g) Dynamic autoscaling is a must. For many applications, you cannot allow huge compounds to break out of your box (9682082). All PubChem images are the same size. If a structure is too big to fit into the box, it is scaled down.

    h) It is often possible to avoid atom overlap with a little effort in postprocessing (11829069, 19040534). And if you need to make a compromise (14384490), better allow bonds and rings to overlap than atoms.

    Some other important features not shown in the sample set, but handled by the renderer are:

    - Stereochemically correct odd and even allenes, quadratic planar cpds etc. are important, too, and do appear in PubChem. And SF4 should not be drawn the same way as CH4 -atom hybridization is part of the equation.

    - The renderer can compute and display contracted symbols (NH2, COOH), but this is currently not enabled for the PubChem images. Also, it can both use available z coordinate information, or automatically infer it for cage ring systems, in order to render crossing bonds with a gap.

    - If there are multiple fragments, special alignment rules bring opposite charge pairs into vicinity, etc. And if somebody encodes an alloy as 90 atoms of metal A and 10 atoms of metal B (yes, you can find them in PubChem), you do not want to see a chain of pearls in the image, but something more reasonably grouped.

    The Cactvs display routines obtain information about defined stereochemistry and connectivity from code implemented with the aid of another toolkit, so it is not to blame for the choice of strange tautomers such as in 20834758.

    The layout component of the toolkit code is more than 25K lines. This count includes ring analysis, rendering functions etc. but not auxiliary support routines (I/O, low-level imaging). This is probably more than the total line count of some of the other tool kits. As always, your results tend to be related to what you invest.

  5. pm286 says:

    (2) many thanks indeed Wolf-D - this is very helpful.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>