CML – "can your system encode these semantics"

Rich Apodaca (Bluer Obelisk) frequently asks “Can your chemoinformatics tool do this?” and has asked how CML represents various systems:

Peter, I would gladly drop FlexMol and enthusiastically support any robust system that enabled me to faithfully represent, store, and transmit representations of molecules containing axial chirality (biaryls, allenes), planar chirality (Fu’s chiral DMAPs), organometallics (metallocenes, piano stool complexes, pi-allyls), square planar stereoisomerism (cis/transplatin) aromatic radical cation/anions and other multi-centered bonding species, and other problem motifs without using templates, non-standard extensions, abusing the “wedge bond”, or other hacks.

I’m glad to see that you believe CML can do this. However, nothing in the documentation I’ve found leads me to believe this is the case, and I’ve seen not one example to show it.

I’m not trying to make a pest of myself (may be too late ;-)), but I really would like an example to back up your claim. I’ve provided a link to a specific example for both cyclopentadienyl anion and ferrocene using FlexMol in my previous comment.

What is the best practice representation (actual, valid XML) for ferrocene in either C4W or CML?

This is an excellent question, and it takes some time to answer. At the heart is the difference between linguistico-graphical representation of chemistry and semantic chemistry. Chemists often communicate using words and diagrams and this is very powerful especially if they have a common education. For example Rich and I both know what a “piano stool” representation, but if we didn’t we wouldn’t be able to communicate.

The problem comes when we try to communicate this to machines. They don’t understand words and they don’t understand pictures. So we have to try to create a system which doesn’t rely on a shared understanding. What does “aromatic” mean? I can guarantee that if you gave 20 chemists a wide range of moderately common molecules including heterocycles and organometallics and asked them to say which was, and which was not aromatic, there would be considerably less than 100% agreement – could well be less than 80%. You can only get higher agreement by very carefully defining the rules. And though chemists love rules there are still many systems where the rules are very fuzzy.

It is very important to stress that

CML is not a file format, it’s a semantic language

CML endeavours to capture those semantics which ared universally agreed and to provide means for representing those which are not universally agreed. So I showed how CML could represent different approaches to ferrocene. All are valid – in the sense that the have a convention which they apply correctly, and all have different conventions.

CML can support most common conventions but does not choose between them

Suppose we take the words and the paper away from chemists – how to they then communicate? Because this is what we have to do for machine representations. I often give students a test – can you communicate this molecule to another student simply by taking (as it were over the phone). Or assume you were unsighted and could not touch the other student – how would you do then?

Chemistry is described in several worlds:

  • words. very important, but only useful to a machine if we have ontologies.
  • connection tables (topology). Very useful but breaks down when molecules are dynamic or geometry matters. This is the strength and weakness of InChI
  • Local geometry. In many molecules geometrical features of part of the molecule play a major role in the chemistry. Thus the orientation round single and double bonds and roundf atoms can be critical (topo and stereochemistry). Chemists normally do this by sketching that part of the molecule using 3D clues. It’s effective for sighted humans, useless for machines. So some, but relatively few of these, have been encoded in pictureless chemical conventions. Those that have include cis/trans, E/Z, atom chirality but little more. Most of the molecules that Rich showed (and I’ll return to them later) rely on a picture – not just 2D coordinates but also visual clues.
  • Global geomstry. These are aspects such as clusters, and infinite solids. They require 3D coordinates and associated geometrical concepts.
  • CML can support the geometric and topological aspects of most of the common pictureless conventions for representing chemical structure

    If we need a picture to describe a molecule, then CML has can support 2D coordinates, bonds between atoms including multicenter bonds, bonds between atoms and bonds and it can annotate them. In principle these annotations can be displayed as glyphs representing many of the visual cues in Rich’s examples. But there are some aspects (e.g. hidden line removal) that CML does not do. Nor can a machine understand these. CML deals strictly with semantics that an machine can, in principle, understand. With ChemSS (chemical stylesheets) which Joe Townsend is working on we shall be able to depict them visually in many different ways. But the semantics are unchanged.

    CML adopted the separation of semantic and style that is key to modern XML languages

    Rich mentions “tool” frequently and I assume this to mean a program. The problem with almost all chemical representation systems is that they rely on software to provide the implicit semantics. A good example is Daylight’s SMILES system. They say, quite accurately – the definition of an aromatic bond is what our program decides. The test of whether semantics are explicit is whether you can write out a file and import it into another program without losing information. That’s hard because of the many conventions but the vision is that a chemical information system should consist of a semantically aware programs and exposed data with no implict semantics. That’s hard, because the functionality of the program may have to be complex to convert between different representations (even the explicit ones). But CML can, at least, encode the semantics and there an increasing number of programs that are CML-aware.

    So the question really should be:

    does your system provide all semantics explicitly and are there programs that can process them.

    CML and CML-aware programs can do a lot towards this goal. In a later post I’ll show approaches towards restricted rotation and metallic stereochemistry using CML. It’s all in the language and there are systems to process some, but not all semantics.

    This entry was posted in Uncategorized. Bookmark the permalink.

    One Response to CML – "can your system encode these semantics"

    1. 3d tutorials says:

      I am working on interviewing chemistry-people (people who are affiliated with chemistry) on FriendFeed for my organic chemistry class. I came across this post of yours and thought that you may be one of the people that I could interview. Would you be so kind as to answer a few short and simple questions for me? It would take less than five minutes. Please let me know if you are willing to participate. Thank you.

    Leave a Reply

    Your email address will not be published. Required fields are marked *