Chem4Word + CML representational power

Rich Apodaca is an original member of the Blue Obelisk and has developed his own chemical authoring tool (ChemWriter 299USD). He’s just posted the rather enigmatic comment…

Metallocenes? Axial Chirality? Apache/MIT/BSD License? OpenOffice? GitHub?

I’m taking this to mean that he’s asking slightly tongue-in-cheek (a) about the power of C4W’s and CML’s representation of chemistry and (b) the openness and comunity aspects of C4W. It’s actually an excellent opportunity to follow those up. Here’s a recent post…

Language for Chemical Representation Part 2: Real-World Problems

Posted by Rich Apodaca 8 days ago

The last installment in this series discussed the limitations in today’s molecular languages and how FlexMol is designed to overcome them. Although these limitations are clearly present theoretically, what’s the practical effect likely to be?
For the last two years, a series of articles highlighting specific examples from the current chemical literature have appeared here. Variously titled “How would your cheminformatics tool do this?”, “Can your cheminformatics tool to this?”, and “Cheminformatics Puzzler”, each entry featured an article from a mainstream chemistry journal in which SMILES, Molfile, CML, and/or InChI would be incapable of faithfully representing a centerpiece structure. The examples are taken from well-read journals in synthetic organic, natural products, and medicinal chemistry.
The purpose was not to bash these languages, but rather point to an important common set of limitations among them – a kind of groupthink if you will.

and earlier

The fundamental problem with ‘standard’ molecular languages such as molfile, SMILES, InChI, and CML is their simplification of bonding and stereochemistry. Bonding is defined as an association between two atoms using two electrons. Stereochemistry is defined in terms of one or more chiral templates.

FlexMol takes a different approach. Bonding is defined in terms of systems of one or more pairs of atoms interacting with the cooperation of zero or more electrons. Stereochemistry is defined in terms planes passing through atom pair axes.
As we shall see, this flexible system enables the faithful, lossless representation of almost any chemical substance consisting of a single, well-defined molecular entity.

<p .
There's a fundamental misunderstanding here about the role of CMl (which is anything but group-think). CML addresses the semantics of chemistry. I could reply – in the same lighthearted vein:

Zeolites? Clathrates? Block co-Polymers? HEPES buffer? transition states? gaussian logfile output? cell dimensions? multiplets? eigenvectors?
and assert that CML can deal with all of them and ChemWriter cannot.

In fact CML can deal with any of the examples that Rich has mentioned in his article because it is (a) extensible (b) namespaced and (c) linked to ontologies. CML can add any properties to any of its primitives (atoms, bonds, etc.) It can define multicenter bonds and bonds between bonds. It has primitives for lines, planes, etc which should be sufficient for representing any of the geometry mentioned. JUMBO can do geometry and algebra on these if required. It has a primitive for electron. It also can hold 2D and 3D coordinates for atoms, so that it can represent the drawings of any of the species in Rich's diagrams.

That some of these CML primitives are not used in practice is because to be useful there needs to be agreement between two of more people. If Rich wishes people to use FlexMol then either they all have to use his software or other vendors have to install FlexMol readers and writers. If he can show me a groundswell of users of FlexMol and if it appears useful for them to convert to CML then I'd be happy to give some pointers.

What would emerge is a set of primitives and ontology terms that was FlexMol-specific - in CML we call call this a convention. There are already several conventions in CML - a typical example is a JSpecView convention for spectra. This requires that a spectrum contains data (it's perfectly reasonable to have an empty spectrum) so that JSpecView can display it. Another convention is CML-lite - a subset of primitives which are processable by default by C4W.

But because CML is semantic and because it uses ontologies it can hold a very wide range of chemistry. If a processor does not understand some of it, then it simply passes it through without loss. Whether this changes the semantics can be decided by the ontology and although that's at an early stage the basic infrastructure works.

I appreciate that for generations raised on FORTRAN-like formats (Mol) and implicit information (SMILES) that it will take time to migrate to XML-based ontology-driven chemistry. But it's the only way forward that can cover mainstream chemistry whether it be molecules, reactions, crystallography, nanotechnology, computation, spectra, physical properties and their measurement.

Because chemistry is a lot more than organic molecules drawn pictorially...

This entry was posted in Uncategorized. Bookmark the permalink.

5 Responses to Chem4Word + CML representational power

  1. Rich Apodaca says:

    Peter, thanks for the quick (and detailed response) to the questions.
    I’m actually not trying to push FlexMol as one of your commenters seemed to suggest. These are tough problems and there are many solutions that could work, some of which may work much better than FlexMol. My questions were really just aimed at understanding _how_ Chem4Word will deal with cheminformatics’ problem molecules.
    I could be missing the point of Chem4Word entirely. But from what I can tell it’s a way to create documents in which various kinds of chemical information can be embedded, an important subset of which will be chemical structures.
    Today’s state of the art consists of ChemDraw objects embedded in Microsoft Word documents. For the vast majority of chemists, this system works fine and there is relatively little interest in experimenting with anything else. To generate interest by busy chemists in another system, it will need to work much better, cost _much_ less, offer something fundamentally different, provide a low barrier to switching, or a combination of these things.
    One selling point could be to represent molecules like metallocenenes, axially-chiral biarlys, and other problem molecules in a way that they can be faithfully extracted later on (by various data mining tools, for example) and mashed up with other information generated in an organization. This is just one of many possible selling points and may or may not be one that matters to your group – I’m not sure.
    Here’s an example of how ferrocene can be represented in FlexMol:
    http://depth-first.com/articles/2006/12/20/a-molecular-language-for-modern-chemistry-getting-started-with-flexmol
    Simple question – how would the same molecule be represented by Chem4Word/CML?

  2. pm286 says:

    I wrote a reply but WP dropped me so this is brief.
    p1. yes
    p2. Chemdraw can’t deal with crystallography, compchem, chemical properties, chemical reference number integrity, ontological validation, etc. If this is “fine” then chemists will slip even further behind other disciplines.
    p3. sorry – these are fun but lower priority than common things like AlCl3, NO2, etc. We’re concentarting on the chemistry in Wikipedia in semantic form.
    p4. Yes – name your model and CML can manage it. the natural approaches are (a) 10 explicit Fe-C bonds labelling with convention=”foochem:metalbond” or (b) a 5-atom 6 electron bond in each ring bonded to the Fe. Neither requires extension to current CML but you would de well to define an ontology. The main challenge is getting other chemists and software to adopt your formulation.

  3. Rich Apodaca says:

    Peter, I would gladly drop FlexMol and enthusiastically support any robust system that enabled me to faithfully represent, store, and transmit representations of molecules containing axial chirality (biaryls, allenes), planar chirality (Fu’s chiral DMAPs), organometallics (metallocenes, piano stool complexes, pi-allyls), square planar stereoisomerism (cis/transplatin) aromatic radical cation/anions and other multi-centered bonding species, and other problem motifs without using templates, non-standard extensions, abusing the “wedge bond”, or other hacks.
    I’m glad to see that you believe CML can do this. However, nothing in the documentation I’ve found leads me to believe this is the case, and I’ve seen not one example to show it.
    I’m not trying to make a pest of myself (may be too late ;-)), but I really would like an example to back up your claim. I’ve provided a link to a specific example for both cyclopentadienyl anion and ferrocene using FlexMol in my previous comment.
    What is the best practice representation (actual, valid XML) for ferrocene in either C4W or CML?

    • pm286 says:

      CML can be used as a natural language – a set of primitives which can be made to carry defined semantics. Those definitions come from the authorities that define them. Bonds can be defined to have a set menu of fairly common orders or can be extended to have any order. Thus ChemDraw defines a “dotted bond”. I have no idea what the ontological interpretaion is as CD doesn’t define it, but I can faithfully represent the semantics as convention=”chemdraw:dotted” where the value is a QName. The QName can then be resolved against an ontology when one is created. If the ontology maps it onto a transition state and transition state is defined in the ontology we have made a lot of progress.
      Similarly I can define the 10 bonds in ferrocene as convention=”cmlx:metallorganic” if it will help and create an ontology entry describing this. the ontology could describe how they were drawn, or we could overload this with a chemical style sheet (part of C4W). If you want a substructure search and if everyone adopts exactly the same approach it’s fairly easy. If other people, as it likely, adopt the other conevntions then we need a system that asserts they are identical. That will not be easy in any formal system. So the best that cxan be done is some normalization which InChI does.
      The point is that until there is a common agreement on representation, searching and normalization we still have chaos. IUPAC is one way of addressing this and where IUPAC converges we shall try to follow. I see little point in trying to emulate CD’s conventions which are highly graphic. So the real problem is persuading the community to converge. I believe that C4W with an Open platform is the best approach we currently have.
      But think about what operational problem you are trying to solve. Is it a pretty picture or is it a search system? If the latter then you are going to have to either write it or persuade someone else to.
      You may think this is a cop-out – creating conventions. It isn’t. It’s taking the semantics to the precise position where we do not need implicit semantics. When the implicit semantics become explicit and when the community starts to need them and adopt them, then it’s straightforward to implement them.

  4. Rich Apodaca says:

    Peter, interesting discussion. Not to sound like a broken record, but I’m still looking for the actual, valid XML.
    Where can I find a real example of ferrocene or any of the other examples I cited being represented by either C4W or CML?

Leave a Reply to pm286 Cancel reply

Your email address will not be published. Required fields are marked *