One of the problems of having deserted blogging for months is that there is a lot to catch up with. So here’s the third post today. (I vowed not to post even once a day but…). Here Egon reviews Chemical Markup Language. I’ll quote a bit and then comment…
Editing and Validation of CML documents in Bioclipse
One advantage of using XML is that one can rely on good support in libraries for functionality. When parsing XML, one does not have to take care of the syntax, and focus on the data and its semantics. This comes at the expense of verbosity, though, but having the ability to express semantics explicitly is a huge benefit for flexibility.
So, when Peter and Henry put their first documents online about the Chemical Markup Language (CML), I was thrilled, even though is actually was still SGML when I encountered it. The work predates the XML recommendation. As I recently blogged, in ’99 I wrote patches for Jmol and JChemPaint to support CML, which were published as preprint in the Chemical Preprint Server in a paper in 2000 in the Internet Journal of Chemistry. Neither of the two has survived.
Anyway, the Chemistry Development Kit makes heavy use of CML, and Bioclipse supports it too. Now, Bioclipse is based on the Eclipse Rich Client Platform architecture, for which there exist quite a few XML tools in the Web Tools Platform (WTP). Among these, a validation, content assisting XML editor. This means, I get red markings when I make my XML document not-well-formed or invalid. Just a quick recap: well-formedness means that the XML document has a proper syntax: one root node, properly closed tags, quotes around attribute values, etc. Validness, however, means that the document is well-formed, but also hierarchically organized according to some specification.
Enter CML. CML is such a specification, first with DTDs, but after the introduction of XML Namespaces with XML Schema (see There can be only one (namespace)). The WTP can use this XML Schema for validation, and this is of great help learning the CML language. Pressing Ctrl-space in Bioclipse will now show you what allowed content can be added at the current character position. …
Yes, CML is ca. 15 years old and was the first XML langauge to be published. It’s mature in that most of the functionality has been designed, and tested, and deployed. It’s complicated because (a) chemistry is complicated and (b) information is complicated. So uptake is gradual.
I owe Egon a great debt. He’s always believed in CML and has done more than almost anyone to make it happen. He’s developed a huge amount of Open Source. That makes it believable as well as giving people ways into using it.
It’s not universally used, but this is because XML is rarely used in chemistry. Chemistry is a very conservative area and probably at leat 10 years behind bioscience. the decisions in bioinformatics are made by scientists – the decision in chemical informatics are made by software and information suppliers. And most of those only innovate when their customers ask for something new. And since the customers are also conservative, progress is slow.
A common belief is that CML is jsut another format. That’s a very constraining view. the true position is that is it is:
- A component of semantic documents (no other format can manage that
- A modelling language. It can represent the internal data model and this is what we are doing in Chem4Word with microsoft.
- An information transfer tool. XML is the first system that can transfer information with almost no semantic loss. By contrast the legacy formats of chemistry are vastly lossy except within limited domains.
- An innovation tool. XML is extensible and so you can add you own extensions without breaking the original.
- A computational tool. We have actually developed CML for grammars (e.g. for generating polymer structures or the representation of imperfect information – generic and ambiguous molecules).
- A consistent and enduring component of the emerging web. We are actively developing ontologies and RDF. This is straightforward with CML – it’s impossible with SMILES, SDFiles, etc.
- A transducer for computation (e.g. using Toby White’s FoX).
There’s been a gradual adoption of CML – and I probably don’t know much of what has happened. The Open Source community has led this and all the main tools will emit or read it. The following companies have sponsored me to develop CML applications:
- Microsoft
- IBM
- Accelrys
- Unilever
- MDL
and I know of 3 I can’t mention who use actively it in their products.
And their are an increasing number of publishers. Pride of place is the Royal Society of Chemistry, but anyone who want to publish chemistry in semantic documents will probably use CML as there is no alternative. We know of one text book publisher who is using it for high-school chemistry.
And we always like to hear from anyone who has potential applications…