With the release of Chem4Word (sorry Chemistry Add-in for Word) we’ve reached an important milestone in the development of CML. CML is about 16 years old (Henry will give a better estimate – but I think we can reasonably date it from our trip to WWW1 and henry’s subsequent trip to WWW2). I think it’s reasonably come of age and can now be regarded as the de facto approach to representing semantic chemistry. And part of the purpose of PMR Symposium #pmrsymp was to be able to make that assertion. We didn’t actually have much about CML per se, but the working code was all based on CML and we shall be publishing the justification in a special issue of BMC.
It’s not been easy to make that statement until now. It needs at least:
- A reasonably stable formulation. That’s been impossible for many years as CML has been naturally fluid as we have tried out new ideas. Now we eat our own dog food. Our CML must validate and the dictionaries must exist and resolve.
- Running code. It’s relatively easy to write a specification. It’s vastly harder to make sure it’s completely implementable. We adopt the IETF motto of “rough consensus and running code” and very little in CML has been deployed without support in at least one major language. When people ask what JUMBO does, the formal answer is that it’s the reference implementation of CML. That’s not dramatic and it’s desperately boring to write. But it’s almost all in place.
- A user community. There is sufficient variety in the people and places that are using CML that we can be reasonably confident that it has a good user base. A lot of people implement solutions without our being aware of it – that’s perfectly OK, of course, but they may be struggling with problems that have already been addressed. But last week, for example, a group in the GRID community wrote to us who had implemented it under lxml and found bugs in the Schema validation. That seems to be a known problem in lxml.
- Robustness and portability. It’s got to be possible to implement CML in different environments. It’s got libraries in Java, C++, C#, FORTRAN, Python and Javascript. These don’t all implement everything in the language but they show that everything is reasonably possible.
- Flexibility and Generality. This is one of the great strengths of CML. It’s possible to express a very wide range of concepts in CML. Because CML contains general tools for physical sciences we can model properties, parameters, complex objects, constraints, etc. The use of @convention is proving to be very powerful for developing new domain without breaking old ones. There are almost no content models (something that is very constraining in XML).
- Dictionaries. A very powerful means of expressing physical science (and other) concepts. Indeed CML can represent a lot of high-school physics and materials.
- Interoperability. CML does not try to do everything – the more that other domains provide the better CML works. So it uses MathML for the maths, SVG for the graphics. Specialist representations within chemistry (e.g. EMSL for basis sets or BIOPAX for bioscience). When NIST (after perhaps 15 years) finally releases UnitsML we’ll use that (assuming it’s easy to implement). For large arrays we use NetCDF or similar tools. For complex relationships we use Xlink or RDF. And so on.
- Simplicity. CML is simple – or at least no more complex than the chemistry it represents. There are no abstract objects or relationships or attempts to build overly complicated models. The elements in a CML file should be understandable by high-school students.
- Uniqueness and unification. There is no other current approach that supports most of the domains in chemistry in a semantic manner. Much chemical software is centred on connection tables, but these do not support solid state, physical properties, experimental processes, computational chemistry, etc. to the same extent that CML can. There are lots of specialist non-semantic files, but these are often archaic and only work for specific codes. CML provides a central nearly lossless semantic centre.
CML supports five main subdomains and there is extensive experience and code in all:
- Core. This supports molecules, atoms, bonds, dictionaries and physical quantities, etc. Many implementations.
- Reactions. Tested with a wide range of reactions including enzymes (MaCiE), literature extraction, and polymers.
- Spectra. Fully supported in JSpecview.
- Crystallography. Able to convert complete CIF files and now with 200,000+ structures in Crystaleye.
- Computational chemistry. Extensively tested with implementations in several major codes and continuing.
And it’s worth pointing out that CML can be used as a computational language – i.e. it can be self-modifying as in polymer markup language.
I owe a huge debt to lots of people and CML really is a community effort, with strong moderation. We wouldn’t be here without the Blue Obelisk, eScience/GRID, and the bioscience community. We’re open to any new ventures and ideas – incorporation in existing codes, chemical publication, artificial intelligence, etc.
CML is ready for universal use within chemistry.
Joe Townsend is coordinating much of the effort. I will be blogging at regular intervals. We hope to get semantic chemical blogging (e.g. in WordPress) very soon.
Pingback: Tweets that mention Unilever Centre for Molecular Informatics, Cambridge - Chemical Markup Language 2011 « petermr's blog -- Topsy.com