Chem4Word – the journey so far

We’ve been very silent about Chem4Word (C4W) for several reasons, but a major one is that I don’t like vapourware. I’ve spent too long in the pharma industry getting high-pitch sales including (ca. late 1980s, all true):

  • “We have a revolutionary method for predicting protein structure. It’s so powerful we aren’t telling anyone anything about it or how it works – you have to buy it to find out”
  • “Our graphics can render spheres 10 times faster than the competition, so you can design 10 times more drugs”
  • “we are launching our product to a selected group of pharma companies; we’ve got one slot left, but we need a PO from you by the end of the week”
  • “The Bioengine is so powerful that it can understand japanese and fold proteins. It’s only 22M USD but you get an ETA supercomputer thrown in (ETA went belly-up the next week)”.

Needless to say none of these were heard of again.

So I have been careful not to create vapour-ware during the gestation period (let’s say 9 months so far). And, gratifyingly, a lot of things have changed in a positive direction. So I can say, accurately, that I am delighted with where we arrived on Tuesday last week. It’s been a rather twisty journey to get there and this has resulted in false trails, confusion, fun, pub sessions, belief, despair, etc. A month earlier I would have said the project velocity was negative, we had a broken system that I would be embarrassed to show anyone, our architecture was a disaster, etc. I was terrified of showing it at BioIT. Today I am proud and very positive and I will tell you about it in a series of posts.

So what is C4W? At the most general it’s an act of faith by Microsoft Research about our work on semantic chemistry at Cambridge and how Word2007 can become the semantic framework. But it’s the people that matter – I can list 20 at Microsoft and I will do so over the weeks but I’ll start with Lee Dirks (project sponsor) and Alex Wade (program manager). Without both of these, working very long hours at difficult hours and with great patience the project would have crashed. They have never flinched from the belief that the project would succeed, and they have changed direction on several occasions in response to need. We’ve been through different approaches to architecture, different types of project management and different allocation of resources. This may sound like thrashing; I can tell you it isn’t.

More specifically C4W is a semantic and ontological chemistry system which includes creation, editing, publishing and re-use of what Henry Rzepa and I call datuments – integrated data and documents. It’s not YACE – “yet another chemical editor”, or YAELNb (lab notebook) or a “ChemFoo killer”. It’s what I have been wanting for 15 years – a properly resourced Open implementation of a semantic chemistry system – a collaboration with a 600-pound gorilla which can make the dream happen.

In developing Chemical Markup Langauge I was always aware that I would need help. Chemists are conservative and when people said “who is using CML? Only the Blue Obelisk? Oh, then we shan’t bother” – I had to accept this as the verdict of the market – I and others had to create a complete ecosystem for CML and then people might start using it. That is, of course, hard. But I knew I needed a 600PG ( I can’t find the origin of this phrase and the mass varies – 800, 900).

In likening Microsoft to a 600PG we know that gorillas are largely harmless unless you upset them or get in their way. I am getting to know the gorilla well in parts and so far I can co-exist without being squashed. I’m in control of those parts I need to be in control of and happily leave other bits to the gorilla.

So what is C4W? It’s a flexible, modular, validatable, semantic ontological chemistry platform in C#, XML/CML and RDF with graphics/UI in WPF and XAML. It emphasizes validation and semantic correctness (e.g. all hydrogens must be specified and no information can be provided by default). The implementation is declarative/functional in that modules are side-effect-free and information is computed lazily or on demand. Recomputing rather than storing makes sure that information cannot get corrupted. An XML data model means that everything is visible. Todays’ machines are fast enough that graphics can be loosely coupled to the data model – we can pass the data repeatedly every few milliseconds. XML gives the flexibility missing in fixed storage models. and a lot more…

What does it currently do? We decided early on that there had to be compromises between functionality, aesthetics, and semantic correctness. We’re strong on the semantics, which needs to be correct right from the start. Word gives a lot of semantic functionality for free – XML validators, smart tags, etc. We’re working on the aesthetics over the next few weeks to get the “user experience” right. Word itself has a great deal of UI functionality – we have a navigator, a gallery, etc. and the group continues to improve its experience in MS UI tools. We’ve developed a completely new approach to chemical styling. The functionality concentrates on processing existing molecules, including “tweak” functionality, validation and normalization. We’ve deliberately left molecule creation to later because its open ended and we need to do it semantically (most current programs are graphically oriented and have virtually no validation other than valence checking on carbon). In the modern era with Pubchem it’s probable that most scientists will be able to find their molecules already exist and there are anyway free drawing tools that emit CML.

As we’ll be formally showing C4W this later this month I’m not going to show screenshots, etc. I am very excited indeed about the ontology feature and Nico Adams’ ChemAxiom but I am keeping my mouth shut and leaving him to blog about it.

We will be releasing C4W under an Open Source licence and looking for collaborators. We don’t know the mechanism of this – it will appear in CodePlex – and we shall be looking for developers who are able and keen to work in a collaborative .NET environment. If you are interested in this, and can contribute, let us know. Accept that we can’t make time promises at this stage and until we know the .NET chemistry community better we don’t know what degree of management will be required.

This entry was posted in "virtual communities", Uncategorized. Bookmark the permalink.

4 Responses to Chem4Word – the journey so far

  1. Rich Apodaca says:

    Metallocenes? Axial Chirality? Apache/MIT/BSD License? OpenOffice? GitHub?

  2. DaveG says:

    The more direct version of Rich’s question:
    FlexMol? Apache/MIT/BSD License? OpenOffice? GitHub?

  3. Peter…I’m interested in taking a look when you are free to demonstrate. I am at Bio-IT World arriving Tuesday morning and leaving Wednesday afternoon so maybe we can coordinate a get together for a coffee?

  4. Pingback: Chemical Industry Blog : Chem4Word Sighting

Leave a Reply

Your email address will not be published. Required fields are marked *