Category Archives: mkm2007

Chemistry in MathML and CML - comments?

[warning - WordPress is not very math/chem friendly so forgive formatting]

Michael Kohlhase and I are trying to come up with a synthesis of MathML and CML for representing the numerical aspects fo chemistry. By chance we have started with reaction rates - mainly because I found a thesis which is well suited for markup. It contained the equation:

rgas = k0·[Ester]+kKat·[Ester·Kat]

(actually it contained it in PDF which didn't transcribe but this is the essence.) So how do we encode it in MathML and CML.

At one level - presentational - its quite easy. MathML has symbols for all the symbols above and you simply pick them. They will allow pretty typesetting (which is important). The problem is that they don't mean anything. What does "+" mean? it's obvious to a chemist - we add two quantities. But to a mathematician it can mean lots of things. And, now you think of it, it also means several things to a chemist - such as a positive charge. Well it obviously doesn't mean that here, does it? Could it be a positively charged Ester? Not really beacuse it's not a superscript and because Esters aren't usually charged and because additional make more sense. But these are chemical judgments. Chemists make them easily. Mathematicians might not.

Then there is the "·" - not a period/fullstop, but "middot" a midheight dot. What does it mean? Well it's obvious to a mathematician that it could mean multiply. So we have three multiplications and we could use the MathML "times" construct. But hang on - Ester times kat doesn't make chemical sense. Here is means "reaction complex" of Ester and Kat (I told you the thesis was in German and this is an abbreviation for Katalyst - catalyst in English). So the symbols by themselves are meaningless to a non-domain expert. And, unfortunately, our chemical journal-eating robots are not yet very expert in equations.

Take a minute to think about how your would explain the complete chemistry in this equation to a mathematician and how you would explain the complete mathematics to a chemist.

You've probably come up with something reasonable. But now try to explain it to a machine. That's what we have to do in Content-MathML and CML.

Michael has come up with the following semantic maths expression (I hope WordPress preserves it)
(no it didn't)

Try again...

<math class="display">
<csymbol cd="foundations" name="rgas"/>
<csymbol cd="constants" name="O"/>
<apply xml:id="esterconst">
<csymbol cd="foundations" name="squarebrackets"/>
<csymbol cd="cml" name="Ester"/>
<csymbol cd="rateconstants" name="Kat"/>
<csymbol cd="foundations" name="squarebrackets"/>
<csymbol cd="cml" name="Ester"/>
<csymbol cd;"cml" name="middot"/>
<csymbol cd="cml" name="Katstar"/>

(this is about as pretty as it gets>

So this has captured the semantic of the maths, but none of the chemistry. It states (roughly) that you multiply something by something and add it to something times something.

The "cd" are OM content dictionaries - you can look up the meaning (and the semantics) of the object in a dictionary. So we could find out what rgas means in the foundations dictionary. Of course we still have to write the dictionary entry - and that isn't easy - it's the sort of thing that Andrew Walkingshaw has been developing Golem to help with. But we make progress.

The content MathML is a big advance - a machine could evaluate the expression if it know what the somethings were. That's where chemistry comes in. And, be warned, if you want a machine to evaluate the chemistry in the above equation it may be harder than it looks. To start you off, here is Wikipedia's version of the rate equation (and if you don't agree, please update WP, that's how it works)...

Formal definition of reaction rate

According to IUPAC's Gold Book definition[1] the reaction rate v (also r or R) for the general chemical reaction aA + bB → pP + qQ, occurring in a closed system under constant-volume conditions, without a build-up of reaction intermediates, is defined as:ccv

v = - \frac{1}{a} \frac{d[A]}{dt} = - \frac{1}{b} \frac{d[B]}{dt} = \frac{1}{p} \frac{d[P]}{dt} = \frac{1}{q} \frac{d[Q]}{dt}

The IUPAC[1] recommends that the unit of time should always be the second. In such a case the rate of reaction differs from the rate of increase of concentration of a product P by a constant factor (the reciprocal of its stoichiometric number) and for a reactant A by minus the reciprocal of the stoichiometric number. Reaction rate usually has the units of mol dm-3 s-1. It is important to bear in mind that the previous definition is only valid for a single reaction, in a closed system of constant volume.

First-order reactions

A first-order reaction depends on the concentration of only one reactant (a unimolecular reaction). Other reactants can be present, but each will be zero-order. The rate law for a first-order reaction is

\ r  = k[A]

k is the first order rate constant that has units of 1/time

If, and only if, this first-order reaction 1) occurs in a closed system, 2) there is no net build-up of intermediates and 3) there are no other reactions occurring, it can be shown by solving a mass balance for the system that

-\frac{1}{a}\frac{d[A]}{dt} = k[A]

where a is the stoichiometric coefficient of the species A.

The integrated first-order rate law is

\ \ln{[A]} = -akt + \ln{[A]_0}

That's enough for me to post at present. Have you thought of everything? (I personally forgot the multiplier "a" in the last equation - it's easy to do).

MathML and CML communities

I was delighted to meet old friends from the MathML/OpenMath community last week at Mathematical Knowledge Management 2007 - Patrick Ion, Robert Miner, James Davenport and Michael Kohlhase (apologies to any I have omitted). OpenMath (1993) was one of the first non-textual markup languages and was based on SGML, while MathML came along later (1999). The languages are distiinct but deliberately converging and (from WP):

OpenMath consists of the definition of "OpenMath Objects", an abstract datatype for describing the logical structure of a mathematical formula, and the definition of "OpenMath Content Dictionaries", or collections of names for mathematical concepts. The names available from the latter type of collections are specifically intended for use in extending MathML, and conversely, a basic set of such "Content Dictionaries" has been designed to be compatible with the small set of mathematical concepts defined in Content MathML, the non-presentational subset of MathML.

so I shall tend to use them interchangeably. Note, however, that MathML is an activity of the W3C, while (WP)

OpenMath has been developed in a long series of workshops and (mostly European) research projects that began in 1993 and continues through today.

MathML and CML have had a long history of association. We tend to present on the same platforms (e.g. NSF / NSDL Workshop on Scientific Markup Languages). Each has its particular growth points - they are accepted as formal means of scholarly publication by several major publishers and there are a variety of toolkits.

Here I want to emphasize that each is required not just in its own domain, but by neighbouring ones. Thus chemistry needs MathML, geology needs CML, etc. This requires a different mindset when developing tools - it isn't necessary to address all the cutting edge research in the mother subject - but important to make sure that you can solve a useful number of problems in everyday science and engineering.

As an example I asked the maths community whether I could search for a given differential equation, e.g.:

dx/dt = -k*x

You can, of course, type this directly into Google and get results like this but that only works when the variables are x and t. Thus

da/dt = -ka  ... or ...

da/a +kdt = 0
or many other forms represent the same abstract equation.

So I was delighted to find that several people were actively working on this - it means we can serach the world's literature for given functional forms indepedently of how they are represented. It's hard - in some cases very hard - and varies between countries. It's similar to the chemist's use of InChI (see Unofficial InChI FAQ) to normalize and canonicalize chemical structure (it doesn't matter whether you write HCl or ClH - the InChI is the same). And Google is quite good at finding these forms.

Even more fundamental is the use of dictionaries - OM has the content dictionaries and CML has CMLDict/Golem. They aren't identical but close enough that it's easy to convert between them.  The dictionary concept is very powerful and allows languages to be extended almost indefinitely. It also allows different groups to develop their own systems - which may even be incompatible - you load in the appropriate dictionary. And the software is effectively written.
So there is now a strong bond between the MathML and CML community. They are starting to adopt the idea of blogging and social computing (chemistry has led the way here), while we shall adopt some of the formalities of OM in our representations of physical science.

We're going to pursue the following (at least) and keep in touch through the blogosphere:

  • mixed mathematics and chemistry (see next post)
  • social computing, which could involve student projects, etc.
  • combining forces in the advocacy of markup languages in scholarly scientific publications and the communal dissemination of data.

So - to show this isn't just talk, MichaelK and I are starting to see how a "simple" formula in physical chemistry can be represented. We'll show you shortly

Blogging in science and mathematics

In a splendid post - reproduced in full (indented) - Kyle Finchsigmate highlights the difference between chemistry and other sciences.

WTF is up with the Science blogosphere?

15:11 30/06/2007, Kyle Finchsigmate, sciency politics, The Chem Blog
vismap.jpgI [i.e. KF] was recently interviewed by Nature on the state of blogs and anonymity and whatnot and the interviewer had an interesting question: Why is the population of chemistry blogs so high relative to other disciplines?  There have been a number of posts on why chemistry is so absent in popular media - (IMHO, it lacks the “God element” present in biology/medicine and physics, especially astrophysics, which makes it less interesting to the masses.  The complexities and fuzzy logic employed by astrophysicists requires nothing less than a religiosity to believe some of the odd shit they sling out.)

Anyway, it’s disheartening that there are very few blogs about physics and biology compared to chemistry since the future is integrated approaches.  If I could, I would seek out a lively biologist to team up with on the Chem Blog, but I know of none.  It would be awesome to have a readable blog about either field, since I consider both of them too far off topic to be approachable.

Anyway, my response to the Nature interviewer (the interview should be available via pod cast next month) was that the chemistry blog-o-sphere had a number of very strong voices and drew a lot of inspiration to a lot of people.  Particularly catalytic in that was Dylan Stiles, Paul Bracher and Paul Docherty.  When asked if I was a strong voice, I arrogantly replied affirmatively, but that’s just my style and I was the subject of the interview in any case, so I had to have some degree of impact.  I know that I have made no secret that I started blogging because of Dylan’s post on Otera’s Catalyst, which he employed in his recent Org. Let. publication.  (I did not find it via Bengu Sezen [a researcher involved in ongoing controversy about the validity of published results - PMR], though I did exploit her to jump start my blog via trackbacks to Dylan’s, which is a wee shameful, but it worked.  Besides, if Blogging really had any superstar it was her.  It is, after all, the news that makes the reporter, not the other way around [though, with blogging, that argument can be contested].)

The walls are too high really to make a plea to people in other fields to start blogs, since I don’t think physicists or biologists frequent this blog, but if they WERE here, I would ask them to consider it.

[PMR: Question to Kyle - is the diagram the chemical blogosphere?]

PMR: I had exactly this experience when I was invited to talk to the Mathematical Knowledge Management group last week. I asked if anyone blogged - not really. There are maths blogs but I get the impression that they are mainly aimed at school, problems solving, etc. No-one was blogging the meeting (other than me! - look for the mkm2007 tag on That's a pity as the talks were excellent - As a crystallographer I was fascinated to here Tom Hales' work on proving Kepler's conjecture (cubic/hexagonal close packing really are the densest ways of packing uniform spheres in 3D).

In chemistry the blogosphere gives up-to-the-minute  reports - in a large meeting it's not impossible that people transfer sessions when they read the blogs (except that the ACS normally refuses to provide wireless even to speakers and you have to buy your own). I won't speculate on why this is so, but I certainly felt more confident of starting a blog because Tenderbutton had shown the way. (It's well know that his supervisor disapproved - "went ballistic" is what I heard from one senior chemist). Note, however, that in synthetic chemistry there are long periods of watching reactions "bubble away" and blogging can be a near-zero-cost multitasking activity.
There are many motivations for blogging - when I talk to young scientists I point out that several chemists are now on the first steps to science journalism having been scooped up by science publishers.  For myself the blog has several novel advantages.

  • I can post ideas in progress. That's anathema in chemistry, though I see signs we are changing it.
  • It summarises my current position - especially where peer-reviewed publication may not be the best way. Difficult to publish technology in science journals.
  • It is a platform for advocacy
  • It reaches out to other disciplines (and I'll say more about maths in later posts)
  • It acts as a record of my talks. In general I blog about what I am going to say at a meeting. This alerts people to the issues, and may also be a fallback if my machine breaks down. At WWW20007 I posted the summary of my ideas and many people in the audience had already read them or were following them as I went through the talk. This is especially useful where (as then) I only had 5-10 minutes - you can give details that you don't have time to say physically. And, since my talks are stochastic, it reminds me if there is anything I have forgotten as I come to the end of the talk.

So I think my talk has catalysed at least a subset of the maths community to think about blogging. Michael Kohlhase's blog is an example and I'll be talking more later about the collaboration we have set up between MathML/OpenMath and CML - this might be exciting news for science publishers and reporters. So perhaps one of the most important aspects of blogging - for me - is:

  • A way of reaching beyond the boundaries of my own domain. It's obviously an effective approach in Open Data, etc. as I have had several people in the LIS (library-information sci) community tell me they were glad I had restarted my blog.  I think that Michael and I will make it work for chemistry and maths. He is intimately connected to the area of mathematical knowledge while I connect to the Blue Obelisk of chemical open source. Thus if we say "who would like to help with the management of geometrical algorithms in the BO repository it's quite possible we'll get someone from the maths community being interested. And when - as we hope - MathML and CML start to really interoperate we will have the basis of some of the formal knowledge architecture of the immediate future.

That couldn't happen without the blogosphere. Blogging is an integral part of modern scientific knowledge. And the more enlightened scientific publishers know it. Unfortunately very few senior chemists do.

Stochastic hyperslide at MKM2007

I have just given my presentation at Mathematical Knowledge Management 2007 for which I wrote an abstract about 2-3 months ago : Mathematics and scientific markup. I knew that in the intervevning time I would find something new to get excited about - and this has happened - I have added the excitement of the lc-semanticweb. Of course the technology and community have developed since then.

As many of you know I rail against Powerpoint as a prime destroyer of semantic content. Powerpoint also constrains the presenter to a linear mode - yes you can skip a few slides and maybe even hide them, but it's not easy to flip about. And it's a poor launch platform for interactive demos.

I've done my slides in XHTML+SVG, believing this is the right way to remain true to my campaign for XML. (I'll do Powerpoint when it's necessary for business purposes - e.g. to integrate with colleagues, but that's about it). This worked for a bit but soon hit problems of scale. I started addressing that with XSLT to add menus to the presentation. In fact I started with the wrong technology (for some bizarre reason I chose it to be Windows specific) and have now simply changed to XHTML.

I have over 12000 XHTML slides. (before you get the wrong idea,  many of these are scraped - so 3000+ from one example of OSCAR3). But nonetheless there are very many. I want to be able to reassemble them for each talk, and I want the technology to be as simple as possible - ideally none. (The efforts I have used in the past have all been broken by browser "upgrades" - a synonym for disasters).

Some ideas are:

  • use a database and craft metadata for each slide
  • use something like Spotlight or local Google

but these don't assemble the talk. So at present I have about 100 directories (maybe with trivial subdirectories) and 5-20 slides per directory. I make the talk by selecting directories which may have some general bearing in the talk - perhaps 20-30. Admittedly it takes memory to work out what is likely to be in each folder but I have to work hard at a talk and the time is well spent. I then asterisk those directories which I HAVE to present (i.e. if I get to 5 mins before the end and haven't mentioned them, i break off and visit them). I prepare demos (such as BIOCLIPSE, OSCAR1, GoogleInChI, Blue Obelisk GreaseMonkey,) and visits to the WWW (when the organizers have provided it - e.g. the ACS hardly ever does even when I ask in advance - it makes little sense to have sessions about the Web when you can't get there).

So I prepared this for today's talk to the MKM. A very nice audience to present to as they understand all about semantic content, namespaces, XML, dictionaries - so none of that has to be explained. I said my hyperslide would be stochastic - I didn't know what slides I would present and in what order. The demos might break.

They did. BIOCLIPSE hung on Jmol rotation (although I got to demo Jmol later). However I am sure the audience appreciated the value - we'd seen Eclipse being used for theorem proving, etc. GreaseMonkey worked yesterday, but failed today. Now I have reinstalled it and it works great. GoogleInchi failed (is the Google API finally broken?) But OSCAR1 and OSCAR3 worked - and the links out to Pubchem and the chemical blogosphere. And the polymer builder, although I didn't have time to explain exactly how it was a symbol manipulator. And I certainly covered less than half of what I might have said. But at least the hyperslide approach means I never overrun - as you can stop when you need to.

There are downsides. It's difficult to keep a record (that's why videos are useful). And  Powerpoint does have the merit of acting as a document container. I've tried both S3 ans Slidy but neither help you to assemble talks.

The only complete way to make slides available is to put them under SVN on WWMM. I can copy the directories to a pen drive. But none of this is a record of which slides were visited in which order and what was said.

I'd be interested in whether anyone else is mad enough to create new ways of managing their slides? And whether they have any ideas. At present I'm almost motivated to try Javascript, but the last time I did that - 5 years ago - everything broke within a year.

Synergy between "MKM"-math and "CML"

In this post I am using the context of the Mathematical Knowledge Management 2007 conference to try to construct similarities in the MathML and CML communities and their thought processes. I'll show this list in my presentation tomorrow - I may not cover all the points. Similarities:

  • both are evolving languages - MathML is not "finished" and won't be for some time.
  • smallish part of the main community, often struggling to get the message across.
  • absolute belief that computer mechanization is essential and beneficial.
  • belief this will happen, but timescales are unclear
  • need for a formal, somewhat arbitrary, selection of core components. e.g. geometry has 42 formal concepts; CML has 100 elements. Often pragmatic.
  • extension is through dictionaries (OM has ContentDictionaries, CDs; CML has convention-based dictionaries (CML/Golem))
  • systems are still evolving and will continue indefinitely. Need versioning and flags.
  • Correctness checking (e.g. of publications) is important, but undervalued by the mainstream communities.
  • Fragmented support for development. Have to concentrate limited resources to aim in communal direction.
  • Would revolutionise publication process but difficult to get mainstream involvement including learned Socs.
  • Problems with browsers - how do we transmit rich content? And every browser release things break.
  • Legacy - can we extract useful info from Bitmaps? PDF? LaTeX/ChemDraw
  • Commercial organizations are often apathetic or even antagonistic
  • Virtual communities can flourish. Openness is critical
  • Both have a strong bottom-up approach. MathML has enough pragmatists to make it work in practice.
  • both are needed by other disciplines - sometimes the support comes from them rather than the mainstream.

Some differences:

  • Maths is relatively poor - although there are commercial companies providing tools and services.
  • Maths sometimes produces proof of concept without later sustainability.
  • Maths believes in standards. Chemistry doesn't (yet).

Possible joint activities:file:///D:/wwmm/presentations/standalone/resources/golem/indexFrame.html

  • workshop on combined infrastructure of MLs, tools, examples
  • joint projects (inc. students presented with "toy" systems)
  • synergy in developing tools (esp. client side)
  • blogs etc. aimed at joint activities.
  • minimal level of infrastructure for OM CML interoperability. Converters? libraries? stylesheets?
  • rich client

I've gone over these with Michael Kohlhase and we roughly agree on them.

The Handbook of Integer Sequences

I am at the Mathematical Knowledge Management 2007 and having a wonderful time. At present Neil Sloane is talking about his marvellous On-Line Encyclopedia of Integer Sequences a collection of every known (and voluntarily communicated) sequence. e.g. what is next term in:



and over 100,000 more

  • 1964 started
  • 1973 2500 seqs
  • 1995 1 m**3 of mail
  • 1995 5500 seqs
  • 1996 10000
  • 2007 131000

A large volunteer community - with a virtual party for 100,000 sequence. One has set the sequences to music (try "Listen" - this sometimes great fun).

2000000 lines of flat file, 120M edited with emacs - total 450 Mbytes including all info. 10K seqs/year, 30 comments/day, 600 emails/day

shortest seq 76337 (1 term)

longest a27 500,000 terms (natural numbers) - this raised a laugh. The point of including it that you can plot one sequence against another.

Used to find out everything about a sequence. Difficulty between conjectures and proofs.

Sequence fans mailing list. But The database is (rightly) restricted to a set of trusted editors. But ultimately almost all is done by Neil.
200 examples of sequence pairs which have identical content but are different sequences.

I find it wonderful that one can search Google with a sequence and it will hit Neil's site (if it's in the database). It's one of the few digital objects where the content acts as an index in Google. InChI is another.
It's fascinating. Every sequence that comes in is transformed but over 100 transforms to see what its structure might be. It's a real living ecology of digital objects. Neil has shown us how datamining the database - comparing sequences with each other - has resulted in new mathematical theorems. Wouldn't it be wonderful if chemical information was available for datamining and not owned by commercial interests?

OpenMath/MathML, CML and communities of practice

James Davenport - one of the originators of OpenMath is presenting the current status. (OpenMath and MathML are converging and although they are distinct I shall confalte them here).

What is the formal semantics? OM has 4 flags:

  • official
  • experimental
  • private
  • obsolete

Normally only the official and obsolete are visible on the official site.

What if we have conflict? An exampls is systems (like Euclid) which start from 1 whereas OpenManth may start from 0. (I'm on weak ground here...). MathML has ContentDictionaries (CD) for representing each approach - but we need a mechanism for conversion.

This is very similar to the CML idea of convention attribute.

For example the British and the French have different symbolisms for integrals. They aren't just linguistic. In CD sqrt(-1) is i, in engineering it is j.

Do we skate over inconsistencies or refuse to emit any results where there are conflicts?

"A theorem prover which  emits wrong results is probably worse than useless." [A chemistry system which emits wrong compounds? Of course. (but most of the current ones do].

Is notation independent of semantics? Apparently not. And it's a harder problem that in looks to convert. How do we speak:

T(n) = O(n2)

Apparently no mathematician would speak "equals" - they would say "is in". JD is suggested a "notation" approach (I see this as similar to CML's convention). So our symbology is much deepr than we usually realise.

So OM defines a smallish set of symbols which are fundamental. This are hardcoded. (In chemistry these migh be element symbols, coordinates, etc.) Then there are those that OM uses widely but are not fundamental, and then those that the community develops - they have to be supported by OM, but are not part of it.

I'll stop here - I hope you get the idea that OM is struggling to formalize those things which are absolute, those which are universal but not fundamental, and those which are  widely used. If those words (which are mine not James') conflict, that's probably accurate.

He finishes:

"OM is useful, but being semantic, let's use it carefully". That goes for CML as well.
My message is that MLs are community processes, depend heavily on interpreting implicit semantics, and in doing so will help to formalize our understanding.

Maths has a community process. CML is much less advanced, but uses the Blue Obelisk as a clearing house.  But I take from OM and communities like FOAF that CML should have flags on its components such as

  • accepted
  • experimental
  • convention (e.g. JSpecView)
  • obsolete

Towards Mechanized Mathematical Assistants (and the Scientist's Amanuensis)

I'm now at MKM2007 whose book[1] has the splendid title:

Towards Mechanized Mathematical Assistants

the vision that computers can work together with humans to enhance the the understanding of both parties. This vision has much in common with our own SciBorg project where we use the idea "scientist's amanuensis" for a computer system than can help hi=umans understand chemistry in machine-accessible form
(see also An Architecture for Language Processing for Scientific Texts)

As with many of my talks I'm blogging so of the talks - ans more importantly the encounters - during this meeting. I haven't decided what I'm going to say tomorrow and these notes may form part of it.

I'm first of all searching for similarities in the thought processes between the MKM community and our own chemical knowledge community (i.e. those people in chemistry who have realised we are in the Century of information and knowledge - a small percentage). That's also true in mathematics - the MKM community - which overlaps with symbolic computation and computer algebra - is not mainstream but, unlike chemistry, has achieved critical mass (i.e. there are regular meetings, serial publications, etc. - that's a pointer for ourselves - we need to do these things as well). So here are some top-level isomorphisms:

  • need to check the correctness of submitted manuscripts. Maths has many places where the author states "it's obvious...", or lemma 12.23.34 appears without an actual derivation. In chemistryand other sciences we have suspect or missing data
  • it's not "proper maths/chemistry". Markup is trivial/non-essential/notRealResearch, etc. You have to proves theroems or make compounds to be a proper researcher.
  • The Open Source systems are no good because the commercial systems (Mathematica, Maple, ChemDraw, OpenEye) are better. If it doesn't cost money it can't be any good
  • Machines can never have the insight that humans have.
  • The publishers/senior editors know what is good for the community. We've done in on printed papaer for 200+ years. It works, whyc change it? (It's also a lot of trouble.

The first lecture (Paule) showed that machines can now prove some quite complex theorems. Cerianly miles beyond me. Things like Wallis' formula for PI and Ramanujan's first identity. And, if we are going to use symbolic algebra systems, we need other systems to prove that the CAS are correct.

[1] Springer, LNAI 4573 eds: Kauer, Kerber, Miner, Windsteiger. Not, of course, Open Access - you have to buy a physical book.

Top-down or bottom-up ontologies?

I am working out some of the ideas I want to talk about at Mathematical Knowledge Management 2007 - and in this post I explore how a knowledge framework might be constructed, and also how it can be represented in machine-understandable form. This, I think, will be seen as one of the central challenges of the current era.

I have worked on bodies responsible for the formalisation of information and have also for 15 years been co-developing Chemical Markup Language (with Henry Rzepa). This journey has not yet ended and I'm changing my viewpoint occasionally.

My first view - perhaps stemming from a background in physical science - was that it should not be too difficult to create machine-processable systems. We are using to manipulating algorithms and transforming numeric quantities between different representations. This process seemed to be universal and independent of culture. This was particularly influenced by being part of the Int. Union of Crystallography's development of the Crystallographic Information Framework dictionary system.

This is a carefully constructed, self-consistent, system of concepts which are implemented in a simple formal language. Physical quantities of interest to crystallographic experiments can be captured precisely and transformed according to the relations described, but not encoded, in the dictionaries. It is now the standard method of communicating the results of studies on small molecules and is the reason that Nick Day and I could create CrystalEye. Using XML and RDF technology we have added a certain amount of machine processability.
Perhaps encouraged by that I and Lesley West came up with the idea of a Virtual Hyperglossary (original site defunct, but see VIRTUAL HYPERGLOSSARY DEVELOPMENTS ON THE NET) which would be a machine-processable terminology covering many major fields of endeavour. Some of this was very naive, some (e.g. the use of namespaces) was ahead of the technology. One by product was an invitation to INTERCOCTA (Committee on Conceptual and Terminological Analysis) - a UNESCO project on terminology. There I met a wonderful person Fred W. Riggs who very gently and tirelessly showed me the complexity and the boundaries of the terminological approach. Here (Terminology Collection: Terminology of Terminology) is an example of the clarity and carefulness of his writing. One of Fred's interests was conflict research and his analysis of the changing nature of "Turmoil among nations". I am sure he found my ideas naive.
So is there any point in trying to create a formalization of everything - sometimes referred to as an Upper Ontology? From WP:

In information science, an upper ontology (top-level ontology, or foundation ontology) is an attempt to create an ontology which describes very general concepts that are the same across all domains. The aim is to have a large number on ontologies accessible under this upper ontology. It is usually a hierarchy of entities and associated rules (both theorems and regulations) that attempts to describe those general entities that do not belong to a specific problem domain.

The article lists several attempts to create such ontologies, one of the most useful for those in Natural Language Processing being

WordNet, a freely available database originally designed as a semantic network based on psycholinguistic principles, was expanded by addition of definitions and is now also viewed as a dictionary. It qualifies as an upper ontology by including the most general concepts as well as more specialized concepts, related to each other not only by the subsumption relations, but by other semantic relations as well, such as part-of and cause. However, unlike Cyc, it has not been formally axiomatized so as to make the logical relations between the concepts precise. It has been widely used in Natural Language Processing research.

(and so is extremely valuable for our own NLP work in chemistry).

But my own experience has shown that the creation of ontologies - or any classification - can be an emotive area and lead to serious disagreements. It's easy for any individual to imagine that their view of a problem is complete and internally consistent and must therefore be identical to others in the same domain. And so the concept of a localised "upper ontology" creeps in - it works for a subset of human knowledge. And the closer to physical science the easier to take this view. But it doesn't work like that in practice. And there is another problem. Whether or not upper ontologies are possible it is often impossible to get enough minds together with a broad enough view to make progress.

So my pragmatic approach in chemistry - and it is a pragmatic science - is that no overarching ontology is worth pursuing. Even if we get one, people won't use it. The International Union of Pure and Applied Chemistry has created hundreds of rules on how to name chemical compounds and relatively few chemists use them unless they are forced to. We have found considerable variance in the way authors report experiments and often the "correct" form is hardly used. In many cases it is "look at the current usage of other authors and do something similar".

And there is a greater spread of concepts than people sometimes realise. What is a molecule? What is a bond? Both are strongly human concepts and so difficult to formalize for a machine. But a program has to understand exactly what a "molecule" is. So a chemical ontology has to accept variability in personal views. A one-ontology-per-person is impossible, but is there scope for variability? And if so how is this to be managed.

So far CML has evolved through a series of levels and it's not yet finished. It started as a hardcoded XML DTD - indeed that was the only thing possible at that stage. (In passing it's interesting to see how the developing range of technology has broadened our views on representability). Then we moved to XML Schema - still with a fairly constrained ontology but greater flexibility. At the same stage we introduced a "convention" attribute on elements. A bond was still a "bond" but the author could state what ontology could be attached to it. There was no constraint on the numbers of conventions but the implied rule is that if you create one you have to provide the formalism and also the code.

An example is "must a spectrum contain data?". We spent time discussing this and we have decided that with the JSpecView convention it must, but that with others it need not. This type of variable constraint is potentially enforceable by Schematron, RDF or perhaps languages from the mathematics community. We have a system where there is "bottom-up" creation of ontologies, but which agree they need a centrally mechanism for formalizing them - a metaontology.  The various flavours of OWL will help but we'll need some additional support for transformation and validation, especially where numbers and other scientific concepts are involved.

Mathematical Knowledge Management 2007

I have been invited to give a lecture at the Mathematical Knowledge Management 2007 meeting next week in Hagenberg, Austria. My talk is entitled Mathematics and scientific markup. I am both excited and apprehensive about this - what is a chemist (whose level of mathematics finishes at Part1A for scientists in Cambridge) doing talking to experts in the field?

However in the spirit of the new Web I'm blogging my thoughts before the meeting. This serves several purposes:

  • helps me get my ideas in order
  • gets feedback from anyone who may have an interest
  • identifies other people who may also be blogging about the meeting
  • acts as a public resource from which I can give my talk if I have problems with my machine.

The conference topic ...

Mathematical Knowledge Management is an innovative field in the intersection of mathematics and computer science. Its development is driven on the one hand by the new technological possibilities which computer science, the internet, and intelligent knowledge processing offer, and on the other hand by the need for new techniques for managing the rapidly growing volume of mathematical knowledge.

The conference is concerned with all aspects of mathematical knowledge management. A (non-exclusive) list of important areas of current interest includes:

  • Representation of mathematical knowledge
  • Repositories of formalized mathematics
  • Diagrammatic representations
  • Mathematical search and retrieval
  • Deduction systems
  • Math assistants, tutoring and assessment systems
  • Mathematical OCR
  • Inference of semantics for semi-formalized mathematics
  • Digital libraries
  • Authoring languages and tools
  • MathML, OpenMath, and other mathematical content standards
  • Web presentation of mathematics
  • Data mining, discovery, theory exploration
  • Computer Algebra Systems
  • Collaboration tools for mathematics

Invited Speakers:

Neil J. A. Sloane AT&T Shannon Labs,a Florham Park, NJ, USA The On-Line Encyclopedia of Integer Sequences
Peter Murray Rust University of Cambridge, Dep. of Chemistry, UK Mathematics and scientific markup

What has molecular informatics to do with this? More than it appears. Chemistry overlaps considerably with chemistry and here formal systems are important. It should be possible to explore the formal representation of thermodynamics or material properties in semantic form (though I may find that my use of "semantics" is imprecise or even "wrong"). Repositories are an obviously exciting area - can we find mathematical objects either by form or by metadata? OCR is important for all content-rich disciplines - see below. Inference and semantics are becoming increasingly important in the emerging web. And so I tick about half the topics above - not in mathematical detail, of course, but in the general approach to the problems.

As an example, what objects contain enough structure and canonicalized content that they act as their own discovery metadata? Most objects need a human or a lookup-table to add the metadata for their web discovery. For example you need to know the names of humans - you cannot work these out by looking at them. But in chemistry we can describe a molecule by its InChI - a canonical representation of the connection table (which is not easily human-interpretable). This is both its content and its discovery metadata. You can search Google for molecules though InChIs will find instances of molecules on the web. I wondered what other objects could be identified just by their textual content. Perhaps a poem (although it won't tell you who wrote it). I started typing lists of numbers into Google and suddenly found I was getting hits on Neil Sloane's Encyclopedia.

In this a sequence can be identified by its content - search Google for "1,3,6,10,15" and you get A000217 in The On-Line Encyclopedia of Integer Sequences. I had a chat with a well known computer scientists ex-mathematician at WWW2007 and he bet that I couldn't tell me the next term in a sequence within 5 minutes. I bet him the drinks that I could. So as we had wireless in the bar I searched Google and immediately found the answer - he was astounded - and bought the drinks.

So many of the problems are generic between domains. Searching for MKM2007 I found this paper on how to extract mathematics from PDF (Retro-enhancement of Recent Mathematical Literature). It's better than recreating cows from hamburgers as they have some access to source - but there are similarities to what we are trying to recover from PDF.

I shall use the tag mkm2007 for this and subsequent posts (in which I'll explore things like the different between top-down and bottom up management systems.) No one has yet used it but maybe someone will find it - let's see.