I am working out some of the ideas I want to talk about at Mathematical Knowledge Management 2007 – and in this post I explore how a knowledge framework might be constructed, and also how it can be represented in machine-understandable form. This, I think, will be seen as one of the central challenges of the current era.
I have worked on bodies responsible for the formalisation of information and have also for 15 years been co-developing Chemical Markup Language (with Henry Rzepa). This journey has not yet ended and I’m changing my viewpoint occasionally.
My first view – perhaps stemming from a background in physical science – was that it should not be too difficult to create machine-processable systems. We are using to manipulating algorithms and transforming numeric quantities between different representations. This process seemed to be universal and independent of culture. This was particularly influenced by being part of the Int. Union of Crystallography’s development of the Crystallographic Information Framework dictionary system.
This is a carefully constructed, self-consistent, system of concepts which are implemented in a simple formal language. Physical quantities of interest to crystallographic experiments can be captured precisely and transformed according to the relations described, but not encoded, in the dictionaries. It is now the standard method of communicating the results of studies on small molecules and is the reason that Nick Day and I could create CrystalEye. Using XML and RDF technology we have added a certain amount of machine processability.
Perhaps encouraged by that I and Lesley West came up with the idea of a Virtual Hyperglossary (original site defunct, but see VIRTUAL HYPERGLOSSARY DEVELOPMENTS ON THE NET) which would be a machine-processable terminology covering many major fields of endeavour. Some of this was very naive, some (e.g. the use of namespaces) was ahead of the technology. One by product was an invitation to INTERCOCTA (Committee on Conceptual and Terminological Analysis) – a UNESCO project on terminology. There I met a wonderful person Fred W. Riggs who very gently and tirelessly showed me the complexity and the boundaries of the terminological approach. Here (Terminology Collection: Terminology of Terminology) is an example of the clarity and carefulness of his writing. One of Fred’s interests was conflict research and his analysis of the changing nature of “Turmoil among nations”. I am sure he found my ideas naive.
So is there any point in trying to create a formalization of everything – sometimes referred to as an Upper Ontology? From WP:
In information science, an upper ontology (top-level ontology, or foundation ontology) is an attempt to create an ontology which describes very general concepts that are the same across all domains. The aim is to have a large number on ontologies accessible under this upper ontology. It is usually a hierarchy of entities and associated rules (both theorems and regulations) that attempts to describe those general entities that do not belong to a specific problem domain.
The article lists several attempts to create such ontologies, one of the most useful for those in Natural Language Processing being
WordNet, a freely available database originally designed as a semantic network based on psycholinguistic principles, was expanded by addition of definitions and is now also viewed as a dictionary. It qualifies as an upper ontology by including the most general concepts as well as more specialized concepts, related to each other not only by the subsumption relations, but by other semantic relations as well, such as part-of and cause. However, unlike Cyc, it has not been formally axiomatized so as to make the logical relations between the concepts precise. It has been widely used in Natural Language Processing research.
(and so is extremely valuable for our own NLP work in chemistry).
But my own experience has shown that the creation of ontologies – or any classification – can be an emotive area and lead to serious disagreements. It’s easy for any individual to imagine that their view of a problem is complete and internally consistent and must therefore be identical to others in the same domain. And so the concept of a localised “upper ontology” creeps in – it works for a subset of human knowledge. And the closer to physical science the easier to take this view. But it doesn’t work like that in practice. And there is another problem. Whether or not upper ontologies are possible it is often impossible to get enough minds together with a broad enough view to make progress.
So my pragmatic approach in chemistry – and it is a pragmatic science – is that no overarching ontology is worth pursuing. Even if we get one, people won’t use it. The International Union of Pure and Applied Chemistry has created hundreds of rules on how to name chemical compounds and relatively few chemists use them unless they are forced to. We have found considerable variance in the way authors report experiments and often the “correct” form is hardly used. In many cases it is “look at the current usage of other authors and do something similar”.
And there is a greater spread of concepts than people sometimes realise. What is a molecule? What is a bond? Both are strongly human concepts and so difficult to formalize for a machine. But a program has to understand exactly what a “molecule” is. So a chemical ontology has to accept variability in personal views. A one-ontology-per-person is impossible, but is there scope for variability? And if so how is this to be managed.
So far CML has evolved through a series of levels and it’s not yet finished. It started as a hardcoded XML DTD – indeed that was the only thing possible at that stage. (In passing it’s interesting to see how the developing range of technology has broadened our views on representability). Then we moved to XML Schema – still with a fairly constrained ontology but greater flexibility. At the same stage we introduced a “convention” attribute on elements. A bond was still a “bond” but the author could state what ontology could be attached to it. There was no constraint on the numbers of conventions but the implied rule is that if you create one you have to provide the formalism and also the code.
An example is “must a spectrum contain data?”. We spent time discussing this and we have decided that with the JSpecView convention it must, but that with others it need not. This type of variable constraint is potentially enforceable by Schematron, RDF or perhaps languages from the mathematics community. We have a system where there is “bottom-up” creation of ontologies, but which agree they need a centrally mechanism for formalizing them – a metaontology. The various flavours of OWL will help but we’ll need some additional support for transformation and validation, especially where numbers and other scientific concepts are involved.