I’m talking at BioIT about Open Semantic data and am going through the concepts in order. I’ve looked at data (BioIT in Boston: What is Open?). Now for semantic/s.
I’m influenced by the Semantic Web and the Wikipedia article is a useful starting point. It also highlights Linked Open Data and I’ll write about that later. Let’s recall TimBL’s motivation:
I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize.
– Tim Berners-Lee, 1999
I have this dream for chemistry – it’s easier than trade and bureaucracy – and a an early chemical semantic web is now technically possible. What do we require?
The most important thing is to realise that our normal means of discourse – spoken and written are perfused by implicit semantics. Take:
Compound 13 melted at 458 K
To an anglophone human chemist the meaning of this is obvious but to a machine it is simply a string of characters (ASCII-67, ASCII-111…) Although Natural Language Processing can interpret some of this (and our Sciborg project has addressed this) it’s still hard for a machine to get the complete meaning. So lets look at the parts.
The concepts are:
- A specific chemical compound
- the concept of melting point
- a numeric value
- scientific units of measure
We use two main ways of expressing this, XML (Chemical Markup langauage) and RDF. Nico Adams and I will argue about when XML should be used and when RDF. Nico would like to express everything in RDF – and that time may come, but at present we have a lot of code that can understand CML and so I prefer to mix them (see below). In any case I’m not going to display them here.
What are the essential parts of our semantic framework?
- The concepts must be clearly identified in a formal system. This can be an ontology or a markup language or both. In each case there is a schema or framework to which the concepts must conform. CML has a schema (essentially stable) and Nico’s ChemAxiom Ontology has to conform to the Basic Formal Ontology.
- There must be an agreed syntax for expressing the statements. In CML this is XML with a series of dictionaries also expressed in CML. For RDF there are a number of universally agreed syntaxes.
- All components in the statement should have identifiers. In CML this is managed through ID attributes, in RDF through URIs. TimBL’s vision is that if everyone uses URIs based on domain names then the world become a Giant Global (RDF) Graph. There is lots of debate as to whether a URI should also be an address – I’ll blog that later. Without question the management of identifiers is a key requirement in the C21.
- There should be software that does something useful with the result. This is often overlooked – systems like RDF allow navigation and validation of graphs and often a tabulation of the results. But chemists will want to view a spectrum as a spectrum, not as a set of RDF triples. We’ve made good progress here – currently my thinking is that CML acts as the primary way of exposing chemical functionality to programs.
I think I’ll post this (also to check ICE) and then talk about chemical concepts in the next post.
This Blog Post prepared with ICE 4.5.6 from USQ in Open Office