petermr's blog

A Scientist and the Web


Archive for April, 2009

BioIT 2009 – Where do we get Semantic Open Data?

Wednesday, April 29th, 2009

Since Open Data are very rare in chemistry, how do we get them? We’ve been working on making this easier, but it’s still very hard work. So here’s a brief overview:

  • Born-semanticOpen. This is the ideal situation, where tools and the culture are used to create data which is intrinsically semantic and aggressivly labelled as Open. This is no production scale example of this. We hope that Chem4Word will be used as a primary tool for creating semantic documents, and that we can add an Open Data stamp to the result. In that way every use of the tool will create Open Semantic data

  • Conversion of structured and semi-structured Open legacy to Semantic Open. An example of this is CrystalEye, where we aggregate Open legacy data (CIFs) and convert to CML. This is then published with the Open data tag in every page. Spectra, if in JCAMP, are also tractable It is also possible, though harder to convert most computation chemistry outputs into CML. Gaussian archive, GAMES are relatively simple Gaussian logfiles are a nightmare.

  • Born-digital computation. By inserting FoX or other CML-generating tools into the source code of comp chem programs. We’ve done this for at least 10 and this means that we get lossless conversion of comp chem into CML, with complete ontological integrity.

  • Recovery from text and PDF (Text mining). Conversion to PDF destroys all semantics, most structure and all ontology. So it has to be messy heuristics to recover anything and we never get 100% recall or precision. We don’t touch bitmaps. Our current tools in Cambridge can:

  1. extract chemical structures from images. This depends on the actual way the image is represented but with vectors rather than bitmaps we have achieved 95% precision on several documents

  2. extracts spectra from images. This is also tractable we haven’t done enough to get metrics and we haven’t covered all the types of instrument but again ca 95% is manageable

  3. text-mining. OSCAR2 is able to recover peak lists from chemical analytical data with over 90%, the failures being mainly due to typos and punctuation. OSCAR3 can extract reaction information (Lezan Hawizy) with probably > 80% precision. We can also convert chemical names to structures (OPSIN) and Daniel Lowe has made impressive progress and for certain corpora – can achieve ca 70%

Some important caveats. Anything other than born-semantic is lossy. The recal/precision can range from 95% to 5%. That sounds silly, but the results are critically dependent on how the documents were created and published. The more human steps (copying, editing) the worse the recall and precision. But with high quality PDFs an impressive amount can be extracted.

But the major challenge is restrictive approaches to extraction. If publishers threaten extractors with legal action for getting science from papers we have destroyed the dream of Linked Open Data in Science.

But for those publishers who support the creation of publication of Born-semantic Open data the future is really exciting.

This Blog Post prepared with ICE 4.5.6 from USQ in Open Office

BioIT 2009 – What is data? -1

Wednesday, April 29th, 2009

This is a post probably the last – in a series outlining Open Semantic Data in Science at BioIT Boston see (BioIT in Boston: What is Open? ).

I’ve explained Open and Semantic. Now I’ll attempt data. I thought this would be simple it’s not.

Why use the word data – at all? I think the reasons are pragmatic. There’s often an over-presented hierarchy:

Data Information Knowledge Wisdom

(the last influenced by T.S.Eliot)

Different people would have cutoffs at different points on this hierarchy but I think the following are fairly common attributes of data:

  • it is distinct from most prose (although some prose would be better recast as data)

  • it is generally a component of a larger information or knowledge structure

  • facts and data are closely related

  • many data are potentially reproducible or unique observations, are not opinions (though different people may produce different data)

  • data, as facts, are not copyrightable.

  • Collections of data and annotated data (data + metadata) may have considerably enhanced value over the individual items.

  • Data can be processed by machine

Here are some statements which provide data:

and here are some which are not data

  • her work is well respected

  • we thank Dr. XYZZY for the crystals

  • we find this reaction very difficult to perform

What’s the point of making the distinction? From my point of view:

  • Data can and increasingly should be converted to semantic form.

  • Data are not copyrightable and should be free to the community

  • Linking Open Data is now possible and has stunning potential.

So my self-appointed mission is to carry this out in the domain that I at least partially understand: chemistry.

  • We already have the semantic framework (CML, ChemAxiom)

  • we are managing to liberate data (Pubchem, Chemspider, CrystalEye, NMRShiftDB, CLARION, etc.)

  • when we have liberated enough we can start to provide Linked Open Data.

We aren’t there yet as there are very few fully Open Data in chemistry (CrystalEye may be the only one that asserts Openness through OKF). And unless there is something to link to we can’t do very much.

But we are moving fast. Four of us (Antony Williams, Alex Tropsha, Steve Heller and PMR) met for an hour yesterday to discuss what we’d like to do and how we might do it. We have complementary things to bring to the table, so watch for developments.

This Blog Post prepared with ICE 4.5.6 from USQ in Open Office

BioIT 2009 – chemical semantic concepts

Wednesday, April 29th, 2009

This is a post in a series outlining Open Semantic Data in Science at BioIT Boston see (BioIT in Boston: What is Open? ).

To build a complete semantic framework we need formal systems to express our concepts, normally markup languages or ontologies. In Cambridge we use both, particularly Chemical Markup Language and ChemAxiom (Nico Adams). Let’s start with ChemAxiom which Nico has blogged. His diagram gives a good overview:


Since Nico leads our Polymer Informatics effort there is a special concentration on polymers but the framework is very general and can be used for mainstream chemistry. As shown ChemAxiom emphasize substances and their properties. Because of our backgrounds as (in part) physical chemists there is emphasis on methods or measurement (metrology) and scientific units of measurement.

The ontology is descended from the Basic Formal Ontology which is an Upper Ontology. This has abstracted concepts which are common to many different ontologies science, commerce, literature, etc. From WP:

The BFO or Basic Formal Ontology framework developed by Barry Smith and his associates consists in a series of sub-ontologies at different levels of granularity. The ontologies are divided into two varieties: SNAP (or snapshot) ontologies, comprehending continuant entities such as three-dimensional enduring objects, and SPAN ontologies, comprehending processes conceived as extended through (or as spanning) time. BFO thus incorporates both three-dimensionalist and four-dimensionalist perspectives on reality within a single framework

In this a compound would be a continuant whereas a reaction as performed by a given person would be an occurrent. Note that language often maps different concepts onto the same linguistic term:

The reaction took place over 1 hour (occurrent)

The aldol reaction creates CC-bonds

Is the latter use a continuant or occurrent? This is the sort of thing we discuss in the pub.

ChemAxiom is expressed in OWL2.0 and this brings considerable power of inference. For example if something is a quantity of type temperature the ontology can assert that Kelvin is a compatible unit while Kilogram is not. And this is only a simple example of the power of ChemAxiom.

In general we expect ChemAxiom to have a stable overall framework as expressed in the diagram but for the details to change considerably as the community debates (hopefully good-naturedly) over the lower level concepts. Nico is presenting this at the International Conference on Biomedical Ontology.

Chemical Markup Language uses a wider-ranging set of chemical and physical concepts which are derived from chemical communication between humans and machines in various degrees. We (Henry Rzepa and I) have used published articles, chemical software systems, chemical data and computational chemistry outputs to define a range of concepts which now seem to be fairly stable. That’s perhaps not too surprising as many of them are over 100 years old. The main concepts are:

  • molecules, compounds and substances
  • chemical reactions, synthesis and procedures
  • crystallography and the chemical solid state
  • chemical spectroscopy and analytical chemistry
  • computational chemistry.

There are about 50 sub concepts, and all these are expressed as XML elements. Thus we can write:

<cml:property dictRef=properties:meltingPoint>

<cml:scalar dataType=xsd:double units=units:kelvin>450</cml:scalar>


The general semantics are that we have a property, with a numeric value. The precise semantics are added by links to dictionaries (or ontologies such as ChemAxiom). With this framework it is possible to markup much of current chemical articles.

It’s also possible to markup much computational chemistry output and we’ve done this for many major codes such as Gaussian, Dalton, GULP, DL_POLY, MOPAC, GAMESS, etc. This has made it possible to chain together processes into a semantic workflow:


Here one program emits XML and feeds automatically into the input of the next. All output is semantic and can be stored in XML or RDF repositories such as our Lensfield which is being developed by Jim Downing and others

Which brings me nicely onto Data… in the next post

This Blog Post prepared with ICE 4.5.6 from USQ in Open Office

BioIT 2009 – What is semantic?

Wednesday, April 29th, 2009

I’m talking at BioIT about Open Semantic data and am going through the concepts in order. I’ve looked at data (BioIT in Boston: What is Open?). Now for semantic/s.

I’m influenced by the Semantic Web and the Wikipedia article is a useful starting point. It also highlights Linked Open Data and I’ll write about that later. Let’s recall TimBL’s motivation:

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web the content, links, and transactions between people and computers. A Semantic Web, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The intelligent agents people have touted for ages will finally materialize.

Tim Berners-Lee, 1999

I have this dream for chemistry it’s easier than trade and bureaucracy and a an early chemical semantic web is now technically possible. What do we require?

The most important thing is to realise that our normal means of discourse – spoken and written are perfused by implicit semantics. Take:

Compound 13 melted at 458 K

To an anglophone human chemist the meaning of this is obvious but to a machine it is simply a string of characters (ASCII-67, ASCII-111…) Although Natural Language Processing can interpret some of this (and our Sciborg project has addressed this) it’s still hard for a machine to get the complete meaning. So lets look at the parts.

The concepts are:

  • A specific chemical compound
  • the concept of melting point
  • a numeric value
  • scientific units of measure

We use two main ways of expressing this, XML (Chemical Markup langauage) and RDF. Nico Adams and I will argue about when XML should be used and when RDF. Nico would like to express everything in RDF and that time may come, but at present we have a lot of code that can understand CML and so I prefer to mix them (see below). In any case I’m not going to display them here.

What are the essential parts of our semantic framework?

  • The concepts must be clearly identified in a formal system. This can be an ontology or a markup language or both. In each case there is a schema or framework to which the concepts must conform. CML has a schema (essentially stable) and Nico’s ChemAxiom Ontology has to conform to the Basic Formal Ontology.
  • There must be an agreed syntax for expressing the statements. In CML this is XML with a series of dictionaries also expressed in CML. For RDF there are a number of universally agreed syntaxes.
  • All components in the statement should have identifiers. In CML this is managed through ID attributes, in RDF through URIs. TimBL’s vision is that if everyone uses URIs based on domain names then the world become a Giant Global (RDF) Graph. There is lots of debate as to whether a URI should also be an address I’ll blog that later. Without question the management of identifiers is a key requirement in the C21.
  • There should be software that does something useful with the result. This is often overlooked systems like RDF allow navigation and validation of graphs and often a tabulation of the results. But chemists will want to view a spectrum as a spectrum, not as a set of RDF triples. We’ve made good progress here currently my thinking is that CML acts as the primary way of exposing chemical functionality to programs.

I think I’ll post this (also to check ICE) and then talk about chemical concepts in the next post.

This Blog Post prepared with ICE 4.5.6 from USQ in Open Office

Chem4Word demo – afterwards – 1

Tuesday, April 28th, 2009

Rudy and I presented the C4W demo to about 150 people, mainly from pharma/IT. We’ed heard about the Pistoia Alliance which aims to bring together pre-competitive elements in the pharma industry to share standards, data , protocols. So it was a good setting.

I was presenting from Rudy’s machine with whose touchpad I was rather unfamiliar with, and also we magnified the text so that people could read it at the cost of some clipping. I’d had some challenge remembering the precise keystrokes to load and show the ontology markup so we actually started with a fairly complete document. As always with interactive demos it’s quite an effort to remember where to go next, what to say and to keep it moving.

Anyway C4W has got its first public showing and we can now show you a screen shot:


The point to stress is that each cell is semantic they all show different surface features of the molecule in the row. We’ve chosen 5 views identifier, formula, name, 2D diagram and InChI. But each of these could be clicked to change to a different representation.

What’s important to realise is that C4W is not a molecular editor, it’s a chemical document editor. Here we are editing the DOCUMENT:


We can, of course, edit molecules and here’s an example


Here the S atom has been picked and dragged and when the editor is saved this will update the document.

More later…

Chem4Word demo

Tuesday, April 28th, 2009

Rudy Potenzone (MS) and I spent a long breakfast (or pre-breakfast) preparing the demo of Chem4Word which gets show in ca. 30 mins (by me, with Rudy on hand as it’s on his machine). We’ve both done demos before my first was ca 1978 (more later, perhaps) but common in the mid-90′s when I was selling (a little) into pharma. Rudy’s been in many companies and orgs CAS, MDL, Polygen (Accelrys), and now MSR where he heads up the industry liaison in the chemistry. We’re excited about the potential of C4W and I have to organize my brain to make sure I remember my lines.

The demo has been planned for a month and there are 35+ separate operations to show. I have a crib sheet as it’s easy to forget what to show not how to show it, but what and when.

Next post might tell you how it went. And perhaps some screen shots…

CLARION Chemical Data repository at Cambridge – (2nd try)

Tuesday, April 28th, 2009

I mentioned yesterday that we had been funded by JISC to develop a departmental repository starting from C3DER (crystallography) and expanding to spectroscopy and chemical syntheses. We shall be working with a commercial supplier of Electronic Lab Notebooks in a tightly coupled project where both will benefit from the synergy we shall have the use of a robust platform and can add on many of the innovations we’ve been developing here, which should then get a wider currency. We think this is a new and exciting way of exploring the next generation of chemical informatics which will be semantic, enhanced and guided by ontologies.

The vendor has not been selected so I am keeping my mouth shut…

CLARION project Cambridge Chemistry Department


The data challenge: Chemistry laboratories produce many types of information and data raw data, processed data, observations, chemical structures, reaction schemes, experimental write-ups, conclusions, graphs, images, crystallographic, spectroscopy data, papers, references, and so on.  It is challenging to store this variety of information such that it is accessible and usable by a variety of users.  The challenges include:


Storing data in formats that allow its use by specialist data processing tools

Using data formats that are suitable for publication and long-term preservation

Allowing certain data to be used by people outside the department

Motivating researchers to open their data

Enhancing the meaning and context of the data to improve its usability

Making the data searchable and easily navigable

Ensuring that the system has minimal support overheads, yet continually evolves as required to meet changes in the IT environment.


Using an ELN:  The Cambridge Chemistry Department has a basic repository which stores crystallographic data.  Project CLARION (Cambridge Laboratory Repository In/Organic Notebooks) will create an enhanced repository that captures core types of chemistry data and ensures their access and preservation.  The Chemistry Department is implementing a commercial Electronic Laboratory Notebook (ELN) system; CLARION will work closely with the ELN team to create a system for ingesting chemistry data directly into the repository with minimum effort by the researcher.


Enhancing and expanding data usage:  CLARION will provide functionality to enable scientists to make selected data available as Open Data for use by people external to the department.  The project will use techniques for adding semantic definition to chemical data, including RDF (Resource Description Framework) and CML (Chemical Markup Language).  Much of these techniques will be extensible to other disciplines.  CLARION will address general issues such as ownership of data, and it will publicise its results to the chemistry and repositories communities.  Effort will be put into developing a sustainable business model for operating the repository that can be adopted by the department after project completion.


Timelines: The project runs for two years from April 2009. The initial pilot deployment of the ELN is scheduled for late 2009, and we hope to be publishing open data from it in early 2010.


Project blog:      

Twitter:                       CLARIONproject

Contact:                      Brian Brooks <>

BioIT 2009 – Trends from the Trenches

Monday, April 27th, 2009

Plenary Trends from the Trenches Chris Dagdigian, BioTeam, Inc.

Some brief notes from plenaries

What’s Mainstream:

virtualization, partly because of power requirements. The simplests and most powerful thing you can do. Protects the web apps, databases that are lashed up. Valid use case, but not enterprise. So virtualization allows enterprise-like environment preserves this innovation without danger. Vital for science

not coming soon Vms for Grids and Clusters. Too much admin hassle

Storage first 100TB single namespace project. Jobs lost over data loss. Data triage is a given. Examples Single namespace for Mac has 80 TB, 1.1 PB on Linux system

Users have no idea of true cost of storage. $124 for 1TB fort hardware is misleading. Individual labs put in 100Tb+ systems

Unlimited data storage days are over. – need triage. Cheaper to repeat experiment than keep data

Data loss – exemple double disk failure in metadata 10 TB in goverment lab. You will get double disk failures. Need RAID6

Backup is becoming a thing of the past, no nightly full.

Amazon, Google MS can store for 80cent / GB / year. Can you do that??

IT cannot be sole decision maker for triage or for storgae optimization

Rate limits are chemistry, regagent costs and human factors

Proeblem is somewhat scary but most people surviving

Amazon is is the cloud has mutli-year headstart

Security in the cloud don’t expect things that you don’t provide. Objections are often political

Compute power is easier than IO. He believes that Amazon are working on data ingestion.

Will be big move of science data into storage cloud. Science data will take 1-way trip. Data will stay in cloud. Only derived data will return.

McKinsey report on Cloud Computing very good, also James Hamilton

Watch Amazon, Google and MS.

Best data practices are starting to trickle out. Google is now showing what it did 5 years ago so they must be up to very exciting things now

Finally federated data storage something for the future

BioIT in Boston: What is Open?

Monday, April 27th, 2009

My talk is Open Semantic Data in Science. I’ll probably write 3-4 blog posts on the various aspects of this, and at present I’m thinking of:

  • What is Open? (this post)
  • What is semantic? And what do we require for it?
  • What is data?
  • What are we able to offer (with some modest emphasis on our own endeavours).

I am starting with the assumption that for science now and in the future Open Data will be essential. The culture, especially among young people, is that the answer is out there and is retrievable within seconds or less. There’s also a realisation that increasingly we don’t know in detail what we are looking for when we start a study. We read bits of papers, skim around till we get a feel for the subject, ask our colleagues, post questions on blogs, etc.

We’re also using machines much more to help us with the data, both in the volume and the diversity. This is a central theme at BioIT. So the fundamental postulate of Openness is:

  • ANY barrier to access and re-use, however small and seemingly trivial COMPLETELY destroys public semantic data.

(Note that I accept that there are closed worlds companies, healthcare, etc. which require access controls, but their technology can feed off what we are trying to create in public view).

Why am I so insistent on this? I’ll leave the moral and ethical arguments aside here and concentrate on the technical aspects. The Open Knowledge Foundation has addressed this point in its definition and I’ll quote from that highlighting particular points (and abbreviating occasionally)

A work is open if its manner of distribution satisfies the following conditions:

  • 1. Access

The work shall be available as a whole …, preferably downloading via the Internet without charge. The work must also be available in a convenient and modifiable form.

Comment: This can be summarized as ‘social’ openness – not only are you allowed to get the work but you can get it. ‘As a whole’ prevents the limitation of access by indirect means, for example by only allowing access to a few items of a database at a time.

  • . Redistribution

The license shall not restrict any party from selling or giving away the work either on its own or as part of a package made from works from many different sources. The license shall not require a royalty or other fee for such sale or distribution.

  • . Reuse

The license must allow for modifications and derivative works and must allow them to be distributed under the terms of the original work. The license may impose some form of attribution and integrity requirements: see principle 5 (Attribution) and principle 6 (Integrity) below.

Comment: Note that this clause does not prevent the use of ‘viral’ or share-alike licenses that require redistribution of modifications under the same terms as the original.

  • . Absence of Technological Restriction

The work must be provided in such a form that there are no technological obstacles to the performance of the above activities. This can be achieved by the provision of the work in an open data format,

  • 5. Attribution

The license may require as a condition for redistribution and re-use the attribution of the contributors and creators to the work.

  • 6. Integrity

The license may require as a condition for the work being distributed in modified form that the resulting work carry a different name or version number from the original work.

  • 7. No Discrimination Against Persons or Groups
  • 8. No Discrimination Against Fields of Endeavor

The license must not restrict anyone from making use of the work in a specific field of endeavor. For example, it may not restrict the work from being used in a business, or from being used for military research.

Comment: The major intention of this clause is to prohibit license traps that prevent open source from being used commercially. We want commercial users to join our community, not feel excluded from it.

9. Distribution of License

The rights attached to the work must apply to all to whom the work is redistributed without the need for execution of an additional license by those parties.

10. License Must Not Be Specific to a Package

11. License Must Not Restrict the Distribution of Other Works

and now the absolute requirement for Openness.


  • This is the crux. There are many data resources which are described as Open but they fail in one or more aspects. The commonest failures are:
  • to expose only part of the data. A database system with a query interface is normally not Open Data even if individual items can be downloaded without barrier. It is generally impossible to extract the whoel work as its boundaries are concealed by the search interface
  • to limit the amount downloaded. This is very frequent (you may use a maximum of 100 entries).
  • To forbid re-use. This data is copyright X and may not be re-used without permission)
  • To require access through specific technology. A search form limits the access.
  • To require any form of signin, even if free. Robots are illiterate in this aspect
  • To restrict purpose of re-use. Thus CC-NC (no commercial reuse) is NOT OKF-compliant
  • To fail to provide a clear statement that the data are open and comply with the Open Knowledge definition. It’s almost universal that data are NOT labelled as Open. This is easy to fix just add the OKF’s tags
  • graphics1
  • So the message is simple, though it will take time to spread
  • Use the OKF definition for all your data and tag it as such

This blog authored with ICE + Open Office; thanks to PeterSefton and USQ

BioIT in Boston: What I shall say and How I shall say it

Monday, April 27th, 2009

I am talking on Wednesday at 2009 Bio-IT World Conference which unites life sciences, pharmaceutical, clinical, healthcare, and IT professionals. It is the perfect place to learn, be recognized and network. Well I hope I can be recognized as I’m meeting Steve Heller at 1600 and Antony Williams tomorrow. And networking is good they have wireless everywhere which is great relief as so many of these conferences have no wireless or charge zillions per day. And I’v found a free lunch (again many of these places are really mean, but you can often sneak in at the back to special functions.

So what shall I say? I was going to talk about the Chemical Semantic Web, but then looked at my program and found I had offered:

Open Semantic Data in Science

which is much better because (a) a Pfizer article in the conference magazine says the SW doesn’t exist, but Semantic Data does and (b) it will fit well into the revised program. Antony Williams is now just before me and Rajarshi Guha is just afterwards. So we shall try to meet beforehand and see if we can make a seamless program.

The conference asked us to upload slides a month before I argued that my style of presentation didn’t use slides and anyway I didn’t know what I was going to say. I therefore generally work it out the day or night beforehand. This isn’t because I am casual about it I take my presentations seriously and put a lot of work into them but a lot of this goes on in my head in the preceding days. I don’t recommend this to others and in Cambridge we require colleagues to present dry runs a week beforehand.

But the main thing is that what matters to me is what I say rather than what I write. I try to adjust my words to the actual audience there could be 10 or 100 people in this session I have no idea. I often use a blog to work my ideas out and I’ll do that here. The blog serves as a record and now that I have ICE as an authoring tool I should be able to add images to the blog. Here goes…


[Question what is the name of the panda? Who is his famous companion ? And if you didn't know how did you find out]

So, yes, images seem to work great! So I can start to use blogs with things other than words.

So how can this technology compete with the awful Powerpoint? Its main virtue and it’s an important one is that it has a good editing tool and it acts as a container for various types of content. Not a robust one, as anyone who has transferred it from or to Mac or Unix knows.

So what alternative is there?

I suggest Word or Open Office. Ideally I’d like to use this for the complete presentation but I have got fouled up with the page flips. I like my HTML presentations I like the ability to scroll. The problem is packing them together afterwards (I can’t do this beforehand as I don’t know what slides I will use).

So my current approach is to blog my main message before the presentation. By doing this with ICE I can author the blog as ODT. Then I can convert this to DOCX and upload it to the conference site. I’ve talked about this with Peter Sefton and we’re thinking about a way to manage some of this.

Ideally I’d like to be able to record which HTML slides I showed (I assume this requires Javascript). Then it would be useful to combine this into a single ODT/Word/HTML/PDF document for the later reader. That’s exactly what Peter does for the courseware at USQ.

So I’m blogging with ODT and will upload several blogs of my talk during the meeting.

I don’t know whether they will make sense, but at least they’ll me more semantic than Powerpoint.

This blog authored with ICE + Open Office; thanks to PeterSefton and USQ