Why we need semantic chemical authoring-2
We’re in the process of aggregating a repository of common chemicals (somewhere in the range 1000-10000 entries) and we are taking data from various publicly available web sites. Typical sources are Wikipedia, any aggregator with Open Data policies and MSDS sheets (chemical safety information). One such site is INCHEM (Chemical Safety Information from Intergovernmental Organizations which lists about 1500 materials (most are chemical compounds though some are mixtures).
The information on the web is HTML pages and we wrote a scraper to extract the information from each. I’d planned to show a screenshot but WordPress has stopped me uploading any images, so you’ll have to visit the link. In any case you wouldn’t be able to see the point from a screenshot. Scrpaing is not fun – the HTML is as bad as almost any other HTML. It needed a 2-pass process – first into HTMLTidy and then analysis by XML tools. From his we extract the most important information and turn it into CML – names, formula, connection tables, properties, etc.
We wanted to see if the aggregation and consistency checking could be done by machine, using RDF. This is surprisingly hard as none of the sites contains all the information we need and many have large sparse patches. There is also the subtle problem of identifying the platonic nature of each chemical – what should we actually use as an entry for – say – alumin(i)um chloride? Or should there be more than one?
We’ve got the data in. There are a large number of simple but niggly lexical problems, such as the degrees symbol for temperature (totally inconsistent within and between documents) And the semantics – how do you record a boiling point as “between 120 and 130 at 20 mm Hg”? (CML can do this, but it takes work to do the conversion.)
And the sites have errors. Here’s a rather subtle one which the average human would miss (we needed a machine to find it). You’ll have to go to the page for chloromethylmethylether
– I daren’t try to transcribe it into WordPress. The error is in the displayed page (no need to scroll down).
It we had semantic authoring tools this wouldn’t happen. I’ll be blogging soon (I hope) about our activity in this area.
UPDATE: My best go at scraping the bit of the page with the error. It’s now semi-semantic (HTML) so you should be able to track the error down. You only have to know a little bit of chemistry…
Molecular mass: 80.5