We’re in the process of aggregating a repository of common chemicals (somewhere in the range 1000-10000 entries) and we are taking data from various publicly available web sites. Typical sources are Wikipedia, any aggregator with Open Data policies and MSDS sheets (chemical safety information). One such site is INCHEM (Chemical Safety Information from Intergovernmental Organizations which lists about 1500 materials (most are chemical compounds though some are mixtures).
The information on the web is HTML pages and we wrote a scraper to extract the information from each. I’d planned to show a screenshot but WordPress has stopped me uploading any images, so you’ll have to visit the link. In any case you wouldn’t be able to see the point from a screenshot. Scrpaing is not fun – the HTML is as bad as almost any other HTML. It needed a 2-pass process – first into HTMLTidy and then analysis by XML tools. From his we extract the most important information and turn it into CML – names, formula, connection tables, properties, etc.
We wanted to see if the aggregation and consistency checking could be done by machine, using RDF. This is surprisingly hard as none of the sites contains all the information we need and many have large sparse patches. There is also the subtle problem of identifying the platonic nature of each chemical – what should we actually use as an entry for – say – alumin(i)um chloride? Or should there be more than one?
We’ve got the data in. There are a large number of simple but niggly lexical problems, such as the degrees symbol for temperature (totally inconsistent within and between documents) And the semantics – how do you record a boiling point as “between 120 and 130 at 20 mm Hg”? (CML can do this, but it takes work to do the conversion.)
And the sites have errors. Here’s a rather subtle one which the average human would miss (we needed a machine to find it). You’ll have to go to the page for chloromethylmethylether – I daren’t try to transcribe it into WordPress. The error is in the displayed page (no need to scroll down).
It we had semantic authoring tools this wouldn’t happen. I’ll be blogging soon (I hope) about our activity in this area.
UPDATE: My best go at scraping the bit of the page with the error. It’s now semi-semantic (HTML) so you should be able to track the error down. You only have to know a little bit of chemistry…
Dimethylchloro ether Chloromethoxymethane |
CAS # |
107-30-2 |
CH3OCH2CI |
RTECS # |
KN6650000 |
Molecular mass: 80.5 |
I know you know this, but I have to say it anyway. The existence of tools is necessary but not sufficient. Changing practices is the other variable in the equation, and it’s a much harder one to solve for.
(1) I know that you know that I know that you know that I know this. The key strategy is to make the tools such that they change practice without people having to know that they are changing. This is hard, but we think we can do it in this area. More later.
OK, are you going to let a non-chemist know what was wrong? This average human certainly missed the problem!
I’ve been thinking about a blog post related to your hamburger rants. But the more I try to think it through, the murkier it gets. Is the problem that PDF cannot store the semantic information? I’m not sure, but I’m beginning to suspect maybe not, ie PDF can. Is the problem that the tools that build the PDFs don’t encode the semantic information? Probably. Is the semantic information available in the publisher’s file from which the PDF is built? Possibly to probably, depending on the publisher and their DTD/schema. Is the semantic information available in the author’s file? Probably not to possibly, depending on author tools (I’m not sure what chemists use to write these days; Word would presumably be dire in this respect unless there is a chemistry plug-in; LaTeX can get great results in math and CS, but I’m not sure how semantic, as opposed to display-oriented, the markup is). And even if this were to all happen, does chemistry have the agreed vocabulary, cf the Gene Ontology in bio-sciences, to make the information truly “semantic”? And…
(3) I will mail you. Since Joe Townsend (at least) is still fighting the problem I don’t want to give too many public hints. But I told Joe that that is a set of humans on this planet who would immediately spot a problem (assuming they have a very very basic knowledge of chemistry). It’s a problem to do with eyes in more than one sense.
Pingback: ChemSpider Blog » Blog Archive » Care in Nomenclature Handling and Why Visual Inspection Will Remain
At Wikipedia, we are currently involved in a VERY similar process to the one you’re describing here, except we are less optimistic as to the possibilities of automation. Given our somewhat eclectic range of compounds, we are more than used to the fact that many fundamental data are simply not known. To take one (extreme) example, have a look at , where we give virtually all of the publically available information on this compound.
While I would not wish to discourage your group, I must say that, at Wikipedia, we have found that the most valuable “semantic chemical authoring tool” is a human chemist: personally, I charge less for consultancy than CAS charges for access to its databases (but maybe that’s my mistake!) Much chemical information, on the web and on paper, is false, and most of it lacks the necessary metadata to be able to judge its veracity. THAT, I feel, is the real problem!