As we have mentioned earlier, we are looking at how experimental data can be extracted from web sources. There is a rough scale of feasibility:
RDF/XML > (legacy)CIF/JCAMP/SDF > HTML > PDF
I have been looking at several sites which produce chemical information (more later). One exposes SDF (a legacy ASCII file of molecular structures and data). The others all expose HTML. This is infinitely better than PDF, BUT…
I had not realised how awful it can be. The problems include:
- encodings. If any characters outside the printing ANSI range (32-127) are used they will almost certainly cause problems. Few sites add an encoding and even if they do the interconversion is not necessarily trivial.
- symbols. Many sites use “smart quotes” for quotes. These are outside the ANSI range and almost invariably cause problems. The author can be slightly forgiven since manu tools (including WordPress) convert to smart quotes (“”) automatically. Even worse is the use of “mdash” (—) for “minus” in numerical values. This can be transformed into a “?” or a block character or even lost. Dropping a minus sign can cause crashes and death. (We also find papers in Word where the numbers are in symbol font and get converted to whatever or deleted.)
- non-HTML tags. Some tools make up their own tags (e.g. I found <startfornow>) and these can cause HTMLTidy to fail.
- non-well-formed HTML. Although there are acceptable ways of doing this (e.g. “br” can miss out the end tag) there are many that are not interpretable. The use of <p> to separate paragraphs rather than contain them is very bad style.
- linear structure rather than groupings. Sections can be created with the “div” tag but many pages assume that a bold heading (h2) is the right way to declare a section. This may be obvious when humans read it, but it causes great problems for machines – it is difficult to know when something finishes.
- variable markup. For a long-established web resource – even where pages are autogenerated – the markup tends to evolve and it may be difficult to find a single approach to understanding it. This is also true of multi-author sites where there is no clear specification for the markup – Wikipedia is a good example of this.
As a result it is not usually possible to extract all the information from HTML pages and precision and recall both fall well short of 100%. The only real solution is to persuade people to create machine-friendly pages based on RSS, RDF, XML and related technology. This solves 90% of the above problems. That’s why we are looking very closely at Jim Downing’s approach of using Atom Publishing Protocol for web sites.