Scraping HTML

As we have mentioned earlier, we are looking at how experimental data can be extracted from web sources. There is a rough scale of feasibility:
RDF/XML > (legacy)CIF/JCAMP/SDF > HTML > PDF
I have been looking at several sites which produce chemical information (more later). One exposes SDF (a legacy ASCII file of molecular structures and data). The others all expose HTML. This is infinitely better than PDF, BUT…
I had not realised how awful it can be. The problems include:

  • encodings. If any characters outside the printing ANSI range (32-127) are used they will almost certainly cause problems. Few sites add an encoding and even if they do the interconversion is not necessarily trivial.
  • symbols. Many sites use “smart quotes” for quotes. These are outside the ANSI range and almost invariably cause problems. The author can be slightly forgiven since manu tools (including WordPress) convert to smart quotes (“”) automatically. Even worse is the use of “mdash” (—) for “minus” in numerical values. This can be transformed into a “?” or a block character or even lost. Dropping a minus sign can cause crashes and death. (We also find papers in Word where the numbers are in symbol font and get converted to whatever or deleted.)
  • non-HTML tags. Some tools make up their own tags (e.g. I found <startfornow>) and these can cause HTMLTidy to fail.
  • non-well-formed HTML. Although there are acceptable ways of doing this (e.g. “br” can miss out the end tag) there are many that are not interpretable. The use of <p> to separate paragraphs rather than contain them is very bad style.
  • javascript, php, etc. Hopefully it can be ignored. But often it can’t.
  • linear structure rather than groupings. Sections can be created with the “div” tag but many pages assume that a bold heading (h2) is the right way to declare a section. This may be obvious when humans read it, but it causes great problems for machines – it is difficult to know when something finishes.
  • variable markup. For a long-established web resource – even where pages are autogenerated – the markup tends to evolve and it may be difficult to find a single approach to understanding it.  This is also true of multi-author sites where there is no clear specification for the markup – Wikipedia is a good example of this.

As a result it is not usually possible to extract all the information from HTML pages and precision and recall both fall well short of 100%. The only real solution is to persuade people to create machine-friendly pages based on RSS, RDF, XML and related technology. This solves 90% of the above problems. That’s why we are looking very closely at Jim Downing’s approach of using Atom Publishing Protocol for web sites.

This entry was posted in data, repositories, semanticWeb. Bookmark the permalink.

2 Responses to Scraping HTML

  1. Chris Rusbridge says:

    Is it not also appropriate to encode information in microformats (see http://en.wikipedia.org/wiki/Microformats)? This approach allows the visual look to remain, but hidden HTML elements to contain the key data, which can then be more reliably scraped. Most microformats are closely based on a common standard in the area; in this case perhaps CML etc?

  2. pm286 says:

    (1) Great to hear Chris – hope things are OK…
    The problem is that the authors (of the websites) have no interest in having their data extracted and so they wouldn’t think of microformats. (None of them have heard of “class” attributes or character encodings – and the html tagset is all over the place.) anyone savvy about microformats would be fairly easy to convert to RSS, etc.
    P.

Leave a Reply

Your email address will not be published. Required fields are marked *