Chemical information on the web – typical problem

Here’s a typical problem with chemical (and other) data on the web and elsewhere. I illustrate it with an entry from Wikipedia, knowing that they’ll probably correct it and similar as soon as it’s pointed out. You don’t have to know much science to solve this one:

Molecular formula XeO4
Molar mass 195.29 g mol−1
Appearance Yellow solid below −36°C
Density ? g cm−3, solid
Melting point −35.9 °C

Here’s part of the infobox for Xenon tetroxide in WP. Why are the data questionable? The problem is universal… [The info box didn’t copy so you’ll have to look at the web page – probably a better idea anyway. Here’s a screenshot] infobox.PNG
UPDATE: The problem comes in the character(s) before the numbers. It is not ASCII character 45, which is what most anglophone keyboards emit when the “-” is typed. From Wikipedia:

Character codes

Read Character Unicode ASCII URL HTML (others)
Plus + U+002B + %2B
Minus U+2212 − or or
Hyphen-minus U+002D - %2D

The Unicode minus sign is designed to be the same length and height as the plus and equals signs. In most fonts these are the same width as digits in order to facilitate the alignment of numbers in tables. The hyphen-minus sign (-) is the ASCII version of the minus sign, and doubles as a hyphen. It is usually shorter in length than the plus sign and sometimes at a different height. It can be used as a substitute for the true minus sign when the character set is limited to ASCII.

There is a tension here between scientific practice and the norms of typesetting and presentation. When the WP XML for this entry is viewed it looks something like:

x<td><a href="/wiki/Molar_mass" title="Molar mass">Molar mass</a></td>
<td>195.29 g mol<sup>−1</sup></td>
</tr>
<tr>
<td>Appearance</td>
<td>Yellow solid below −36°C</td>
</tr>
<tr>
<td><a href="/wiki/Density" title="Density">Density</a></td>
<td> ? g cm<sup>−3</sup>, solid</td>
</tr>
<tr>
<td><a href="/wiki/Melting_point" title="Melting point">Melting point</a></td>
<td>
<p>−35.9 °C</p>

where the “minus” is represented by 3 bytes, which here print as

 −

Note also that the degree sign is composed of two characters.
If the document is Unicode then this may be strictly correct, but in a scientific context it is universal that ASCII 45 is used for minus.
The consequence is that a large amount of HTML is not machine-readable in the way that a human reads it.
The answer for “minus” is clear – in a scientific context always use ASCII 45. It is difficult to know what to do with the other characters such as degrees. They can be guaranteed to cause problems at some stage when transforming XML, HTML or any other format unless there is very strict discipline on character encodings in documents, prgrams and stylesheets.
Which is not common.
Note, of course, that’s it’s much worse in Word documents. We have examples in published manuscripts (i.e. on publisher web sites) where numbers are taken not from the normal ASCII range (48-57) but from any of a number of symbols fonts. These are almost impossible for machines to manage correctly.

This entry was posted in data, fun and tagged . Bookmark the permalink.

One Response to Chemical information on the web – typical problem

  1. Pingback: PT’s blog » Blog Archive » ICE Mashups, part one, take two

Leave a Reply

Your email address will not be published. Required fields are marked *