Category Archives: fun

Chemical information on the web - typical problem

Here's a typical problem with chemical (and other) data on the web and elsewhere. I illustrate it with an entry from Wikipedia, knowing that they'll probably correct it and similar as soon as it's pointed out. You don't have to know much science to solve this one:

Molecular formula XeO4
Molar mass 195.29 g mol−1
Appearance Yellow solid below −36°C
Density ? g cm−3, solid
Melting point −35.9 °C

Here's part of the infobox for Xenon tetroxide in WP. Why are the data questionable? The problem is universal... [The info box didn't copy so you'll have to look at the web page - probably a better idea anyway. Here's a screenshot] infobox.PNG

UPDATE: The problem comes in the character(s) before the numbers. It is not ASCII character 45, which is what most anglophone keyboards emit when the "-" is typed. From Wikipedia:

Character codes

Read Character Unicode ASCII URL HTML (others)
Plus + U+002B + %2B
Minus U+2212 − or or
Hyphen-minus - U+002D - %2D

The Unicode minus sign is designed to be the same length and height as the plus and equals signs. In most fonts these are the same width as digits in order to facilitate the alignment of numbers in tables. The hyphen-minus sign (-) is the ASCII version of the minus sign, and doubles as a hyphen. It is usually shorter in length than the plus sign and sometimes at a different height. It can be used as a substitute for the true minus sign when the character set is limited to ASCII.

There is a tension here between scientific practice and the norms of typesetting and presentation. When the WP XML for this entry is viewed it looks something like:

x<td><a href="/wiki/Molar_mass" title="Molar mass">Molar mass</a></td>
<td>195.29 g mol<sup>−1</sup></td>
<td>Yellow solid below −36°C</td>
<td><a href="/wiki/Density" title="Density">Density</a></td>
<td> ? g cm<sup>−3</sup>, solid</td>
<td><a href="/wiki/Melting_point" title="Melting point">Melting point</a></td>
<p>−35.9 °C</p>

where the "minus" is represented by 3 bytes, which here print as


Note also that the degree sign is composed of two characters.

If the document is Unicode then this may be strictly correct, but in a scientific context it is universal that ASCII 45 is used for minus.

The consequence is that a large amount of HTML is not machine-readable in the way that a human reads it.

The answer for "minus" is clear - in a scientific context always use ASCII 45. It is difficult to know what to do with the other characters such as degrees. They can be guaranteed to cause problems at some stage when transforming XML, HTML or any other format unless there is very strict discipline on character encodings in documents, prgrams and stylesheets.

Which is not common.

Note, of course, that's it's much worse in Word documents. We have examples in published manuscripts (i.e. on publisher web sites) where numbers are taken not from the normal ASCII range (48-57) but from any of a number of symbols fonts. These are almost impossible for machines to manage correctly.

Mystery picture

What's this picture?


and why might I be interested in it?

(It's not the whole picture, so I claim fair use - I don't know who the copyright holder is. And the clipped space hides a fairly vital clue).

[UPDATE: 2007-12-23:

It's a penguin, drawn by Robert Shackelton. There's also one by Robert Scott.  They were discovered in a basement in the Scott Polar Research Institute which is just next to The Chemistry lab in Cambridge. There was a TV van there two days ago...




Mystery Picture

Here is a photograph (untouched, not CGI). When I saw it I went wow! (I knew what it was). I'd be interested to know if anyone (a) KNOWS what it is of (b) can estimate the scale (c) has seen anything like it. If you do know, please post a comment saying so [but please DON'T give the answer]. I plan to release more information daily...

Besides the photo itself there is a serious question. How can you search the web for images like this?


and a close-up:


[UPDATE - more info: The photograph was taken yesterday by Dr. Judith Murray-Rust.]

[ANSWER: This is, indeed, crystalline water but the scale took us by surprise. The x-axis is ca. 20 cm. This artefact appeared in our bird bath and there appear to be 2 perfect, huge, hexagonal ice crystals (it is possible that they are both sixfold twins, I suppose). The faces are highly planar and specular (we have more pictures).

It is also remarkable that there are two artefacts separated by 10 cm(between centres) which are almost identical. What possible coupling could there be between them - that is the real mysetery.]

Happy Holliday - as I might say to Gemma.

Data for common chemicals

As part of a project on chemical synthesis we need to collect some data on common chemicals. So what better place to start than with water? Before looking at the answers, see if you can find the

  • density of Water gas? (Wikipedia)
  • melting point of snow? (Pubchem)
  •  freezing point of water ice? (Wikipedia)

There's a serious point to this. Much chemistry uses human language and words as a means of identifying concepts. (Thanks to Peter Corbett for (2))

NMR Challenge: what can a machine deduce from a thesis?

One of the ways of extracting chemical structures from the literature is to use the NMR to constrain the possibilities. So, to give you an amusement for the weekend, here are some problems. I have a thesis (which I'm not identifying, but I know the author and he's happy for the thesis to yield Open Data). I am not sure whether the compounds are in the public literature yet, but they are in the public domain if you know where to find the paper thesis.

Imagine that some future archaeologist had discovered the thesis and only a few scraps had survived. What could be deduced? I'm starting with smallish (hopefully fairly simple) structures and only feeding you some of the information. Depending on what you answer, I'll either release more or select more complex compounds. All compounds are distinct.

Compound 172

dH (400 MHz, CDCl3): 1.15 (3H, t, J 7.1, OCH2CH3), 1.24 (3H, d, J 5.2, 6-H x 3), 2.84 (1H, qd, J 5.2, 2.0, 5-H), 3.05 (1H, dd, J 7.0, 2.0, 4-H), 4.07 (2H, q, J 7.1, OCH2CH3), 5.99 (1H, dd, J 15.7, 0.6, 2-H), 6.54 (1H, dd, J 15.7, 7.0, 3-H);

Compound 167

dC (100 MHz, CDCl3): 164.5 (CO), 160.3 (C), 107.5 (C), 95.6 (CH), 40.9 (CH2Cl), 24.7 (CH3 x 2);

compound 156

dC (100 MHz, CDCl3): 83.0 (3-C), 79.6 (2-C), 61.8 (5-C), 51.0 (1-C), 25.8 (SiC(CH3)3 x 1), 23.1 (4-C), 18.3 (SiC(CH3)3), -5.1 (Si(CH3) x 2);

Note that the molecular formula, molecular weight, etc. have all been destroyed by the ravages of time.

You can use any method you like, including searching in commercial databases.

What could a machine do with the information above?