In our SPECTRa-T project we are exploring how we can extract data and metadata from chemistry theses. Almost all these documents are now born-digital, i.e. written in a wordprocessor such as Word or TeX rather than being typed on carbon paper. So in principle we should be able to include the actual data into the thesis. And occasionally this happens – I’ll give an example later. But all too often the absurd ritual requires the author to retranscribe experimental data into pretty “readable form”. This is a lot or work and often requires special programs to generate the prettiness. Here I show the wasted labour and data corruption required when reporting crystallography.
I have a typical thesis relating to synthetic organic chemistry. As part of proof-of-synthesis it is common to carry out crystal structures and this thesis has 3. Each has a depiction of the structure (that’s fine) but then ca 10 pages of numeric data (i.e. 30 pages in the complete thesis). Here’s a sample
Table 4. Anisotropic displacement parameters (Å2x 103) for [xxxx]. The anisotropic displacement factor exponent takes the form: -2p2[ h2 a*2U11 + … + 2 h k a* b* U12 ]
_____________________________________________________________________
U11 U22 U33 U23 U13 U12
_____________________________________________________________________
O11 85(4) 61(3) 54(3) -1(2) 3(3) 1(3)
O21 93(4) 75(3) 44(3) -10(2) -19(3) 24(3)
O31 80(4) 74(3) 54(3) 0(2) -9(3) 18(3)
C11 76(5) 63(4) 52(4) -8(4) -11(4) 16(4)
C21 66(5) 65(4) 52(4) -13(3) -15(4) 11(4)
…
Isn’t that beautifully laid out? The anisotropic displacement factor describes the deviation of the atom from spherical shape (mainly due to vibration, disorder and bonding effects). However there is not a human on the planet who could read these numbers and make any sense of them. The first problem is that you need to know the cell dimensions or metric tensor. Then you have to solve some linear algebra and get the displacements and their direction cosines. Still with me?
The reality, of course, that these number are ONLY useful to machines. And the ritual of creating a header, adding lines, multiplying by 10^3 (WHY??) and listing as a whitespace-formatted table guarantees that a machine will NOT be able to read them. And almost certainly there are cut-and-paste errors
Here’s what the original data might look like (I don’t have the actual raw data (CIF) for the thesis – it wasn’t deposited and has been thrown away – we may manage to salvage it internally).
loop_
_atom_site_aniso_label
_atom_site_aniso_U_11
_atom_site_aniso_U_22
_atom_site_aniso_U_33
_atom_site_aniso_U_12
_atom_site_aniso_U_13
_atom_site_aniso_U_23
O1 0.104(2) 0.0781(16) 0.0495(11) -0.0581(16) -0.0082(12) -0.0065(11)
O2 0.0415(10) 0.0446(9) 0.0275(7) -0.0064(8) 0.0001(7) -0.0110(7)
O3 0.0261(8) 0.0535(11) 0.0527(10) -0.0042(9) -0.0007(8) -0.0091(9)
O4 0.0473(11) 0.0401(10) 0.0443(10) 0.0112(9) 0.0062(9) -0.0119(8)
[…]
That is COMPLETELY machine-readable, even when embedded in a thesis, and even with line wrap (as long as it’s whitespace). It’s not ugly, is it? It takes a minute or so to copy it into the appendix of the thesis – you don’t even have to think about the content as it’s come straight off the instrument.
GRADUATE STUDENTS! DON’T PUT UP WITH THIS NONSENSE. COPY CIFS DIRECTLY INTO YOUR THESIS. IT’S QUICKER AND BETTER SCIENCE. AND DO THE SAME WHEN YOU PUBLISH. THAT WAY WE SHALL SAVE YOUR WORK RATHER THAN DESTROY IT.