Joe Townsend: textual and crystallographic eScience

Joe Townsend has worked with our group for ca. 6 years. As an undergraduate he worked as a summer student and was one of the first co-authors of OSCAR. He’s submitted his thesis and is being examined on Wednesday. His work has greatly informed what Nick Day has been doing. Here’s a snippet of what he has in his thesis (we are currently writing the paper)
He extracted small molecular entities from the data – what is now CrystalEye – WWMM – and optimsed the geometry in much the same was a Nick has been doing. He then compared the calculated and observed geometries of over 1000 entries and got:
joegamess1.PNG
This shows a wide scatter (y is calc, x is observed). Is the deviation due to problems with the data or problems with the model or both? Have a think and then read on…
By carefully analysing outliers Joe came up with about 10 ways that the data might have problems. Because the data were Open, and because the metadata were Open and rich, Joe was able to create a protocol that filtered out entries with potential problems in the data. The protocol was NOT based on the fact of an entry being an outlier, but with some aspect of metadata. (Here is PART of the protocol)
protocol1.PNG
As a result it was possible to devise a machine procedure which AUTOMATICALLY created a cleaned data set. The resulting comparison then looked like this:
joegamess2.PNG
You can see immediately that although there are many fewer entries the agreement is excellent. As a result, and only as a result, it was possible to find outliers where there were potential concerns about the quality of the calculated data. The effects are small, but probably real. We’ll see if the examiners agree. (The obvious outlier above (1.31, 1.25) is due to differences in the models – gas-phase versus crystal – i.e. “crystal packing forces”)
This shows that in principle it is possible to create robots which use both theory and experiment to improve each other. It relies on having good open metadata and open data.
It is impossible with closed data. The next post will contrast Nick’s results for NMR.

This entry was posted in data, open notebook science. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *