In our SPECTRa-T project we are exploring how we can extract data and metadata from chemistry theses. Almost all these documents are now born-digital, i.e. written in a wordprocessor such as Word or TeX rather than being typed on carbon paper. So in principle we should be able to include the actual data into the thesis. And occasionally this happens – I’ll give an example later. But all too often the absurd ritual requires the author to retranscribe experimental data into pretty “readable form”. This is a lot or work and often requires special programs to generate the prettiness. Here I show the wasted labour and data corruption required when reporting crystallography.
As I have already blogged (WWMM calculation of spectra) we are hoping to provide Jean-Claude Bradley and others an Open service to calculate NMR spectra from structure. This needs a lot of software components and a lot of glueware. With the release of FROG – not just Free, but Open yet another problem is solved, but we aren’t there quite yet.
The calculation of spectra from NMRShiftDB is automatic because, AND ONLY BECAUSE, Christoph and Stefan have used CMLSpect to represent the data. CMLSpect allows:
- connection table
- atom labels
- 3D coordinates
- spectral peaks
- assignment of peaks to atoms
all these (except the raw spectra) are required for the calculation. Actually the connection table can be dispensed with if the hydrogen atoms are given explicitly – as they should ALWAYS be. (Implicit hydrogens have probably cost the human race thousands of wasted years through errors. There is now NO excuse for not including hydrogen atoms explicitly in files. Size of files? Rubbish. All the hydrogens in a year’s global chemistry are worth 1 day of astronomical simulation).
So with NMRShiftDB we have the simple process:
- read NMRShiftDB file
- add hydrogens with coordinates (JUMBO does this)
- transform to Gaussian input (XSLT makes this automatic)
- run job (Condor makes this automatic)
- analyze results (i.e. compare calculated and observed – Nick Day’s software is making this automatic)
With the normal chemical environment this is messier
- read mol file
- submit to FROG to generate 3D coordinates. Hope it hasn’t changed the order of atoms
- convert mol file to CML
- read list of peaks in some legacy format (?Excel)
- try to match peaks to atoms for assignment (probably have to rely on atom ordering)
- create peakList in CMLSpect. How?
- combine peakList with molecule in CML
- transform to Gaussian input (as above) and then it’s plain sailing
The problems arise because:
- hydrogens are a problem
- mol files (and all other files than CML) do not have atom labels
- there is no Open tool for assigning peaks to atoms
- relying on atom ordering is a recipe for disaster and extremely difficult to debug
So what is clear is that we need a tool to couple JSpecView to a molecule in CML. The output, at least, has to be in CML because there is no other way of linking atoms to peaks.
This should be seen as one of the great (but achievable) challenges of the Blue Obelisk movement. When we get it, it will transform the way that graduate students record their peak assignment and publish their papers and THESES!
I am very grateful to Caltech, specially Eric van der Velde, for organising and recording my presentation on eTheses at Caltech last month. See The power of the Scientific eThesis, a combined audio, video and screenshow. Caltech have done a very good job of stitching it together. Many of the “slides” were in scrolling HTML so the slide-count is artificially high – each scroll generates a new “slide”. Total time about 67 minutes.
The themes include:
- homage to Caltech: Jack Dunitz, Linus Pauling, Verner Schomaker and Ken Trueblood.
- data-driven science in crystallography – examples from 1973 to present day.
- semantic web and chemistry, including DBPedia
- Open Access
and questions at the end.
Since my presentations are taken from many thousand slides it gives an accurate impression of a typical talk, where I do not know in advance exactly what components I shall touch on. In a few places my machine ran slowly so there are minor hiatuses.
15:13 13/08/2007, Open Access NewsDean Giustini, UBC’s John Willinsky – Stanford Takes Him (For Now), Open Medicine blog, August 12, 2007. Excerpt:
UBC’s Dr. John Willinsky is no stranger to open access advocates. His book The Access Principle is ‘required reading’ for all those who believe in the connection between access to information and the economic and social well-being of knowledge-based societies. Recently, John accepted an appointment at Stanford University….
As for what’s next for PKP, we will be releasing the next version of OJS, in a few months time, in association with our parallel release of Lemon8-XML, developed by MJ Suhonos, which will will automate XML conversion from Word and ODT documents.
Lemon8-XMLLemon8-XML is a web-based service designed to make it easier to convert academic papers from typical word-processor editing formats such as MS-Word .DOC and OpenOffice .ODT, to publishing layout formats such as XML. It provides the ability to edit document metadata such as the list of authors, as well as robust citation editing, checking and correction.Lemon8-XML is a project developed by the Public Knowledge Project, as a demonstration of technology that can help significantly decrease the cost and effort of scholarly publishing. Although it is a standalone service, Lemon8 works well with journals published using Open Journal Systems.
Much of the work involved in Lemon8 has been developed from years of journal publishing experience, and continues to take advantage of the newest web-based technology as it becomes available.
We will soon be creating a mailing list for interested developers and beta-testers, along with some documentation, an FAQ, and a PKP discussion forum for Lemon8-XML.
If you’d like to be kept up-to-date on Lemon8-XML developments, please let us know.
PMR: This is very exciting for our SPECTRa-T : Submission, Preservation and Exposure of Chemistry … project where we are capturing metadata from academic theses. Although the preferred method of presentation is PDF these theses are originally born-digital as Word or LaTeX. But these versions are often hidden away and not reposited. The PDF looks so wonderful, doesn’t it? Surely no-one wants that ugly Word doc? But for use it’s a 100 times better. And if the lemon8-XML can capture authors and other metadata that’s a really important advance.
Because the more structured the document is the better we can analyze it. For example it’s not a good idea to look for chemical names in author lists. (Murray-”Rust” could be indexed as Fe3O4 and PMR as proton magnetic resonance). But normal word documents just contain different paragraphs, usually no sections. Bold 12 is not obviously a chapter, author, or citation.
I couldn’t find a download button. (I am assuming that it is Open Source, given that it comes from the home of Open Journals. No logical connection, of course, but…)
NOTE ADDED LATER:
There is a forum http://pkp.sfu.ca/support/forum/ for lemon8-xml and some slides from a meeting: Lemon8-PKP-Conference.pdf
The slides have a bit more information suggesting this is an early adopter tool at present. I have written asking for more info and will post when it appears. Since they have other Open Source software on their site it should be a good bet that lemon8-xml is Open.