Monthly Archives: October 2007

Dissemination of CrystalEye

There has been considerable interest in having access to the bulk knowledgebase of CrystalEye - WWMM which contains primary data for over 100,000 crystal structures  and probably over 1 million copies of fragments derived from those. We are obviously excited to see the interest and will be talking this morning, and possibly later today at the SPECTRa-T meeting, about the problems of  disseminating repositories and knowledgebases where we have some experts in the field.

Firstly I reiterate that CrystalEye is OpenData according to the The Open Knowledge Foundation definition. It does not actually carry a licence but uses this as a  meta-licence. So it is legally allowed for anyone to take copies of the contents and re-use them, including for commercial purposes. We shall not waver in that.

There have been recent suggestions that to save bandwidth people should make copies of the data and redistribute them on DVD. We would ask you to refrain from doing this for the immediate future for several reasons:

  • The architecture of CrystalEye and its dissemination through AtomPP is new. Jim Downing hasn't even had the chance to explain what his vision for the dissemination is. Please give Jim a chance to explain.
  • It is not trivial to take a physical snapshot of dynamic hypermedia. CrystalEye is updated daily, has over 1 million hyperlinks, and contains several distinct meta-views of the knowledge. This cannot be captured in a single process. Therefore any physical copy will involve significant loss of metadata. This loss could be so significant that the copy was effectively broken.
  • It seems clear from the 2-3 days discussion that different communities want different views of CrystalEye. Some want links to the entries as arranged by the literature, others want it organised by fragments. These are almost completely orthogonal.
  • Copying the data and redisseminating without reference to the originators it is, in effect, forking (see below).
  • We are critically concerned about versioning and annotations. CrystalEye has effectively nightly versions and it is important that when people use it for data-driven science it is clear that they are referring to PRECISELY the same collection.
  • We have thought carefully about sustainability of CrystalEye and have had discussion with appropriate bodies. These would maintain the Openness, but would look to sustainable processes. I cannot give more details in public.

Please note than many Open resources ask or require that their database is not distributed in toto without their involvement. I think this is true of Pubchem - anyone can download individual entries and re-use but it is required (and common courtesy) to ask before downloading the whole lot. We have done this, for example, for the names in Pubchem which are now part of OSCAR.

Then there are the more intangible aspects. It is appropriate that this is seen as a creation of the authors, their collaborators, the Unilever Centre for Molecular Science and Informatics, and the University of Cambridge.  It would be appropriate that these are the first entities that the world should look to if there is to be a physical distribution of some of the resource. At present we see a physical resource as potentially creating as many problems as it solves - whether done by us or not. CrystalEye is much more than the data contained in it - a physical snapshot gives as much indication of this as a series of photographs does of television.

So before distributing the data without our involvement, please let's discuss the aspects - and at present this blog is the appropriate place. I reiterate that no comments are moderated out.

From WP: Fork: (this relates to software, rather than data but the principles overlap)

In free software, forks often result from a schism over different goals or personality clashes. In a fork, both parties assume nearly identical code bases but typically only the larger group, or that containing the original architect, will retain the full original name and its associated user community. Thus there is a reputation penalty associated with forking. The relationship between the different teams can be cordial (e.g., Ubuntu and Debian), very bitter (X.Org Server and XFree86, or cdrtools and cdrkit) or none to speak of (most branching Linux distributions).

Forks are considered an expression of the freedom made available by free software, but a weakness since they duplicate development efforts and can confuse users over which forked package to use. Developers have the option to collaborate and pool resources with free software, but it is not ensured by free software licenses, only by a commitment to cooperation. That said, many developers will make the effort to put changes into all relevant forks, e.g., amongst the BSDs.[citation needed]

The Cathedral and the Bazaar stated in 1997 [1] that "The most important characteristic of a fork is that it spawns competing projects that cannot later exchange code, splitting the potential developer community." However, this is not common present usage.

In some cases, a fork can merge back into the original project or replace it. EGCS (the Experimental/Enhanced GNU Compiler System) was a fork from GCC which proved more vital than the original project and was eventually "blessed" as the official GCC project. Some have attempted to invoke this effect deliberately, e.g., Mozilla Firefox was an unofficial project within Mozilla that soon replaced the Mozilla Suite as the focus of development.

On the matter of forking, the Jargon File says:

"Forking is considered a Bad Thing—not merely because it implies a lot of wasted effort in the future, but because forks tend to be accompanied by a great deal of strife and acrimony between the successor groups over issues of legitimacy, succession, and design direction. There is serious social pressure against forking. As a result, major forks (such as the Gnu-Emacs/XEmacs split, the fissioning of the 386BSD group into three daughter projects, and the short-lived GCC/EGCS split) are rare enough that they are remembered individually in hacker folklore."

It is easy to declare a fork, but can require considerable effort to continue independent development and support. As such, forks without adequate resources can soon become inactive, e.g., GoneME, a fork of GNOME by a former developer, which was soon discontinued despite attracting some publicity. Some well-known forks have enjoyed great success, however, such as the X.Org X11 server, a fork from XFree86 which gained widespread support from developers and users and notably sped up X development

We agree the structure is wrong!

Nick Day's procedure has generated the agreement - and disagreement - between observed and calculated NMR shifts. In my post Open Notebook NMR - the good and the ugly I highlighted one of the worst disagreements. I hesitated to say "the structure is wrong" because I am not an expert in either NMR or this group of natural products", but I would have bet on it.
Now there is general agreement that the structure is wrong:

Wolfgang Robien Says (in comment on post above)

[... details snipped...]

(5) Application of ANY PREDICTION SOFTWARE PACKAGE should put a lot of RED and/or YELLOW markers on this entry ….. CSEARCH does this AUTOMATICALLY during display - a NN-spectrum is calculated on-the-fly whenever you display an entry ! (ACD does it too according to the examples I have seen)

Conclusion: Guys, its all there, we only need a few drops of glue putting the pieces together

PMR: I think we have good agreement here. The glue that is needed is between the NMR community and the publication process, ultimately to generate semantic publishing. My involvement is to try to "sell" the idea of semantic publishing and data valiation to the chemical authors and to the publishers. If an author is spending 3000 USD to publish a paper, then it should not be impossible to find part of that to validate the data.

Hopefully this acts as a signal to reduce the number of wrong structures in future.

ATOMic crystals

How do we disseminate our CrystalEye data? If we use one large file, even zipped, it will run into gigabytes. Also it can't easily be updated.  Jim Downing has started to set up AtomPP feeds for disseminating  it. Geoff Hutchison asks:

  1. Geoff Hutchison Says:
    October 29th, 2007 at 10:03 pm ePeter, as I mentioned to you earlier, I think many of us are looking for the open data in Crystal Eye, particularly fragments. Surely there’s an easier and more efficient way to get the data than AtomPP feeds. Will you have periodic dumps — say I get this quarter’s crystal structures and then can use the AtomPP feeds to just pull new entries?

I think this depends where someone starts. If they are a regular user of crystalEye then AtomPP would seem to be the best approach - it means you don't have to remember when to download and what the size of chunk are. Is it a simple method to get the historical material when starting out? Jim, perhaps you can help here

Open NMR publication: possible ways forward

Wolfgang Robien has posted some valuable comments and I think this gives us a positive way forward. I won't comment line by line but refer you to the links. For background Wolfgang suggest that I have a religious take on this and am trying to impose this on the NMR community which already has adequate and self-sufficient processes. [In all this we differentiate macromolecular/bioscience from smallMolecule/chemistry which have completely different ethics and practice. Here we refer only to small molecules.] I am not religious about NMR.
I'll start by saying that I think Wolfgang and I may have very significant common ground and this is an attempt to address it. I also think that our differences are confined to different fields of endeavour.In summary I believe that:

  • NMR data are published in non-semantic ways (PDF, etc.) and that this destroys much useful machine-interpretable information. By contrast crystallography is semantic and the quality at time of publication is very much higher.
  • A significant number of papers contain NMR data which do not correspond exactly to the structures - often referred to as "wrong". By contrast this hardly ever happens in crystallography.
  • Crystallographic data is subject to intense validation before publication and the algorithms and code are freely available. This has raised the quality of crystallography over the last 15 years and the data in crystalEye show this clearly. With the advent of computational methods in NMR (whether HOSE or GIAO) it should be possible to carry out similar validation before publication.
  • The crystallographic data as published constitute a global knowledgebase which can be re-used in many ways in a semantically valid framework. This is currently not possible for NMR but it could be if the community wished it.

Wolfgang mentions religiosity - I try not to be but the publishing community is rapidly fracturing over the Open-Closed line and I personally see this as having little middle ground. Others disagree. I am insistent that the words "Open Access" be used in a manner which is consistent with the Open Access definitions, just as for Open Source. There is a tendency for people to describe resources as Open when they do not conform to the definition. I hold the same view for Open Data.

Where I think we have common ground is that we both agree that:

  • there are too many publications where the NMR-structure is simply wrong
  • it would be possible to validate many of these using software
  • that it would be useful to publish the spectra in semantic form rather than text and PDFs. (Wolfgang may disagree here and see value in having the data retyped by humans, and if so I'd like to see the case. In practice we have shown that the data can go straight from the instrument to the repository without semantic loss, but that the business processes are not yet clear).

In principle I would be very happy to collaborate on developing an NMR protocol which would validate data in publications. I think we would need a variety of methods and data resources. We can't do this in Nick Day's project and I can't speak for Henry, but it sounds promising. Methods like this exist for crystallography and thermochemistry (ThermoML). Spectroscopy and computational chemistry are the most tractable and valuable next steps.

One reason we used NMRShiftDB was that we knew that the data were heterogeneous and possibly contained errors. This simulated what we might find in publications. We can use our OSCAR and other software to extract spectra and structures from the literature though the assignments are harder without explicitly numbering schemes in connection tables. Clearly the requirements on analysing questionable data and creating a validation procedure are more difficult in this case but we are prepared to defend it.

Ultimately my vision is that all NMR in journals would be validated and in semantic form (e.g. CMLSpect) before being published. Other disciplines have already achieved it, so it's a matter of communal will rather than absence of technology. I think we have a mutual way forward, though not in the timescale of Nick Day's thesis.

Wolfgang Robien Says:

[links to comments broken in WordPress]

OK, you are not a NMR-spectroscopist, but you want to liberate NMR data from the pages of the journals:
PMR: This is exactly right. It is virtually the sole motivation for this work. Anything else (NMRShiftDB/WR, GIAO/HOSE-NN) is secondary. It is also coupled to the capture of data from eTheses (the SPECTRa and SPECTRa-T projects) where we have shown that most data rapidly gets lost. It is about validation, semantic quality, dissemination, preservation, and closely tied to the capture of academic output in institutional and other repositories.

WR: There are so many people around working in this field, who are doing excellent science

PMR: I am unaware of major scientific laboratories who are making major efforts in changing the way that NMR Spectra are published in journals or theses or captured in repositories. I do claim to be aware of semantic scientific publication and repositories and am regularly invited by both the Open and Closed publishers to talk about this. If there is major work ongoing in pre-publication validation and semantic output of NMR I haven't heard of it

WWMM: The World Wide Molecular Matrix

Since I have been asked to talk about the WWMM here's a bit of background... When the UK e-Science project started (2001) we put in a proposal for a new vision of shared chemistry - the World Wide Molecular Matrix. The term "Matrix" comes from the futuristic computer network and virtual world in William Gibson's novels where humans and machines are coupled in cyberspace. Our proposal was for distributed chemistry based on a Napster-like model where chemical objects could be shared in server-browsers just as for music.

It seemed easy. If it worked for music it should be possible for chemistry. Admittedly the content was more variable and the metadata more complex. But nothing that shouldn't be possible with XML. And when we built the better mousetrap, the chemists would come. Others liked the idea, and there is a n article in Wikipedia (Worldwide molecular matrix).

But it's taken at least 5 years. The idea seems simple, but there are lots of details. The eScience program helped - we had two postdocs through the Cambridge eScience Centre and the DTI (Molecular Informatics "Molecular Standards for the Grid"). As well as CML ee listed 10 technologies (Java, Jumbo, Apache HTTP Server, Apache Tomcat, Xindice, CDK - Chemistry Development Kit, JMol, JChemPaint, Condor and Condor-G, PHP). We're not using much PHP, no Xindice and prefer Jetty to Tomcat, but the rest remain core components. We've added a lot more - RDF, RSS, Atom, InChI, blogs, wikis, SVN, Eclipse, JUnit, and a good deal more. It's always more and more technology... OpenBabel, JSpecView, Bioclipse, OSCAR and OSCAR3...
But we needed it. The original vision was correct but impossible in 2002. Now the technology has risen up to meet the expectations. CrystalEye, along with SPECTRa,  is the first example of fully functioning WWMM. It's free, virtually maintenance-free, and very high quality. We have developed it so it's portable and we'll be making the contents and software available wherever they are wanted.

But it also requires content. That's why we are developing ways of authoring chemical documents and why we are creating mechanisms for sharing. Sharing only comes about when there is mutual benefit, and until the blogosphere arrived there was little public appreciation. We now see the value of trading goods and services and the power of the gift economy. In our case we are adding things like quality and discoverability as added value. We've seen the first request for a mashup today.

WWMM requires Open Data, and probably we had to create the definition and management of Openness before we knew how to do it. We'll start to see more truly Open Data as publishers realise the value and encourage their authors to create Open content as part of the submission process. And funders will encourage the creation and deposition of data as part of the required Open publication process.  Then scientists will see the value of authoring semantic data rather than paying post-publications aggregators to type up up again. At that stage WWMM will truly have arrived

COST D37 Meeting in Rome

Tomorrow Andrew Walkingshaw and I will be off to Rome for the COST D37 Working Group. From the site:

What is COST?

COST is one of the longest-running instruments supporting co-operation among scientists and researchers across Europe. COST now has 35 member countries and enables scientists to collaborate in a wide spectrum of activities in research and technology. [...]

PMR: I'm always proud to be involved in European collaborations. When I was born Europe was tearing itself apart. Whatever we may think of the bureaucracy involved it's worth it. Science and scientists have always been a major force in international collaboration, and the prevention of conflict.

The meeting itself (COST D37) is aimed at interoperability on chemical computation:


Realistic modelling in chemistry often requires the orchestration of a variety of application programs into complex workflows (multi-scale modelling, hybrid methods). The main objective of this working group (WG) is the implementation, evaluation and scientific validation of workflow environments for selected illustrator scenarios.


In the CCWF group, the focus is on the implementation and evaluation of quantum chemical (QC) workflows in distributed (Grid) environments. This is accomplished by:

  • The implementation of workflow environments for QC by adapting standard Grid technologies.
  • Fostering standard techniques (interfaces) for handling quantum chemical data in a flexible and extensible format to ensure application program interoperability and support of an efficient access to chemical information based on a Computational Chemistry ontology.
  • The implementation of computational chemistry illustrator scenarios from areas of heterogeneous catalysis, QSAR/QSPR, and rational materials design to demonstrate the applicability of our approach.

PMR: So I'll be talking about the World Wide Molecular Matrix (WWMM) and Andrew will talk on Golem - which will transduce the output of computational programs into ontologically supported components that can be fed into other programs without loss of information. I shall try to present as much as possible from the WWW, linking into CrystalEye and OpenNMR.

Anyone for crystal mashups?

From the blogosphere through Bora:

Liz Allen posted this on the Wall of the PLoS Facebook group yesterday:

Here's a fun Friday activity for all of you who like to track the stats of the inevitable rise and world domination of OA!


Another cool mash up site (great logo, takes a minute or so to load) is there you can see the number of OA repositories mapped across the globe, there were 808 as of earlier today.

PMR: This is fun. I've never done a mashup, but I'd love to try. Here's my idea. Since CrystalEye is Open, anyone can do it or we can do it together:

CrystalEye contains > 100, 000 entries, all with author names and addresses. Here's an example:

"College of Sciences Tianjin University of Science and Technology Tianjin 300457 P. R. China"

How easy it it to turn that into a Google coordinate? Of course not all addresses will have a consistent format, so we need a service that can guess formats.

Then how do we actually mash it with Google? I imagine it's fairly easy.

Then we have a map of where every  crystal structure in CrystalEye has been done...

And of course this is only the start. We can add info on date, number of authors, field of study, etc. The only requirement is that you have to be prepared to work with Open Data. It's not harmful. You don't even have to let us know you are doing it. But you do have to acknowledge the source.


Open NMR: contributions from the community about outliers and assignments

We are delighted at the practical and helpful  contributions from members of the community in helping to understand or correct outliers in the data set we are using. This is exactly what we hoped would happen at the start of the project and it has not started to gain momentum. I list some of them below to acknoledge the help. It is also highlighting the need for better tools for such collaborative projects - a blog is a poor mechanism but wikis also have their failings.

To reiterate:

  • Nick has been through the dataset by hand and identified all data sets with potential misassignments or other anomalies. This has been done by comparing agreements within each set. A data set is likely to have been flagged if (a) it has a single widely outlying shift (b) two peaks (a, b) have coordinates yb, xa (as we have shown) giving an "X"-like pattern (c) has a large general scatter considerably greater than the average.
  • Nick will post the major outliers based on RMSD. I don't know how many there will be but I expect about 50 (hence the "20%"). These will be clickable - i.e. anyone with an SVG browser can imemdiately find our which peak is linked to which atom.
  • After, and only after, these have been cleaned or accepted we will try to see if there are systematic effects in the data  - either the variance or the precision. We could expect that data from various sources could provide much of the variation, or the date, or the field strength, or the temperature, or the solvent. Unfortunately we do not have all the metadata as it isn't present in the CMLSpect files.
  • Finally we may be able to comment on Henry's method. It is possible that certain functional groups have problems (Nick has some suspicions) but at present these are overwhelmed by variance from other sources in the experiment or its capture

So here are examples of useful comments. (I am not sure why Pachyclavulide-A is relevant - I can't find it by name search in NMRShiftDB - but the effort is appreciated. However we are primarily looking for comments on the outliers we have identified.)

  1. Egon Willighagen Says:
    October 26th, 2007 at 1:15 am eThe first one is another misassignment. Look up the structure in the NMRShiftDB and you will see one correctly assigned and one misassigned spectrums. This kind of issues should be filed as ‘data’ bug report at:

    I’m will do this one.

  2. Egon Willighagen Says:
    October 26th, 2007 at 1:17 am eFiled as:

  3. Wolfgang Robien Says:
    October 26th, 2007 at 8:48 am eAnother error: Pachyclavulide-A (should be C26 instead C27), MW=510

    Found automatically by the following procedure within CSEARCH:

    Search all unassigned methylgroups located at a ring junction. The methylgroup must be connected either with an up or down bond. As an additional condition, it can be specified if only “Q’s” are missing or if the multiplicity of missing lines can be ignored. I think a quite sophisticated check which goes into deep details of possible error sources. [...]

  4. hko Says:
    October 27th, 2007 at 12:02 pm eMisassignments NMRShiftDB (10008656-2) removed.
  5. hko Says:
    October 27th, 2007 at 5:28 pm eMisassignments NMRShiftDB (10006416-2) removed. 45.0 and 34.4 reversed.

Open NMR calculations: intermediate conclusions (comments)

I posted our intermediate conclusions on Nick Day's computational NMR project, and have received two lengthy comments. I try to answer all comments, though as Peter Suber says in his interview sometimes comments lead to discourses of indefinite length. I am taking the pragmatic view that I will mainly address comments (and subcomments) that:

  • address our project as we defined it (not necessarily the project that others would like us to have done)
  • add useful information (especially annotation of suspected problems)
  • or show that our scientific method is flawed or could be strengthened
  • relate to Open issues. In our present stage of robotic access to and re-use of data we can only realistically use databases that explicitly allow re-use of data and do not require special negotiation with the owners

There has been a great deal of discussion (far more than we had expected) on our project. Some of this has been directly relevant in responding to our direct requests for annotation of specific outliers and we acknowledge posts from Egon Willighagen, Christoph Steinbeck, Jean-Claude Bradley, Wolfgang Robien, and the University of Mainz. A lot of the discussion has been of general interest but not directly relevant to the aims of the project which was to show what fully automatic systems can do, not to create specific resources ("clean NMRShiftDB"). It is possible, though not necessary, that the work might be more generally valuable depending on what we found.

Wolfgang Robien: October 27th, 2007 at 12:03 pm e

You wrote: ….only Open collection of spectra is NMRShiftDB - open nmr database on the web.

Also the SDBS system can be downloaded - as far as I remember its limitited to 50 entries per day (should be no problem because QM-calculation are quite slow compared to HOSE/NN/Incr)

PMR: thank you for reminding us. I have not used SDBS and it looks a useful resource for checking individual structures but it is inappropriate for the current work as:

  • robotic download is forbidden
  • there is no obvious way of downloading sets of structures - they need to be the result of a search
  • there is no obvious machine-readable connection table (there is a semantically void image).
  • there are no 3D coordinates (this is not essential but it meant Nick could work almost immediately)

It is possible that if we wrote to the maintainers they would let us have a dataset, but this would double the size of the project at least.

If you need 500 entries with a certain specification (e.g. by elements, molwt, partial structure, etc.) and you want to perform a common project, please let me know …..

PMR: Thank you. This is a generous offer and we may wish to take it up. Contrary to your comment that I am an NMR expert, I'm really not - I'm an eChemist and this exercise in NMR is because I wish to liberate NMR from the pages of journals. If we find that Henry's program needs more data or that yours has fewer problems it could be extremely valuable. We would wish the actual data to be Open so that others can re-use it.
PMR is quoted as "We downloaded the whole of NMRShiftDB. When we started we had NO idea of the quality. …"
That is quite true. There were a number of public comments on NMRShiftDB ranging from (mild) approval to (mild) disapproval, some scalar values for RMS against various prediction programs and some figures on misassignments, etc. These gave relatively little indication of the detailed data quality - e.g. the higher moments of variation.
If there is currently a full list of NMRShiftDB entries with your annotations this would be valuable. Currently I can find a number of comments on individual entries with gross problems at

but these seem anecdotal rather than a complete list.
PMR: and the second set of comments

ChemSpiderMan Says:October 27th, 2007 at 12:43 pm

  • 2) Regarding “the only Open collection of spectra is NMRShiftDB - open nmr database on the web.” Just to clarify these are NOT NMR spectra actually. Unless NMRShiftDB has a capability I am aware of NMRSHiftDB is a database of molecular structures with associated assignments (and maybe in some cases just a list of shifts..maybe all don’t have to be assigned.)PMR: Thank you for the correction. I should have said peaklists with assignments.

    4)Regarding “We knew in advance that certain calculations would be inappropriate. Large molecules (> 20 heavy atoms) would take too long. ” The 20 heavy atom limit is a real constraint. I judge that most pharmaceuticals in use today are over 20 atoms (xanax, sildenafil, ketoconazole, singulair for example). I would hope that members of the NMR community are watching your work as it should be of value to them but I believe 20 atoms is a severe constraint. That said I know that with more time you could do larger molecules but a day per molecule is likely enough time investment.

    PMR: We have strategies for dealing with larger molecules but are not deploying them here.

    6) Regarding “So we have a final list of about 300 candidates.” Out of a total of over 20000 individual structures your analysis was performed on 1.5% of the dataset. How many data points was this out of interest.

    PMR: I expect about 6-20 shifts per entry. Some overlap because of symmetry

    7) Regarding ” probably 20% of entries have misassignments and transcription errors. Difficult to say, but probably about 1-5%”. This suggests about 25% of shifts associated with my estimated 3000 shifts are in error. This is about 750 data points and this conclusion was made by the study of 300 molecules. For sure the 25% does not carry over to the entire database. It is of MUCH higher quality that that. My earlier posting suggested that there were about 250 BAD points. The subjective criteria are discussed here ( Wolfgang suggested about 300 bad points but we were both being very conservative.You discussed the difference between 250 and 300 here on your blog as you likely recall

    PMR: Nick will detail these later. We believe that the QM method is sufficiently powerful to show misassignments of a very few ppm - I will not give figures before we have down the work. With known variance it is possible to give a formal probability that peaks are misassigned. I have shown some examples of what we believe to be clear misassignments, but we have not gone back to the authors or literature (which often does not have enough information to decide). I do not believe you can compare your estimates with ours as you and we have not defined what a misassignment is.
    8) Regarding “We realise that other groups have access to larger and, they claim, better data sets. But they are closed. I shall argue in a later post that closed approaches hold back the quality of scientific data.” I think your comments are regarding Wolfgang Robien and ACD/labs. That is true that we have access to larger datasets but we can limit the conversations to NMRShiftDB since we ALL have access to that. Robien’s and ACD/Labs algorithms can adequately deal with the NMRSHiftDB dataset. For the neural nets and Increment based approach over 200,000 data points can be calculated in less than 5 minutes ( You have access to the same dataset and can handle 300 of the structures. Your statement is is NOT about database size but about algorithmic capabilities.

    PMR: My statement was about size and quality of datasets and is completely clear. It has nothing to do with algorithms. I am not interested in comparing the speed of algorithms but am concerned about metrics for the quality of data. I shan't discuss speed of algorithms unrelated to the current project

  • Open NMR calculations: intermediate conclusions

    Over the last 1-2 weeks Nick Day has been calculating NMR spectra and comparing the results with experiment. As there appears to be considerable interest we have agreed to make our conclusions Open on an almost daily basis. These lead to some compelling observations on the value of Open Data which I shall publish later. To summarise:

    We needed an Open Data set of NMR spectra and associated molecular structures. The data had to be Open because we wished to publish all the results in detail without requiring to go back to the "owners". The data also required annotation of the spectra with the molecular structure ("assignment of peaks"). Ideally the data should be in XML as this is the only way of avoiding potential semantic loss and corruption, but we would have managed with legacy formats.

    There are well over 1 million spectra published each year in peer-reviewed journals. Almost NONE are published in semantic form - most are as textual PDFs or graphical PDFs. It is also unclear how many of these could be robotically downloaded without the publishers sending lawyers - at least Elsevier allow us to do this. In any case we would have to use OSCAR to extract the data, probably involving corruption and loss.

    So we looked for Open collections of spectra. There are many, with an aggregated count of probably over a million spectra. However almost all are completely closed - they requires licence fees and forbid re-use. I have criticized this practice before and shall do so later, but here I note that the only Open collection of spectra is NMRShiftDB - open nmr database on the web. This has been created by Crhristoph Steinbeck and colleagues and contains somewhat over 20,000 spectra. Because Crhistoph is a member of the Blue Obelisk the data can be exported in CMLSpect (XML) without which Nick Day's project would not have been possible.

    We downloaded the whole of NMRShiftDB. When we started we had NO idea of the quality. This was deliberate, and part of the eScience- can a robot automatically determine the quality of data? Several correspondents have missed this point - It was more important to answer this question than whether the data were "good". IF AND ONLY IF the data AND METADATA were of sufficient quality would it be possible to say something useful about the value of the theoretical calculations.

    We knew in advance that certain calculations would be inappropriate. Large molecules (> 20 heavy atoms) would take too long. Molecules with floppy groups cannot be easily analysed. So we selected small molecules with rigid frameworks. This gave a starting set of about 500 candidates, each of which takes on average 1 day to calculate.

    About 200 of these had computational problems - mainly that they were too large, they didn't converge, Condor had problems managing the jobs, etc. So we have a final list of about 300 candidates.

    We have listed the analysis of these results over the last few days. It is clear that some entries have "errors" and that there were also defects in the initial calculation models. Henry knew the latter, of course, but if we hadn't known this at the start we would have been able to hypothethise that Br, Cl gave rise to serious additive deviation. So this is at least confirmation of a known effect for which we have made empirical corrections based on theory.
    We have shown examples of poor agreements and anticipated all of them in principle. The data set contains many problems including:

    • wrong structures. I am sure that at least one structure does not correspond to the spectrum
    • misassignments. These are very common - probably 20% of entries have misassignments
    • transcription errors. Difficult to say, but probably about 1-5%
    • conformational problems. There are certain molecules which have conformational variability (i.e. floppy) but we have only calculaed one conformer. The most common example is medium-sized rings
    • human editing of data leading to corruption. 2 entries at least

    As a result Nick is going to produce a cleaned data set manually. He has already done most of this (has he slept?). He cannot do this automatically as the metadata are not present in the XML files. He will then be in a position to answer the questions:

    • how much of the variance is due to experimental problems?
    • if this is lower than, say, 70%, is it possible to detect systematic "errors" in the computational methodology.
    • if so, can the method be improved.

    If we can believe in the methodology then we can start to use it as a tool for analysing future data sets. But until then we can't.
    We realise that other groups have access to larger and, they claim, better data sets. But they are closed. I shall argue in a later post that closed approaches hold back the quality of scientific data.