Open Notebook NMR : the act of discovery

Things are happening so fast on the Open Notebook NMR project that we need to take stock. Here’s today’s developments:

  • Nick Day is working round the clock to manage the data and create the plots. He has a thesis to write 🙂
  • Henry Rzepa continues to have near-daily insights into the NMR methodology
  • Antony Williams and now ACD labs would like to work with the data
  • Christoph Steinbeck (NMRShiftDB) is visiting today

This highlights the problem of the process of discovery. The most important person is Nick. He has spent the last year developing this software (much of it re-used from CrystalEye) and now he’s got a chance to use it. It’s similar to instrument-based physics where a student spends three years developing a new type of detector and has a mad rush to point it at the sky and show it works. Then it’s released to the world and they start to make their own independent discoveries. So we have to have a delay where Nick can discover things. That’s one reason why the two of us sat down yesterday and published all the things we might expect to discover. Then, when we  or others find them, Nick can claim all or some of the credit.
Similarly Henry has developed a number of new insights into NMR. They are his, and if I published them here without his agreement it would be inappropriate. So at this stage I will simple say that Henry has an important protocol that explains a large percentage of the variance for certain chemical groups and I have asked him to publish it here or send the material.
Christoph certainly has the right to have a few day’s breathing space while we look at what’s in NMRShiftDB. He and colleagues have spent years building it up and if we have a new tool then they should get the chance to use it on their data.
Antony and colleagues have offered help and this is welcomed.

  1. ChemSpiderMan Says:
    October 23rd, 2007 at 12:52 am ePeter, all that is needed to perform the calculations for comparison using the ACD/Labs NMR predictors is a download of the exact dataset Christoph provided to you (we have already had issues with comparing algorithm to algorithm but using different versions of the NMRShiftDB database…not good). Also, if Nick can send us the ID of the structure inside the NMRShiftDB this should be enough. Thanks

We hadn’t anticipated this and we have to think about how this fits in. This is a structured project, on which parts of Nick’s thesis are based, and if other people do some of the work he had planned it causes problems. I think we can work it out but we need to set the ground rules publicly.
There is also the role of ACDLabs which I consider in a later post.
So over the next week we shall probably share our insights but not our data. Probably the first thing we shall ask the community for is help with outliers – here is a structure which doesn’t fit – does anyone know why?  Henry has already done this for the worst outlier and shown that it is probably a transcriptional or related error – i.e. there is a report of the compound in the literature which has relatively unremarkable values. If, during this process or weeding out outliers someone discovers a completely new effect then they will deserve a lot of credit.

Posted in data, nmr, open notebook science | 1 Comment

Open Notebook NMR – Henry's improved protocol

When we first started this (nearly a month ago). Henry suggested a protocol for calculating the chemical shifts. Nick tooled up for this and had to overcome several technical problems on job submission, etc. (A typical example – the order of arguments in a Condor machine filter seems to matter – anyway shuffling them fixed the problem. This took at least a week out of our elapsed lives. This is not really chemistry, but it’s part of eScience, just as analysing solvents from different suppliers is part of chemistry.). Here’s the first graph Nick got (you’ve seen it already):
nmr1.PNG
During this period Henry improved his protocol, but we continued to use the old one until the jobs finished. Then we re-ran them all with the new one (details will come later):
nmr2.PNG
you can see there is a small but significant improvement. We believe that when the data errors are filtered out the improvement will be much clearer and more obviously valuable.
We’ll let Henry tell you what he’s done when it’s relevant.
WE ACKNOWLEDGE JOE TOWNSEND’S PLOTTING SOFTWARE WHICH WHEN DISPLAYED AS SVG (LATER THIS WEEK) ALLOWS EXCITING THINGS.

Posted in chemistry, nmr, open notebook science | 3 Comments

Open Notebook NMR: Anticipated errors

Nick and I sat down this morning and thought about what possible errors might arise in the “data” or “experimental” axis and also on the “predicted” axis. Some of these may overlap with Antony’s suggestions but they are independent.
An “experimental” error is one that is independent of the prediction of the effect. There are some grey areas (in the compound), but we have come up with:

  • mis-reported solvent (the shifts are solvent dependent and the calculation tries to simulate this)
  • variable calibration of NMR instrument (e.g. giving rise to origin shifts)
  • impure compound. The sample may contain substance(s) which give rise to appreciable peaks not belonging to the title compound
  • wrong compound assigned to spectrum (i.e. error in bookkeeping or drawing error)
  • machine parameters (phasing, folding, field strength, etc.) varied incorrectly or reported incorrectly
  • transcription errors in spectrum or peaks.
  • misassignment of peaks to inappropriate atoms
  • broad peaks with considerable variance leading to misreporting of mean (unlikely with 13C)
  • errors in applying theory of NMR or its interpretation
  • noise (including random noise and mains spikes).
  • human editing of spectra including fraud

A “prediction” error is independent of the reported value for the shift. Some are theoretical, some are computer “bugs”. These include:

  • mis-calculation of offset (e.g. from isotropic tensor to observed shift)
  • garbling of the assignment of peaks to atoms (bug)
  • corruption of connection tables (especially in adding hydrogen atoms)
  • mismapping of atoms between input and output of calculation (we assume atoms come out in the order they go in – bug)
  • incorrect generation of input (bug)
  • program bugs in reading input and main calculation. For example we found a really nasty bug with GAMESS – if the line overflowed 80 characters the atom was reported but not include in the calculation.
  • incorrect transformation of output to CMLSpect
  • theoretical model has limitations (Henry will comment)
  • Oversimplified chemical model. There are several common problems:
  1. only one conformer is calculated
  2. symmetry is not well treated
  3. tautomerism is ignored
  4. isomerism (e.g. ring-chain is ignored)
  5. other chemical effects (Antony mentions micelles, etc.)

There are also potential bugs in the computational side:

  • inconsistent results from different machine architectures
  • errors in processing and displayng the results

So, we look forward to sharing this with Christoph tomorrow. Nick has prepared a range of display tools, including a filter for the errors within structures. Ideally the claculated value (y) shoudl relate to the observed one by:
y = x + eps
where eps is normally distributed. In practice we expect that we shall find
y = x + c + eps
where c varies between entries and reflects the errors in origins and solvents. We don’t know what the magnitude will be. We don’t see any need at present for
y = m*x + c + eps
where there is empirical scaling.
The intra-compound comparison will highlight entries with the following features:

  • high precision, high accuracy
  • high precision, low accuracy (hopefully allowing identification of systematic error)
  • low precision, high accuracy (maybe due to noise, though this is unlikely here)
  • low precision, low accuracy (these may allow us to identify problems with various sources such as authors, machines or protocols).
Posted in chemistry, data, nmr, open notebook science | Leave a comment

Open Notebook NMR – cont

We are now close to releasing the first results of the calculations – at present 300+ molecules. We think that really major foul-ups have been superseded (i.e. when all the Gaussian files failed to run because of a missing blank line, etc.) So we think it’s worth the community listening in.
To set the scene. This is part of Nick Day’s thesis work and Nick will be first author of anything to come out in the immediate future. Henry Rzepa provided much of the motivation and also the algorithms that we are now using. He first gave us an extension of the GIAO and Rychnowsky methods and then elaborated this in a further protocol which is what we are now using. This protocol is based on his own work at Imperial where her has computed a number of structures and gradually refined the methods. So this is his current best guess as to what we need, although there are some refinements for halogens that need to be added after the calculation.
Christoph has provided the NMR data from NMRShiftDB. Of course it comes from various sources but we shall rely on his judgement as to whether a structure is likely to be “wrong”. This is a difficult one – we cannot simply remove a structure because it doesn’t fit but he may be able to assert that there is a known problem. We may also have generic filters like the laboratory it came from.
These are the expected initial authors and we’ll see how things go. Christoph and Henry and Nick and I will have a few days to inspect the data before releasing it all. This should remove any really obvious “data errors” and also allow us to plan any further refinements. For example Henry has looked at the really glaring outlier and suggested a protocol change though we don’t think it will account for all the deviation.
People’s contributions will necessarily be recorded and so it will be clear what has been done. In the first instance I think we shall use the NMRShiftDB data and the Imperial protocol to give us an idea of the tractability of the method.
We absolutely welcome any input. We’ll be fairly focussed on a thesis-like approach for the next month or so, but may branch out. Here are some highly valuable suggestions

  1. ChemSpiderMan Says:
    October 22nd, 2007 at 2:49 pm eI’ve seen the discrepancies Jean-Claude is talking about many times. However, a difference of 0.2ppm in C-13 is pretty much irrelevant. Admittedly [… discussed elsewhere – PMR …]
  2. Peter – FYI ACD/Labs are ready to participate in the work as discussed: http://www.chemspider.com/blog/?p=213#comment-3735Peter and Tony,I think this is a fantastic project and am very keen to see how accurate the QM techniques prove to be for the subset of structures that you choose from the NMRShiftDB, and then how helpful they can be in improving the accuracy of experimental shifts in this wonderful resource.For the purposes of this work, we would be willing to provide the chemical shift predictions from the ACD/Labs software if you would like to use them in your comparison. If, for instance, they prove to be accurate enough to find many of these problems without the need for time consuming QM calculations, it may be preferrable to use the faster calculation algorithms that are available in our software. It may turn out that the ACD/Labs predictions could serve as a pre-filter to define which structures need the QM calculations and which don’t. Many variations on this theme come to mind, but we won’t know which are useful until we do the work.
    Sincerely,
    Brent Lefebvre
    NMR Product Manager
    Advanced Chemistry Development, Inc.

PMR Many thanks. I think this will be extremely useful in the next phase of the program (which could be quite soon). At present Nick needs to concentrate on the Gaussian stuff as it is fairly easy to initiate a new protocol and re-run the jobs in perhaps 2 days. The results of this will then give us an idea of where the main problems. If, for example, we find 5% of structures are misassigned, that is ca 15. Not to difficult to do by hand. But if we then scale this to 20,000 in NMRShiftDB then it’s 1000 entries and we have to automate or fan out the social computing. If, however the data error rate is 0.5% then 100 problems in NMRShiftDB is a long wet afternoon for the dedicated few.
The data quality are critical. Joe Townsend went round this loop several times before coming up with a usable protocol for filtering problems. It’s harder for NMR, but there are some tricks we may be able to play to weed out the worst.

Posted in data, nmr, open notebook science | 1 Comment

Experiment and Calculation in WWMM-NMR. Open Notebook Science

Antony guessed the graph – regular readers will recognise the context of previous posts. We are starting an Open Notebook project to determine whether theoretical calculations and experimental observations agree – or rather within what limits. (Earlier this year I talked with Michael Kohlhase about PhysML, where the basis of the language is to assert that observations and predictions do or do not agree and that hypotheses may or may not be disproved). Being a chemist my language may be sloppier – please forgive or correct me.
Anyway here it is. Nick Day has done all the work and deserves all the credit. There are ca 500 structures in this and they are all aggregated and plotted with the same origin.
nmr1.PNG
Before analysing this, it’s important to know the methods in detail. Nick will post more.
Experimental: All (?) structures in NMRShiftDB were extracted (ca. 20000). Compounds unsuitable for calculation (e.g. with heavy metals) were excluded. All files were in CMLSpect and contained 3D coordinates (except for H atoms – everyone PLEASE include H atoms), 13C spectral peaks (shift in ppm), labelled with the atom or atoms to which they were assigned. The solvent, field strength(??), temperature (??) were recorded. I am not sure of the metadata (I am writing from home and Nick will fill all this in later).
In general there is no chance of re-running the spectrum or re-analysing the sample, but it may be possible to contact the authors on an occasional basis.
Calculation: We will give the basis set and method later (I don’t know whether Henry uses the term GIAO – he has extended the method). The calculation emits the isotropic tensor which is an absolute value (whereas the chemical shift is relative to tetramethyl silane).
I don’t know whether we should transform the experimental to be on the same scale as the calculated, but we have done the reverse. The calculated values are substracted from TMS (in the same solvent as reported for the spectrum).
The comparison is then plotted. We do, of course, have plots for each compound as well.
We’ll come onto the analysis later, but you may wish to think about possible reasons for disagreement (i.e. points lying off the line). I have over 10 reasons.

Posted in nmr, open notebook science, XML | 3 Comments

Peter Suber – a model for us all

I have just read Richard Poynder’s interview with Peter Suber: The Basement Interviews: Peter Suber, Open and Shut? October 19, 2007. It’s 80 pages and Richard records that it took 3 hours on Skype and landline. It’s almost the equivalent of a small book – and it could benefit from being bound as one. Please read it. And it could be used as a text for students in several areas.
PeterS’s written material is honoured for its clarity, precision, consistency and recall of all the relevant background material. What is equally remarkable is that his spoken answers read in almost the same way (I don’t know whether there was much or even any editing). I’ve never spoken with Peter but have corresponded frequently, and fancifully I think of him as having the same verbal impact as Alistair Cooke where every word has its place. Maybe one day I shall be (dis)illusioned.
The interview is remarkable in its honesty and coverage. He is not afraid to talk about disagreements, especially with Stevan Harnad. The interview gives a picture of how the early Open Access movement depended centrally on both of them and the insights into the difference “B”s of the B-B-B declarations. Much of this was news to me – my physical interaction with the OA movement is sporadic and depends mainly on selected invitations. In fact I now have a greater appreciation of the value of the Green aspects of Open Access, while still arguing that permission, as well as price, barriers are critical for effective digital scholarship.
Where would be be without the central role that Peter plays? We would be far worse off. We could be disagreeing over badly presented facts and arguments (I and others are guilty of this). We could leave large errors that undermined the validity of our case. And, as Richard and others make clear, the clarity and equal-handedness of his writing is probably our strongest weapon. In almost all cases he allows that the opponents of OA have an understandable position. Only when it comes to the recent misrepresentation of the effect of OA on peer-review does he become severely critical:

RP: … [PRISM members] claim that they are defending peer-review, or trying to avoid the collapse of the scholarly communication process?
PS: Yes. What bothers me is the dishonesty of some of the arguments. Instead of arguing that their revenues are at stake, publisher lobbyists argue that peer review as such is at stake, that OA is a form of censorship, and that authors and publishers will be forced to “surrender” their articles to the government. There’s a gray area in which we can’t distinguish very weak arguments from deliberate disinformation, or innocent misunderstandings from culpable misrepresentations. But these are well outside that zone. They’re cynical attempts to mislead anyone who doesn’t know the facts, especially policy-makers and journalists. They’re arguments that only work with an ignorant audience, and they know that.
What’s ironic and frankly astonishing is that academic publishers should be making these arguments, or allowing their lobbyists to make them. They should be trying to prove that they are especially careful with reasoning and evidence, and deserve to be entrusted with
the management of peer review. But the ones behind the PRISM campaign are proving that they are careless with truth and do not deserve that trust.

PMR:
This is the strongest that Peter gets, and those familiar with his writing recognise it.
Personally I should thank Peter for his many private emails (and you will see from this interview how much of his day is taken with answering them). For his support for The Open Knowledge Foundation. For spending an hour discussing chermistry with Antony Williams. (A Conversation with Peter Suber – Navigating the Complexities of Open Access Definitions)
and for providing his daily blog from which so many  other blog comments and posts arise.

Posted in open issues | Leave a comment

Reconciling points of View

Over the last few weeks there has been strong and active discussion about issues relating to Openness and some of these have been commented on (or even initiated) here. Some people feel that I have may been simplistic or overly polemic and there is a danger of unnecessary polarisation, so I have taken a few days off blogging to reflect.
The issues range from open access publishing (OA), through the BBB declarations to Open Data and the role of commercial companies. There is so much that is new and changing that all of us have to rethink our position on a frequent basis. We see daily changes in the balance of the Open Access practice and community and many think that the change is unstoppable. However it seems clear to me that there is currently a major struggle for the control of information and data, with ownership and licences as important concepts.
The blogosphere is a critically important area for these issues to be raised, and it’s an excellent way of drawing in new voices. Going back a year we would see little discussion of the issues in chemistry, whereas now there are several active blogs.
It is also clear that the issues are complex, probably more complex than any of us realised at the beginning. There is a spread between the “religious” – “this is my point of view and it’s self-evidently right” – to the all-inclusive “nobody should be criticized”.
These tensions have been clearly visible in the Open Source movement where Richard Stallman (Why “Open Source” misses the point of Free Software.) has often disagreed with approaches different to his. I believe that this disagreement has been constructive as it has required everyone to think clearly about the issues and to create instruments (licences) to manage practice in the community. Similarly in Open Access I regard the organized publishers’ lobby (reported by Peter Suber) as something that has to be challenged, and if necessary with robust language. If this struggle loses friends it may be the price, but it would be a pity.
Other areas also show major differences of opinion. Stevan Harnad has a strongly and long held view of Green Open Access and believes this is the best solution for the Open Access community. I believe in much of what Stevan has done and achieved but I differ over the value of Green OA for scientific data. Stevan has recently written:  Time to Update the BBB Definition of Open Access, Open Access Archivangelism, October 18, 2007. where he calls for a review of the BBB declaration(s). I suspect this may be similar to the development of Open Source licences – there will be no simple solution and it will take time. It looks as if Stevan wishes to adjust BBB so it doesn’t talk about permission barriers and only relates to sighted human readership. This may be useful, but it will have to be accompanied by considerable work on licences relating to permission, or otherwise we may have gone backwards. In any case it will have redefined what “Open Access” means (and this reminds us that words are both our friends and our problem).
There has also been a lot of discussion about Open Data, especially catalysed by the Chemspider discussions.

Who Gets to Choose Whether Data is Open or Not?

and an important contribution from Joerg Wegner

which review some of the aspects of the Blue Obelisk’s Open Data, Open Source, Open Standards (ODOSOS).
It is clear that there are a variety of opinions, but it is also clear that there is a more-or-less identified community which wishes to make progress. That community is wider than Blue Obelisk but has limits in that it does not, for example, include any publishers.
So it may be useful to regard the discussions of the last few months as the labour pains of the birth of a movement in chemistry. There will have to be limits – it would be difficult to have people or organizations who wished to control data for their own purposes. It’s clear that we have to accommodate a wider range of views and practices than a year ago – then ODOSOS seemed fairly simple – now it may be less clear.
I would like to be part of a constructive discussion and practice in the future. Here are some of my points of view – like Peter Suber’s Nomic some are immutable and some are mutable (although I’m not always clear which):

  • we have the possibility to develop the next phase of net-based chemical collaboration. This will include the enhancement of practice and standards. I am committed to being part of this.
  • I have been involved in several virtual communities (e.g. XML Dev-Mail List) which develop new approaches. They are rarely free from occasional flame (though XML-DEV was remarkable), but I will from now on be very careful to avoid tensions. I hold no personal animosity to any individual.
  • commercial organisations may have a role. However there have been examples where commercial organizations have used collaborations to develop their own interests at the expense of the community. These tensions still exist in the Open Source community.
  • We have to use words and algorithms as management devices. We cannot assume that simple (English) words have an obvious meaning. “Open” and “Free” are now so heavily overloaded that we cannot use them without the risk of confusion. So, for example, Joerg writes “Open must not be free: I strongly believe that ‘open’, means not ‘free’ of charge.”. This is very similar to the debate in Open Source. I assume that here it means that you may (not must) charge for Open Data. If it means you must always charge for Open Data then I have to disagree. My phrase would be: “you may charge for Open Data but you must also allow the Open data to be accessible without charge”. This emphasizes that this is not simple.
  • I welcome the chance to collaborate with Antony Williams on Open Notebook NMR spectra and hope this takes off.
  • I welcome the exploration of companies such as Chemspider developing Open Data licences and practices. It will not be easy. For example I am a strong supported of what Talis has done with their Open Data licence. If I feel that any company has an honest desire to support Open Data I will be happy to work with them.

I will be concentrating on Open Notebook Science next week – Christoph will be visiting us.

Posted in "virtual communities", open issues | 1 Comment

Urgent action need to support the NIH bill

Peter Suber has written at length (Urgent action need to support the NIH bill)
The provision to mandate OA at the NIH is in trouble.  Late Friday, just before the filing deadline, a Senator acting on behalf of the publishing lobby filed two harmful amendments, one to delete the provision and one to weaken it significantly.  We thought we’d done everything and only had to wait for the Senate vote.  But now we have to mobilize once more, and fast, to squash these amendments.  Here an announcement from the Alliance for Taxpayer Access:
The immediate message is that all US citizens should spend time today or Monday emailing their representative.
People sometimes wonder why I get overly angry on this blog. This is why. Yes, it’s legal. But it represents the power that money has over the future of the information revolution. Perhaps I exaggerate but it should have all of the moral force of the struggle against unacceptable ownership of labour.
Posted in open issues | Leave a comment

A graph

In the tradition of Rich Apodaca’s “Name that graph” (example), here is a graph without axes. You will be seeing more of these later.
untitled.PNG

Posted in data, puzzles | 4 Comments

Why Green Open Access does not support text- and data-mining

Stevan Harnad, Peter Suber and I have been discussing whether Green Open Access (author self-archiving in an Institutional Repository) is sufficient to allow indexing and mining. Stevan comments:

Individual re-use capabilities: If a document’s full-text is freely accessible online (OA), that means any individual can (1) access it, (2) read it, (3) download it, (4) store it (for personal use), (5) print it off (for personal use), (6) “data-mine” it and (7) re-use the results of the data-mining in further research publications (but they may not re-publish or re-sell the full-text itself: “derivative works” must instead link to its URL).

and later:

Stevan Harnad Says:
October 15th, 2007 at 11:51 pm e
Dear Peter,
The example you gave of robot blockage was the publisher (Gold? or something else?) giving “free access” with strings and constraints attached. That is not what I am talking about. I am talking about Green OA: That is when an author self-archives his own final, peer-reviewed, accepted draft (”postprint”) in his own Institutional Repository and sets access as “Open Access.” No strings attached, and the spiders can spider away.
And the essence of both my logical and methodological point is that paid Gold OA is always also Green OA. So don’t rely on your publisher providing proper access: self-archive the postprint! Then all the capabilities you seek will come with the territory. Further rights retention or licensing is superfluous (and a retardant, if insisted upon, gratuitously, as a precondition for providing OA!).
And, for the record, I am always talking about published, peer-reviewed journal articles. I am not for a moment contesting that authors can and should license rights to their data as part of making them OA.

PMR: It will help if we understand what responsible and publishable text-mining involves. If any of the SciBorg project (e.g. Peter Corbett) publishes a paper on natural language processing in chemistry, it has to be reproducible. This is fundamental to science – and NLP is a science. If you make a claim but do not allow someone to falsify your claim you are not publishing science. (Unfortunately this lack of repeatability is almost universal in “chemoinformatics” publications where raw data is never required by the journals, but that’s another article).
So the first thing to do is to gather a corpus of documents. This corpus is part of the experimental toolkit – any other scientist should be able to have access to it. It therefore  has to be freely distributable. Since we are interested in machines understanding science, we are concentrating on chemistry articles. This isn’t easy since almost all articles are copyrighted and non-distributable. Publisher Copyright is a major barrier to progress in Chemical Natural Language Processing – you can’t just go out and compile a wordlist or whatever as you may infringe copyright or invisible publisher contracts (we found that out the hard way).
When SciBorg started there were no Open Access chemistry journals. Even now the Open  Beilstein Journal of Organic Chemistry only has ca. 50 articles. Our corpus comes from Royal Society of Chemistry, Nature, and International Union of Crystallography and we are working on what parts of this we can legally redistribute.
The corpus doesn’t stay as PDFs – PDFs are so awful they are not just useless, but actually destroy information. (Diana Stewart, who works on SPECTRa-T, is trying to find out why theses from Caltech emit non-printing ASCII control characters in their PDF.) So we have to repurpose them by converting to HTML, XML and so on. It’s not a convenience, it’s a necessity. This conversion almost certainly loses information and almost certainly loses any copyright statement (which may even be in an image).
Now the corpus is annotated. Expert humans go through line by line, word by word and character by character, identifying the role of each. Often several do this independently to see how well they agree (it’s never 100%). Then everyone can test their software on the same corpus and make meaningful comparisons. It is this annotated corpus which is of most use to the scientific community.
So suppose I find 50 articles in 50 different repositories, all of which claim to be Green Open Access. I now download them, aggregate them and repurpose them. What is the likelihood that some publisher will complain? I would guess very high. The context of the papers is lost – they simply see “their papers” being packaged and redistributed. They may claim that we have violated database rights, etc. The example I gave showed not that Green Open Access per se was being violated (it wasn’t) but that publishers act in restrictive ways that make no logical sense, and hence logic is of little value.
Only a rights statement actually on each document would allow us to create a corpus for NLP without fear of being asked to take it down.
Data is similar but left as an exercise for the reader.

Posted in Uncategorized | 2 Comments