Anyone for crystal mashups?

From the blogosphere through Bora:

Liz Allen posted this on the Wall of the PLoS Facebook group yesterday:

Here’s a fun Friday activity for all of you who like to track the stats of the inevitable rise and world domination of OA!

[…]

Another cool mash up site (great logo, takes a minute or so to load) is http://maps.repository66.org/ there you can see the number of OA repositories mapped across the globe, there were 808 as of earlier today.

PMR: This is fun. I’ve never done a mashup, but I’d love to try. Here’s my idea. Since CrystalEye is Open, anyone can do it or we can do it together:
CrystalEye contains > 100, 000 entries, all with author names and addresses. Here’s an example:
“College of Sciences Tianjin University of Science and Technology Tianjin 300457 P. R. China”
How easy it it to turn that into a Google coordinate? Of course not all addresses will have a consistent format, so we need a service that can guess formats.
Then how do we actually mash it with Google? I imagine it’s fairly easy.
Then we have a map of where every  crystal structure in CrystalEye has been done…
And of course this is only the start. We can add info on date, number of authors, field of study, etc. The only requirement is that you have to be prepared to work with Open Data. It’s not harmful. You don’t even have to let us know you are doing it. But you do have to acknowledge the source.
P.

Posted in "virtual communities", open issues, semanticWeb | 7 Comments

Open NMR: contributions from the community about outliers and assignments

We are delighted at the practical and helpful  contributions from members of the community in helping to understand or correct outliers in the data set we are using. This is exactly what we hoped would happen at the start of the project and it has not started to gain momentum. I list some of them below to acknoledge the help. It is also highlighting the need for better tools for such collaborative projects – a blog is a poor mechanism but wikis also have their failings.
To reiterate:

  • Nick has been through the dataset by hand and identified all data sets with potential misassignments or other anomalies. This has been done by comparing agreements within each set. A data set is likely to have been flagged if (a) it has a single widely outlying shift (b) two peaks (a, b) have coordinates yb, xa (as we have shown) giving an “X”-like pattern (c) has a large general scatter considerably greater than the average.
  • Nick will post the major outliers based on RMSD. I don’t know how many there will be but I expect about 50 (hence the “20%”). These will be clickable – i.e. anyone with an SVG browser can imemdiately find our which peak is linked to which atom.
  • After, and only after, these have been cleaned or accepted we will try to see if there are systematic effects in the data  – either the variance or the precision. We could expect that data from various sources could provide much of the variation, or the date, or the field strength, or the temperature, or the solvent. Unfortunately we do not have all the metadata as it isn’t present in the CMLSpect files.
  • Finally we may be able to comment on Henry’s method. It is possible that certain functional groups have problems (Nick has some suspicions) but at present these are overwhelmed by variance from other sources in the experiment or its capture

So here are examples of useful comments. (I am not sure why Pachyclavulide-A is relevant – I can’t find it by name search in NMRShiftDB – but the effort is appreciated. However we are primarily looking for comments on the outliers we have identified.)

  1. Egon Willighagen Says:
    October 26th, 2007 at 1:15 am eThe first one is another misassignment. Look up the structure in the NMRShiftDB and you will see one correctly assigned and one misassigned spectrums. This kind of issues should be filed as ‘data’ bug report at:
    http://sourceforge.net/tracker/?atid=560728&group_id=20485&func=browse
    I’m will do this one.
  2. Egon Willighagen Says:
    October 26th, 2007 at 1:17 am eFiled as:
    http://sourceforge.net/tracker/index.php?func=detail&aid=1820353&group_id=20485&atid=560728
  3. Wolfgang Robien Says:
    October 26th, 2007 at 8:48 am eAnother error: Pachyclavulide-A (should be C26 instead C27), MW=510
    Found automatically by the following procedure within CSEARCH:
    Search all unassigned methylgroups located at a ring junction. The methylgroup must be connected either with an up or down bond. As an additional condition, it can be specified if only “Q’s” are missing or if the multiplicity of missing lines can be ignored. I think a quite sophisticated check which goes into deep details of possible error sources. […]
  4. hko Says:
    October 27th, 2007 at 12:02 pm eMisassignments NMRShiftDB (10008656-2) removed.
  5. hko Says:
    October 27th, 2007 at 5:28 pm eMisassignments NMRShiftDB (10006416-2) removed. 45.0 and 34.4 reversed.
Posted in nmr, open issues, open notebook science | 1 Comment

Open NMR calculations: intermediate conclusions (comments)

I posted our intermediate conclusions on Nick Day’s computational NMR project, and have received two lengthy comments. I try to answer all comments, though as Peter Suber says in his interview sometimes comments lead to discourses of indefinite length. I am taking the pragmatic view that I will mainly address comments (and subcomments) that:

  • address our project as we defined it (not necessarily the project that others would like us to have done)
  • add useful information (especially annotation of suspected problems)
  • or show that our scientific method is flawed or could be strengthened
  • relate to Open issues. In our present stage of robotic access to and re-use of data we can only realistically use databases that explicitly allow re-use of data and do not require special negotiation with the owners

There has been a great deal of discussion (far more than we had expected) on our project. Some of this has been directly relevant in responding to our direct requests for annotation of specific outliers and we acknowledge posts from Egon Willighagen, Christoph Steinbeck, Jean-Claude Bradley, Wolfgang Robien, and the University of Mainz. A lot of the discussion has been of general interest but not directly relevant to the aims of the project which was to show what fully automatic systems can do, not to create specific resources (“clean NMRShiftDB”). It is possible, though not necessary, that the work might be more generally valuable depending on what we found.

Wolfgang Robien: October 27th, 2007 at 12:03 pm e

You wrote: ….only Open collection of spectra is NMRShiftDB – open nmr database on the web.
Also the SDBS system can be downloaded – as far as I remember its limitited to 50 entries per day (should be no problem because QM-calculation are quite slow compared to HOSE/NN/Incr)

PMR: thank you for reminding us. I have not used SDBS and it looks a useful resource for checking individual structures but it is inappropriate for the current work as:

  • robotic download is forbidden
  • there is no obvious way of downloading sets of structures – they need to be the result of a search
  • there is no obvious machine-readable connection table (there is a semantically void image).
  • there are no 3D coordinates (this is not essential but it meant Nick could work almost immediately)

It is possible that if we wrote to the maintainers they would let us have a dataset, but this would double the size of the project at least.

If you need 500 entries with a certain specification (e.g. by elements, molwt, partial structure, etc.) and you want to perform a common project, please let me know …..

PMR: Thank you. This is a generous offer and we may wish to take it up. Contrary to your comment that I am an NMR expert, I’m really not – I’m an eChemist and this exercise in NMR is because I wish to liberate NMR from the pages of journals. If we find that Henry’s program needs more data or that yours has fewer problems it could be extremely valuable. We would wish the actual data to be Open so that others can re-use it.
PMR is quoted as “We downloaded the whole of NMRShiftDB. When we started we had NO idea of the quality. …”
That is quite true. There were a number of public comments on NMRShiftDB ranging from (mild) approval to (mild) disapproval, some scalar values for RMS against various prediction programs and some figures on misassignments, etc. These gave relatively little indication of the detailed data quality – e.g. the higher moments of variation.
If there is currently a full list of NMRShiftDB entries with your annotations this would be valuable. Currently I can find a number of comments on individual entries with gross problems at
http://nmrpredict.orc.univie.ac.at/csearchlite/hallofshame.html
but these seem anecdotal rather than a complete list.
PMR: and the second set of comments

ChemSpiderMan Says:October 27th, 2007 at 12:43 pm

  • 2) Regarding “the only Open collection of spectra is NMRShiftDB – open nmr database on the web.” Just to clarify these are NOT NMR spectra actually. Unless NMRShiftDB has a capability I am aware of NMRSHiftDB is a database of molecular structures with associated assignments (and maybe in some cases just a list of shifts..maybe all don’t have to be assigned.)PMR: Thank you for the correction. I should have said peaklists with assignments.
    4)Regarding “We knew in advance that certain calculations would be inappropriate. Large molecules (> 20 heavy atoms) would take too long. ” The 20 heavy atom limit is a real constraint. I judge that most pharmaceuticals in use today are over 20 atoms (xanax, sildenafil, ketoconazole, singulair for example). I would hope that members of the NMR community are watching your work as it should be of value to them but I believe 20 atoms is a severe constraint. That said I know that with more time you could do larger molecules but a day per molecule is likely enough time investment.
    PMR: We have strategies for dealing with larger molecules but are not deploying them here.
    6) Regarding “So we have a final list of about 300 candidates.” Out of a total of over 20000 individual structures your analysis was performed on 1.5% of the dataset. How many data points was this out of interest.
    PMR: I expect about 6-20 shifts per entry. Some overlap because of symmetry
    7) Regarding ” probably 20% of entries have misassignments and transcription errors. Difficult to say, but probably about 1-5%”. This suggests about 25% of shifts associated with my estimated 3000 shifts are in error. This is about 750 data points and this conclusion was made by the study of 300 molecules. For sure the 25% does not carry over to the entire database. It is of MUCH higher quality that that. My earlier posting suggested that there were about 250 BAD points. The subjective criteria are discussed here (http://www.chemspider.com/blog/?p=44). Wolfgang suggested about 300 bad points but we were both being very conservative.You discussed the difference between 250 and 300 here on your blog as you likely recall http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=346
    PMR: Nick will detail these later. We believe that the QM method is sufficiently powerful to show misassignments of a very few ppm – I will not give figures before we have down the work. With known variance it is possible to give a formal probability that peaks are misassigned. I have shown some examples of what we believe to be clear misassignments, but we have not gone back to the authors or literature (which often does not have enough information to decide). I do not believe you can compare your estimates with ours as you and we have not defined what a misassignment is.
    8) Regarding “We realise that other groups have access to larger and, they claim, better data sets. But they are closed. I shall argue in a later post that closed approaches hold back the quality of scientific data.” I think your comments are regarding Wolfgang Robien and ACD/labs. That is true that we have access to larger datasets but we can limit the conversations to NMRShiftDB since we ALL have access to that. Robien’s and ACD/Labs algorithms can adequately deal with the NMRSHiftDB dataset. For the neural nets and Increment based approach over 200,000 data points can be calculated in less than 5 minutes (http://www.chemspider.com/blog/?p=213). You have access to the same dataset and can handle 300 of the structures. Your statement is moot..it is NOT about database size but about algorithmic capabilities.
    PMR: My statement was about size and quality of datasets and is completely clear. It has nothing to do with algorithms. I am not interested in comparing the speed of algorithms but am concerned about metrics for the quality of data. I shan’t discuss speed of algorithms unrelated to the current project
  • Posted in nmr, open issues, open notebook science | 4 Comments

    Open NMR calculations: intermediate conclusions

    Over the last 1-2 weeks Nick Day has been calculating NMR spectra and comparing the results with experiment. As there appears to be considerable interest we have agreed to make our conclusions Open on an almost daily basis. These lead to some compelling observations on the value of Open Data which I shall publish later. To summarise:
    We needed an Open Data set of NMR spectra and associated molecular structures. The data had to be Open because we wished to publish all the results in detail without requiring to go back to the “owners”. The data also required annotation of the spectra with the molecular structure (“assignment of peaks”). Ideally the data should be in XML as this is the only way of avoiding potential semantic loss and corruption, but we would have managed with legacy formats.
    There are well over 1 million spectra published each year in peer-reviewed journals. Almost NONE are published in semantic form – most are as textual PDFs or graphical PDFs. It is also unclear how many of these could be robotically downloaded without the publishers sending lawyers – at least Elsevier allow us to do this. In any case we would have to use OSCAR to extract the data, probably involving corruption and loss.
    So we looked for Open collections of spectra. There are many, with an aggregated count of probably over a million spectra. However almost all are completely closed – they requires licence fees and forbid re-use. I have criticized this practice before and shall do so later, but here I note that the only Open collection of spectra is NMRShiftDB – open nmr database on the web. This has been created by Crhristoph Steinbeck and colleagues and contains somewhat over 20,000 spectra. Because Crhistoph is a member of the Blue Obelisk the data can be exported in CMLSpect (XML) without which Nick Day’s project would not have been possible.
    We downloaded the whole of NMRShiftDB. When we started we had NO idea of the quality. This was deliberate, and part of the eScience- can a robot automatically determine the quality of data? Several correspondents have missed this point – It was more important to answer this question than whether the data were “good”. IF AND ONLY IF the data AND METADATA were of sufficient quality would it be possible to say something useful about the value of the theoretical calculations.
    We knew in advance that certain calculations would be inappropriate. Large molecules (> 20 heavy atoms) would take too long. Molecules with floppy groups cannot be easily analysed. So we selected small molecules with rigid frameworks. This gave a starting set of about 500 candidates, each of which takes on average 1 day to calculate.
    About 200 of these had computational problems – mainly that they were too large, they didn’t converge, Condor had problems managing the jobs, etc. So we have a final list of about 300 candidates.
    We have listed the analysis of these results over the last few days. It is clear that some entries have “errors” and that there were also defects in the initial calculation models. Henry knew the latter, of course, but if we hadn’t known this at the start we would have been able to hypothethise that Br, Cl gave rise to serious additive deviation. So this is at least confirmation of a known effect for which we have made empirical corrections based on theory.
    We have shown examples of poor agreements and anticipated all of them in principle. The data set contains many problems including:

    • wrong structures. I am sure that at least one structure does not correspond to the spectrum
    • misassignments. These are very common – probably 20% of entries have misassignments
    • transcription errors. Difficult to say, but probably about 1-5%
    • conformational problems. There are certain molecules which have conformational variability (i.e. floppy) but we have only calculaed one conformer. The most common example is medium-sized rings
    • human editing of data leading to corruption. 2 entries at least

    As a result Nick is going to produce a cleaned data set manually. He has already done most of this (has he slept?). He cannot do this automatically as the metadata are not present in the XML files. He will then be in a position to answer the questions:

    • how much of the variance is due to experimental problems?
    • if this is lower than, say, 70%, is it possible to detect systematic “errors” in the computational methodology.
    • if so, can the method be improved.

    If we can believe in the methodology then we can start to use it as a tool for analysing future data sets. But until then we can’t.
    We realise that other groups have access to larger and, they claim, better data sets. But they are closed. I shall argue in a later post that closed approaches hold back the quality of scientific data.

    Posted in nmr, open issues, open notebook science | 2 Comments

    Joe Townsend: textual and crystallographic eScience

    Joe Townsend has worked with our group for ca. 6 years. As an undergraduate he worked as a summer student and was one of the first co-authors of OSCAR. He’s submitted his thesis and is being examined on Wednesday. His work has greatly informed what Nick Day has been doing. Here’s a snippet of what he has in his thesis (we are currently writing the paper)
    He extracted small molecular entities from the data – what is now CrystalEye – WWMM – and optimsed the geometry in much the same was a Nick has been doing. He then compared the calculated and observed geometries of over 1000 entries and got:
    joegamess1.PNG
    This shows a wide scatter (y is calc, x is observed). Is the deviation due to problems with the data or problems with the model or both? Have a think and then read on…
    By carefully analysing outliers Joe came up with about 10 ways that the data might have problems. Because the data were Open, and because the metadata were Open and rich, Joe was able to create a protocol that filtered out entries with potential problems in the data. The protocol was NOT based on the fact of an entry being an outlier, but with some aspect of metadata. (Here is PART of the protocol)
    protocol1.PNG
    As a result it was possible to devise a machine procedure which AUTOMATICALLY created a cleaned data set. The resulting comparison then looked like this:
    joegamess2.PNG
    You can see immediately that although there are many fewer entries the agreement is excellent. As a result, and only as a result, it was possible to find outliers where there were potential concerns about the quality of the calculated data. The effects are small, but probably real. We’ll see if the examiners agree. (The obvious outlier above (1.31, 1.25) is due to differences in the models – gas-phase versus crystal – i.e. “crystal packing forces”)
    This shows that in principle it is possible to create robots which use both theory and experiment to improve each other. It relies on having good open metadata and open data.
    It is impossible with closed data. The next post will contrast Nick’s results for NMR.

    Posted in data, open notebook science | Leave a comment

    Open Notebook : more ideas

    Cameron Neylon has made a very useful comment on the Open Notebook philosophy which I can go along with:

    Cameron Neylon Says:
    October 26th, 2007 at 8:51 am eI’ve come in a bit late on this. I am with Jean-Claude and Bill Hooker I think. I would call this as it stands an ‘Open’ or ‘Public’ experiment rather than Open Notebook Science. This is not to say it is a bad thing. And the motivation for holding back a little on the data is a very good and reasonable one. There is also a grey area that Bill noted which is that obviously data is not made immediately available but that our approach is that it should be made available as rapidly as is practicable.

    PMR: I also agree with this – so all protagonists including myself are agreed that what we are doing is not Open Notebook Science. I reiterate that it is our intention to make the protocols and data open as soon as practicable. Some of this is simply technical – we would like to write to wikis automatically but haven’t got the technology in place yet. By daily postings we are effectively making our evolving protocol Open.

    I see the slogan ‘No insider information’ as a goal to work towards rather than necessarily achievable. It is a challenging one but it is what we aim for. We are working towards getting our analytical instruments to autopost to our blog so if I can make an analogy here. If a student of mine puts on an analysis overnight the results ideally would be published directly to the blog as they come off the instrument. It is possible that someone in Australia (or California) would see these, notice that we have discovered a new enzyme activity/new drg target inhibitor and then claim the observation.
    We explicitly take this risk. In particular for some of the large facility experiments I am planning I will put up raw and partially processed data that it will take me some months to get through the analysis of – someone else may beat me to it. But if we think this through. They could claim the discovery (and to do so would have to do it rapidly – via a blog/wiki). They would have to refer to the dataset (because they won’t have the equivalent dataset) and so they would have to make the observation public in non-peer reviewed form. For the deliberate spoiler I think you can argue that there would be a rapid and very negative public response.

    PMR: I agree that this is true for large public datasets that cannot be replicated. The particle physicists have a very carefully worked out protocol for when data can be released and who can work on it and who gets credit. So do some of the astronomers and geospatial communities. But chemistry has no tradition of releasing data (and much tradition of not releasing it) so we are encountering birth pains.

    Two cases where there is potential difficulty. Someone being ‘helpful’ by making an observation that I would have made (basically the obvious conventional data analysis). This means you feel obliged to give credit. I would say this is still fine to include as a students work in a thesis but would feel obliged to give credit (authorship) in a publication. But there is clearly a very large grey area here. We want people to find things we’ve missed – this is part of the reason we are doing this. And there are many cases where someone sees something that is obvious in hindsight but it is very difficult to pin down whether you would have seen it unless you were looking.

    PMR: Agreed. This is possible for CrystalEye – anyone is able to inspect our histograms of bond lengths and come up with stuff that we and others have missed. We really hope they do so

    The second difficult area is when do you feel that data is ‘fair game’ for re-use. If I leave a piece of interesting data on the blog for six months and make no comment and publish no paper does this mean someone else can have a go and feel free to go with it, perhaps publish independently? 12 months? 18 months? I think there is a need to develop or evolve some sort of code of good practise here. We don’t want people having to ask permission every time before playing with our data – but we want them to play nicely giving due credit where appropriate. Perhaps we should tag datasets as ‘I’m done here – feel free to go at it’ or ‘Anybody got any ideas?’. I will try to post on this if I can find some time over the next few days.

    PMR: These are useful suggestions. I would certainly intend that there was a “fair re-use” moment. ONS says that is at the moment of conception. The Protein Data Bank says it is at the moment of public release of the dataset which may be months after deposition – it takes time to go through the system and there are some embargoes (usually not more than 6 months).
    Part of the problems of the current exercise – and why it isn’t immediately suitable for ONS – is that it would be possible for someone to replicate the whole work in a day and submit it for publication (on the same day) and ostensibly legitimately claim that they had done this independently. They might, of course use a slightly different data set, and slightly different tweaks. The other factor is that data in NMR seem to be so valuable – there are still daily comments on this blog from one group attacking another group (independently of us) – that it is difficult to be objective

    Posted in nmr, open issues, open notebook science | 2 Comments

    Computational NMR: treatment of outliers, we need your help

    We have posted an number of cases where the calculated NMR shifts do not agree with the observed ones, and also indicated over 25 possible reasons for this – some due to errors or features in the experiment, some in calculation. A priori we do not know which causes are more frequent. Indeed there have been a number of correspondents who have suggested that our methods and the data both contain significant problems. And it is also clear that if it is possible to get good agreement between calculations and a data set this would justify both and would be seen as of considerable communal value.
    So we now intend to pursue these investigations of quality – in public, and resulting in Open Data, and although not conformant to Open Notebook within a very short period of creation. We are exposing all our results and thinking so that the final data set and protocol is as transparent as possible and so should avoid much of the debate that occurs for closed data and methods.
    The first step is to analyse the causes of variance in the data. We have two measures, precision (the variance) and accuracy (agreement of absolute value). Initially they are not independent in that a few very large errors in single shifts will cause both variance and imprecision.
    For each data set (which contains between 1 and 20 observed and calculated shifts) we fit the data to
    y = x + c + eps
    This contains 1 adjustable parameter, c – the offset. For each point in a data set we can calculate the signed deviation (delta) between observed and calculated. For each data set we can calculate the root mean square of delta (RMSD). Here is our current plot, with NO data points omitted.
    scatter.PNG
    Most of the data are clustered in a roughly normal fashion with a mean value of c close to zero. Do not try to read any more information into any apparent systematic variation as the outliers are not uniformly distributed and have high leverage.
    At this stage we do not know whether any outlier is caused by failings in the data, failings in the method or both. It would be totally irresponsible to omit points simply because they didn’t agree. We must identify the causes of error for major outliers, and – if they make scientific sense – we may then argue they should be omitted.
    If each error is independent of every other error then there is no option but to examine every case in detail. For example if the sole cause were serious transcription errors there is no way of automating this checking (other than by running OSCAR over publishers PDFs). However there may be some causes of error which occur frequently. Egon has already shown us that misassignment of peaks can happen and in fact we believe this is a frequent occurrence. If we can detect this in a manner than convinces the community then we can legitimately remove these entries from the set.
    It is also possible that certain types of experiment may show large variance. Egon has suggested that some solvents (e.g. acetone) may affect the shifts. If, for example, outliers contained a high proportion of  acetone as solvent relative to – say – chloroform then we could hypothesize that acetone caused variance. We might try to correct for this in some scientifically acceptable way (e.g. by modelling acetone in the QM calculations) – alternatively we might reduce the scope of our calculations to those experiments not done in acetone.
    Similarly we may find that certain types of chemical groups are associated with outliers. This is indeed tru for the heavier halogens (and heavier elements). Henry had already predicted this from his own work, but if he had not it would have been legitimate to hypothesize that halogens caused systematic error and normal regression techniques could lead to values for correction (and the variance of the correction). Indeed we may find it useful to compute regression-based values for those elements showing spin-orbit-coupling.
    However we still expect outliers with their own, isolated, causes. In this case the first action is to return to the literature source. I have done this for one of the outliers, and can find no transcription errors so I have appealed to the community for their collective wisdom. Jean-Claude has suggested it may be due to tautomerism, but I would welcome other ideas.
    What we intend to do, therefore is to publish the interactive data for outliers (i.e. clickable plots, highlighting atoms in Jmol and with links to NMRShiftDB) and ask for community input. All results will be Open and will be immediately available.
    Our intention – as we set out earlier – is:

    • create a small subset of NMRShiftDB which has been freed from the main errors we – and hopefull the community – can identify.
    • Use this to estimate the precision and variance of our QM-based protocol for calculating shifts.
    • refine the protocol in the light of variance which can be scientifically explained.
    Posted in nmr, open notebook science | 4 Comments

    Computational NMR : more outliers

    Here is a very common deviation from linearity, which I believe we can deal with. We believewe understand why, but would welcome confirmation (or dissension). And more important is whether we are allowed to do anything about it:
    8656.PNG
    Ypu can see the original data at: nmrshiftdb10008656-2 (solvent: chloroform) As you can see the points do not lie on
    y = x + c + eps
    Why?
    and another nmrshiftdb10006416-2 (solvent: chloroform)
    6416.PNG

    Posted in nmr, open notebook science | 5 Comments

    Open Notebook – reflections and conclusion

    Jean-Claude and Bill are right to point out that in the last week it has been inappropriate to use the term “Open Notebook Science” and I shall no longer use it in conjunction with the NMR work that Nick, Henry, Christoph and I have been working on. Nick and I have, however, been doing fully conformant Open Notebook Science for months in the CrystalEye project. In this project the aims are transparent -the robots collect crystallographic data every day and pass it through an elaborate and tested set of validation procedures so that we feel extremely confident of the accuracy and precision of the data. Every night the robots transform the data, collect statistics, add chemistry and create indexes – we feel justified in believing that it is among the best heterogeneous scientific knowledge bases. There is no “insider knowledge” – everyone has access to the data as soon as it comes off the machine. The entries and the statistics are available to all and anyone can use the data without our explicit blessing. So we know what Open Notebook is and we practice it. The data are labelled Open Data and I have explained why. And, although not required, the code is Open Source and distributed as soon as it is robust.
    We had hoped to use the same model for the NMR project. We have been preparing the software so it would write to the web directly. This has not proved easy. Meanwhile we have set out our project aims very clearly. We have also defined our protocol for analysing variance in some detail.
    The science is going well – we are on track according to our schedule. But the Open Notebook isn’t working out. Although the community at large is interested, their interests are highly varied and in several places inconsistent. The community does not seem to wish to support our project aims – that’s fine – but it doesn’t make it easy to have a large virtual collaborative project.
    We shall continue on the project, one of whose purposes is to investigate the hypothesis that QM calculations can be used to evaluate the quality of NMR spectra to a useful level. We shall continue to post results – hopefully with the same frequency as up-to-now. Since we shall have had a few hours to analyse the results before posting we cannot call them “Open Notebook” but we hope they will have pedagogic value and we also invite constructive criticism.

    Posted in nmr, open notebook science, Uncategorized | 5 Comments

    Open(?) Notebook NMR – is it really Open Notebook?

    1. Jean-Claude Bradley Says:
      October 25th, 2007 at 2:15 pm eConcerning your comment:
      We have so far shared every piece of data and metadata that we feel is fit to publish. Open does not mean “immediate”.
      True that “open” does not mean “immediate” but the term Open Notebook Science does imply that, following the principle of “no insider information”:
      http://drexel-coas-elearning.blogspot.com/2006/09/open-notebook-science.html
      and a recent rant here:
      http://usefulchem.blogspot.com/2007/10/science-is-about-mistrust.html
      In other words, if you and your student selectively publish results so that there is a public notebook and a private one, that does not fit with ONS.
      Definitions are a hassle sometimes. But as you have shown with the term “Open Access” we have to keep discussing these issues to make sure all assumptions are explicit.

    PMR: This is a very important point and I put my hand up… We’ll need to think about it. It may be a matter of timescale – we are moving to make our results available within days, not weeks. But it is also true that we do not, currently, expose enough for any reader in the world to be able to do exactly the same as us at any given time.
    However it is very difficult not to have insider information in any project. In out case we do not share our directories with the world. But also J-C does not share his physical samples with the world. For example he would be able to get a crystal structure or spectrum performed before anyone else. He would know the results of this minutes or hours before he told the world. He would notice colour changes in a reaction as it happened and before the rest of the world knew about it. He would know from his colleagues that the reagents used in Drexel had been found to be suspect.
    In our case everything we do is, in principle, repeatable. We are going through the process of cleaning the data set. That is the primary scientific operation. And we are asking the world to help. And thanks to those who have done so.
    So I will replace the title by “Open Computational NMR”. It’s time for a change anyway.

    Posted in nmr, open notebook science | 4 Comments