Open Notebook NMR: Commercial re-use of data?

Posted on October 23, 2007 by pm286

Antony Williams of Chemspider has offered to participate in our Open Notebook NMR experiment. Now this offer has been joined by ACDLabs – I am not sure of the formal relation between the companies but they have clear common interests. I had originally thought this was one individual making a personal offer – now there is a company that is requesting our data for them to work on.I have some genuine concerns about how we should proceed so am clearing my thoughts online. Readers will recall that I have been strongly critical of companies or nonprofits which use closed source and protect closed data. I have roughly equal numbers of correspondents who think I have been too hard on these organizations and those who think I need to be tougher. So I am taking a measured tone here.

Peter – FYI ACD/Labs are ready to participate in the work as discussed: http://www.chemspider.com/blog/?p=213#comment-3735
Peter and Tony,I think this is a fantastic project and am very keen to see how accurate the QM techniques prove to be for the subset of structures that you choose from the NMRShiftDB, and then how helpful they can be in improving the accuracy of experimental shifts in this wonderful resource.For the purposes of this work, we would be willing to provide the chemical shift predictions from the ACD/Labs software if you would like to use them in your comparison. If, for instance, they prove to be accurate enough to find many of these problems without the need for time consuming QM calculations, it may be preferrable to use the faster calculation algorithms that are available in our software. It may turn out that the ACD/Labs predictions could serve as a pre-filter to define which structures need the QM calculations and which don’t. Many variations on this theme come to mind, but we won’t know which are useful until we do the work.
Sincerely,
Brent Lefebvre
NMR Product Manager
Advanced Chemistry Development, Inc.

and Antony also requests the data:

Peter, all that is needed to perform the calculations for comparison using the ACD/Labs NMR predictors is a download of the exact dataset Christoph provided to you (we have already had issues with comparing algorithm to algorithm but using different versions of the NMRShiftDB database…not good). Also, if Nick can send us the ID of the structure inside the NMRShiftDB this should be enough. Thanks

and he points to how he/ACDLabs can predict chemical shifts:

Also, plots of the nature shown at the following page http://www.acdlabs.com/products/spec_lab/predict_nmr/chemnmr/ are generally of value when comparing improvements in algorithm performance. it’s a good way to compare your own improvements in algorithm (or Henry’s) as well as between algorithms

The link here points to a report of NMR calculations done with ACDLabs software and ChemNMR (Cambridgesoft). Note that both companies have developed their own software for chemical predictions, but the software is closed in that it can only be used if purchased, and anyway the algorithms are not visible. Note that Cambridgesoft patent algorithms – I don’t know whether ACDLabs do. The data are also closed in that they are not generally available and I have no idea of their general chemical makeup. From the page about calculating chemical shifts

During the history of the development of the ACD/Labs NMR prediction tools we have compiled and checked data from thousands of literature articles to build databases of over 165,000 assigned structures for ¹H and ¹³C, 8000 for ¹⁵N, 13,800 for ¹⁹F, and 22,600 for ³¹P. With these assignments and structures as a basis and using our correlation algorithms developed over the past ten years, it should be obvious that our fragment-based predictions will offer superior performance to any rules-based systems.Fragment-based prediction offers the opportunity to cover wide ranges of structural diversity not available to rules-based systems. It is this type of performance that has enabled Advanced Chemistry Development to become the industry leader in NMR prediction today. Rigorous testing of our capabilities relative to other software packages has resulted in our software becoming the package of choice on a worldwide level for companies such as Pfizer, Astra Zeneca, GlaxoSmithKline, 3M, Eastman Kodak Co., Millennium Pharmaceuticals and many others.

and the results:

¹³C NMR Prediction Comparison

r r² Standard Error (ppm) Predicted ¹³C Chemical Shifts

ACD/CNMR 8.0 0.999 0.998 2.33 68,129

CambridgeSoft ChemDraw 8.0 0.995 0.990 5.30 67,841

Table 2: Statisics comparing the accuracy of ACD/CNMR predictions to Cambridge Soft

Once again the table illustrates the remarkable accuracy advantage of ACD/CNMR as its error is less than half that of ChemNMR. Fully 64% of the predictions from CNMR are within 1 ppm of the experimental shift as opposed to only 32% in ChemNMR (Figure 2).

	r	r²	Standard Error (ppm)	Predicted ¹³C Chemical Shifts
ACD/CNMR 8.0	0.999	0.998	2.33	68,129
CambridgeSoft ChemDraw 8.0	0.995	0.990	5.30	67,841

I do not know if the work is formally published in the scientific manner but it is available from an unusual source:

Please request the complete comparison document from your account manager or email sales@acdlabs.com.

This clearly raises issues that need agreeing, so I’m asking the blogosphere for comment.

If ACDLabs use their software to make calculations on our data what is the value? We cannot repeat the calculation and have no independent assessment of whether it is valid. I’m not accusing them of fudging the results but it is extremely easy to get good results by (even unconscious) selection of data. We have heard that formal analysis of chemoinformatics papers shows that many are suspect because of over- or under-fitting. Data sets and algorithms are never available, whereas we wish to make all our approach public.
If we do make our approach and data public then they can be assimilated into ACDLabs software and operations. In principle I’m not too worried about that as long as the Open Data is honoured. But I am more concerned that ideas will be assimilated without credit – this is a general problem of Open Notebooks and one we all have to address. This includes Henry’s ideas as well as Nick’s. Note that many people deliberately use CC-NC licences (non-commercial) on their data. I have resisted this (the CC-NC on the blog is simply that I haven’t found out how to take it down and replace by CC-BY.
If they take our data they are in a position to scoop us if, as they say, they have the best software in the world and much more data.
I am worried by the high level of marketing in their scientific report. While it is perfectly reasonable for a company to promote its products and also reasonable to show that they outperform others there is a danger that the primary motivation is to use the current exercise for marketing rather than science and we would not wish to be associated with this.

These are clear issues but they cannot be solved easily. I am actually not clear what ACDLabs wishes to do – their data is a superset of ours as they have had access to NMRShiftDB like everyone else. I suspect our data will be cleaner as it is not easy to validate 150,000 + entries (although we have managed this with crystalEye). And I’m not quite clear what we get out of it – we can calculate empirical (fragment-based) shifts with Christoph’s software – at least enough for a reasonable filter.
But I welcome comments.

This entry was posted in nmr, open issues, open notebook science. Bookmark the permalink.

8 Responses to Open Notebook NMR: Commercial re-use of data?

ChemSpiderMan says:

October 24, 2007 at 1:42 am

Peter, I’ll comment in more detail on this after your readers have a chance to comment. But, for clarity, I point your readers to http://www.chemspider.com/blog/?p=213 where I took your “George Whitesides approach to writing papers” and added to the conclusion. Here’s the piece I added.
“The results from the GIAO calculations were compared with three other prediction approaches provided by Advanced Chemistry Development. These algorithms were not limited in the number of heavy atoms that could be handled by the algorithm, The algorithms were a HOSE-code based approach, a neural network approach and an “increment approach”. A distinct advantage of these approaches is the time for prediction relative to the quantum-mechanical calculations. The QM calculation took a number of weeks to perform on the dataset of 23475 structures on a cluster of computers. However, a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.
A comparison of the approaches gave statistics for the non-QM approaches superior to those of the QM approach. Outliers were observed in both cases and were traced to misassignments. QM approaches were generally more capable of predicting exotic structures while for the majority of the NMRShiftDB made up of general organic chemicals non-QM approaches were superior.”
There should be no surprise to you that ACD/Labs stepped forward to participate. I declared it explicitly in my blog posting.
I also posted in that blog the following statement “I believe this project offers the ability to help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science. There has never been a study of the magnitude being discussed here comparing quantum-mechanical NMR prediction methods with the methods represented by commercial software products. I look forward to it!”
I fully acknowledge your stance on commercial software companies. Also on publishers. And many other areas. You’re not shy with your judgments. Having worked in academia, a Fortune 500 company and in a commercial software company I can comment that all three have good science going on, some excellent people in their organizations and certainly people committed to their roles and to science. I beg the question why not help build the bridge rather than maintain the distance.
Further clarification..ChemSpider does not have access to any NMR prediction algorithms. However, they would be willing to work on this project for the science. The hypothesis under question is whether HOSE, Neural Net or Increment based algorithms can outperform GIAO predictions. It is already known that they are faster …these are real numbers generated already “on the dataset of 23475 structures on a cluster of computers a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.” What is the statement on accuracy? I believe it’s a valid scientific question to be answered.
We have just submitted a publication regarding one aspect of this validated on NMRShiftDB with your collaborator, Christoph Steinbeck, as our collaborator. The title and authors are below..should be in JCIM shortly. I have already sent you a copy I believe.
The Performance Validation of Neural Network Based 13C NMR Prediction
Using a Publicly Available Data Source.
K.A. Blinov§, Y.D. Smurnyy§, M.E. Elyashberg§, T.S. Churanova§, M. Kvasha§, C. Steinbeck#, B.A. Lefebvre† and A.J Williams‡
§ Advanced Chemistry Development, Moscow Department, 6 Akademik Bakulev Street, Moscow 117513, Russian Federation
† Advanced Chemistry Development, Inc., 110 Yonge Street, 14th floor, Toronto, Ontario, Canada, M5C 1T4
# Steinbeck Molecular Informatics, Franz-John-Str. 10, 77855 Achern, Germany.
‡ ChemZoo Inc., 904 Tamaras Circle, Wake Forest, NC-27587
Christoph is running a company. ChemZoo is a company. ACD/labs is a company. We all have good scientists.

Reply
Egon Willighagen says:

October 24, 2007 at 11:19 am

I do not feel that the computational predictions will really improve the ACD/Labs database; that database covers a wide range of structural types. Moreover, you already indicated the considerable overlap. In that respect, I am not sure if the ACD/Labs can really use your data as test set. Personally, I would be more interested in seeing the prediction error, and comparing both methods, like Table 2, and show that my algorithm did better than the competitors 🙂
Anyway, the true value lies in the ‘new’ molecules, which are not likely to be in the existing databases yet. That is, run these predictions on the suitable species in CrystalEye, which I’m sure you are doing too. That data set would be valuable.
Still, it’s predicted spectra, and modeling data after modeled data is nothing more than introduction another source of errors. That is, the prediction based on a HOSE++ model based on calculated spectra will not be better than the predicted spectra alone (if the HOSE++ model on experimental data is not better already).
I think the value in this ONN project is that you:
– can indicate possible problems in existing databases
– can indicate possible problems in published articles
– can predict NMR spectra for compounds for which no spectra are available (*)
– help validate experimental sections (**)
*) but I do not think that will significantly improve other products
**) I would love to see CMLExperimentalSection take shape, with the web service to validate the CMLES files, just like those CIF validation services (maybe we can organize a virtual BO hack session for that?)

Reply
Egon Willighagen says:

October 24, 2007 at 11:37 am

Oh… after reading [1] I realized another interesting advantage:
– calculate spectra for different solvents !
The ‘bad’ in [1] is interesting… can you make a plot ‘predicted versus measured’ per solvent?
That is, maybe the ‘bad’ is bad, because of the more difficult to model acetone solvent field? I remember that solvents like acetone (which is polar on one side, not so polar on the other side, where the methyls shield the dipole) are particularly difficult to model; the molecules orient in different orientations around the molecule depending on functional groups interacting with the acetone dipole…
1.http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=724

Reply
Ryan Sasaki says:

October 24, 2007 at 12:16 pm

Hi Peter,
While I cannot speak for Brent Lefebvre, I have a couple of comments in relation to Open Notebook NMR and the potential involvement of commercial software companies in this study.
First of all, you mention that you will probably share your insights but not your data. If that is true, then I do not understand how this project can be referred as OPEN. If the dataset being used is not shared publicly, how can this be considered an open project?
Some other points:
What is the value of evaluating your dataset with our predictions? I think it would provide significant, scientific value. As Tony mentioned above, we know that HOSE, NN, and increments approach all outperform GIAO with regards to speed. What about accuracy? If HOSE, NN, and increments significantly outperform GIAO in regards to speed and accuracy I think that is scientifically relevant information. If it does in speed but not in accuracy, I think that is too. If there is some way that approaches can be combined to leverage the strengths of each approach, that’s another step in the right direction. I don’t think ACD/Labs involvement has been clearly defined. Brent has proposed a couple of alternatives, I think quality science is the #1 thing in mind.
I do not think there is any intention on assimilating your data into the ACD/Labs database. We have a large database and we work hard every year to add more novel compounds to it. See my post here regarding how we build our databases:
http://acdlabs.typepad.com/my_weblog/2007/06/the_purgatory_d.html
Specifically,
“It is simply because we have a purgatory database that literally consists of hundreds of thousands of compounds that are waiting for quality checking. This is data coming from the most recent literature that we currently do not have in our database that is prioritized in a way that is most beneficial to the structural diversity of our databases.
This is the standard we have put in place and we truly believe it is the best way for us to ultimately improve the prediction in our NMR software. If we didn’t hold quality of science at a high standard within our company, we could double the size of the database tomorrow by importing the entire purgatory database as-is.”
There is no intention to scoop anyone, we are interested in the science. Although we are a commercial company, we are full of scientists interested in performing quality scientific validations.
The Cambridge Soft Comparison is something quite a bit different than this. We have been asked over the years to publish “marketing” comparisons for customer use where they are interested in how ACD/Labs performs against other commercial products. Both products cost money, people want to get a reasonable idea of performance. The other alternative is for a specific individual to evaluate both softwares on their own so they can see how they perform with their data. We support that as well, but many people don’t have the time for this so they ask for help.
When considering our involvement with your current project, I would point you more towards the, “Performance Validation of Neural Network Based 13C NMR Prediction Using a Publicly Available Data Source” that Antony is referring to. Further, I have blogged about this repeatedly on my site. Our intention was not to make the NMRShiftDB look bad, but to validate the quality. We have repeatedly endorsed the NMRShiftDB as a good project and I personally have provided direct feedback to some of the developers behind the NMRShiftDB. You will not find any bold marketing statements regarding ACD/Labs and its comparison with NMRShift DB on the ACD/Labs website. This is a very different approach than what you have highlighted in the Cambridge Soft comparison.
I do hope that the significant amount of scientific publications that members of ACD/Labs have co-authored over the years is proof that we are interested in doing quality science and not simply looking for something else to market. Quality science has been undertaken within this company for many years and the hope is that this will continue in the coming years. While we are certainly a commercial company, I do think that our scientific contributions over the years should be acknowledged.

Reply
Wolfgang Robien says:

October 24, 2007 at 5:19 pm

Dear Peter;
there is a searchable database of 16.4 millions of calculated C13-NMR spectra available since approx. 1 year on http://nmrpredict.orc.univie.ac.at/identify
( It’s free of charge ! ;-)) )
The spectra have been calculated for 16,4 millions of the PUBCHEM-structures using the CSEARCH NN-approach. The search technology used, is a modified SAHO-approach as implemented in CSEARCH.
If there is more interest in using this, no problem to upgrade the data file to the actual size of the PUBCHEM-collection. The calculation of approx. 40 millions of spectra can be done in less than one week on a 4-processor box.
Best regards, Wolfgang Robien

Reply
Joerg Kurt Wegner says:

October 24, 2007 at 7:53 pm

I have only three points:
1. I see a lot of potential in this exercise for all participants. I think you have to work together for seeing where the model has its pros and cons. I think being over-cautious is not really practical. If you want special agreements or outcomes, just negotiate realistic goals.
2. Model quality: Just check recent hypothesis testing literature, e.g. DOI 10.1021/ci700157b, or use standard hypothesis testing techniques. For this data, I would be surprised to see any differences due to the testing.
3. Marketing versus Blogging/OpenSource? For me those points are rather related then distant! I would guess that all participants *must* get something out of it. I think it is important to lower the threshold between the ‘open xxx’ community and commercial partners. I see the trend positive, because it will help the whole community one way or the other. Just make sure that the people are playing a fair game, if not, well … then the false playing party will loose a lot of trust and everybody will know that. As long as people keep talking to each other I would be extremely surprised to see this exercise failing. All members seem to be very consistent and trustworthy in their doings.
Happy (enlarged) community regards, Joerg

Reply
Ryan Sasaki says:

October 25, 2007 at 8:49 pm

Great comment Joerg, and I agree with you regarding the relationship between marketing, blogging, and open source.
And of course one of the beauties of blogging and an online community is that when someone is not playing fair, they get called out and exposed on the blogosphere almost immediately.

Reply
Brent Lefebvre says:

November 6, 2007 at 2:49 pm

Hi Peter,
Have you come to a conclusion? I feel that most of the responders to this blog posting have been supportive of the initiative. Please know that what Ryan Sasaki says is true; we are interested in quality science and furthering this where we can. The reason I offered to help is two-fold. We are interested in how our NMR spectrum prediction algorithms compare to the calculation methods you are proposing to use. By using this dataset as a benchmark, I think we can help this project. And if we can help this project, I hope that helps this scientific discipline, if only modestly.
Of course, our motives are not entirely altruistic. By having this benchmark test performed, we get to see how our prediction algorithms compare to the ones you are using. This can then tell us where we need to go and improve our prediction quality, which is of course, very valuable information.
I also have no problem providing you with the results of the predictions, but with the software capable of providing you with the results independently. This should assuage any fears you have of overfitting the data.
Please contact me directly if you wish to puruse this direction. We would be very happy to help.
Sincerely,
Brent Lefebvre
NMR Product Manager
Advanced Chemistry Development

Reply