We are now close to releasing the first results of the calculations – at present 300+ molecules. We think that really major foul-ups have been superseded (i.e. when all the Gaussian files failed to run because of a missing blank line, etc.) So we think it’s worth the community listening in.
To set the scene. This is part of Nick Day’s thesis work and Nick will be first author of anything to come out in the immediate future. Henry Rzepa provided much of the motivation and also the algorithms that we are now using. He first gave us an extension of the GIAO and Rychnowsky methods and then elaborated this in a further protocol which is what we are now using. This protocol is based on his own work at Imperial where her has computed a number of structures and gradually refined the methods. So this is his current best guess as to what we need, although there are some refinements for halogens that need to be added after the calculation.
Christoph has provided the NMR data from NMRShiftDB. Of course it comes from various sources but we shall rely on his judgement as to whether a structure is likely to be “wrong”. This is a difficult one – we cannot simply remove a structure because it doesn’t fit but he may be able to assert that there is a known problem. We may also have generic filters like the laboratory it came from.
These are the expected initial authors and we’ll see how things go. Christoph and Henry and Nick and I will have a few days to inspect the data before releasing it all. This should remove any really obvious “data errors” and also allow us to plan any further refinements. For example Henry has looked at the really glaring outlier and suggested a protocol change though we don’t think it will account for all the deviation.
People’s contributions will necessarily be recorded and so it will be clear what has been done. In the first instance I think we shall use the NMRShiftDB data and the Imperial protocol to give us an idea of the tractability of the method.
We absolutely welcome any input. We’ll be fairly focussed on a thesis-like approach for the next month or so, but may branch out. Here are some highly valuable suggestions
PMR Many thanks. I think this will be extremely useful in the next phase of the program (which could be quite soon). At present Nick needs to concentrate on the Gaussian stuff as it is fairly easy to initiate a new protocol and re-run the jobs in perhaps 2 days. The results of this will then give us an idea of where the main problems. If, for example, we find 5% of structures are misassigned, that is ca 15. Not to difficult to do by hand. But if we then scale this to 20,000 in NMRShiftDB then it’s 1000 entries and we have to automate or fan out the social computing. If, however the data error rate is 0.5% then 100 problems in NMRShiftDB is a long wet afternoon for the dedicated few.
The data quality are critical. Joe Townsend went round this loop several times before coming up with a usable protocol for filtering problems. It’s harder for NMR, but there are some tricks we may be able to play to weed out the worst.