We are part of the COST D37 action on Computational chemistry and specifically the creation of workflows and interoperability through standard APIs and formats. Our group at Cambridge is providing the basic infrastructure which I hope I can say objectively is now looking believable and exciting. Effectively it covers:
-
The conversion of computational chemistry output to CML. Ideally this should be done with FoX which depends on the authors of the program (or the developers) incorporating this into their release programmes. CASTEP, Dalton, GULP, DL_POLY, MOPAC are some of the programs which have been FoXized, but not all the current versions include it. So we are likely to require an intermediate step:
-
explicit conversion of logfile and other output to CML. This can be done by ad-hoc scripts, but I have developed a more systematic method – JUMBOMarker. This is a set of heuristics which include regular expressions, chunking and lookahead. It’s rather out of date so it’s getting a redesign and facelift. It should then be relatively easy to convert most outputs to CML. Each program requires a set of templates which describe the output, and these can be created by annotating typical program outputs. This has to be done by someone who understands what the program does – and so I spent yeterday with Kurt Mikkelson working on understanding Dalton – Kurt is an author.
-
An ontology for each code. The system can run without ontologies, but when these are created they add spectacular power to navigating and querying outputs. Every piece of information in the logfile can, in principle, be queried. These are currently bottom-up ontologies – they describe the phenomenology of the program rather than its platonic nature and purpose. If we get ontologies for several programs, then we’ll be able to abstract commonality and create a mid-level ontology for Computational chemistry. This will then be able to transform the way CompChem is used – for example the ontology can control the job creation and submission.
-
Conversion to RDF. This is automatic and uses one of the JUMBOConverter modules. The result can be a union of the relevant parts of the ontology and the actual data. This represents a leap forward in the management of future scientific data.
-
Ingestion into the Lensfield repository (Jim Downing, Nick Day, Lezan Hawizy). This is now automatic. The repository allows for ontology-based searching so that we can ask very general questions of the data sets using reasoning engines. The repository is general so that it can accommodate all aspects of molecules (provenance, literature, etc.)
We are going to run a combined project whereby we generate large numbers of related molecules in a parameter sweep (substituted oligo-acetylenes) and decorate them with donors and acceptors. Then compute properties such as the hyperpolarizability and see what combinations of substituents generate new effects. This type of high-throughput calculation can only be done by building workflows.
There are still problems to be overcome – Kurt and Hans-Peter Luethi reckon that 10% of calculation behave pathologically – but the workflow will allow us to apply machine-learning to determine what the factors might be.
In the evening we went out punting with an ad hoc picnic. Of that I shall only say one word:
SPLOSH!
No, that should actually be
SPLOSH … SPLOSH.
If you are interested in more details, let us know