Category Archives: nmr

funding models for software, OSCAR meets OMII

In a previous post I introduced our chemical natural language tools OSCAR and OPSIN. They are widely used, but in academia there is a general problem - there isn't a simple way to finance the continued development and maintenance of software . Some disciplines (bioscience, big science) recognize the value of funding software but chemistry doesn't. I can count the following other approaches (there may be combinations)

  • Institutional funding. That's the model that ICE: The Integrated Content Environment uses. The major reason is that the University has a major need for the tool and it's cost-effective to do this as it allows important new features to be added.
  • Consortium funding. Often a natural progression from the latter. Thus all the major repository software (DSPACE, ePrints, Fedora) and content/courseware (Moodle, Sakai) have a large formal member base of instutions with subventions. These consortia may also be able to raise grants.
  • Marginal costs. Some individuals or groups are sufficiently committed that they devote a significant amount of their marginal time to creating. An excellent example of this is George Sheldrick's SHELX where he single-handedly developed the major community tool for crystallographic analysis. I remember the first distributions - in ca 1974 - when it was sent as a compressed deck of FORTRAN cards (think about that).  For afficionados there was a single variable A(32768) in which different locations had defined meanings only in George's head. Add EQUIVALENCE, blank COMMON and any alteration to the code except by George led to immediate disaster. A good strategy to avoid forks. My own JUMBO largely falls into this category (but with some OS contribs).
  • Commercial release. Many groups have developed methods for generating a commercial income stream. Many of the computational chemistry codes (e.g. Gaussian) go down this route - an academic group either licenses the software to a commercial company, or set up a company themselves, or recover costs from users. The model varies. In some cases charges are only made to non-academics, and in some cases there is an active academic devloper community who contribute to the main branch, such as for CASTEP
  • Open Source and Crowdsourcing. This is very common in ICT areas (e.g. Linux) but does not come naturally to chemistry. We have created the BlueObelisk as a loose umbrella organisation for Open Data, Open Standards and Open Source in chemistry. I believe it's now having an important impact on chemical informatics - it encourages innovation and public control of quality. Most of the components are created on marginal costs. It's why we have taken the view that - at the start - all our software is Open. I'll deal with the pros and cons later but note that not all OS projects are suited for crowdsourcing on day one - a reliable infrastructure needs to be created.
  • 800-pound gorilla. When a large player comes into an industry sector they can change the business models. We are delighted to be working with Microsoft Research - gorillas can be friendly - who see the whole chemical informatics arena as being based on outdated technology and stovepipe practices. We've been working together on Chem4Word which will transform the role of the semantic document in chemistry. After a successful showing at BioIT we are discussing with Lee Dirks, Alex Wade and Tony Hey the future of C4W
  • public targeted productisation. In this there is specific public funding to take an academic piece of software to a properly engineered system. A special organisation, OMII, has been set up in the UK to do this...

So what and why and who and where are OMII? :

OMII-UK is an open-source organisation that empowers the UK research community by providing software for use in all disciplines of research. Our mission is to cultivate and sustain community software important to research. All of OMII-UK's software is free, open source and fully supported.

OMII was set up to exploit and support the fruits of the UK eScience program. It concentrated on middleware, especially griddy stuff, and this is of little use to chemistry which needs Open chemistryware first. However last year I bumped into Dave DeRoure and Carole Goble and they told me of an initiative - ENGAGE - sponsored by JISC - whose role is to help eResearchers directly:

The widespread adoption of e-Research technologies will revolutionise the way that research is conducted. The ENGAGE project plans to accelerate this revolution by meeting with researchers and developing software to fulfil their needs. If you would like to benefit from the project, please contact ENGAGE (info@omii.ac.uk) or visit their website (www.engage.ac.uk).

ENGAGE combines the expertise of OMII-UK and the NGS ? the UK?s foremost providers of e-Research software and e-Infrastructure. The first phase, which began in September, is currently identifying and interviewing researchers that could benefit from e-Research but are relatively new to the field. "The response from researchers has been very positive" says Chris Brown, project leader of the interview phase, "we are learning a lot about their perceptions of e-Research and the problems they have faced". Eleven groups, with research interests that include Oceanography, Biology and Chemistry, have already been interviewed.

The results of the interviews will be reviewed during ENGAGE's second phase. This phase will identify and publicise the 'big issues' that are hindering e-Research adoption, and the 'big wins' that could help it. Solutions to some of the big issues will be developed and made freely available so that the entire research community will benefit. The solutions may involve the development of new software, which will make use of OMII-UK's expertise, or may simply require the provision of more information and training. Any software that is developed will be deployed and evaluated by the community on the NGS. "It's very early in the interview phase, but we?re already learning that researchers want to be better informed of new developments and are keen for more training and support." says Chris Brown.

ENGAGE is a JISC-funded project that will collaborate with two other JISC projects ? e-IUS and e-Uptake ? to further e-Research community engagement within the UK. "To improve the uptake of e-Research, we need to make sure that researchers understand what e-Research is and how it can benefit them" says Neil Chue Hong, OMII-UK's director, "We need to hear from as many researchers and as many fields of research as possible, and to do this, we need researchers to contact ENGAGE."

Dave and Carole indicated that OSCAR could be a candidate for an ENGAGE project and so we've been working with OMII. We had our first f2f meeting on Thursday where Neil, and two colleagues, Steve and Steve came up from Southampton (that's where OMII is centered although they have projects and colleagues elsewhere). We had a very useful session where OMII have taken the ownership of the process of refactoring OSCAR and also evangelising it. They've gone into OSCAR's architecture in depth and commented favourably on it. They are picking PeterC's brains so that they are able to navigate through OSCAR. The sorts of things that they will address are:

  • Singletons and startup resources
  • configuration (different options at statup, vocabularies, etc.)
  • documentation, examples and tutorials
  • regression testing
  • modularisation (e.g. OPSIN and pre- and post-processing)

And then there is the evangelism. Part of OMII-ENGAGE's remit is to evangelise, through brochures and meetings. So we are tentatively planning an Open OSCAR-ENGAGE meeting in Cambridge in June. Anyone interested at this early stage should mail me and I'll pass it onto the OMII folks.

... and now OPSIN...

Software patents again... Oh dear

The price of freedom is eternal vigilance and the hydras have many heads. Just when you think PRISM is decapitated up pops Conyers and now the good old European Patents directive is still alive.

Please kill it... The great thing is that e-democracy can be mobilised very quickly now

From FII... (sorry about the formatting in my WordPress)

The question of software patents without democracy and the FFII response

In October 2008, the President of the European Patent Office (EPO) issued a Referral to its Enlarged Board of Appeal (EBoA) concerning the questions as to the examination and granting of software patents in Europe. In the absence of European legislative initiatives, the EBoA's conclusion on this matter is likely to have the same effect as a software patent directive.

However, since this decision will be based on a purely legal interpretation of the European Patent Convention (EPC) by the EBoA, it will not be accompanied by more extensive political and economic debate.

As stated by the EPO, third parties may wish to use the opportunity to file written statements before the end of April
(http://tinyurl.com/chkljo)

We would like to ask you to consider writing a statement in the name of your company, organisation or as private person, and if possible also to support the action plan of the FFII (see below).

You can see statements already submited by others at http://www.epo.org/patents/appeals/eba-decisions/referrals/pending.html

We offer a dedicated mailing list for discussions on the referral at

https://lists.ffii.org/mailman/listinfo/boa

and a petition page against software patents at

http://stopsoftwarepatents.eu/

With our action plan, we are funding two experts to work full-time on the issue and also produce detailed documentation about software patents in Europe, to be published in the near future. We need your contribution in order to do this. Please consider making a donation, marking it as 'EBoA Referral'.

International bank data:

IBAN:    DE78701500000031112097
BIC:     SSKMDEMM
Country: Germany
Name:    FFII e.V.
Address: Blutenburgstr 17, DE 80636 Muenchen

Germany bank data:

Name:            FFII e.V.
Account:         31112097
Sort code (BLZ): 70150000

For using Paypal, see
http://ffii.org/Donations

Background information

At present there is no central jurisdiction for European or community patents. National court decisions are still not fully aligned with the European Patent Office's (EPO) granting policy concerning software patents that has been developed by decisions of the EPO Boards of Appeal. The disparity between national patent enforcement courts and the EPO's granting practice was one of the reasons why a directive on the patentability of computer-implemented inventions was proposed. This directive, as well as the 2000 attempt to change the European Patent Convention, was rejected not least because of the larger FFII network's activities.

Despite the fact that several attempts to formally legalise software patents in Europe proved unsuccessful, the EPO still has not adapted to the developments in the political arena. The EPO still grants software patents under the application of loopholes created by its Boards of Appeal decisions.

The EPO's granting practice gradually gains more acceptance in national courts thanks to a trickle down effect, while the legal certainty of national software patents remains to be determined. Validity rulings and opposition mostly reject questionable software patents out of novelty and inventive step considerations, but not on grounds of the substantive scope of patent law.

On October 22, 2008 the Enlarged Board of Appeal was asked by the President of the European Patent Office, Alison Brimelow (UK), for an opinion concerning the exclusion of computer programs as such according to Article 112(1)b EPC. She highlights that this matter is of fundamental importance as it defines the limits of patentability in the field of computing. The Referral is divided into four chapters. The first chapter describes the background to the Referral, the second chapter concerns definitions of auxiliary terms such as software, while part three includes four questions about substantive law interpretation.
Part four describes the legal framework and options for its development.
The President also added background information and an overview of BoA decisions related to this specific matter.

The FFII has a wiki page where comments on the questions can be added.

https://www.ffii.org/EPOReferral

The EPO Enlarged Board of Appeal decided to allow third parties to make statements concerning the points of law (November 11, 2008). We will provide legal considerations which challenge the controversial Boards of Appeal decisions and thus influence the decision-making process. In the absence of legislative clarifications, some courts in the UK recently accepted EPO 'case law'. The opinion of the Extended Boards of Appeal will create the precedent for all future legislative developments.

As there is no legislative scenario in sight which might overrule the EBoA in case it permits software patents, this particular Referral needs our attention. Other parties interested in software patents are going to submit comments in favour of software patents. Philips, in fact, has already done so.

Our action plan

We will submit entries to the Enlarged Board of Appeal in order to bring about a more balanced assessment, and to help the EBoA arrive at legal solutions that are closer to our expectations. Our communication targets are patent technocrats with a different belief system to which we need to adapt. So far we have concluded that several different strategies can be applied. We have discussed these extensively with patent experts. For strategic reasons we cannot make them public, suffice it to say that we are currently in the process of finding collaborators in our attempt to stop software patents.

Challenge

* Recent EPO legal patent literature has done little to challenge or even criticise the teachings of the EPO. Patent scholars from other professions such as political science, economics, etc. are hardly discussed in the legal literature. Patent professionals' task is not normative legislature, but winning cases and applications. While there has been sustained disagreement with software patents in the field of business, legal literature still hardly reflects this shift.

* Inside the EPO there is no open debate and employees are bound by strict staff obligations (cmp. Communique 22). The EPO aggressively intervenes in political and scientific debates, while the patent community's belief system is still largely determined by an unchallenged endorsement of software patents.

* The EBoA's members are not necessarily eligible for judicial office, and some of them are merely technically qualified. The EBoA's lack of independence is a known issue and an EPO reform is underway to make these bodies more independent. Some patent scholars altogether question the legal quality of EBoA reasoning.

* The political debate over patent law is largely blocked. The fact that no corresponding parliament report was issued in response to an official communication from the Commission about the future of Industrial Property policy testifies to this.

* Members of the EBoA will probably only accept legal considerations and solutions.

* The EPO's dogmatic language is shielded against public criticism and, even for legally trained people, like a net in which one easily gets caught. Its reasoning is often based on logical fallacies and hidden value judgments.

* Patent law interpretation practice is expansive.
In an allegedly unclear situation, the patent community will always argue against exclusion from patentability. It lacks a negative definition of "invention" and a sound basis in legal teaching which could be used to explain why a field is not to be covered by patent law.
Patent professionals generally do not understand the economic rationale behind incentive system application, while economists often assume for their model that the patent system has the claimed effects.

* The EPO and its staff have a strong commercial bias in favour of granting patents and are hardly ever subjected to public scrutiny and control. Patent opposition is less than ideal due to free riding effects and associated risks and transparency gaps (cmp. Guellec07)

* Complicated institutional conflicts between German and UK patent traditions loom in the background of the Referral. De facto European patent policy and litigation is strongly dominated by UK and Germany stakeholders and traditions.

Conferences

The following conferences - among others which are not public - will be or have already been attended by some of our members.

Current Policy Issues in the Governance of the European Patent System
Venue: European Parliament, Rue Wiertz 60, Room Anna Lindt, P1A002, Brussels B-1047, BELGIUM
17 March 2009
Alison Brimelow : Closing remarks
www.europarl.europa.eu/stoa/events/workshop/20090317/programme_en.pdf

WIPO - STANDING COMMITTEE ON THE LAW OF PATENTS Geneva, March 23 to 27, 2009 (We have a written report available)

The future of intellectual property
Creativity and innovation in the digital era April 23rd -24th, 2009, Committee of the Regions, Brussels

Making IPR work for SMEs
27th of April 2009, Brussels
http://ec.europa.eu/enterprise/enterprise_policy/industry/ipr_conference.htm

Patinnova
April 28th-30th, Prague
Alison Brimelow opening it.
Workshop on patents and software
http://www.epo.org/about-us/events/epf2009.html

Measuring the value of IPR: theory, business practice and public policy September 24-25, 2009, Bologna Sponsored by the EPO. Alison Brimelow has been invited.
http://www.epip.eu/conferences/epip04/

How to support the FFII

The FFII is divided in working groups. We welcome new active people in our working groups which are listed at https://action.ffii.org

If you consider our work important but you are not able to help actively, you can become a passive sustaining member of the FFII, starting at 15 EUR per year. See

http://action.ffii.org/member_application

How to contact us

FFII e.V.
Blutenburgstr. 17
80636 Munich
Germany

https://www.ffii.org

office@ffii.org

Tel. +49 30 417 22 597
Fax: +49 30 417 22 597
IRC: #ffii @ irc.freenode.net
Blogs: http://planet.ffii.org/

Tax number: 143 / 843 / 17600 at the German tax office in Munich.
IBAN: DE78701500000031112097, SWIFT/BIC: SSKMDEMM Registered organisation in Munich, Amtsgericht München VR 16460
Board: Benjamin Henrion, Rene Mages, Ivan Villanueva, Andre Rebentisch,
Alex Macfie

librarians of the future - Christine and Kimberley

It was great to meet up again with Christine Borgman from UCLA at the Microsoft meeting. Christine and I have much in common about what needs to be done for digital scholarship.

Christine runs a Masters (I think) in LIS and serially hijacked many of the invitees to take part in virtual sessions with her students. So I gave 15-20 minutes of brain dump over the video link. I said that I would be talking in Oxford and asked for contribution. This one, from Kimberley Garmoe arrived just too late for me to reference it... I am really flattered by the reference to the Enlightenment...

I was the small and nearly invisible voice from the back of the room at UCLA. I do not intentionally hide behind tall people, but somehow they always end up in front of me. I immensely enjoyed your talk, and agreed in principle with everything you had to say. I also think we are on the cusp of the communications revolution, the importance of which can only be compared to the first 300 years of printing. The revolution is already upon us, and clearly will not be televised. I hope that you do not take umbrage with my argumentative tendencies. Your style is so engaging and ideas so compelling that I found it impossible to remain passive and polite.

Dr. Borgman is correct, I am one of the students whose background is not the library sciences, or in any science for that matter. I am a Ph.D. student in European history, and my work is on the communications revolution in German mass media at the end of the 18th century. I will spare you further gory details. However, I think that much your argument reaches back into late Enlightenment thought, and I am happy to see that the legacy of the Enlightenment lives on.

I share your concern that information should not be monopolized, but I would point out that monopoly in the production of information seems to have existed from the early days of print. And by this I do not mean merely the official monopolies granted individual printers, but also the tendency of first printers, and then publishing houses, to establish control over the selection and distribution of information over long periods of time. Even the world of print there have been long decades of disaggregation and competition, but time and time again we end up with powerful monopolies. I would like to know what you think, why does information end up monopolized?

I am certain that your talk at the Bodleian went exceedingly well, and I only wish I was there to hear it. However I know I can look forward to it in the Bodleian print series.

With my best regards,

Kimberly Garmoe

See a very comprehensive account of (some of) the proceedings. I expect that video, etc. may appear as well.

OREChem

I will start to widen out from the library of the future  and bring in chemistry and eScience. Librarians should not switch off as the topics are very relevant. Several in our group are off to Redmond - to two official meetings and other informal meet-ups. I'll blog (or twitter/FF about these as we go).

The first meeting is OREChem, sponsored by Lee Giles Dirks from Microsoft External Research and PI'ed by Carl Lagoze from Cornell. Lee is part of Tony Hey's empire in MSR and has responsibility for Scholarly publication and education. There is a good coherence and overlap between the projects and we are  committed to these being Open.

OAI-ORE (Open Archives Initiative Protocol - Object Exchange and Reuse) is brought to you by the people that brought you OAI-PMH - Carl and Herbert. One of the tricky problems on the web is being able to access a bounded set of information on the web. For example if you go to this blog address and download it, what do you get. I actually don't know and I expect it's a mess. This isn't a new problem, and the hypermedia gurus have been active for decades - when I started SGML I spent many hours trying to understand "Bounded Object Sets" and "architectural forms".

ORE tackles this problem in the context of research and scholarship. It can be used for anything, but the thrust is on making web resources for digital libraries, research laboratories, etc. I have the honour of being on the ORE advisory board MSR and I'd urge you to get involved. MSR are backing ORE and as an exemplar have applied this to chemistry, in OREChem. Here we are showing how to create bounded web resources in a context of linked data. I'll write more later, but to put a marker down we have transformed CrystalEye into RDF and will be working over the weekend to agree what the best approach to ORE-ifying it is. I'll leave you with Carl's recent paper (The oreChem Project) ...

The oreChem Project:
Integrating Chemistry Scholarship with the Semantic Web

Carl Lagoze
Information Science, Cornell University
lagoze@cs.cornell.edu

The oreChem project, funded by Microsoft, is a collaboration1 between chemistry scholars
and information scientists to develop and deploy the infrastructure, services, and
applications to enable new models for research and dissemination of scholarly materials in
the chemistry community. Although the focus of the project is chemistry, the work is being
undertaken with an attention to general cyber infrastructure for eScience, thereby enabling
the linkages among disciplines that are required to solve today’s key scientific challenges
such as global warming. A key aspect of this work, and a core aim of this project, is the
design and implementation of an interoperability infrastructure that will allow chemistry
scholars to share, reuse, manipulate, and enhance data that are located in repositories,
databases, and Web services distributed across the network.

The foundations of this planned infrastructure are the specifications developed as part of
the Open Archives Initiative‐Object Reuse and Exchange (OAI‐ORE) [9] effort. These
specifications provide a data model [8] and set of serialization syntaxes [10‐12] for
describing and identifying aggregations of Web resources and describing the relationships
among the resources that are constituents of aggregations. The OAI‐ORE specifications are
firmly grounded in the Web architecture [6] and in the principles of the semantic web [4, 7]
and the Linked Data Effort [3]. The relevant connections of the OAI‐ORE specifications to
mainstream Web and Semantic Web architecture include:

  • All aspects of data model are expressed in terms of resources, representations, URIs,
    and triples.
  • The fundamental entity in the data model, the Aggregation, is a resource without a
    Representation (a “non‐document” resource). This paradigm is similar to the
    manner in which real‐world entities or concepts are included in the Web via the
    mechanisms proposed by the Linked Data Effort [3],
  • The description of an Aggregation, a Resource Map, is a separate Resource, which is
    accessible via the URI of the Aggregation using the mechanisms defined for Cool
    URIs [15].
  • The result of an HTTP access of a Resource Map URI is a serialization of the triples
    describing the Aggregation. This serialization may be in any of the OAI‐ORE
    serialization syntaxes: RDF/XML [2], RDFa [1], and Atom [14] (triples can be
    extracted from this via an OAI‐ORE defined GRDDL‐compliant XSLT script).

Our initial work in the oreChem Project is the design of a graph‐based object model that
specializes the core OAI‐ORE data model for the chemistry domain. This model builds on
the centrality of the molecule, or chemical compound, in the record of chemistry
scholarship. In the nature of a relational database key, a molecule or compound, identified
in a universal manner [13], forms the central hub for linkages to other entities such as
investigations, experiments, scholars, and processes related to that molecule. We are then
using this model to design interfaces and APIs to exchange molecular information and their
relationships among distributed repositories, services, and agents.

We are demonstrating this infrastructure by adapting a number of existing chemistry data
repositories2 to the APIs and models. We are also further populating these repositories by
developing and refining automated techniques for retrospectively extracting chemical
information and interlinking chemical data from existing chemistry research corpora.

Following this we will develop and deploy a number of tools, such as chemical structure
searching, over the repositories that have been adapted to the infrastructure. In the latter
stages of the project, we will extend the retrospective data extraction techniques with active
“in the lab” capture of chemistry data, and the addition of that “in‐process” data to the
knowledge network defined by the infrastructure data model.

Ultimately, we envision that this common data model, interchange protocols, and suite of
data extraction and data capture tools will enable an eChemistry Web – a semantic graph
with embedded subgraphs representing molecules which are then interrelated to
publications that refer to them, experiments that work with them, the context of these
experiments, the researchers working with these molecules, annotations about publications
and experiments, and the like. A particularly interesting aspect of this semantic graph is the
manner in which it mixes data, publication artifacts, and people – providing an informationrich
social network built around the notion of object‐centered sociality [5]. In the latter
phases of the project we hope to build innovative analysis tools that will extract new
“scientometric” information and knowledge from the eChemistry Web.

Our work in the oreChem Project and, in particular, our design of the interoperability
infrastructure, is being undertaken with the recognition that chemistry, like any scholarly
discipline, is not an island, but has complex linkages to scholarship in other disciplines and
into related activities such as education, and in fact to the general network‐based
information environment. By basing our work on OAI‐ORE, we hope that the
interoperability paradigm designed for oreChem will coexist with similar work in other
disciplines and in fact with the general Web information space and its ubiquitous search
tools, services, and applications.

1 Collaborators in the oreChem Project are University of Cambridge (Peter Murray Rust, Jim
Downing), Cornell University (Carl Lagoze, Theresa Velden), University of Indiana (Geoffrey
Fox, Marlon Pierce), Penn State University (C. Lee Giles, Prasenjit Mitra, Karl Mueller),
PuBChem (Steve Bryant), and University of Southampton (Jeremy Frey, Simon Coles).

2 These repositories include CrystalEye, 100,000 molecules and 100,000 fragments from
crystal structures with full crystallographic details and with 3D coordinates; SPECTRaT,
open theses with molecules; Pub3D, MMFF94‐optimized 3D structures for PubChem
compounds; ChemXSeer, an integrated digital library and database allowing for intelligent
search of documents in the chemistry domain and data obtained from chemical kinetics;
eCrystals, high level crystal structures and processed x‐ray diffraction data; and R4L,
experimental spectroscopic and analytical chemical data.

SPECTRa tools released

The SPECTRa tools allow chemists (perhaps group or departmental analytical spectroscopy groups) to submit their data (spectra, crystal structures, compchem) to a repository.

From Jim Downing SPECTRa released

Now that a number of niggling bugs have been ironed out, we’ve released a stable version of the SPECTRa tools.

There are prebuilt binaries for spectra-filetool (command line tool helpful for performing batch validation, metadata extraction and conversion of JCAMP-DX, MDL Mol and CIF files), and spectra-sub (flexible web application for depositing chemistry data in repositories). The source code is available from the spectra-chem package, or from Subversion. All of these are available from the spectra-chem SourceForge site.

Mavenites can obtain the libraries (and source code) from the SPECTRa maven repo at http://spectra-chem.sourceforge.net/maven2/. The groupId is uk.ac.cam.spectra - browse around for artifact ids and versions.

PMR: This is an important tool in the chain and congratulations to Jim for designing and building it. It interfaces with a repository (as they say on kids toys "repository not included") so that you can customise your own business process. We hope to see departments appreciating the need for repositing their data (it gets lost, it could be re-used, etc.).

The legacy formats (CIF, JCAMP, Gaussian, etc.) are well structured and  SPECTRa allows them to be used in a way which maximises the effort that went into creating them. The process is almost automatic for crystallography - a good CIF has all the metadata inside it) but a small amount of manual effort for spectra (the molecule is not normally embedded in the JCAMP so has to be provided separately).

The system is potentially searchable by chemistry - it might look something like Crystaleye  with a search provided by OpenBabel.

NMR Challenge: what can a machine deduce from a thesis?

One of the ways of extracting chemical structures from the literature is to use the NMR to constrain the possibilities. So, to give you an amusement for the weekend, here are some problems. I have a thesis (which I'm not identifying, but I know the author and he's happy for the thesis to yield Open Data). I am not sure whether the compounds are in the public literature yet, but they are in the public domain if you know where to find the paper thesis.

Imagine that some future archaeologist had discovered the thesis and only a few scraps had survived. What could be deduced? I'm starting with smallish (hopefully fairly simple) structures and only feeding you some of the information. Depending on what you answer, I'll either release more or select more complex compounds. All compounds are distinct.

Compound 172

dH (400 MHz, CDCl3): 1.15 (3H, t, J 7.1, OCH2CH3), 1.24 (3H, d, J 5.2, 6-H x 3), 2.84 (1H, qd, J 5.2, 2.0, 5-H), 3.05 (1H, dd, J 7.0, 2.0, 4-H), 4.07 (2H, q, J 7.1, OCH2CH3), 5.99 (1H, dd, J 15.7, 0.6, 2-H), 6.54 (1H, dd, J 15.7, 7.0, 3-H);

Compound 167

dC (100 MHz, CDCl3): 164.5 (CO), 160.3 (C), 107.5 (C), 95.6 (CH), 40.9 (CH2Cl), 24.7 (CH3 x 2);

compound 156

dC (100 MHz, CDCl3): 83.0 (3-C), 79.6 (2-C), 61.8 (5-C), 51.0 (1-C), 25.8 (SiC(CH3)3 x 1), 23.1 (4-C), 18.3 (SiC(CH3)3), -5.1 (Si(CH3) x 2);

Note that the molecular formula, molecular weight, etc. have all been destroyed by the ravages of time.

You can use any method you like, including searching in commercial databases.

What could a machine do with the information above?

How OSCAR interprets text and data

I recently posted ( Open NMR and OSCAR toolchains ) about how OSCAR can extract data from chemical articles, and in particular chemical theses. Wolfgang Robien points out November 24th, 2007 at 11:03 am e

I think, no, I am absolutely sure, this functionality can be achieved with a few basic UNIX-commands like ‘grep’, ‘cut’, ‘paste’, etc. What you need is the assignment of the signals to specific carbons in your structure, because this (and EXACTLY THIS) is the basis of spectrum prediction and structure verification - before this could be done, you need the structure itself.

Wolfgang is correct that the basis of this part of OSCAR is based on regular expressions (which are also used in grep). However developing such regular expressions that work across a variety of styles (journals, theses, etc.) is a lot of work - conservatively this took many months. The current set of regexes runs to many pages. Initially when I started this work about 7 years ago I thought that chemical papers could be solved by regexes alone, but this is quite infeasible. Even if the language is completely regular (as is possible, but not always observed in spectra data) we rapidly get a combinatorial explosion. Joe Townsend, Chris Waudby, Vanessa de Sousa and Sam Adams did much of the pioneering work here and showed the limitations. In fact the current OSCAR, which we are refactoring at this moment consists of several components:

  • natural language parsing techniques (including part of speech tagging and, to come, more sophisticated grammars)
  • identification of chemical names by Bayesian techniques
  • chemical name deconstruction (OPSIN)
  • heuristic chunking of the document
  • lookup in ontologies
  • regular expressions

These can interact in quite complex manners - for example chemical names and formula can be found in the middle of the data. For this reason OSCAR - and any parsing technique - can never be 100% perfect. (We should mention, and I will continue to do so, that parsing PDF - even single column - is a nightmare).

Wolfgang is right that we also need the assignment of the carbons to the peakList and also the chemical structure. Taking the structure first, we can try to determine it by the following methods:

  • interpreting the chemical name. OPSIN does a good job on simple compounds. I don't have metrics for the current literature but I think it's running at ca 20%. That may sound low, but name2structure requires the compilation of many sub-lexicons and sub-grammars (e.g. for multicyclic systems, saccharides, etc.) If there is a need for this, much can be done by community action.
  • interpreting the chemical diagram. Open tools are starting to emerge here and my own dabbling with PDF suggests that perhaps 20-25% can be extracted. The main problems are (a) finding the diagrams and linking them to the serial number and (b) the publishers' claim that images are copyright.
  • using the crystallography. If a crystal structure is available then the conversion to connection table, including bond orders and hydrogens, is virtually 100%. Again there may be a problem in linking the structure to the formula.
  • reconstruction from spectral data. For simple molecules this should be possible - after all we set this in exam questions so a robot should be able to do some. The combination of HNMR, CNMR and IR should constrain the possibilities. Whether this is a brute force approach (generate all structures and remove incompatible ones) or whether it is based on logic and rules may depend on the software available and the system.

(Of course if the publisher or student makes an InChI available all this is unnecessary).

There are two ways of trying to add the assignment. One is simply by trusting the shifts from the calculation (whether GIAO or HOSE). A problem here is that the authors may - and do - omit peaks or mis-transcribe them. I think I have an approach to manage simple cases here. The other is by trying to interpret the author's annotations. This is a nice exercise because there is no standard way of reporting it and there is almost certainly no numbering scheme. So we will need to build up databases of numbering schemes and also heuristics of how most authors annotate C13 spectra.

Open NMR and OSCAR toolchains

I am currently refactoring Nick Day's code that has supported "NMREye" - the collection of Open experiments and Data that he has generated as part of his thesis and have been trailed on this blog ( View post). One intention of this - which got lost in some of the other discussion is to be able to see whether published results are "correct". This is, of course, not new to us - students here developed the OSCAR toolkit for checking experimental data (View post). The NMREye work suggest that it should be possible to validate the actual 13C NMR values reported in a scientific experiment.

Nick will take it as a compliment that I am refactoring his code. It was written on a very strict timescale - he had to write the code, collect and analyse the results in little more than a month. And his work has a wider applicability within our group. So I am trying to design a library system that supports his ideas while being generally re-usable. And this has very useful consequences for CML - the main question as always is "does CML support enough chemistry in a simple fashion and can it be coded?". As an example here's an example of data from a thesis we are analyzing in the SPECTRaT project:

13C (150 MHz) d 138.4 (Ar-ipso-C), 136.7 (C-2), 136.1 (C-1), 128.3, 127.6, 127.5 (Ar‑ortho-C, Ar-meta-C, Ar-para-C), 87.2 (C-3), 80.1 (C-4), 72.1 (OCH2Ph), 69.7 (CH2OBn), 58.0 (C-5), 26.7 (C-6), 20.9 ((CH3)AC-6), 17.9 ((CH3)BC-6), 11.3 (CH3C‑2), 0.5 (Si(CH3)3).

(the "d" is a delta but I think everything has been faithfully copied from the Word document. Note that OSCAR can :

  • understand that this is a 13C spectrum
  • extract the frequency
  • identify the peak values (shiofts) and identify the comments

Try to think how you would explain this to a robot and what additional information you would need. Indeed try to explain this to a non-chemist - it's a useful exercise.

What OSCAR and the other tools cannot do yet is:

  • extract the solvent (this is mentioned elsewhere in the thesis)
  • understand the comments
  • manage the framework symmetry group of the phenyl ring
  • understand peakGroup (the aromatic ring)

So the toolchain has to cover this and much more. However the open source chemistry community (in this case all Blue Obelisk) has provided most of the components. More on this later.

Open NMR: update

I am very grateful to hko and Wolfgang Robien for their continued analysis of the results of Nick Day's automated calculation of NMR chemical shifts, using the GIAO approach (parameterized by Henry Rzepa). The discussion has shown that some structures are "wrong" and rather more are misassigned.

Wolfgang Robien Says:
November 11th, 2007 at 10:01 am e

we need ‘CORRECT’ data - many assignments of the early 70’s are absolutely correct and useful for comparison [...]
As a consequence of your QM-calculations 10 assignment corrections and 1 structure revision within a few hundred compounds have been performed by ‘hko’ (see postings above) - this corresponds to an
error rate of approx. 5% ! [PMR: In the data set we extracted from NMRShiftDB]. [... discussion of how such errors are detected snipped...]

PMR: Part of the exercise that Nick Day has undertaken was to give an objective analysis of the errors in the GIAO method. The intention was to select a data set objectively. It is extremely difficult to select a representative data set by any means - every collection is made with some purpose in mind. We assumed that NMRShiftDB was "roughly representative" of 13C NMR (and so fat this hasn't been an issue). It could be argued that it may not have many organometallics, minerals, proteins, etc. and I suspect that our discourse is mainly about "small organic molecules". But I don't know. It may certainly not be representative of the scope or GIAO or HOSE codes. Again I don't know. Having made the choice of data set the algorithm for selecting the test data was objective and Nick has stated it (< 20 heavy atoms, <= Cl except Br, no adjacent acyclic bonds). There may have been odd errors in implementing this (we got 2-3 compounds with adjacent acyclic bonds) but it was largely correct. And it could be re-run to remove these. We stress again that we did not know how many structures we would get and whether they would behave well in the GIAO method. In fact over 25% failed to complete the calculation. (We are continuing to find this - the atom count is not a perfect indication of how long a calculation will take which can vary by nearly a factor of 10).

We would not claim that the remaining ca. 250 compounds were "representative". There are no organometallics, no electron-deficient compounds, no overcrowded compounds, no major ring currents, etc. (all of which are areas where we might expect GIAO to do better than some empirical methods). In fact the compounds are generally ones that we would expect connection-table-based methods to score well on as there are few unusual groups (so well trained) and no examples where the connection table cannot describe the molecule well (e.g. Li4Me4, Fe(Cp)2, etc.

Our current conclusion is that the variance in the experimental data is sufficiently large (even after removal of misassignments) to hide errors in the GIAO method. This appears to give good agreement with an RMS of ca. 2 ppm. (but again we stress that the data set is not necessarily representative). If the Br/Cl correction had not been anticipated it would have been clearly visible and the exercise would have revealed it as a new effect. It is certainly possible that there are other undetected effects (especially for unusual chemistry). But, for common compounds I think we can claim that the GIAO method is a useful prediction tool. It should be particularly useful where connection tables break down and here are some systems I'd like to see it exposed to:

  • Li4Me4
  • Fe(Cp2) - although Fe is difficult to calculate well.
  • p-cyclophane (C1c(cc2)ccc2CCc(cc3)ccc3C1)
  • 18-annulene
  • PMR: So what I would like is a representative test data set that could be used for the GIAO method. The necessary criteria are:

    • It is agreed what the chemical scope is. I think we would all exclude minerals, probably all solid state, proteins, macromolecules (there are other communities which do that). But I think we should include a wide chemical range if possible.
    • The data set is prepared by one or more NMR-expert groups that have no particular interest in promoting one method over another. That rules out Henry, Wolfgang, ACDLabs, and probably NMRShiftDB.
    • The data set should provide experimental chemical shifts and the experts should have agreed the assignments by whatever methods are currently appropriate - these could include a group opinion. The assignments should NOT have been based on any of the potential competitive methodologies.

    For a competition there would be stronger requirements - it is essential it is seen to be fair as reputation and commercial success might hang on the result.

    So I make my request again. Please can anyone give me some data that we can use in an Open experiment to test (and if necessary validate/invalidate) the GIAO method? At this stage we'd be happy to take material from anyone's collections, but it would have to be Open so that other groups have the chance to comment.

    I hope someone can volunteer. If not we may have to resort to (machine) extraction of data from the current literature. Our experience with crystallography suggests that the reporting and quality of analytical data in general has increased over the last 10 years.

    Open NMR: Nick Day's "final" results

    Nick has more-or-less finished the computational NMR work on compounds from NMRShiftDB and we are exposing as much of the work as technically possible. Here is his interim report, some of which I trailed yesterday. The theoretical calculation (rmpw1pw91/6-31g(d,p)) involves:

    • correction for spin-orbit coupling in C-Cl (-3 ppm) and C-Br (-12 ppm)
    • averaging of chemically identical carbons (solves some, but not all conformational problems)
    • extra basis set for C and O [below]

    ====== Gaussian 03 ====

    --Link1--
    %chk=nmrshiftdb2189-1.chk
    # rmpw1pw91/6-31g(d,p) NMR scrf(cpcm,solvent=Acetone) ExtraBasis

    Calculating GIAO-shifts.

    0 1

    C 0
    SP 1 1.00
    0.05 1.00000000 1.00000000
    ****
    O 0
    SP 1 1.00
    0.070000 1.0000000 1.0000000
    ****
    ====== Gaussian 03 ====

    In general his/our conclusions are:

    • the major variance in the observed-calculated variate is due to "experimental" problems ("wrong" structures, misassignments)
    • significant variance from unresolved conformers and tautomers
    • small systematic effects in the offset depending on the hybridization [below]

    The final variance is shown here (interactive plot at (http://wwmm.ch.cam.ac.uk/data/nmr/html/hsr1_hal_morgan/solvents-difference/index.html) requires Firefox):

    nick1.PNG

    (In the interactive plot clicking on any point brings up the structure, and the various diagnostics plots can then be loadad for that structure). It can be seen that the sp3 Carbons (left) are systematically different from the sp2 (right) and we shall be playing with the basis sets to see if we can get this better. If not it will have to be an empirical calculation.

    The variance can be plotted per structure in terms of absolute error (C) and intra-structure variance (RMSD). Here's the plot (http://wwmm.ch.cam.ac.uk/data/nmr/html/hsr1_hal_morgan/RMSD-vs-C/index.html) for this (which obviously includes some of the variance from the systematic error above):

    nick2.PNG

    The sp2/sp3 scatter can be seen at the left but the main RMSD (> 3.0 ppm) is probably due to bad structures and unresolved chemistry. There are 22 points there and we'd be very grateful for informed comment.

    Assuming the main outliers can be discarded for legitimate reasons (not just because we don't like them) then I think we have the following conclusion. For molecules with:

    • one major conformation ...
    • ... and where there are no tautomers or we have got the major one ...
    • ... and where the molecule contains only C, H, B, N, O, F, Si, P, S, Cl, Br ...

    then the error to be expected from the calculation is in the range 1-2 ppm.

    We can't go any further without having a cleaner dataset. We'd be very interested if anyone can make one Open. But we have also have some ideas how to start building one and we'd be interested in collaborators.

    We've now essentially exposed all or methodology and data. OK, it wasn't Open Notebook Science because there were times when we didn't expose things. But from now on we shall try to do it as full Open Notebook Science. There may be some manual procedures in transferring results from the Condor system to web pages, but that's no different from writing down observations in a notebook - there will be a few minutes between the experiment and the broadcast. And this will be an experiment where anyone can be involved.