Category Archives: open notebook science

funding models for software, OSCAR meets OMII

In a previous post I introduced our chemical natural language tools OSCAR and OPSIN. They are widely used, but in academia there is a general problem - there isn't a simple way to finance the continued development and maintenance of software . Some disciplines (bioscience, big science) recognize the value of funding software but chemistry doesn't. I can count the following other approaches (there may be combinations)

  • Institutional funding. That's the model that ICE: The Integrated Content Environment uses. The major reason is that the University has a major need for the tool and it's cost-effective to do this as it allows important new features to be added.
  • Consortium funding. Often a natural progression from the latter. Thus all the major repository software (DSPACE, ePrints, Fedora) and content/courseware (Moodle, Sakai) have a large formal member base of instutions with subventions. These consortia may also be able to raise grants.
  • Marginal costs. Some individuals or groups are sufficiently committed that they devote a significant amount of their marginal time to creating. An excellent example of this is George Sheldrick's SHELX where he single-handedly developed the major community tool for crystallographic analysis. I remember the first distributions - in ca 1974 - when it was sent as a compressed deck of FORTRAN cards (think about that).  For afficionados there was a single variable A(32768) in which different locations had defined meanings only in George's head. Add EQUIVALENCE, blank COMMON and any alteration to the code except by George led to immediate disaster. A good strategy to avoid forks. My own JUMBO largely falls into this category (but with some OS contribs).
  • Commercial release. Many groups have developed methods for generating a commercial income stream. Many of the computational chemistry codes (e.g. Gaussian) go down this route - an academic group either licenses the software to a commercial company, or set up a company themselves, or recover costs from users. The model varies. In some cases charges are only made to non-academics, and in some cases there is an active academic devloper community who contribute to the main branch, such as for CASTEP
  • Open Source and Crowdsourcing. This is very common in ICT areas (e.g. Linux) but does not come naturally to chemistry. We have created the BlueObelisk as a loose umbrella organisation for Open Data, Open Standards and Open Source in chemistry. I believe it's now having an important impact on chemical informatics - it encourages innovation and public control of quality. Most of the components are created on marginal costs. It's why we have taken the view that - at the start - all our software is Open. I'll deal with the pros and cons later but note that not all OS projects are suited for crowdsourcing on day one - a reliable infrastructure needs to be created.
  • 800-pound gorilla. When a large player comes into an industry sector they can change the business models. We are delighted to be working with Microsoft Research - gorillas can be friendly - who see the whole chemical informatics arena as being based on outdated technology and stovepipe practices. We've been working together on Chem4Word which will transform the role of the semantic document in chemistry. After a successful showing at BioIT we are discussing with Lee Dirks, Alex Wade and Tony Hey the future of C4W
  • public targeted productisation. In this there is specific public funding to take an academic piece of software to a properly engineered system. A special organisation, OMII, has been set up in the UK to do this...

So what and why and who and where are OMII? :

OMII-UK is an open-source organisation that empowers the UK research community by providing software for use in all disciplines of research. Our mission is to cultivate and sustain community software important to research. All of OMII-UK's software is free, open source and fully supported.

OMII was set up to exploit and support the fruits of the UK eScience program. It concentrated on middleware, especially griddy stuff, and this is of little use to chemistry which needs Open chemistryware first. However last year I bumped into Dave DeRoure and Carole Goble and they told me of an initiative - ENGAGE - sponsored by JISC - whose role is to help eResearchers directly:

The widespread adoption of e-Research technologies will revolutionise the way that research is conducted. The ENGAGE project plans to accelerate this revolution by meeting with researchers and developing software to fulfil their needs. If you would like to benefit from the project, please contact ENGAGE ( or visit their website (

ENGAGE combines the expertise of OMII-UK and the NGS ? the UK?s foremost providers of e-Research software and e-Infrastructure. The first phase, which began in September, is currently identifying and interviewing researchers that could benefit from e-Research but are relatively new to the field. "The response from researchers has been very positive" says Chris Brown, project leader of the interview phase, "we are learning a lot about their perceptions of e-Research and the problems they have faced". Eleven groups, with research interests that include Oceanography, Biology and Chemistry, have already been interviewed.

The results of the interviews will be reviewed during ENGAGE's second phase. This phase will identify and publicise the 'big issues' that are hindering e-Research adoption, and the 'big wins' that could help it. Solutions to some of the big issues will be developed and made freely available so that the entire research community will benefit. The solutions may involve the development of new software, which will make use of OMII-UK's expertise, or may simply require the provision of more information and training. Any software that is developed will be deployed and evaluated by the community on the NGS. "It's very early in the interview phase, but we?re already learning that researchers want to be better informed of new developments and are keen for more training and support." says Chris Brown.

ENGAGE is a JISC-funded project that will collaborate with two other JISC projects ? e-IUS and e-Uptake ? to further e-Research community engagement within the UK. "To improve the uptake of e-Research, we need to make sure that researchers understand what e-Research is and how it can benefit them" says Neil Chue Hong, OMII-UK's director, "We need to hear from as many researchers and as many fields of research as possible, and to do this, we need researchers to contact ENGAGE."

Dave and Carole indicated that OSCAR could be a candidate for an ENGAGE project and so we've been working with OMII. We had our first f2f meeting on Thursday where Neil, and two colleagues, Steve and Steve came up from Southampton (that's where OMII is centered although they have projects and colleagues elsewhere). We had a very useful session where OMII have taken the ownership of the process of refactoring OSCAR and also evangelising it. They've gone into OSCAR's architecture in depth and commented favourably on it. They are picking PeterC's brains so that they are able to navigate through OSCAR. The sorts of things that they will address are:

  • Singletons and startup resources
  • configuration (different options at statup, vocabularies, etc.)
  • documentation, examples and tutorials
  • regression testing
  • modularisation (e.g. OPSIN and pre- and post-processing)

And then there is the evangelism. Part of OMII-ENGAGE's remit is to evangelise, through brochures and meetings. So we are tentatively planning an Open OSCAR-ENGAGE meeting in Cambridge in June. Anyone interested at this early stage should mail me and I'll pass it onto the OMII folks.

... and now OPSIN...

CML - a semantic approach to chemistry

Rich Apodaca has asked me to show how CML can deal with metallocene compounds - and I'm happy to do this - it comes at a very good time. He points to Metallome blog and I'll copy some of the material on ferrocene. I'll show the post and then explain the approach

Metallome: Drawing ferrocene

Ferrocene was discovered in 1951 and we still do not know the proper way to draw it. CrossFire example recommends to connect every carbon atom of the ring to the central metal atom. Which is fair enough and will be a valid query for CrossFire Gmelin database. Similarly, both ChEBI and NIST Webbook use decacoordinate iron in ferrocene structure (a). In this representation, all carbon—carbon bonds are single. But, according to IUPAC Recommendations, section GR-1.7.2,

    coordination bonds to contiguous atoms (most commonly representing a form of π-bonding) should be drawn to indicate most clearly that special bonding pattern. Depictions that imply a regular covalent bond — and especially, depictions that show a regular covalent bond to each member of a delocalized system — are not acceptable.

In other words, the preferred representation is the one with bicoordinate iron and delocalised bond system (b). The problem with that is there is no agreed (as far as chemoinformaticans are concerned) way to do that, even though solutions for different applications (e.g. for Marvin Sketch) do exist. In MolBase, the coordination number of iron in ferrocene is 6 (and I do remember Mark Winter confirming that this is true). On yet another hand, Beilstein and ChemIDplus databases represent ferrocene as a standalone Fe2+ and two standalone cyclopenta-2,4-dienide anions (c), thus avoiding the question of coordination number altogether. Naturally, the decacoordinate-iron query will not work in Beilstein. (For InChI implications, see this discussion.)

ferrocene with 10-coordinate iron
ferrocene with bi-coordinate iron
ferrocene as three standalone entities

PMR: Many thanks Kirill for this very clear explanation. The first and central point is that there is no agreed way to represent ferrocene, and the semantic approach honours this.

in CML we represent what we do know, and do that as fully as possible. Implicit semantics (i.e. information that has to be provided by the reader or reading program) creates enormous problems. A typical example of implicit semantics is omitting hydrogen atoms and although CML allows this we are not allowing omitted H in Chem4Word.

So let's build up systematically. What do we know? We know we have a molecule (ferrocene exists in the gas phase so we can talk of single molecules and don't have to worry about substances at this stage).

<molecule id="mol123456789" title="ferrocene"

What an anticlimax! we knew that. But we have at least told the world we have a molecule. Let's see what the world has to offer... off to Pubchem for  Ferrocene search. This gives a huge amount of enties and shows the chaos when we don't have semantic chemistry - several entries have a formula of MF: C10H10Fe-6 which is obviously caused by a non-semantic program trying to work out the formula from non-semantic input.

The lesson is simple: If you care about quality and validity you must use a semantic approach.

So how do we know which is actually "ferrocene". Simple answer - we don't. We have a number of conflicting pieces of information - some have a name "ferrocene" but the formula associated with those names varies. Of course we as inorganic chemists know the "correct" formula, but it still means Pubchem (and much else) doesn't act as a simple lookup. We have to add metadata - who asserted what. That's where semantics starts to come in.

We'll take the first entry:

SID: 49854569 <!--
var Menu49854569_1 = [
["UseLocalConfig", "jsmenu3Config", "", ""],
["Same Substances" , "'/sites/entrez?Db=pcsubstance&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pcsubstance_same_popup&LinkReadableName=Same%20Substances&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum' ", "", ""],
["Same Parent" , "'/sites/entrez?Db=pcsubstance&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pcsubstance_parent_popup&LinkReadableName=Same%20Parent&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum' ", "", ""],
["Same Parent, Connectivity" , "'/sites/entrez?Db=pcsubstance&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pcsubstance_parent_connectivity_popup&LinkReadableName=Same%20Parent%2C%20Connectivity&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum' ", "", ""],
["Similar Substances" , "'/sites/entrez?Db=pcsubstance&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pcsubstance&LinkReadableName=Similar%20Substances&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum' ", "", ""],
["PubChem Same Compound" , "'/sites/entrez?Db=pccompound&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pccompound_same&LinkReadableName=PubChem%20Same%20Compound&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum' ", "", ""],
["PubChem Component Compounds" , "'/sites/entrez?Db=pccompound&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pccompound&LinkReadableName=PubChem%20Component%20Compounds&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum' ", "", ""]
-->Related Structures, <!--
var Menu49854569_4 = [
["UseLocalConfig", "jsmenu3Config", "", ""],
["PubMed MeSH Keyword Summary" , "';uid=49854569&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum&ordinalpos=1' ", "", ""],
["PMC Articles" , "'/sites/entrez?Db=pmc&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pmc&LinkReadableName=PMC%20Articles&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum' ", "", ""],
["PubMed (MeSH Keyword)" , "'/sites/entrez?Db=pubmed&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pubmed_mesh&LinkReadableName=PubMed%20(MeSH%20Keyword)&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum' ", "", ""],
["MeSH Keyword" , "'/sites/entrez?Db=mesh&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_mesh&LinkReadableName=MeSH%20Keyword&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum' ", "", ""]
var PopUpMenu2_LocalConfig_jsmenu3Config = [
["Help","'/entrez/query/static/popup.html','Links_Help','resizable=no, scrollbars=yes, toolbar=no, location=no, directories=no, status=no, menubar=no, copyhistory=no, alwaysRaised=no, depend=no, width=400, height=500');"],
["TitleText"," Links "]

var jsmenu3Config = [

function ShowLinks(url,linkscount)
var X,Y;
var H = (linkscount + 5)*30, W = 300;
if(parseFloat(navigator.appVersion)>= 4) {
if(navigator.appName=="Netscape") {
if(H > window.innerHeight) { H=window.innerHeight-50;}
if(H > document.body.offsetHeight) { H=window.innerHeight-50;}
}, 'Links','alwaysRaised=yes,screenX='+String(X)+',screenY='+String(Y)+',resizable=no,scrollbars=yes,toolbar=no,location=no,directories=no,status=no,menubar=no,title=no,copyhistory=yes,width='+String(W)+',height='+String(H)).focus();

Ferrotsen; Catane; FERROCENE ...
Compound ID: 7611
Source: LeadScope (LS-357)
IUPAC: cyclopenta-1,3-diene; iron(2+)
MW: 186.031400 g/mol | MF: C10H10Fe

There is a lot of semantics we should encode here:

Pubchem ASSERTS that Leadscope (a depositor) has deposited a substance entry (SID: 49854569). Leadscope ASSERTS that this entry relates to the name "ferrocene"; that this entry relates to the name "catane" (and many more); that this entry is associated with formula C10H10Fe; that this entry has a connection table [identical with (c) above]; and so on.

The proper way to do this is with RDF using triples and/or reification, bnodes or quads. In this way we can see who asserted what or who asserted who said what. This is not chopping logic - there is no correct connection table for ferrocene , there are only assertions made by authorities (here Leadscope, although they may have taken it from somewhere else unspecified).

Let's encode the formula. cml:formula is one of the benefits of CML and C4W has been written to manage formulae properly. There are 2 formulae, the sum of the atoms (concise) and an inline (which can be any text).

<molecule id="mol123456789" title="ferrocene" xmlns=''>
<formula concise="C 10 H 10 Fe 1" inline="Fe(C_5_H_5)_2_"/>

Now we come to the bonding. This representation has 3 species and CML supports sub-molecules (again an important feature). The iron is:

<molecule id="m1">
    <atom id="a0" elementType="Fe" formalCharge="2"/>

That tells us everything about the iron with no implicit semantics. The Cps could be represented in a number of ways. I'll write

<molecule id="m2" formalCharge="-1">
    <atom id="a1" elementType="C"/>
    <atom id="a2" elementType="C"/>
    <atom id="a3" elementType="C"/>
    <atom id="a4" elementType="C"/>
    <atom id="a5" elementType="C"/>
    <atom id="a6" elementType="H"/>
    <atom id="a7" elementType="H"/>
    <atom id="a8" elementType="H"/>
    <atom id="a9" elementType="H"/>
    <atom id="a10" elementType="H"/>
    <bond id="a1_a2" atomRefs2="a1 a2/>
    <bond id="a2_a3" atomRefs2="a2 a3/>
    <bond id="a3_a4" atomRefs2="a3 a4/>
    <bond id="a4_a5" atomRefs2="a4 a5/>
    <bond id="a5_a1" atomRefs2="a5 a1/>
    <bond id="a1_a6" atomRefs2="a1 a6/>
    <bond id="a2_a7" atomRefs2="a2 a7/>
    <bond id="a3_a8" atomRefs2="a3 a8/>
    <bond id="a4_a9" atomRefs2="a4 a9/>
    <bond id="a5_a10" atomRefs2="a5 a10/>

and the other by analogy. Note that I have not added any bond orders - this is deliberate. the community will argue about whether the ring bonds are single, double, delocalised, pi, etc. They will argue where the charge should be put. So I have added exactly enough information that stops before they start fighting.

Putting it together we get the complete CML for this particular representation:

<molecule id="mol123456789" title="ferrocene" xmlns=''>
  <formula concise="C 10 H 10 Fe 1" inline="Fe(C_5_H_5)_2_"/>
  <molecule id="m1">
      <atom id="a0" elementType="Fe" formalCharge="2"/>
  <molecule id="m2" formalCharge="-1">
      <atom id="a1" elementType="C"/>
      <atom id="a2" elementType="C"/>
      <atom id="a3" elementType="C"/>
      <atom id="a4" elementType="C"/>
      <atom id="a5" elementType="C"/>
      <atom id="a6" elementType="H"/>
      <atom id="a7" elementType="H"/>
      <atom id="a8" elementType="H"/>
      <atom id="a9" elementType="H"/>
      <atom id="a10" elementType="H"/>
      <bond id="a1_a2" atomRefs2="a1 a2"/>
      <bond id="a2_a3" atomRefs2="a2 a3"/>
      <bond id="a3_a4" atomRefs2="a3 a4"/>
      <bond id="a4_a5" atomRefs2="a4 a5"/>
      <bond id="a5_a1" atomRefs2="a5 a1"/>
      <bond id="a1_a6" atomRefs2="a1 a6"/>
      <bond id="a2_a7" atomRefs2="a2 a7"/>
      <bond id="a3_a8" atomRefs2="a3 a8"/>
      <bond id="a4_a9" atomRefs2="a4 a9"/>
      <bond id="a5_a10" atomRefs2="a5 a10"/>
  <molecule id="m3" formalCharge="-1">
      <atom id="a11" elementType="C"/>
      <atom id="a12" elementType="C"/>
      <atom id="a13" elementType="C"/>
      <atom id="a14" elementType="C"/>
      <atom id="a15" elementType="C"/>
      <atom id="a16" elementType="H"/>
      <atom id="a17" elementType="H"/>
      <atom id="a18" elementType="H"/>
      <atom id="a19" elementType="H"/>
      <atom id="a20" elementType="H"/>
      <bond id="a11_a12" atomRefs2="a11 a12"/>
      <bond id="a12_a13" atomRefs2="a12 a13"/>
      <bond id="a13_a14" atomRefs2="a13 a14"/>
      <bond id="a14_a15" atomRefs2="a14 a15"/>
      <bond id="a15_a11" atomRefs2="a15 a11"/>
      <bond id="a11_a16" atomRefs2="a11 a16"/>
      <bond id="a12_a17" atomRefs2="a12 a17"/>
      <bond id="a13_a18" atomRefs2="a13 a18"/>
      <bond id="a14_a19" atomRefs2="a14 a19"/>
      <bond id="a15_a20" atomRefs2="a15 a20"/>

Notice how CML has represented exactly what we know about this depiction. (If I had wanted I could have put in the precise positions of the double bonds and carbanion, but it's probably counterproductive). It's completely semantic, no implicit information.

That's enough for now; I promise that in the next post, Rich, I will deal with pi-bonding etc. I hope you will agree that this is one valid representation of ferrocene.


I will start to widen out from the library of the future  and bring in chemistry and eScience. Librarians should not switch off as the topics are very relevant. Several in our group are off to Redmond - to two official meetings and other informal meet-ups. I'll blog (or twitter/FF about these as we go).

The first meeting is OREChem, sponsored by Lee Giles Dirks from Microsoft External Research and PI'ed by Carl Lagoze from Cornell. Lee is part of Tony Hey's empire in MSR and has responsibility for Scholarly publication and education. There is a good coherence and overlap between the projects and we are  committed to these being Open.

OAI-ORE (Open Archives Initiative Protocol - Object Exchange and Reuse) is brought to you by the people that brought you OAI-PMH - Carl and Herbert. One of the tricky problems on the web is being able to access a bounded set of information on the web. For example if you go to this blog address and download it, what do you get. I actually don't know and I expect it's a mess. This isn't a new problem, and the hypermedia gurus have been active for decades - when I started SGML I spent many hours trying to understand "Bounded Object Sets" and "architectural forms".

ORE tackles this problem in the context of research and scholarship. It can be used for anything, but the thrust is on making web resources for digital libraries, research laboratories, etc. I have the honour of being on the ORE advisory board MSR and I'd urge you to get involved. MSR are backing ORE and as an exemplar have applied this to chemistry, in OREChem. Here we are showing how to create bounded web resources in a context of linked data. I'll write more later, but to put a marker down we have transformed CrystalEye into RDF and will be working over the weekend to agree what the best approach to ORE-ifying it is. I'll leave you with Carl's recent paper (The oreChem Project) ...

The oreChem Project:
Integrating Chemistry Scholarship with the Semantic Web

Carl Lagoze
Information Science, Cornell University

The oreChem project, funded by Microsoft, is a collaboration1 between chemistry scholars
and information scientists to develop and deploy the infrastructure, services, and
applications to enable new models for research and dissemination of scholarly materials in
the chemistry community. Although the focus of the project is chemistry, the work is being
undertaken with an attention to general cyber infrastructure for eScience, thereby enabling
the linkages among disciplines that are required to solve today’s key scientific challenges
such as global warming. A key aspect of this work, and a core aim of this project, is the
design and implementation of an interoperability infrastructure that will allow chemistry
scholars to share, reuse, manipulate, and enhance data that are located in repositories,
databases, and Web services distributed across the network.

The foundations of this planned infrastructure are the specifications developed as part of
the Open Archives Initiative‐Object Reuse and Exchange (OAI‐ORE) [9] effort. These
specifications provide a data model [8] and set of serialization syntaxes [10‐12] for
describing and identifying aggregations of Web resources and describing the relationships
among the resources that are constituents of aggregations. The OAI‐ORE specifications are
firmly grounded in the Web architecture [6] and in the principles of the semantic web [4, 7]
and the Linked Data Effort [3]. The relevant connections of the OAI‐ORE specifications to
mainstream Web and Semantic Web architecture include:

  • All aspects of data model are expressed in terms of resources, representations, URIs,
    and triples.
  • The fundamental entity in the data model, the Aggregation, is a resource without a
    Representation (a “non‐document” resource). This paradigm is similar to the
    manner in which real‐world entities or concepts are included in the Web via the
    mechanisms proposed by the Linked Data Effort [3],
  • The description of an Aggregation, a Resource Map, is a separate Resource, which is
    accessible via the URI of the Aggregation using the mechanisms defined for Cool
    URIs [15].
  • The result of an HTTP access of a Resource Map URI is a serialization of the triples
    describing the Aggregation. This serialization may be in any of the OAI‐ORE
    serialization syntaxes: RDF/XML [2], RDFa [1], and Atom [14] (triples can be
    extracted from this via an OAI‐ORE defined GRDDL‐compliant XSLT script).

Our initial work in the oreChem Project is the design of a graph‐based object model that
specializes the core OAI‐ORE data model for the chemistry domain. This model builds on
the centrality of the molecule, or chemical compound, in the record of chemistry
scholarship. In the nature of a relational database key, a molecule or compound, identified
in a universal manner [13], forms the central hub for linkages to other entities such as
investigations, experiments, scholars, and processes related to that molecule. We are then
using this model to design interfaces and APIs to exchange molecular information and their
relationships among distributed repositories, services, and agents.

We are demonstrating this infrastructure by adapting a number of existing chemistry data
repositories2 to the APIs and models. We are also further populating these repositories by
developing and refining automated techniques for retrospectively extracting chemical
information and interlinking chemical data from existing chemistry research corpora.

Following this we will develop and deploy a number of tools, such as chemical structure
searching, over the repositories that have been adapted to the infrastructure. In the latter
stages of the project, we will extend the retrospective data extraction techniques with active
“in the lab” capture of chemistry data, and the addition of that “in‐process” data to the
knowledge network defined by the infrastructure data model.

Ultimately, we envision that this common data model, interchange protocols, and suite of
data extraction and data capture tools will enable an eChemistry Web – a semantic graph
with embedded subgraphs representing molecules which are then interrelated to
publications that refer to them, experiments that work with them, the context of these
experiments, the researchers working with these molecules, annotations about publications
and experiments, and the like. A particularly interesting aspect of this semantic graph is the
manner in which it mixes data, publication artifacts, and people – providing an informationrich
social network built around the notion of object‐centered sociality [5]. In the latter
phases of the project we hope to build innovative analysis tools that will extract new
“scientometric” information and knowledge from the eChemistry Web.

Our work in the oreChem Project and, in particular, our design of the interoperability
infrastructure, is being undertaken with the recognition that chemistry, like any scholarly
discipline, is not an island, but has complex linkages to scholarship in other disciplines and
into related activities such as education, and in fact to the general network‐based
information environment. By basing our work on OAI‐ORE, we hope that the
interoperability paradigm designed for oreChem will coexist with similar work in other
disciplines and in fact with the general Web information space and its ubiquitous search
tools, services, and applications.

1 Collaborators in the oreChem Project are University of Cambridge (Peter Murray Rust, Jim
Downing), Cornell University (Carl Lagoze, Theresa Velden), University of Indiana (Geoffrey
Fox, Marlon Pierce), Penn State University (C. Lee Giles, Prasenjit Mitra, Karl Mueller),
PuBChem (Steve Bryant), and University of Southampton (Jeremy Frey, Simon Coles).

2 These repositories include CrystalEye, 100,000 molecules and 100,000 fragments from
crystal structures with full crystallographic details and with 3D coordinates; SPECTRaT,
open theses with molecules; Pub3D, MMFF94‐optimized 3D structures for PubChem
compounds; ChemXSeer, an integrated digital library and database allowing for intelligent
search of documents in the chemistry domain and data obtained from chemical kinetics;
eCrystals, high level crystal structures and processed x‐ray diffraction data; and R4L,
experimental spectroscopic and analytical chemical data.

Please send us your Vistas

I recently got an invitation to speak (anonymized as I don't want to fall out) which included:

"I would very much appreciate a copy of your presentation in advance of the event in Windows XP format as the venue is not migrated to Vista. "

This is yet another example of technology driving scholarly communication. Increasingly we are asked "please send us your Powerpoints".

I shall use modulated sound waves for the body of my message and I am tempted not to bring any visual material. But I shall, because I need to communicate what we are actually doing by showing it. Scientists, of course, frequently need to show images of molecules, animals, stars, clouds, etc. and this will and should be the mainstream.

But there is little need to echo the words that I speak by showing them on the screen. When speaking to an international audience it can be very useful to have text on the screen, but this is a UK event and whatever else I speak clearly and loudly and should be comprehensible.

What is inexcusable is how often conference organizers fail to provide any connection to the Internet, and some of these meetings are *about* the Internet and digital age. Even more inexcusable are those places which charge 100USD d^-1 for connections.

(BTW I am still fighting WordPress which loses paragraph breaks regularly...)

the library of the future - Oxford 2009-04-02

In this and subsequent posts I shall explore some ideas on the library of the future, being catalyzed by the following invitation from Rachel Bruce of The JISC:

...I'm now writing on behalf of the JISC and the Bodleian Library to invite you to speak at an event that will explore the future library on 2 April 2009.

The event will be in the style of a question time panel, before questions are put to the panel a number of key stakeholders will present their perspective on their requirements. I thought you'd be a [...] speaker and member of the panel able to give the perspective of a researcher. ... we'd like you to speak about your information needs, how you undertake your research and what you, as a researcher, need to remain relevant and to produce new and innovative research.

The event will run from 2pm - 6.30pm and should have an audience of 150 -200. It will be held at the University of Oxford.

The purpose of the event is to consider some of the key challenges that will shape the library of the future. So in effect key issues libraries need to respond to if they are to survive. The types of issues we expect to be raised include: skills for the future librarian from marketing to data curation, the need to foster partnerships between public and private sectors as well as working across the organsiation ( university ); the need for a heightened understanding of the changing user base and increasingly diverse needs of users; future information needs of researchers and what will they need to undertake their research and how to serve the citizen.

We are hoping this event, with the aid of high profile speakers, will serve to make a high profile statement to libraries about how they need to respond to support research and society more generally into the future and in the digital age.

I'm very excited about this and I'm starting to think and do some browsing (I won't call it "research"). I shall blog from time to time as I go through - I shall be provocative but, I hope, constructive.

The main question has to be "what is a library?", moderated by "what is it for?" **in the current century**. Unless we can answer those questions, and the second one in a constructive manner - then the rest of the discussion is likely to be ill-directed.

So I have started by trying to ask "what is the Bodleian Library for?" I may try to moderate it by looking at colleges on Oxbridge, specifically Balliol and Churchill. Both have archives, but with a wide difference in content and approach.

I'm taking a pragmatic approach.

If, as a citizen of the world with no special privileges, I can't find a resource on the web within 5 mnutes then it doesn't exist..

I am not a historical researcher who can travel to read medieval documents - I require them to be online and transcribed into accessible twenty-first century documents. And although in practice I would probably enlist the actual help of librarians/archivists at Bodley, Balliol and Churchill I am doing this deliberately blind. I ask forbearance from anyone whose collections I may apparently criticize - I have unreserved admiration for all who curate the past and present and know how difficult this is with limited resources.

I am a scientist so will start with a hypothesis:

"The stated purpose of libraries at Oxford and Cambridge is to glorify God and promote His Kingdom on earth". This purpose has not been formally modified

Since Cambridge and Oxford are about 800 years old (Cambridge celibrates its 800th anniversary this year) there may have been minor deviations from this purpose (kings and primeministers have sometimes tried to steer away from this) but our charters and other founding documents willl confirm the hypothesis. (I do not have enough resource to do a proper study, so in the spirit of the collaborative electronic age I will be delighted to see whether the blogosphere can help).

Let's start with the history and statutes...

Wellcome gets tough on Open Access depositions

When one is active in an area (in this case Open Access) it's often difficult to see how important it is from outside. So I was delighted to get an internal email to all staff making it clear that it was MANDATORY for Wellcome grantees to publish their papers as Open Access. Here's excerpts from the mail:

As you may be aware, the Wellcome Trust's award terms and conditions require that all research papers arising from Wellcome Trust funded research must be made available on the PubMed Central website ( within six months of publication.

The Wellcome Trust have been monitoring compliance rates, and have been disappointed to find that these are currently very low.  As a result of this, they intend to more actively monitor compliance, and in future will be contacting researchers who have not had articles published as Open Access papers.

The University of Cambridge has been given a grant to cover costs associated with Open Access publishing.  If your journal charges for making your article available on PubMed Central, please refer to this website: for how to claim these costs back from my office.

Further information on the Wellcome Trust's Open Access policy can be found here:, or at the Wellcome Trust's website here:

and the claims site announces:

Claiming Open Access Charges

This page describes how to claim back costs charged by publishers for placing papers on the UK PubMed Central website. Initially, you will have to pay the publisher’s Open Access charges. You can then claim these costs back as follows:

  1. Fill out a form (Open Access request form) with the requested information.
  2. Please return the form and an internal invoice ...
  3. Once we have this, the monies you have paid for Open Access charges will be re-imbursed to your account.

I have the privilege of being on the UKPMC advisory board and we've been thinking about how to make the policies and practices more widely known. UKPMC is doing roadshows (the first in Oxford last month) and I am sure they would welcome enquiries from institutions or individuals wanting more info.

We have to realise that Open Access will take hard work. It's not just building deposition systems and expecting them to get filled. It needs a commitment from the grant holder. It's simple:

  • If you receive a grant you have to publish the results as Open Access.

If you don't want to, no-one is forcing you to apply for grants.

(Well, yes they probably are, so you had better get used to the practice of publishing Open Access)

APE2008 - Heuer, CERN

APE (Academic Publishing in Europe)  was a stimulating meeting, but I wasn't able to blog any of it as (a) there wasn't any wireless and (b) there wasn't any electricity (we were in the Berlin-Brandenburg. Academy of Sciences, which made up for the lack by the architecture and the legacy of bullet holes in the masonry). So I took notes while the battery lasted, but they read rather staccato.
The first keynote was very exciting. Rolf-Dieter Heuer is the new Director General of CERN - where they start hunting the Higgs Boson any time now. CERN has decided to run its own publishing venture - SCOAP3- which I first heard of from Salvatore Mele - I'm hoping to visit him is CERN before they let the hadrons loose.

So my scattered notes...

SCOAP requires all COUNTRIES contribute (i.e. total commitment from the community and support for the poorer members)
closely knit community, 22, 000 ppl.
ca 10MEUR for HEP - much smaller than expts (500MEUR) so easy for CERN to manage (So organising a publishing project is small beer compared with lowering a 1200 tonne magnet down a shaft

22% use of Google by young people in physics as primary search engine
could we persuade people to spend 30 mins/week for tagging

what people want
full text
depth of content

build complete HEP paltform
integrate present repositories
one-stop shop
integrate content and thesis material [PMR - I agree this is very important]

text-and data-mining
relate documents containg similar information
new hybrid metrcs
deploy Web2.0
engage readers in subject tagging
review and comment

preserve and re-use reaserach data
includes programs to read and analyse
data simulations, programs behind epts
software problem
must have migration
must reuse terminated experiments

[PMR. Interesting that HEP is now keen to re-use data. We often heard that only physiscists would understand the data so why re-use it. But now we see things like the variation of the fundamental constants over time   - I *think* ths means that the measurement varies, not the actual constants]

same reesearchers
similar experiements
future experiements
theoretic who want to check
theorist who want to test futuire (e.g. weak force)
need to reanalyze data with time (JADE experiement, tapes saved weeks before destruction and had expert)
SERENDIPTOUS discovery showing that weak force grows less with shorter distance

Raw data 3200 TB

raw-> calibrated -> skimmed -> high-leve obj -> phsyics anal - > results
must store semantic knowledge
involve grey literature and oral tradition

MUST reuse data after experiment is stopped

re-suable by other micro doamins
alliance for permanent access

PMR: I have missed the first part because battery crashed. But the overall impression is that SCOAP3 will reach beyond physics just as arXiv does. It nmay rival Wellcome in its impact on Open Acces publishing. SCOAP3 has the critical mass of community, probably finance, and it certainly has the will to succeed. Successes tend to breed successes.

... more notes will come at random intervals ...

Science 2.0

Bill Hooker points to an initiative by Scientific American to help collaborative science. Mitch Waldrop on Science 2.0

I'm way behind on this, but anyway: a while back, writer Mitch Waldrop interviewed me and a whole bunch of other people interested in (what I usually call) Open Science, for an upcoming article in Scientific American. A draft of the article is now available for reading, but even better -- in a wholly subject matter appropriate twist, it's also available for input from readers. Quoth Mitch:

Welcome to a Scientific American experiment in "networked journalism," in which readers -- you --get to collaborate with the author to give a story its final form.The article, below, is a particularly apt candidate for such an experiment: it's my feature story on "Science 2.0," which describes how researchers are beginning to harness wikis, blogs and other Web 2.0 technologies as a potentially transformative way of doing science. The draft article appears here, several months in advance of its print publication, and we are inviting you to comment on it. Your inputs will influence the article's content, reporting, perhaps even its point of view.

PMR: It a reasonably balanced article, touching many of the efforts mentioned in this blog. It's under no illusions that this won't be easy. I've just finished doing an interview where at the end I was asked what we would be like in 5 years' time and I was rather pessismistic that the current metrics-based dystopia would persist and even get worse (The UK has increased its efforts on metrics-based assessment in which case almost any innovation, almost by definition, is discouraged). But on the other hand I think the vitality pf @2.0@ in so many areas may provide unstoppable disruption.

Does the semantic web work for chemical reactions

A very exciting post from Jean-Claude Bradley asking whether we can formalize the semantics of chemical reactions and synthetic procedures. Excerpts, and then comment...

Modularizing Results and Analysis in Chemistry

Chemical research has traditionally been organized in either experiment-centric or molecule-centric models.

This makes sense from the chemist's standpoint.

When we think about doing chemistry, we conceptualize experiments as the fundamental unit of progress. This is reflected in the laboratory notebook, where each page is an experiment, with an objective, a procedure, the results, their analysis and a final conclusion optimally directly answering the stated objective.

When we think about searching for chemistry, we generally imagine molecules and transformations. This is reflected in the search engines that are available to chemists, with most allowing at least the drawing or representation of a single molecule or class of molecules (via substructure searching).

But these are not the only perspectives possible.

What would chemistry look like from a results-centric view?

Lets see with a specific example. Take EXP150, where we are trying to synthesize a Ugi product as a potential anti-malarial agent and identify Ugi products that crystallize from their reaction mixture.

If we extract the information contained here based on individual results, something very interesting happens. By using some standard representation for actions we can come up with something that looks like it should be machine readable without much difficulty:

  • ADD container (type=one dram screwcap vial)
  • ADD methanol (InChIKey=OKKJLVBELUTLKV-UHFFFAOYAX, volume=1 ml)
  • WAIT (time=15 min)
  • ADD benzylamine (InChIKey=WGQKYBSKWIADBV-UHFFFAOYAL, volume=54.6 ul)
  • VORTEX (time=15 s)
  • WAIT (time=4 min)
  • ADD phenanthrene-9-carboxaldehyde (InChIKey=QECIGCMPORCORE-UHFFFAOYAE, mass=103.1 mg)
  • VORTEX (time=4 min)
  • WAIT (time=22 min)
  • ADD crotonic acid (InChIKey=LDHQCZJRKDOVOX-JSWHHWTPCJ, mass=43.0 mg)
  • VORTEX (time=30 s)
  • WAIT (time=14 min)
  • ADD tert-butyl isocyanide (InChIKey=FAGLEPBREOXSAC-UHFFFAOYAL, volume=56.5 ul)
  • VORTEX (time=5.5 min)

It turns out that for this CombiUgi project very few commands are required to describe all possible actions:

  • ADD
  • WAIT

By focusing on each result independently, it no longer matters if the objective of the experiment was reached or if the experiment was aborted at a later point.

Also, if we recorded chemistry this way we could do searches that are currently not possible:

  • What happens (pictures, NMRs) when an amine and an aromatic aldehyde are mixed in an alcoholic solvent for more than 3 hours with at least 15 s vortexing after the addition of both reagents?
  • What happens (picture, NMRs) when an isonitrile, amine, aldehyde and carboxylic acid are mixed in that specific order, with at least 2 vortexing steps of any duration?

I am not sure if we can get to that level of query control, but ChemSpider will investigate representing our results in a database in this way to see how far we can get.

Note that we can't represent everything using this approach. For example observations made in the experiment log don't show up here, as well as anything unexpected. Therefore, at least as long as we have human beings recording experiments, we're going to continue to use the wiki as the official lab notebook of my group. But hopefully I've shown how we can translate from freeform to structured format fairly easily.

Now one reason I think that this is a good time to generate results-centric databases is the inevitable rise of automation. It turns out that it is difficult for humans to record an experiment log accurately. (Take a look at the lab notebooks in a typical organic chemistry lab - can you really reproduce all those experiments without talking to the researcher?)

But machines are good at recording dates and times of actions and all the tedious details of executing a protocol. This is something that we would like to address in the automation component of our next proposal.

Does that mean that machines will replace chemists in the near future? Not any more than calculators have replaced mathematicians. I think that automating result production will leave more time for analysis, which is really the test of a true chemist (as opposed to a technician).

Here is an example


database, as long as attribution is provided. (If anyone knows of any accepted XML for experimental actions let me know and we'll adopt that.)


I think this takes us a step closer from freeform Open Notebook Science to the chemical semantic web, something that both Cameron Neylon and I have been discussing for a while now.

PMR: This is very important to follow - and I'll give some of our insights. Firstly, we have been tackling this for ca. 5 years, starting from the results as recorded in scientific papers or theses. Most recently we have been concentrating very hard on theses and have just taken delivery of a batch of about 20, all from the same lab.

I agree absolutely with J-C that traditional recording of chemical syntheses in papers and theses is very variable and almost always misses large amounts of essential details. I also agree absolutely that the way to get the info is to record the experiment as it happens. That's what the Southampton projects CombeChem and R4L spent a lot of time doing. The rouble is it's hard. Hard socially. Hard to get chemists interested (if it was easy we'd be doing it by now). We are doing exactly the same with some industrial partners. They want to keep the lab book.The paper lab book. That's why electronic notebook systems have been so slow to take off. The lab book works - up to a point - and it also serves the critical issues of managing safety and intellectual property. Not very well, but well enough.

J-C asks

If anyone knows of any accepted XML for experimental actions let me know and we'll adopt that

CML has been designed to support and Lezan Hawizy in our group has been working in detail over the last 4 months to see if CML works. It's capable of managing inter alia:

  • observations
  • actions
  • substances, molecules, amounts
  • parameters
  • properties (molecules and reactions)
  • reactions (in detail) with their conditions
  • scientific units

We have now taken a good subset of literature reactions (abbreviated though they may be) and worked out some of the syntactic, semantic, ontological and lexical environment that is required. Here is a typical result, which has a lot in common with J-C's synthesis.


(Click to enlarge. ) I have cut out the actual compounds though in the real example they have full formulae in CML, and can be used to manage balance of reactions, masses, volumes, molar amounts, etc. JUMBO is capable of working out which reagents are present in excess, for example. It can also tell you how much of every you will need and how long the reaction will take. No magic, just housekeeping.

CML is designed with a fluid vocabulary, so that anything which isn't already known is found in dictionaries and repositories. So we have collections of:

  • solvents
  • reagents
  • apparatus
  • procedures
  • appearances
  • units
  • common molecules

A word of warning. It looks attractive, almost trivial, when you start. But as you look at more examples and particularly widen your scope it gets less and less productive. I've probably looked through several hundred papers. There is always a balance between precision and recall and Zipf's law. You will never manage everything. There will be procedures, substances, etc, that defy representation. There are anonymous compounds and anaphora.

So we can't yet build a semantic robot that is capable of doing everything. We probably can build examples that work in specific labs where the reactions are systematically similar - as in combinatorial chemistry.

So, yes, J-C - we would love to explore how CML can support this...

Open Notebook Science and Glueware

Cameron laments the difficulty of creating an Open Notebook system when there is a lot of data:


The problem with data…

Our laboratory blog system has been doing a reasonable job of handling protocols and simple pieces of analysis thus far. While more automation in the posting would be a big benefit, this is more a mechanical issue than a fundamental problem. To re-cap our system is that every “item” has its own post. Until now these items have been samples, or materials. The items are linked by posts that describe procedures. This system provides a crude kind of triple; Sample X was generated using Procedure A from Material Z. Where we have some analytical data, like a gel, it was generally enough to drop that in at the bottom of the procedure post. I blithely assumed that when we had more complicated data, that might for instance need re-processing, we could treat it the same way as a product or sample.



PMR: How I sympathize! We had a closely related problem with Nick Day's protocol for NMR calculations. There were also other reasons why we didn't do complete Open Notebook, but even if we had wanted we couldn't. Because the whole submissions and calculation process is such horrendous glueware. It's difficult enough keeping it under control yourself, let alone exposing the spaghetti to others. So, until the protocol has stabilised (and that's hard when it's perpetual beta), it's very hard to do ONS.


And what happens when you change the protocol? The data formats suddenly change. And that will foul all your possible collaborators. Do you have a duty of care to support any random visitor who wants to use your data - I have to argue "no" at this stage. You may expose what you have but it's a mess.


The only viable solution is to create a workflow - and to tee the output. But as Carole Goble told us at DCC - worklfows are HARD. That's why glueware is so messy - if we had cracked the workflow problem we would have eliminated glueware.


The good news is that IF we crack it for a problem, then it should be much much easier to archive, preserve and re-use the output of ONS.