Category Archives: cyberscience

WWMM: The World Wide Molecular Matrix

Since I have been asked to talk about the WWMM here's a bit of background... When the UK e-Science project started (2001) we put in a proposal for a new vision of shared chemistry - the World Wide Molecular Matrix. The term "Matrix" comes from the futuristic computer network and virtual world in William Gibson's novels where humans and machines are coupled in cyberspace. Our proposal was for distributed chemistry based on a Napster-like model where chemical objects could be shared in server-browsers just as for music.

It seemed easy. If it worked for music it should be possible for chemistry. Admittedly the content was more variable and the metadata more complex. But nothing that shouldn't be possible with XML. And when we built the better mousetrap, the chemists would come. Others liked the idea, and there is a n article in Wikipedia (Worldwide molecular matrix).

But it's taken at least 5 years. The idea seems simple, but there are lots of details. The eScience program helped - we had two postdocs through the Cambridge eScience Centre and the DTI (Molecular Informatics "Molecular Standards for the Grid"). As well as CML ee listed 10 technologies (Java, Jumbo, Apache HTTP Server, Apache Tomcat, Xindice, CDK - Chemistry Development Kit, JMol, JChemPaint, Condor and Condor-G, PHP). We're not using much PHP, no Xindice and prefer Jetty to Tomcat, but the rest remain core components. We've added a lot more - RDF, RSS, Atom, InChI, blogs, wikis, SVN, Eclipse, JUnit, and a good deal more. It's always more and more technology... OpenBabel, JSpecView, Bioclipse, OSCAR and OSCAR3...
But we needed it. The original vision was correct but impossible in 2002. Now the technology has risen up to meet the expectations. CrystalEye, along with SPECTRa,  is the first example of fully functioning WWMM. It's free, virtually maintenance-free, and very high quality. We have developed it so it's portable and we'll be making the contents and software available wherever they are wanted.

But it also requires content. That's why we are developing ways of authoring chemical documents and why we are creating mechanisms for sharing. Sharing only comes about when there is mutual benefit, and until the blogosphere arrived there was little public appreciation. We now see the value of trading goods and services and the power of the gift economy. In our case we are adding things like quality and discoverability as added value. We've seen the first request for a mashup today.

WWMM requires Open Data, and probably we had to create the definition and management of Openness before we knew how to do it. We'll start to see more truly Open Data as publishers realise the value and encourage their authors to create Open content as part of the submission process. And funders will encourage the creation and deposition of data as part of the required Open publication process.  Then scientists will see the value of authoring semantic data rather than paying post-publications aggregators to type up up again. At that stage WWMM will truly have arrived

NSF/JISC meeting on eScience/cyberinfrastructure

I was privileged to be at a meeting between JISC (UK) and NSF (US). Every paragraph of the report is worth reading - I quote a few...

William Y. Arms and Ronald L. Larsen, The Future of Scholarly Communication: Building the Infrastructure of Cyberscholarship, September 26, 2007. Report of the NSF/JISC Repositories Workshop (Phoenix, April 17-19, 2007). It announces

The fundamental conclusions of the workshop were:
• The widespread availability of digital content creates opportunities for new forms of research and scholarship that are qualitatively different from traditional ways of using academic publications and research data. We call this "cyberscholarship".
• The widespread availability of content in digital formats provides an infrastructure for novel forms of research. To support cyberscholarship, such content must be captured, managed, and preserved in ways that are significantly different from conventional methods.
As with other forms of infrastructure, common interests are best served by agreement on general principles that are expressed as a set of standards and approaches that, once adopted, become transparent to the user. Without such agreements, locally optimal decisions may preclude global advancement. Therefore, the workshop concluded that:
• Development of the infrastructure requires coordination at a national and
international level. In Britain, JISC can provide this coordination. In the United States, there is no single agency with this mission; we recommend an inter-agency coordinating committee. The Federal Coordinating Council for Science, Engineering and Technology (FCCSET), which coordinated much of the US government's role in developing high performance computing in the 1990s, provides a good model for the proposed Federal Coordinating Council on Cyberscholarship (FC3S). International coordination should also engage organizations such as the European Strategy Forum on Research Infrastructures (ESFRI), the German research foundation DFG, and the Max Planck Digital Library.
• Development of the content infrastructure requires a blend of interdisciplinary research and development that engages scientists, technologists, and humanities scholars. The time is right for a focused, international effort to experiment, explore, and finally build the infrastructure for cyberscholarship.
• We propose a seven-year timetable for implementation of the infrastructure. The first three years will focus on developing and testing a set of prototypes, followed by implementation of coordinated systems and services.


Computer programs analyze vast amounts of information that could never be processed manually. This is sometimes referred to as "data-driven science". Some have described data-driven science as a new paradigm of
research. This may be an over-statement, but there is no doubt that digital information is leading to new forms of scholarship. In a completely different field, Gregory Crane, a humanities researcher, recently made the simple but profound statement, "When collections get large, only the computer reads every word." A scholar can read only one document at a time, but a supercomputer can analyze millions, discovering patterns that no human could observe.


The National Virtual Observatory describes itself as "a new way of doing astronomy, moving from an era of observations of small, carefully selected samples of objects in one or a few wavelength bands, to the use of multiwavelength data for millions, if not billions of objects. Such datasets will allow researchers to discover subtle but significant patterns in statistically rich and unbiased databases, and to understand complex astrophysical systems through the comparison of data to numerical simulations." From:


The workshop participants set the following goal:
Ensure that all publicly-funded research products and primary resources will be readily available, accessible, and usable via common infrastructure and tools through space, time, and across disciplines, stages of research, and modes of human expression.


The shortcomings of the current environment for scholarly communication are wellknown and evident. Journal articles include too little information to replicate an experiment. Restrictions justified by copyright, patents, trade secrets, and security, and the high costs of access all add up to a situation that is far from optimal. Yet this suboptimal system has vigorous supporters, many of whom benefit from its idiosyncrasies.
For example, the high cost of access benefits people who belong to the wealthy organizations that can afford that access. Journal profits subsidize academic societies. Universities use publication patterns as an approximate measure of excellence.

Younger scholars, who grew up with the Web, are less likely to be restrained by the habits of the past. Often – but not always – they are early adopters of innovations such as web search engines, Google Scholar, Wikipedia, and blog-science. Yet, they come under intense pressure early in their careers to conform to the publication norms of the past.

... and so the final proposal

... a seven year target for the implementation of the infrastructure for
cyberscholarship. The goal of establishing an infrastructure for cyberscholarship by 2015 is aggressive, but achievable, when coordinated with other initiatives in the U.S., Britain, and elsewhere. A three-phase program is proposed over seven years: a three-year research prototype phase that explores several competing alternatives, a one-year architecture specification phase that integrates the best ideas from the prototypes, followed by a three-year research and implementation phase in which content infrastructure is deployed and research on value-added services continues. Throughout the seven years, an evaluation component will
provide the appropriate focus on measurable capability across comparable services. A "roadmap" for the program is suggested in the following figure.

[... it's too large to cut, so you'll have to read it for yourselves...]

... and the details ...

Open grant writing. Can the Chemical Blogosphere help with "Agents and Eyeballs"

In the current spirit of Openness I'm appealing to the chemical blogosphere for help. Jim Downing and I are writing a grant proposal for UK's JISC : supporting education and research - which supports digital libraries, repositories, eScience/cyberinfrastructure, collaborative working, etc. The grant will directly support the activities of the blogosphere, for example by providing better reporting and review tools, hopefully with chemical enhancement.
The basic theme is that the Chemical Blogosphere is now a major force for enhancing data quality in chemical databases and publications, and we are asking for 1 person-year to help build a "Web 2.0"-based system to help support the current practice and ethos. The current working title is "Agents and Eyeballs", reflecting that some of the work will be done by

  • machines, as in CrystalEye - WWMM which aggregates and checks crystal published structures on a daily basis.
  • humans as in the Hexacyclinol? Or Not? saga. Readers may remember that there was a report of the synthesis of a complicated molecule. This was heavily criticized in the blogosphere, and indeed the top 9 hits on google for "hexacyclinol" are all blogs - the formal, Closed, peer-reviewed paper comes tenth in interest.

"Given enough eyeballs, all bugs are shallow" - Eric Raymond. In chemistry it is clear that the system of closed peer-review by 2-3 humans sometimes leads to poor data quality and poor science. We've found that in some chemistry journals almost every paper has an error - not always "serious", but ... So:
"Agents and eyeballs for better chemical peer-review".

Not very catchy but we'll think of something.
It's unusual to make your grant proposal Open (and we are not actually putting the grant itself online, especially the financial details). But there are parts of the case that we would like the blogosphere to help with. If you have already written a blog on any of the aspects here, please give the link. You may even wish to write a post

  • showing that the blogosphere is organised and effectively oversees all major Open discussion in chemistry. I take Chemical blogspace as the best place for a non-chemist (as the reviewers will be) to start.
  • show that the Blogosphere cares about data. Here I would like to point to the Blue Obelisk and the way Chemspider has reacted positively to the concerns about data quality
  • show that important bad science cannot hide. I would very much like an overview of the hexacyclinol story - which is still happening - with some of the most useful historical links. Anything showing that the blogosphere was reported in the conventional chemical grey literature would be valuable.
  • Open Notebook Science.

We have three partners from the conventional publishing industry - I won't name them - who have offered to help explore how the Agents and Eyeballs approach could help with their data peer review.

You might ask "why is PMR not doing this, but asking the blogosphere?" It's precisely because I want to show how responsive and responsible the blogosphere is, when we ask questions like this.

There is considerable urgency. To include anything in the grant we'll need it within 36 hours, although contributions after that will be seen by the reviewers. I suggest that you leave comments on this post, with pointers where necessary. Later I suspect we'll wikify something, but it's actually the difficulty of doing this properly and easily that is - in part - motivating the grant.


scifoo: academic publishing and what can computer scientists do?

Jim Hendler has summarised several scifoo sessions related to publishing and peer-review and added thoughts for the future (there's mote to come).  It's long, but I didn't feel anything could be selectively deleted so I've left only the last para, which has a slight change of subject - speculation what computer scientists could do to help.

15:16 14/08/2007, Planet SciFoo
Here’s a pre-edited preprint of my editorial for the next issue of IEEE Intelligent Systems. I welcome your comments - Jim H.


[... very worthwhile summary snipped ...]
I believe it is time for us as computer scientists to take a leading role in helping to create innovation in this area. Some ideas are very simple, for example providing overlay journals that link already existing Web publications together, thus increasing the visibility (and therefore impact) of research that cuts across fields. Others may require more work, such as exploring how we can easily embed semantic markup into authoring tools and return some value (for example, automatic reference suggestions) via the use of user-extensible ontologies. In part II of this editorial, next issue, I’ll discuss some ideas being explored with respect to new technologies for the future of academic communication that we as a field may be able to help bring into being, and some of the obstacles thereto. I look forward to hearing your thoughts on the subject.

PMR: I'd love to see some decent semantic authoring tools - and before that just some decent authoring tools. For example I hoped to have contributed code and markup examples to this blog and I simply can't. Yes there are various plugins but I haven't got them to work regularly. So the first step is syntactic wikis, blogs, etc. We have to be able to write code in our blogs as naturally as we create it in - say - Eclipse. To have it checked for syntax. To allow others to extract it. And the same goes for RDF, MathML. SVG is a disaster. I hailed it in 1998 as a killer app - 9 years later we are struggling to get  it working in the average browser. These things  can be done if we try hard enough, but we shouldn't have to try.

It's even more difficult to create and embed semantic chemistry (CML) and semantic GIS. But these are truly killer apps. The chemical blogosphere is doing its best with really awful baseline technology. Ideas such as embedding metadata in PNGs. Better than nothing but almost certain to decay with a year or so. Hiding stuff in PDFs? hardly semantic. We don't even have a portable mechanism for transferring compound HTML documents reliably (*.mth and so on).  So until we have solved some of this I think the semantic layer will continue to break. The message of Web 2.0 is that we love lashups and mashups but not yet clear this scales to formal semantic systems.
What's the answer? I'm not sure since we are in the hands of the browser manufacturers at present and they have no commitment to semantics. They are focussed on centralised servers providing for individual visitors. It's great that blogs and wikis can work with current browsers but they are in spite of the browsers rather than enabled by them. The trend is towards wikis and blogs mounted on other sites rather than our own desktop, rather than enabling the power of the individual on their own machine.

Having been part of the UK eScience program (== cyberinfrastructure) for 5 years I've seen the heavy concentration on "the Grid" and very little on the browser. My opinion is the the middleware systems developed are too heavy for innovation. Like good citizens we installed SOAP, WSDL etc and then found we couldn't share any of it - the installation wasn't portable. So now we are moving to a much lighter, more rapid environment based on minimalist approaches such as REST.  RDF rather than SQL, XOM rather than DOM, and a mixture of whatever scripts and templating tools fit the problem. But with a basic philosophy that we need to build it with sustainability in mind.

The Grid suits communities already used to heavy engineering - physics, space, etc. But it doesn't map onto the liberated Web 2.0. An important part of the Grid was controlling who could do what where. The modern web is liberated by assuming that we live our informatics lives in public. Perhaps the next rounds of funding should concentrate on increasing the emphasis on enabling individuals to share information.

cyberscience: Changing the business model for access to data

I have been reviewing the availability of Open Data for cyberscience - concentrating recently on crystallography and chemical spectra as examples. I'll propose a new business model here, still very ill-formed and I welcome comments. It applies particularly to disciplines where the data are collected in a fragmented manner rather than being coordinated as in, for example, survey of the earth or sky. I call this fragmentation "hypopublication".
However the Internet has the power to pull together this fragmentation if the following conditions are met:

  • the data are fully Open and exposed. There must be no cost, no impediment to access, no registration (even if free), no forms to fill in.
  • the data must conform to a published standard and the software to manage that standard must be Openly available (almost necessarily Open Source). The metadata should be Open.
  • the exposing sites must be robot-friendly (and in return the robots should be courteous).

Such a state nearly exists in modern crystallography. The situation for macromolecules is that authors are required to deposit data in a central repository ( For small molecules there is less Open Data but a significant amount is available because of the work put in by:

  • the International Union of Crystallography (IUCr), which for at least 30 years has pioneered the development of data standards and ontologies emerging in its current Crystallographic Information File specification.
  • a number of publishers who have Openly exposed CIF data files on their websites for every article which contains relevant crystallography. They include the IUCr itself, the Royal Society of Chemistry, the American Chemical Society, the Chemical Society of Japan, and the American Mineralogist. (There may be others - if so I apologize and ask them to come forward). The licences are occasionally a bit fuzzy but the spirit and intention is clear. The data are there as a scientific record and to be re-used.
  • The Crystallography Open Database - a volunteer activity which has aggregated approximately 50 K CIFs from donations.

The Internet now means that the data can be reliably aggregated as in our Crystaleye knowledgebase. This also acts as an immediate alerting system - as soon as a new piece of interesting crystallography is published, subscribers to our RSS feeds are notified immediately.

The criticism is sometimes made that unless data is inspected by humans it cannot be certified as fit for purpose. This depends entirely what the purpose is. It's often better to have data of variable quality than no data at all. And it's always better to have data of variable KNOWN quality rather than none, even if the quality is often known to be low. It's a balance of precision and recall (Why 100% is never achievable). Joe Townsend here has shown in his PhD that if we lower the recall of crystallographic data (i.e. throw out everything that is known to have errors) we can get very high precision indeed without having to inspect the data.
Our remaining problem is that not all publishers expose the data Openly. The rest of this post explores why they should think of doing so.

Before the Internet it was necessary to have central repositories to put data in, but now with all publishers online the data can just as easily be posted on their sites. Even if there is no intrinsic search mechanism on the publisher sites, researchers like Nick Day (here) can create tools for managing the data and metadata in CrystalEye. So why don't all publishers expose their crystallography - I think it's just a matter of priorities and hope this post will advance the case.

Data costs money. True, but the amount is falling. I don't know how much it costs the publishers above to manage the exposure of the crystallography files - and I'm not asking - but it's obviously not prohibitive. They've done it (I assume) because they think it's an important part of the publication process - allowing science to be verified, providing a record, allowing new research to build on old. So they have - presumably - included the cost within the general cost of publication (which is covered mainly by subscriptions but for some of the articles also paid-by-author/funder Open Access).

The main cost of the process - the creation of communal metadata - is already past. This is probably the largest barrier to any group trying to emulate the idea. But it's also happening in thermochemistry (ThermoML) where a number of journals:

Journal of Chemical & Engineering Data (Elsevier)

The Journal of Chemical Thermodynamics (Elsevier)

Fluid Phase Equilibria (Elsevier)

Thermochimica Acta (Elsevier)

International Journal of Thermophysics (Springer)

all require data to be published at source and made Openly available. Here's a sample issue which lists the Open data:


ThermoML Data for The Journal of Chemical Thermodynamics, Vol. 39, No. 6 June 2007
Developed in cooperation between The Journal of Chemical Thermodynamics and the Thermodynamics Research Center (TRC)

The full Table of Contents for this issue is available from JCT. The numbers below correspond to the numbers in the full Table of Contents.

Low pressure solubility and thermodynamics of solvation of oxygen, carbon dioxide, and carbon monoxide in fluorinated liquids
Pages 847-854
J. Deschamps, D.-H. Menz, A.A.H. Padua and M.F. Costa Gomes
ThermoML Data (To download: right-click on link and select "Save Link Target As" )

High pressure phase behaviour of the binary mixture for the 2-hydroxyethyl methacrylate, 2-hydroxypropyl acrylate, and 2-hydroxypropyl methacrylate in supercritical carbon dioxide
Pages 855-861
Hun-Soo Byun and Min-Yong Choi
ThermoML Data (To download: right-click on link and select "Save Link Target As" )


You'll see that the data are Open.

So couldn't this be a model for all of science? As I have posted recently I'm going to write to the editors of Elsevier's Tetrahedron suggesting that they make all their crystallographic data available Openly. They agree it's not their copyright, so it's just a question of how to do it - files on a website shouldn't be a major expense.

And funders should encourage this. If you are urging authors and journals to publish Open full-text, please extend this to data. Yes, there are some technical difficulties in some cases such as metadata, complexity and size but they probably aren't too scary. And in any case the community will help work out how to use them.

cyberscience: Why 100% is never achievable

In the current series of posts I argue that data should be Open and re-usable and that we should be allowed to use robots to extract it from publishers' websites. A common counter argument is that data should be aggregated by secondary publishers who add human criticism to the data, improve it, and so should be allowed to resell it. We'll look at that later. But we also often see a corollary expressed as the syllogism:

  • all raw data sets contain some errors
  • any errors in a data set render it worthless
  • therefore all raw datasets are worthless and are only useful if cleaned by humans

We would all agree that human data aggregation and cleaning is expensive and requires a coherent business model. However I have argued that the technology of data collection and the added metadata can now be very high and that "raw" data sets, such as CrystalEye can be of very high quality and, if we can assess the quality automatically, can make useful decisions as to what purposes it is fit for.

In bioscience it is well appreciated that no data is perfect, and that everything has to be assessed in the context of current understanding. There is no "sequence of the human genome" - there are frequent releases which reflect advances in technology, understanding and curation. The Ensembl genome system is on version 45. We are going to have to embrace the idea that whenever we use a data set we need to work out "how good it is".

That is a major aspect of the work that our group does here. We collect data in three ways:

  • text-mining
  • data-extraction
  • calculation and simulation

None of these is perfect but the more access to them and their metadata, the better off we are.

There are a number of ways of assessing the quality of a dataset.

  • using two independent methods to determine a quantity. If they disagree then at least one of them has errors. If they agree within estimated error then we can hypothesize that the methods, which may be experimental or theoretical, are not in error. We try to disprove this hypothesis by devising better experiments or using new methods.
  • relying on expert humans to pass judgement. This is good but expensive and, unfortunately, almost always requires cost-recovery that means the data are not re-usable. Thus the NIST Chemistry WebBook is free-to-read for individual compounds but the data set is not Open. (Note that while most US governments works are free of copyright there are special dispensations for NIST and similar agencies to allow cost-recovery).
  • relying on the user community and blogosphere to pass judgement. This is, of course, a relatively new approach but also very powerful. If every time someone accesses a data item and reports an error the data set can be potentially enhanced. Note that this does not mean that the data are necessarily edited, but that a annotation is added to the set. This leads to communal curation of data - fairly common in bioscience - virtually unknown in chemistry since - until now - almost all data sets were commercially owned and few people will put effort into annotating something that they do not - in some sense - own. The annotation model will soon be made available on CrystalEye.
  • validating the protocol that is used to create or assess the data set. This is particularly important for large sets where there is no way of predicting the quantities. There are 1 million new compounds published per year, each with ca 50 or more data points - i.e. 50 megadata. This is too much for humans to check so we have to validate the protocol that extracts the data from the literature.

The first three approaches are self-explanatory, but the fourth needs comment. How do we validate a protocol? In medicine and information retrieval (IR) it is common to create a "gold standard". Here is WP on the medical usage:

In medicine, a gold standard test is a diagnostic test or benchmark that is regarded as definitive. This can refer to diagnosing a disease process, or the criteria by which scientific evidence is evaluated. For example, in resuscitation research, the gold standard test of a medication or procedure is whether or not it leads to an increase in the number of neurologically intact survivors that walk out of the hospital.[1] Other types of medical research might regard a significant decrease in 30 day mortality as the gold standard. The AMA Style Guide prefers the phrase Criterion Standard instead of Gold Standard, and many medical journals now mandate this usage in their instructions for contributors. For instance, Archives of Physical Medicine and Rehabilitation specifies this usage.[1].

A hypothetical ideal gold standard test has a sensitivity of 100% (it identifies all individuals with a disease process; it does not have any false-negative results) and a specificity of 100% (it does not falsely identify someone with a condition that does not have the condition; it does not have any false-positive results). In practice, there are no ideal gold standard tests.

In IR the terms recall and precision are normally used. Again it is normally impossible to get 100% and so values of 80% are common. Let's see a real example from OSCAR3 operating on an Open Access thesis from St Andrews University[1]. OSCAR3 is identifying ("retrieving") the chemicals (CM) in the text and OSCAR's results are underlined; at this stage we ignire whether OSCAR actually knows what the compounds are.

In 1923, J.B.S. datastore['chm180'] = { "Text": "Haldane", "Type": "CM" } Haldane introduced the concept of renewable datastore['chm181'] = { "Element": "H", "Text": "hydrogen", "Type": "CM" } hydrogen. datastore['chm182'] = { "Text": "Haldane", "Type": "CM" } Haldane stated that if datastore['chm183'] = { "Element": "H", "Text": "hydrogen", "Type": "CM" } hydrogen derived from wind power via electrolysis were liquefied and stored it would be the ideal datastore['chm184'] = { "ontIDs": "CHEBI:33292", "Text": "fuel", "Type": "ONT" } fuel of the future. [5] The depletion of datastore['chm185'] = { "ontIDs": "CHEBI:35230", "Text": "fossil fuel", "Type": "ONT" } fossil fuel resources and the need to reduce climate-affecting emissions (also known as green house gases) has driven the search for alternative energy sources. datastore['chm186'] = { "Element": "H", "Text": "Hydrogen", "Type": "CM" } Hydrogen is a leading candidate as an alternative to datastore['chm187'] = { "Text": "hydrocarbon", "Type": "CM" } hydrocarbon fossil fuels. datastore['chm188'] = { "Element": "H", "Text": "Hydrogen", "Type": "CM" } Hydrogen can have the advantages of renewable production from datastore['chm189'] = { "Element": "C", "ontIDs": "CHEBI:27594", "Text": "carbon", "Type": "CM" } carbon-free sources, which result in datastore['chm190'] = { "ontIDs": "REX:0000303", "Text": "emission", "Type": "ONT" } emission levels far below existing datastore['chm191'] = { "ontIDs": "REX:0000303", "Text": "emission", "Type": "ONT" } emission standards. datastore['chm192'] = { "Element": "H", "Text": "Hydrogen", "Type": "CM" } Hydrogen can be derived from a diverse range of sources offering a variety of production methods best suited to a particular area or situation.[6] datastore['chm193'] = { "Element": "H", "Text": "Hydrogen", "Type": "CM" } Hydrogen has long been used as a datastore['chm194'] = { "ontIDs": "CHEBI:33292", "Text": "fuel", "Type": "ONT" } fuel. The gas supply in the early part of the 20th century consisted almost entirely of a coal gas comprised of more than 50% datastore['chm195'] = { "Element": "H", "Text": "hydrogen", "Type": "CM" } hydrogen, along with datastore['chm196'] = { "cmlRef": "cml5", "SMILES": "[H]C([H])([H])[H]", "InChI": "InChI=1/CH4/h1H4", "ontIDs": "CHEBI:16183", "Text": "methane", "Type": "CM" } methane, datastore['chm197'] = { "cmlRef": "cml6", "SMILES": "[C-]#[O+]", "InChI": "InChI=1/CO/c1-2", "ontIDs": "CHEBI:17245", "Text": "carbon monoxide", "Type": "CM" } carbon monoxide, and datastore['chm198'] = { "cmlRef": "cml7", "SMILES": "O=C=O", "InChI": "InChI=1/CO2/c2-1-3", "ontIDs": "CHEBI:16526", "Text": "carbon dioxide", "Type": "CM" } carbon dioxide, known as "town gas".

Before we can evaluate how good this is we have to agree on what the "right" result is. Peter Corbett and Colin Batchelor spent much effort on devising a set of rules that tell us what should be regarded as a CM. Thus, for example, "coal gas" and "hydrocarbon" are not regarded as CMs in their guidelines but "hydrogen" is. In this passage OSCAR has found 12 phrases (mainly words) which it think are CM. Of these 11 are correct but one ("In") is a false positive (OSCAR thinks this is the element Indium). OSCAR has not missed any, so we have:

True positives = 11

False positives = 1

False negatives = 0

So we have a recall of 100% (we got all the CMs) with a precision of 11/12 == 91%. It is, of course, easy to get 100% recall by marking everything as a CM so it is essential to report the precision as well. The average of these quantities (more strictly the harmonic mean) is often called the F or F1 score.

In this example OSCAR scored well because the compounds are all simple and common. But if we have unusual chemical names OSCAR may miss them. Here's an example from another thesis [2]:

and datastore['chm918'] = { "cmlRef": "cml42", "SMILES": "N[C@H](Cc1c[nH]c2ccccc12)C(O)=O", "InChI": "InChI=1/C11H12N2O2/c12-9(11(14)15)5-7-6-13-10-4-2-1-3-8(7)10/h1-4,6,9,13H,5,12H2,(H,14,15)/t9-/m1/s1/f/h14H", "ontIDs": "CHEBI:16296", "Text": "D-tryptophan", "Type": "CM" } D-tryptophan and some datastore['chm919'] = { "cmlRef": "cml1", "SMILES": "NC(Cc1c[nH]c2ccccc12)C(O)=O", "InChI": "InChI=1/C11H12N2O2/c12-9(11(14)15)5-7-6-13-10-4-2-1-3-8(7)10/h1-4,6,9,13H,5,12H2,(H,14,15)/f/h14H", "ontIDs": "CHEBI:27897", "Text": "tryptophan", "Type": "CM" } tryptophan and datastore['chm920'] = { "cmlRef": "cml2", "SMILES": "c1ccc2[nH]ccc2c1", "InChI": "InChI=1/C8H7N/c1-2-4-8-7(3-1)5-6-9-8/h1-6,9H", "ontIDs": "CHEBI:16881 CHEBI:35581", "Text": "indole", "Type": "CM" } indole derivatives (1) and datastore['chm921'] = { "cmlRef": "cml42", "SMILES": "N[C@H](Cc1c[nH]c2ccccc12)C(O)=O", "InChI": "InChI=1/C11H12N2O2/c12-9(11(14)15)5-7-6-13-10-4-2-1-3-8(7)10/h1-4,6,9,13H,5,12H2,(H,14,15)/t9-/m1/s1/f/h14H", "ontIDs": "CHEBI:16296", "Text": "D-tryptophan", "Type": "CM" } D-tryptophan and datastore['chm922'] = { "cmlRef": "cml42", "SMILES": "N[C@H](Cc1c[nH]c2ccccc12)C(O)=O", "InChI": "InChI=1/C11H12N2O2/c12-9(11(14)15)5-7-6-13-10-4-2-1-3-8(7)10/h1-4,6,9,13H,5,12H2,(H,14,15)/t9-/m1/s1/f/h14H", "ontIDs": "CHEBI:16296", "Text": "D-tryptophan", "Type": "CM" } D-tryptophan and datastore['chm923'] = { "cmlRef": "cml42", "SMILES": "N[C@H](Cc1c[nH]c2ccccc12)C(O)=O", "InChI": "InChI=1/C11H12N2O2/c12-9(11(14)15)5-7-6-13-10-4-2-1-3-8(7)10/h1-4,6,9,13H,5,12H2,(H,14,15)/t9-/m1/s1/f/h14H", "ontIDs": "CHEBI:16296", "Text": "D-tryptophan", "Type": "CM" } D-tryptophan (1) (1) (1)

Halometabolite PrnA / datastore['chm925'] = { "Text": "PrnC", "Type": "CM" } PrnC

datastore['chm927'] = { "Text": "Pyrrolnitrin", "Type": "CM" } Pyrrolnitrin (5) datastore['chm928'] = { "Text": "Rebeccamycin", "Type": "CM" } Rebeccamycin (6) datastore['chm929'] = { "Text": "Pyrrindomycin", "Type": "CM" } Pyrrindomycin (7) datastore['chm930'] = { "Text": "Thienodolin", "Type": "CM" } Thienodolin (8) datastore['chm931'] = { "Text": "Pyrrolnitrin", "Type": "CM" } Pyrrolnitrin (9) datastore['chm932'] = { "Text": "Pentachloropseudilin", "Type": "CM" } Pentachloropseudilin (10) Pyoluteorin (11) -

Here OSCAR has missed "Pyoluteorin" (see Pubchem for the structure) and we have 12 true positives and 1 false negative. So we have recall of 12/13 == 91% and precision of 100%.

Peter Corbett measures OSCAR3 ceaselessly. It's only as good as the metrics. Read his blog to find out where he's got to. But I had to search quite hard to find false negatives. It's also dependent on the corpus used - the two examples are quite different sorts of chemistry and score differently.
Unfortunately this type of careful study is uncommon in chemistry. Much of the software and information is commercial and closed. So you will hear vendors tell you how good their text-mining software or descriptors or machine-learning are. And there are hundreds of papers in the literature claiming wonderful results. Try asking them what their gold standard is; what is the balance between precision and recall. If they look perplexed or shifty don't buy it.

So why is this important for cyberscience? Because we are going to use very large amounts of data and we need to know how good it is. That can normally only be done by using robots. In some cases these robots need a protocol that has been thoroughly tested on a small set - the gold standard - and then we can infer something about the quality of the data of the rest. Alternatively we develop tools for analysing the spread of the data, consistency with know values, etc. It's hard work, but a necessary approach for cyberscience. And we shall find, if the publishers let us have access to the data, that everyone benefits from the added critical analysis that the robots bring.

[1] I can't find this by searching Google - so repository managers make sure your theses are well indexed by search engines. Anyway, Kelcey, many thanks for making your thesis Open Access.

[2] "PyrH and PrnB Crystal Structures", School of Chemistry and Centre for Biomolecular Sciences University of St Andrews, Walter De Laurentis, December 2006, Supervisor ­ Prof. J datastore['chm8'] = { "Element": "H", "Text": "H", "Type": "CM" } H Naismith

cyberscience: CrystalEye at WWMM Cambridge

We've mentioned CrystalEye frequently on this blog but not announced it formally. We were about to post it about three weeks ago but had a serious server crash. Also we are very concerned about quality and want to make sure there as as few bugs as possible (there will always be bugs, of course). But now here is a formal announcement. The work and the site has all been done by Nick Day (ned24 AT our email address). I am only blogging this because Nick is too busy using the data for research and hoping to get enough results by the end of the academic year.
We believe that this is one of the first resources in any discipline created and continued by robotic extraction of data from primary publications and re-published for Openly re-use. Every day a spider visits the sites of publishers who expose crystallographic data, downloads it, validates it, etc. Here's a schematic diagram:

This post stresses the right-hand site - the systematic scraping of data from publisher sites and the left hand side shows we can also do it for theses (The SPECTRa-T project). That's still in its infancy but at least it's within academia's control (as long as they don't give away the right to control theses).
CrystalEye adds a LOT of validation. The central part above expands to a detailed workflow:


We are generating individual 3D molecules, 2D diagrams, InChIs and HTML - and none of this is available in its raw form on the publisher site. Remember that we are simply taking the raw data from the author - untouched by publisher. So we are adding a great deal of value without infringing any rights

There's a lot more to say about CrystalEye. So read the FAQ. But we'll be back and explain how it changes the face of secondary publishing. We simply don't need humans retyping the literature.

Now if you are a publisher (or editor employed by a publishing house - I have to watch my terminology) this post will probably cause one of the following reactions:

  • This is really exciting. How can we help the development of cyberscience? We could benefit from a whole new market.
  • This is an appalling threat. The scientists are stealing our data. How can we stop them?

If you are not in the first category you are in the second. "This is boring", "data isn't important, only full-text matters", "we can't afford to be involved", "let's wait for 10 years and see what happens". So I am gently inviting the publishers to tell me whether they will help me or prevent me use their data for cyber-science.

cyberscience: Where does the data come from?

[In several previous and future posts I use the tag "cyberscience" - a portmanteau of E-Science (UK, Europe) and Cyberinfrastructure (US) which emphasizes the international aspect and importance of the discipline.]

Cyberscience is a vision for this century:

The term describes the new research environments that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computing and information processing services over the Internet. In scientific usage, cyberinfrastructure is a technological solution to the problem of efficiently connecting data, computers, and people with the goal of enabling derivation of novel scientific theories and knowledge.

So we in Cambridge are attempting to derive novel scientific theories and knowledge. We may do much of this in public - on this blog and other media - and we welcome any help from readers.

Many cyberscientists will bemoan the lack of data. Often this is because science is difficult and expensive - if I want a crystal structure at 4K determined by neutron diffraction I have to send the sample to a nuclear reactor. If I want images of the IR spectrum of parts of the environment I need a satellite with modern detectors. If I want to find a large hadron I need a large hadron collider.

But we are getting a lot better at automating data collection. Almost all instruments are now digital and many disciplines have now put effort into devising metadata and markup procedures. The dramatic, but largely unsung, developments in detectors - sensitivity, size, response, rate, wavelength, cost makes many data trivially cheap. And there is a corresponding increase in quality - if all data are determined from a given detector then it is much easier to judge quality automatically. We'll explore this in later posts.

So where does data come from? I suggest the following, but would be grateful for more ideas:

  • directly from a survey to gather data. This is common in astronomy, environment, particle physics, genomes, social science, economics, etc. Frequently the data are deposited in the hope that someone else will use them. It's unknown in chemistry (and I would not be optimistic of getting funding). It sometimes happens from government agencies (e.g. NIST) but the results are often not open.
  • directly from the published literature. This is uncommon and the next post will highlight our CrystalEye project. The reasons are that (a) we haven't agreed the metadata (b) scientists don't see the point (c) the journals don't encourage or even permit it. However this has been possible in our CrystalEye
    project (described in later posts). Note, of course, that this is an implicit and valuable way of validating published information.
  • retyped from the current literature. Unfortunately this is all too common. It has no advantages, other than that it is often legal to retype facts but not for robots to ingest them as above. It is slow, expensive, error prone and almost leads to closed data. It may be argues that the process is critical and thus adds value - in some cases this is true - but most of this is unnecessary - robots can often do a good job of critiquing data.
  • output of simulations and other calculations. We do a lot of this - over 1 million calculations on molecules. If you believe the results it's a wonderful source of data. We've been trying to validate the calculations in CrystalEye.
  • mashups and meshups. The aggregation of data from any of the 4 sources above into a new work. The aggregation can bridge disciplines, help validation, etc. We are validating computation against crystallography and crystallography against computation. Both win.

So given the power of cyberscience why is it still so hard to find data even when they exist? (I except the disciplines which are explicitly data-driven). Here are some ideas:

  • The scientists don't realise how useful their data are. Hopefully public demonstration will overcome this.
  • The scientists do realise how useful their data are (and want to protect them). A natural emotion and one that repositories everywhere have to deal with. It's basically the Prisoners' dliemma where individual benefit competes against communal good. In many disciplines the societies and the publishers have collaborated to require deposition of data regardless of screams from authors, and the community can see the benefit.
  • The data are published, but only in human-readable form ("hamburgers"). This is all too common, and we make gentle progress with publishers to try to persuade authors and readers of the value of semantic data.
  • The publishers actively resist the re-use of data because they have a knee-jerk reaction to making any content free. They don't realise data are different (I am not, of course asserting that I personally agree with copyrighting the non-data, but the argument is accepted by some others). They fear that unless they avidly protect everything they are going to lose out. We need to work ceaselessly to show them that this is misguiding and this century is about sharing, not possessing scientific content.
  • The publishers actively resist the re-use of data because they understand its value and wish to sell us our own data back. This blog will try to clarify the cases where this happens and try to give the publishers a chance to argue their case.

The next post shows how our CrystalEye project is helping to make data available to cyberscience.