CML Blog and Update

Henry [Rzepa] and I are planning a major facelift for the public face of CML this year.  CML is about 13 years old and has gone through several revisions and relocations, so that information is somewhat scattered. CML is now a large system (ca 100 elements and 200 attributes) and we now have good proof-of-concept in all the key areas:

  • molecular structures (atoms. bonds, etc.)
  • reactions
  • substances, mixtures, macroscopic amounts
  • properties of molecules, reactions and substances
  • crystallography and solid state
  • computational chemistry
  • analytical data and spectroscopy
  • procedures, actions and objects in physical science

in addition CML can support

  • interoperation with other markup languages, especially XHTML, MathML and SVG
  • dictionaries and ontologies
  • representation in RDF(S)

CML can also support a number of language features

  • data interchange
  • ontology development
  • workflow and computation
  • computational grammar (e.g. combinatorial chemistry, fuzzy structures, variability)

CML has been publicly available for many years, and over the last two years has stabilised in design and software. We do not expect major changes in the next year and so are rationialising access to the components and information. Recently we have had bad attacks from spammers on the Wikis so will be discontinuing this as an interactive feature and will use it as a read-only resource. Since the blog system here has worked well I shall use the CMLBlog as a means of developing public resources for CML.
Since the CMLBlog has been dormant for the last year I shall post messages on this blog and clone them to the CMLBlog so that those who only want to follow CML can transfer there.  I hope to post about 1 topic/day which should get me through the schema by the end of the year. Eac h post will cover a clear topic and allow for feedback. And there will be regular requests for new topics.
BTW – if anyone knows a good forum software this might be an alternative to a blog.

Posted in XML | Tagged , | Leave a comment

Open Data in Science

I have been invited to write an article for Elsevier’s Serials Review and mentioned it in an earlier post (Open Data: Datument submitted to Elsevier’s Serials Review). I had hoped to post the manuscript immediately afterward but (a) our DSpace crashed and (b) Nature Precedings doesn’t accept HTML So DSpace is up again and you can see the article. This post is about the content, not the technology
[NOTE: The document was created as a full hyperlinked datument, but DSpace cannot handle hyperlinks and it numbers each of the components as a completely separate object with an unpredictable address. So none of the images show up – it’s probably not a complete disaster – and you lose any force of the datument concept (available here as zip) which contains an interactive molecule (Jmol) ]
The abstract:

Open Data (OD) is an emerging term in the process of defining how scientific data may be published and re-used without price or permission barriers. Scientists generally see published data as belonging to the scientific community, but many publishers claim copyright over data and will not allow its re-use without permission. This is a major impediment to the progress of scholarship in the digital age. This article reviews the need for Open Data, shows examples of why Open Data are valuable and summarises some early initiatives in formalising the right of access to and re-use of scientific data.

PMR: The article tries not to be too polemic and to review objectively the area of Open Data (in scientific scholarship), in the style that I have done for Wikipedia. The next section shows Open Data in action, both on individual articles and when aggregating large numbers (> 100,000) articles. Although the illustrations are from chemistry and crystallography the message should transcend the details. Finally I try to review the various intitiatives that have happened very recently and I would welcome comments and corrections. I think I understand the issues raised in the last month but they will take time to sink in.
So, for example, the last section I describe and pay tribute to the Open Knowledge Foundation, Talis and colleagues, and Science/Creative Commons. I will blog this later but there is a now a formal apparatus for managing Open Data (unlike Open Access where the lack of this causes serious problems for science data). In summary, se now have:

  • Community Norms(“this is how the community expects A and B and C to behave – the norms have no legal force but if you don’t work with them you might be ostracized, get no grants, etc.”)
  • Protocols. These are high-level declarations which allow licences to be constructed. Both Science Commons and The Open Knowledge Foundation have such instruments. They describe the principles to which conformant licences must honour. I use the term meta-licence (analogous to XML, a meta-markuplanguage for creating markup languages).
  • Licences. These include PDDL and CC0 which conform to the protocol.

Throughout the article I stress the need for licences, and draw much analogy from the Open/Free Source communities which have meta-licences and then lists of conformant licences. I think the licence approach will be successful and will be rapidly adopted.
The relationship between Open Access and Open Data will require detailed work – they are distinct and can exist together or independently.  In conclusion I write:

Open Data in science is now recognised as a critically important area which needs much careful and coordinated work if it is to develop successfully. Much of this requires advocacy and it is likely that when scientists are made aware of the value of labeling their work the movement will grow rapidly. Besides the licences and buttons there are other tools which can make it easier to create Open Data (for example modifying software so that it can mark the work and also to add hash codes to protect the digital integrity).
Creative Commons is well known outside Open Access and has a large following. Outside of software, it is seen by many as the default way of protecting their work while making it available in the way they wish. CC has the resources, the community respect and the commitment to continue to develop appropriate tools and strategies.
But there is much more that needs to be done. Full Open Access is the simplest solution but if we have to coexist with closed full-text the problem of embedded data must be addressed, by recognising the right to extract and index data. And in any case conventional publication discourages the full publication of the scientific record. The adoption of Open Notebook Science in parallel with the formal publications of the work can do much to liberate the data. Although data quality and formats are not strictly part of Open Data, their adoption will have marked improvements. The general realisation of the value of reuse will create strong pressure for more and better data. If publishers do not gladly accept this challenge, then scientists will rapidly find other ways of publishing data, probably through institutional, departmental, national or international subject repositories. In any case the community will rapidly move to Open Data and publishers resisting this will be seen as a problem to be circumvented

Posted in data, publishing | Tagged , , , | 13 Comments

Why publishers' technology is obsolete – I

I have just finished writing an article for a journal – and I suspect the comments apply to all publishers. To create the Citations (or “references”) they require:

CITATIONS Citations should be double-spaced at the end of the text, with the notes numbered sequentially without superscript format. Authors are responsible for accuracy of references in all aspects. Please verify quotations and page numbers before submitting.
Superscript numerals should be placed at the end of the quotation or of the materials in which the source is mentioned. The numeral should be placed after all punctuation. SR follows the latest edition of the Chicago Manual of Style, published by the University of Chicago Press. Examples of the correct format for most often used references are the following:
Article from a journal: Paul Metz, “Thirteen Steps to Avoiding Bad Luck in a Serials Cancellation Project,” Journal of Academic Librarianship 18 (May 1992): 76-82.
[Note: when each issue is paged separately, include the issue number after the volume number: 18, no. 3(May 1992): 76-82. Do not abbreviate months. When citing page numbers, omit the digits that remain the same in both the beginning and ending numbers, e.g., 111-13.

PMR: It’s the author who has to do all this. In a different journal it would be a different style – maybe Harvard, or Oxford or goodness knows. Each with their own bizarre, pointless micro syntax.
As we know there is a simple effective way of identifying a citation in a journal – the Digital object identifier – (Wikipedia), It’s a unique identifier, managed by each publisher and there is a resolution service. OK not all back journals are in the system, and OK it doesn’t do non journal articles but why not use it for the citations it can support. IN many science disciplines almost all modern citations would have DOIs.
Not only would it speed up the process but it would save errors. Authors tend to write abbreviations (J. Acad. Lib), mungle the volumes and pages, get the fields in the wrong areas. They hate it, and I suspect so do the technical editors when they have to correct the error. I can’t actually believe the authors save the technical editorsany time – I suspect it costs time.
You may argue that the publisher still has to type out the citation from the DOI. Not at all. This is all in standard form. Completely automatic.
Why also cannot publisher emit their bibliographic metadata in standard XML on their web page. It’s a solved problem. It would mean that anyone getting the a citation would get it right. (I assume that garbled citations don’t get counted in the holy numbers game, so it pays to have your metadata scraped correctly. And XML is the simple, correct way to do that.
It’s not as if the publishers don’t have an XML Schema (or rather DTD). They do.
It’s called PRISM. Honest. Publishing Requirements for Industry Standard a worthy – if probably overengineered approach. But maybe the name has got confused.
Of course the NIH/Pubmed has got this problem solved. Because they are the scientific information providers of the future.

Why not borrow their solution?
Posted in publishing, XML | Tagged , , | Leave a comment

Why PubMed is so important in the NIH mandate – cont.

In Why PubMed is so important in the NIH mandate  – which got sent off prematurely – I started to show why the NIH/PubMed relationship was so important. To pick up…
The difference between PubMed and almost all other repositories is that it has developed over many years as a top-class domain specific information engine.  Here’s a typical top page (click to enlarge):
pubmed.PNG
Notice the range of topics offered. Many of these are searching collections of named scientific entities. Such as genes, proteins, molecules, diseases, etc. One really clever idea – at least two decades old – was that you search in one domain, come back with the hits, search in another domain, and so on. An early idea of mashups, for example.
You can’t do this with Google. If you search for CAT you get all sorts of things. But in Pubmed you can differentiate between the animal, the 3-base codon, the tripeptide, the enzyme, the gene, the scanning techique and so on. Vastly improved accuracy. You can search for CAT scans on Cats. And there are the non-textual searches. You can do homology seraches for sequences. Similar molecules using connection tables. etc. etc.
Then there is the enormous economy of scale. Let’s say I search for p450 (a liver enzyme). I get 23000+ hits. I can’t possibly read them all. But OSCAR can. OSCAR can read the abstracts anyway, but now it will be able to read many more fulltexts as well. It can pass them to chemistry engines, which pass them onto … and then onto …
You can’t do that with Institutional repositories or with self-archiving. They don’t have the domain search engines, they don’t have the comprehensives. They don’t emit the science in standard XML.
For science it is likely that we have to have domain repositories. With domain-specific search engines, XML, RDF, ORE, the lot. It’s the natural way that scientists will work.
And PubMed – and its whole information infrastructure of MeSH, PubChem, Entrez, etc. is so well constructed and run that it serves as an excellent example of where we should be aiming. It’s part of the future of scientific information and data-driven science.

Posted in publishing | Tagged , | Leave a comment

Do the Royal Society of Chemistry and Wiley care about my moral rights?

In a previous post I asked Did I write this paper??? because I had come across something like this:

wiley2.png

(click to enlarge). Take a long hard look and tell me what is the journal, and who is the publisher. Note also that it costs 30 USD to look at it.
Now I (and others) wrote that paper. When we submitted it I was proud to publish it with the Royal Society fo Chemistry. I cite it as:
Org. Biomol. Chem., 2004, 2, 3067 – 3070, DOI: 10.1039/B411699M


Experimental data checker: better information for organic chemists

S. E. Adams, J. M. Goodman, R. J. Kidd, A. D. McNaught, P. Murray-Rust, F. R. Norton, J. A. Townsend and C. A. Waudby
and you can still find it posted at:
http://www.rsc.org/Publishing/Journals/OB/article.asp?doi=B411699M
However when I visit the RSC page – on the RSC site – at:
http://www.rsc.org/publishing/journals/OB/article.asp?DOI=B411699M&type=ForwardLink
I find:
wiley3.PNG
Since this is on the RSC’s own site and it says it’s not an RSC journal article it’s clearly deliberate, not a mistake. The RSC seems to have transferred the rights of the paper to Wiley, who are reselling it under the name Cheminform. Or maybe both are selling it. Or maybe the RSC don’t know what Wiley are doing. (The best I can see is that Wiley appear to be passing off my/our paper under their name. As far as I can see they are only selling the abstract and even then it;s the wrong one – but maybe they are also selling the full text if they were competent to get the web site right. And they are asking 30 USD.)
I care very deeply about this. I used to be proud to publish in the journals of the Chemical Society (now the RSC). Can I still be proud? they have disowned my article as not one of theirs. Someone reading the Wiley page would naturally assume that I had published in a Wiley Journal and not with the RSC. We’ve worked closely with the RSC – many of the ideas for Project Prospect came from our group.
A major justification for Transfer of Copyright to publishers, whether or not you believe it, is that it allows the publisher to defend the integrity of the work against copyright infringement by others. I contend that what I have depicted here is a gross violation of someone’s copyright. Probably not mine since I gave it away.
Cockup or conspiracy – I don’t know. But I certainly feel my rights have been violated.

Posted in publishing | Tagged , , | 3 Comments

Learning RDF and RDFS – help!

I’m getting myself up to speed on RDF (and RDFS) and building molecular repositories as an example. I’m using  the Jena Semantic Web Framework (Open Source , Java, HP-inspired) and so far like it. But I have only done a little bit (subject-predicate-object) Jim tells me that what I have produced so far needs cleaning up. As a minimum I have to use RDF types (rdf:type).
I like learning by example – give me a few examples of RDF-XML and the corresponding Jena code and that will go a long way. But although I could do this easily for the simple stuff the Jena tutorial runs out before RDFS. And the Javadoc is enormous. I’m impressed, but I don’t know where to start. There are no obvious package or class names. Everything uses abstract language. How do I learn about ranges and domains unless I can see some working examples and create some?
So rather than exposiing this on an RDF-specific list (which I may do later) I’m wondering if there are any kind readers who can point to some examples of RDFS-XML and even better if they can suggest how to hack them in Jena.
TIA

Posted in programming for scientists | Tagged , , | 3 Comments

Open Data: Datument submitted to Elsevier's Serials Review

I have just finished writing an invited article for Serials Review – Elsevier (I’m making an exception and submitting to a closed access publisher because (a) this is a special issue – from the invitation from Connie Foster

*Serials Review*
Serials Review (v.30, no.4, 2004) was a focus issue on Open Access. It remains one of the most heavily downloaded issues and articles even now. Open Access remains a “hot topic” and fundamental discussion in scholarly communication. Your names were suggested by either current board members or previous contributors to the Open Access issue.
At the time of that publication, editors and authors envisioned revisiting the Open Access environment a few years hence since issues, publisher responses, “experiments,” and government mandates were or are in flux.

PMR: and (b) we are all allowed to retain copyright.
[I’ll discuss the message later. This post is about the medium. And how today’s medium doesn’t carry messages very well at all.]
First to publicly thank Connie Foster for her patience. I warned her that I would not submit a conventional manuscript because I wanted to show what Scientific Data are actually like. And you can’t do that in a PDF, can you?
So I asked ahead of time if I could submit HTML. It caused the publoisher (Elsevier) a lot of huffing and puffing. The answer seemed to be “yes”, but when I came to submit the manuscript it only accepted dead documents. So I’ve ended up mailing it to Connie.
The document is a datument – a term that Henry Rzepa and I coined about 4 years ago (From Hypermedia to Datuments: Murray-Rust and Rzepa: JoDI). It emphasizes that information should be seamless – not arbitrarily split into “full-text” and “data” because it’s easier for twentieth century publishers. (I return to this in a later post). The ideal medium for datuments is XML – for example using ICE (Integrated Content Environment) and that’s why I’m going to visit Peter Sefton and colleagues.
But the simple way to create datuments is in valid XHTML. Every editor in the world should now produce XHTML so there is no reason not to do it. It’s a standard. It’s in billions of machines over the world. It’s got everything we need. You see hundreds of examples every day.
XHTML manages:

  • images (it’s done this for 15 years)
  • multimedia (also for 15 years)
  • hyperlinks (for 15 years)
  • interactive objects (also for 15 years, though with some scratchy syntax)
  • foreign namespaces – probaly about 10 years
  • vector graphics (SVG) nearly 10 years

. It also manages STYLES. You don’t have to put the style in the content. You put it in a stylesheet. So my datument doesn’t have styles. Elsevier can add those if it wants. Personally I like reading black text on a white background – I know it’s very old-fashioned, but that;s how I was educated.
Also, since it’s in XML you can repurpose it. Extract just the images. Or discard the applet. Or reorganise the order of author’s names. Or mash it with another paper. Or extract the data. Or…
So XHTML is a liberating medium int which to publish while PDF is a dead, restrictinf and dismal medium. So having created my manuscript as a standard XHTML hyperdocument – no technology that isn’t at least 10 years old I try to submit it. Doesn’t work. Publisher doesn’t like HTML. This seems barmy since they actually publish in HTML.
I am not prepared to transform the datument into PDF. It destroys the whole point of the article. It would be like publishing movie as a single snapshot. Or a recording of a song using only a score. So I’ve had to zip it up and send it as email. Which is what we do everyday anyway.
[In passing – why this elaborate ritual with the publishers’ technology? Authors have been producing acceptable manuscripts in HTML for years. Why publish in double-column PDF? I didn’t ask for it. It is purely for the benefit of the publishers. To help their branding. (It’s not even to make their life easier, as I’ll show later because it doesn’t).]
So, as a good Open Access advocate I have reposited it in the Cambridge DSpace. DSpace does not deal wth hyperdocuments (please tell me I’m wrong). I would have to go through all the documents and find the relative URLs and expand them to the Cambridge DSpace base URL. This, of course, means that the documents are not portable. So I had to reposit a ZIP file. 15 years after the invention of HTML and we cannot reposit HTML hyperdocuments.
[UPDATE: I have since found that it does accept HTML so we’ll see how it comes out. ]
[UPDATE2: Yes, it accepts HTML, but no the links don’t work. You have to know the address of each image before you deposit them. Then you have to edit the main paper to make them work. Which means it breaks if you export it. So basically you cannot reposit normal HTML in DSpace and expect it to work.]
So, dear reader, if you are a human, and want to read the file, download the zip file, unzip it, point your browser at it, swear at me when the browser breaks.
[UPDATE: Bill says is breaks. I don’t understand this.]
And, dear reader, if you are a robot you have no option but to ignore it. It’s a zip file. It’s potentially evil. And anyway you wouldn’t know what you were indexing or looking for. So maybe I will give you the top part of the HTML to look at. You won’t see the pictures, but you probably don’t care at this stage, though in a few years you will.
I also tried to reposit it at Nature Precedings. They wouldn’t let post a zip file. Only DOC, PPT, PDF. Oh dear.

Posted in publishing | Tagged , , , , | 14 Comments

Why getting information from publishers is soul-destroying

I’m reprinting parts of a post from Bill Hooker. The point here is not just the message, but also the meta-medium. To get the message Bill has had to do some messy, boring, unsatisfying, incomplete research. Here’s how he did it.

Does the AAP/PSP really represent its members?

 

[…]
The PSP lists its members here ; it didn’t take long to compare that list with the list of publishers indexed by SHERPA/RoMEO. Of the 355 publishers in the RoMEO database, 46 are members of PSP; of these, 16 are listed as “grey” (won’t allow archiving), 23 are “green” (allow refereed postprint archiving — NIH mandate compliant) and 7 “pale green” (allow preprint archiving; many “pale green” publishers actually allow postprint archiving and are NIH compliant, but are not listed as green because of various restrictions).
It’s not possible to do what I wanted here — which was to answer the title question. The problem is that the PSP lists 102 members that aren’t indexed by RoMEO. I found that somewhat surprising, particularly since the list includes names I’d have expected to find in RoMEO: FASEB, Stanford U Press, Yale U Press, Cold Spring Harbor Lab Press, NEJM, Highwire Press and others.
Nonetheless, we can say that if the RoMEO-indexed sample (46 of 148, 31%) is representative, then at least 50% of PSP members are already complying with the NIH mandate, and a further 15% at least allow preprint archiving and may even be NIH-compliant.
It’s even more unbalanced if we compare the numbers of journals published by each company. Those 46 publishers account for 5901 journals; the grey publishers put out 222 (4%), the green publishers 4243 (72%) and the pale green publishers 1436 (24%).
If the PSP were honest and interested in fairly representing its members, I’d think they would find out (and make public) whether the remaining, non-RoMEO indexed members follow the same pattern. I won’t hold my breath.
____
Full disclosure: the numbers above are not 100% accurate, since the comparison between the two lists was not always straightforward. For instance, RoMEO indexes “Yale Law School” and the PSP lists “Yale University Press” as a member. I tried to err on the side of the PSP — for instance, Yale Law is grey, so I included them. There were a few such problematic instances; I very much doubt that they made any difference to the data expressed as percentages, I’d welcome correction and a better dataset, and if anybody wants the Excel files I used I’ll be happy to provide them.

PMR: I know exactly what Bill has gone through because I’ve done a lot of this myself. It might seem simple to find information from publishers. It’s not. My generalisations below extend a little into Open Access publishers, but it’s mainly aimed at Closed Access publishers.
A little while ago I thought it would be useful to see what degree of compliance Open Access publishers of chemistry had with the BBB declarations.  Should be easy – there’s only about 60 titles listed. So I mailed the Blue Obelisk and the Open Knowledge Foundation and suggested that if we divided the work – each took a few publishers – we could do this in a relatively short time. And maybe publish it.
Oh dear. The publisher websites were awful. It’s practically impossible to find out anything from most publishers (of any ROMEO/HARNAD colour). It’s spread over several pages, perhaps for authors, perhaps general blurb, wherever (and this is true for all publishers). We created a spreadsheet of what we wanted to record but found that the practice was so variable that we couldn’t systematize it.
So we gave up. The effort of finding out policies, even for Open Access publishers was too great. (But, closed access publishers, do not feel this is a defeat – we shall return).
The thing that really upsets me about closed access publishers is how profoundly unhelpful they are. They don’t want to communicate with the general authorship and readership. Each thinks it’s the centre of the world. Despite their acclaimed publisher organisations (AAP, ALPSP, STM, etc.) it is one of the technically worst industries I have encountered. There are no standards. No attempt to adjust to the modern world (I shall revist this later). Here are some examples:

  • Many don’t reply to courteous requests for information. I admit that this blog is sometimes a bit brusque, but it’s come that way because of the unhelpfulness of publishers. Every publisher should have a page on which it lists its policies. And there should be open forums fordiscussion on these policies. Some repository managers spend large amounts of time trying to work out whether articles can be put in  a repository – and I guess the publisher gets asked frequently. Wouldn’t it be easy to add a label to each journal saying whether manuscripts can be put in a repository. I suppose not, it would require agreement across the industry.
  • They work on Jurassic timescales. In the modern age people expect replies by return. It’s taken months to get answers for my latest manuscript  – and I’m an author. The ACS is taking a minimum of FIVE MONTHS to respond to Antony Williams’ courteous request as to the copyright position of our abstracting of factual data.
  • Requests, discussion, etc are all fragmented. I suspect the  same questions get asked again and again. If these were listed  on a policy FAQ as they were asked and answered it would save everybody’s time.
  • The technology is totally geared to the each publisher’s byzantine and tortuous internal business processes. I’ll give examples in a later post. Typical example from my latest submission – I have to submit a title page without the body of the text and the body of the text without my name. Those are reasonable objects for blind review. But why should I have to do this? Publishers,  we are in the twenty-first century. This can be done AUTOMATICALLY. You use a technology called STYLESHEETS. It takes 2 lines of code to split a document into these two bits.  I’ll give more later. This spills over into the awful state of presenting policy.
  • The technical business model is slow to adjust to changing demands. So when publishers adopted their “hybrid” policy (a different one for each publisher of course) they generally failed to tell the technical department that they needed to adjust their labelling and their policies and permissions for individual articles. With the result that I spent a number of gloomy days on this blog pointing out to publishers how little effort they had put into this.

 

A concern that I’ll return to is that not only is the technical standard awful, I suspect it’s expensive.  I’ll return to this.

 

But the really sad thing is that publishing (unlike making toothpaste, or bicycles) is based on  communication. I have been looking to see which closed access publishers have taken any effort whatsoever over the last year to communicate with authors and readers.  I don’t follow everything, but I follow Peter Suber’s blog, the only one I can think of is Nature.

 

Wouldn’t you think it would be good for business in this era of Web2.0 to bee seen to be interested in communicating? If – as a publisher – you want to respond I will honour your posting.

Posted in publishing | 1 Comment

Why PubMed is so important in the NIH mandate

Some us of know the following phrase by heart:

all investigators funded by the NIH submit or have submitted for them to the National Library of Medicine’s PubMed Central an electronic version of their final, peer-reviewed manuscripts upon acceptance for publication to be made publicly available no later than 12 months after the official date of publication:

PMR: Why is the “PubMed” bit so important? Why wouldn’t an institutional repository do? Or a departmenal web page? After all, Google will index these and find them, won’t it?
I realise that the following is going to upset some people but I have to tell it as it is….
The reason that PubMed Central is so important to me is  thatit is not”just an archive”, not “just an enforcement agency”, not “just a repository”. It is a living instrument of scientific research in a way that the others are not – and won’t be. And it’s evolving rapidly. It will, I predict, overtake commercial abstracting organisations (you know which ones) during the next decade. It will become the single most important scientific information resource in the chemical/biological arena. Together with arXiv and SCOAP in physics, it will be part of the twenty-first scientific information enlightment.
Why do I say this? Because for the last 30? years the NIH has had first calss teams working on the modern practices of informatics. That’s shown by Pubchem, which is now seen by many (not yet all, but most of the enlightened ones) as the first place to look for chemical information on molecules. Why? First because it’s Open, but more importantly because it has a modern approach to information. Pubchem will continue to overtake the conventional molecular databases. That’s why there was such a fuss when Pubchem was launched.

Posted in Uncategorized | 5 Comments

Does the semantic web work for chemical reactions

A very exciting post from Jean-Claude Bradley asking whether we can formalize the semantics of chemical reactions and synthetic procedures. Excerpts, and then comment…

Modularizing Results and Analysis in Chemistry


Chemical research has traditionally been organized in either experiment-centric or molecule-centric models.

This makes sense from the chemist’s standpoint.
When we think about doing chemistry, we conceptualize experiments as the fundamental unit of progress. This is reflected in the laboratory notebook, where each page is an experiment, with an objective, a procedure, the results, their analysis and a final conclusion optimally directly answering the stated objective.
When we think about searching for chemistry, we generally imagine molecules and transformations. This is reflected in the search engines that are available to chemists, with most allowing at least the drawing or representation of a single molecule or class of molecules (via substructure searching).
But these are not the only perspectives possible.
What would chemistry look like from a results-centric view?
Lets see with a specific example. Take EXP150, where we are trying to synthesize a Ugi product as a potential anti-malarial agent and identify Ugi products that crystallize from their reaction mixture.
If we extract the information contained here based on individual results, something very interesting happens. By using some standard representation for actions we can come up with something that looks like it should be machine readable without much difficulty:

  • ADD container (type=one dram screwcap vial)
  • ADD methanol (InChIKey=OKKJLVBELUTLKV-UHFFFAOYAX, volume=1 ml)
  • WAIT (time=15 min)
  • ADD benzylamine (InChIKey=WGQKYBSKWIADBV-UHFFFAOYAL, volume=54.6 ul)
  • VORTEX (time=15 s)
  • WAIT (time=4 min)
  • ADD phenanthrene-9-carboxaldehyde (InChIKey=QECIGCMPORCORE-UHFFFAOYAE, mass=103.1 mg)
  • VORTEX (time=4 min)
  • WAIT (time=22 min)
  • ADD crotonic acid (InChIKey=LDHQCZJRKDOVOX-JSWHHWTPCJ, mass=43.0 mg)
  • VORTEX (time=30 s)
  • WAIT (time=14 min)
  • ADD tert-butyl isocyanide (InChIKey=FAGLEPBREOXSAC-UHFFFAOYAL, volume=56.5 ul)
  • VORTEX (time=5.5 min)
  • TAKE PICTURE

It turns out that for this CombiUgi project very few commands are required to describe all possible actions:

  • ADD
  • WAIT
  • VORTEX
  • CENTRIFUGE
  • DECANT
  • TAKE PICTURE
  • TAKE NMR

By focusing on each result independently, it no longer matters if the objective of the experiment was reached or if the experiment was aborted at a later point.
Also, if we recorded chemistry this way we could do searches that are currently not possible:

  • What happens (pictures, NMRs) when an amine and an aromatic aldehyde are mixed in an alcoholic solvent for more than 3 hours with at least 15 s vortexing after the addition of both reagents?
  • What happens (picture, NMRs) when an isonitrile, amine, aldehyde and carboxylic acid are mixed in that specific order, with at least 2 vortexing steps of any duration?

I am not sure if we can get to that level of query control, but ChemSpider will investigate representing our results in a database in this way to see how far we can get.

Note that we can’t represent everything using this approach. For example observations made in the experiment log don’t show up here, as well as anything unexpected. Therefore, at least as long as we have human beings recording experiments, we’re going to continue to use the wiki as the official lab notebook of my group. But hopefully I’ve shown how we can translate from freeform to structured format fairly easily.
Now one reason I think that this is a good time to generate results-centric databases is the inevitable rise of automation. It turns out that it is difficult for humans to record an experiment log accurately. (Take a look at the lab notebooks in a typical organic chemistry lab – can you really reproduce all those experiments without talking to the researcher?)
But machines are good at recording dates and times of actions and all the tedious details of executing a protocol. This is something that we would like to address in the automation component of our next proposal.
Does that mean that machines will replace chemists in the near future? Not any more than calculators have replaced mathematicians. I think that automating result production will leave more time for analysis, which is really the test of a true chemist (as opposed to a technician).
Here is an example

[…]

database, as long as attribution is provided. (If anyone knows of any accepted XML for experimental actions let me know and we’ll adopt that.)

 

I think this takes us a step closer from freeform Open Notebook Science to the chemical semantic web, something that both Cameron Neylon and I have been discussing for a while now.

PMR: This is very important to follow – and I’ll give some of our insights. Firstly, we have been tackling this for ca. 5 years, starting from the results as recorded in scientific papers or theses. Most recently we have been concentrating very hard on theses and have just taken delivery of a batch of about 20, all from the same lab.
I agree absolutely with J-C that traditional recording of chemical syntheses in papers and theses is very variable and almost always misses large amounts of essential details. I also agree absolutely that the way to get the info is to record the experiment as it happens. That’s what the Southampton projects CombeChem and R4L spent a lot of time doing. The rouble is it’s hard. Hard socially. Hard to get chemists interested (if it was easy we’d be doing it by now). We are doing exactly the same with some industrial partners. They want to keep the lab book.The paper lab book. That’s why electronic notebook systems have been so slow to take off. The lab book works – up to a point – and it also serves the critical issues of managing safety and intellectual property. Not very well, but well enough.
J-C asks

If anyone knows of any accepted XML for experimental actions let me know and we’ll adopt that

CML has been designed to support and Lezan Hawizy in our group has been working in detail over the last 4 months to see if CML works. It’s capable of managing inter alia:

  • observations
  • actions
  • substances, molecules, amounts
  • parameters
  • properties (molecules and reactions)
  • reactions (in detail) with their conditions
  • scientific units

We have now taken a good subset of literature reactions (abbreviated though they may be) and worked out some of the syntactic, semantic, ontological and lexical environment that is required. Here is a typical result, which has a lot in common with J-C’s synthesis.
synthesis.PNG
(Click to enlarge. ) I have cut out the actual compounds though in the real example they have full formulae in CML, and can be used to manage balance of reactions, masses, volumes, molar amounts, etc. JUMBO is capable of working out which reagents are present in excess, for example. It can also tell you how much of every you will need and how long the reaction will take. No magic, just housekeeping.
CML is designed with a fluid vocabulary, so that anything which isn’t already known is found in dictionaries and repositories. So we have collections of:

  • solvents
  • reagents
  • apparatus
  • procedures
  • appearances
  • units
  • common molecules

A word of warning. It looks attractive, almost trivial, when you start. But as you look at more examples and particularly widen your scope it gets less and less productive. I’ve probably looked through several hundred papers. There is always a balance between precision and recall and Zipf’s law. You will never manage everything. There will be procedures, substances, etc, that defy representation. There are anonymous compounds and anaphora.
So we can’t yet build a semantic robot that is capable of doing everything. We probably can build examples that work in specific labs where the reactions are systematically similar – as in combinatorial chemistry.
So, yes, J-C – we would love to explore how CML can support this…

Posted in chemistry, data, open notebook science | Tagged , , | 8 Comments