Category Archives: chemistry

Working with the NCI

I was intending to blog about our collaboration with Dan Zaharevitz and colleagues at the National Cancer Institute in the DTP (Developmental Therapeutics Program). Dan beat me to it: in a CMLBlog comment (February 4th, 2008 at 5:02 pm e) to CML – what and why. In the comment he explains why the NCI has chosen to work with us on CML.

Dan and I first made contact ca. 5+ years ago. I think he had noticed my posting or contributing to CDK (Chemistry Development Kit) and had asked about what CML could do.  We got into correspondence and as a result he supported Henry and me  in the development of JUMBO – probably JUMBO 4.6.
It is refreshing to work with the NCI. Their agenda is ultimately simple – methods of combatting cancer. And they are very clear that the way to do this is through Openness – Open Data, Open Source, Open Standards. So it is wonderful to have a sponsor who says “we will help you to develop this code” and you can make it Open – indeed this is  virtually a requirement.

NCI is well known for pioneering the release of their data in Open form. For many years the NCI database – with about 250,000 compounds and associated biological data – was the only data that could be used for free in chemistry. This database was the logical predecessor of Pubchem which now has over 18 million compounds. (An important difference is that the NCI database relates to physical samples while many entries in Pubchem do not).

Dan’s support has been invaluable. Firstly it’s supported us to do the work. Secondly it gives much moral support to continue. And third it has given us important feedback. Since CML has many uses (publishing, computation, crystallography) it ‘s very useful to have an organisation who wants to manage data. NCI is not only interested in chemical structure but also associated data, including analytical.

So it was great to sit in Dan’s splendid basement and review how he was using CML and how we jointly felt it might develop. CML details will follow on the CMLBlog.

Automatic assignment of charges by JUMBO

Egon has spotted a bug in our code for assignment of charges to atoms:

Why chemistry-rich RSS feeds matter… data minging,

The example shown by Peter was nicely chosen: something is wrong with that example. It uncovers a bug in the pipeline, that could have been uncovered by a simple agent monitoring the RSS feed. That is why this technology is important! It allows pipelining of information between services.

Anyway, before you read on, check the structure in the example yourself (Bis(pyrimidine-2-carboxylato-K2N,O)copper(II)).

Done? Checked it? You saw the problem, right? Good.

The charges in the structure are indeed wrong. There are two challenges…

  • for structures with more than one moiety (isolated fragment) in the structure it is formally impossible to know the changes if the author doesn’t give them.  The authors can give them in _chemical_formula_moiety but they are often difficult to parse correctly and in any case they often aren’t given. In those cases we don’t try to assign charges. (The crystallographic experiment itself cannot determine charges).
  • In cases where the fragment contains only light atoms it is usually (but not always) possible to allocate charges by machine. In cases with metals it’s usually impossible to do a good job. The molecule in questions is:

Summary page for crystal structure from DataBlock I in CIF xu2383sup1 from article xu2383 in issue 2008/01-00 of Acta Crystallographica, Section E.


 

The molecule itself is neutral. The easiest way is not to put any charges. Anything else in uncomfortable. We can have + charges on the N’s which is natural, but then there are 2 – charges on the CU. That’s formally correct but since the mertal is usually described as Cu(II) it’s not happy. Or we can play around with thearomaticity, or dissociate the Cu-N or C-O bonds but that’s not happy either. And this is simple compared with may metal structures.

What we have been doing is to disoociate the metal, do the aromaticity and charges, and then add the metal back. In doing so it’s easy to forget the charges and that is what has happened. We’ll try to fix it.

 

But in the end the only thing that matters is the total electron count and the spin state (which normally isn’t given except in the text). Cu2+ is d9 so it has one unpaired electron. But Fe is much more difficult and it’s virtually impossible to do anythig automatic. We’ll probably simply leave the charges off…

 

Semantic Chemical Computing

Several threads come together to confirm we are seeing a change in the external face of scientific computing. Not what goes on inside a program, but what can be seen from the outside. Within simple limits what goes on inside need not affect what is visible. The natural way now for a program to interface with other programs and with humans is to use a mixture of XML and RDF. XML provides a voculabulary and a simple grammar; RDF  provides the logic of the data and application.

The COSTD37 group has just met in Berlin  (I blogged the last meeting – COST D37 Meeting in Rome) COST is about interoerability in Comp Chem and it’s proceeding by collaorative work to fit XML/CML into FORTRAN programs – at present Dalton and Vamp. We do this by exchange visits paid by COST, wo we are looking forward to having visitors in Cambridge shortly.

It coincided roughly with Toby White’s session at NeSC in Edinburgh  on how to fit XML/CML into FORTRAN using his FoX library. I look forward to hearing how he got on.

And then, on Friday, we had a group meeting including outside visitors where the theme was RDF. I was very impressed by what the various members of the group had got up to – five or six mini-presentations. Molecular repositories, chemical synthesis, polymers, ontologies, natural language and term extraction. Andrew Walkingshaw showed the power of Golem which combines XPath with RDF to make a very powerful search tool. We are grateful to Talis for making their RDF engine available and when I have some hard URLs I’ll blog how this works.

The main message is that the new technolgies work. Certainly well enough to support collections in the order of 100,000 objects with many triples (Andrew had ca 10 megatriples). We are also making great progress in extracting chemistry out of free text (PDF is still awful, so please let’s have Word, or even better XHTML and XML). Or LaTeX. But in any case most of the toolset is now well prototyped. More later…

Does the semantic web work for chemical reactions

A very exciting post from Jean-Claude Bradley asking whether we can formalize the semantics of chemical reactions and synthetic procedures. Excerpts, and then comment…

Modularizing Results and Analysis in Chemistry


Chemical research has traditionally been organized in either experiment-centric or molecule-centric models.

This makes sense from the chemist’s standpoint.

When we think about doing chemistry, we conceptualize experiments as the fundamental unit of progress. This is reflected in the laboratory notebook, where each page is an experiment, with an objective, a procedure, the results, their analysis and a final conclusion optimally directly answering the stated objective.

When we think about searching for chemistry, we generally imagine molecules and transformations. This is reflected in the search engines that are available to chemists, with most allowing at least the drawing or representation of a single molecule or class of molecules (via substructure searching).

But these are not the only perspectives possible.

What would chemistry look like from a results-centric view?

Lets see with a specific example. Take EXP150, where we are trying to synthesize a Ugi product as a potential anti-malarial agent and identify Ugi products that crystallize from their reaction mixture.

If we extract the information contained here based on individual results, something very interesting happens. By using some standard representation for actions we can come up with something that looks like it should be machine readable without much difficulty:

  • ADD container (type=one dram screwcap vial)
  • ADD methanol (InChIKey=OKKJLVBELUTLKV-UHFFFAOYAX, volume=1 ml)
  • WAIT (time=15 min)
  • ADD benzylamine (InChIKey=WGQKYBSKWIADBV-UHFFFAOYAL, volume=54.6 ul)
  • VORTEX (time=15 s)
  • WAIT (time=4 min)
  • ADD phenanthrene-9-carboxaldehyde (InChIKey=QECIGCMPORCORE-UHFFFAOYAE, mass=103.1 mg)
  • VORTEX (time=4 min)
  • WAIT (time=22 min)
  • ADD crotonic acid (InChIKey=LDHQCZJRKDOVOX-JSWHHWTPCJ, mass=43.0 mg)
  • VORTEX (time=30 s)
  • WAIT (time=14 min)
  • ADD tert-butyl isocyanide (InChIKey=FAGLEPBREOXSAC-UHFFFAOYAL, volume=56.5 ul)
  • VORTEX (time=5.5 min)
  • TAKE PICTURE

It turns out that for this CombiUgi project very few commands are required to describe all possible actions:

  • ADD
  • WAIT
  • VORTEX
  • CENTRIFUGE
  • DECANT
  • TAKE PICTURE
  • TAKE NMR

By focusing on each result independently, it no longer matters if the objective of the experiment was reached or if the experiment was aborted at a later point.

Also, if we recorded chemistry this way we could do searches that are currently not possible:

  • What happens (pictures, NMRs) when an amine and an aromatic aldehyde are mixed in an alcoholic solvent for more than 3 hours with at least 15 s vortexing after the addition of both reagents?
  • What happens (picture, NMRs) when an isonitrile, amine, aldehyde and carboxylic acid are mixed in that specific order, with at least 2 vortexing steps of any duration?

I am not sure if we can get to that level of query control, but ChemSpider will investigate representing our results in a database in this way to see how far we can get.

Note that we can’t represent everything using this approach. For example observations made in the experiment log don’t show up here, as well as anything unexpected. Therefore, at least as long as we have human beings recording experiments, we’re going to continue to use the wiki as the official lab notebook of my group. But hopefully I’ve shown how we can translate from freeform to structured format fairly easily.

Now one reason I think that this is a good time to generate results-centric databases is the inevitable rise of automation. It turns out that it is difficult for humans to record an experiment log accurately. (Take a look at the lab notebooks in a typical organic chemistry lab – can you really reproduce all those experiments without talking to the researcher?)

But machines are good at recording dates and times of actions and all the tedious details of executing a protocol. This is something that we would like to address in the automation component of our next proposal.

Does that mean that machines will replace chemists in the near future? Not any more than calculators have replaced mathematicians. I think that automating result production will leave more time for analysis, which is really the test of a true chemist (as opposed to a technician).

Here is an example

[...]

database, as long as attribution is provided. (If anyone knows of any accepted XML for experimental actions let me know and we’ll adopt that.)

 

I think this takes us a step closer from freeform Open Notebook Science to the chemical semantic web, something that both Cameron Neylon and I have been discussing for a while now.

PMR: This is very important to follow – and I’ll give some of our insights. Firstly, we have been tackling this for ca. 5 years, starting from the results as recorded in scientific papers or theses. Most recently we have been concentrating very hard on theses and have just taken delivery of a batch of about 20, all from the same lab.

I agree absolutely with J-C that traditional recording of chemical syntheses in papers and theses is very variable and almost always misses large amounts of essential details. I also agree absolutely that the way to get the info is to record the experiment as it happens. That’s what the Southampton projects CombeChem and R4L spent a lot of time doing. The rouble is it’s hard. Hard socially. Hard to get chemists interested (if it was easy we’d be doing it by now). We are doing exactly the same with some industrial partners. They want to keep the lab book.The paper lab book. That’s why electronic notebook systems have been so slow to take off. The lab book works – up to a point – and it also serves the critical issues of managing safety and intellectual property. Not very well, but well enough.

J-C asks

If anyone knows of any accepted XML for experimental actions let me know and we’ll adopt that

CML has been designed to support and Lezan Hawizy in our group has been working in detail over the last 4 months to see if CML works. It’s capable of managing inter alia:

  • observations
  • actions
  • substances, molecules, amounts
  • parameters
  • properties (molecules and reactions)
  • reactions (in detail) with their conditions
  • scientific units

We have now taken a good subset of literature reactions (abbreviated though they may be) and worked out some of the syntactic, semantic, ontological and lexical environment that is required. Here is a typical result, which has a lot in common with J-C’s synthesis.

synthesis.PNG

(Click to enlarge. ) I have cut out the actual compounds though in the real example they have full formulae in CML, and can be used to manage balance of reactions, masses, volumes, molar amounts, etc. JUMBO is capable of working out which reagents are present in excess, for example. It can also tell you how much of every you will need and how long the reaction will take. No magic, just housekeeping.

CML is designed with a fluid vocabulary, so that anything which isn’t already known is found in dictionaries and repositories. So we have collections of:

  • solvents
  • reagents
  • apparatus
  • procedures
  • appearances
  • units
  • common molecules

A word of warning. It looks attractive, almost trivial, when you start. But as you look at more examples and particularly widen your scope it gets less and less productive. I’ve probably looked through several hundred papers. There is always a balance between precision and recall and Zipf’s law. You will never manage everything. There will be procedures, substances, etc, that defy representation. There are anonymous compounds and anaphora.

So we can’t yet build a semantic robot that is capable of doing everything. We probably can build examples that work in specific labs where the reactions are systematically similar – as in combinatorial chemistry.

So, yes, J-C – we would love to explore how CML can support this…

Is the scientific archive safe with publishers?

“In the pipeline” is an impressive and much-followed part of the chemical blogosphere. I’m a bit late on its post Kids These Days! which deals in depth with a case (Menger / Christl pyridinium incident) of published scientific error. The case even got as far as Der Spiegel – the German magazine. It’s worth reading (the link will take you to other links and also a very worthwhile set of comments from the blogosphere).

My summary is that: some chemists reported the synthesis of a novel set of compounds, published in Angewandte Chemie (Wiley) (2007) and Organic Letters (ACS) , (2006). After publication, doubt was thrown on the identification of the products, claiming that analytical evidence had been misinterpreted. As a result the original authors withdrew their claim. [The blogosphere has the usual range of opinions - the referees should have picked this up, the authors were sloppy, the criticism was rude, the reaction had been known for 100 years, etc. All perfectly reasonable - this is a fundamental part of science - it must be open to criticism and falsifiability. We expect a range of opinions on acceptable practice.]

What worried me was one comment that the publisher had altered the scientific record.

17. Metalate on December 1, 2007 11:00 AM writes…

Has anyone noticed that OL has removed all but the first page of the Supporting Info from the 2006 paper? Is this policy on retracted papers? And if so, why?

Permalink to Comment

PMR: I wasn’t reading this story originally, so went back to the article:

orglett1.PNG

As I am currently not in cam.ac.uk I cannot get the paper without paying 25 USD (and I don’t want to take the risk that there is nothing there. I’ll visit in a day or two).

But the ACS DOES allow anyone to read the supporting information for free (whether they can re-use it is unclear and it takes the ACS months to even reply on this). So I thought it would be an idea to see if our NMREye calculations would show that the products were inconsistent with the data. I go to the supporting information

and find:

orglett2.PNG

[On another day I would have criticized the use of hamburger bitmaps to store scientific information but that's not today's concern.]
There is only one page. As it ends in mid sentence I am sure Metalate is correct.

The publishers have altered the scientific record

I don’t know what they have done to the fulltext article. Replaced it by dev/null? Or removed all but the title page?

This is the equivalent of going to a library and cutting out pages you don’t agree with. The irony is that there is almost certainly nothing wrong with the supporting information. It should be a factual record of what the authors did and observed. There is no suggestion that they didn’t do the work, make compounds, record their melting points, spectra, etc. All these are potentially valuable scientific data. They may have misinterpreted their result but the work is still part of the scientific record. For all I know (and I can’t because the publisher has censored the data) the compounds they made were actually novel (if uninteresting). Even if they weren’t novel it could be valuable to have additional measurements on them.

I have a perfectly legitimate scholarly quest. I want to see how well chemical data supports the claims made in the literature. We have been doing this with crystallography and other analytical data for several years. It’s hard because most data is thrown away or in PDF but when we can get it the approach works. We contend that if this paper had been made available to high throughput NMR calculation (“robot referees”) – by whatever method – it might have been shown to be false. It’s even possible that the compounds proposed might have been shown to be unstable – I don’t know enough without doing the calculations.

But the publisher’s censorship has prevented me from doing this.

The ACS takes archival seriously: C&EN: Editor’s Page – Socialized Science:

As I’ve [Rudy Baum] written on this page in the past, one important consequence of electronic publishing is to shift primary responsibility for maintaining the archive of STM literature from libraries to publishers. I know that publishers like the American Chemical Society are committed to maintaining the archive of material they publish.

PMR: I am not an archivist but I know some and I don’t know of any who deliberately censor the past. So I have some open questions to the American Chemical Society (and to other publishers who have taken on the self-appointed role of archivist):

  • what is the justification for this alteration of the record? Why is the original not still available with an annotation?
  • who – apart from the publisher – holds the actual formal record of publications? And how do I get it? (Remember that a University library who subscribes to a journal will probably lose all back issues – unlike paper journals the library has not purchased the articles, only rented them). I assume that some deposit libraries hold copies but I bet it’s not trivial to get this out of the British Library.
  • where and how can I get hold of the original supplemental data? And yes, I want it for scientific purposes – to do NMR calculations. Since it was originally free, I assume it is still free.

Surely the appropriate way to tackle this is through versions or annotations? One of the many strengths of Wikipedia is that it has a top-class approach to versions and annotations. If someone writes something that others disagree with, the latter can change it. BUT the original version still exists and can be easily located. If there is still disagreement, then WP may put a stamp of the form “this entry is disputed”. Readers know exactly where they are and they can see the whole history of the dispute.

So here, surely, the simple answer is to preserve, not censor, the scientific record. The work may be “junk science” but it is still reported science. Surely an editor should simply add “The authors have retracted this paper because…” on all documents and otherwise leave them in full.

It is obvious that this problem cannot arise with Open Access CC-BY papers because anyone can make a complete historical record as soon as they are published.

[UPDATE. I have now looked at the original article and this seems to have been treated satisfactorily - the fulltext is still available, with an annotation that "The authors have retracted this paper on November 15, 2007 (Org. Lett. 2007, 24, 5139) due to uncertainties regarding what products are formed in the reaction described." That's fair and I have relatively little quibble - although it would still be valuable to see the original and not simply an annotated version.

But the arguments about the supplemental data still persist. If it's deliberate it's very worrying. If it's a technical error in archival it's also very worrying. ]

Open Data: publishers are the problem

The Chemspider site and blog have been making rapid and valuable progress towards Open Data. This is particularly laudable for a commercial site where Openness in chemistry is a long way from being a proven business model and is actively resisted by many. Here is a typical tale of frustration – I comment below
Why We Can’t Publish Scraped CrystalEye Data Yet….And Science Commons Declare a Protocol for Implementing Open Access Data
Previously I blogged about our intention to scrape CrystalEye data and publish onto ChemSpider. The original comments regarding the data on CrystalEye were as follows:

  1. pm286 Says:
    October 26th, 2007 at 7:54 am (1) All data come from Free sources – i.e. visible without a subscription. Some journals (Acta Crystallographica and RSC for example) do not copyright the data. Others like ACS add copyright notices. It is our contention, and Elsevier has agreed for its own material, that facts are not copyrightable. We have therefore extracted and transformed facts and mounted these. Where the original material (CIF) does not carry copyright we mount it on our pages – where it does we do not, but we have the transformed data. In those cases it would be possible to recreate the original CIF data in semantic form ,but not the exact typographical layout which contains meaningless whitespace.I am not aware that ACS or Elsevier have ever made statements of any kind about our Open Data efforts.You may scrape anything, must you must honour the source and the metadata and you should add the Open Data sticker. If you scrape the link (simplest) you may simpy point to our site. If you scrape more data you should ensure that the integrity of the data is maintined and that if it is re-used the re-used data should still clearly show our metadata.

[PMR: Yesterday's announcement of the CCZero licence could mean that we change from a meta-licence ("Open Data") to an explicit CCZero licence. I will need to read the details. I don't think it changes the arguments below.]

We have already done the work to scrape certain data from the site but have chosen to be extra careful with taking the declaration of Open Data made to all data sources. My primary worry was with the data scraped from the ACS journals. With this caution in mind I sent a letter to the copyright department at ACS as outlined here. In fact I made a couple of phone calls, sent the email about 2 more times and finally managed to talk to a nice gentleman from the ACS copyright department and brought my concerns to light. Since then we have exchanged multiple emails, spoken again on the phone and I have been told that a meeting of minds from both Washington and Ohio was being scheduled to discuss the situation. That’s 2 months after my original email.

Today I received the following email and I am excerpting from it..

“Thank you for your inquiry about the proposed use by ChemSpider of information in the CrystalEye database that has been published within certain ACS journal publications. In light of your query, we are examining the manner in which ACS published material is represented within that database as well as the nature of your proposed use, so that we can respond in an informed manner to your request.

<snip>

If you will be attending the ACS National Meeting in New Orleans, perhaps we could confer with you at that time to discuss our findings and advise you appropriately?

Communicators Name withheld ”

What I thought was a simple question and done with the intention that ChemSpider was safe turns out not to be so simple. It could take until March 2008 to get an answer! At this stage we will not be publishing any of the CrystalEye data without confirmation from each of the publishers that this is allowed. I asked the question previously “Who gets to declare data open or not?“ and even received the question “Why even offer the option of closed?” The primary reason is that we have turbulent times ahead of us around such issues of “openness” and until these are navigated I am working to keep ChemSpider “safe “. I am willing to participate, support and contribute to the evangelism of openness but am equally concerned with keeping ChemSpider alive for the close to 3000 users per day now accessing the service.

It was an interesting day to receive this email about a potential FIVE MONTH delay to a decision about Open Data especially now that Science Commons have released a Protocol for Implementing Open Access Data just yesterday. …

So, while protocols are exposed to the community by Science Commons the challenge of utilizing them now begins…I will be in communication with members of the Science Commons soon to determine how ChemSpider can it into the model…

PMR: This is, unfortunately, completely typical. Earlier this year I wrote to Tetrahedron (an Elsevier journal) asking if they would consider posting CIFs (crystallographic data):

Request for Open publication of crystallographic data in Elsevier’s Tetrahedron

=========== Open letter to editors of Tetrahedron ==========

Professor L. Ghosez ,
Professor Lin Guo-Qiang ,
Professor T. Lectka ,
Professor S.F. Martin ,
Professor W.B. Motherwell ,
Professor R.J.K. Taylor ,
Professor K. Tomioka

Subj: Request for Open publication of crystallographic data in Tetrahedron
Dear editors,
I have recently been reviewing access to supplemental data in chemistry publications, in particular crystallographic data (”CIFs”). Many publishers (IUCr, RSC, ACS…) expose these on their websites as Open Data (for examples see: http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=455). The data are acknowledged not to be copyrightable (see http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=447) where your colleague Jennifer Jones (copied) has confirmed:

Dear Peter Murray-Rust
Thanks for your email. Data is not copyrighted. If you are reusing the entire presentation of the data, then you have to seek permission, otherwise, you can use the data without seeking our permission.
Yours sincerely
Jennifer Jones
Rights Assistant
Global Rights Department
Elsevier Ltd
PO Box 800
Oxford OX5 1GB
UK
Tel: + 44 (1) 865 843830
Fax: +44 (1) 865 853333
email: j.jones@elsevier.com

Other Elsevier journals such as those publishing thermochemistry (see last blog post) are now actively making the supplemental data Openly available on the journal website. I am therefore asking whether Tetrahedron (and perhaps other Elsevier chemistry journals) might consider publishing their data Openly in this way and would be grateful for your views.

(This is an Open letter (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=456) and I would like to publish your reply so please mark any confidential material as such).

Thank you for considering this

PMR: Five editors – I haven’t had the courtesy of a reply. This is not uncommon – I didn’t get replies on Open topics from Wiley, Springer (first time round) either. Either journals are not in the habit of replying – they consider ordinary scientists too low in the foodchain to merit consideration (most likely) – or they regard anything Open as a pain and want to slow it by inaction (also most likely). They have their set way of doing things – God ordained in 1972 that the world belongs to the publishers and they don’t want to see it change.

Another typical example. I was invited to write an article for Serials Review on Open Data. I asked if I could write my artcile in HTML and embed my own copyright material, noted as such under appropriate licence. The editorial office siad that would come back to me. It’s now past the closing date of the submission. After ca. 6 weeks I got the reply:

Facts and data are not copyrightable but the expression of data is

copyrightable. If you wish to use third-party data in a different

format within your article, including full acknowledgement to the source

of the data, then that would be acceptable. However, if you wish to

retain the expression of the data, then you will need to include

alternate diagrams within the article.

So I can use the data – IF I can get it. If I can only get a graph then I can’t unless I redraw it. Is redrawing a graph a useful activity for science – do I need to answer? The only value is that it adds some random errors to the data (or systematic ones) that would be fun to give as exercises in bad scientific practice for students. “Expression of the data” – i.e. the author’s graphs – are not re-usable.

So what’s the answer? Currently I use the “ask forgiveness, not ask permission” mode. And if the “owners” ot the data (read “appropriators”) send the lawyers and ask for a take-down – make a huge public fuss. As the world did when Shelly Batts “stole” a graph from from Wiley (Sued for 10 Data Points). And Wiley backed down. The publishers don’t like public fuss.

So a few months ago I would have advised Chemspider “go ahead”. But they ran foul of another publisher (I think it was the Royal Society of Chemistry). I never understood the details but Chemspider linked to publicly visible papers (not Open) and were asked to take the links out of the Chemspider database. This doesn’t even seem to make sense. I would have thought publishers would like people linking to their papers – maybe it was the metadata.

So I appreciate Chemspider’s wish to remain on the correct legal side of the publisher. But [the publishers'] actions destroy scientific data in the current century. Chemistry publishers [OA publishers and IUCr excepted] are actively and passively resisting the re-use of data. They copyright factual data, hide it, require take-downs, refuse to reply to reasonable letters – everything. They are simply in the way between the creator of the data and the consumer

As I have blogged we now have an exciting project sponsored by Microsoft on eChemistry. We are going to fill repositories with data. And we are going to get that data (“not copyrightable” – see above) from any source we reasonably can. It will be available to the whole world. It will probably be stamped CCZero. CrystalEye will be in there. We shall, of course, include the source (provenance) as we really care about it and metadata. So people will know where it came from.

Why can’t the ACS reply “Yes” to Chemspider by return? Does it really make sense for chemistry publishers to be universally seen as Luddites? Because the world will sweep these restrictive practices away, and the business will have moved from the publishers to somewhere in the twenty-first century (the one we are in).

Open Access – Chemistry World reviews the dilemma

In this month’s Chemistry World (a magazine from the Royal Society of Chemistry) there is an important article by Rebecca Trager (US) reviewing the increasing fission within the chemistry publishing community: Chemistry’s open access dilemma

 

This was a commissioned article, I think (Rebecca interviewed a number of people including me by phone) and does not, I think, represent any explicit or implicit policy of the RSC itself. I think the article gives a fair account of the current position in chemistry (the article is free-to-read and I give selected quotes):

But the saga [NIH bill] has highlighted a widening rift in the chemical community over open access publishing – and the contentious provision could yet be revived.

Major scholarly societies joined the Association of American Publishers (AAP) in lobbying against the proposal, including the American Chemical Society (ACS), the American Association for Clinical Chemistry, the Biochemical Society, and the RSC (publishers of Chemistry World).

PMR. I suspect, though I do not know, that this is distinct from the PRISM movement which was also launched from the AAP

But the battle lines are already being drawn. The ACS wants the NIH policy to remain voluntary. ‘Depending on how they implement this, it could represent a federal taking of copyrighted materials,’ ACS spokesman Glenn Ruskin told Chemistry World.

A compulsory policy would need costly monitoring and penalisation systems, Ruskin said. ‘Why expend monies on a mandatory policy, when they could get to their endpoint a lot quicker if they just worked more cooperatively with the publishers?’

‘The idea of public access to research information is a little bit specious,’ added Robert Parker, managing director of RSC publishing. ‘The UK government will be funding the London Olympics in 2012, but that doesn’t mean that everybody can have free tickets – there is a big difference between funding something and having it be freely available.’

PMR: Factually the current position is that almost all chemistry publishers (such as ACS and RSC) continue to hold the copyright on closed access articles funded by governments. Maybe the analogy with the Olympics is a little bit stretched.

The Partnership for Research Integrity in Science and Medicine (PRISM) argues that the Congress bill could damage peer review by compromising the viability of non-profit and commercial journals. Predictably, the campaign has sparked outrage among open access lobby groups. In the wake of the furore, nine publishers have disavowed PRISM, including Cambridge University Press, Oxford University Press, Columbia University Press and University of Chicago Press. The ACS – which had been closely involved with PRISM – has now also played down links with the campaign.

PMR: PRISM is playing Haydn’s farewell symphony. No one seems to support it (I don’t know about the RSC- maybe this is a chance for them to comment). Is anyone left?

As a result, the steps taken by the RSC and ACS to enter this new world of publishing have received a stilted response from chemists.

For roughly a year, the RSC has had an Open Science service that allows authors to pay to make their article freely accessible to all. The basic fee for a primary research article is £1600 with a 15 per cent discount for RSC members, owner societies of RSC journals, and authors from subscribing organisations. So far, just four authors have participated.

PMR: Just in case anyone is unfamiliar with the RSC’s use of “Open Science” – this is not full Open Access under the BBB declaration but is a free-to-read version where the journal retains copyright. Readers can decide whether this is a good bargain compared with full Open Access offerings (it’s not the worst).

Indeed, there are calls for bold and decisive leadership on this increasingly divisive issue from all sides of the chemistry community. ‘Vision is needed. Where we are at the moment is unacceptable,’ said the ACS’s Ruskin.

PMR: I have indeed argued frequently that bold and decisive leadership is necessary and that it should come from learned societies and International Unions who are respected by the community. But if it doesn’t come from there, the community will find another way and in the Internet era that can happen very quickly.

Dog food is tasty!

I can’t escape… I have committed myself publicly. Here’s Peter Sefton:  Crossing curation mountain


I’m looking forward to seeing Peter Murray Rust eat my dog food. He’s lucky cos at our place the hounds eat relatively benign dry food. [...]

I’m going to follow up on using ICE for blogging to WordPress soon which is what that dog food stuff is about, but Peter has just pointed out some issues with getting papers into institutional repositories and I wanted to discuss some of his points here.

[...]

I liked the last bit, so I added some emphasis:

And if I were funding repositories I would certainly put resource into communal authoring environments. If you do that, then it really is a one-click reposition instead of the half-day mess of trying to find the lost documents.

I’ll be sure to mention this to our friends at DEST.

PMR: Thanks. I’ll need help. First we need to make sure the WordPress version is correct. I have 2.03. There are no immediate plans to upgrade but this might swing it. I would re-open the CMLBlog (which is sleeping till I can author better). I probably need some hand-holding.

I think we are gradually getting places. Some years ago we (Henry, Egon and me) hacked CMLRSS. It works, but only with a complicated bespoke client. Now we’ve got a better handle of the technology and with Atom+PNG we can direct intravenous feeds of CrystalEye. Every new structure with full structural diagram (well every organic one). Here’s Jim’s post…

That’s what real publishers should be thinking about. What’s inside the post as well as on the surface. Come to think of it we can probably put it on ICE.

CrystalEye: data loss and corruption through legacy files

Andrew Dalke raised the issue of data corruption:

  1. Andrew Dalke Says:
    November 4th, 2007 at 2:32 am e
  2. PMR: Moreover crystal structures contain problems such as disorder and partial occupancy which are impossible to hold in an SDFile as far as I know without corrupting the data.
  3. “Corruption” is a strong word. Why not think of it as the way you wrote in your “Round-trip format conversion” wikipedia article?

PMR: Here is a widespread and almost universal example of corruption which is almost entirely down to the use of SD (MOL) files and/or SMILES in particular (but is common to almost all legacy formats). Nitric oxide (WP) is a very important molecule – it is an essential signalling molecule in the vascular system, and also a serious pollutant from transport. Its formula is NO, one nitrogen atom and one oxygen atom.

A large number of freely accessible databases give other formulas:

PMR: These variations are not because there are different opinions about what “nitric oxide” is, or whether the name may be used differently by different communities. They are because the use of SD/MOL or SMILES has corrupted the information. Because SD files have no mechanism for indicating that an atom does not have implicit hydrogens, many programs are “clever” and add them according to “valence rules”. While these are OK for a subset of chemistry they are a disaster for others. Nitric oxide is just one of many examples where they fail. So that is why I cannot answer Chemspider’s request for SD files of CrystalEye – I KNOW it will corrupt the information. It is possible that there is a simple algorithm that could filter out “most” of the entries which would not be corrupted, but it will not be watertight. That is why we have developed CML – it is designed to avoid corruption.

  1. When a document in one format is converted to another there is likely to be information loss. Is “information loss” necessarily “corruption”? From my experience in dealing with PDB files, which has some of these crystallographic properties, I think there can be meaningful information despite the information loss. So long as the tools and the users understand that there are limitations in the conversion.

PMR: There are “obviously” parts of the information that can be omitted without corruption. An example is “iucr:_publ_contact_author_phone”. But what happens if you omit “occupancy” in an entry ? It looks like:

nite.PNG

Notice that the _chemical_formula_sum contains non-integral atom counts – this is common in crystal structures nd is supported by the _atom_site_occupancy flag in CIF which points to the last field before the two dots.

_atom_site_type_symbol
_atom_site_label
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
_atom_site_U_iso_or_equiv
_atom_site_adp_type
_atom_site_calc_flag
_atom_site_refinement_flags
_atom_site_occupancy
_atom_site_disorder_assembly
_atom_site_disorder_group
Ni Ni1 0.05235(10) 0.2500 0.4203(2) 0.0152(4) Uani d S 1 . .
Ni Ni2 0.14310(14) 0.2500 0.0802(3) 0.0185(4) Uani d SP 0.80 . .
Ni Ni3 0.15349(14) -0.2500 0.5956(3) 0.0191(5) Uani d SP 0.80 . .
Te Te1 0.25343(5) 0.2500 0.42149(10) 0.0131(3) Uani d S 1 . .
Te Te2 0.00373(5) 0.2500 0.78163(10) 0.0146(3) Uani d S 1 . .

Confirm that Ni(1+0.8+0.8) => Ni2.6 and Te(1+1) => Te2. CML is designed to hold this without loss (through the occupancy attribute) but SD files, SMILES and almost all other legacy (except PDB and a few other crystallographic files) are not. Therefore using SD to bundle this entry and transmit is is guaranteed to corrupt it.

[Note added later. There is a well characterised HN=O molecule - see NIST Webbook - but it is nitrosyl hydride, not nitric oxide.]

COST D37 Meeting in Rome

Tomorrow Andrew Walkingshaw and I will be off to Rome for the COST D37 Working Group. From the site:

What is COST?

COST is one of the longest-running instruments supporting co-operation among scientists and researchers across Europe. COST now has 35 member countries and enables scientists to collaborate in a wide spectrum of activities in research and technology. [...]

PMR: I’m always proud to be involved in European collaborations. When I was born Europe was tearing itself apart. Whatever we may think of the bureaucracy involved it’s worth it. Science and scientists have always been a major force in international collaboration, and the prevention of conflict.

The meeting itself (COST D37) is aimed at interoperability on chemical computation:

Objective

Realistic modelling in chemistry often requires the orchestration of a variety of application programs into complex workflows (multi-scale modelling, hybrid methods). The main objective of this working group (WG) is the implementation, evaluation and scientific validation of workflow environments for selected illustrator scenarios.

Goals

In the CCWF group, the focus is on the implementation and evaluation of quantum chemical (QC) workflows in distributed (Grid) environments. This is accomplished by:

  • The implementation of workflow environments for QC by adapting standard Grid technologies.
  • Fostering standard techniques (interfaces) for handling quantum chemical data in a flexible and extensible format to ensure application program interoperability and support of an efficient access to chemical information based on a Computational Chemistry ontology.
  • The implementation of computational chemistry illustrator scenarios from areas of heterogeneous catalysis, QSAR/QSPR, and rational materials design to demonstrate the applicability of our approach.

PMR: So I’ll be talking about the World Wide Molecular Matrix (WWMM) and Andrew will talk on Golem – which will transduce the output of computational programs into ontologically supported components that can be fed into other programs without loss of information. I shall try to present as much as possible from the WWW, linking into CrystalEye and OpenNMR.