#solo10 GreenChainReaction: A greenness calculator

Scraped/typed into Arcturus

Jean-Claude Bradley has made a useful contribution to the Green Chain Reaction (http://usefulchem.blogspot.com/2010/08/green-solvent-metric-on-solvent.html )… Even if you aren’t a chemist, you should be able to follow the logic

In the spirit of contributing to Peter Murray-Rust’s initiative to collect Green Chemistry information, Andrew Lang and I have added a green solvent metric for 28 of the 72 solvents we include in our Solvent Selector service. The scale represents the combined contributions for Safety, Health and Environment (SHE) as calculated by ETH Zurich.

For example consider the following Ugi reaction solvent selection. Using the default thresholds, 6 solvents are proposed and 5 have SHE values. Assuming there are no additional selection factors, a chemist might start with ethyl acetate with a SHE value of 2.9 rather than acetonitrile with a value of 4.6.

Individual values of Safety, Health and Environment for each solvent are available from the ETH tool. We are just including the sum of the three out of convenience.

 

Note that the license for using the data from this tool requires citing this reference:

 

Koller, G., U. Fischer, and K. Hungerbühler, (2000). Assessing Safety, Health and Environmental Impact during Early Process Development. Industrial & Engineering Chemistry Research 39: 960-972.

The method therefore appears to combine three measures into a single metric.

The basis of the method is not clear without reading the original paper (which I haven’t done). The tool itself seems to be an Excel Macro, and the reference explaining it (http://www.sust-chem.ethz.ch/research/process/ehs.html ) didn’t resolve for me YMMV. It’s not clear whether it’s an Open Source tool or not. The licence and docs require that you quote the paper and this is perfectly acceptable – you can reasonably require people to acknowledge your work under any CC-*-BY licence or equivalent.

Anyway it’s a good idea. I imagine that once calculated it should be possible to re-use the data values, so that we can label the SHE (safety, health, environment) for every reaction…

… now I just have to get the next bit of code to do this…

Posted in Uncategorized | 1 Comment

#solo10 GreenChainReaction: Can you spot how green the reaction is?

Scraped/typed into Arcturus

Mat Todd has made some great suggestions about what we can measure in the Green Chain reaction. Here’s his comments on the Etherpad (http://okfnpad.org/solo10 ) and my additions. Please comment and add extra ideas. I’ll show some phrases below and let YOU test your ability. YOU DON’T NEED TO KNOW MUCH CHEMISTRY.

MatT

Has there been a shift away from chlorinated solvents in recent years?

PMR Should be easy to find the solvent and classify it/them. The linguistic and contextual pointers to solvents (“3 g of X in Y”, generally identifies Y as a solvent) have been identified by Lezan Hawizy.

 Are there more catalytic reactions being carried out now vs 10 years ago?

PMR Catalysts can be identified by

  • molar amount. If 1 mmol or A reacts with 1 mmol of B and there is 0.01 mmol of C, then C is almost certainly a catalyst.
  • actual composition. If a reaction involves zeolites, platinum, or “Wilkinson’s catalyst” then these are likely to be catalysts
  • type of reaction. It may be possible to spot the fact that a reaction is catalysed by its type. Thus hydrogenation often requires catalysts.
  • Linguistic pointers. The phrases “catalysed by”, “added the catalyst” may be included – though this is often not actually included.
  • Similarity with other reactions. This involves full analysis of the reaction language.

What proportion of organic molecules are purified by crystallization (good) vs chromatography (bad).

PMR This Should be relatively easy. The workup phrases are fairly standard and crystallization is usually mentioned explicitly

 

There are more complex questions concerning atom economy that it would be awesome to look at, but they’re tricky. e.g. what proportion of the atoms put into the reaction end up as part of the product vs things that are byproducts and are thrown away, such as salts, water, gases that you essentially “lose”)

PMR – agreed. Problem here is that it’s tricky to work out the reaction as normally only have reactants and product (singular). Have to manage stoichiometry, etc. But not impossible

 

Here’s a typical reaction, selected at random (the subscripts come out as <sub>3</sub>, etc.). See if you can answer Mat’s questions – solvent? Catalyst? Crystallization?

<p id=”p0067 num=”0067“>A 3-neck 300 mL round-bottomed flask equipped with a reflux condenser, magnetic stir bar and a nitrogen inlet was charged with 5 g (1 equivalent) of 4-hydroxybenzonitrile, absolute ethanol 150 mL, and 15.7 mL (1 equivalent) of sodium ethoxide. This mixture was stirred at 25 °C for 15 minutes. Ethyl 8-bromooctanoate (10.5 g, 1 equivalent) was then added dropwise over 10 minutes. The resulting mixture was heated to reflux (75 °C) for 72 hours.</p>


<p id=”p0068 num=”0068“>

 
The reaction mixture was cooled and the solids filtered off. The solvent was removed on a rotary evaporator. The crude residue was dissolved in methylene chloride (200 mL) and washed with saturated NaHCO

 
<sub>3</sub>

 
(2 x 75 mL), H

 
<sub>2</sub>

 
O (1 x 100 mL) and brine (1 x 100 mL). The crude material was then dissolved in ethanol (125 mL) and water (10 mL). LiOH (5 g) was added and the resulting mixture was heated to reflux (75 °C) for 1 hour then stirred at ambient temperature overnight. The solvent was evaporated and 75 mL of H

 
<sub>2</sub>

 
O was added. The aqueous solution was acidified to a pH of about 3 with concentrated HCl and the flask cooled to 4 °C. An off-white colored solid precipitated. This material was collected by vacuum filtration and dried on the high vacuum overnight to give the crude acid. These solids were further purified by recrystallization from Ethyl acetate/hexanes (95/5) and again with chloroform to give 4.5 g of the product, 8-(4-cyanophenoxy)octanoic acid (41 % yield).

 
<br
/>


 

Posted in Uncategorized | Leave a comment

#solo10 GreenChainReaction: Update and continued request for help

Typed into Arcturus

We are making excellent progress. Some things go faster, some slower as always.

We now need a second round of volunteers. I’ll detail what we have done and what needs to be done this week. Most of the activity is completely open at http://okfnpad.org/solo10

Code:

Dan, Mark and Nava has forged ahead and shown that our document extraction framework can be run independently of OS and location. So far it:

  • Downloads patents from weekly indexes
  • Unzips them
  • Converts the images to chemistry
  • Restructures the main document

Todo

  • Annotate the experimental sections and link up the chemistry
  • Run chemicalTagger (POS+Chemistry tagger)
  • Collect and upload solvent data

Data:

At the moment we are concentrating on patent data. The data is messy but tractable. We shall very soon be able to distribute code and data. We’ll be looking for volunteers who can run this on their local machines and then upload it.

Currently we are looking at RDF for managing simple solvent and temperature data on reactions. Something like (pseudocode):

ExperimentURI/UUID

  • Has Temperature (hasUnits)
  • Has Solvent
  • Has Duration (hasUnits)
  • Has Amount (hasUnits)

SolventURI/UUID

  • Has formula
  • HasWikipediaEntry
  • hasPubchemCID

     

Reaction specification:

Not much progress (blocked on PMR). We can probably analyse solvents without complete semantics for the reaction, but it would be nice to try. Thanks to mat and Jean-Claude for their patience

Making documents open:

Progress in the background between Heather Piwowar and PMR. Currently blocked on PMR.

Resources and help wanted

  • Comments on the above welcomed
  • Where can the data be reposited? We’ve had one offer. At present we’d like this to have an upload server for triples. Anyone with a triple server much appreciated
  • Help with analysing the results. Mainly descriptive stats, we hope.
  • Help with running the patent downloads and conversions (at least 10 volunteers wanted)

 

More later…

Posted in Uncategorized | Leave a comment

#solo10 GreenChainReaction: Some chemical information is appalling quality, but does anyone care?

Typed into Arcturus

Earlier I asked for the compound a patent image represented (EP_2050749A1/0026imgb0032.tif)


“Could someone please tell me what the InChI or SMILES or CML is for this compound?” This was a slightly trick question as you have to realise that the image is corrupted. (You might have to re-read the patent to know this).

Egon Willighagen says:

August 15, 2010 at 12:36 pm  (Edit)

Peter, I guess this is the compound:

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24888923

When seeing the image for the first time, I had the feeling a bond was too thin to show up in the rasterized image… this hit on PubChem might be a lead for further information.

Egon has solved it. There ought to be a vertical line (chemical bond) in the formula. It’s missing. That’s because TIFFs and GIFS and PDFs destroy and corrupt information. This is a classic. Here’s a similar compound –


 

The point is that the quality of the chemical information in the patent is so poor that the image has got corrupted in the processing somewhere. This is symptomatic of chemistry where there is little communal interest in tools that create quality validated information. The EPO engaged with us some years ago to introduce Chemical Markup Langauge but they couldn’t convince the chemical companies, the chemical software companies, the secondary patent providers and the whole market sector to do it.

Egon is trying to get the Chemoinformatics sector to do proper, Openly validatble science. I’m with him – and have been all along. But the mainstream community does not want to require details of what data where used, where they came from, how to re-run the calculations, how to re-analyse the data. Chemoiniformatics is in danger of being regarded as a pseudoscience. I’ll blog more of this later – probably not until after the GCR.

Here’s the pubchem entry. Note how this is presented in scalable vectors:

 

(I have rotated this to show the similarity to the original


Of course you can follow the links on Pubchem to see what the commercial providers of patent info provide. If you haven’t seen the sort of webpages, take time to have a look…

Category: [for same structure substances]

   DiscoveryGate ( 1 ) External ID: 24888923
   Thomson Pharma ( 1 ) External ID: 02644348

How much use is this to you?


 

Posted in Uncategorized | Leave a comment

#solo10 GreenChainReaction: update and What is that Chemical Compound?

Typed into Arcturus

The first pass of the automatic extraction of chemical information from patents is going well on a mechanical level.

 

  • One weekly index has 30-200 appropriate patents. Each has between 0 and 1500 images of chemical relevance
  • Each index therefore has ca 10,000 images, almost all of chemical compounds or general formulae or reactions.
  • We use OSRA (Open Source, NIH) to interpret the images. It takes about 1-30 secs each and the first index will complete in ca 24 hours. This means that we could do this task for the last 10 years in 500 distributed days. I’d like to do that before #solo10. (I could do it all at Cambridge, but I’d rather it were citizen-science.)

So far the “record” is a patent with 1500 images. Here’s one (EP_2050749A1/0026imgb0032.tif)

 

 


Could someone please tell me what the InChI or SMILES or CML is for this compound?

I am now working on the text-mining. More later today.


 

Posted in Uncategorized | 5 Comments

Supplementary Data must be published somewhere to validate the science

Dictated into Arcturus

There has been quite a lot of discussion over the last days and about the decision by the Journal Of Neuroscience to stop posting supplemental data. This has been reviewed by Heather Piwowar (http://researchremix.wordpress.com/2010/08/13/supplementary-materials-is-a-stopgap-for-data-archiving/ ) with a useful review

The issue seems to be that the journal can no longer make the effort to host the supplemental data on its website. That’s neither surprising or unreasonable.

However the critical responsibility of a journal, if it has any, is to make sure that the science has been properly conducted and is in principle (and hopefully in practice) reproducible. I believe it is irresponsible for a journal to publish science based on data which is not available to the reader. This is more important than any aspect of archiving or data re-use. The mainstream model of publication is that If the science is not valid it should not be published. (Models such as PLoSOne legitimately take a different view where debatable science can be published for open community view and review). If the material does not warrant publication then there is little point in archiving it. If the material represents invalid science then there is little point in disseminating it.

The reactions to J neuroscience have been mixed. Drug monkey exalts in an engaging fashion, but it is difficult to see that DM is responsible. It appears that DM wishes to be excused from the burden of providing the data that supports DM’s work. If this is true then I can give DM no support whatever.

There is some confusion and a suggestion that the supplemental material should be cut down and included in the material behind the publisher’s pay wall. This again there would be wholly regrettable. If the data are needed to support the experiment then they should be published in full. Moreover I believe it to be retrogressive if there is an insistence that only its subscribers can have access to the data on which the experiment was based.

So, if any data which is being published now would not be published in the future, then this is a seriously retrograde step.

Some other correspondents believe that the data should be in a repository, and I would agree with this so long as it was an Open repository and not closed (access, re-use, etc.). Some members of the institutional community believe that it should be in an institutional repository. Here for example is Dorothea Salo:

And this, Peter Murray-Rust, is partly why I believe institutions are not out of the data picture yet. The quickest, lowest-friction data-management service may well reside at one’s institution. It’s not to be found at a picky, red-tape-encumbered, heavily quality-controlled disciplinary data service on the ICPSR model, which is the model most disciplinary services I know of use. It’s certainly possible, even likely, that data will migrate through institutions to disciplinary services over time, and I have no problem with that whatever—but when the pressure is on to publish, I suspect researchers will come to a local laissez-faire service before they’ll put in the work to burnish up what they’ve got for the big dogs. (Can institutional data services disrupt the big-dog disciplinary data services? Interesting question. I don’t know. I think a lot depends on how loosely-coupled datasets can be. Loose coupling works better for some than others.)

Since since it’s addressed to me, I’ll respond. I personally do not have an absolute “religious” divide between domain repositories (DSR) and institutional repositories but what I am passionate about is the validation of data before and at the time of publication. A “heavily quality-controlled disciplinary data service” is because the community of scientists wishes to know that the data on which the science is based is valid. You cannot publish a crystal structure in a reputable journal without depositing the data in a formal manner which is susceptible to validation. You cannot publish a sequence without submitting it to a database which will review the scientific basis of the data. That’s not “picky”, it’s an essential part of data-rich science. I continue to see papers published whether data are either nonexistent, incomplete, all presented in such a way it is impossible to validate the science.

The reality is that this is a time consuming tedious process. If you don’t like it, don’t do science. As an example, when I did my doctorate, I measured about 20,000 data points and wrote them into a book. I then typed up each data point, where it occurred both in my thesis and in the full text of a paper journal. That was not unusual, everybody had to do it. The International Union Of Crystallography developed this so it has now become almost universal . In similar fashion the Proteomics community is insisting that data is deposited in a DSR.

I have said before that this could, in principle, be done by the library community. But it isn’t being done and I see no signs of it being done in the future. It is much harder to do it in the library community because the data are dispersed over many individual repositories and is now a technology at the moment that creates an effective federation. It is possible to conceive of a world where data went straight into the institutional repositories, and we are working on a JISC funded project to investigate this. But it is essentially unknown at present. If IRS wish us to change the model then they are going to have to present a believable infrastructure very shortly.

As an example, I have appealed for a place where I can put my results from the Green Chain Reaction (#solo). This is a valid scientific endeavour (in the area of molecular informatics) and its data requirements are modest compared with many other scientific disciplines. There are perhaps 1000 institutional repositories worldwide and I’d be delighted if one of them stepped forward and offered to host the data. Host, not archive. I’m not asking for any domain expertise, simply ingesting and holding an HTML web page tree.

If IRs can help with the problem of supplementary data they will have to act very quickly.

And if all this leads to less, not more, data being published Openly that’s a disaster.

Posted in Uncategorized | 2 Comments

#solo10: Green Chain Reaction; where to store the data? DSR? IR? BioTorrent, OKF or ???

Dictated into Arcturus

The Green Chain Reaction will soon be generating a lot of high quality structured data. The question is how and where to store this. To give an idea of the scope let me illustrate this with the patent data.

The European Patent Office publishes about 100 patents each week in the categories that we are interested in. Our current software downloads the index, extracts all patents, and selects those which are classified as chemistry. Each patent contains anywhere between 5 and 500 files (the large number is because the chemical structures are represented as graphical images, usually TIFFs). So this means about 10,000 files each week, in a well structured hierarchy. The absolute size is not large, and is about 100 MB per index. We arrange the raw data and processed data in a directory structure for each index such that it can easily be exposed on the web. Every document will have a unique identifier, so that it is straightforward to transform this into URIs and URLs. This means that we will be able to create Linked Open Data in a straightforward manner.

So the whole project might deliver

10 years * 50 weeks * 10,000 files = 5,000,000 files (ca 50 Gigabytes)

This shouldn’t terrify any body and in fact I routinely hold subsets of this on my laptop. It’s simply a set of structured directories which can be held on a file system and exposed as web pages. There is no need for relational databases or other engines to deliver them. (Of course we shall build indexes as well which may require engines such as triple stores).

The question is where to store it? We’ve been having discussions recently as to whether data should be stored in domain-specific repositories (DSRs) or in institutional repositories (IRs) or in both or in neither. This is where we’d like your help.

I don’t know how to store five million web pages in an institutional repository. It ought to be easy. (I and Jim Downing tried with 200,000 files in DSpace and it was a failure, because of the insistence on splah pages which destroy the natural structure).It’s critical to store them as web pages so that then they are indexed by the major search engines. We shall also index them chemically and by data. It’s obviously a valid type of digital hyperobject to store in a repository and ours must be similar to many other requests that scientists would be likely to make.

We could also store them in a domain repository. I don’t know any Open domain repository for chemical patents (there are many closed ones and a few that are free but not open). It’s possible that we could create an equivalent service to the one we provide for Crystaleye (http://wwmm.ch.cam.ac.uk/crystaleye). However this does not address the problem of long term archiving (although assuming this experiment is successful I don’t think there will be any problem in finding people who wish to help.)

Or we could store it to through the Open Knowledge Foundation and its CKAN repository of metadata. CKAN is not normally used for storing data per se so this would be a departure and the OKF will need to discuss it. It wouldn’t be my first choice, but it’s certainly better than not storing the results at all.

Or we could store it through something like BioTorrent (http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0010071 and http://www.biotorrents.net/ ). This is a new and exciting service which tackles the problem of sharing open data files in a community. One of its purposes is to solve the problem of distributing very large files, but it may also be suitable for distributing a very large number of small files – I don’t yet know but I’m sure I will get feedback from the community. If this is the best technical solution then I don’t think I would have much difficulty persuading them that chemical patents were a useful source of linked open data for the Bio community.

Or some other organizations that I haven’t thought of might offer to host the data. Examples could be the British Library, or UKPubMedCentral (UKPMC), or a bioinformatics institute, …

… or you. (I tried with 200,000 files in DSpace and it was a failure).

It would be a major step forward in Open data-driven science if we could find some common answers to this problem.

Posted in Uncategorized | 5 Comments

#solo10: Green Chain Reaction – much progress and continued request for help

Dictated into Arcturus

We’ve made great progress over the last week on the Green Chain Reaction. Our progress is all being recorded on the Etherpad provided by the Open Knowledge Foundation at thttp://okfnpad.org/solo10. Have a look! And feel free to add anything – it’s very easy.

Each section here will contain an invitation to participate.

CODE

Dan and Mark have worked very hard in showing that the System Works and documenting what is necessary. I am regularly feeding new bits of code we are very close to having a system which will extract and analyse chemistry from published documents. The test bed is Acta Crystallographica E, which is open with about 10,000 reactions. The main data will come from patents, and I have been modifying David Jessop’s download and analyser so that they can be distributed.

We’re now looking for computer-savvy volunteers to see if the code can be widely distributed in its current form. Please volunteer by signing up on the Etherpad.

CONTENT

Mat Todd and Jean-Claude Bradley have already contributed material (via their links) and we will soon be analysing this in detail. Mat’s posted a number of places where we might get additional content and some of these will be straightforward as it is clear that the content is Open. However a number of offers, such as from Chemspider, are formally copyrighted and we will need explicit permission from the “owners” to make them open.

We’d like to hear from anyone who has chemical reaction content that we can extract and they completely open. We particularly like to hear from publishers who would like to take part in this high-profile activity to show the value of open data. And we’d also like to hear from organisations such as government agencies who are by default make their data open.

ISITOPEN

Heather Piwowar and I are creating a series of letters to enquire of data providers whether their data is open. I hope to draft the principles today in a PantonPaper, and will blog this probably in the afternoon.

Contributions on describing and taking data as open will be particularly valuable. Feel free to join heather and myself in the ether pad. Any pointers to existing protocols and manifestos on Open data will be particularly valuable

OTHER HELP

We’ve had some other offers which are much appreciated. These include links to other resources, help with the Green concept, and a lot more. You may have ideas on what is possible in the next few weeks so we’d love to hear them.

Please make other suggestions and offers of help that we may not have thought of.

TIMELINE

I’m expecting that by the end of today I will have managed to modularise the parts of David’s code which will be used in this project. They will be described in the Etherpad, and reposited in several Bitbucket projects.

Also by the end of today I expect to have drafted the principles of data extraction from scientific documents on websites. Heather and I will work this so that we expect to be able to impose these to document providers such as publishers and get their early response.

I also hope to be able to spend some time on creating semantic chemistry definitions that will result from parses. I shall do this on the ether pad and will welcome contributions.

 

Posted in Uncategorized | Leave a comment

#solo10: Green Chain Reaction is using an Etherpad to collect its thoughts

Typed into Arcturus

We’ve been bouncing emails around on the Green Chain Reaction coding part of the project and we’ve got to the stage where looking back through emails to see where bits of code got copied in, who was on the email header, etc. is a nightmare. Email is awful for collaboration. Awful.

Googledocs is better , but you have to invite people so there’s an entry barrier when you don’t know who your collaborators are.

Etherpad is a wonderful collaborative tool for community projects. It takes a tiny bit of effort to set up (i.e you have to have a machine; and you have to be allowed to set up a server on it). But don’t let that put you off, read on … it’s already done

It’s completely open to anyone in the world. [Surely it will get spammed? Not much point, since it’s ephemeral and has no outward pointing links. Any vandalism is immediately apparent and can be reverted. ] So I’m setting one up for the Green Chain Reaction collaborators at #solo10.

And I can do this because the OKF makes Etherpads available for its projects and collaborators. After getting Rufus’s OK (not needed technically , but polite) it takes 5 second to set it up:

http://okfnpad.org/solo10

Anyone – and that means YOU! – can edit the material. Two or more people can edit it at once! You can see the other keystrokes appearing as you type! (If you try to do the same part of the document you can backtrack). There’s a timeslider that literally remembers everything types in (and deleted). It’s almost like a Memex.

So from now on most of my contributions to the project will be on the Etherpad. It’s a great way of creating communal documents – READMEs, code snippets, expected outputs, failing tests, useful URLs, etc. Letters to publishers on IsItOpen

And at the end of it you can have a polished document which could be mailed, pasted into a Wiki, turned into a PDF…

It’s not Google wave, it doesn’t do multimedia, it’s Flash-agnostic, … It does honest to goodness text.

Try it…

 

 

Posted in Uncategorized | Leave a comment

#solo10: Publishers, is your data Openly re-usable – please help us

Dictated into Arcturus

There is an absolutely critical point for the Green Chain Reaction that has been raised by a reader of this blog

Anon says:

August 12, 2010 at 9:45 am  (Edit)

Aren’t most supplementary information files freely available? There must be tons of synthetic procedures hiding in there.

From a JACS paper: “This [Supporting Information] material is available free of charge via the Internet at http://pubs.acs.org.”

As regards explicit permission…good luck finding any information at all about that.

The problem is copyright, followed by contract. By default copyright prevents you from doing anything with a document. And it’s the law, so that not surprisingly reputable institutions such as universities are absolutely clear that it must not be broken.

It has been argued that copyright law even forbids your saving a copy of any downloaded article on your hard disk. It’s almost certainly true that if you have thousands of PDF articles you’re breaking copyright in the eyes of some interpreters. And I should stress that this is an area where almost every question is unanswered. The only responsible default answer is “you can’t do that.”

This is an enormous impediment to data-driven science. Enormous. By default we cannot extract or use any data from the published literature unless there is explicit permission. It’s got to the stage where this problem is seriously holding back science.

You might argue that I’m being pedantic or chopping hairs. I’m afraid that’s not true. Shelley Batts reproduced a single graph from a closed access article (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=338 ) and was sent a threatening legal letter from the publisher (Wiley). There was an explosion in the blogosphere, and Wiley dropped their threats against Shelley. However they have never explicitly given permission to any one to post any closed material from any journal and we must assume that they are capable of carrying out the same threats in the future, although they know how the blogosphere will react.

I myself have caused the University of Cambridge to be completely cut off from a publisher on two separate occasions (ACS and Royal Society Of Chemistry). In both cases what I did was completely legal but the publisher’s automatic systems assumed that I was stealing content and automatically closed down all access to anybody in the University of Cambridge for an indefinite period. I do not know on what basis this was done, but I assume that it was for a (wrongly) interpreted infringement of the specific contract that the University of Cambridge had with the publisher.

You should note that almost all universities sign contracts with publishers which are more restrictive than copyright. These contracts are often covered by secrecy agreements, and so the average scientist in the university probably has no access to the details of the contract that they may infringe. The contracts in many cases cover the amount of material and the frequency that a scientist has access to. Given that data-driven science requires access to large amounts of material on a frequent basis it is almost certain that attempts to carry it out are likely to involve in breach of contract in particular institutions.

I appreciate that the current business model of publishing is that closed access publishers “own – or control – the content”. I personally do not agree that this is a good thing for science (in fact I believe it is highly detrimental) but I do not intend to break the law (at present) and I do not intend to cause my university to break its contract. However this means that readers are regarded as being potential content thieves and that and the publishers put in place expensive content management and access systems with a partial purpose to police the activities of readers. It is more important to most closed-access publishers to protect content than disseminate science and they err on that side.

The default position, therefore, is that one cannot automatically download and re-use content on the web without permission. It is very disappointing that none of the major publishers (other than of course the CC-BY Open Access publishers who are excused from all of these arguments and discussion) have changed their attitude or practice in making content available to scientists for the practice of modern science. I have on several occasions written to publishers asking whether I can use material and almost invariably I get no reply. It is difficult for me to see the closed-access publishers as part of an alliance which is trying to improve the way that data-driven science is done.

I should comment also that Openly reusable data must lead to higher quality science in that it is easier to pick up errors. For example our Crystaleye System (or rather Nick Day’s system http://wwmm.ch.cam.ac.uk/crystaleye ) not only aggregates all publicly visible crystallographic data on publisher’s web sites but also validates it as it processes it. This validation is part of our recently funded #jiscxyz bid where we are working to develop a means of validating Scientific Data before it appears in print. Just recently I have been pointed at two cases from very high profile closed-access publishers where the crystallography has been inappropriately used to support what is clearly invalid science. If the data had been openly available these errors would not have happened. There are also notable cases where the blogosphere has rightly criticized published science on the basis of invalid or incorrectly interpreted data.

Despite my campaign for greater openness in publication, I am not against a commercial publishing industry. I am in favour of a responsible publishing industry which makes efforts to innovate and support science. I am disappointed that publishers have not addressed the question of re-use of data and I am saddened by the fact that they do not regard readers’ emails and enquiries such as mine as worth replying to. It is not just me – it is now two years since Chemspider enquired to ACS about the rights to re-use ACS supplemental data and as far as I know ACS has not given a formal reply.

I am, however, an optimist and I believe the publishers will now take this problem constructively and start to give clear information. Since data are not only valuable for citations but also are necessary for the proper practice of science I’m going to assume that the major players in chemistry will be keen to give definitive answers on this problem. If nothing else, it is actually in their best interests – being seen to be helpful and to increase their own citations is hardly a barrier to doing good business.

Therefore Heather and I will be preparing a set of letters to send to all the publishers under the IsItOpen tool. This allows the precise request and precise response to be publicly visible and act as a useful definitive record. It therefore saves both the readers and the publishers from having to continually reiterate their position. We appreciate that it may in some cases not be possible to give complete answers but we would certainly expect the courtesy of a timely reply.

We hope that it will be possible to collect all the replies from the major publishers before the Science Online meeting and hope that this will be a valuable contribution to the delegates and those who are following the procedures from a distance.

More later on what exactly free, open , gratis and libre mean.

Posted in Uncategorized | 2 Comments