petermr's blog

A Scientist and the Web


Archive for August, 2010

#solo10 An introduction to textmining and data extraction

Thursday, August 19th, 2010

Scraped/typed into Arcturus

But now we’ll show what we can get out of patents. Even if you aren’t a chemist you should be able to follow this. It’ll show you what text-mining is about and how we are looking for greenness..

Here’s a typical report in PDF (I have cut and pasted it so that’s why it looks tacky – but that’s what you get with PDF):

A) 10-Octadecyl-_1,4,7,10-tetraazacyclododecane-_1,4,7-triacetic acid

_[0065] A mixture of 1,4,7,10-_tetraazacyclododecane-_1,4,7-_triacetic acid tris_(1,1-_dimethylethyl) ester (37.5 g; 72.8

mmol) and 1-_bromooctadecane (24.5 g; 73.5 mmol) in CH3CN (500 mL) was heated to reflux. After 2 h the reaction

mixture was evaporated and the residue was dissolved in CHCl3 and a portion of CF3COOH was added. After 16 h at

room temperature the reaction mixture was evaporated and the oily residue dissolved in CF3COOH. After 3 days at

room temperature, the solution was evaporated, the residue taken up in CHCl3 and the solution evaporated. This operation

was repeated three times. The oily residue was purified by flash chromatography as follows:_


(a) CH2Cl2/_MeOH = 3/1 (v/v) 3 litres

(b) CH2Cl2/_MeOH/NH4OH 25% (w/w) = 12/4/1 (v/v/v) 12 litres

(c) CH2Cl2/_MeOH/NH4OH 25% (w/w) = 6/3/1 (v/v/v) 2 litres

_[0066] The product was dissolved in H2O and acidified with 6N HCl; then, the solution was loaded onto an AmberliteK

XAD-_8 resin column and eluted with a CH3CN/H2O gradient. The product started eluting with 20% CH3CN.

Because PDF is non semantic we have lost some of the formatting, but it doesn’t matter. The XML is MUCH more useful. That’s why you should author XML, as well as PDF (or even instead). Don’t switch off just because it’s XML. The key phrases we shall use are highlighted…

<p id=”p0065″ num=”0065″>A mixture of 1,4,7,10-tetraazacyclododecane-1,4,7-triacetic acid tris(1,1-dimethylethyl) ester (37.5 g; 72.8 mmol) and 1-bromooctadecane (24.5 g; 73.5 mmol) in CH<sub>3</sub>CN (500 mL) was heated to reflux. After 2 h the reaction mixture was evaporated and the residue was dissolved in CHCl<sub>3</sub> and a portion of CF<sub>3</sub>COOH was added. After 16 h at room temperature the reaction mixture was evaporated and the oily residue dissolved in CF<sub>3</sub>COOH. After 3 days at room temperature, the solution was evaporated, the residue taken up in CHCl<sub>3</sub> and the solution evaporated. This operation was repeated three times. The oily residue was purified by flash chromatography as follows:<br/>


<ul id=”ul0004″ list-style=”none” compact=”compact”>

<li>(a) CH<sub>2</sub>Cl<sub>2</sub> / MeOH = 3/1 (v/v) 3 litres</li>

<li>(b) CH<sub>2</sub>Cl<sub>2</sub> / MeOH / NH<sub>4</sub>OH 25% (w/w) = 12/4/1 (v/v/v) 12 litres</li>

<li>(c) CH<sub>2</sub>Cl<sub>2</sub> / MeOH / NH<sub>4</sub>OH 25% (w/w) = 6/3/1 (v/v/v) 2 litres</li>


PMR: notice how the subscripts and the list has been properly captured. Anyway it’s the text that matters. The key phrases for the Green Chain Reaction are in bold.

dissolved in CF<sub>3</sub>COOH is a classic linguistic template. This tells us that CF3COOH is a solvent! (I don’t know how green it is/isn’t. Here are its hazards

External MSDS
NFPA 704

( )

in CH<sub>3</sub>CN (500 mL) is also a classic template. The (number + mL) tells us it’s a liquid (if it were a solid it would have grams (g) or mg (milligrams) as units. We also know that CH3CN is a liquid by looking it up: : EU classification Flammable, harmful R-phrases
R11, R20/21/22, R36
(S1/2), S16, S36/37

dissolved in CHCl<sub>3</sub>. … says:

The US National Toxicology Program’s eleventh report on carcinogens[20] implicates it as reasonably anticipated to be a human carcinogen, a designation equivalent to International Agency for Research on Cancer class 2A. It has been most readily associated with hepatocellular carcinoma.[21][22] Caution is mandated during its handling in order to minimize unnecessary exposure; safer alternatives, such as dichloromethane, have resulted in a substantial reduction of its use as a solvent.

So this is where we start to see the point of the GreenChainReaction… Let’s see what the ratio of use of chloroform to dichloromethane is over time.

So if we can extract the solvents out of every patent reaction we can get an indication of the greenness. The next post will show how we do this.








#solo10 GreenChemicalReaction: Anatomy of a Patent

Wednesday, August 18th, 2010

Scraped/typed into Arcturus

As we are making progress I thought we’d let you have a look at what we are doing. Because we believe in Openness we’re wearing our heart in the Open and telling you as it is created. This is OpenNotebook chemical informatics. Let’s count:

  • The raw data are Open (patents)
  • The code is Open (bitbucket)
  • We’ve invited anyone who wants to be involved to join – and there’s good membership from the Blue Obelisk community and others
  • We sync our contributions on Etherpad which is immediate and Openly visible

We are working on patents. The details can be found on the Etherpad ( or ). A European Patent (EP) consists of:

An index file, with a name like:


This is a fully semantic, consistent name (EPO+date). You start to get a warm feeling. The data are well structured. However the indexfile itself is a nasty shock. It starts:




<document-id lang=”en”>




<original-kind />

<correction-code />

<sequence />

<original-sequence />




<application-reference appl-type=”patent”>








<text>A01D003464 A01B007300 A01B0059048</text>



Well we are used to non-semantic identifiers and they’re a good thing. Except that the identifiers have implicit semantics. You have to pull these strings apart to discover whether the document is a chemical one or not. It’s hideous – and I’m publicly grateful to David Jessop who has written the deconstruction code. David’s written stuff which downloads the Zips – it’s not trivial. It’s anything but trivial and involves repeated HTTP polling of the site to determine where the zipped patents are. I suspect we are losing some downloads because of the arcane nature of the index. But at least it can’t get much worse….


Oh yes it can. This is the filename of a downloaded zip. It’s got a SPACE in it. But not your honest to goodness char(32). Or char(8), but a beautiful char(160) hamburger space. A gold-plated non-breaking nbsp space. On we go…


  • A Zip file which contains an unpredictable (or at least variable) number of goodies with names and labels you generally have to guess at. We haven’t worked out the whole crypanalysis but there are normally
  • DOCUMENT.PDF. PDFs are normally human-readable but not machine-readable. These PDFs stretch the definition of human-readable. There are misspellings, illiteracies and the images are a delight for those who like hamburgers rather than cows. Reassembling the cow is not for the faint-hearted. Here’s a snippet


  • Luckily (or since we pay taxes for it, in addition) there is:
  • DOC00001.xml or something like EP00401685NWA1.xml. There are clearly at least two independent nomenclature systems for the files. These are XML files the EPO has created from the PDF. It’s done by humans – I have visited the office where they do it (it’s probably outsourced by now). A huge barn where humans are given the OCR of the patent and have to correct the text line by line (and add the line number).
  • And there are between 0 and 2000 images. These are not only non-semantic but a veritable chamber of horrors. Here are two. The first is typical. Its quality is awful


How many minus signs (as opposed to single bonds) are there in that picture? (HINT: There are some).

You think that’s bad? Here’s a chemical formula:

Now it wasn’t that bad when the author drew it. It has probably been deliberately obfusticated so that the patent is harder to understand. I THINK the characters are taken from only N,C,H,R, (,),1,2,3,4,5 m and n. I don’t think anyone on the planet can be absolutely sure what they are.

But then there is the one I posted some days ago and Egon cracked the puzzle. The quality was SO BAD that chemical bonds are simply missing.

This is the level of quality that is endemic in chemical informatics. (and patents).

However the XML is a lifeline and I’ll show you where we have got to in a following post

Ah, well

#solo10 GreenChemicalReaction; we’ll have something exciting to show at the meeting

Wednesday, August 18th, 2010

Scraped/typed into Arcturus

Typical hard slog day, gearing up to do text-mining on the chemical reactions extracted from the patent. Slower than I had thought but steady progress. The routines are all written – it’s the glueware that is the problem.

We still need YOU to volunteer.

As an incentive, we’ve got some stunning technology for displaying the results. Absolutely stunning. It’s based on thinking in a modern frame of mind with Open Data, links, etc. rather than walled gardens, free-but-not-open, etc. RDF, not proprietary formats. XML, not PDF, and so on. I’m not going to tell you what it is as we shall announce it at the meeting!

Heather and I have put together a letter to send to science publishers asking about the Openness of their data. Since all publishers seem to agree that by default data are open, not copyrightable we expect general agreement. We’ll probably send it out later today or tomorrow. Whether we can use the publisher information I don’t know, but we have enough in the patents.

Oh, we still need volunteers.


#solo10 GreenChainReaction: We need more volunteers – and yes, it will be fun!

Tuesday, August 17th, 2010

Scraped/typed into Arcturus

Continued incremental progress…

I know it’s the summer vacation but we need your help
even more. Now is your chance to

  • Download documents robotically
  • Run semantifiers over them
  • Do natural language processing
  • Extract data from raw text
  • Create RDF and search it

These are reall transferable information skills that will be highly valued. Volunteers will be guided through this by experts (well we’ve been going a week so that makes us experts!)

All you have to do is volunteer. It will be fun. After all it is the holidays.

So where are we at?

  • The patent downloader has been compiled, distributed and works. Start trying it out now
  • We’ve downloaded and parsed patents back to 2000 so if they all work we’ll have gazillions of experiments
  • We’re finalising the chemistry extraction over the next few hours.

Lezan Hawizy who wrote the chemicalTagger is now back in action. She and David Jessop are presenting their work (code, results) at the American Chemical Soc. In Boston. Be there!


What’s more to do?

  • Build an aggregator for the green data. Probably solvents at first.
  • Attach greenness to each solvent
  • Develop data presentation. Sam Adams (who works with us and won the Dev8D memento challenge and was runner up in OR10′s DevSci) has shown me an idea that looks like being sensational. But you will have to come to the session at #solo10 to see its premiere! (Unless you are actually helping with this project, which of course you now will be)
  • Legitimise other sources of data. Heather Piwowar (@researchremix) and I are setting out our principles for accessing and producing Open Data – we’ll be asking for responses through IsItOpen.


#solo10 GreenChainReaction: A greenness calculator

Monday, August 16th, 2010

Scraped/typed into Arcturus

Jean-Claude Bradley has made a useful contribution to the Green Chain Reaction ( )… Even if you aren’t a chemist, you should be able to follow the logic

In the spirit of contributing to Peter Murray-Rust’s initiative to collect Green Chemistry information, Andrew Lang and I have added a green solvent metric for 28 of the 72 solvents we include in our Solvent Selector service. The scale represents the combined contributions for Safety, Health and Environment (SHE) as calculated by ETH Zurich.

For example consider the following Ugi reaction solvent selection. Using the default thresholds, 6 solvents are proposed and 5 have SHE values. Assuming there are no additional selection factors, a chemist might start with ethyl acetate with a SHE value of 2.9 rather than acetonitrile with a value of 4.6.

Individual values of Safety, Health and Environment for each solvent are available from the ETH tool. We are just including the sum of the three out of convenience.


Note that the license for using the data from this tool requires citing this reference:


Koller, G., U. Fischer, and K. Hungerbühler, (2000). Assessing Safety, Health and Environmental Impact during Early Process Development. Industrial & Engineering Chemistry Research 39: 960-972.

The method therefore appears to combine three measures into a single metric.

The basis of the method is not clear without reading the original paper (which I haven’t done). The tool itself seems to be an Excel Macro, and the reference explaining it ( ) didn’t resolve for me YMMV. It’s not clear whether it’s an Open Source tool or not. The licence and docs require that you quote the paper and this is perfectly acceptable – you can reasonably require people to acknowledge your work under any CC-*-BY licence or equivalent.

Anyway it’s a good idea. I imagine that once calculated it should be possible to re-use the data values, so that we can label the SHE (safety, health, environment) for every reaction…

… now I just have to get the next bit of code to do this…

#solo10 GreenChainReaction: Can you spot how green the reaction is?

Monday, August 16th, 2010

Scraped/typed into Arcturus

Mat Todd has made some great suggestions about what we can measure in the Green Chain reaction. Here’s his comments on the Etherpad ( ) and my additions. Please comment and add extra ideas. I’ll show some phrases below and let YOU test your ability. YOU DON’T NEED TO KNOW MUCH CHEMISTRY.


Has there been a shift away from chlorinated solvents in recent years?

PMR Should be easy to find the solvent and classify it/them. The linguistic and contextual pointers to solvents (“3 g of X in Y”, generally identifies Y as a solvent) have been identified by Lezan Hawizy.

 Are there more catalytic reactions being carried out now vs 10 years ago?

PMR Catalysts can be identified by

What proportion of organic molecules are purified by crystallization (good) vs chromatography (bad).

PMR This Should be relatively easy. The workup phrases are fairly standard and crystallization is usually mentioned explicitly


There are more complex questions concerning atom economy that it would be awesome to look at, but they’re tricky. e.g. what proportion of the atoms put into the reaction end up as part of the product vs things that are byproducts and are thrown away, such as salts, water, gases that you essentially “lose”)

PMR – agreed. Problem here is that it’s tricky to work out the reaction as normally only have reactants and product (singular). Have to manage stoichiometry, etc. But not impossible


Here’s a typical reaction, selected at random (the subscripts come out as <sub>3</sub>, etc.). See if you can answer Mat’s questions – solvent? Catalyst? Crystallization?

<p id=”p0067 num=”0067“>A 3-neck 300 mL round-bottomed flask equipped with a reflux condenser, magnetic stir bar and a nitrogen inlet was charged with 5 g (1 equivalent) of 4-hydroxybenzonitrile, absolute ethanol 150 mL, and 15.7 mL (1 equivalent) of sodium ethoxide. This mixture was stirred at 25 °C for 15 minutes. Ethyl 8-bromooctanoate (10.5 g, 1 equivalent) was then added dropwise over 10 minutes. The resulting mixture was heated to reflux (75 °C) for 72 hours.</p>

<p id=”p0068 num=”0068“>

The reaction mixture was cooled and the solids filtered off. The solvent was removed on a rotary evaporator. The crude residue was dissolved in methylene chloride (200 mL) and washed with saturated NaHCO


(2 x 75 mL), H


O (1 x 100 mL) and brine (1 x 100 mL). The crude material was then dissolved in ethanol (125 mL) and water (10 mL). LiOH (5 g) was added and the resulting mixture was heated to reflux (75 °C) for 1 hour then stirred at ambient temperature overnight. The solvent was evaporated and 75 mL of H


O was added. The aqueous solution was acidified to a pH of about 3 with concentrated HCl and the flask cooled to 4 °C. An off-white colored solid precipitated. This material was collected by vacuum filtration and dried on the high vacuum overnight to give the crude acid. These solids were further purified by recrystallization from Ethyl acetate/hexanes (95/5) and again with chloroform to give 4.5 g of the product, 8-(4-cyanophenoxy)octanoic acid (41 % yield).



#solo10 GreenChainReaction: Update and continued request for help

Monday, August 16th, 2010

Typed into Arcturus

We are making excellent progress. Some things go faster, some slower as always.

We now need a second round of volunteers. I’ll detail what we have done and what needs to be done this week. Most of the activity is completely open at


Dan, Mark and Nava has forged ahead and shown that our document extraction framework can be run independently of OS and location. So far it:

  • Downloads patents from weekly indexes
  • Unzips them
  • Converts the images to chemistry
  • Restructures the main document


  • Annotate the experimental sections and link up the chemistry
  • Run chemicalTagger (POS+Chemistry tagger)
  • Collect and upload solvent data


At the moment we are concentrating on patent data. The data is messy but tractable. We shall very soon be able to distribute code and data. We’ll be looking for volunteers who can run this on their local machines and then upload it.

Currently we are looking at RDF for managing simple solvent and temperature data on reactions. Something like (pseudocode):


  • Has Temperature (hasUnits)
  • Has Solvent
  • Has Duration (hasUnits)
  • Has Amount (hasUnits)


  • Has formula
  • HasWikipediaEntry
  • hasPubchemCID


Reaction specification:

Not much progress (blocked on PMR). We can probably analyse solvents without complete semantics for the reaction, but it would be nice to try. Thanks to mat and Jean-Claude for their patience

Making documents open:

Progress in the background between Heather Piwowar and PMR. Currently blocked on PMR.

Resources and help wanted

  • Comments on the above welcomed
  • Where can the data be reposited? We’ve had one offer. At present we’d like this to have an upload server for triples. Anyone with a triple server much appreciated
  • Help with analysing the results. Mainly descriptive stats, we hope.
  • Help with running the patent downloads and conversions (at least 10 volunteers wanted)


More later…

#solo10 GreenChainReaction: Some chemical information is appalling quality, but does anyone care?

Sunday, August 15th, 2010

Typed into Arcturus

Earlier I asked for the compound a patent image represented (EP_2050749A1/0026imgb0032.tif)

“Could someone please tell me what the InChI or SMILES or CML is for this compound?” This was a slightly trick question as you have to realise that the image is corrupted. (You might have to re-read the patent to know this).

Egon Willighagen says:

August 15, 2010 at 12:36 pm  (Edit)

Peter, I guess this is the compound:

When seeing the image for the first time, I had the feeling a bond was too thin to show up in the rasterized image… this hit on PubChem might be a lead for further information.

Egon has solved it. There ought to be a vertical line (chemical bond) in the formula. It’s missing. That’s because TIFFs and GIFS and PDFs destroy and corrupt information. This is a classic. Here’s a similar compound –


The point is that the quality of the chemical information in the patent is so poor that the image has got corrupted in the processing somewhere. This is symptomatic of chemistry where there is little communal interest in tools that create quality validated information. The EPO engaged with us some years ago to introduce Chemical Markup Langauge but they couldn’t convince the chemical companies, the chemical software companies, the secondary patent providers and the whole market sector to do it.

Egon is trying to get the Chemoinformatics sector to do proper, Openly validatble science. I’m with him – and have been all along. But the mainstream community does not want to require details of what data where used, where they came from, how to re-run the calculations, how to re-analyse the data. Chemoiniformatics is in danger of being regarded as a pseudoscience. I’ll blog more of this later – probably not until after the GCR.

Here’s the pubchem entry. Note how this is presented in scalable vectors:


(I have rotated this to show the similarity to the original

Of course you can follow the links on Pubchem to see what the commercial providers of patent info provide. If you haven’t seen the sort of webpages, take time to have a look…

Category: [for same structure substances]

   DiscoveryGate ( 1 ) External ID: 24888923
   Thomson Pharma ( 1 ) External ID: 02644348

How much use is this to you?


#solo10 GreenChainReaction: update and What is that Chemical Compound?

Sunday, August 15th, 2010

Typed into Arcturus

The first pass of the automatic extraction of chemical information from patents is going well on a mechanical level.


  • One weekly index has 30-200 appropriate patents. Each has between 0 and 1500 images of chemical relevance
  • Each index therefore has ca 10,000 images, almost all of chemical compounds or general formulae or reactions.
  • We use OSRA (Open Source, NIH) to interpret the images. It takes about 1-30 secs each and the first index will complete in ca 24 hours. This means that we could do this task for the last 10 years in 500 distributed days. I’d like to do that before #solo10. (I could do it all at Cambridge, but I’d rather it were citizen-science.)

So far the “record” is a patent with 1500 images. Here’s one (EP_2050749A1/0026imgb0032.tif)



Could someone please tell me what the InChI or SMILES or CML is for this compound?

I am now working on the text-mining. More later today.


Supplementary Data must be published somewhere to validate the science

Sunday, August 15th, 2010

Dictated into Arcturus

There has been quite a lot of discussion over the last days and about the decision by the Journal Of Neuroscience to stop posting supplemental data. This has been reviewed by Heather Piwowar ( ) with a useful review

The issue seems to be that the journal can no longer make the effort to host the supplemental data on its website. That’s neither surprising or unreasonable.

However the critical responsibility of a journal, if it has any, is to make sure that the science has been properly conducted and is in principle (and hopefully in practice) reproducible. I believe it is irresponsible for a journal to publish science based on data which is not available to the reader. This is more important than any aspect of archiving or data re-use. The mainstream model of publication is that If the science is not valid it should not be published. (Models such as PLoSOne legitimately take a different view where debatable science can be published for open community view and review). If the material does not warrant publication then there is little point in archiving it. If the material represents invalid science then there is little point in disseminating it.

The reactions to J neuroscience have been mixed. Drug monkey exalts in an engaging fashion, but it is difficult to see that DM is responsible. It appears that DM wishes to be excused from the burden of providing the data that supports DM’s work. If this is true then I can give DM no support whatever.

There is some confusion and a suggestion that the supplemental material should be cut down and included in the material behind the publisher’s pay wall. This again there would be wholly regrettable. If the data are needed to support the experiment then they should be published in full. Moreover I believe it to be retrogressive if there is an insistence that only its subscribers can have access to the data on which the experiment was based.

So, if any data which is being published now would not be published in the future, then this is a seriously retrograde step.

Some other correspondents believe that the data should be in a repository, and I would agree with this so long as it was an Open repository and not closed (access, re-use, etc.). Some members of the institutional community believe that it should be in an institutional repository. Here for example is Dorothea Salo:

And this, Peter Murray-Rust, is partly why I believe institutions are not out of the data picture yet. The quickest, lowest-friction data-management service may well reside at one’s institution. It’s not to be found at a picky, red-tape-encumbered, heavily quality-controlled disciplinary data service on the ICPSR model, which is the model most disciplinary services I know of use. It’s certainly possible, even likely, that data will migrate through institutions to disciplinary services over time, and I have no problem with that whatever—but when the pressure is on to publish, I suspect researchers will come to a local laissez-faire service before they’ll put in the work to burnish up what they’ve got for the big dogs. (Can institutional data services disrupt the big-dog disciplinary data services? Interesting question. I don’t know. I think a lot depends on how loosely-coupled datasets can be. Loose coupling works better for some than others.)

Since since it’s addressed to me, I’ll respond. I personally do not have an absolute “religious” divide between domain repositories (DSR) and institutional repositories but what I am passionate about is the validation of data before and at the time of publication. A “heavily quality-controlled disciplinary data service” is because the community of scientists wishes to know that the data on which the science is based is valid. You cannot publish a crystal structure in a reputable journal without depositing the data in a formal manner which is susceptible to validation. You cannot publish a sequence without submitting it to a database which will review the scientific basis of the data. That’s not “picky”, it’s an essential part of data-rich science. I continue to see papers published whether data are either nonexistent, incomplete, all presented in such a way it is impossible to validate the science.

The reality is that this is a time consuming tedious process. If you don’t like it, don’t do science. As an example, when I did my doctorate, I measured about 20,000 data points and wrote them into a book. I then typed up each data point, where it occurred both in my thesis and in the full text of a paper journal. That was not unusual, everybody had to do it. The International Union Of Crystallography developed this so it has now become almost universal . In similar fashion the Proteomics community is insisting that data is deposited in a DSR.

I have said before that this could, in principle, be done by the library community. But it isn’t being done and I see no signs of it being done in the future. It is much harder to do it in the library community because the data are dispersed over many individual repositories and is now a technology at the moment that creates an effective federation. It is possible to conceive of a world where data went straight into the institutional repositories, and we are working on a JISC funded project to investigate this. But it is essentially unknown at present. If IRS wish us to change the model then they are going to have to present a believable infrastructure very shortly.

As an example, I have appealed for a place where I can put my results from the Green Chain Reaction (#solo). This is a valid scientific endeavour (in the area of molecular informatics) and its data requirements are modest compared with many other scientific disciplines. There are perhaps 1000 institutional repositories worldwide and I’d be delighted if one of them stepped forward and offered to host the data. Host, not archive. I’m not asking for any domain expertise, simply ingesting and holding an HTML web page tree.

If IRs can help with the problem of supplementary data they will have to act very quickly.

And if all this leads to less, not more, data being published Openly that’s a disaster.