petermr's blog

#solo10 GreenChainReaction : an Open Index of patents

Posted on August 26, 2010 by pm286

Scraped/typed into Arcturus

We are on the last lap of preparation for the Green Chain Reaction. We have now built a server to receive the results – the address is in the Etherpad (http://okfnpad.org/openPatents )

We are uploading the weekly indexes of all European Patents to this site. Each index contains several thousand patents and is about 1 MByte. There will be about 1500 such weekly indexes.

How does it work?

Each volunteer downloads the patentAnalysis code and verifies they can run it
They then take a weekly index and run the code against it. This code
- Selects the chemical patents (by EPO code) – 10-100 per week
- Extracts the text for the experimental sections
- Parses the solvents from them
- Aggregates the result for each patent (dissolveTotal.html)
- Uploads this to the GreenChain Server (code being written)
Repeat. It takes about 30-60 mins for 1 week. You could get through 10 a day just watching the Test Match (or the rain)

We shall then trawl the results from the server and present them at the meeting. Since the data are all Open this is truly Open Notebook work. Anyone with a different approach to analysis is welcome to use the data.

Indexes for 1980-1900 are now loaded, and the rest should be done in an hour.

Many thanks to our volunteers and to Sam Adams for creating the Green Chain Server system and code to access it.

Posted in Uncategorized | Leave a comment

#solo10 GreenChainReaction: Almost ready to go, so please volunteer

Posted on August 25, 2010 by pm286

Scraped/typed into Arcturus

We have now scoped out the project for the Green Chain Reaction and I am almost certain it can work. How well it works depends largely on the number and time of the volunteers.

If you are interested in taking part let us know now … it should be fun and it doesn’t matter if any particular person or machine doesn’t succeed. BE BRAVE!

The code now works at the following levels:

It takes a weekly patent index and downloads all the chemical patents.
It trawls through these to see which contain experimental sections
It analyses the text in these to extract mentions of solvents, including chemical formula and amount (where given)
It aggregates all the solvent data from a single patent into a summary file (dissolveTotal.html)

I am hoping that we can add company and country information to disoolveTotal.html but this is not critical – just fun.

Volunteers should read http://okfnpad.org/openPatents and email me (pm286 a t cam dot ac dot uk). I will put them on a communal mailing group so they can mail for help. Each volunteer will select a patent Index (there are about 1500 – 30 years at 50 weeks). It takes about 30-60 minutes to download, unzip and analyse a week’s material. It generates about 200 MB and 12000 files. Only about 50 of these will need to be (automatically) uploaded.
So for the whole group in total about 1000 hours’ work and 12 million files. It’s this job we want YOU to help with.

The dissolveTotal.html is quite small – a few Kb and there are perhaps 20/patentIndex and it’s this you will upload to our server. When we get a significant number of these we can then start using our new software to analyse and display the results.

We’ve currently got about 8 people who can run these jobs. That’s quite a lot of effort per person – 200 jobs. So if we get more volunteers it will make it more fun. Of course we don’t have to do the whole lot, but it’s a fun challenge.

We’ll probably make two runs at the data as the parser needs tuning in respect of what we see.

Please join in…

Posted in Uncategorized | 3 Comments

PantonDiscussion#1: Richard Poynder

Posted on August 24, 2010 by pm286

Dictated into Arcturus

We’ve just had the first of our Panton Discussions with Richard Poynder as the visiting guest. This went extremely well and as you can see from the photographs was held in the small room in the Panton Arms. We had this all to ourselves and a session lasted for about 2 hours.

It turned out that the session was mainly driven by Richard asking a number of questions to which Jordan Hatcher and I replied as appropriate, although there were times at which free discussion took place. This has been very useful for us because it is allowed us to lay out several of the primary ideas underlying the Panton Principles and because Jordan and I were able to balance each other from different viewpoints. It means that the session will serve as a good reference point.

Brian Brooks recorded the whole session on audio (ca 80 mins) and some of it on video. We hope that there will be snippets of video which will stand alone and can be posted on YouTube. The audio should be available more or less as is after minor edits, but we’d also like a written transcript. We’d like suggestions here as to the best way of doing this. Brian is going to try and read this into Dragon, and there we will need to correct the mistranscribed words.

Alternatively we may need to type it up from the audio in which case any volunteers would be extremely valuable!

UPDATE: we’ve had two volunteers for transcription and we’d welcome more as that will help both of them and us

Posted in Uncategorized | 1 Comment

Panton Discussions#1: Richard Poynder 2010-08-24

Posted on August 23, 2010 by pm286

Typed into Arcturus

Following on from our creation of the Panton Principles we are now starting a series of “Panton Discussions”. These are intended to explore in public, with a well-known guest, aspects of Open Data. They have a fluid format, which may well change as we progress.

We are delighted to be joined in the Panton Arms tomorrow (Tuesday 24^th Aug) at 1200 by Richard Poynder (see http://www.richardpoynder.co.uk/ ). Anyone is welcome. – we’ve booked a room. Richard describes himself as:

Richard Poynder writes about information technology, telecommunications, and intellectual property. In particular, he specialises in online services; electronic information systems; the Internet; Open Access; e-Science and e-Research; cyberinfrastructure; digital rights management; Creative Commons; Open Source Software; Free Software; copyright; patents, and patent information.

Richard has contributed to a wide range of specialist, national and international publications, and edited and co-authored two books: Hidden Value and Caught in a Web, Intellectual Property in Cyberspace. He has also contributed to radio programmes.

Richard has interviewed a number of leaders in the Open Access and other movements and these interviews are highly valued.

We don’t yet know the precise form of tomorrow’s Discussion. We’d each put forward some topics/questions and I’ve listed these below. I doubt we’ll get through everything. We’ll probably start by sorting/triaging them.

The meeting will hopefully be recorded by us and we’d like to keep editing to a minimum. There won’t, by default, be a transcript unless someone volunteers. If there is a video it would be Openly published.

Open Data is a relatively new phrase but is now being heard everywhere. What do you think has led to this?
What can the Open Data movement learn from other Open movements (especially Access, but also Open Source)?
Where should the open data movement put its initial energies? Should it try to tackle some well-defined issues or must it draw a wide plan of what needs to be addressed?
Who are likely to be, or become, enthusiastic champions of Open Data?
How can one build a successful business model (i.e. generate revenue) from Open knowledge. It’s worked in the limited area of ICT but there are few examples outside that?

 What is open data, and why is it important? What’s the problem it needs to fix?

 As [we] understand it there is both the Public Domain Dedication Licence and the Creative Commons Zero. What are these, how do they work, and how do they differ?

 We also have the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition. What are these? How do they differ? And why again do we need two initiatives?

 And more recently we have The Panton Principles? What does this provide that was not available before?

 Meanwhile Open Data has acquired a much wider definition now, covering for instance government and other public data? Is there some confusion around?

 How many people have signed up to the Panton Principles?

 Why have they not been endorsed by the Science Commons?

 There was some disagreement I think in developing the Panton Principles I think, and so they represent a compromise. Who had to compromise over what and why?

 Essentially we are talking about never using Non Commercial licences?

 What about Share Alike? I guess it does not have to be Share Alike. So would it perhaps not fit Richard Stallman’s concept of freedom?

 Bollier in his book Viral Spiral says that Open Data does not rely on copyright, and does not use a licence, but rather uses a set of protocols. Can you expand on this? It’s a public commitment basically is it?

 The open data philosophy also talks of using norms. Can you say expand on this?

 In the Panton Principles it says, “You may wish to take the ‘ask for forgiveness, not permission’ approach”. Isn’t that somewhat dangerous advice?

 Why not cover the social sciences and humanities too?

 Where does Open Data fit with Open Access?

 You will note that the OA movement has now differentiated between OA gratis and OA libre. Does that help the cause of Open Data? (I.e. on re-usability)?

 Where does Open Science fit in here?

 What about Open Notebook Science?

 Where does Open Data live: in institutional repositories, or in central archives like arXiv and PubMedCentral?

 Earlier this year you got a SPARC Innovators Aware. Beyond that, do you have any figures to show the success you are having: the number of data sets with the Open Data Logo on for instance? Or the number of records now freely available etc.?

 What advice would you give to any scientist seeking to make their data available today? What pitfalls do they need to avoid?

 Where do scholarly publishers fit in here? Are they problematic?

 Have all OA publishers now adopted Open Data too?

Posted in Uncategorized | 1 Comment

Data Validation and publication – some early ideas

Posted on August 23, 2010 by pm286

Dictated into Arcturus

On Friday David Shotton from Oxford and I visited Iain.Hrynaszkiewicz and colleagues at Biomed Central to discuss collaboration on several grants funded by JISC. In our case they are Open-bibliography (#jiscopenbib) and #jiscxyz (on data) and in David’s they are Open Citations (#jiscopencite) and Dryad-UK. These involve several other partners (who I shall mention later, but highlight here The International Union Of Crystallography). Our meeting with Iain was about access to bibliographic material and also their requirements for data publication. I’ll be blogging a lot about this, but one thing was very clear:

Data should be validated as early as possible, for example before the manuscript is sent to a publisher for consideration.

This has a major influence on how and where data are stored and published (and archived) and influence whether we should have Domain Specific Repositories and where they should be. Our group on Friday was unanimous that repositories should be domain-specific. (I know this is controversial and I’ll be happy to comment on alternative models – I shall certainly blog this topic).

I am also keen that wherever possible data validation should be carried out by agreed algorithmic procedures. Humans do not review data well, and there are recurring examples of science based on unjustifiable data, where pre-publication data review would have helped the human reviewers to take a more critical view. I shall publish a blog on one such example

Here are some things that a machine is capable of doing before a manuscript is submitted. I’ll analyse them in detail later. They will vary from discipline to discipline.

Labelling. Have the data components been clearly and unambiguously labelled? (for example we have had *.gif files (image) wrongly labelled as *.cif) crystallography)
formats and content. (Is the content of the files described by an Open specification? Does the content conform to those rules?)
units. Do all quantities have scientific units of measurement.
Uncertainty. Is an estimate given of the possible measurement and other variation?
Completeness Checklist. Are
all the required
components present? If, say, there is a spectrum of a molecule is the molecular formula also available?
“Expected values”. In most disciplines data fall within a prescribed range. For example human age is normally between zero and 120. A negative value would cause suspicion as would an age of 999.
Interrelations between components. Very often components are linked in such a way that the relationship can be tested by a program.
Algorithmic validation of content. In many disciplines it’s possible to compute either from first principles or heuristically what the expected value is. For example the geometry and energy of a molecule can be predicted from Quantum Mechanics.
Presentation to humans. The robot reviewer should compile a report that a human can easily understand – again I shall show a paper where if this had been done to paper would have been seriously criticized.

Posted in Uncategorized | 1 Comment

Blue Obelisks: Openness is taking off

Posted on August 22, 2010 by pm286

Scraped/typed into Arcturus

The Blue Obelisk (http://www.blueobelisk.org ) is a relatively unstructured group of people interested in software and information in chemistry. The mantra is:

“Open Data, Open Standards, Open Source”

This guides the BlueOb in its thinking. It has web pages and a mailing list to which anyone can sign up

It is an unorganization:

It unmeets at dinners when it does
It has an unmembership
It has an unagenda and unminutes

There may or may not be a BlueObelisk dinner at the ACS meeting. It’s been planned. The unattendees are those who express interest and can interpret the uninstructions on how to get to a restaurant at a given time.

Sometimes people get BlueObelisks and sometimes they don’t.

Sometimes the blue obelisks are blue, and sometimes they aren’t. Sometimes they are obelisks and sometimes they aren’t. But most of the time they are.

I have spent today travelling to a secret source of Blue Obelisks. Here are some:

I shan’t physically be at the Blue Obelisk dinner at the ACS but I’ll be there in spirit.

The Blue Obelisk now provides good, Open solutions in most of chemoinformatics. Consider joining the Blue Obelisk. You may wish to contribute – there are many ways other than writing code. Being an enthusiastic user is just as useful. The Open movement is taking off everywhere – it’s worth being part of it.

Posted in Uncategorized | Leave a comment

#solo10 An introduction to textmining and data extraction

Posted on August 19, 2010 by pm286

Scraped/typed into Arcturus

But now we’ll show what we can get out of patents. Even if you aren’t a chemist you should be able to follow this. It’ll show you what text-mining is about and how we are looking for greenness..

Here’s a typical report in PDF (I have cut and pasted it so that’s why it looks tacky – but that’s what you get with PDF):

A) 10-Octadecyl-_1,4,7,10-tetraazacyclododecane-_1,4,7-triacetic acid

_[0065] A mixture of 1,4,7,10-_tetraazacyclododecane-_1,4,7-_triacetic acid tris_(1,1-_dimethylethyl) ester (37.5 g; 72.8

mmol) and 1-_bromooctadecane (24.5 g; 73.5 mmol) in CH3CN (500 mL) was heated to reflux. After 2 h the reaction

mixture was evaporated and the residue was dissolved in CHCl3 and a portion of CF3COOH was added. After 16 h at

room temperature the reaction mixture was evaporated and the oily residue dissolved in CF3COOH. After 3 days at

room temperature, the solution was evaporated, the residue taken up in CHCl3 and the solution evaporated. This operation

was repeated three times. The oily residue was purified by flash chromatography as follows:_

Eluents:_

(a) CH2Cl2/_MeOH = 3/1 (v/v) 3 litres

(b) CH2Cl2/_MeOH/NH4OH 25% (w/w) = 12/4/1 (v/v/v) 12 litres

(c) CH2Cl2/_MeOH/NH4OH 25% (w/w) = 6/3/1 (v/v/v) 2 litres

_[0066] The product was dissolved in H2O and acidified with 6N HCl; then, the solution was loaded onto an AmberliteK

XAD-_8 resin column and eluted with a CH3CN/H2O gradient. The product started eluting with 20% CH3CN.

Because PDF is non semantic we have lost some of the formatting, but it doesn’t matter. The XML is MUCH more useful. That’s why you should author XML, as well as PDF (or even instead). Don’t switch off just because it’s XML. The key phrases we shall use are highlighted…

A mixture of 1,4,7,10-tetraazacyclododecane-1,4,7-triacetic acid tris(1,1-dimethylethyl) ester (37.5 g; 72.8 mmol) and 1-bromooctadecane (24.5 g; 73.5 mmol) in CH3CN (500 mL) was heated to reflux. After 2 h the reaction mixture was evaporated and the residue was dissolved in CHCl3 and a portion of CF3COOH was added. After 16 h at room temperature the reaction mixture was evaporated and the oily residue dissolved in CF3COOH. After 3 days at room temperature, the solution was evaporated, the residue taken up in CHCl3 and the solution evaporated. This operation was repeated three times. The oily residue was purified by flash chromatography as follows:

Eluents:

<ul id=”ul0004″ list-style=”none” compact=”compact”>

<li>(a) CH2Cl2 / MeOH = 3/1 (v/v) 3 litres</li>

<li>(b) CH2Cl2 / MeOH / NH4OH 25% (w/w) = 12/4/1 (v/v/v) 12 litres</li>

<li>(c) CH2Cl2 / MeOH / NH4OH 25% (w/w) = 6/3/1 (v/v/v) 2 litres</li>

</ul>

PMR: notice how the subscripts and the list has been properly captured. Anyway it’s the text that matters. The key phrases for the Green Chain Reaction are in bold.

dissolved in CF3COOH is a classic linguistic template. This tells us that CF3COOH is a solvent! (I don’t know how green it is/isn’t. Here are its hazards

MSDS
External MSDS
R-phrases
R20
R35
R52/53
S-phrases
S9
S26
S27
S28
S45
S61
NFPA 704

(http://en.wikipedia.org/wiki/Trifluoroacetic_acid )

in CH3CN (500 mL) is also a classic template. The (number + mL) tells us it’s a liquid (if it were a solid it would have grams (g) or mg (milligrams) as units. We also know that CH3CN is a liquid by looking it up:

http://en.wikipedia.org/wiki/Acetonitrile : EU classification Flammable, harmful R-phrases
R11, R20/21/22, R36
S-phrases
(S1/2), S16, S36/37

dissolved in CHCl3. http://en.wikipedia.org/wiki/Chloroform … says:

The US National Toxicology Program’s eleventh report on carcinogens^[20] implicates it as reasonably anticipated to be a human carcinogen, a designation equivalent to International Agency for Research on Cancer class 2A. It has been most readily associated with hepatocellular carcinoma.^[21]^[22] Caution is mandated during its handling in order to minimize unnecessary exposure; safer alternatives, such as dichloromethane, have resulted in a substantial reduction of its use as a solvent.

So this is where we start to see the point of the GreenChainReaction… Let’s see what the ratio of use of chloroform to dichloromethane is over time.

So if we can extract the solvents out of every patent reaction we can get an indication of the greenness. The next post will show how we do this.

Posted in Uncategorized | Leave a comment

#solo10 GreenChemicalReaction: Anatomy of a Patent

Posted on August 18, 2010 by pm286

Scraped/typed into Arcturus

As we are making progress I thought we’d let you have a look at what we are doing. Because we believe in Openness we’re wearing our heart in the Open and telling you as it is created. This is OpenNotebook chemical informatics. Let’s count:

The raw data are Open (patents)
The code is Open (bitbucket)
We’ve invited anyone who wants to be involved to join – and there’s good membership from the Blue Obelisk community and others
We sync our contributions on Etherpad which is immediate and Openly visible

We are working on patents. The details can be found on the Etherpad (http://okfnpad.org/solo10 or http://okfnpad.org/openPatents ). A European Patent (EP) consists of:

An index file, with a name like:

EPO-2000-12-20.xml

This is a fully semantic, consistent name (EPO+date). You start to get a warm feeling. The data are well structured. However the indexfile itself is a nasty shock. It starts:

<ep-patent-document-list>

<ep-patent-document-ref>

<publication-reference>

<document-id lang=”en”>

<doc-number>000000002141978</doc-number>

<original-kind />

<correction-code />

<original-sequence />

</document-id>

</publication-reference>

<application-reference appl-type=”patent”>

<document-id>

<doc-number>000000008724152</doc-number>

</document-id>

</application-reference>

<classification-ipc>

</classification-ipc>

</ep-patent-document-ref>

Well we are used to non-semantic identifiers and they’re a good thing. Except that the identifiers have implicit semantics. You have to pull these strings apart to discover whether the document is a chemical one or not. It’s hideous – and I’m publicly grateful to David Jessop who has written the deconstruction code. David’s written stuff which downloads the Zips – it’s not trivial. It’s anything but trivial and involves repeated HTTP polling of the site to determine where the zipped patents are. I suspect we are losing some downloads because of the arcane nature of the index. But at least it can’t get much worse….

EP 1060157B1.zip

Oh yes it can. This is the filename of a downloaded zip. It’s got a SPACE in it. But not your honest to goodness char(32). Or char(8), but a beautiful char(160) hamburger space. A gold-plated non-breaking nbsp space. On we go…

A Zip file which contains an unpredictable (or at least variable) number of goodies with names and labels you generally have to guess at. We haven’t worked out the whole crypanalysis but there are normally
DOCUMENT.PDF. PDFs are normally human-readable but not machine-readable. These PDFs stretch the definition of human-readable. There are misspellings, illiteracies and the images are a delight for those who like hamburgers rather than cows. Reassembling the cow is not for the faint-hearted. Here’s a snippet

Luckily (or since we pay taxes for it, in addition) there is:
DOC00001.xml or something like EP00401685NWA1.xml. There are clearly at least two independent nomenclature systems for the files. These are XML files the EPO has created from the PDF. It’s done by humans – I have visited the office where they do it (it’s probably outsourced by now). A huge barn where humans are given the OCR of the patent and have to correct the text line by line (and add the line number).
And there are between 0 and 2000 images. These are not only non-semantic but a veritable chamber of horrors. Here are two. The first is typical. Its quality is awful

How many minus signs (as opposed to single bonds) are there in that picture? (HINT: There are some).

You think that’s bad? Here’s a chemical formula:

Now it wasn’t that bad when the author drew it. It has probably been deliberately obfusticated so that the patent is harder to understand. I THINK the characters are taken from only N,C,H,R, (,),1,2,3,4,5 m and n. I don’t think anyone on the planet can be absolutely sure what they are.

But then there is the one I posted some days ago and Egon cracked the puzzle. The quality was SO BAD that chemical bonds are simply missing.

This is the level of quality that is endemic in chemical informatics. (and patents).

However the XML is a lifeline and I’ll show you where we have got to in a following post

Ah, well

Posted in Uncategorized | Leave a comment

#solo10 GreenChemicalReaction; we’ll have something exciting to show at the meeting

Posted on August 18, 2010 by pm286

Scraped/typed into Arcturus

Typical hard slog day, gearing up to do text-mining on the chemical reactions extracted from the patent. Slower than I had thought but steady progress. The routines are all written – it’s the glueware that is the problem.

We still need YOU to volunteer.

As an incentive, we’ve got some stunning technology for displaying the results. Absolutely stunning. It’s based on thinking in a modern frame of mind with Open Data, links, etc. rather than walled gardens, free-but-not-open, etc. RDF, not proprietary formats. XML, not PDF, and so on. I’m not going to tell you what it is as we shall announce it at the meeting!

Heather and I have put together a letter to send to science publishers asking about the Openness of their data. Since all publishers seem to agree that by default data are open, not copyrightable we expect general agreement. We’ll probably send it out later today or tomorrow. Whether we can use the publisher information I don’t know, but we have enough in the patents.

Oh, we still need volunteers.

Posted in Uncategorized | Leave a comment

#solo10 GreenChainReaction: We need more volunteers – and yes, it will be fun!

Posted on August 17, 2010 by pm286

Scraped/typed into Arcturus

Continued incremental progress…

I know it’s the summer vacation but we need your help
even more. Now is your chance to

Download documents robotically
Run semantifiers over them
Do natural language processing
Extract data from raw text
Create RDF and search it

These are reall transferable information skills that will be highly valued. Volunteers will be guided through this by experts (well we’ve been going a week so that makes us experts!)

All you have to do is volunteer. It will be fun. After all it is the holidays.

So where are we at?

The patent downloader has been compiled, distributed and works. Start trying it out now
We’ve downloaded and parsed patents back to 2000 so if they all work we’ll have gazillions of experiments
We’re finalising the chemistry extraction over the next few hours.

Lezan Hawizy who wrote the chemicalTagger is now back in action. She and David Jessop are presenting their work (code, results) at the American Chemical Soc. In Boston. Be there!

What’s more to do?

Build an aggregator for the green data. Probably solvents at first.
Attach greenness to each solvent
Develop data presentation. Sam Adams (who works with us and won the Dev8D memento challenge and was runner up in OR10’s DevSci) has shown me an idea that looks like being sensational. But you will have to come to the session at #solo10 to see its premiere! (Unless you are actually helping with this project, which of course you now will be)
Legitimise other sources of data. Heather Piwowar (@researchremix) and I are setting out our principles for accessing and producing Open Data – we’ll be asking for responses through IsItOpen.

Posted in Uncategorized | 1 Comment

#solo10 GreenChainReaction : an Open Index of patents

#solo10 GreenChainReaction: Almost ready to go, so please volunteer

PantonDiscussion#1: Richard Poynder

Panton Discussions#1: Richard Poynder 2010-08-24

Data Validation and publication – some early ideas

Blue Obelisks: Openness is taking off

#solo10 An introduction to textmining and data extraction

#solo10 GreenChemicalReaction: Anatomy of a Patent

#solo10 GreenChemicalReaction; we’ll have something exciting to show at the meeting

#solo10 GreenChainReaction: We need more volunteers – and yes, it will be fun!

Recent Posts

Recent Comments

Archives

Categories

Meta