ACS Talk on the Green Chain Reaction: summary of what we did

This is a collection of all the blog posts that I made on the Greenchain reaction. They give some idea of the tight deadline , the volunteering. It’s followed by a scrape of the (now-spammed) site…

 

#solo2010: The Green Chain Reaction; please get involved!

Sunday, August 8th, 2010

#solo2010: Green Chain Reaction; details and offers of help

Monday, August 9th, 2010

#solo2010: How safe / green is my reaction? Feedback requested

Tuesday, August 10th, 2010

#solo2010: Computing volunteers needed for Green Chain Reaction

Tuesday, August 10th, 2010

#solo2010: Green Chain Reaction – update and more volunteers!

Wednesday, August 11th, 2010

#solo2010: Sharing Data is Good for All of Us

Wednesday, August 11th, 2010

#solo2010: Where can we get Open Chemical Reactions?

Thursday, August 12th, 2010

#solo10: The Green Chain Reaction is becoming a chain reaction!

Thursday, August 12th, 2010

#solo10: Publishers, is your data Openly re-usable – please help us

Thursday, August 12th, 2010

#solo10: Green Chain Reaction is using an Etherpad to collect its thoughts

Thursday, August 12th, 2010

#solo10: Green Chain Reaction – much progress and continued request for help

Saturday, August 14th, 2010

#solo10: Green Chain Reaction; where to store the data? DSR? IR? BioTorrent, OKF or ???

Saturday, August 14th, 2010

#solo10 GreenChainReaction: update and What is that Chemical Compound?

Sunday, August 15th, 2010

#solo10 GreenChainReaction: Some chemical information is appalling quality, but does anyone care?

Sunday, August 15th, 2010

#solo10 GreenChainReaction: Update and continued request for help

Monday, August 16th, 2010

#solo10 GreenChainReaction: We need more volunteers – and yes, it will be fun!

Tuesday, August 17th, 2010

#solo10 GreenChainReaction: Can you spot how green the reaction is?

Monday, August 16th, 2010

#solo10 GreenChainReaction: A greenness calculator

Monday, August 16th, 2010

#solo10 An introduction to textmining and data extraction

Thursday, August 19th, 2010

#solo10 GreenChainReaction: Almost ready to go, so please volunteer

Wednesday, August 25th, 2010

#solo10 GreenChainReaction : an Open Index of patents

Thursday, August 26th, 2010

#solo10: Green Chain Reaction starts to deliver Open Notebook Science and Open Patents

Thursday, September 2nd, 2010

#solo10: Green Chain Reaction here are the first results! Make your own judgment on greenness

Friday, September 3rd, 2010

#solo10 GreenChainReaction first results and comments

Friday, September 3rd, 2010

#solo10 Immediate reactions and thanks

Monday, September 6th, 2010

#solo10 GreenChainReaction how we take this forward

Monday, September 6th, 2010

#solo10: Green Chain Reaction – cleaning the data and next steps

Friday, September 10th, 2010

 

History in SCRAPES from the Science online site

 

General intro

 

You don’t have to be a chemist (or a practising scientist) to be involved in this experiment! The most important thing is that you offer to help. Here are some suggestions:

  • help organize and expand these pages
  • publicize the experiment
  • add your name and possible contribution here
  • help request data-mining permissions
  • read chemistry experiments and extract information manually
  • test and deploy software
  • run text-mining software to extract data from publications
  • organize the retrieved information
  • present the results in an exciting manner
  • create links
  • [add your suggestion]

Ready? Email Peter Murray-Rust and add your name to the contributors and participants below!

One of many offers: Hi Peter, sounds a really fun project. I’m happy to help out with some Java coding. Also I have a cloud-hosted virtual machine I’m not really making much use of right now which you’re welcome to use.

Who has volunteered

Current contributors (alphabetical, add links)

Unilever Credits

We must also give great credit to David Jessop (Unilever Centre) for having written the code which we have re-used in this experiment. David worked out how to

  • download patents and understand what comes down
  • restructure the XML into sections and paragraphs

and to Lezan Hawizy (Unilever Centre) who wrote the chemical natural language processing (NLP) (ChemicalTagger) which structures the text and identifies the chemical “parts of speech”.

and to Daniel Lowe who has created OPSIN as the best IUPAC chemical name-2-structure converter

and to Sam Adams who has helped with lots of stuff including Hudson, Pivot and general insight.

ISITOPEN requests

We need people to help with legwork behind requests to IsItOpen.

IsItOpen aims to facilitate enquires of data holders about the openness of the data they hold — and to record publicly the results of those efforts. It is inspired by [hwhatdotheyknow.org What Do They Know?], a site which allow citizens to make requests to find out information to which they have a right. It works by identifying the right place to ask, assisting users make requests and publicly displaying the results.

The Green Chain Reaction project will be using IsItOpen to ask publishers whether we can data-mine their chemical reactions. We know we can do this with BMC, PLoS, Acta E and a few others, but we need to capture and record their formal replies. In particular it’s valuable to make sure they get courteous replies (even if they say no). So we’d like to start by asking the committed OA publishers, get “yes” from them and then start the not-so-committed ones.

How far we’ll get before the date we don’t know. We’d like to present this at the meeting and at least be able to show successes.

CODE

Anyone can contribute or develop software as long as it is Open Source. Initially we start with Code from the Unilever Centre

Code written by members of the Unilever Centre for text-mining, high-throughput computation, semantic web, etc. Most of this has been build under our Hudson continuous integration system and should be reasonably portable but has not been widely deployed. Early adopters should be familiar with:

  • Maven 2.0
  • Java 1.6
  • an IDE (we have used Eclipse, IntelliJ and Netbeans).

Code to be tested and deployed (please be gentle — the README hasn’t been written yet—)

  • Crystaleye processor . Code to extract entries from Acta Crystallographica E. Please checkout and run tests and report problems. (DanH and MarkW have got this running – thx)
  • test data from ActaE for CrystaleyeProcessor
  • Code to extract patent entries and analyse them. alpha-release now on bitbucket. needs customising

*
Getting started quickly:

CODE AND DATA FOLR HACKERS

Prerequisites: Java, Maven and Mercurial.

If you are unfamiliar with these technologies then check this page for some useful tips and links.

These instructions have been tested on Linux and Windows 7

For analysing papers (output to crystaleyeTestData/html/x/y/z/…):

For patent analysis (output to patentData/EPO-yyyy-MM-dd):

To build executable jar for patent analysis, including all dependencies:

  • mvn -Dmaven.test.skip=true assembly:assembly
  • generates target/patent-analysis-0.0.1-jar-with-dependencies.jar
  • See this page for instructions on running the jar

To obtain new patent catalogs in the patentData folder:

If you want patent analysis to use self-built crystaleye-moieties then perform this command in crystaleye-moieties folder:

  • mvn -Dmaven.test.skip=true install

INSTRUCTIONS

What is going on here?

What you are going to do is download a small program that runs in Java. You almost certainly have java installed on your computer if you have a web browser. The program reads an instruction file which tells it how to read through a list of patents that relate to chemistry. You will also need to download these two files and instructions are given below.

Why would I want to do this?

This project is attempting to ask a question by getting computers to “read” as many patents as possible from the recent to the quite old. The question we are asking is “Is chemistry becoming more green in the processes and reagents that it uses?” To do this work we are asking volunteers to become involved by contributing their computing resources to help read the patents. No knowledge of chemistry is necessary!

More generally we are trying to demonstrate the feasibility of collecting information from across a wide range of documents that relate to science to ask wider questions. The results of this work will be presented at Science Online London 2010 in a few weeks time.

Sounds great! How do I do it?

Prerequisites: Java

Instructions for analysing patents:

Latest instructions for the experienced

  1. please always use the code from Hudson
  2. Download latest jar from https://hudson.ch.cam.ac.uk/job/patentanalysis/lastSuccessfulBuild/wwmm$patent-analysis/patent-analysis-0.0.1-jar-with-dependencies.jar which have been lightly tested.
  3. Create a folder named e.g. patentData where the index is and where the results will come
  4. Download http://greenchain.ch.cam.ac.uk/patents/jars/parsePatent.xml intoanywhere convenient – yourDir
  5. Download http://greenchain.ch.cam.ac.uk/patents/jars/uploadWeek.xml to anywhere convenient – yourDir
  6. Download a random patent catalogue (though pre-1990 may be lacking Chemistry patents) from http://greenchain.ch.cam.ac.uk/patents/indexes/ into the patentData folder
  7. run “java -Xmx512m -jar patent-analysis-0.0.1-jar-with-dependencies.jar -p <yourDir>/parsePatent.xml -d <patentData>”
  8. Then run “java -Xmx512m -jar patent-analysis-0.0.1-jar-with-dependencies.jar -p <yourDir>/uploadWeek.xml -d <patentData>”

to upload the results.

THE FOLLOWING SECTION MAY BE OBSOLETE

More detailed instructions for the less confident volunteer (but check filenames against those above)

  1. Downloading the software tools and creating a working directory
    1. Open a browser and paste the following link into your address bar: http://dl.dropbox.com/u/1120779/solo/patentData.zip A download should start automatically. It might take a little while (around 40 seconds for me).
    2. Once you’ve downloaded the zip file, find it (your browser should help you with this) and unzip it. In most cases, double clicking, or right-clicking and selecting “Unzip” or something similar should do the job.
    3. Check that you have three files in the unzipped folder, they should be called “parsePatent.xml”, “uploadSolvent.xml”, and “patent-analysis-0.0.1-with-dependencies.jar”
    4. Drag the folder to somewhere convenient, like the desktop or your documents folder
  2. Second step – getting a patent index
    1. Point your browser at http://greenchain.ch.cam.ac.uk/patents/indexes This takes you to the main index.
    2. You can select any year. Probably not much point going for ones much before 1990.
    3. Then select an index. Probably easiest to right click (or click-hold on a Mac) and choose “Save target as…” Save the folder into the directory with the tools that you just put somewhere where you can remember it. Now you are reading to…
  3. Do the analysis!
    1. Open a terminal window.
      1. Windows: In Start Menu select “Run” and type “cmd” and press return
      2. Mac: Open “Terminal” from Applications -> Utilities
    2. Navigate to your directory.
      1. On Windows or Windows if the directory is on the desktop try “cd Desktop/patentData”
    3. In the terminal type the command “java -Xmx512m -jar patent-analysis-0.0.1-jar-with-dependencies.jar parsePatent.xml”
    4. This should then run the extraction. Sit back and enjoy the nice warm feeling. The analysis will take between 10 and 60 minutes depending on how many patents are in the index.
    5. When the program has finished running you are ready to upload the results. At the command prompt type “java -jar patent-analysis-0.0.1-jar-with-dependencies.jar uploadSolvent.xml”
  4. All done! You can now go back to Step 2, pick a different patent index and start again…(you might want to delete all the folders and files that have been created first just to keep things clear and tidy)
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *