#solo2010: Computing volunteers needed for Green Chain Reaction

Typed into Arcturus

Here is a wonderful offer for the Green Chain reaction project at #solo2010.


Dan Hagon says:

August 10, 2010 at 12:28 am  (Edit)

Hi Peter, sounds a really fun project. I’m happy to help out with some Java coding. Also I have a cloud-hosted virtual machine I’m not really making much use of right now which you’re welcome to use.

This is exactly one of the skills we shall need for this project. If we are going to look at patents over many past years we are going to have to use either/or a lot or humans or a lot of computing.

Dan worked with us as a summer student and then moved on to RAL. He helped us get much of the automation into crystal structure repositories. So I know that he knows this contribution is possible and valuable.

I’ll explain in more detail what we are going to do, but this is about how. We have written most of the tools (in Java) and we’ll be able to offer them so they can run standalone on any machine. This may require wrapping them as a WAR or other self-starting distributable. We’ll also need to make sure they run remotely (Java is described as write-once-run-anywhere and parodied as write-once-debug-everywhere. So people who know what debugging looks like are highly valued).

The main distributed tool will be natural-language-processing (NLP) for chemical documents and specifically reactions. I’ll describe this in detail in a later post. The overall strategy looks something like:

  • Download N documents from remote site (e.g. patents, Acta Crystallographica E)
  • Find all reactions in the document (can be hundreds in patents, only one in Acta)
  • Carry out NLP on each reaction.
  • Create a datafile from each
  • Index each datafile (probably using RDF)
  • Search for green concepts in the RDF repository
  • Present the results

We’ve got code for 1-4. We’ll need help and imagination with the later stages (5-7), especially since they may come slightly later than the initial parsing. But there will be many of you out there who have some experience of this sort of thing.

Note that the cloud is an ideal place to do this sort of work as it is embarrassingly parallel – or can be created as map-reduce. For example each volunteer could take a year of patents (many tens of thousands of reactions in each year)

So please volunteer for help with the computing – it should be fun.


 

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to #solo2010: Computing volunteers needed for Green Chain Reaction

  1. Mark W says:

    I’m no chemistry expert (more bioinformatics) but can contribute Java coding and/or computing power. Please get in touch if I can help with either!

Leave a Reply

Your email address will not be published. Required fields are marked *