ACS Talk on the Green Chain Reaction: summary of what we did

This is a collection of all the blog posts that I made on the Greenchain reaction. They give some idea of the tight deadline , the volunteering. It’s followed by a scrape of the (now-spammed) site…

 

#solo2010: The Green Chain Reaction; please get involved!

Sunday, August 8th, 2010

#solo2010: Green Chain Reaction; details and offers of help

Monday, August 9th, 2010

#solo2010: How safe / green is my reaction? Feedback requested

Tuesday, August 10th, 2010

#solo2010: Computing volunteers needed for Green Chain Reaction

Tuesday, August 10th, 2010

#solo2010: Green Chain Reaction – update and more volunteers!

Wednesday, August 11th, 2010

#solo2010: Sharing Data is Good for All of Us

Wednesday, August 11th, 2010

#solo2010: Where can we get Open Chemical Reactions?

Thursday, August 12th, 2010

#solo10: The Green Chain Reaction is becoming a chain reaction!

Thursday, August 12th, 2010

#solo10: Publishers, is your data Openly re-usable – please help us

Thursday, August 12th, 2010

#solo10: Green Chain Reaction is using an Etherpad to collect its thoughts

Thursday, August 12th, 2010

#solo10: Green Chain Reaction – much progress and continued request for help

Saturday, August 14th, 2010

#solo10: Green Chain Reaction; where to store the data? DSR? IR? BioTorrent, OKF or ???

Saturday, August 14th, 2010

#solo10 GreenChainReaction: update and What is that Chemical Compound?

Sunday, August 15th, 2010

#solo10 GreenChainReaction: Some chemical information is appalling quality, but does anyone care?

Sunday, August 15th, 2010

#solo10 GreenChainReaction: Update and continued request for help

Monday, August 16th, 2010

#solo10 GreenChainReaction: We need more volunteers – and yes, it will be fun!

Tuesday, August 17th, 2010

#solo10 GreenChainReaction: Can you spot how green the reaction is?

Monday, August 16th, 2010

#solo10 GreenChainReaction: A greenness calculator

Monday, August 16th, 2010

#solo10 An introduction to textmining and data extraction

Thursday, August 19th, 2010

#solo10 GreenChainReaction: Almost ready to go, so please volunteer

Wednesday, August 25th, 2010

#solo10 GreenChainReaction : an Open Index of patents

Thursday, August 26th, 2010

#solo10: Green Chain Reaction starts to deliver Open Notebook Science and Open Patents

Thursday, September 2nd, 2010

#solo10: Green Chain Reaction here are the first results! Make your own judgment on greenness

Friday, September 3rd, 2010

#solo10 GreenChainReaction first results and comments

Friday, September 3rd, 2010

#solo10 Immediate reactions and thanks

Monday, September 6th, 2010

#solo10 GreenChainReaction how we take this forward

Monday, September 6th, 2010

#solo10: Green Chain Reaction – cleaning the data and next steps

Friday, September 10th, 2010

 

History in SCRAPES from the Science online site

 

General intro

 

You don’t have to be a chemist (or a practising scientist) to be involved in this experiment! The most important thing is that you offer to help. Here are some suggestions:

  • help organize and expand these pages
  • publicize the experiment
  • add your name and possible contribution here
  • help request data-mining permissions
  • read chemistry experiments and extract information manually
  • test and deploy software
  • run text-mining software to extract data from publications
  • organize the retrieved information
  • present the results in an exciting manner
  • create links
  • [add your suggestion]

Ready? Email Peter Murray-Rust and add your name to the contributors and participants below!

One of many offers: Hi Peter, sounds a really fun project. I’m happy to help out with some Java coding. Also I have a cloud-hosted virtual machine I’m not really making much use of right now which you’re welcome to use.

Who has volunteered

Current contributors (alphabetical, add links)

Unilever Credits

We must also give great credit to David Jessop (Unilever Centre) for having written the code which we have re-used in this experiment. David worked out how to

  • download patents and understand what comes down
  • restructure the XML into sections and paragraphs

and to Lezan Hawizy (Unilever Centre) who wrote the chemical natural language processing (NLP) (ChemicalTagger) which structures the text and identifies the chemical “parts of speech”.

and to Daniel Lowe who has created OPSIN as the best IUPAC chemical name-2-structure converter

and to Sam Adams who has helped with lots of stuff including Hudson, Pivot and general insight.

ISITOPEN requests

We need people to help with legwork behind requests to IsItOpen.

IsItOpen aims to facilitate enquires of data holders about the openness of the data they hold — and to record publicly the results of those efforts. It is inspired by [hwhatdotheyknow.org What Do They Know?], a site which allow citizens to make requests to find out information to which they have a right. It works by identifying the right place to ask, assisting users make requests and publicly displaying the results.

The Green Chain Reaction project will be using IsItOpen to ask publishers whether we can data-mine their chemical reactions. We know we can do this with BMC, PLoS, Acta E and a few others, but we need to capture and record their formal replies. In particular it’s valuable to make sure they get courteous replies (even if they say no). So we’d like to start by asking the committed OA publishers, get “yes” from them and then start the not-so-committed ones.

How far we’ll get before the date we don’t know. We’d like to present this at the meeting and at least be able to show successes.

CODE

Anyone can contribute or develop software as long as it is Open Source. Initially we start with Code from the Unilever Centre

Code written by members of the Unilever Centre for text-mining, high-throughput computation, semantic web, etc. Most of this has been build under our Hudson continuous integration system and should be reasonably portable but has not been widely deployed. Early adopters should be familiar with:

  • Maven 2.0
  • Java 1.6
  • an IDE (we have used Eclipse, IntelliJ and Netbeans).

Code to be tested and deployed (please be gentle — the README hasn’t been written yet—)

  • Crystaleye processor . Code to extract entries from Acta Crystallographica E. Please checkout and run tests and report problems. (DanH and MarkW have got this running – thx)
  • test data from ActaE for CrystaleyeProcessor
  • Code to extract patent entries and analyse them. alpha-release now on bitbucket. needs customising

*
Getting started quickly:

CODE AND DATA FOLR HACKERS

Prerequisites: Java, Maven and Mercurial.

If you are unfamiliar with these technologies then check this page for some useful tips and links.

These instructions have been tested on Linux and Windows 7

For analysing papers (output to crystaleyeTestData/html/x/y/z/…):

For patent analysis (output to patentData/EPO-yyyy-MM-dd):

To build executable jar for patent analysis, including all dependencies:

  • mvn -Dmaven.test.skip=true assembly:assembly
  • generates target/patent-analysis-0.0.1-jar-with-dependencies.jar
  • See this page for instructions on running the jar

To obtain new patent catalogs in the patentData folder:

If you want patent analysis to use self-built crystaleye-moieties then perform this command in crystaleye-moieties folder:

  • mvn -Dmaven.test.skip=true install

INSTRUCTIONS

What is going on here?

What you are going to do is download a small program that runs in Java. You almost certainly have java installed on your computer if you have a web browser. The program reads an instruction file which tells it how to read through a list of patents that relate to chemistry. You will also need to download these two files and instructions are given below.

Why would I want to do this?

This project is attempting to ask a question by getting computers to “read” as many patents as possible from the recent to the quite old. The question we are asking is “Is chemistry becoming more green in the processes and reagents that it uses?” To do this work we are asking volunteers to become involved by contributing their computing resources to help read the patents. No knowledge of chemistry is necessary!

More generally we are trying to demonstrate the feasibility of collecting information from across a wide range of documents that relate to science to ask wider questions. The results of this work will be presented at Science Online London 2010 in a few weeks time.

Sounds great! How do I do it?

Prerequisites: Java

Instructions for analysing patents:

Latest instructions for the experienced

  1. please always use the code from Hudson
  2. Download latest jar from https://hudson.ch.cam.ac.uk/job/patentanalysis/lastSuccessfulBuild/wwmm$patent-analysis/patent-analysis-0.0.1-jar-with-dependencies.jar which have been lightly tested.
  3. Create a folder named e.g. patentData where the index is and where the results will come
  4. Download http://greenchain.ch.cam.ac.uk/patents/jars/parsePatent.xml intoanywhere convenient – yourDir
  5. Download http://greenchain.ch.cam.ac.uk/patents/jars/uploadWeek.xml to anywhere convenient – yourDir
  6. Download a random patent catalogue (though pre-1990 may be lacking Chemistry patents) from http://greenchain.ch.cam.ac.uk/patents/indexes/ into the patentData folder
  7. run “java -Xmx512m -jar patent-analysis-0.0.1-jar-with-dependencies.jar -p <yourDir>/parsePatent.xml -d <patentData>”
  8. Then run “java -Xmx512m -jar patent-analysis-0.0.1-jar-with-dependencies.jar -p <yourDir>/uploadWeek.xml -d <patentData>”

to upload the results.

THE FOLLOWING SECTION MAY BE OBSOLETE

More detailed instructions for the less confident volunteer (but check filenames against those above)

  1. Downloading the software tools and creating a working directory
    1. Open a browser and paste the following link into your address bar: http://dl.dropbox.com/u/1120779/solo/patentData.zip A download should start automatically. It might take a little while (around 40 seconds for me).
    2. Once you’ve downloaded the zip file, find it (your browser should help you with this) and unzip it. In most cases, double clicking, or right-clicking and selecting “Unzip” or something similar should do the job.
    3. Check that you have three files in the unzipped folder, they should be called “parsePatent.xml”, “uploadSolvent.xml”, and “patent-analysis-0.0.1-with-dependencies.jar”
    4. Drag the folder to somewhere convenient, like the desktop or your documents folder
  2. Second step – getting a patent index
    1. Point your browser at http://greenchain.ch.cam.ac.uk/patents/indexes This takes you to the main index.
    2. You can select any year. Probably not much point going for ones much before 1990.
    3. Then select an index. Probably easiest to right click (or click-hold on a Mac) and choose “Save target as…” Save the folder into the directory with the tools that you just put somewhere where you can remember it. Now you are reading to…
  3. Do the analysis!
    1. Open a terminal window.
      1. Windows: In Start Menu select “Run” and type “cmd” and press return
      2. Mac: Open “Terminal” from Applications -> Utilities
    2. Navigate to your directory.
      1. On Windows or Windows if the directory is on the desktop try “cd Desktop/patentData”
    3. In the terminal type the command “java -Xmx512m -jar patent-analysis-0.0.1-jar-with-dependencies.jar parsePatent.xml”
    4. This should then run the extraction. Sit back and enjoy the nice warm feeling. The analysis will take between 10 and 60 minutes depending on how many patents are in the index.
    5. When the program has finished running you are ready to upload the results. At the command prompt type “java -jar patent-analysis-0.0.1-jar-with-dependencies.jar uploadSolvent.xml”
  4. All done! You can now go back to Step 2, pick a different patent index and start again…(you might want to delete all the folders and files that have been created first just to keep things clear and tidy)
Posted in Uncategorized | Leave a comment

History of Internet Social Networks in Chemistry: can we create a collective memory?

#greenchain #acsanaheim #quixotechem #blueobelisk

I am leading off today’s ACS program on the Internet and Social Networking http://abstracts.acs.org/chem/241nm/program/divisionindex.php?act=presentations&val=Internet+and+Chemistry:+Social+Networking&ses=Internet+and+Chemistry:+Social+Networking&prog=54108 . I intend to talk about our social experiment last year at Science Online (the “greenchain reaction”) but this is also a wonderful opportunity to create a collective memory of the last ?20 years. Memory fades (at least mine) and the Internet record decays with ruthless and terrifying speed. So here’s the idea (only made possible because we have good wifi in the centre):

We create an Etherpad (http://okfnpad.org/internetChemistry ) of the timeline of the last 20 years (1990-) and populate it communally from within the real-life audience and also wider (anyone who sees this mail or related tweets). If you have never used an Etherpad before it’s trivial. Just enter your name or alias, and start typing. I am going to seed it with some timepoints which I think are critical, and here’s the current list. I will almost certainly get dates wrong and miss people. So here’s my first pass… current pad scraped…

 History of Chemistry on the Internet 1990- 

 
 

 Emphasis on Social networks or seminal technologies (not organizational presence)

 
 

 Please enter events or technologies or resources that are seminal to the development of social networks either directly (e.g. a wiki) or indirectly (e.g. Rasmol). Commercial entities are welcomed if they contribute to the history (e.g. Chemweb) but please avoid treating this as an opportunity for product placement – e.g. “AcmeChem put its catalog online”

 
 

 Many of the dates are certainly wrong – please correct them!

 
 

1990

1991

HTML TB-L 

1992

BioMOO

1993

Kinemage?

Rasmol

Mosaic

1994

WWW1 at Geneva demos of Rasmol, etc.

Chemical MIME

Chemical Markup Language

1995

Chime?

Hyperactive molecules (HSR et al)

Principles of Protein Structure Internet Course

1996

ECTOC-1/2/3

Structure Based Drug Design Course 

biomednet?

1997

Chemweb

1998

Chemweb launch of CML

1999

2000

2001

First datument (MW,HSR, PMR in RSC)

2002

2003

2004

Internet Journal Chemistry

ZINC?

2005

Blue Obelisk

2006

2007

Chemistry in Second life

2008

Chemspider

2009

2010

Green Chain Reaction

Quixote

2011 

ACS Internet meeting

 

Please visit the pad and contribute. Before, during and after the meeting. It’s not meant to be PMR-centric.

Peter Sefton has a great tool for formatting the pad – it will look pretty after that and is a useful static snapshot

 

Posted in Uncategorized | Leave a comment

Open Data: latest overview. Please comment

#acsanaheim #opendata #crystaleye

I’ve more-or-less put my thoughts together for the session on Open Data. It seems to me that the key question is whether the price we pay for traditional closed data is worth it. Not just the monetary cost, but the opportunity cost – particularly in access by everyone and re-use. I’ve created a list of issues which I’d like you to think about – I have tried to be fair. If you feel strongly, please edit the Etherpad:

Overview

VERY SORRY!! I HAVE TO LEAVE AT END OF TALK AS I AM TALKING IN ANOTHER SESSION

Web-based science relies on Linked Open Data.

Topics

  • Almost no scientific data is effectively published
  • “Almost Open”, “Freely Accessible” is not good enough
  • Open Knowledge Foundation – defines Open and DOES THINGS
  • Individuals and small groups can change the world
    • Wikipedia
    • OpenStreetMap – The Ordnance survey generates 100 M GBP per year but open maps bring 500 M to the economy
    • What Do They Know? (Web democracy through FOI)
    • Quixote – reclaiming computational chemistry
    • Current publishing models are asymmetric; the author and reader have few rights or influence
    • Software as an agent of political change
    • Web democracy – cf Wikipedia
    • Bottom-up Web 2.0 (The Blue Obelisk and Quixote)
    • Text and data mining
    • Panton Principles
    • Near-zero cost of robots – crystalEye
    • eTheses

    Resources

  • “Open Data” on Wikipedia
  • “Open Data in Science” (Murray-Rust on Nature Precedings (http://precedings.nature.com)
  • Science Commons
  • Open Knowledge Foundation

    Recent Blogs

  • /pmr/2011/03/28/open-data-what-i-shall-say-at-acs
  • /pmr/2011/03/28/draft-panton-paper-on-textmining/
  • Some fallacies:

  • “You can have SOME of the data (ACS make 8000 CAS numbers freely available to Wikipedia)
  • The data are free for NON-COMMERCIAL use (see my /pmr/2010/12/17/why-i-and-you-should-avoid-nc-licences/
  • “You can always ask permission and we’ll grant it”; PMR: doesn’t scale, doesn’t persist, can’t re-use

    The key question: Is the price of closed data worth it?. Do the benefits outweight the disadvantages?: to help you:

issue

closed data

open data

sustainability

supported by income

few proven models

creation of business model

easyish

hard

added human value

often common

possible

support

usually good

depends on community

acceptability

well proven

often suspicious

cost

high; increasing?

marginal

innovation

central authority

fully open

reuse

normally NO

fully OPEN

speed from source

often slow

immediate

mashupability/LODD

very rare

almost universal

reaction to new tech.

often slow

very fast

comprehensivenes

often patchy

potentially v. high

global availability

often very poor

universal

 

I have started an Etherpad at http://okfnpad.org/openClosedData. Please feel free to contribute

Posted in Uncategorized | 1 Comment

BiomedCentral use Open Data buttons in their publications

#opendata #acsanaheim #pantonprinciples

Last night I asked Jan Kuras of BiomedCentral (BMC) whether any of their publications specifically declared their data as Open [OKD-compliant). Here’s his immediate reply:

Hi Peter

 

The following papers at BMC Bioinformatics have Open Data within the Additional Files, Tables:

 

compomics-utilities: an open-source Java library for computational proteomics

Harald Barsnes, Marc Vaudel, Niklaas Colaert, Kenny Helsens, Albert Sickmann, Frode S Berven, Lennart Martens

BMC Bioinformatics 2011, 12:70 (8 March 2011)

http://www.biomedcentral.com/1471-2105/12/70

 

A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment

Yufan Guo, Anna Korhonen, Maria Liakata, Ilona Silins, Johan Hogberg, Ulla Stenius

BMC Bioinformatics 2011, 12:69 (8 March 2011)

http://www.biomedcentral.com/1471-2105/12/69

 

Comparing genotyping algorithms for Illumina’s Infinium whole-genome SNP BeadChips

Matthew E Ritchie, Ruijie Liu, Benilton S Carvalho, The Australia and New Zealand Multiple Sclerosis Genetics Consortium (ANZgene), Rafael A Irizarry

BMC Bioinformatics 2011, 12:68 (8 March 2011)

http://www.biomedcentral.com/1471-2105/12/68

 

So I am delighted to blog this and – at least for me – launch the OpenData button as a viable, respected and used tool in clarifying and asserting Openness.

Let’s follow the first link:

Comparing genotyping algorithms for Illumina’s Infinium whole-genome SNP BeadChips

Matthew E Ritchie1,2* , Ruijie Liu1* , Benilton S Carvalho3
, The Australia and New Zealand Multiple Sclerosis Genetics Consortium (ANZgene)
and Rafael A Irizarry4

Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia

Department of Medical Biology, The University of Melbourne, Parkville, Victoria 3010, Australia

Department of Oncology, University of Cambridge, CRUK Cambridge Research Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, UK

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, North Wolfe Street E3035, Baltimore, MD 21205, USA

author email corresponding author email* Contributed equally

BMC Bioinformatics 2011, 12:68doi:10.1186/1471-2105-12-68

 

Published:

8 March 2011

Additional files

Additional file 1:

Supplemental Figures.

Format: PDF Size: 547KB Download file

This file can be viewed with: Adobe Acrobat Reader


And when opened the file contained:


 

And the BMC explanation of Open Data is consistent with the OKDefinition:

  Brief summary of what Open Access means for the reader:

  Articles with this logo are immediately and permanently available online. Unrestricted use, distribution and reproduction in any medium is permitted, provided the article is properly cited. See our open access charter.

By open data we mean that it is freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. We encourage the use of fully open formats wherever possible.

Anyone is free:

  • to copy, distribute, and display the work;
  • to make derivative works;
  • to make commercial use of the work;

Under the following conditions: Attribution

  • the original author must be given credit;
  • for any reuse or distribution, it must be made clear to others what the license terms of this work are;
  • any of these conditions can be waived if the authors gives permission.

Statutory fair use and other rights are in no way affected by the above.

This equates to a CC-BY, not a PDDL/CC0, licence but its effect is to make the data completely Open.

So great kudos to BMC. They have shown the way and it’s now easy for others to follow. Just create a protocol that all scientific data is Open and label it as such. I hope we can see OpenData buttons mushrooimg everywhere.

Posted in Uncategorized | 1 Comment

Open Data: what I shall say at ACS

#pantonprinciples #opendata #acsanaheim

I am speaking on Open Data and the Panton Principles at ACS. It’s in a session solely on Open Data. But I suspect I am the only one who is promoting Free-as-in-speech. I get to kick off, so here’s an idea of what I am going to say… I have promised (gently) not to rant. But I cannot help myself if the arguments for Openness are overwhelming.

Note for non-chemists. Chemistry is a very non-Open subject. 95% of the journals are closed and of the remaining 5% it’s difficult to get people other than zealots to publish. There is almost no Open Data. Publishers defend “their content” avidly and there have been lawsuits. About 0.01% of practising chemists know or practice Openness (a guess, but probably a reasonable one).

The principles below hold for any science but the examples are chemistry.

Why is Open Data Important?

Data-rich sciences rely on data to:

  • Confirm or disprove experiments. When the data are not published the science cannot be verified or validated
  • Act as yardsticks for other methods (e.g. computational chemistry strives to replicate experiment. Experiment is moderated by theory)
  • Be re-used (mashed up, linked) in novel ways. There are over a thousand papers describing chemistry derived from the reported crystal structure literature

Traditionally only a small amount of data was published. Now, with printing costs irrelevant it’s technically possible to publish the whole experiment.

Moreover much science reported in text is now processible by machines. I argue /pmr/2011/03/28/draft-panton-paper-on-textmining/ that textmining can bring huge amounts of value to science.

What is the problem?

Most data is never published at all. That is partly laziness, partly selfishness, partly lack of technology, and partly the lack of culture in favour of publishing.

The data that is published is often published in “textual” form. By default we are forbidden – not for scientific reasons but for legal and commercial ones – to use modern methods of textmining to extract this data. This means that the legal restrictions are holding back science, perhaps by 10 years. The effect of this is:

  • We have much less data and much less variety of content
  • The current quality of data could be much higher (machines do a good job of validation)
  • The efficiency of data creation could be much higher
  • We could detect more fraud or other questionable data
  • Data would be more immediately available (humans have a slower clock cycle)

The downside – which I am sure most closed access publishers must concur with – is that some publishers cannot support their business model. So the equation is simple:

Support closed access publishers at the cost of fewer poorer later data.

I don’t think anyone can disagree with this conclusion. I make no public judgment here – the choice is yours

How to proceed?

If all publishers adopt a business model of Open Data content (and this is compatible with closed access publishing) then we have a step forward. So I and others are asking all publishers to declare that the data in the publications is Open.

What is Open?

Open is free-as-in-speech (libre) not free-as-in-beer (gratis) [Richard Stallman]. Gratis means you can use something but you have no rights. Most of the free services in chemistry are gratis. They can be switched off tomorrow (and frequently have been – I can name many service which were free-to-use and now are not). With gratis material you cannot as of right:

  • Create a derivative work – this curtails innovation
  • Rely on the material being persistent. This curtails mashups and linking
  • Publish aggregations, compure derivaties…
  • Create new ways of presenting and using the content

Libre allows all this. Free-as-in-speech is exemplified for knowledge by the Open Knowledge Foundation’s open Definition. : http://www.opendefinition.org/ :

The Open Knowledge Definition (OKD) sets out principles to define ‘openness’ in knowledge – that’s any kind of content or data ‘from sonnets to statistics, genes to geodata’. The definition can be summed up in the statement that “A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.

It’s very simple. Just remember “free to use, free to re-use, free to redistribute”. It is clinically effective at deciding whether something is Open.

So what’s the problem?

The problem is that almost no data is marked as Open. So by default its state is unknown. And in that case the default action has to be to say it’s not Open. You cannot guess, or use algorithms to determine whether something is Open. The only foolproof way is to let someone sue you and lose. That’s because it’s a legal thing, not a moral one. And algorithms and reasonableness and morals don’t work in law.

So the way round this is for content providers (I avoid “owners”) to declare that data are Open. And that is what we are asking YOU to do.

So Why the Panton Principles?

The problem is that it’s not legally trivial to declare something Open. It’s taken the OKF and Creative/Science Commons two years to work out the best way. And the best way is to dedicate the data to the public domain. That needs a licence – and we suggest either PDDL or CC0. (not any-old CC licence, but CC0). So we met over some years and finally came up with the Panton Principles: http://pantonprinciples.org/

There is no reason why all authors publishers and funders should not endorse these principles and some have.

So what’s the problem?

The problem is that many content providers don’t realise that this is a problem. So we’ve built a site to ask them about the Openness of their data: http://www.isitopendata.org/ . Here we can ask questions of content providers and record their answers – in public. This means that we can save them time and hassle by only asking the question once.

The answer is for content providers who wish to make it clear that their data is Open to add a licence to that data. It’s also useful to add a visual indicator such as the OKF’s Open Data button.

And where is this going?

The steps are:

  • To get content providers to consider the importance of Openness
  • To get them to make a considered decision (hopefully Open)
  • To get them to mark the content as Open
  • To get them to spread the idea.

The framework is compelling. The Panton Principles have been successfully applied to bibliography – an important part of science. And they have activated a set of Panton deliberations – discussions (audio/visual) and papers.

We need Open Data for better quicker more complete science. That may mean changing the business models. If so we need to think soon…

 

 

almost no data is marked as Open. So by default its state is unknown. And in that case the default action has to be to say it’s not Open. You

 

 

 

 

Posted in Uncategorized | 1 Comment

Draft Panton Paper on Textmining

#pantonprinciples

 

I am using the ACS session on Open data as an opportunity to create principles that allow textmining in science. With Jenny Molloy and Graham Steel we are creating a draft Panton Paper, on an Etherpad at:

http://okfnpad.org/PPDataTextMining

Please feel free to contribute (it’s trivial to edit, but please leave your identity). I’d also like to show the power of the Etherpad …

 

 

Panton Paper for text and data Mining

 

Co -authors Graham Steel and Peter Murray-Rust

 

PLEASE CONTRIBUTE

 

I am using this as a basis for my talk at the ACS on “Open Data and the Panton Principles”. I’ think text-mining is one of the biggest problems in scientific data so this is a good excuse to air it and to argue for Open Data

 

Background:

 

Scientific articles (papers) are the commonest and most highly valued ways of transmitting science. It has been accepted for at least 130 years (Beilstein, organic chemistry) that it is valuable to extract data from scientific articles and republish it without permission of the original author. This has led to countless review articles where factual data in primary sources are summarised and critiqued in secondary articles (reviews).

 

Factual data in articles occurs as:

  • tables
  • graphs (e.g. plots of X against Y)
  • numbers embedded in running text (“the melting point of benzene is 5 Celsius”)

 

Articles are giverned by copyright and this restricts their re-use. The exact details are not – and never will be – precisely specified but we can generally assume that copyright holders could (not necessarily would, but still could) take action on:

  • copying the whole article even for colleagues and collaborators
  • copying any diagram for re-publication (graphics are “creative works”). Copyright holders have objected to the ree-publication of a graph, even to make a valid scientific point for which the graph was necessary
  • republishing paragraphs of text, even for scholarly purposes.

 

A comment on “fair use” (fair dealing). See http://en.wikipedia.org/wiki/Fair_use This is NOT applicable outside the US so is of little global value.

 

Scientific authors do not expect (or receive) payment for their articles and almost all reviewers give their services freely (although there may be costs involved). There is no ethical or utilitalarian reason for restricting the re-publication of science except for the need to provide publishers with income. This paper does not debate the ethics of this (“Open Access”) but is confined to the right to extract data from non-Open material

 

Text-mining

 

We take as agreed that a human may extract data from an article to which they have legitimate access. That they may republish this without further permission. 

 

Text-mining is the use of machines to extract data and other information from articles (as opposed to extracting it from databases). The technology can allow high precision/recall rates (> 90%) making it very useful as a way of reading and systematising the primary literature. There are various aspects to TM:

  • information retrieval. Classification of documents either supervised (into predetermined categories) or unsupervised (e.gt. cluster analysis)
  • information extraction. Extraction of information from subcomponents of the text of a document. A common approach is Named Entity Recognition where words and phrases may be identified as people, places, species, chemicals, etc.
  • sentiment and argumentation. More general (and harder) interpretations of the role of the article (or parts of it) – “we believe that”, “this is incompatible with” “this article has been discredited”

Note that if this is done automatically the machines often have to be “trained” by giving them examples of material which have been classified by humans (annotators) as positive or negative. Machines are not perfect (but nor are humans – our work shows interannotator agreement of 90% with machines not much behind).

 

We argue that any analysis of a document that can be freely published by a human can also be freely published by a machine. That copyright refers to the precise wording and formatting of the document, not to the abstract ideas or facts published in it. Copyright can only be violated by quoting or reproducing chunks of the original verbatim.

 

The extraction of information does not normally require the verbatim qutotation of reproduction of diagrams and so does not per se violate copyright. And it is permitted for human readers of all sorts of material – books, films, as well as scientific articles. If, however, machines are used then the process is “forbidden”.

 

There are many reasons why machine information extraction is valuable to science:

  • Humans cannot keep up with the volume of literature
  • humans cannot always keep up with correct terminology and usage
  • the information is complex and specialised and there are not enough human experts
  • the information requires many different resources (e.g. geo-locations, gazeteers, online algorithms, etc). As an example interpreting the geo-location of chemicals requires an expert in both fields
  • algorithms and processing are required (a simple example is conversion of units – Celsius and Fahrenheit for weather)
  • machines can monitor trends with time and place (and many other variables)

 

Aspects of forbidding of TM:

The provision of journal articles is controlled not only by copyright but also (for most scientists) the contracts signed by the institution. These contracts are usually not public. We believe (from anecdotal evidence) that there are clauses forbidding the use of systematic machine crawling of articles, even for legitimate scientific purposes.

 

There are many serious consequences of forbidding text-mining:

  • new scientific relationships are not discovered. This is particularly common in biomediacl where searching the literature is as important as doing new research
  • “wrong” science (for whatever reason) is not detected (machine analysis of data is very powerful here)
  • the text-mining tools cannot be adequately published (acceptable practice is that the training corpus must be freely available – not just DOI references)
  • products of text-mining (e.g. classifications and lexicons) are themselves valuable for the next cycle of research and these derived works cannot be published
  • innovation in textmining itself and in TM for science is held back

 

The primary resource on which textmining relies is published science. 

 

We assert that there is no legal, ethical or moral reason to refuse to allow scientists to use machines to analyse the published output of their community.

 

 

 

 

Posted in Uncategorized | 1 Comment

Online resources for Chemical Education: Open Semantic Resources

Robert Belford invited me to present at the ACS “Online resources for Chemical Education” on Monday 2011-03-28 (tomorrow). I’m meeting Robert soon at the CHED reception but here’s what I plan to say and the resources. I may change things as a result of that…

The first message is that we must assume that EVRYTHING IS OR WILL BE ONLINE. That’s not a worldshattering thought but it is not common among educators. Online resources are becoming BETTER than printed ones. And many of them – and an increasing number – are Open. That’s important because it means that they can not only be used freely, but also re-used. If you want to use a graph, a table, a diagram you normally have to ask the authors’ permission. And sometimes they say yes, sometimes they just don’t reply. Whereas open means you can re-use anything you like whenever you like for whatever purpose. And if you feel it can be improved, then just improve it.

So I shall present a few resources that we have created. That’s not egocentric because there are many other in the session presenting their resources. Including Wikipedia which is destined – inexorably – to become the central reference work for education and research.

So I shall present:

  • Chem4Word (more accurately the Chemistry Addin for Word). This addin is Open Source (though Word is not… ). You get it at OuterCurve – the Microsoft equivalent of Apache. A collection of Free/Open software developed in a Microsoft culture. Here’s an announcement http://www.outercurve.org/Blogs/EntryId/26/Chemistry-Add-in-for-Word-tackles-long-term-challenges-for-compound-document-creation-and-manipulation which gives you the OuterCurve site and how to download the software.

     

    I’ll be showing the role of Chem4Word as an environment for teaching and learning. You can create documents with Word and embed molecules in them. We plan to develop this as a tool for student documents including lab reports. The results can be validated and so pre-marked by machine.

     

    You can use Chem4Word as the primary access to resources. There are currently two:

     

  1. Import via OPSIN. This translates IUPAC names into chemical structures through a webservice (http://opsin.ch.cam.ac.uk ) C4W is an example of how we can customize such services to make them almost transparent to the user. (For the technical people, the OPSIN service includes REST and content negotiation).
  2. Import from Pubchem. This requests a search from the NIH’s Pubchem. Generally the first compound in the hit list is the most useful (the most commonly requested)
  • Crystaleye. Our fairly comprehensive Open/Libre resource (http://wwmm.ch.cam.ac.uk/crystaleye )which trawls the web every night for new crystal structures (inorganic as well as organic). It’s likely to contain exemplars of all common chemical structures.

    I’ll try to investigate the following question tomorrow – “what determines the length of an As-Cl bond?”. We’ll use the bond-length tool in Crystaleye to look for very long ones and very short ones and see if there is a pattern.

    The Crystaleye approach leads to the Quixote/Chempound approach for collecting chemical computations. It should then be possible to provide a resource with many common molecules computed at different levels of theory. Sam will be presenting Chempound on Wednesday so I will just bhiint at what he’ll reveal.

More general ideas – we resources alter the balance between “teacher” and “learner”. Facts are now universal and free so the role of the teacher changes from a feeder-of-facts to a mentor. That’s disruptive technology at work. It will be beneficial but it will be painful for many.

 

 

 

 

Posted in Uncategorized | 1 Comment

My Presentations at the American Chemical Society

#quixotechem #greenchain #scholarlyhtml #oscar4 #pantonprinciples

Am attending the 241st National Meeting of the ACS at Anaheim (read Disney). In case you want me I am at the Sheraton Park. I have a hectic program being author of 6 presentations and giving 4 (there used to be a rule that this wasn’t allowed I think but the sky won’t fall in). So here’s the program… (CHED = Chemical Education; CINF = Chemical Information; COMP = Computation

  • Sunday (today): prepare talks for tomorrow. Go to CHED Reception (Disneyland Pier Hotel). We are trying to find a wifi dongle top-up card as w2e don’t trust conference wifi (apparently it’s free this year)
  • Monday: 0910 CINF Open Data session. Convention Centre Room 202A I’m talking about the Panton Principles. It’s hard work promoting Openness at the ACS. W talks in this session. I think I am the only Libre presenter. One is not very obviously Open (looks commercial), one is defending the need to charge for scientific data and one is open-but-NC as far as I know. I’ll blog this later (I normally blog my talks as a record for the attendees because I don’t use Powepoint because I don’t know what I am going to say before I start and because Powerpoint corrupts thought processes. I don’t expect many converts (of course BMC is already Libre, but not many people publish chemistry (yet) in BMC. Let’s see if we can change that.

    1055 CHED Online Open Semantic Resources for chemical education. I apologize to the previous speakers as I am in CINF. They are: Bob Hanson (Jmol), J.Bradley (Jean-Claude?), Henry Rzepa (iPads), Tony Williams (Chemspider), Robert Belford (WikiGlossary), Martin Walker (Wikipedia), then me. I am talking on online semantic resources which includes Chem4Word, Crystaleye and The OSCAR/OPSIN/ChemicalTagger triptych. I am rather gutted to miss the first bit but both sessions run in parallel and they are a few hundred Mickey Mice apart so it takes time to commute.

    Evening: CINF party (I think) and then (I think) The Blue Obelisk dinner at Gumbo Shrimp (I think). Also have to prepare a talk for the next day.

  • Tuesday CINF. Allday-session run by Henry and Steve Bachrach. Internet and Chemistry, social networking. I kick off with the Green Chain Reaction (a collaborative project run on this blog and completely Open in all senses. Then Marcus Hanwell presents Quixote at 0950. It was going to be Jens Thomas but he is leaving Daresbury and off to see the world.

    At this stage I can probably start to wind down a bit.

  • Wednesday CINF. . Internet and Chemistry, social networking continues. Steve and Henry have done very well to get two days of contributed. I must admit that when they asked me I wasn’t bursting with enthusiasm. We’ve been down this route before and the enthusiasm of the late nineties has disappeared. However I thought that if I was positive and so were others then we might get a critical mass. And I keep telling people that we have ton keep believing. So two days of contribution is very gratifying. Mainly people I know , but that’s what you expect in a session on Internet and social networks. Anyway Sam Adams is talking at 1340 on the Clarion project, which will include his new Chempound repository system. I’ll write more later…
  • Thursday COMP. I put in a presentation on our collaborative approach to CompChem information. A mixture of Chempound and Quixote. It’s called Memex for Computational Chemistry. At 1145. The idea is that it is (not will be) possible to forget what work you have done and let the computer discover it for you. An imaginative assistant that has the whole past knowledge of the subject.

    My talk is preceded by one with the intriguing title “withdrawn”. So the room will be empty to start with. So it’s a real challenge for me to fill a room with the last talk of the meeting on the last day after a withdrawn paper. But I am sure that people are already changing their plane reservations. Put in your spreadbets on the attendance count (One very well known compchemist got 3 (apart from himself and the chair) the last time he presented – his wife, and two young ladies who were looking for a different sessions. But the show has to go on. I plan this talk to have the same impact as the original Memex, so make sure you are there!

Wompf! And then off to Seattle to talk with friends at Microsoft Research and write a blog on the Fourth Paradigm that I have been doing for the last “little while”. And then to PNNL for NWChem, OREChem and then back.

To the launch of OSCAR4 on April 13th. Which you should register for now… (I’ll blog this later)

 

 

Posted in Uncategorized | Leave a comment

Extracting data from scientific calculations and experimental log files

Many scientists have to work with data produced by programs written in the era of FORTRAN IV which produces a mindset of punched cards for input and lineprinter for output. Both of these were designed primarily for humans – the human punched the cards, fed them into the machine and got fanfold paper out. (I actually go back to when programs were keyed in on toggle switches and paper tape, but…). The output was expected to be read by humans and – if there was something useful it could be typed up again. This was good for the machine, but not much fun for the human.

Not much has changed. In computational chemistry much of the information is only available in log files (page-oriented, 80/132 column). Countless scientists in 2011 still retype data from logfiles.

The problem is that the files are machine-readable (ASCII if you are lucky) but not machine-understandable. Here’s a typical chunk from computational chemistry (I HOPE WordPress keeps the formatting on your browser).

———————————————————————

Rotational constants (GHZ): 0.8124822 0.2870678 0.2733351

Standard basis: CC-pVDZ (5D, 7F)

There are 312 symmetry adapted basis functions of A symmetry.

Integral buffers will be 131072 words long.

Raffenetti 2 integral format.

Two-electron integral symmetry is turned on.

312 basis functions, 678 primitive gaussians, 330 cartesian basis functions

64 alpha electrons 64 beta electrons

nuclear repulsion energy 1312.3003184698 Hartrees.

NAtoms= 30 NActive= 30 NUniq= 30 SFac= 1.00D+00 NAtFMM= 50 NAOKFM=F Big=F

——————————————————————————

 

Some of this information is important – some less so. I have no idea what some of it means. Some of the text is meaningful and varies, some is boiler plate. How do we get the information out? Let’s try

nuclear repulsion energy 1312.3003184698 Hartrees.

 

You can “grep” it (i.e. run a UNIX-type search over it). That works for small amounts of discrete information. It works in most cases but is highly fragile.

You can write a Python program to read this stuff. (But only if you are a Pythonista). And that suffers from lack of scale, difficulty of maintenance and non-transparency to other users. But it’s probably the commonest approach.

Or you can use a framework designed to extract this sort of information. I have been trying to find one for 10 years (yes, I have asked on Stack Overflow and Googled, so if there is one it’s well hidden).

So I have been forced to build my own. (If you tell me there is a better solution I’ll be delighted). It’s a declarative approach. That means the user doesn’t have to understand Java, or Python or nay procedural language. They create a set of instructions that define the output. I use XML because XML is my golden hammer but a purist would use LISP. So the declarative approach asserts what the result should be based on the input:

<record id=”nre”> nuclear repulsion energy {F20.10,compchem:nuclearRepulsionEnergy,unit:hartree} Hartrees.</record>

That’s it. The F20.10 means a floating-point number width 20 decimal 10 (this is standard FORTRAN anyway). The compchem:nuclearRepulsionEnergy denotes an entry in the compchem dictionary (more about dictionaries later) and the unit:hartree declares the units. The dictionary will check that Hartrees are energy units. The result looks like:

<scalar dictRef=”compchem:nuclearRepulsionEnergy” units=”unit:hartree”

dataType=”xsd:double”>1312.3003184698</scalar>

This is held in an XML DOM and can be searched and processed by a wide range of tools.

The reason I have written a framework is that I need parsers for all the compchem programs (>20). A conventional approach would take lots of programmers with fragmentation of effort. The declarative approach is much quicker and almost anyone can use it. (You may have to learn some simple regexes but that’s all).

Yesterday we proved that JUMBOConverters works, that people who hadn’t seen it before could pick it up quickly.

So I’m writing this in case there are groups outside compchem who need to parse “lineprinter” output. Cameron Neylon has given me two files and they should only take minutes to write parsers. I’ll show you this in later posts.

JUMBO is, of course, Open Source.

Posted in Uncategorized | Leave a comment

Peter Sefton goes forth with ScholarlyHTML; we’ll meet again

 

Peter Sefton (PT) is leaving Toowoomba – http://ptsefton.com/2011/03/24/onwards.htm. We talked about it while he was here but the news was only public this week. PT has been part of our past and will be part of our future – don’t know where, don’t know when, but it’s inevitable. The web brings people together and keeps them together.

PT has had a fairly unique group in academia – a service group with satisfied customers and also able to innovate. The group – which developed a new generation of authoring tools and document/information management – was of critical mass and critical quality. Run very efficiently without the political overhead of the large educational infrastructure projects that often go so slowly.

We worked last week with PT to develop scholarlyHTML (/pmr/2011/03/20/scholarly-html-%E2%80%93-latest-thoughts/ ) – the philosophy, the tools, the culture, the content. It needs all of those. PT’s group has developed tools that are beyond anything else the world has. They help us usher in a new approach to scholarly communication – humans and machines.

The great thing is that the tools are Open. That means that wherever PT is he – and we – can work with an on those tools. They aren’t tied down by restrictive regulations and licences. They are his gift to the world and possibly a means of support.

It’s not easy changing the world. There is no guarantee of success – and success requires hard work – often extremely hard work – at unpleasant hours and with immediate deadlines. There is the constant reworking – the demos of two years ago do not work today – they need refactoring. But each refactoring, tedious though it is, leads to increased quality.

Yesterday I felt that after 16 years CML had finally “made it” (/pmr/2011/03/23/quixote-cml-is-now-an-infrastructure-for-computational-chemistry/ ). There will come a time in the next few years when ScholarlyHTML – PT’s vision – will have “made it”. We cannot predict where or when, but I am certain its time will come, just as I felt CML’s would.

Now is the time to believe and work towards it.

 

Posted in Uncategorized | Leave a comment