RSC: What I and others said; and let's unlease the robots

Christoph Steinbeck has given a huge account of the RSC meeting so I don’t have to say anything. Here’s what I seem to have said (As I have no powerpoint or linear trajectory my talks do not follow a set course and are determined in part by audience reaction).

Open Access Publishing in the Chemical Sciences

I was invited to give my views on some new chemistry in European Bioinformatics at a Meeting held by the CICAG group of the Royal Society, held at Burlington House, London.
Peter Murray-Rust set the scene by emphasising the importance for Open Data. He showed some fantastic work on data extraction by OSCAR from theses, where his group had parsed a synthetic chemistry thesis into an interactive graph of a reaction network. He also showed an SVG animation of this graph as a reaction sequence, all automatically generated from an OSCAR run. Peter pointed out in the subsequent discussion that data cannot be copyrighted, which was acknowledged by all publishers in the audience. The reality is different, however, because publisher’s licenses often prevent downloading of more than few articles in a row. Detection of a robotic download for text mining comes with the danger of the whole university being disconnected. It is unclear to me how robotically parsing papers and extracting data would damage the bushiness model of publishers. It could, of course, lower the number of subscriptions from

PMR: The main thing we took away was the importance of factual data. No-one disputed that facts could not be copyrighted (though not all realised that copyright was only one of the methods used by publishers to control access and re-use – server-side beheading is completely effective). I asked the audience – > 30 composed of publishers, librarians, software companies, etc. – no actual chemists of course – whether anyone would object to our robots reading the literature and extracting the data from the papers whether as text, images of tables. Half the audience thought I should, the rest didn’t vote against.
So, publishers, I’m going to start mining data from your sites. I hope you welcome this as a way forward t a new exciting era of data-rich science publishing. I hope that if you don’t agree you’ll let me know. I wouldn’t like to start and then get the lawyers sent. So please comment – it’s very important. I shan’t attack anyone who sends a reply. And you can send it by confidential email if you like.
There are a million new compounds each year in the scholarly literature. Our robots can produce huge amounts of good information from it. In some cases we get over 90% recall and precision – it depends on the type. This must be good for science. So please, publishers, let us know we can do it and we’ll publicly thank you. And if you don’t like the idea, please let us know why.
I’m in Barcelona at COST D37 helping to develop computational tools using CML. It is really changing the way things are done.
More on both fronts later.

