petermr's blog

A Scientist and the Web


Hackday 2014-06-19 in Edinburgh – a radically new approach to Scholarly Communication in the Digital Enlightenment

June 13th, 2014
Summary: Help us change the way we communicate Science and the Humanities in the Digital Enlightenment. Free [1] EVERYONE can help.Edinburgh is the capital of the Scottish Enlightenment where free thinkers changed the way we think about and run the world. Next week (June 19th) we’ll be running a hackday to change the way that we communicate Science and the  Humanities.

For 400 years we have relied on the “printed journal” and “articles” (e.g. “PDFs”) and now we’re doing something completely different. Authors should be able to do what *they* want and readers should be able to read in the way *they* want. And readers aren’t just lecturers, they are 4-year olds, patients and machines. 4-year olds LOVE DINOSAURS.

We’ve built most of the basics. We are going to:

  • SCRAPE material from PLOS (and other Open) articles. And some of these are FUN! They’re about DINOSAURS!!
  • EXTRACT the information. Which papers talk about DINOSAURS? Do they have pictures?
  • REPUBLISH as a book. Make your OWN E-BOOK with Pictures of DINOSAURS with their FULL LATIN NAMES!!

[I'm serious about the 4-year olds. I have two high quality data points where 4-year olds LOVE Binomial names. This hackday is NOT designed for kids... but future ones maybe]

For the Techies:

  • Ross Mounce has zillions of Open DOIs about dinosaurs (i.e. a list of papers).
  • Richard Smith-Unna has built the world’s latest and greatest scraper (quickscrape) for journal articles. Anyone who can edit a file can learn to use it in 5 minutes
  • Peter Murray-Rust and friends have written AMI which can extract many types of information from articles. The simplest method is regexes, but we can do phylogenetic trees from diagrams, chemistry and much else. All in a giant Java Jar. This can filter out either the articles you want or just the bits you want!
  • Peter Sefton has built scholarly authoring systems that academics actually want to use!! We’ll probably use eBook technology which can reassemble the bits that AMI has found and you want to read. All the adverts are gone! We can make ebooks for a given subject, or today’s publications, or methods for cloning mosquitoes or all the graphs about climate change…
In hackdays YOU decide what you want to do, find friends and explore. You might create something wonderful or you might just have fun.
YES! Edinburgh has DINSOSAUR skeletons.
Mark writes:
“Room 1.15, Informatics Forum, University of Edinburgh, George Square, Edinburgh”
This room has tables and seats for 12 people comfortably, and another 8 folding seats for people to dot around – I was not sure how big we were aiming, but the Forum also has a fair bit of open space if we need to de-camp some people. There is a computer and a projector too, and a whiteboard.
I should keep track of how many people plan to attend, to make sure we have space. So, could we add the following to the summary:
“If you would like to join us, please email to confirm attendance”
[we think Cameron said food provided by PLoS! - we're checking ]

We launch The Content Mine In Vienna, Interviews, Talks and our first public Workshop

June 13th, 2014

Last week was one of the most exciting in my life – but also among the hardest I have worked. I travelled from Budapest to Vienna to be the guest of the Austrian Science Fund (FWF) and to give a lecture: .. I changed the title to “Open Notebook Science” in honour of the late Jean-Claude Bradley and to promote his ideas. My talk’s on Slideshare: [].

Before that I had given two interviews – one to ORF ( ) , the Austrian public Broadcasting network Österreichischer Rundfunk Here’s the interview – I haven’t seen a translation but web translaters give a reasonable version I explained why science was important beyond the walls of academia and why we needed to liberate scientific knowledge.

Then the “launch” of The Content Mine ( ), my Shuttleworth Fellowship project, which aims to extract 100,000,000 facts from the scientific literature. The philosophy is not that *I* do this but that *WE* do this. To do that we have to:

  • have reliable, compelling, distributable software. That’s hard. But we’ve got one of the best small teams in the world – it would be harder to think of a better one. That’s because we are developer-scholars – we are not only very experience in the coding and design of information , but we are also expert in our own right in our fields (Chemistry, Phylogenetics, Plant Genetics, and Informatics/ScholarlyPublishing). That means we know where we are going, know what works (or rather what *doesn’t* work!) and know who else in the world is doing similar stuff. And because I’m funded by the Shuttleworth Foundation there’s a guarantee  that we won’t get bought by Elsevier or Macmillan or Thomson-Reuters. I wouldn’t swap any of the team for ten million dollars – that’s how important they are to my life.
  • show YOU how to become part of US. The goal is to create a community. We’re in very good touch with Wikimedia, Mozilla, Software Carpentry, OpenStretMap, Open Knowledge, Blue Obelisk, Apache, so our community will be recognisable in that environment. And also think of WellcomeTrust, Austrian Science Fund, RCUK, NIH, to get a feel for how we relate to science funders. We’ve only been going 3 months so we want to see a community evolve rather than design it prematurely. When it’s strong and energetic it will start to suggest where we should be going organisationally. We also work closely with domain repositories such as PubChem, EuropePubMedCentral, Treebank, Dryad, Crystallography Open Database, etc.
  • At present we are reaching out through workshops. We’re doing several this summer – Edinburgh, Berlin/OKFest, Wikimania, OK Brazil, and one or two more yet to be finalised. We’re informed by the Software carpentry philosophy, where we ru a workshop for a sponsor, and during the workshop train apprentices. Then these apprentices wll be able to help run new workshops and then perhaps their own workshops. So although Michelle and I ran this workshop, there will be later ones with different leaders.

So we ran our first public workshop on 2014-06-04 at  Institute of Science and Technology Austria (IST Austria) We advertised it as:

Workshop with Peter Murray-Rust and Michelle Brook: “Can we build an intelligent scientific reader?”

Venue: IST Austria, Am Campus 1, 3400 Klosterneuburg
Time: 4th of June 2014, 10:00 a.m. – 4 p.m. (ballroom)
Participants: 10 places are still available (first come, first serve)
Registration: send an email (incl. first name, surname, institution, email) asap but until 30/5/2014 to

Workshop Description
The workshop will be suitable for anyone interested in biological science and not frightened of installing and running pre-prepared programs and data (following written guidance and with support from those present in the room). The aim is to introduce computational methods for processing scientific papers, enabling analysis of multiple papers in a rapid fashion. These techniques include how to download multiple files, extract concepts and facts from the literature and figures, using Natural Language Processing and Computer Vision.

Technical expertise required
Very little expertise is required beyond general use of a computer. Much more important is a willingness to learn and experiment. However we will ensure options are made available for those who are confident/technically able, including providing opportunities to develop their own tools for analysis.

We got 18 brave people, mainly compsci but also bioscientists and it went well. Michelle is getting formal feedback. We’re hard at work taking our own criticism on board (Michelle collected a very thorough set of observations). It was hard work, but we now know we can do it and it works. The main emphasis was on understanding the concept (with highlighter pens and paper!), scraping, extraction, and how to work as a community. We’ve got attendees who want to folow up on how they can use it! That’s the philosophy.

Then the next day an all-day hack run by OKFN Austria (Stefan Kasberger and Peter Kraker (Panton Fellow) – A wonderful hackspace (metalab), couches, soft drinks + honour payment, bits of kit lying around – grafitti – you know the sort of thing.

And then at the end 4 invited speakers (including PMR). We are very impressed by OKFN Austria – the day drew perhaps 25 people. And a lovely city.

But Exhausting! At the  end I crashed for a long night. (In writing my Shuttleworth Quarterly report I was aksed “What was your greatest loss during this quarter?” Answer: SLEEP!

Much more to come – a hackday in Edinburgh next week to be announced later today.


My MPs say “You can ignore Elsevier’s TDM click-through API and we urge your library to do so too”

June 11th, 2014


A little while ago I wrote to Minister David Willetts through my MP Julian Huppert on two issues;

  1. Elsevier’s misselling of Open Access Articles (later described by Elsevier as their “bumpy road to Open Access”)
  2. Elsevier’s unnecessary click-through API which would constrain researchers and get them and librraies to sign away their rights.

Today I have got a reply on both points which I reproduce below.

1) TL;DR They’ve talked with Elsevier about the bumpy road (i.e. charging people for Open Access). You’ll have to read between the lines as to what was actually said, but it might be “David, we’re terribly sorry, grovel grovel [1]”

2) They held firm and said “yes, the point of the law was that researchers could mine facts (etc.) without having to sign publisher APIs”. “Yes, PMR has a right to do it and you can’t stop it”. After all, if they didn’t say that, what’s the point of the law? Elsevier and the other publishers have lost that battle and should move on.

Just in case any other publishers think the message wasn’t clear, here it is. So thank you very much David and Julian. You have worked hard and consistently for that. And I and other researchers in the UK will show that your effort has unleased a massive potential for increasing wealth, human well-being and enhancing the status of the UK. I’ll be blogging on that RSN.




I have redacted my address so that GCHQ can’t say where I live and tell the NSA. (Ha!)




TL;DR Elsevier are very slowly responding to my criticisms. It seems the more money a company takes in the harder it is for them to get their Systems right.  Good that they encourage Gold OA; Bad that they exercise no price control; Ugly that they think “Access to Research” is more than a cosmetic gesture. (That’s the one where citizens can cyle through the snow to their nearest library, have an hour to read a dumb screen, cannot cut and paste, cannot copy and cannot print; what we want is legal access over the Internet, not some Charles Dickens stupidity).

NOW the more exciting part…


TL;DR. A UK academic has the legal right to carry out TDM for non-commercial purposes unless THEIR LIBRARY stops them, by agreeing that they will act as publisher police. And , LIBRARIES, the goverment is making sure you know this. Ignorance is unacceptable. So why might you sign? The publisher might sweet-talk you into doing this, just like washing machine salesforce sell you “insurance” that is worse than your current legal rights. Remember PPI? Click through licences are as honest as misselled PPI.  They’ll offer you a “better price” if you agree to constrain your researcher

The carrot for not signingis that your researchers will thank you and praise the library for freeing them from Elsevier’s wish to agree to their research project. You will have a warm fuzzy feeling that you have stood up for freedom. Libraries are more important to reseachers than publishers!

The stick… you can’t hide. The FOI-flying squad can find out whether you’ve signed the click through or other TDM restrictions. Resistance is futile. No “it’s too difficult to tell you, ” “we can’t find our contracts”, etc. There’ll be a giant UK spreadsheet (promise!) with your institution on it.

It’s easy. When a publisher salesperson comes to you mumble the mantra: “Yes to TDM; no to click-through”. They’ll try anything, but use the Force and be strong.

[1] a well-known parliamentary expression

Elsevier’s new API approach to Content Mining should be avoided by all Librarians

June 10th, 2014

Yesterday Elsevier updated its approach to Text and Data Mining. This is a rapid response . Elsevier’s material is at and is italicised here.  My emphasis in Elsevier’s text is [thus]. My comments interleaved.

TL;DR [summary] Elsevier’s new approach is unnecessary and should be avoided by all libraries and researchers.

Some arguments below suggest that mining is better and easier with Elsevier’s APIs. This is an untested assertion. There are many Free and Open tools that can mine content. Unix tools are quite satisfactory and we have developed Free and Open content mining tools at . 

There is a SUMMARY at the end

How does Elsevier’s text mining policy work with new UK TDM law?

By Gemma Hersh | Posted on 9 June 2014

In January, Elsevier announced a new text and data mining policy, which allows academic researchers at subscribing institutions to text mine subscribed content for non-commercial research purposes.

PMR: we and others showed that this was deeply and utterly flawed and contained many clauses which were solely for Elsevier’s benefit.

Last week, a new UK text and data mining copyright exception came into force which allows researchers with lawful access to works to make copies of these for the purposes of non-commercial text and data mining. Accordingly, it’s is a good opportunity to reflect on how our policy and the exception work together.

YES, and my blogs posts will reflect. Note that I have lawful access to all works I want to mine for my non-commercial research processes

Elsevier and the UK TDM copyright exception

A new UK text and data mining copyright exception came into force on June 1st. What is it and how do Elsevier’s systems accommodate this requirement?

  • An exception to copyright is when someone is allowed to copy a work without seeking the permission of the rights holder. In this instance, researchers with lawful access to works published by Elsevier can copy these without asking,  [using tools we have provided for this purpose], provided they are doing the copying to carry out non-commercial text and data mining.

The highlighted phrase is completely spurious. We can copy the material with OUR tools which are Open. or with anyone else’s such as GNU/Linux Tools. Many readers may misread this phrase as part of the legislation – it is FUD and its introduction is completely irresponsible

  • Elsevier offers an Application Programming Interface (API) to facilitate text and data mining of content held on Science Direct. This API makes the process [easier and more efficient] for researchers compared to manual downloading and mining of articles. It also helps us to provide a good experience to human readers and to miners at the same time.

This is an untested assertion and written from a marketing perspective rather than an actual study. It is unlikely to be easier than FreeOpen tools which work for all publishers’ output.

  • Under the UK legislation, publishers can use “reasonable measures to maintain the stability and security” of their networks, and so the [requirement to use] this API is fully compatible with the copyright exception. 

So this appears to a MANDATORY API; if we do not use it Elsevier will take action. This is INCOMPATIBLE with the new legislation that allowed miners to ignore restrictions imposed by publishers.

  • Our approach to TDM remains under review and continual refinement. We have already made changes based on [researcher feedback during our pilot] and will continue to do so in order to support researchers.

PMR: Where is this “researcher feedback”? NAME them and publish the full details. No one has consulted me or many of the other proponents of unrestricted mining under the law. It’s always possible to find someone who will provide support for some case, but that’s neither scientific or responsible.

  • We believe text and data mining is important for advancing science, and we are keen to provide tools to support researchers who wish to mine no matter where they are located.

 This is vacuous marketing mumble.

Related resources

Elsevier has provided [text and data mining support for researchers since 2006].

PMR: Not for me. I spent years trying to get a reasonable approach.

We designed our policy framework to span across all legal environments as research is global, and this framework complements the UK exception. Since the beginning of the year, in accordance with our policy, we have started to include text and data mining rights for non-commercial purposes in all new ScienceDirect subscription agreements and upon renewal for existing academic customers. [The UK law adds weight to our position; we are ensuring that those with "lawful access" (in UK legislation speak) have the right to mine our works].

PMR The UK law allows ME to mine Elsevier content WITHOUT the rights included in contracts. Read those clauses carefully , LIBRARIANS. It is highly likely that you will be giving up some of MY rights

Contrary to what some have suggested, [our policy was not designed to undermine library lobbying for copyright exceptions for text and data mining], but rather to position us to continue to offer flexible and scalable solutions to support researchers no matter where they are based.

PMR: Last year the massed mainstream publishers INCLUDING ELSEVIER  fought against the European libraries, funders, JISC, SURF, etc to require licences for content mining. “Licences 4 Europe”. The talks in Brussels broke down. Neelie Kroes stated that licences were not the answer.

PMR So it was all a misunderstanding? Elsevier wasn’t fighting us? Orwell calls this DOUBLESPEAK. Just reading the previous sentence should convince your that publishers are not “our partners”.

What the law alone cannot do – in the UK or elsewhere – [is resolve some of the technical sticking points that often frustrate a researcher's mining experience]. That’s why our policy facilitates text mining via an Application Programming Interface (API).

PMR The FreeOpen software can already deal with the technical sticking points

The advantages of using APIs for text mining

As users of many popular websites will know, it is standard best practice for users (well, their machines) to be asked to use APIs or other download mechanisms when the website in question holds a lot of content. That’s the case with ScienceDirect, which holds over 12.5 million articles and almost 20,000 books, and we are among many other large platforms, including Wikipedia PubMed Central and Twitter, in asking for our API to be used for downloading and mining content. We do this to provide researchers with an optimum text mining experience.  

PMR Wikipedia and PubMedCentral (on whose advisory board I am) have public and democratic approaches to governance and control. Elsevier’s API is developed without any significant community input. If I saw an Elsevier API Advisory Board, with public minutes and transparency of the stature of PubMedCentral I would be prepared to engage

PMR APIs also allow websites to monitor (Snoop) on who uses the API for what purpose and when. It also allows the provider to provide the particular view (often limited or distorted) that they wish to promote.

For starters, access via the API provides full-text content of ScienceDirect in XML and plaintext formats, [which researchers tell us they prefer to HTML] for mining.

PMR Weasel words (Wikipedia term). I (PMR) find good standards-conformant HTML totally acceptable and often superior. I will be happy to report publicly whether Elsevier’s HTML is standards-conformant.

Similarly, experience in our pilots has indicated that text miners prefer API access for automated text mining for several other reasons, one being that content is available from our APIs without all of the extraneous information that is added to web pages intended for human consumption but which make text mining more difficult (e.g., presentational JavaScript, navigational controls and images, website branding, advertisements). Access via our API also provides content to researchers in stable, well-documented formats; by contrast, HTML coding can change at any time, making it arduous to keep “screen-scraping” scripts up to date.

PMR Human readers are no doubt clamouring for the extraneous information ,  yearning for website branding, and reading the site for the advertisements. Our content mining tools can avoid this clutter.

It’s not just text miners who benefit from our API, but users of ScienceDirect who are there to read content rather than download and mine it. Their user experience of ScienceDirect can be maintained at the highest level, as bulk downloading needed for mining is done elsewhere, via our API. If bulk downloading over a short period of time took place on the ScienceDirect site, [the system's stability would be compromised, affecting researchers of every hue]. By contrast, our API is designed to cope with high-frequency requests from automated bots and crawlers in a very efficient manner which enables us to scale our systems to meet demand.

PMR I shan’t comment on what human ScienceDirect readers want;  Cameron Neylon has already demolished the idea that commercial publishers cannot provide robust servers for all types of use.

PMR: I do not understand why the hue (=colour) of researchers is important; In the UK and many other countries this is objectionable language and should not appear on a reputable publisher’s site. Please apologise and remove or I shall report this.

The Explanatory Notes published alongside the UK legislation make clear that publishers are able to impose “reasonable measures to maintain the stability and security” of their networks, as long as researchers are able to benefit from the exception to carry out non-commercial research. In other words, researchers with lawful access to works can copy these for the purposes of non-commercial text and data mining, and publishers have a role to play in managing this process. [1]The “reasonable measures” include requesting that miners to carry out text mining via a separate API], in line with Elsevier’s existing policy, and we have received numerous reassurances from the UK Government [2]that use of our API will be in compliance with the law].

PMR [1] You may request but you may not require.

PMR [2] And ignoring your API is ALSO in compliance with the law.

PMR. If the law is interpreted as “the publisher decides whether an activity is compliant with the law” then the law is pointless.

We will continue to monitor how our API is used and to make tweaks and changes to our policy in response to community feedback. We have already made several adjustments. For example, we no longer request a project description as part of the API registration process, and we now allow TDM output to be hosted in an institutional repository. We also know, for example, [that researchers would like to mine third-party images and graphics that they cannot currently download automatically via our API].

PMR: Yes. I would like to mine images and I will mine images. If Elsevier does not provide images through their API this is an unassailable argument for getting them directly from the website as the law allows.

[We of course make this content available to researchers on request],

PMR You didn’t (“of course”) make anything available to me during the three years I “negotiated” with you.


. but we are looking at how we might ensure that the rights of [third-party content owners] are respected whilst at the same time providing researchers with all of the content they want immediately via our API.

PMR. More FUD. We have a complete right to mine third-party content as well. Elsevier’s “ensuring rights” is a process that is of indeterminate duration.

And we are a signatory to the new CrossRef Prospect text and data mining service, which aims to allow researchers to mine content from a range of publishers through one single portal.

PMR CrossRef is set up by publishers and guided by the publishers who finance it.

Further, we’re looking at how we ensure that researchers [know what they can and cannot do with content, or where to go for further information], without giving the impression that we are claiming ownership over non-copyrightable facts and data.

PMR. I know what I can do and where I can go without Elsevier’s  help. And it’s likelythat miners may choose to come to  and similar community sites for information provided by the community for the community.


We’ve already altered our output terms, so that researchers can redistribute 200 characters in addition to text entity matches; [researchers] told us that our previous inclusion of text entity matches within that 200 character limit sometimes caused problems when displaying lengthy chemical formulas.

PMR “Researchers” was actually me. It’s polite to credit sources.

In short, we will continue to do what we have always done: work with the research community to support their research, listen to feedback and respond to changing needs. Our text and data mining policy is a reflection of this and will continue to evolve accordingly.

PMR More FUD and mumble.






Content Minings Starts Today! and we have the technology

June 1st, 2014

Today 2014-06-01 is a very important date. The UK government has pushed for reform of copyright and – despite significant opposition and lobbying from mainstream publishers – the proposals are now law. Today.

Laws are complicated and the language can be hard to understand but for our purposes (Scientific articles to which we have the right to read ) :

  • If you have the right to read something in the UK then you have the right to extract and publish facts from it for non-commercial use.
  • This right overrides any restrictions in the contract signed between the publisher and and the buyer/renter.

Of course we are still bound by copyright law in general, defamation, passing off and many other laws. But our machines can now download subscribed articles without legal hindrance and  as long as we don’t publish large non-factual chunks we can go ahead.

Without asking permission.

That’s the key point. If we had to ask permission or were bound by contracts that forbid us then the law would be useless. But it isn’t.

I’m mentally starting today, but since I’m not in UK I’ll wait for a few days. I’ve got several non-commercial projects I want to work on – one today about pheromones – I need to scan a lot of papers for chemical structures and species.

It also wouldn’t be much use without the technology. There’s 1000-5000 articles per day – no-one really knows. That’s 1-2 a minute to crawl and scrape. We believe that a lot of the crawled metadata is freely available so we are concentrating on scraping.

We’ll launch the technology on Wednesday at . If you are in the Vienna area you might want to come – I think there may be a place or two but can’t guarantee it. We’ll post the details and probably open an Etherpad if any brave people want to try remotely .

All the  people have worked very hard but top kudos to Richard Smith-Unna (@blahah404) for building the scraper. It’s a scary ghost ride with a “headless browser”, “PhantomJS”, “SpookyJS”, “CasperJS” but we’ll be doing this in daylight so it should be safe.

The workshop is truly interactive – we want to hear what the participants want, why it does/not work for them, and to build collaborative projects. Ideally we’d like a self reproducing community developing applications and running workshops.

A small amount of the workshop – e.g. Computer Vision for Science – will be “bleeding edge”. It should be fun.



Shuttleworth Gathering Budapest, Content Mine Dogfood

May 29th, 2014

Twice a year the Shuttleworth Fellowship meets in a Gathering – could be anywhere in the world (subject to a minimum travel costs algorithm). This is my first and we are in Budapest – one of Europe’s loveliest cities. (I’ve been here before, luckily, as our programme has been very full and we only got out once formally for a river cruise.

It’s Chatham House Rule so no details but see our web page for the 13 fellows. This is one of the most coherent, inspiring, groups I have ever been in. So much is common ground – we agree on doing Open, the questions are why? what and how? and we’ve explored those. I’ve found so much in common – we are in the area of liberating knowledge and inspiring innovation , mixed with democracy and justice. I’m finding out about how to build communities, annotation, education while being able to help with computer vision, information extraction, metadata, etc.

We each ran a 75 minute slot on “eating our own dogfood”. NOT a lecture. We had to bring the practice of our project and ask the others – everyone – to grok it and hack it. Often this was in small groups and so for mine we had 5 groups of 5. Here’s my rough summary with comments:

  • Why are we doing ContentMining? economics, openness/democracy, innovations, disruption.  Hargreaves

Very useful discussion (as would be expected)

  • Manual markup (highlighters) of two articles

Worked very well. Lots of questions about “should we mark this?”. 

  • Demo (PMR) of semantic content  (chemistry)

  • Crawling exercise (manual)

Good involvement. “Why doesn’t publisher X have an RSS feed?”, etc.

  • Scraping exercise (manual and software)

Again worked very well

  • Extraction (software and manual design)

Mainly concentrated on manual markup but showed chemical tagger, etc.

  • Where are we going?


I deliberately put far too much in – so people could test the software worked, etc. But the main idea was to see how non-biologists managed. I chose a paper on evolutionary biology of Lions in Africa and everyone got the point. In fact it reinforced how needlessly exclusive scientific language is. The first part of the introduction could be rewritten without loss to read something like

“African Lions are dying out because of hunting and environment change. DNA analyses show that lions in different parts of Africa have evolved in different ways. By studying the DNA and historical specimens we can understand the evolution and perhaps use this for conservation.”

There wasn’t enough time for everyone to run the software – deliberately – but we got very useful feedback.  I shall be tweaking it over the weekend to make sure it’s working for our Vienna workshop.


Content Mining will be legal in UK; I inform Cambridge Library and the world of my plans

May 19th, 2014

Early last week the UK House of Lords passed the final stages of a Statutory Instrument with exceptions to copyright. For me that most important was that those with legitimate access to electronic content can now use mining technology to extract data without permission from the owners. The actual legislation took less than a minute, but the process has been desperately fought by the traditional publishers who have attempted to require subscribers to get permission from them.


That means that I, who have legitimate access to the content of Cambridge University Library and their electronic subscriptions, can now use machines to read any or all of this without breaking copyright law. Moreover the publishers cannot override this with additional restrictive clauses in their contracts.

The new law restricts the use to “non-commercial” but this will no affect what I intend to do. To avoid any confusion I am publicly setting out my intentions; because I shall be using subscription content I am advising Cambridge University Library. I am not asking anyone’s permission because I don’t have to.

Yesterday I wrote to Yvonne Nobis, Head of Science Information in CUL.

I am informing you of my content mining research using subscription content in CUL. Please forward this to anyone else in CUL who may need to know. Also if there is any time this week I would be very happy to meet (or failing that Skype) – even for a short time.
As you know the UK government has passed a Statutory Instrument based on the Hargreaves review of copyright exempting certain activities from copyright, especially “data analytics” which covers content mining for facts. This comes into force on 2014-06-01.
I intend to use this to start non-commercial research and to publish the results in an OpenNotebookScience (  philosophy (i.e. publicly and immediately on the web as the work is done, not retrospectively). This involves both personal research in several scientific fields and also collaborations in 3-4 funded projects:
  •  PLUTo (BBSRC, Univ Bath) – Ross Mounce
  • Metabolism mining (Andy Howlett, Unilever funded PhD and also with Christoph Steinbeck EBI, Hinxton, UK)
  • Chemical mining (TSB grant) Mark Williamson.
We are also collaborators in the final application stage for an NSF grant collaboration for chemical biodiversity in Lamiacae (mints, etc.). This is very exciting and mining may throw light on chemicals as signals of climate change.
I intend to mine responsibly and within UK law. I expect to mine about 1000-2000 papers per day – many will be subscription-based through CUL. I have access to these as I have an Emeritus position but as I am not paid by CU then this cannot be construed as commercial activity. Typically my software will ingest a paper, mine it for facts, and discard the paper – the process takes a few seconds.
As a responsible scientist I am required by scientific ethics and reproducibility/verifiability to make my results Open and this includes the following Facts:
  • bibliographic metadata of the article (but not the abstract)
  • citations (bibliographic references) within the article
  • factual lists of tables , figures and supplemental data.
  • sources of funding (to evaluate the motivations of researchers
  • licences
  • scientific facts (below)
I shall not reproduce the whole content but shall reproduce necessary textual metadata without which the facts cannot be verified. These include:
  • figure and table captions (i.e. metadata)
  • experimental methodology (e.g. procedures carried out)
I shall not reproduce tables and figures. However my software is capable, for many papers, of interpreting tables and diagrams and extracting Factual information (e.g. in CSV files). [My output will be more flexible and re-sable than traditional pixel-based graphs.]
I expect to extract and interpret the following types of Facts:
  • biological species
  • place names and geo-locations (e.g. lat/long)
  • protein and nucleic acid sequences
  • chemical names and structure diagrams
  • phylogenetic (e.g. evolutionary) trees
  • scatterplots, bar graphs, pie charts, etc.
 and several others as the technology progresses.
The load on publishers’ servers is negligible (this has been analysed by Cameron Neylon of PLoS).
I stress the the output is qualitatively no different from centuries of extraction from the literature – it is the automation of the procedure. Facts are not copyrightable and nor will my output be.
I shall publish the results on my personal open web pages, repositories such as Github and offer them to EuropePMC for incorporation if they wish . Everything I publish will be licensed under CC 0 (effectively public domain). I would also like to explore exposing the results through the CUL. I have already pioneered dspace@cam for large volumes of facts, but found that the search and indexing wasn’t appropriate at the time. If you have suggestions as to how the UL might help it could be a valuable example for other scholars.
I am not expecting any push-back or take-downs from publishers as this activity is now wholly legal.  The Statutory Instrument overrides any restrictive clauses from suppliers, including robots.txt. I therefore do not need or intend to ask anyone for permission. This will be a very public process – I have nothing to hide. However I wish to behave responsibly, the most likely problem being load on publishers’ servers. Richard S-U (Plant Sciences, Cambridge, copied) and I are developing crawling and scraping protocols which are publisher-friendly (e.g. delays and retries) – we  have also discussed this with PLoS (Cameron).
In the unlikely event of any problems from publishers I expect that CUL, as licensee/renter of content, would be the first point of contact. I will be happy to be available if CUL needs me. If publishers contact me directly I shall immediately refer them to CUL as CUL is the licensee.
I have written this in the first person (“I”) since the legislation emphasises personal use and because organised consortia may be seen as “commercial”. The law is for the UK. Fortunately the mining is wholly compatible:
  • I am a UK citizen from Birth
  • I live in the UK
  • I have a pension from the UK government (non-commercial activity)
  • My affiliation is with a UK university
  • The projects I outline are funded by UK organisations.
  • My collaborators are all UK.

I play a public domain version of “Rule Britannia!” incessantly and have a  Union Jack teddy bear. I shall however, vote for Britain to continue as a member of the EU and also urge my representatives (MEPs) to continue to press for similar legislation in  Europe. I personally thank Julian Huppert and David Willetts for their energy and consistency in pushing for this reform, which highlights the potential value of parliaments in a democracy.

I also thank my collaborators in the ContentMine ( where I shall be demonstrating and discussing our technology, which is the best that I know of outside companies like G**gle. As an academic I welcome offers of collaboration, but stress that we cannot run a mining service for you (though we can show you how to run our toolkit).  If the projects are interesting enough to excite me as a scientist I may be very happy to work with you as a co-investigator, though I cannot be paid for mining services.

Sadly, very few publishers come out of this with anything positive. Naturally the Open Access publishers (PLOS, BMC, eLife, MDPI, PeerJ, Ubiquity and others) have no problems as they can be and want to be mined. We have already had long discussions with them. The Royal Society (sic, not the RSC) has positively said that their content can be mined. All the rest, and especially the larger ones, have actively lobbied and FUDded to stop content mining. When you know that organisations are spending millions of dollars to stop you doing science it can be depressing, but we’ve had the faith to continue. I’m particularly proud of Jenny Molloy, Ross Mounce and others for their public energy in maintaining

“The Right To Read is the Right To Mine”

Now that the political battle (which has taken up 5 years of my life) is largely over, I’m devoting my energies to getting the ContentMine as a universal resource and building new next generation of intelligent scientific software.

And you can be an equal part of it, if you wish.




Jean-Claude Bradley: Hero of Open Notebook Science; it must become the central way of doing science

May 19th, 2014

It is with great sadness that we report the death of Jean-Claude Bradley who invented the concept of “Open NoteBook Science”.  ( ).


[Blue Obelisk presented to J-C (left) by Egon Willighagen (right), 2007. Photo Credit CC BY Christoph Steinbeck]

I learnt of this last Wednesday, while preparing a keynote talk on “Open Data” at the European Bioinformatics Institute at Hinxton.  I dropped half of what I was intending to present , to provide a fitting tribute to J-C. On the Blue Obelisk mailing list I wrote:

Jean-Claude was years ahead of his time. He did what he considered right, not what was expedient or what the world expected.

He and I discussed Open Data and Open Notebook Science. We found that they were different things and that each was a critically important subject. J-C set up a webpage on Wikipedia to describe ONS and its practice.

ONS is truly innovative. The research must be available to everyone – regardless of who they are are or what they had studied. And it must be fair – “no insider knowledge”.

Several groups in chemistry are following J-C’s lead – and we honour him in that.

I have been invited to present a keynote on “Open Data” at Hinxton Genome Campus tomorrow and shall make J-C’s work the focus and inspiration.

I am truly glad we awarded him a Blue Obelisk. As a community we should think how to take the message further.

I stayed up late into the night finding material to include. J-C has left a clear legacy and it has been possible to find clear, simple, precise indications of his thinking . See slides 4-20 in There is an excellent video interview last year (links at end of my presentation).

As I found more material I suddenly got the revelation:

“This is the only proper way to do science in the Century of the Digital Enlightenment”

I perhaps knew this theoretically, but now it hit me emotionally.  Jean-Claude’s vision was absolute, simple, and feasible. In fact ONS is a simpler way of doing science than we have at present. It’s vastly better and immediately provides a total record of what everyone has done. It’s literally edited by the minute. Everyone gets fair credit for what they have done, there is massive loss of wasted effort, no opportunity for fraud.

ONS also solves the “Open Data” and “Open Access” at a stroke. It is impossible not to publish Open data, impossible for publishers to try to steal it from the public. Open Access becomes virtually irrelevant – it’s an integral part of the system.

I’ll have a lot more to write. In preparing my talk I asked Mat Todd, Univ of Sydney, to comment. Mat has been another pioneer in OpenNotebookScience, using chemistry not for conventional academic glory (though he has that from many) but to cure human disease, particularly Neglected Tropical Diseases. Mat wrote:

JC was a pioneer in open science, and uncompromising about its importance. We had so many productive interactions over the years, starting from the end of January 2006, when we started our open chemistry project on The Synaptic Leap (JC was the first to comment!) and JC posted his very first experiment online at Usefulchem. I remember starting to think about how to do completely open projects, looking around the web in 2005 to see if anything open was going on in chemistry, and coming across JC’s lone voice, and I thought “Wow, who isthis guy?” He had dedication and integrity – we’ll all miss him.


TheContentMine: Progress and our Philosophy

April 30th, 2014

TheContentMine is a project to extract all facts from the scientific literature. It has now been going for about 6 weeks - this is a soft-launch. We continue to develop it and record our progress publicly. It’s a community project and we are starting to get offers of help right now.  We welcome these but we shan’t be able to get everything going immediately.

We want people to know what they are committing to and what they can expect in return. So yesterday I drafted an initial Philosophy – we welcome comments.

Our philosophy is to create an Open resource for everyone created by everyone. Ownership and control of knowledge by unaccountable organisations is a major current threat; our strategy is to liberate and protect content.

The Content Mine is a community and we want you to know that your contribution will remain Open. We will build safeguards into The Content Mine to protect against acquisition.

We are a meritocracy. We are inspired by Open communities such as the Open Knowledge Foundation, MozillaWikipedia and OpenStreetMap all of whom have huge communities who have developed a trustable governance model.

We are going ahead on several fronts – “breadth-first”, although some areas have considerable depth. Just like Wikipedia or OSM you’ll come across stubs and broken links – it’s the sign of an Open growing organisation.

There’s so much to do, so we are meeting today to draft maps, guidelines, architecture. We’re gathering the community tools – wikis, mail lists, blogs, Github, etc. As the community grows we can scale in several directions:

  • primary source. Contributors can choose particular journals or institutions/theses to mine from.
  • subject/discipline. You may be interested in Chemistry or Phylogenetic Trees, Sequences or Species.
  • technology. Concentrate on OCR, Natural Language Processing, CrawlingSyntax or develop your own extraction techniques
  • advocacy and publicity. A major aim is to influence scientists and policy makers to make content Open
  • community - its growth and practice.

We are developing a number of subprojects which will demonstrate our technology and how the site will work. Hope to report more tomorrow.


Is Elsevier going to take control of us and our data? The Vice-Chancellor of Cambridge thinks so and I’m terrified

April 29th, 2014

I am gutted that I missed the Q+A session with Professor Sir Leszek Borysiewicz the Vice-chancellor of  Cambridge University. It doesn’t seem to have been advertised widely – only 17 people went – and it deserves to be repeated.

The indefatigable Richard Taylor – who reports everything in Cambridge - has reported it in detail. It was a really important meeting. I’ll highlight one statement, which chills me to the bone (note that this is RT’s transcript):

“the publishers are faster off the mark than governments are. Elsevier is already looking at ways in which it can control open data as a private company rather than the public bodies concerned.”

Now I know this already – I’ve spent 4 years finding out  in detail about Elsevier’s publishing practices. It’s good that the VC realises it as well. Open Access is a mess – the Universities have given part of their priceless wealth to the publishers and are desperately scrabbling to get some of it back. The very lack of will and success makes me despondent – LB says:

“And I know disadvantaging the individual academic by not having publication in what is deemed to be the top publications available? So it’s a balance in the argument that we have.”

in other words we have to concede control to the publishers to get the “value” of academics publishing where they want.

Scholarly publishing costs about 15,000,000,000 USD per year. Scholarly knowledge/data is worth at least ten times that (> 100,000,000,000 USD/year).  [I'll justify the figure later]. And we are likely to hand it all over to Elsevier (or Macmillan Digital Science).

I’ve done what I can to highlight the concern. This was the reason for my promoting the phrase “Open Data” in 2006  – and in helping create the Panton Principles for Open Data in Science in 2008. The idea is to make everyone aware that Open Data is valuable and needs protecting.

Because if we don’t Elsevier and Figshare and the others will possess and control all our data. And then they will control us.

Isn’t this overly dramatic?

No. Elsevier has bought Mendeley – a social network for managing academic bibliography.  Scientists put their current reading into Mendeley and use it to look up others. Mendeley is a social network which knows who you are, and who you are working with.

Do you trust Mendeley? Do you trust Elsevier? Do you trust and large organisations without independent control (GCHQ, NSA, Google, Facebook)? If you do, stop reading and don’t worry.

In Mendeley, Elsevier has a window onto nearly everything that a scientist is interested in. Every time your read a new paper Mendeley knows what you are interested in.  Mendeley knows your working habits – what time are you spending on your research?

And this isn’t just passive information. Elsevier has Scopus – a database of citations. How does a paper get into this? – Scopus decides, not the scientific world. Scopus can decide what to highlight and what to hold back. Do you know how Journal Impact Factors are calculated? I don’t because it’s a trade secret.  Does Scopus’ Advisory Board guarantee transparency of  practice? Not that I can see. Since JIF’s now control much academic thinking and planning, those who control them are in a position to influence academic practice.

Does Mendeley have an advisory board? I couldn’t find one. And when I say “advisory board”, I mean a board which can uncover unacceptable practices. I have no evidence that anything wrong is being done, but I have no evidence that there are any checks against it. Elsevier has already created fake journals for Merck, so how can I be sure it will resist the pressure to use Mendeley for inappropriate purposes? Is Mendeley any different from Facebook as far as transparency is concerned?  Is there any guarantee that it is not snooping on academics and manipulating and selling opinion? “Dear VC – this is the latest Hot Topics from Mendeley; make your next round of hirings in these fields”.

I’m also concerned that Figshare will go the same way. I have have huge respect for Mark Hahnel who founded it.  But Figshare also  doesn’t appear to have an advisory board. Do I trust Macmillan? “we may anonymize your Personal Information so that you cannot be individually identified, and provide that information to our partners, investors, Content providers or other third parties.” Since information can be anonymised or useful but not both are you happy with that?

There aren’t any easy solutions.  If we do nothing, are we trusting our academic future to commercial publishers who control the information and knowledge flow. We have to take back our own property – the knowledge that *we* produce. Publishers should be the servants of knowledge – at present they are becoming the tyrants.