ContentMine at WOSP2014: Text and Data Mining: III What Elsevier’s Chris Shillum thinks we can do; Responsible Mining

My last post ( ) described the unacceptable and intransigent attitude of Elsevier’s Gemma Hersh at WOSP2014 Text and Datamining of Scientific Documents.

But there was a nother face to Elsevier at the meeting, Chris Shillum. ( ). Charles Oppenheim and I talked with him throughout lunch and also later Richard Smith-Unna also talked. None of the following is on public record but it’s a reasonable approximation to what he told us.

Firstly he disagrees with Gemma Hersh that we HAVE to use Elsevier’s API and sign their Terms and Conditions (which gives away several rights and severely limit what we can do and publish.) We CAN mine Elsevier’s content through their web pages that we have the right to read and we cannot be stopped by law. We have a reasonable duty of care to respect the technical integrity of their system, but that’s all we have to worry about.

I *think* we can move on from that. If so, thanks, Chris. And if so, it’s a model for other publishers.

We told him what we were planning to do – read every paper as it is published and extract the facts.

CS:  Elsevier published ca 1000 papers a day.

PMR : that’s one per minute;  that won’t break your servers

CS: But it’s not how humans behave….

I think this means that is we appear to their servers as just another human then they don’t have a load problem. (For the record I sometimes download manually in sequence as many Elsevier papers as I can to (a) check the licence or (b) to see whether the figures contain chemistry, or sequences. Both of these are legitimate human activities either for subscribers or unsubscribers. For (a) I can manage 3 per minute, 200/hour – I have to scroll to the end of the paper because that’s often where the “all rights reserved, C Elsevier” is. For (b) it’s about 40/hour if there are an average of 5 diagrams (I can tell within 500 milliseconds whether a diagram has chemistry, sequences, phylogenetic trees, dose-response curves…).  [Note: I don't do this for fun - it isn't - but because I am fighting for our digital rights.]

But it does get boring and error-prone which is why we use machines.

The second problem is what we can publish. The problem is copyright. If I download a complete Closed Access paper from Elsevier and post it on a public website I am breaking copyright. I accept the law. (I am only going to talk about the formal law, not morals or ethics at this stage).

If I read a scientific paper I can publish facts. I MAY be able to publish some text as comment, either as metadata, or because I am critiquing the text. Here’s an example ( :

In 1953, the following sentence appeared near the end of a neat little paper by James Watson and Francis Crick proposing the double helical structure of DNA (Nature171: 737-738 (1953)):

“It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.”

Of course, this bon mot is now wildly famous amongst scientists, probably as much for its coyness and understatement …

Lablit can reasonably assume that this is fair comment and quote C+W (Copyright who? they didn’t have transfer in those days). I can quote Lablit in similar fashion.  In the US this is often justified under “fair use” , but there is not such protection in UK. Fair use is, anyway, extremely fuzzy.

So Elsevier gave a guide as to what they allow IF you sign their TDM restrictions. Originally it was 200 characters, but I pointed out that many entities (facts) , such as chemical names or biological sequences, were larger. So Chris clarified that Elsevier would allow 200 characters of surrounding context.  This could means something like (Abstract, ) showing ContentMine markup:

“…Generally the secondary metabolite capability of <a href=””> A.oryzae</a> presents several novel end products likely to result from the domestication process…”

That’s 136 characters without the Named Entity “<a…/a>”

Which means that we need guidelines for Responsible Content Mining.

That’s what JISC have asked Jenny Molloy and me to do (and we’ve now invited Charles Oppenheim to be a third author). It’s not easy as there are few agreed current practices. It’s possible for us to summarise what people currently do, what has been challenged by rights holders and then for us to suggest what is reasonable. Note that this is not a negotiation. We are not at that stage.

So I’d start with:

  1. The right to read is the right to mine.
  2. Researchers should take reasonable care not to violate the integrity of publishers’ servers.
  3. Copyright law applies to all parts of the process  although it is frequently unclear
  4. Researchers should take reasonable care not to violate publishers’ copyright.
  5. Facts are uncopyrightable.
  6. Science requires the publication of source material as far as possible to verify the integrity of the process. This may conflict with (3/4).


CC BY publishers such as PLoS and BMC will only be concerned with (2). Therefore they act as a yardstick, and that’s why we are working with them. Cameron Neylon of PLoS has publicly stated that single text miners cause no problems of PLoS servers and there are checks to counter irresponsible crawling.

Unfortunately (4) cannot be solved by discussions with publishers as they have shown themselves to be in conflict with Libraries, Funders, JISC, etc. Therefore I shall proceed by announcing what I intend to do. This is not a negotiation and it’s not asking for permission. It’s allowing a reasonable publisher to state any reasonable concerns.

And I hope we can see that as a reasonable way forward.









WOSP2014: Text and Data Mining: II Elsevier’s Presentation (Gemma Hersh)

WOSP2014 – – is a scholarly, peer-reviwed workshop. It consists of submiited, peer-reviewed talks and demos and invited talks from well-known people in the field (Lee Giles,  Birger Larsen). At ContentMine we submiited three papers/demos which were peer-reviewed and accepted (and which I’ll blog later) .

But there was also one Presentation which was, as I understand, neither invited nor peer-reviewed.

“Elsevier’s Text and Data Mining Policy” by Gemma Hersh

It is usually inappropriate for a manufacturer to present at a scholarly conference where the audience are effectively customers. It ends up as a products pitch which is offensive to attendees who have paid to attend, and offensive to those who have submmited papers which were rejected, while product pitches are allowed.

This was one of the most unacceptable presentations I have ever seen at a scholarly workshop and I said so.

Before I add my own comments I simply record the facts. Professor Charles Oppenheim agrees that this is a factual record.

GH = Gemma Hersh (Elsevier)
CO = Prof Charles Oppenheim
GH arrived 10 mins before her presentation and left immediately afterwards. She did not stop to talk. She later tweeted that she had a meeting.
GH Presentation

(1) Elsevier’s presentation was not an invitation or peer-reviewed submission but appeared to have been a result of pressuring the organizers.

(2) it was a manufacturer-specific product pitch not a scholarly presentation

(3) It made no attempt to be balanced but presented only Elsevier’s product. In particular:

  * no mention was made of Hargreaves

  * no mention that it had been rejected by LIBER and associates

  * no mention that the library community had walked out of Licences for Europe

(4) Elsevier said that their studies showed researchers preferred APIs. No mention was made that researchers had to sign an additional agreement

Public Discussion (no record, but ca 20 witnesses)

PMR challenged the presentation on the basis of bias and inaccuracy.

GH cricitized PMR for being aggressive. She stated that it was the libraries fault that L4E had broken down

PMR asked GH to confirm that if he had the right to read Elsevier material he could mine it without using Elsevier’s API

GH replied that he couldn’t.

CO told GH that PMR had a legal right to do so

GH said he didn’t and that CO was wrong

Discussion continued with no resolution. GH showed no intention of listening to PMR and CO

Later tweets from GH
@petermurrayrust check the explanatory notes that accompany the legislation.
@petermurrayrust would prefer a constructive chat rather than an attack though….
@petermurrayrust very happy to keep talking but I’ve just come straight from one appointment and now have another.


PMR I believe that any neutral observer would agree that this was roughly factually correct.


PMR: now my comments…

It is completely unacceptable for a product manager to push their way into a scholarly workshop, arrive 10 minutes before their presentation, give a product pitch and leave immediately without deigning to talk to anyone.

The pitch itself was utterly one-sided, presenting Elsevier as the text-miner’s friend and failing to give a balanced view of the last several years. Those of us in the UK Hargreaves process and Licences4Europe know that STM publishers in general and Elsevier in particular have thrown money and people in trying to control the mining effort through licences. To give a blatantly biased presentation at a scholarly meeting rules them out as trustable partners.

Worse, the product pitch was false. I called her on this – I was forthright – and asked whether I could mine without Elsevier’s permission. She categorically denied this. When challenged by Professor Oppenheim she told him curtly he was wrong and Elsevier could do what they liked.

The law explicitly states that publishers cannot use terms and conditions or other contractual processes to override the right to mine for non-commercial research processes.

So it’s a question of who do you believe:

Gemma Hersh, Elsevier or Professor Charles Oppenheim, Loughborough, Northampton, City?

(and PMR, Nottingham and Cambridge Universities)

If GH is right, then the law is pointless.

But she isn’t and it isn’t.

It gets worse. In later discussions with Chris Shillum, who take a more constructive view, he made it clear that we had the right to mine without Elsevier’s permission as long as we didn’t sign their terms and conditions. The discussion – which I’ll cover in the next post – was useful.

He also said that Elsevier had changed their TaC several times since January, much of this as a result of my challenging them. This means:

  1. Elsevier themselves do not agree on the interpretation of the law
  2. Elsevier’s terms and conditions are so mutable and frequently changed that they cannot be regarded as having any force.




ContentMine at WOSP2014: Text and Data Mining; I. General Impressions

On Friday 2014-09-12 4 of us from The ContentMine presented 3 papers at WOSP2014 ( . The meeting was well run by Petr Knoth and colleagues from the Open University and CORE (the JISC- and funder-supported project for aggregation of repositories). The meeting gave a useful  overview of TextAndDataMining (TDM).  From the program

  1. The whole ecosystem of infrastructures including repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets, services and APIs that enable analysis of large volumes of scientific publications.
  2. Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods.
  3. Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence etc.

This was an important meeting in several ways and I want to comment on:

  1. the general state of Content-mining
  2. Our own presentations
  3. Elsevier’s uninvited presentation on “Elsevier’s Text and Data Mining Policy”. I’ll split this into two parts in two separate posts.

PMR’s impressions

Content mining and extraction has a long history so it’s now an incremental rather than a revolutionary technology. Much of the basis is “machine learning” where statistical measures are used for classification and identification (essentially how spam detectors work). So several papers dealt with classification, based on the words in the main part (“full text”) of the document. [NOTE: full-text is far better than abstracts for classifying documents. Always demand full-text.] Several papers dealt with this.

Then there’s the problem of analysing semistructured information. A PDF is weakly structured – a machine doesn’t know what order the characters are, whether there are words, etc. Tables are particularly difficult and there were two papers on this.

And then aggregation, repositories, crawling etc. The most important takeaway for me is CORE ( ) , which aggregates several hundred (UK and other) repositories. This is necessary because UK repositories are an uncontrolled mess. Each university does its own thing, has a different philosophy, uses different indexing and access standards. The universities can’t decide whether the repo is for authors’ benefit, readers’ benefit, the university’s benefit, HEFCE’s benefit or whether they “just have to have one because everyone else does”. (By contract the French have HAL ). So UK repositories remain uncrawlable and unindexed until CORE (even then their is much internal inconsistency and uncertain philosophy).

It’s important to have metrics because otherwise we don’t know whether something works. But there was too much emphasis on metrics (often to 4 (insignificant) figures). One paper reported 0.38% recall with strict method and 94% with a more sloppy one. Is this really useful?

But virtually no one (I’ll omit the keynotes) gave any indication of whether they were doing something useful to others outside their group. I talked to 2-3 groups – why were they working on (sentiment analysis | table extraction | classification). Did they have users? Was their code available? was anyone else using their code. Take up seems very small. Coupled with the fact that many projects have a 2-3 year lifespan, that the basis is competition rather than collaboration, and we see endless reinvention. (I’ve done table extraction but I’d much rather someone else did it so we are working with tabulaPDF). The output in academia is a publication, not a running reusable chunk of code.

So it’s not surprising that there isn’t much public acknowledgement of TDM. The tools are tied up in a myriad of university labs, often without code or continuity.

One shining exception is Lee Giles, whose group has built CiteSeer. Lee and I have known each other for many years and we worked together on a MicrosoftResearch project, OREChem. So when we got talking we found we had two bits of the jigsaw.

Readers of this blog will know that Ross Mounce and I are analysing diagrams. To do that we have to indetify the diagrams, and this is best done from the captions (captions are the most important part of a scientific document for understanding what it’s about). And we are hacking the contents of the images. So these two fit together perfectly. His colleague Sagnik Ray Choudhury is working on extracting and classifying the images; that saves us huge time and effort in knowing what to process. I’m therefore planning to visit later this year.

For me that was probably the most important positive outcome of the meeting.

The next post will deal with Elsevier’s Gemma Hersh who gave an uninvited “Presentation”, and the one after with Elsevier’s Chris Shillum’s comment on Gemma’s views and also on Elsevier’s TaC.





Wellcome’s recommendations on Data Access and Sharing

The Wellcome Trust and other funders have commisioned a study on


(This is a report of the Expert Advisory Group on Data Access (EAGDA). EAGDA was established by the MRC, ESRC, Cancer Research UK and the Wellcome Trust in 2012 to provide strategic advice on emerging scientific, ethical and legal issues in relation to data access for cohort and longitudinal studies.)

This is very welcome – data is a poor relation of the holy – and in many subjects often largely useless – PDF full text. Here the report states why, and how we need to care for data.

Our findings were that

– making data accessible to others can carry a significant cost to researchers (both in terms of financial resource and the time it requires) and there are constraints in terms of protecting the privacy and confidentiality of research participants;

– while funders have done much valuable work to encourage data access and have made significant investments to support key data resources (such as the UK Data Service for the social sciences), the data management and sharing plans they request of researchers are often not reviewed nor resourced adequately, and the delivery of these plans neither routinely monitored nor enforced;

– there is typically very little, if any, formal recognition for data outputs in key assessment processes – including in funding decisions, academic promotion, and in the UK Research Excellence Framework;

– data managers have an increasingly vital role as members of research teams, but are often afforded a low status and few career progression opportunities;

– working in data intensive research areas can create potential challenges for early career researchers in developing careers in these fields;

– the infrastructures needed to support researchers in data management and sharing, and to ensure the long-term preservation and curation of data, are often lacking (both at an institutional and a community level).

TL;DR It needs commitment in money, policies and management and it’s a large task

So …


We recommend that research funders should:

1. Strengthen approaches for scrutinising data management and sharing plans associated with their funded research – ensuring that these are resourced appropriately and implemented in a manner that maximises the long-term value of key data outputs.


2. Urge the UK Higher Education funding councils to adopt a clear policy at the earliest possible stage for high quality datasets that are shared with others to be explicitly recognised and assessed as valued research outputs in the post-2014 Research Excellence Framework


3. Take a proactive lead in recognising the contribution of those who generate and share high quality datasets, including as a formal criterion for assessing the track record and achievements of researchers during funding decisions.


4. Work in partnership with research institutions and other stakeholders to establish career paths for data managers.


5. Ensure key data repositories serving the data community have adequate funding to meet the long-term costs of data preservation, and develop user-friendly services that reduce the burden on researchers as far as possible.

PMR: This is the FUNDERS urging various bodies to act. Some items are conceivably possible. (4) is highly desirable but very challenging and universities have consistently failed to value support roles and honour  ”research outputs” instead. (5) is possible but must be done by people and organizations who undersdtand repositories, not university libraries whose repositories are effectively unused.

And we MUSTN”T hand this over to commercial companies.

We recommend that research leaders should:

6. Adopt robust approaches for planning and costing data management and sharing plans when submitting funding applications.

7. Ensure that the contributions of both early-career researchers and data managers are recognised and valued appropriately, and that the career development of individuals in both roles is nurtured.

8. Develop and adopt approaches that accelerate timely and appropriate access to key research datasets.

9. Champion greater recognition of data outputs in the assessment processes to which they contribute.

PMR: (6) will have lipservice unless the process is changed (7) means changing culture and diverting money (8) is possible (9) requires a stick from funders

We also emphasise that research institutions and journals have critical roles in supporting the cultural change required.

Specifically, we call for research institutions to develop clear policies on data sharing and preservation; to provide training and support for researchers to manage data effectively; to strengthen career pathways for data managers; and to recognise data outputs in performance reviews.

We call on journals to establish clear policies on data sharing and processes to enable the contribution of individual authors on the publication to be assessed, and to require the appropriate citation and acknowledgement of datasets used in the course of a piece of published research. In addition, journals should require that datasets underlying published papers are accessible, including through direct links in papers wherever possible.

PMR:  Journals have failed us catastrophically both technically (they are among the worst technical printed output in the world – broken bizarre HTML and not even using Unicode) and politically (where their main product is glory, not technical). The only way to change this is to create different organisations.

This will be very difficult.

The funders are to be commended on these goals – it will be an awful lot of money and time and effort and politics.

How contentmine will extract millions of species

We are now  describing our workflow from extracting facts from the scientific literature on . Yesterday Ross Mounce and I hacked through what was necessary to extract species from PLoSone. Here’s the workflow we came up with:

Ross has described it in detail at and you should read that for the details. The key points are:

  • This is an open project. You can join in; be aware it’s alpha in places. There’s a discussion list at!forum/contentmine-community . Its style and content will be determined by what you post!
  • We are soft-launching it. You’ll wake up one day and find that it’s got critical mass of people and content (e.g. species). No fanfare and no vapourware.
  • It’s fluid. The diagram above is our best guess today. It will change. I mentioned in the previous post that we are working with WikiData for part of “where it’s going to be put”. If you have ideas please let us know.



How will extract 100 million scientific facts


At we have been working hard to create a platform for extracting facts from the literature. It’s been great to create a team – CottageLabs (CL) and I have worked together for over 5 years and they know better than me what needs to be built. Richard (RSU) is more recent but is a wonderful combination of scientist, hacker and generator of community.

Community is key to ContentMine. This will succeed because we are a community. We aren’t a startup that does a bit for free and then sells out to FooPLC or BarCorp and loses vision and control. We all passionately believe in community ownership and sharing. Exactly where we end up will depend on you as well as us. At present the future might look like OpenStreetMap, but it could also look like SoftWareCarpentry or Zooniverse.  Or even the Blue Obelisk.

You cannot easily ask volunteers to build infrastructure. Infrastructure is boring, hard work, relatively unrewarding and has to be built to high standards. So we are very grateful to Shuttleworth for funding this. When it’s prototyped, with a clear development path, the community will start to get involved.

And that’s started with quickscrape. The Mozilla science sprint created a nucleus of quickscrape hackers. This proved we (or rather Richard!) had built a great platform that people could build one and create per-journal and per-publishers scrapers.

So here’s our system. Don’t try to understand all of it in detail – I’ll give a high-level overview.


CRAWL: we generate a feed of some or all of the scientific literature. Possible sources are JournalToCs, CrossRef, doing it ourselves, gathering exhaust fragments. We’d be happy not to do it if there are stable, guaranteed sources. The result of crawling is a stream of DOIs and or bibliographic data passed to a QUEUE to be passed to …

SCRAPE: This extracts the components of the publications – e.g. abstract, fulltext, citations, images, tables, supplemental data, etc. Each publisher (or sometimes each journal) requires a scraper. It’s easy to write these for Richard’s quickscrape platform, which includes the scary Spooky, Phantom and Headless. A scraper  takes between 30 minutes and and 2 hours so it’s great for a spare evening. The scraped components are passed to the next queue …

EXTRACT. These are plugins which extract science from the components. Each scientific disciplines requires a different plugin. Some are simple and can be created either by lookup against Wikipedia or other open resources; or by creating regular expressions (not as scary as they sound). Others, such as those interpreting chemical structure diagrams or phylogenetics trees have taken more effort (but we’ve written some of  them).

The results can be used in many ways. They include:

  • new terms and data which can go direct;y into Wikidata – we’ve spent time at Wikimania exploring this. Since facts are uncopyrightable we can take them from any publication whether or not it’s #openaccess
  • annotation of  the fulltext. This can be legally done on openaccess text.
  • new derivatives of the facts – mixing them, recoputing them, doing simulations and much more

Currently people are starting to help  writing scrapers and if you are keen let us kn0w on the mailing list!forum/contentmine-community


Wikimania (Wikipedia) has changed my life

I’ve just spent 3 hectic days at Wikimania (the world gathering of world Wikimedians) and am so overwhelmed I’m spending today getting my thoughts in order.  Wikimedia is the generic organization for Wikipedia , Wikidata, Wikimedia Commons, and lots else. I’ll use Wikimedia as the generic term

Very simply:

Wikimedia is at the centre of the Century of the Digital Enlightenment.

Everything I do will be done under the influence of WM. It will be my gateway to the facts, the ideas, the people, the organizations, the processes of the Digital Enlightenment. It is the modern incarnation of Diderot and the Encyclopediee.

2000 Wikimanians gathered at the Barbican (London) for talks, bazaars, food, and fun. It’s far more than an encyclopedia.

We are building the values of the Digital Enlightenment

If we succeed in that everything follows. So I sat through incredible presentation on digital democracy, the future of scholarship, the liberation of thought, globalization, the fight against injustice, the role of corporations, and of the state. How Wikipedia  is becoming universal in Southern Africa through smart phones (while publishing in the rich west is completely out of touch with the future).

And fracture lines are starting to appear between the conservatism of the C20 and the Digital Enlightenment. We heard how universities still cannot accept the Wikipedia is the future. Students are not allowed to cite WP in their assignments – it’s an elephant in the room. Everyone uses it but you’re not allowed to say so.

If you are not using Wikipedia as a central part of your educational process, then you must change the process.

It’s a tragedy that universities are so conservative.

For me

Wikipedia is the centre of scientific research and publishing in this century.

I’ll expand in a later post.

But among the top values in Wikipedia is the community. The Digital Enligyhtenment is about community. It’s about inclusiveness. It’s about networks. It’s about sharing. C20 academia spends its enegery fighting its neighbours (“we have a higher ranking than you and I have a higher impact factor than you”). In Wikipedia community is honoured. The people who build the bots are welcomed by those who edit and those who curate the tools mutually honour those who build the chapters.

And anyone who thinks Wikipedia is dying has never been to Wikimania.

Wikipedia constantly reinvents itself and it’s doing so now. The number of edits is irrelevant. What matters is community and values.  We’re concerned about how organizations function in the Digital age. How do we resolve conflicts? what is truth?

I was honoured to be asked to give a talk, and here it is: (3h12m-> 3h42m)

I’ll blog more about it later.

My major discovery was

WIKIDATA, which will become a cornerstone of modern science

I’ll write more but huge thanks to:

  • Ed Saperia who spent 2 years organizing Wikimania
  • Wikichemistry (to whom I awarded a Blue Obelisk)
  • Wikidata

I am now so fortunate to be alive in the era of Wikimedia, Mozilla, Open Street Map, Open Knowledge Foundation, and Shuttleworth Fellowships.  These and others are a key part of change=ing and building a new world.

And I and will be part of it (more later).