OpenCon2014 was the Best and Most Important Meeting of My Life; the Revolution is launched

I'm serious.

From start to finish this was a superb three-day meeting of young people who know that current scholarship/publishing/university_practice has so many injustices and so much waste that we cannot go on this way.

Must be brief - airport. Search for OpenCon2014 for reports and tweets and pictures. In brief:

  • 175 young passionate people from round the globe (only 5% of applicants could be selected for 80 scholarships)
  • sensational vision and planning from SPARC , RightToResearch, OA Button, etc.
  • Brillaint and inspiring speakers. A+ to everyone.
  • Many people already suffering under the present system.
  • Culminating in a superbly organized day of Advocacy actually on Capitol Hill, Washington. Huge insights into how to get political support and change.

I use the word "Revolution". It needn't be bloody. I'll write more...


Elsevier's French TDM licence conditions

There 's a very useful blog post on Elsevier's content-mining conditions in France. I assume this is layered on the recent French five-year mega-deal with Elsevier.  My school French isn't good enough for technico-legal terms, so I cannot comment authoritatively (and I'm not sure whether G**gl* translate is better).


This is only one clause and there are others that matter. It looks fairly similar to what Gemma Hersh and Chris Shillum pushed at us last month. Reading the rest of the blog post (which doesn't contain the whole contract, but only snippets - maybe that's all you are allowed to read)  it appears:

  • That mining has to take place through the Elsevier API.
  • That non-API crawling/scraping of the website is forbidden.
  • That there are significant restrictions on the re-use of mined material.

Since I and others have highlighted the unacceptability of Elsevier contracts (they change every month, they have unacceptable restrictions (must not disadvantage Elsevier business), are internally inconsistent and unclear) I hope very much that the French authorities signing this were aware of all the problems.

I'd be grateful for an expert view on what is contained and it would be very useful to have a reasonably precise legal translation....

If anyone or any country is about to sign an agreement with any publisher that contains any mention of mining crawling spidering extraction APIs then:




Because otherwise you are likely to betray the trust of 5 years of researchers.



from Fric_Adèle.

Pardon my French, but yours is indeed not subtle enough ;-)
My pleasure to help.

You missed the most important word of the document : “notamment”, standing for “not least”.
It means that the API is one of the way to perform TDM, not the only way.
Fairly interesting, isn’t it ?

Moreover, in the next section about forbidden uses you can read :
“A l’exception de ce qui est expressément prévu dans le présent Contrat ou autorisé par écrit par Elsevier, l’Abonné et ses Utilisateurs Autorisés ne peuvent pas :
- utiliser des robots ou programmes de téléchargement automatisé destinés à, de façon continue et automatique, extraire l’intégralité ou une partie substantielle des Produits Souscrits (sauf exception autorisée pour le TDM) ou destinés à perturber le fonctionnement des Produits Souscrits”
So, another tweak in Elsevier practices : crawling/scraping is forbidden in general EXCEPT for TDM purposes.
TDM is considered as an exception to what is generally forbidden.

To be more precise : this document is the license between Elsevier and each institution. It is an appendix of the general contract, in which the TDM is allowed as follows – you can again read the word “notamment” :
“6.6 Data et text mining
Tous les contenus accessibles et souscrits sur la plateforme du Titulaire dans le cadre de cet accord seront utilisables à des fins de data et text mining notamment via une interrogation des données par une API connectée à la plateforme ScienceDirect®, conformément aux stipulations du Contrat de Licence.”

which you can translate by :
“All content accessible and subscribed through the agreement can be used for TDM purposes, not least using an API connected to ScienceDirect Platform, in compliance with the Licence Agreement.”

All the best,



Thank you so much.

This reinforces the idea that Elsevier contracts change with the phases of the moon... Now any authorised user can carry out TDM either with Elsevier's API or without it. And that robots can only be used for TDM.

TDM is the common phrase in the UK for "data analytics" (Hargreaves legislation). I can't think of many reasons for using robots that wouldn't be classified as data analytics. Indexing, classifcation, usage - these are all data analytics in my understanding. Most non-mining activities would relate to storage and transformation and here the restrictions come from copyright and agreements, not whether robots are used to collect the material.

But maybe I'll be enlightened?



Institute of Physics (IOP) Charges for Open

ContentMine has a new collaborator - an astrophysicist. We're going to work together on mining data from journals. Their leading two include "The Astrophysics Journal" (TAJ) published by the Institute of Physics (IOP). Now s/he's one of the #scholarlypoor - left academia, but stiil fiercely interested in their discipline. Now TAJ is not Open, but I am allowed to read it. I always practice being a scholarlypoor, so went to the website and saw


WOW! an "Open Issue" - just what s/he needs to use to develop a scraper. So I have a look...


I understand the words, so I am competent to do ContentMining on this - I'm an astrophysical miner... Let's look at a paper...


WHAT!??? Open = Pay us SIX QUID (and that's just rental)

I give publishers the "cock-up" benefit before the "conspiracy". I can hypothesize that "Open" was a typo for "Current" or that Openness is a new concept for IOP, who haven't switched the linksand so describe this as a "Bumpy Road" (a phrase coined by Elsevier). Or... let's hear from IOP.



Update for last month: Shuttleworth, BL, MySociety and more

I have been silent for the last month because I have been very busy. I hope to blog more in the next few days.

The main activities have been

  • The twice-yearly Gathering of Shuttleworth fellows (this time in Malta). My first time (March) was great, but this was fantastic. I have so much in common with so many of the past and current Fellows. I've a huge todo list to see how we can work together.
  • An application to the Shuttleworth Foundation for re-funding for a further year. I'll take you through this in detail in the next posts.
  • Running ContentMining workshops (at EBI on 2014-10-06) and (Jenny Molloy and Puneet Kishor) in Delhi last weekend (2014-11-02). We have now ironed out most of the problems and feel confident about delivering a range of workshops.
  • Preparing for the launch of our mining activities (RRSN, promise!). Will blog in next day or two
  • Preparing for trip to US and doing workshops in Chicago, Washington (OpenCon) and visiting Penn State).
  • Attending Open Access Button launch 2014-10-21. OA_Button is massive because at least there is some true energy and anger. I gave a 30 sec tribute where I said it was massive and that we hadn't seen nothing yet. I think OA_Button will change the world. Young people are sick of the broken values of academia. (I shall expand on this at OpenCon).
  • Cobi Smith (who is currently with Francois Grey in CERN) gave an excellent talk at Open Research Cambridge in the Panton Arms on crowdsourcing/crafting for disasters.
  • Met with all the Cambridge University Librarians to say farewell to Peter Morgan, retiring, who has helped me get off the ground in Library and Informatics research. Peter's been a huge contributor to new ideas and practices and we'll miss him. Good opportunity to talk with the library and know they are supportive of my ContentMining research.
  • Interesting meeting on Big Data in science in Cambridge. Great talk by Florian Markowetz (Cancer Genomics) debunking the hype (Big Data is sliding down the curve into trough of disillusion). He demolished the "lets throw all our data into Machine Learning and wonderful things will automatically emerge" syndrome. I completely agree - Machine Learning has its place - e.g. OCR - but leads to no understanding and no transferability. I am largely renouncing it.
  • British Library Labs on "Big Data" in Arts and Humantities. Massively Wonderful. Wonderful speakers and projects. The BL runs its labs on a shoestring (Mainly Mahendra and Ben) and IMO is a world leader in innovative use of library resources. I'm certainly planning to see if they'll host a workshop on ContentMining.
  • MySociety - ran a great meeting yesterday evening in Cambridge and asked me to present TheContentMine. I had some help - see

More later...


ContentMine at WOSP2014: Text and Data Mining: III What Elsevier's Chris Shillum thinks we can do; Responsible Mining

My last post ( ) described the unacceptable and intransigent attitude of Elsevier's Gemma Hersh at WOSP2014 Text and Datamining of Scientific Documents.

But there was a nother face to Elsevier at the meeting, Chris Shillum. ( ). Charles Oppenheim and I talked with him throughout lunch and also later Richard Smith-Unna also talked. None of the following is on public record but it's a reasonable approximation to what he told us.

Firstly he disagrees with Gemma Hersh that we HAVE to use Elsevier's API and sign their Terms and Conditions (which gives away several rights and severely limit what we can do and publish.) We CAN mine Elsevier's content through their web pages that we have the right to read and we cannot be stopped by law. We have a reasonable duty of care to respect the technical integrity of their system, but that's all we have to worry about.

I *think* we can move on from that. If so, thanks, Chris. And if so, it's a model for other publishers.

We told him what we were planning to do - read every paper as it is published and extract the facts.

CS:  Elsevier published ca 1000 papers a day.

PMR : that's one per minute;  that won't break your servers

CS: But it's not how humans behave....

I think this means that is we appear to their servers as just another human then they don't have a load problem. (For the record I sometimes download manually in sequence as many Elsevier papers as I can to (a) check the licence or (b) to see whether the figures contain chemistry, or sequences. Both of these are legitimate human activities either for subscribers or unsubscribers. For (a) I can manage 3 per minute, 200/hour - I have to scroll to the end of the paper because that's often where the "all rights reserved, C Elsevier" is. For (b) it's about 40/hour if there are an average of 5 diagrams (I can tell within 500 milliseconds whether a diagram has chemistry, sequences, phylogenetic trees, dose-response curves...).  [Note: I don't do this for fun - it isn't - but because I am fighting for our digital rights.]

But it does get boring and error-prone which is why we use machines.

The second problem is what we can publish. The problem is copyright. If I download a complete Closed Access paper from Elsevier and post it on a public website I am breaking copyright. I accept the law. (I am only going to talk about the formal law, not morals or ethics at this stage).

If I read a scientific paper I can publish facts. I MAY be able to publish some text as comment, either as metadata, or because I am critiquing the text. Here's an example ( :

In 1953, the following sentence appeared near the end of a neat little paper by James Watson and Francis Crick proposing the double helical structure of DNA (Nature171: 737-738 (1953)):

"It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material."

Of course, this bon mot is now wildly famous amongst scientists, probably as much for its coyness and understatement ...

Lablit can reasonably assume that this is fair comment and quote C+W (Copyright who? they didn't have transfer in those days). I can quote Lablit in similar fashion.  In the US this is often justified under "fair use" , but there is not such protection in UK. Fair use is, anyway, extremely fuzzy.

So Elsevier gave a guide as to what they allow IF you sign their TDM restrictions. Originally it was 200 characters, but I pointed out that many entities (facts) , such as chemical names or biological sequences, were larger. So Chris clarified that Elsevier would allow 200 characters of surrounding context.  This could means something like (Abstract, ) showing ContentMine markup:

"...Generally the secondary metabolite capability of <a href=""> A.oryzae</a> presents several novel end products likely to result from the domestication process..."

That's 136 characters without the Named Entity "<a.../a>"

Which means that we need guidelines for Responsible Content Mining.

That's what JISC have asked Jenny Molloy and me to do (and we've now invited Charles Oppenheim to be a third author). It's not easy as there are few agreed current practices. It's possible for us to summarise what people currently do, what has been challenged by rights holders and then for us to suggest what is reasonable. Note that this is not a negotiation. We are not at that stage.

So I'd start with:

  1. The right to read is the right to mine.
  2. Researchers should take reasonable care not to violate the integrity of publishers' servers.
  3. Copyright law applies to all parts of the process  although it is frequently unclear
  4. Researchers should take reasonable care not to violate publishers' copyright.
  5. Facts are uncopyrightable.
  6. Science requires the publication of source material as far as possible to verify the integrity of the process. This may conflict with (3/4).


CC BY publishers such as PLoS and BMC will only be concerned with (2). Therefore they act as a yardstick, and that's why we are working with them. Cameron Neylon of PLoS has publicly stated that single text miners cause no problems of PLoS servers and there are checks to counter irresponsible crawling.

Unfortunately (4) cannot be solved by discussions with publishers as they have shown themselves to be in conflict with Libraries, Funders, JISC, etc. Therefore I shall proceed by announcing what I intend to do. This is not a negotiation and it's not asking for permission. It's allowing a reasonable publisher to state any reasonable concerns.

And I hope we can see that as a reasonable way forward.









WOSP2014: Text and Data Mining: II Elsevier's Presentation (Gemma Hersh)

WOSP2014 - - is a scholarly, peer-reviwed workshop. It consists of submiited, peer-reviewed talks and demos and invited talks from well-known people in the field (Lee Giles,  Birger Larsen). At ContentMine we submiited three papers/demos which were peer-reviewed and accepted (and which I'll blog later) .

But there was also one Presentation which was, as I understand, neither invited nor peer-reviewed.

"Elsevier's Text and Data Mining Policy" by Gemma Hersh

It is usually inappropriate for a manufacturer to present at a scholarly conference where the audience are effectively customers. It ends up as a products pitch which is offensive to attendees who have paid to attend, and offensive to those who have submmited papers which were rejected, while product pitches are allowed.

This was one of the most unacceptable presentations I have ever seen at a scholarly workshop and I said so.

Before I add my own comments I simply record the facts. Professor Charles Oppenheim agrees that this is a factual record.

GH = Gemma Hersh (Elsevier)
CO = Prof Charles Oppenheim
GH arrived 10 mins before her presentation and left immediately afterwards. She did not stop to talk. She later tweeted that she had a meeting.
GH Presentation

(1) Elsevier's presentation was not an invitation or peer-reviewed submission but appeared to have been a result of pressuring the organizers.

(2) it was a manufacturer-specific product pitch not a scholarly presentation

(3) It made no attempt to be balanced but presented only Elsevier's product. In particular:

  * no mention was made of Hargreaves

  * no mention that it had been rejected by LIBER and associates

  * no mention that the library community had walked out of Licences for Europe

(4) Elsevier said that their studies showed researchers preferred APIs. No mention was made that researchers had to sign an additional agreement

Public Discussion (no record, but ca 20 witnesses)

PMR challenged the presentation on the basis of bias and inaccuracy.

GH cricitized PMR for being aggressive. She stated that it was the libraries fault that L4E had broken down

PMR asked GH to confirm that if he had the right to read Elsevier material he could mine it without using Elsevier's API

GH replied that he couldn't.

CO told GH that PMR had a legal right to do so

GH said he didn't and that CO was wrong

Discussion continued with no resolution. GH showed no intention of listening to PMR and CO

Later tweets from GH
@petermurrayrust check the explanatory notes that accompany the legislation.
@petermurrayrust would prefer a constructive chat rather than an attack though....
@petermurrayrust very happy to keep talking but I've just come straight from one appointment and now have another.


PMR I believe that any neutral observer would agree that this was roughly factually correct.


PMR: now my comments...

It is completely unacceptable for a product manager to push their way into a scholarly workshop, arrive 10 minutes before their presentation, give a product pitch and leave immediately without deigning to talk to anyone.

The pitch itself was utterly one-sided, presenting Elsevier as the text-miner's friend and failing to give a balanced view of the last several years. Those of us in the UK Hargreaves process and Licences4Europe know that STM publishers in general and Elsevier in particular have thrown money and people in trying to control the mining effort through licences. To give a blatantly biased presentation at a scholarly meeting rules them out as trustable partners.

Worse, the product pitch was false. I called her on this - I was forthright - and asked whether I could mine without Elsevier's permission. She categorically denied this. When challenged by Professor Oppenheim she told him curtly he was wrong and Elsevier could do what they liked.

The law explicitly states that publishers cannot use terms and conditions or other contractual processes to override the right to mine for non-commercial research processes.

So it's a question of who do you believe:

Gemma Hersh, Elsevier or Professor Charles Oppenheim, Loughborough, Northampton, City?

(and PMR, Nottingham and Cambridge Universities)

If GH is right, then the law is pointless.

But she isn't and it isn't.

It gets worse. In later discussions with Chris Shillum, who take a more constructive view, he made it clear that we had the right to mine without Elsevier's permission as long as we didn't sign their terms and conditions. The discussion - which I'll cover in the next post - was useful.

He also said that Elsevier had changed their TaC several times since January, much of this as a result of my challenging them. This means:

  1. Elsevier themselves do not agree on the interpretation of the law
  2. Elsevier's terms and conditions are so mutable and frequently changed that they cannot be regarded as having any force.




ContentMine at WOSP2014: Text and Data Mining; I. General Impressions

On Friday 2014-09-12 4 of us from The ContentMine presented 3 papers at WOSP2014 ( . The meeting was well run by Petr Knoth and colleagues from the Open University and CORE (the JISC- and funder-supported project for aggregation of repositories). The meeting gave a useful  overview of TextAndDataMining (TDM).  From the program

  1. The whole ecosystem of infrastructures including repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets, services and APIs that enable analysis of large volumes of scientific publications.
  2. Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods.
  3. Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence etc.

This was an important meeting in several ways and I want to comment on:

  1. the general state of Content-mining
  2. Our own presentations
  3. Elsevier's uninvited presentation on "Elsevier's Text and Data Mining Policy". I'll split this into two parts in two separate posts.

PMR's impressions

Content mining and extraction has a long history so it's now an incremental rather than a revolutionary technology. Much of the basis is "machine learning" where statistical measures are used for classification and identification (essentially how spam detectors work). So several papers dealt with classification, based on the words in the main part ("full text") of the document. [NOTE: full-text is far better than abstracts for classifying documents. Always demand full-text.] Several papers dealt with this.

Then there's the problem of analysing semistructured information. A PDF is weakly structured - a machine doesn't know what order the characters are, whether there are words, etc. Tables are particularly difficult and there were two papers on this.

And then aggregation, repositories, crawling etc. The most important takeaway for me is CORE ( ) , which aggregates several hundred (UK and other) repositories. This is necessary because UK repositories are an uncontrolled mess. Each university does its own thing, has a different philosophy, uses different indexing and access standards. The universities can't decide whether the repo is for authors' benefit, readers' benefit, the university's benefit, HEFCE's benefit or whether they "just have to have one because everyone else does". (By contract the French have HAL ). So UK repositories remain uncrawlable and unindexed until CORE (even then their is much internal inconsistency and uncertain philosophy).

It's important to have metrics because otherwise we don't know whether something works. But there was too much emphasis on metrics (often to 4 (insignificant) figures). One paper reported 0.38% recall with strict method and 94% with a more sloppy one. Is this really useful?

But virtually no one (I'll omit the keynotes) gave any indication of whether they were doing something useful to others outside their group. I talked to 2-3 groups - why were they working on (sentiment analysis | table extraction | classification). Did they have users? Was their code available? was anyone else using their code. Take up seems very small. Coupled with the fact that many projects have a 2-3 year lifespan, that the basis is competition rather than collaboration, and we see endless reinvention. (I've done table extraction but I'd much rather someone else did it so we are working with tabulaPDF). The output in academia is a publication, not a running reusable chunk of code.

So it's not surprising that there isn't much public acknowledgement of TDM. The tools are tied up in a myriad of university labs, often without code or continuity.

One shining exception is Lee Giles, whose group has built CiteSeer. Lee and I have known each other for many years and we worked together on a MicrosoftResearch project, OREChem. So when we got talking we found we had two bits of the jigsaw.

Readers of this blog will know that Ross Mounce and I are analysing diagrams. To do that we have to indetify the diagrams, and this is best done from the captions (captions are the most important part of a scientific document for understanding what it's about). And we are hacking the contents of the images. So these two fit together perfectly. His colleague Sagnik Ray Choudhury is working on extracting and classifying the images; that saves us huge time and effort in knowing what to process. I'm therefore planning to visit later this year.

For me that was probably the most important positive outcome of the meeting.

The next post will deal with Elsevier's Gemma Hersh who gave an uninvited "Presentation", and the one after with Elsevier's Chris Shillum's comment on Gemma's views and also on Elsevier's TaC.





Wellcome's recommendations on Data Access and Sharing

The Wellcome Trust and other funders have commisioned a study on


(This is a report of the Expert Advisory Group on Data Access (EAGDA). EAGDA was established by the MRC, ESRC, Cancer Research UK and the Wellcome Trust in 2012 to provide strategic advice on emerging scientific, ethical and legal issues in relation to data access for cohort and longitudinal studies.)

This is very welcome - data is a poor relation of the holy - and in many subjects often largely useless - PDF full text. Here the report states why, and how we need to care for data.

Our findings were that

– making data accessible to others can carry a significant cost to researchers (both in terms of financial resource and the time it requires) and there are constraints in terms of protecting the privacy and confidentiality of research participants;

– while funders have done much valuable work to encourage data access and have made significant investments to support key data resources (such as the UK Data Service for the social sciences), the data management and sharing plans they request of researchers are often not reviewed nor resourced adequately, and the delivery of these plans neither routinely monitored nor enforced;

– there is typically very little, if any, formal recognition for data outputs in key assessment processes – including in funding decisions, academic promotion, and in the UK Research Excellence Framework;

– data managers have an increasingly vital role as members of research teams, but are often afforded a low status and few career progression opportunities;

– working in data intensive research areas can create potential challenges for early career researchers in developing careers in these fields;

– the infrastructures needed to support researchers in data management and sharing, and to ensure the long-term preservation and curation of data, are often lacking (both at an institutional and a community level).

TL;DR It needs commitment in money, policies and management and it's a large task

So ...


We recommend that research funders should:

1. Strengthen approaches for scrutinising data management and sharing plans associated with their funded research – ensuring that these are resourced appropriately and implemented in a manner that maximises the long-term value of key data outputs.


2. Urge the UK Higher Education funding councils to adopt a clear policy at the earliest possible stage for high quality datasets that are shared with others to be explicitly recognised and assessed as valued research outputs in the post-2014 Research Excellence Framework


3. Take a proactive lead in recognising the contribution of those who generate and share high quality datasets, including as a formal criterion for assessing the track record and achievements of researchers during funding decisions.


4. Work in partnership with research institutions and other stakeholders to establish career paths for data managers.


5. Ensure key data repositories serving the data community have adequate funding to meet the long-term costs of data preservation, and develop user-friendly services that reduce the burden on researchers as far as possible.

PMR: This is the FUNDERS urging various bodies to act. Some items are conceivably possible. (4) is highly desirable but very challenging and universities have consistently failed to value support roles and honour  "research outputs" instead. (5) is possible but must be done by people and organizations who undersdtand repositories, not university libraries whose repositories are effectively unused.

And we MUSTN"T hand this over to commercial companies.

We recommend that research leaders should:

6. Adopt robust approaches for planning and costing data management and sharing plans when submitting funding applications.

7. Ensure that the contributions of both early-career researchers and data managers are recognised and valued appropriately, and that the career development of individuals in both roles is nurtured.

8. Develop and adopt approaches that accelerate timely and appropriate access to key research datasets.

9. Champion greater recognition of data outputs in the assessment processes to which they contribute.

PMR: (6) will have lipservice unless the process is changed (7) means changing culture and diverting money (8) is possible (9) requires a stick from funders

We also emphasise that research institutions and journals have critical roles in supporting the cultural change required.

Specifically, we call for research institutions to develop clear policies on data sharing and preservation; to provide training and support for researchers to manage data effectively; to strengthen career pathways for data managers; and to recognise data outputs in performance reviews.

We call on journals to establish clear policies on data sharing and processes to enable the contribution of individual authors on the publication to be assessed, and to require the appropriate citation and acknowledgement of datasets used in the course of a piece of published research. In addition, journals should require that datasets underlying published papers are accessible, including through direct links in papers wherever possible.

PMR:  Journals have failed us catastrophically both technically (they are among the worst technical printed output in the world - broken bizarre HTML and not even using Unicode) and politically (where their main product is glory, not technical). The only way to change this is to create different organisations.

This will be very difficult.

The funders are to be commended on these goals - it will be an awful lot of money and time and effort and politics.