petermr's blog

A Scientist and the Web


Content Mining: Why you and I should NOT sign up for Elsevier’s TDM service

In the last few days Elsevier has announced their policy on Text And Data Mining (TDM). I use the term “content mining” as I wish to mine every part of published content (images, audio, video) and not just text. The policy was announced here .

This post contains a lot of material (from Elsevier and my comments) so I’ll try to summarise. Note that Elsevier’s material seems inconsistent in places (common with this publisher). I have had to go behind Elsevier’s paywall to find one statement of agreement and rights and it is probable that I have not found everything. In essence:

  • Elsevier asserts complete control over “its” content and requires both institutions and individuals to sign licences.
  • Elsevier is the sole author and controller of the policy – there has been no Open discussion or agreement with scholarly bodies
  • Libraries have to – individually – sign agreements with Elsevier. There are no details of these policies or whether they entail additional institutional payment. It is also possible that Institutions may be asked to give up content-mining rights in return for lower overall prices. (Libraries have universally and unilaterally given away all these rights over the last decade and support publishers to forbid machine access to content).
  • Researchers have to register as a developer (I think) and ask permission of Elsevier for every project they wish to do. It is not clear whether permission is automatic or whether Elsevier exercise control over choice and scope of project (they certainly did when I “negotiated” with them ).
  • Researchers can only access content through an Elsevier-controlled portal. They have to register as a Developer and get an APIKey (conflicts with “sign a click-through licence”).
  • Researchers can only mine text. Images are specifically prohibited. This is useless for me – as I and colleagues are mining chemical structure diagrams.
  • There is no indication of how current the material will be. I shall be mining the literature an hour after it appears. Will the API provide that?
  • The amount that can be republished is often useless (“200 characters”). I want to build corpora (impossible); vocabularies (essential to record precise words – impossible); chemical names (often > 200 characters so impossible). Figure captions (impossible).
  • The researchers must commit to a CC-NC licence. This effectively kills downstream use (I shall use CC0). It also trains them into thinking CC-NC is a “good thing”. It isn’t.
  • If a researcher has a LEGITIMATE collection of papers that they wish to mine (say on their hard disk) they are forbidden. They have to go to each publisher (if this awful protocol is promoted elsewhere) and find the API and mine the individual papers. Absurd.


    This is licence-controlled TDM. The publishers tried very hard to get Europe (Neelie Kroes) to agree to licences for TDM (“Licences for Europe”). They failed.

    They tried to stop the UK Hargreaves process exempting data analytics from copyright reform. They failed.


    The leading library organizations and funders such as the British Library, JISC, LIBER, Wellcome Trust, RCUK are united in their opposition to licences. This is simply Licences under another head.


    The danger is that University libraries – who have signed these restrictive clauses for years will continue to sign them.




    Don’t take my word for this. Ask the BL, or JISC or LIBER.






    APIs make it HARDER to mine. We are releasing technology that will work directly on PDFs. It’s Open Source and works. And others are doing the same. If every publisher came up with a similar process it would make the burden of mining huge. This is probably what some publishers hope.

Here are the supporting docs. I have emphasized some parts: (In front of paywall)

How to gain access

For Academic subscribers once your institutional agreement has been updated to allow text-mining access, individual researcher access is an automatic process, managed through our developer portal. Researchers will need to follow three steps:

  1. Register their details using the online form on the developer’s website
  2. Agree to our Text Mining conditions via a “click-through” agreement
  3. Receive an API token that will allow you to access ScienceDirect content (delivered in an XML format suitable for text mining)

Terms and conditions of text and data mining

  • Text mining access is provided to subscribers for non-commercial purposes
  • Access is via the ScienceDirect APIs only
  • Text mining output adheres to the following conditions:

1. Output can contain “snippets” of up to 200 characters of the original text

2. Licensed as CC-BY-NC

3. Includes DOI link to original content

Note: We request that all access to content for text mining purposes takes place through our APIs and remind you that in order to maintain performance and availability for all users, the terms and conditions of access to ScienceDirect continue to prohibit the use of robots, spiders, crawlers or other automated programs, or algorithms to download content from the website itself. (behind paywall?)

Text mining of Elsevier publications


Revision history

Definition: the client application is a system that ingests full-text publications in order to text-mine them: extract data and information using automated algorithms. Examples of text mining are entity recognition, relationship extraction, and sentiment analysis using linguistic methods.

We allow this use case under the following conditions:

  • Access to the APIs for text mining purposes is available free of charge to researchers at academic institutions that subscribe to The full-text content that is available for mining through the APIs is the content that the institute has subscribed to [PMR it's TEXT ONLY].
  • Our APIs must be used to retrieve the content; crawling the website itself is not allowed.
  • The institution needs to have written permission from Elsevier for text mining, either through a clause embedded in an existing subscription agreement or as a separate add-on agreement.
  • After permission is granted, researchers at the institution will be able to obtain an APIKey by registering their text mining project through the ‘My Projects’ page of the Elsevier Developer Portal.
  • The use of Elsevier content in text mining, and of the resulting output, should adhere to Elsevier’s TDM policy as outlined on

If your institution wants to get written permission for text minng, the institution’s authorized representative can request Elsevier to provide one, by contacting his/her Elsevier account manager or our Academic & Government Sales department.

If you want to mine Elsevier content for commercial purpose, please contact our Corporate Sales department.


Leave a Reply