In the last few days Elsevier has announced their policy on Text And Data Mining (TDM). I use the term “content mining” as I wish to mine every part of published content (images, audio, video) and not just text. The policy was announced here http://www.elsevier.com/about/universal-access/content-mining-policies .
This post contains a lot of material (from Elsevier and my comments) so I’ll try to summarise. Note that Elsevier’s material seems inconsistent in places (common with this publisher). I have had to go behind Elsevier’s paywall to find one statement of agreement and rights and it is probable that I have not found everything. In essence:
- Elsevier asserts complete control over “its” content and requires both institutions and individuals to sign licences.
- Elsevier is the sole author and controller of the policy – there has been no Open discussion or agreement with scholarly bodies
- Libraries have to – individually – sign agreements with Elsevier. There are no details of these policies or whether they entail additional institutional payment. It is also possible that Institutions may be asked to give up content-mining rights in return for lower overall prices. (Libraries have universally and unilaterally given away all these rights over the last decade and support publishers to forbid machine access to content).
- Researchers have to register as a developer (I think) and ask permission of Elsevier for every project they wish to do. It is not clear whether permission is automatic or whether Elsevier exercise control over choice and scope of project (they certainly did when I “negotiated” with them /pmr/2011/11/27/textmining-my-years-negotiating-with-elsevier/ ).
- Researchers can only access content through an Elsevier-controlled portal. They have to register as a Developer and get an APIKey (conflicts with “sign a click-through licence”).
- Researchers can only mine text. Images are specifically prohibited. This is useless for me – as I and colleagues are mining chemical structure diagrams.
- There is no indication of how current the material will be. I shall be mining the literature an hour after it appears. Will the API provide that?
- The amount that can be republished is often useless (“200 characters”). I want to build corpora (impossible); vocabularies (essential to record precise words – impossible); chemical names (often > 200 characters so impossible). Figure captions (impossible).
- The researchers must commit to a CC-NC licence. This effectively kills downstream use (I shall use CC0). It also trains them into thinking CC-NC is a “good thing”. It isn’t.
-
If a researcher has a LEGITIMATE collection of papers that they wish to mine (say on their hard disk) they are forbidden. They have to go to each publisher (if this awful protocol is promoted elsewhere) and find the API and mine the individual papers. Absurd.
This is licence-controlled TDM. The publishers tried very hard to get Europe (Neelie Kroes) to agree to licences for TDM (“Licences for Europe”). They failed.
They tried to stop the UK Hargreaves process exempting data analytics from copyright reform. They failed.
The leading library organizations and funders such as the British Library, JISC, LIBER, Wellcome Trust, RCUK are united in their opposition to licences. This is simply Licences under another head.
The danger is that University libraries – who have signed these restrictive clauses for years will continue to sign them.
DON’T.
Don’t take my word for this. Ask the BL, or JISC or LIBER.
BUT DON’T SIGN ELSEVIERS TDM.
And:
YOU DO NOT NEED ANY API.
APIs make it HARDER to mine. We are releasing technology that will work directly on PDFs. It’s Open Source and works. And others are doing the same. If every publisher came up with a similar process it would make the burden of mining huge. This is probably what some publishers hope.
Here are the supporting docs. I have emphasized some parts:
http://www.elsevier.com/about/universal-access/content-mining-policies (In front of paywall)
How to gain access
For Academic subscribers once your institutional agreement has been updated to allow text-mining access, individual researcher access is an automatic process, managed through our developer portal. Researchers will need to follow three steps:
- Register their details using the online form on the developer’s website
- Agree to our Text Mining conditions via a “click-through” agreement
- Receive an API token that will allow you to access ScienceDirect content (delivered in an XML format suitable for text mining)
…
Terms and conditions of text and data mining
- Text mining access is provided to subscribers for non-commercial purposes
- Access is via the ScienceDirect APIs only
- Text mining output adheres to the following conditions:
1. Output can contain “snippets” of up to 200 characters of the original text
2. Licensed as CC-BY-NC
3. Includes DOI link to original content
Note: We request that all access to content for text mining purposes takes place through our APIs and remind you that in order to maintain performance and availability for all users, the terms and conditions of access to ScienceDirect continue to prohibit the use of robots, spiders, crawlers or other automated programs, or algorithms to download content from the website itself.
http://www.developers.elsevier.com/cms/content/text-mining-elsevier-publications (behind paywall?)
Text mining of Elsevier publications
Definition: the client application is a system that ingests full-text publications in order to text-mine them: extract data and information using automated algorithms. Examples of text mining are entity recognition, relationship extraction, and sentiment analysis using linguistic methods.
We allow this use case under the following conditions:
- Access to the APIs for text mining purposes is available free of charge to researchers at academic institutions that subscribe to sciencedirect.com. The full-text content that is available for mining through the APIs is the content that the institute has subscribed to [PMR it’s TEXT ONLY].
- Our APIs must be used to retrieve the content; crawling the sciencedirect.com website itself is not allowed.
- The institution needs to have written permission from Elsevier for text mining, either through a clause embedded in an existing subscription agreement or as a separate add-on agreement.
- After permission is granted, researchers at the institution will be able to obtain an APIKey by registering their text mining project through the ‘My Projects’ page of the Elsevier Developer Portal.
- The use of Elsevier content in text mining, and of the resulting output, should adhere to Elsevier’s TDM policy as outlined on http://www.elsevier.com/tdm.
If your institution wants to get written permission for text minng, the institution’s authorized representative can request Elsevier to provide one, by contacting his/her Elsevier account manager or our Academic & Government Sales department.
If you want to mine Elsevier content for commercial purpose, please contact our Corporate Sales department.
Hi Peter,
We think our new text mining policy goes a long way to addressing researcher needs in respect of TDM. You raise some good questions, though, and I’d like to take this opportunity to respond to them:
• Elsevier requires both institutions and individuals to sign licenses
Our objective is to provide practical support to researchers. We believe a licence-based, self-service solution removes access barriers for researchers who want to text and data mine while allowing publishers to ensure performance and quality of service for all users.
• Elsevier is the sole author and controller of the policy – there has been no Open discussion or agreement with scholarly bodies
This new policy is the result of extensive discussions with academic institutions – we have, for example, been running pilots with a number of institutions over the course of last year to test and refine both our technology and the terms and conditions under which this access is provided.
• Libraries have to – individually – sign agreements with Elsevier. There are no details of these policies or whether they entail additional institutional payment. It is also possible that Institutions may be asked to give up content-mining rights in return for lower overall prices. (Libraries have universally and unilaterally given away all these rights over the last decade and support publishers to forbid machine access to content).
There is no additional charge for this access, and it will be automatically included in all library contracts when they are renewed. Libraries who would like access immediately (perhaps their next renewal is some time away) are asked to simply send us a request and we will amend their current agreement to include this access.
• Researchers have to register as a developer (I think) and ask permission of Elsevier for every project they wish to do. It is not clear whether permission is automatic or whether Elsevier exercise control over choice and scope of project
The process is automatic – researchers are indeed asked to register and agree to the terms, and are then automatically sent an API key. You don’t need to contact anyone at Elsevier, and we do not exercise any control over the choice and scope of research projects.
• Researchers can only mine text. Images are specifically prohibited. This is useless for me – as I and colleagues are mining chemical structure diagrams.
Figure metadata (titles, captions, etc) is included in the XML returned from our APIs and may be mined as a matter of course. Due to some ambiguity about re-use rights for some of the images included in our content, we are not automatically making the images themselves available to those who self-register for our text-mining API, but do have an image retrieval API that we can make available upon request once we understand the way in which the researcher intends to use the images.
• There is no indication of how current the material will be. I shall be mining the literature an hour after it appears. Will the API provide that?
Yes. The APIs provide immediate access to content – they are hooked up to the same “back end” content store as ScienceDirect.com itself.
• The amount that can be republished is often useless (“200 characters”). I want to build corpora (impossible); vocabularies (essential to record precise words – impossible); chemical names (often > 200 characters so impossible). Figure captions (impossible).
• The researchers must commit to a CC-NC licence. This effectively kills downstream use (I shall use CC0). It also trains them into thinking CC-NC is a “good thing”. It isn’t.
We arrived at our terms in consultation with researchers, and we believe that they pose no issue in the vast majority of cases. Of course, it’s not possible to cover every situation in a general policy, so we’re always open to specific requests.
• If a researcher has a LEGITIMATE collection of papers that they wish to mine (say on their hard disk) they are forbidden. They have to go to each publisher (if this awful protocol is promoted elsewhere) and find the API and mine the individual papers. Absurd.
We recognise that an important issue for researchers is the need to deal with multiple publishers. So for us providing an API for our customers is only part of the solution – we’re also strong supporters of CrossRef’s Prospect initiative (https://prospect.crossref.org/splash/), which aims to provide a single interface to content from multiple publishers.
Interested readers can learn more here: http://www.elsevier.com/connect/elsevier-updates-text-mining-policy-to-improve-access-for-researchers
With kind wishes,
Alicia
Dr Alicia Wise
Director of Access & Policy
Elsevier
a.wise@elsevier.com
@wisealic
Pingback: Elsevier annuncia la sua policy su text e data mining (TDM): maggiore apertura? Il dibattito è aperto « Bibliomedica In-forma
Pingback: Data Mining : quand Elsevier écrit sa propre loi… | Sciences communes