Nature’s recent “news” article on Text and Data Mining was unacceptable [redacted]; I ask them to renounce licensing.

[See Update 2014-02-10 at end]

I have sent the following letter to Philip Campbell, Editor of Nature:

Dear Philip,

I am writing to you to protest against your biased reporting of Text And Data Mining in Nature News (part of Nature Publishing Group (NPG))[1] . This article, which purports to be news, is effectively an attempt by the Toll-Access Scientific, Technical & Medical Publishers (TA-STM) industry to promote publisher licences as a benefit to science. It is in the same category of market-led misinformation as Science Magazine’s analysis of flawed Open Access.

Here is the true story which I ask you to publish to redress the balance.

For years Nature and other TA-STM publishers have consistently fought to prevent Text and Data Mining (TDM) solely for their financial benefit. As an example NPG promoted “the Open Text Mining Interface” in 2006 which was designed to appear useful but actually jumbled the sentences (” while preserving any subscription model that funds the journals”).[2].

During the last few years publishers, including yourselves, have imposed draconian conditions restricting crawling and reuse far beyond copyright law. These effectively prevent legal TDM for science and this has killed open activity.  Scientists doing TDM hide their activities for fear of being prosecuted or cut off. For example Max Hauessler (cited in your “news”) spent two years trying to get permission from Elsevier [3] to mine for biological sequences. It is no surprise that he can be persuaded to give a positive comment now that he can “click through”. Heather Piwowar “negotiated” for months with Elsevier – who sent several executives to negotiate. I quote her (with permission) “I hate negotiating with publishers – the stress gives me hives [Urticaria]”.

Last year the European Commission attempted to pave the way for responsible TDM (“data analytics”) by bringing the publishing community together with librarians and scientists and open groups (Ross Mounce and I represented the Open Knowledge Foundation). “Licences 4 Europe” was a series of meetings in Brussels. To summarise, the TA-STM publishers were not prepared to cooperate effectively [4] and halfway into the proceedings most of the committee wrote:

“We write to express our serious and deep-felt concerns in regards to Working Group 4 on text and data mining (TDM).  Despite the title, it appears the research and technology communities have been presented not with a stakeholder dialogue, but a process with an already predetermined outcome – namely that additional licensing is the only solution to the problems being faced by those wishing to undertake TDM of content to which they already have lawful access. ”


The signatories came from about 40 highly responsible European Scientific and Scholarly Organizations [4] and included: The Association of European Research Libraries (LIBER), UUK, The Royal Society, SURF, The Hungarian Academy of Sciences, JISC, SPARC, Research Libraries UK., The Austrian Science Fund, and included experts on policy and intellectual property law. Despite this clear and compelling request the TA-STM publishers held their position, and the signatories later withdrew from negotiations.

This failure of cooperation was later noted by Mme Neelie Kroes, European Commissioner for Digital Agenda and Vice-President EC [5]

“And, for me, the Text and Data mining Group has also shown something very clear. We need to find better ways to cope with immense data flows. They affect so many aspects of our daily lives and professional work. As the European Council put it, big data drives innovation, improves productivity, means better quality services. And scientists in particular can use these data flows for research, even for life-saving discoveries. They need every possibility to do that.

I understand the proposed initiative here by publishers is not supported by the users. And this cannot be seen as any kind of solution without agreement from that very important group of stakeholders. Now we need to seriously consider possible legislative exceptions.”

The TA-STM publishers, NPG included, have ignored Mme Kroes. The industry continues to promote licences. Elsevier’s recent announcement is not news (save for the click-through) and although previous Elsevier contracts are often secret, I suspect the click-through forfeits even more rights than before. By your complete lack of balance in failing to report any of the Licences4Europe dissension and choosing proponents who can be expected to see click-through as an advance, you are effectively marketing the licence solution under the guise of news.

My primary concern is the unacceptability of NPG using its “news organ” for self-interested promotion of the licence solution. However I have also analyzed Elsevier’s “click-through” licence in some detail and found it directly contrary to the requirements of TDM. It is badly written and designed to stop any large scale TDM. In my blog [6]  (and several previous ones) I show that the licence prevents me legally from doing chemical TDM as it would disadvantage Elsevier’s commercial offerings in this area. I could easily end up in court. So, I suspect, could the enthusiasts from whom you got quotes – their outputs, if done responsibly, could compete with Elsevier products. My analysis is backed by Professor Charles Oppenheim an expert in scholarly publishing.

There will be a strong incentive for other TA-STM publishers, including, I suspect, NPG, to follow the Elsevier route. This will either result in a plethora of per-publisher click-through licences or a single, probably highly restrictive Elsevier-like licence, available through a publisher supported gateway.

At present therefore I am finding it hard to continue to have confidence in NPG as a responsible organization in Science evaluation and communication. This is a great pity as I have previously worked productively with you and your colleagues.  Richard van Noorden had asked if he could do a story about our new initiatives in TDM (to be announced later this month) – I can’t now regard this as impartial.

I would ask you to do the following:

  • publicly renounce the use of licences to control TDM and agree that “The right to read is the right to mine”. The Royal Society (a publisher) takes this position so surely NPG could.
  • Commission a balanced account of the Licences4Europe story from a disinterested expert and publish it in Nature.

 

In two months the UK parliament is expected to table and pass the Hargreaves recommendations for TDM,  when we will be able legally to carry this out in UK. Since my institution subscribes to a large number of NPG journals which I have the right to read I expect to start mining them, without further negotiations and without your further permission, in the near future.

 

This letter will appear on my blog. I would consider it appropriate for Nature Correspondence and I request you to publish it.

Peter

 

 

 

[1] http://www.nature.com/news/elsevier-opens-its-papers-to-text-mining-1.14659

[2] http://blogs.nature.com/nascent/2006/04/open_text_mining_interface_1.html and http://hublog.hubmed.org/archives/001345.html

[3] My submission to the UK Government IPO /pmr/2012/03/21/my-response-to-hargreaves-on-copyright-reform-i-request-the-removal-of-contractual-restrictions-and-independent-oversight/

[4] http://www.libereurope.eu/news/licences-for-europe-a-stakeholder-dialogue-text-and-data-mining-for-scientific-research-purpose
[5] http://commentneelie.eu/speech.php?sp=SPEECH/13/917
[6] /pmr/2014/02/07/contentmining-and-elseviers-terms-the-small-print-absolutely-prevents-responsible-science/

 

 

  • Update.
  • Richard van Noorden has tweeted that he is the sole author of the article. I accept his assertion and have removed the implication that he was involved in a marketing exercise. My other concerns about the unacceptability of a news article promoting NPGs position remain.

Posted in Uncategorized | 8 Comments

#contentmining and #elsevier’s terms; The small print absolutely prevents responsible science

I am systematically going through Elsevier’s terms and conditions for content mining (TDM) see /pmr/2014/02/06/elseviers-tdm-terms-tac-can-they-force-us-to-copyright-data-2/ and previous. In this I look at what I must sign up for. The term “Dataset” appears to refer to Elsevier’s collection of papers (probably only in XML), possibly some images if they allow access – effectively the target of what I would intend to mine.

2. USER RIGHTS AND RESPONSIBILITIES.

 2.2  The User may not other than for the uses as permitted above:

 
 

§  abridge, modify, translate or create any derivative work based on the Dataset;

I simply don’t understand this. My TDM output is a derivative work, isn’t it? So I can create a TDM output but not modify it. If I discover something went wrong I can’t amend it. I can’t abridge it. So I can’t filter my output for different purposes or because it’s too big to fit on a disk?

I expect Universal Access staff will tell me that I have misunderstood this and say it’s all OK really. But this is a legal document. They can’t interpret it for me. Only a lawyer can.

And I can’t translate it. Our OPSIN software can in principle be modified to translate chemical names to other languages. This is forbidden.

Now I expect that detailed discussion with the helpful Universal Access people we could resolve this. It would take a few months. And I don’t have a few months. And for 100 other publishers with similar licences? (This is why we walked out of Licences for Europe – exactly to avoid the waste of time and restrictions I am showing you). [BTW in the time it has taken to write this para we can mine 50 papers from PLoS or BMC with zero hassle].

§  remove, obscure or modify in any way any copyright notices, other notices or disclaimers as they appear in the Dataset;

I necessarily remove copyright notices. Data is not copyrightable. The absurdity of water belonging to Elsevier

<boilingPoint substance=”water” pressure=”1 atm” units=”Celsius” copyright=”Elsevier 2012 All rights reserved”>100</boilingPoint>

§  substantially or systematically reproduce, retain or redistribute the Dataset;

“substantially reproduce” can only be decided by a lawyer. “[not] retain” means delete after using, so the user may not be able to repeat their work (since the “Database” will change). Requiring people to destroy their data is bad science.

Any responsible text mining requires the corpus used to be available to others to validate the science. Without this a paper reads like:

“We analysed 5000 papers from Elsevier’s http://www.journals.elsevier.com/molecular-phylogenetics-and-evolution/ . We annotated 100 of these for binomial species names and found an interannotator agreement of 98.23% We cannot make these available for reviewers but trust us, we are conscientious scientists”.

When people are demanding reproducibility in scientific computing there is a requirement for the primary data to be Openly available. Elsevier’s and other publishers’ restrictions have held back natural language processing by at least a decade.

§  extract, develop or use the Dataset in any direct or indirect commercial activity;

I have no idea what an indirect commercial activity is.

§  use any robots, spiders or other automated downloading programs, algorithms or devices to search, screen-scrape, extract, or index any Elsevier web site or web application;

I will deliberately not comment on this as there is too much to say.. Later

§  utilize the TDM Output to enhance institutional or subject repositories in a way that would compete with the value of the final peer review journal article, or have the potential to substitute and/or replicate any other existing Elsevier products, services and/or solutions.

This is the killer. It absolutely stops me doing any serious systematic scientific TDM under this “licence”.

For background you must know that Elsevier produces many products other than journals. Many are secondary publications – they abstract, summarize, codify, etc. In chemistry they produce http://en.wikipedia.org/wiki/Beilstein_database

The Beilstein database is the largest database in the field of organic chemistry, in which compounds are uniquely identified by their Beilstein Registry Number. The database covers the scientific literature from 1771 to the present and contains experimentally validated information on millions of chemical reactions and substances from original scientific publications. The electronic database was originally created from Beilstein’s Handbook of Organic Chemistry, founded by Friedrich Konrad Beilstein in 1881, but has appeared online under a number of different names, including Crossfire Beilstein. Since 2009, the content has been maintained and distributed by Elsevier Information Systems in Frankfurt under the product name “Reaxys“.[1]

The database contains information on reactions, substances, structures and properties. Up to 350 fields containing chemical and physical data (such as melting point, refractive index etc.) are available for each substance. References to the literature in which the reaction or substance data appear are also given.

It’s got roughly 10 million compounds. Let’s suppose I intend to mine Elsevier’s TDM API for chemical reactions. I guess they publish at least 100,000 a year (there’s 10,000 pages per year in their Tetrahedron and many papers will have lots of reactions. I want to mine them for scientific purposes for the EPSRC-funded “Dial a Molecule” that I have been involved with. This program wants to create artificially intelligent systems for making new chemicals – new drugs, new smart materials, etc. An essential part is an Open collection of existing chemical reaction data. I know what I want from the literature and I can technically extract it.

I can do it on my laptop.

But Elsevier will claim this is competing against Reaxys and they will stop me doing this.

And this will happen in other fields – anything useful extracted from the literature will compete against Elsevier products. (If you have enthused about Elsevier’s TDM are you fully aware of the products you may compete against?).

So what I and others are doing is an inevitable part of progress and innovation. It’s constant. Elsevier are trying to hold it back.

And trying to prevent it through lawyers shows a fundamental contempt for true science.

 

 

 
 

Posted in Uncategorized | 5 Comments

#elsevier’s TDM Terms (TaC): Can they force us to copyright data? (2)

I am continuing with my analysis of Elsevier’s terms and Conditions that researchers must use to carry out content-mining. The first post urged you to stop and think /pmr/2014/02/06/content-mining-elseviers-tdm-why-researchers-and-libraries-should-think-very-carefully-and-then-not-sign-1/ (I hope you haven’t already signed). This post suggest that what they are requiring researchers to do is probably legally meaningless in parts.

The first thing to realise is that the terms are potentially incompatible with other indications on Elsevier’s site. Thus the TaC allow content mining of closed articles. To mine Open access articles requires a different licence or process. And specifically (http://www.elsevier.com/about/universal-access/content-mining-policies?a=120946 ):

User License

Reuse the article in another work?

Reuse portions or extracts from the article in other works?

Make a modification of the article (e.g. translations)?

Text & data mine?

Choose a different license?

‘Sell ‘ or re-use for “commercial purposes”?

CC BY

  Y

Y

  Y

CC BY NC SA

  Y

  Y

  Y

  Y

No

No

CC BY NC ND

  Y

No*

No

No

No

No

 

I am NOT allowed to mine Elsevier’s “Open Access” articles published under CC-BY-NC-ND (an option that Elsevier makes it easy for authors to select (unlike Springer who rightly forbids CC-NC for Open Access)).

Now I expect that someone from Elsevier will mail and say I have misunderstood this – which I will accept as many Elsevier papers contain direct legal contradictions such as CC-BY and “All rights reserved” juxtaposed. But the point is that legal documents must be clear and this one isn’t clear to me.

With this reservation I’ll take section 2:

2. USER RIGHTS AND RESPONSIBILITIES. 

2.1  Elsevier grants You a limited license to use the TDM Service, data, files and other materials provided by Elsevier (the “Dataset”), to use the TDM Service:

2.1.1 to continuously and automatically extract semantic entities from full-text articles retrieved through the TDM service for the purpose of recognition and classification of the relations between them and mount, load and integrate the results (the “TDM Output”) on a server used for the User’s text-mining system (i.e., not in libraries, repositories or archives) for access and use by the User or the company, institute or organization the User is affiliated with;

LIBRARIES NOTE: I cannot put my output in the University Repository (dspace@cam.ac.uk). The IR is a natural place to put valuable science. I have probably put more science into repositories that any other. I put 200,000 datasets in the IR nearly 10 years ago. So, simply, even with the requirements of funders I CANNOT archive my science.

2.1.2 to distribute the TDM Output externally, which may include a few lines of query-dependent text of individual full text articles or book chapters which shall be up to a maximum length of 200 characters surrounding and including the text entity matched(“Snippets”) or bibliographic metadata,.

Here’s a typical scientific sentence (I didn’t search for it – I didn’t even have to flip the page):

Liquid chromatography was performed on an Agilent (Torrence, CA, USA) 1100 HPLC system coupled to a triple-quadropole mass spectrometer (Waters-Micromass, Manchester, UK) with a Z-spray ESI operated in positive mode source using a flow of 700 L/h nitrogen desolvated at 350 °C.

278 characters. Every single word is necessary for accurate rendition of science. I can’t quote this responsibly if it’s truncated. Take the “°C” off and there’s a good chance someone will think it K => 77 °C. This happens. It’s actually scientifically irresponsible and illiterate to require truncation.

Further the TDM Output should include a Creative Commons proprietary notice in the following form:

“©Some rights reserved. This work is distributed under the terms of the CC-BY-NC Attribution-NonCommercial- 3.0, which permits non-commercial use, distribution, and reproduction in any medium, provided the original author and source are credited.”

My TDM output is FACTS. DATA. It will look something like:

<method> Liquid chromatography</method>

<equipment> Agilent (Torrence, CA, USA) 1100 HPLC system</equipment>

<flowrate units=” L/h”>700 </flowrate>

(actually it’s much better and smarter than this…)

And I am required (“SHOULD”) to copyright this.

But this is ridiculous, FACTS and DATA cannot be copyrighted. See http://www.lib.umich.edu/copyright/facts-and-data – UMich are among the world experts in this area:

Exceptions to Copyright: Facts and Data

Copyright basics
Copyright law provides protection for original creative expression that is recorded in a physical or digital form, things like literary works, music, art, and film. Copyright does not protect facts, data, or ideas though it does protect databases.

Copyright and databases
Copyright law does not apply to facts, data, or ideas. According to the U.S. Constitution, the purpose of copyright law is “to promote the progress of science and useful arts.” If copyright could grant individuals or business exclusive control of facts and ideas, it would constrain all kinds of progress, or eliminate it altogether. That is why the second section of the US Copyright Act spells out what is not protected by copyright:

It is important to remember that even if a database or compilation is arranged with sufficient originality to qualify for copyright protection, the facts and data within that database are still in the public domain. Anyone can take those facts and reuse or republish them, as long as that person arranges them in a new way. Unless they are accessible only under a contract that conditions access on limiting how the facts and data may or may not be used; any such contract would control.

And I would contend that even if I wanted to copyright this as a database (perish the thought) I don’t think 3 lines of RDF represents a database.

So what Elsevier are asking me to do (if I signed up) is legally absurd.

And that makes the whole TaC unacceptable. If it has such glaring errors as copyrighting data can YOU trust any of it to be legally valid?

In the next post I shall show why I (specifically PMR and some others) would breach Elsevier’s TaC as soon as I started mining…

[BTW Heather Joseph tweeted – “why is PMR the only one blogging on this topic?” do libraries and universities and academics simply not care?]

 

 

Posted in Uncategorized | 6 Comments

Content-Mining Elsevier’s TDM; Why researchers and libraries should think very carefully and then not sign (1)

Elsevier have posted their terms and conditions for content-mining (TDM). See http://www.developers.elsevier.com/cms/content/text-and-data-mining-service-agreement (I think you can only see the agreement if your institution subscribes). I don’t know whether I am allowed to post the TaC without permission but I am assuming fair use and public interest (neither of which actually exist).

I shall take this in two parts: (1) why you should not sign this agreement without consulting a lawyer and your institution and (2) what trouble you might get into. For (1) I will only use the head of the document.

“TEXT AND DATA MINING SERVICE AGREEMENT

PLEASE READ THE FOLLOWING TERMS AND CONDITIONS CAREFULLY. THESE TERMS AND CONDITIONS CONSTITUTE A LEGAL AGREEMENT BETWEEN YOU AND ELSEVIER.”

 

The key point is that this is a legal agreement. It binds you to adhere to the conditions in the document. If you do not you may be sued. This is independent of what your motive was. If you break the agreement you may be sued.

But surely Elsevier won’t sue ordinary researchers?

When ordinary researchers in the University of California posted their research articles, Elsevier sent the lawyers in. http://osc.universityofcalifornia.edu/2013/12/elsevier-takedown-notices/ . Read this carefully – it’s complicated but nowhere near as complicated as the TDM agreement. If the complications worry you, then the TDM argument will worry you.

And surely Text and Data Mining for no reward isn’t really a crime?

Ask the friends of http://en.wikipedia.org/wiki/Aaron_Swartz who downloaded Jstore content to carry out TDM. From Wikipedia:

On January 6, 2011, Swartz was arrested by MIT police on state breaking-and-entering charges, after systematically downloading academic journal articles from JSTOR.[10][11] Federal prosecutors later charged him with two counts of wire fraud and 11 violations of the Computer Fraud and Abuse Act,[12] carrying a cumulative maximum penalty of $1 million in fines, 35 years in prison, asset forfeiture, restitution and supervised release.[13]

That is not a misprint. It really was THIRTY-FIVE YEARS IN PRISON. As I expect you know Aaron died last year.

But surely MIT would come to his support?

The Swartz family don’t think so:

Aaron’s death is not simply a personal tragedy, it is the product of a criminal justice system rife with intimidation and prosecutorial overreach. Decisions made by officials in the Massachusetts U.S. Attorney’s office and at MIT contributed to his death.

Statement by family and partner of Aaron Swartz[108]

I have no informed comment.

So the bottom line is that if you get it wrong in the eyes of Elsevier you may be sued. You may win. Your University might support you. Or it might not… especially since:

 

” You confirm that You have the right and authority to enter into this Agreement”

Do you? I doubt it very much. The contractor to Elsevier for the subscription is the University, not you. Unless they give you the right to sign this licence then you are acting without their authority. I am not a lawyer, but at present I would not do this without getting my university to indemnify me.

If you foul up, then you might get sued, and/or the University would get sued. If the University had not agreed that you could do this they might also take action against you.

BTW – I suspect that, so far, every other TDM agreement with every other publisher will be just as problematic.

In (2) I shall look at some of the clauses and why they expressly prevent me signing up for this.

Posted in Uncategorized | Leave a comment

Can an #openaccess advocate publish subscription-only content?

I am an active advocate for #openaccess and #opendata. I was recently asked on Twitter “how many closed access publications have you authored?” and I replied “none in the last five years.” I was then challenged about one recent paper on which I am an author and which is Closed:

Chemical Name to Structure: OPSIN, an Open Source Solution, Daniel M. Lowe , Peter T. Corbett , Peter Murray-Rust *, and Robert C. Glen,

J. Chem. Inf. Model., 2011, 51 (3), pp 739–753 DOI: 10.1021/ci100384d http://pubs.acs.org/doi/abs/10.1021/ci100384d

 

This is a legitimate challenge, and I’ll explain my philosophy. (Twitter is a poor mechanism for detailed discourse).

 

First, all papers in the last 5 years which I am sole author of are Open. (The only exception is an invited one in an Elsevier journal where I was promised Green Open Access but it has now been closed by the publisher). In most cases I have paid an APC.

 

In my own group, (now formally wound down – I have no direct students or research fellows now though I have several collaborators), we had the tradition that it was the choice of the person who did most of the work. I have always said that it is hard enough being a junior researcher without the additional constraint of a senior person’s prejudices. If a junior researcher feels that they must publish in a closed manner, I don’t think it’s right to force them to do otherwise. In particular there are some subdisciplines (e.g. polymer chemistry) where there is no clear #openaccess outlet. In practice we had a very strong Open culture and almost all papers were published APC-paid OA.

 

When publishing with other groups, and where there is no funder mandate, it depends on who controls the process. #openaccess is messy and multi-author papers are particularly messy (I have been on one with 95 authors). In the case above the other authors wished to published in ACS, and Daniel – who did most of the work and is first author – was not formally in my group. (I suspect the asterisk is because students can’t be corresponding authors or simply an error).

 

Where funders mandate #openaccess then there should be no problem. The researchers know the rules before they start the research – if they don’t want to publish #openaccess they don’t have to take the grant. It’s more difficult where one minor author has personal funding (e.g. a fellowship or sited in a funded laboratory).

I’d be interested what other #openaccess advocates do when the choice involved other authors.

[Note that all the material in Daniel’s paper is Openly available – his thesis in Cambridge DSpace https://www.repository.cam.ac.uk/handle/1810/244727 contains far more detail and is more valuable than the article ; the software is all available at https://bitbucket.org/dan2097/opsin/ and there is a running service at http://opsin.ch.cam.ac.uk/ . ]

 

 

Posted in Uncategorized | 10 Comments

Content Mining (TDM). I analyse Elsevier’s reply and ask whether I am allowed to mine Chemistry

Elsevier has replied to my last blog post on their Content Mining (TDM) facility and regulations. I am going to critique these – mainly for the benefit of Universities and policy makers/funders who might think it is a step forward. It isn’t.

First a preamble about the TA (Closed) Scholarly publishing industry. This is almost unique in that it provides an essential service on an unregulated monopoly basis. IOW the industry can do what it likes (within the law) and largely get away with. The “customers” are the University libraries who seem only to care about price and not what the service actually is. As long as they can “buy” (sorry “rent”) journals they largely don’t seem to care about the conditions of use (and in particular the right to carry out Content Mining). In many ways they act as internal delivery agents and first-line policing (on copyright) for the publishers. This means that the readers (both generally and with institutional subscription) have no formal voice.

Railways have to submit to scrutiny and have passenger liaison committees. So do energy providers. Ultimately they are answerable to governments as well as their shareholders.

Publishers have no regulation and have effective micromonopolies. Readers have no choice in what they read – there is no substitutability. They can either subscribe to read it or they are prevented by the paywalls. If they have access they can either mine it or they are subject to legal constraints (as in this case). When reading Elsevier’s reply remember that the only constraint on what the Director of Access and Policy has is that they must make money for Elsevier. Nothing else matters. Elsevier can go a very long way in upsetting its readers without losing market.

Elsevier has replied through its Directorate of Access and Policy. (This is the one acceptable feature – that there is a clear channel). It used to be called “Universal Access” but the Orwellian euphemism seems to have gone. The Director is currently Alicia Wise (who also tweets under @wisealic). I treat the Directorate in a polite manner and regard it in the same way as I regard “Customer Care” on the railways. To me its staff are people employed by Elsevier to maximise their profits by growing the market and limiting damage. They are not my collaborators and we do not share common goals. In many cases they are directly trying to make life difficult for me and other readers.


Alicia Wise says:

February 1, 2014 at 2:20 pm

Hi Peter,

Dear Director of Access and Policy,

Thanks you for the reply on my blog which I have copied in full. I comment and add QUESTIONS which I would be grateful if you would answer clearly and succinctly. Please avoid generalities. If I do not get answers within a few days, I shall announce that Elsevier have failed to answer.

We think our new text mining policy goes a long way to addressing researcher needs in respect of TDM. You raise some good questions, though, and I’d like to take this opportunity to respond to them:

PMR: Companies always assert that they address customer needs. It is an effectively empty phrase.

• Elsevier requires both institutions and individuals to sign licenses

Our objective is to provide practical support to researchers. We believe a licence-based, self-service solution removes access barriers for researchers who want to text and data mine while allowing publishers to ensure performance and quality of service for all users.

PMR: Another empty phrase.

• Elsevier is the sole author and controller of the policy – there has been no Open discussion or agreement with scholarly bodies

This new policy is the result of extensive discussions with academic institutions – we have, for example, been running pilots with a number of institutions over the course of last year to test and refine both our technology and the terms and conditions under which this access is provided.

PMR: It is relatively easy to find customers who will promote the role of a company.

QUESTION: Where are these pilots? Have any been published? Have any University, Funder or Government organizations acted to oversee them and provide an impartial opinion? (Without such evidence this is an empty claim).

• Libraries have to – individually – sign agreements with Elsevier. There are no details of these policies or whether they entail additional institutional payment. It is also possible that Institutions may be asked to give up content-mining rights in return for lower overall prices. (Libraries have universally and unilaterally given away all these rights over the last decade and support publishers to forbid machine access to content).

There is no additional charge for this access, and it will be automatically included in all library contracts when they are renewed. Libraries who would like access immediately (perhaps their next renewal is some time away) are asked to simply send us a request and we will amend their current agreement to include this access.

PMR: I will be getting this information from libraries

• Researchers have to register as a developer (I think) and ask permission of Elsevier for every project they wish to do. It is not clear whether permission is automatic or whether Elsevier exercise control over choice and scope of project

The process is automatic – researchers are indeed asked to register and agree to the terms, and are then automatically sent an API key. You don’t need to contact anyone at Elsevier, and we do not exercise any control over the choice and scope of research projects.

QUESTION: What are the detailed terms that researchers have to agree to? (Note: The previous terms that Elsevier asked me to sign, restricting out and forbidding mining chemistry effectively were unacceptable).

• Researchers can only mine text. Images are specifically prohibited. This is useless for me – as I and colleagues are mining chemical structure diagrams.

Figure metadata (titles, captions, etc) is included in the XML returned from our APIs and may be mined as a matter of course. Due to some ambiguity about re-use rights for some of the images included in our content, we are not automatically making the images themselves available to those who self-register for our text-mining API, but do have an image retrieval API that we can make available upon request once we understand the way in which the researcher intends to use the images.

PMR: This requires the researchers to formally seek Elsevier’s permission to mine images. You also wish to decide whether you approve of my proposed use. I will therefore state this as a formal request.

QUESTION: I wish to mine all chemical diagrams in Elsevier publications and extract reactions and analyse these for novel chemical reactions. I have an institutional subscription. I will publish only facts which are uncopyrightable. I wish to analyse 100 articles a day – one every 15 minutes (which should not cause stress on your servers). Note: There are only two answers: YES and NOT-YES. Any prevarication as I have had before will be interpreted as a refusal.

• There is no indication of how current the material will be. I shall be mining the literature an hour after it appears. Will the API provide that?

Yes. The APIs provide immediate access to content – they are hooked up to the same “back end” content store as ScienceDirect.com itself.

PMR: Noted.

• The amount that can be republished is often useless (“200 characters”). I want to build corpora (impossible); vocabularies (essential to record precise words – impossible); chemical names (often > 200 characters so impossible). Figure captions (impossible).
• The researchers must commit to a CC-NC licence. This effectively kills downstream use (I shall use CC0). It also trains them into thinking CC-NC is a “good thing”. It isn’t.

We arrived at our terms in consultation with researchers, and we believe that they pose no issue in the vast majority of cases. Of course, it’s not possible to cover every situation in a general policy, so we’re always open to specific requests.

QUESTION: Which researchers did you consult? Please publish full details of their input.

PMR: Note. Yet again a subscriber has to make a specific request. Elsevier are regulating what can be accessed.

• If a researcher has a LEGITIMATE collection of papers that they wish to mine (say on their hard disk) they are forbidden. They have to go to each publisher (if this awful protocol is promoted elsewhere) and find the API and mine the individual papers. Absurd.

We recognise that an important issue for researchers is the need to deal with multiple publishers. So for us providing an API for our customers is only part of the solution – we’re also strong supporters of CrossRef’s Prospect initiative (https://prospect.crossref.org/splash/), which aims to provide a single interface to content from multiple publishers.

Interested readers can learn more here: http://www.elsevier.com/connect/elsevier-updates-text-mining-policy-to-improve-access-for-researchers

PMR: Noted.

With kind wishes,
Alicia

Dr Alicia Wise
Director of Access & Policy
Elsevier
a.wise@elsevier.com
@wisealic

Yours Sincerely

Peter Murray-Rust

Posted in Uncategorized | 6 Comments

Content Mining: Why you and I should NOT sign up for Elsevier’s TDM service

In the last few days Elsevier has announced their policy on Text And Data Mining (TDM). I use the term “content mining” as I wish to mine every part of published content (images, audio, video) and not just text. The policy was announced here http://www.elsevier.com/about/universal-access/content-mining-policies .

This post contains a lot of material (from Elsevier and my comments) so I’ll try to summarise. Note that Elsevier’s material seems inconsistent in places (common with this publisher). I have had to go behind Elsevier’s paywall to find one statement of agreement and rights and it is probable that I have not found everything. In essence:

  • Elsevier asserts complete control over “its” content and requires both institutions and individuals to sign licences.
  • Elsevier is the sole author and controller of the policy – there has been no Open discussion or agreement with scholarly bodies
  • Libraries have to – individually – sign agreements with Elsevier. There are no details of these policies or whether they entail additional institutional payment. It is also possible that Institutions may be asked to give up content-mining rights in return for lower overall prices. (Libraries have universally and unilaterally given away all these rights over the last decade and support publishers to forbid machine access to content).
  • Researchers have to register as a developer (I think) and ask permission of Elsevier for every project they wish to do. It is not clear whether permission is automatic or whether Elsevier exercise control over choice and scope of project (they certainly did when I “negotiated” with them /pmr/2011/11/27/textmining-my-years-negotiating-with-elsevier/ ).
  • Researchers can only access content through an Elsevier-controlled portal. They have to register as a Developer and get an APIKey (conflicts with “sign a click-through licence”).
  • Researchers can only mine text. Images are specifically prohibited. This is useless for me – as I and colleagues are mining chemical structure diagrams.
  • There is no indication of how current the material will be. I shall be mining the literature an hour after it appears. Will the API provide that?
  • The amount that can be republished is often useless (“200 characters”). I want to build corpora (impossible); vocabularies (essential to record precise words – impossible); chemical names (often > 200 characters so impossible). Figure captions (impossible).
  • The researchers must commit to a CC-NC licence. This effectively kills downstream use (I shall use CC0). It also trains them into thinking CC-NC is a “good thing”. It isn’t.
  • If a researcher has a LEGITIMATE collection of papers that they wish to mine (say on their hard disk) they are forbidden. They have to go to each publisher (if this awful protocol is promoted elsewhere) and find the API and mine the individual papers. Absurd.

     

    This is licence-controlled TDM. The publishers tried very hard to get Europe (Neelie Kroes) to agree to licences for TDM (“Licences for Europe”). They failed.

    They tried to stop the UK Hargreaves process exempting data analytics from copyright reform. They failed.

     

    The leading library organizations and funders such as the British Library, JISC, LIBER, Wellcome Trust, RCUK are united in their opposition to licences. This is simply Licences under another head.

     

    The danger is that University libraries – who have signed these restrictive clauses for years will continue to sign them.

     

    DON’T.

     

    Don’t take my word for this. Ask the BL, or JISC or LIBER.

    BUT DON’T SIGN ELSEVIERS TDM.

     

    And:

    YOU DO NOT NEED ANY API.

     

    APIs make it HARDER to mine. We are releasing technology that will work directly on PDFs. It’s Open Source and works. And others are doing the same. If every publisher came up with a similar process it would make the burden of mining huge. This is probably what some publishers hope.

Here are the supporting docs. I have emphasized some parts:

http://www.elsevier.com/about/universal-access/content-mining-policies (In front of paywall)

How to gain access

For Academic subscribers once your institutional agreement has been updated to allow text-mining access, individual researcher access is an automatic process, managed through our developer portal. Researchers will need to follow three steps:

  1. Register their details using the online form on the developer’s website
  2. Agree to our Text Mining conditions via a “click-through” agreement
  3. Receive an API token that will allow you to access ScienceDirect content (delivered in an XML format suitable for text mining)

Terms and conditions of text and data mining

  • Text mining access is provided to subscribers for non-commercial purposes
  • Access is via the ScienceDirect APIs only
  • Text mining output adheres to the following conditions:


1. Output can contain “snippets” of up to 200 characters of the original text

2. Licensed as CC-BY-NC

3. Includes DOI link to original content

Note: We request that all access to content for text mining purposes takes place through our APIs and remind you that in order to maintain performance and availability for all users, the terms and conditions of access to ScienceDirect continue to prohibit the use of robots, spiders, crawlers or other automated programs, or algorithms to download content from the website itself.

 

 

http://www.developers.elsevier.com/cms/content/text-mining-elsevier-publications (behind paywall?)

Text mining of Elsevier publications

 
 

Revision history

Definition: the client application is a system that ingests full-text publications in order to text-mine them: extract data and information using automated algorithms. Examples of text mining are entity recognition, relationship extraction, and sentiment analysis using linguistic methods.

We allow this use case under the following conditions:

  • Access to the APIs for text mining purposes is available free of charge to researchers at academic institutions that subscribe to sciencedirect.com. The full-text content that is available for mining through the APIs is the content that the institute has subscribed to [PMR it’s TEXT ONLY].
  • Our APIs must be used to retrieve the content; crawling the sciencedirect.com website itself is not allowed.
  • The institution needs to have written permission from Elsevier for text mining, either through a clause embedded in an existing subscription agreement or as a separate add-on agreement.
  • After permission is granted, researchers at the institution will be able to obtain an APIKey by registering their text mining project through the ‘My Projects’ page of the Elsevier Developer Portal.
  • The use of Elsevier content in text mining, and of the resulting output, should adhere to Elsevier’s TDM policy as outlined on http://www.elsevier.com/tdm.

If your institution wants to get written permission for text minng, the institution’s authorized representative can request Elsevier to provide one, by contacting his/her Elsevier account manager or our Academic & Government Sales department.

If you want to mine Elsevier content for commercial purpose, please contact our Corporate Sales department.

 
 

Posted in Uncategorized | 3 Comments

We – and that includes you – must preserve Net Neutrality

I have just signed a petition on Net Neutrality; written to the MEP/Rapporteur for the ITRE process; and written to my MEPs. Ten years ago that would have taken me all day. Now it takes under half an hour.

Some of you may think – “isn’t it a bit radical to be doing all this? The world’s reasonably OK. Theare visible?) politicians will look after it, won’t they? And everybody knows the value of the Web – it’s unthinkable it will be removed. And it’s not really my business”.

Well it IS.

We are in the middle of a digital war. A war between corporate interests and freedom. Because there are huge amounts of money to be made by controlling people and the flow of information. A typical idea would be a “two-speed internet” – one for those who can pay, and one for the rest. And on the high-speed we wouldn’t get all that spam because it would be controlled by those who know best what we want. Like Movie corporations; Or mega-science-publishers. (Do you really want a dedicated net where only rich science publishers are present?).

The forces of control have money. Their weapon is the lobbyist. I know, for example, that publisher money is being spent to stop me – yes PM-R and colleagues – developing a free-to-everyone approach to Content Mining (Text and Data Mining). Loss of Net Neutrality could kill content mining as the content would only be available on the mega-publishers’ private web.

We have people’s minds and energy. That’s very powerful but it relies on YOU. I hope I have convinced you to care. Then it’s easy. Go to:

https://www.writetothem.com

They (the wonderful MySociety) will tell you what to do. All you have to know is your postcode. They’ll work out who your MEPs are. Write your letter. It’s worth making it personal – you’ll see below that I have included bits of me, and bits of the local region. (Cut-and-paste is collected, counted, but not read in detail).

So here’s how it’s done… (the names are worked out by WTT)

Attn:

  • Richard Howitt MEP
  • Vicky Ford MEP
  • Geoffrey Van Orden MEP
  • Stuart Agnew MEP
  • David Campbell Bannerman MEP
  • Andrew Duff MEP
  • Robert Sturdy MEP


Eastern

Thursday 30 January 2014

From: Peter Murray-Rust


Dear Geoffrey Van Orden, Richard Howitt, Robert Sturdy, Andrew Duff, Stuart Agnew, Vicky Ford and David Campbell Bannerman,

I am writing to you to urge you to vote and campaign for Net Neutrality, which is being debated in a few days time in ITRE. I have written to MEP/Rapporteur Pilar del Castillo
( pilar.delcastillo@europarl.europa.eu) :

I am writing to urge you to preserve Net Neutrality in the ITRE process at all costs. Europe invented the Web. I was privileged to be at CERN in 1994 to hear the scientist Sir Tim Berners-Lee launch the Semantic Web.

Tim’s vision is simple – at the 2012 Olympics his message was

“This is for everyone”.

Europe has made massive contributions to the Web. It stands to gain massively more. I have estimated to the UK government that in my own discipline of chemistry we stand to gain “low billions” worldwide by making knowledge free. In Europe alone the new uses of scientific information could generate huge wealth.

Restriction – such as a divided web – kills innovation. Innovation and free information are the foundation for a better future.

This is compelling in itself, but there are also special local reasons to support Net Neutrality. Cambridge and the Eastern Region are making outstanding advances in new technology and deploying them for both wealth generation and the betterment of our society and the planet. Free flow of information and ideas are fundamental to this. So by supporting Net Neutrality you will also be helping to strengthen the outstanding potential of our region.


 

Posted in Uncategorized | Leave a comment

Saulius Gražulis gets #blueobelisk award: if you want Open Crystallography go to COD

Today I presented Saulius Gražulis with a Blue Obelisk for his, and his colleagues’ , work on making Crystallography Open for everyone through the Crystallography Open Database (COD).

The http://www.blueobelisk.org is a very loose collaboration of people who work together on a semi-structured basis to create or liberate or otherwise provide:

OPEN DATA, OPEN STANDARDS, OPEN SOURCE (ODOSOS)

In chemistry and related sciences. This is very much valued and many people and companies use BO software. Because it’s open then don’t have to ask our permission. They don’t have to say thank you (though it’s nice). But they do have to acknowledge author’s moral rights (i.e. acknowledge who wrote the software).

In macromolecules there’s an abundance of Open data – the Protein Databank for example. But in “small molecules” or minerals or materials there’s effectively no organized source – apart from the COD. It is hard to build up a voluntary database and keep it running bt that’s what Saulius has done.

And much more than that – it is improving rapidly. With the addition of our Crystaleye structures and chemical software the COD is now able to offer a wide range of crystallographic knowledge.

For example together with the PDB it’s the only crystallographic database in the Linked Open Data Cloud. Today we had confirmation that LOD2 was happy to work with us to include the RDFised COD. So here it is :

You can see PDB close by and Bio2RDF at the bottom. (We’ve talked with Michel Dumontier about how to link to that). So the semantic Web will recognize that if people want semantic crystallography they can come to COD.

And we stress that COD data is not only free but can be used for any purpose without permission. You can build programs round it, sell them, derive forcefields, create reference data tables, use it for validation, compute the structures and properties with QM or MM programs, etc. It’s truly OPEN and the largest data set (I think) in the BO collection.

We also applaud BO’s NMRShiftDB and will actively work to link this to COD.

And COD covers all disciplines – organic, organometallic, inorganic – no other database does that. And no other database allows you to link out to other disciplines and link back in.

Moreover COD will start exposing molecular structures. Often chemists find crystal structures too complicated – they want single molecules (“moieties”). Nick Day did that in Crystaleye and we’ve transferred the software to COD.

And every new addition to the BO repertoire increases its value n-squared.

Thank you Saulius – your obelisk will be in the post.

Posted in Uncategorized | Leave a comment

Liberating Open Crystallography: My 2 weeks in Vilnius with COD; massive progress, Crystaleye moves

I have been in Vilnius LT for nearly two weeks. I had hoped to blog every day, but have failed to do so once. This is because we are working flat out on developing Open Crystallography (for “small molecules” – i.e. non-macromolecules). I have masses to write (and will do so) but here is the summary:

Much small-molecule crystallography is effectively Closed and certainly not conformant to the OKF’s Open Definition. I’ve written about this several times earlier – in essence people don’t have facile access to enough data, code. There’s a lot of people – not just practising crystallographers – who want to change this. Crystallography is a central science (and this year is recognised as The International Year of Crystallography) it’s used in:

  • Bioscience
  • Medicine
  • Materials
  • Chemistry
  • Mathematics

And much more.     

Ten years ago Armel Le Bail Set up an initiative – the Crystallography Open Database (COD) – to collect and store completely Open crystallography. It’s had a lot of support in kind, and some financial support. It now has about 250,000 structures. These are being widely used. Some years ago Armel handed over the direction to Saulius Grazulis (there’s a hacek on the “z”) and I’ve been visiting Saulius and colleagues for 2 weeks.

Independently Nick Day in our group in Cambridge built an Open Database of structures (“Crystaleye” (CY)). Like so many things (e.g. Figshare) it wasn’t planned as a world-beating database. Nick wanted these structures to validate computational methods, so he thought why no collect every structure on the web. Then he thought, why not offer them to the world (http://wwmm.ch.cam.ac.uk/crystaleye/ ) and built a system wich not only exposed the data, but also calculated a huge variety of chemistry. This was possible not only because of the code we had written but the huge contributions of the http://www.blueobelisk.org community. We extensively use CDK, OpenBabel , Jmol, Avogadro and many others. This meant that Crystaleye could display over 10 million computed webpages to allow people to browse and display the chemistry.

I’ve formally shut down my group at Cambridge but continue to be active in chemistry and it would be a great pity if Crystaleye atrophied and died. Nick put many completely novel features into it. So Saulius and I planned that the two efforts would merge – COD has an emphasis on crystallography and CY ‘s is on chemistry. So they complement each other well.

In the time here we have tackled:

  • Pulling the Crystaleye entries to Vilnius. Of the 250,000 10,000 were unique to CY so COD has immediately increased.
  • Extracting the major chemistry routines from CY and installing them in COD-CY
  • Testing the extraction of chemistry from COD-CY
  • Designing novel functionality and display for the web pages
  • Expanding the community that COD-CY interacts with in both directions. I’ll write more about this. COD chemistry will be a massive resource for the whole chemical community and the BlueObelisk will contribute hugely to COD-CY;
  • Designing and implementing RDF for crystallography
  • Turning COC-CY into one of the first small-molecule chemical resources on the LinkedOpenData Cloud.

The group here is wonderful and the potential is huge. We are seeing how Open resources can liberate thought and action in chemistry and crystallography. There’s a commitment to being part of the world community.

I’ve particularly worked with Saulius – we’ve had many days where we have literally hacked from dawn to dusk. Saulius is an ace UNIX-hacker and the infrastructure of the COD is very impressive – with a lot of Perl and shellscripts. I contracts much of the BlueObelisk software is Java and many users run on windows. So we’ve spent a lot of time making CY tools and JUMBO-converters run on the commandline. We’ve cracked the main problems and Saulius can now run Nick’s Crystaleye ideas on the COD server.

 

Much more later

Posted in Uncategorized | 1 Comment