#contentmining and #elsevier’s terms; The small print absolutely prevents responsible science

I am systematically going through Elsevier’s terms and conditions for content mining (TDM) see /pmr/2014/02/06/elseviers-tdm-terms-tac-can-they-force-us-to-copyright-data-2/ and previous. In this I look at what I must sign up for. The term “Dataset” appears to refer to Elsevier’s collection of papers (probably only in XML), possibly some images if they allow access – effectively the target of what I would intend to mine.

2. USER RIGHTS AND RESPONSIBILITIES.

2.2 The User may not other than for the uses as permitted above:

§ abridge, modify, translate or create any derivative work based on the Dataset;

I simply don’t understand this. My TDM output is a derivative work, isn’t it? So I can create a TDM output but not modify it. If I discover something went wrong I can’t amend it. I can’t abridge it. So I can’t filter my output for different purposes or because it’s too big to fit on a disk?

I expect Universal Access staff will tell me that I have misunderstood this and say it’s all OK really. But this is a legal document. They can’t interpret it for me. Only a lawyer can.

And I can’t translate it. Our OPSIN software can in principle be modified to translate chemical names to other languages. This is forbidden.

Now I expect that detailed discussion with the helpful Universal Access people we could resolve this. It would take a few months. And I don’t have a few months. And for 100 other publishers with similar licences? (This is why we walked out of Licences for Europe – exactly to avoid the waste of time and restrictions I am showing you). [BTW in the time it has taken to write this para we can mine 50 papers from PLoS or BMC with zero hassle].

§ remove, obscure or modify in any way any copyright notices, other notices or disclaimers as they appear in the Dataset;

I necessarily remove copyright notices. Data is not copyrightable. The absurdity of water belonging to Elsevier

§ substantially or systematically reproduce, retain or redistribute the Dataset;

“substantially reproduce” can only be decided by a lawyer. “[not] retain” means delete after using, so the user may not be able to repeat their work (since the “Database” will change). Requiring people to destroy their data is bad science.

Any responsible text mining requires the corpus used to be available to others to validate the science. Without this a paper reads like:

“We analysed 5000 papers from Elsevier’s http://www.journals.elsevier.com/molecular-phylogenetics-and-evolution/ . We annotated 100 of these for binomial species names and found an interannotator agreement of 98.23% We cannot make these available for reviewers but trust us, we are conscientious scientists”.

When people are demanding reproducibility in scientific computing there is a requirement for the primary data to be Openly available. Elsevier’s and other publishers’ restrictions have held back natural language processing by at least a decade.

§ extract, develop or use the Dataset in any direct or indirect commercial activity;

I have no idea what an indirect commercial activity is.

§ use any robots, spiders or other automated downloading programs, algorithms or devices to search, screen-scrape, extract, or index any Elsevier web site or web application;

I will deliberately not comment on this as there is too much to say.. Later

§ utilize the TDM Output to enhance institutional or subject repositories in a way that would compete with the value of the final peer review journal article, or have the potential to substitute and/or replicate any other existing Elsevier products, services and/or solutions.

This is the killer. It absolutely stops me doing any serious systematic scientific TDM under this “licence”.

For background you must know that Elsevier produces many products other than journals. Many are secondary publications – they abstract, summarize, codify, etc. In chemistry they produce http://en.wikipedia.org/wiki/Beilstein_database

The Beilstein database is the largest database in the field of organic chemistry, in which compounds are uniquely identified by their Beilstein Registry Number. The database covers the scientific literature from 1771 to the present and contains experimentally validated information on millions of chemical reactions and substances from original scientific publications. The electronic database was originally created from Beilstein’s Handbook of Organic Chemistry, founded by Friedrich Konrad Beilstein in 1881, but has appeared online under a number of different names, including Crossfire Beilstein. Since 2009, the content has been maintained and distributed by Elsevier Information Systems in Frankfurt under the product name “Reaxys“.^[1]

The database contains information on reactions, substances, structures and properties. Up to 350 fields containing chemical and physical data (such as melting point, refractive index etc.) are available for each substance. References to the literature in which the reaction or substance data appear are also given.

It’s got roughly 10 million compounds. Let’s suppose I intend to mine Elsevier’s TDM API for chemical reactions. I guess they publish at least 100,000 a year (there’s 10,000 pages per year in their Tetrahedron and many papers will have lots of reactions. I want to mine them for scientific purposes for the EPSRC-funded “Dial a Molecule” that I have been involved with. This program wants to create artificially intelligent systems for making new chemicals – new drugs, new smart materials, etc. An essential part is an Open collection of existing chemical reaction data. I know what I want from the literature and I can technically extract it.

I can do it on my laptop.

But Elsevier will claim this is competing against Reaxys and they will stop me doing this.

And this will happen in other fields – anything useful extracted from the literature will compete against Elsevier products. (If you have enthused about Elsevier’s TDM are you fully aware of the products you may compete against?).

So what I and others are doing is an inevitable part of progress and innovation. It’s constant. Elsevier are trying to hold it back.

And trying to prevent it through lawyers shows a fundamental contempt for true science.

5 Responses to #contentmining and #elsevier’s terms; The small print absolutely prevents responsible science

Charles Oppenheim says:

February 7, 2014 at 10:18 am

I too have read the T&Cs and totally agree with your assessment. This takes us no further forward. “Indirect commercial activity” could be interpreted as widely as preparing for a bid for funding, and including output within a bid. Another thing missing from the T&C’s is a statement along the lines of “nothing within this agreement shall prevent you from doing such actions that are permitted as exceptions under copyright law”. It is important to have such wording in, as it looks likely there will be an exception for TDM introduced as part of UK copyright law (as indeed already exists in many other jurisdictions)

- pm286 says:
  
  February 7, 2014 at 10:31 am
  
  Many thanks Charles,
  For those who don’t know, Professor Charles Oppenheim is an expert in the area of copyright and scholarly publishing. It’s really valuable to have this confirmation.
  
Jason Hoyt says:

February 7, 2014 at 11:39 am

I agree, Peter. The text mining license is baffling and an insincere attempt to advance knowledge. It suggests the Elsevier show is really ran by the legal team – not by those in favor of advancing Elsevier’s own mission statement of “making genuine contributions to the science and health communities.” (http://www.elsevier.com/about/mission).
I urge everyone to stop and think before signing away your copyright when publishing. It may be harmful to your/our future.

- pm286 says:
  
  February 7, 2014 at 12:27 pm
  
  Thanks Jason – very good to get expert support
  
Hal says:

February 20, 2014 at 1:17 am

As far as I know, “indirect commercial activity” refers to activities like putting your work based upon their data onto a website that has ads on it. In this example, you would not profit directly off of the work by selling it, but indirectly by your readers being subject to paid advertisements.
At least that’s the context in which I’ve come into contact with the term in the past.