I am systematically going through Elsevier’s terms and conditions for content mining (TDM) see http://blogs.ch.cam.ac.uk/pmr/2014/02/06/elseviers-tdm-terms-tac-can-they-force-us-to-copyright-data-2/ and previous. In this I look at what I must sign up for. The term “Dataset” appears to refer to Elsevier’s collection of papers (probably only in XML), possibly some images if they allow access – effectively the target of what I would intend to mine.
2. USER RIGHTS AND RESPONSIBILITIES.
2.2 The User may not other than for the uses as permitted above:
§ abridge, modify, translate or create any derivative work based on the Dataset;
I simply don’t understand this. My TDM output is a derivative work, isn’t it? So I can create a TDM output but not modify it. If I discover something went wrong I can’t amend it. I can’t abridge it. So I can’t filter my output for different purposes or because it’s too big to fit on a disk?
I expect Universal Access staff will tell me that I have misunderstood this and say it’s all OK really. But this is a legal document. They can’t interpret it for me. Only a lawyer can.
And I can’t translate it. Our OPSIN software can in principle be modified to translate chemical names to other languages. This is forbidden.
Now I expect that detailed discussion with the helpful Universal Access people we could resolve this. It would take a few months. And I don’t have a few months. And for 100 other publishers with similar licences? (This is why we walked out of Licences for Europe – exactly to avoid the waste of time and restrictions I am showing you). [BTW in the time it has taken to write this para we can mine 50 papers from PLoS or BMC with zero hassle].
§ remove, obscure or modify in any way any copyright notices, other notices or disclaimers as they appear in the Dataset;
I necessarily remove copyright notices. Data is not copyrightable. The absurdity of water belonging to Elsevier
<boilingPoint substance=”water” pressure=”1 atm” units=”Celsius” copyright=”Elsevier 2012 All rights reserved”>100</boilingPoint>
§ substantially or systematically reproduce, retain or redistribute the Dataset;
“substantially reproduce” can only be decided by a lawyer. “[not] retain” means delete after using, so the user may not be able to repeat their work (since the “Database” will change). Requiring people to destroy their data is bad science.
Any responsible text mining requires the corpus used to be available to others to validate the science. Without this a paper reads like:
“We analysed 5000 papers from Elsevier’s http://www.journals.elsevier.com/molecular-phylogenetics-and-evolution/ . We annotated 100 of these for binomial species names and found an interannotator agreement of 98.23% We cannot make these available for reviewers but trust us, we are conscientious scientists”.
When people are demanding reproducibility in scientific computing there is a requirement for the primary data to be Openly available. Elsevier’s and other publishers’ restrictions have held back natural language processing by at least a decade.
§ extract, develop or use the Dataset in any direct or indirect commercial activity;
I have no idea what an indirect commercial activity is.
§ use any robots, spiders or other automated downloading programs, algorithms or devices to search, screen-scrape, extract, or index any Elsevier web site or web application;
I will deliberately not comment on this as there is too much to say.. Later
§ utilize the TDM Output to enhance institutional or subject repositories in a way that would compete with the value of the final peer review journal article, or have the potential to substitute and/or replicate any other existing Elsevier products, services and/or solutions.
This is the killer. It absolutely stops me doing any serious systematic scientific TDM under this “licence”.
For background you must know that Elsevier produces many products other than journals. Many are secondary publications – they abstract, summarize, codify, etc. In chemistry they produce http://en.wikipedia.org/wiki/Beilstein_database
The Beilstein database is the largest database in the field of organic chemistry, in which compounds are uniquely identified by their Beilstein Registry Number. The database covers the scientific literature from 1771 to the present and contains experimentally validated information on millions of chemical reactions and substances from original scientific publications. The electronic database was originally created from Beilstein’s Handbook of Organic Chemistry, founded by Friedrich Konrad Beilstein in 1881, but has appeared online under a number of different names, including Crossfire Beilstein. Since 2009, the content has been maintained and distributed by Elsevier Information Systems in Frankfurt under the product name “Reaxys“.
The database contains information on reactions, substances, structures and properties. Up to 350 fields containing chemical and physical data (such as melting point, refractive index etc.) are available for each substance. References to the literature in which the reaction or substance data appear are also given.
It’s got roughly 10 million compounds. Let’s suppose I intend to mine Elsevier’s TDM API for chemical reactions. I guess they publish at least 100,000 a year (there’s 10,000 pages per year in their Tetrahedron and many papers will have lots of reactions. I want to mine them for scientific purposes for the EPSRC-funded “Dial a Molecule” that I have been involved with. This program wants to create artificially intelligent systems for making new chemicals – new drugs, new smart materials, etc. An essential part is an Open collection of existing chemical reaction data. I know what I want from the literature and I can technically extract it.
I can do it on my laptop.
But Elsevier will claim this is competing against Reaxys and they will stop me doing this.
And this will happen in other fields – anything useful extracted from the literature will compete against Elsevier products. (If you have enthused about Elsevier’s TDM are you fully aware of the products you may compete against?).
So what I and others are doing is an inevitable part of progress and innovation. It’s constant. Elsevier are trying to hold it back.
And trying to prevent it through lawyers shows a fundamental contempt for true science.