Textmining: My years negotiating with Elsevier

This post – which is long, but necessary – recounts my attempts to obtain permission to text-mine content published in Elsevier’s journals. (If you wish to trust my account the simple answer is – I have got nowhere and I am increasingly worried about Elsevier’s Sciverse as a monopolistic walled garden. If you don’t trust this judgement read the details). What matters is that the publishers are presenting themselves as “extremely helpful and responsive to request for textmining” – my experience is the opposite and I have said so to Efke Smit of the STM publishers’ assoc. In particular I believe that Elsevier made me and the chemical community a public promise 2 years ago and they have failed to honour it.

Although it is about chemistry it is immediately understandable by non-scientists. It is immediately relevant to my concerns about the collaboration between the University of Manchester and Elsevier but has much wider implications for scientific text-mining in general. New readers should read recent blogs posts here including /pmr/2011/11/25/the-scandal-of-publisher-forbidden-textmining-the-vision-denied/ which explains what scientific textmining can cover and should also read forthcoming posts and comments.

I shall frequently use “we” to mean the group I created in Cambridge and extended virtual coworkers. I am not normally a self-promotionist, but it is important to realise that in the following history “we” are the leading group in chemical textmining, objectively confirmed by the ACS Skolnik award. “we” deserve a modicum of respect in this.

 

I start from common practice, logic, and legal facts. My basic premises are:

  • I have the fundamental and absolute right to extract factual data from the literature and republish it as Open content. “facts cannot be copyrighted” (though collections can). It has been common practice over two or more centuries for scientists to abstract factual data from the literature to which they have access (either by subscription or through public libraries). There are huge compilations of facts. A typical example is the NIST webbook; please look at http://webbook.nist.gov/cgi/cbook.cgi?ID=C64175&Units=SI&Mask=1#Thermo-Gas. This is a typical page (of probably >> 100,000) carefully abstracted from the literature by humans. It is legal, it is valuable and it is essential.
  • We have developed technology to automate this process. I argue logically that what a human can do, so can a machine. Logic has no force in business or in court and I am forbidden to deploy my technology it by restrictive publisher contracts (see previous posts). So what is a perfectly natural extension of human practice to machines is forbidden for no reason other than the protection of business interests. It has no logical basis.
  • I wish to mine factual data from Elsevier journals, specifically “Tetrahedron” and “Tetrahedron Letters”. I shall refer to these jointly as “Tetrahedron”. The factual content in these journals is created by academics and effectively 100% of this factual content is published verbatim without editorial correction. Authors are required to sign over their rights to Elsevier (and even if there may be exceptions they are tortuous in the extreme and most authors simply sign). Elsevier staff refer to this as “Elsevier content”. I shall always quote this phrase as otherwise it implies legitimacy which I dispute – I do not believe it is legally possible to sign over factual data to a monopolist third party. But it has never been challenged in court.
  • Everything I do is Open. I have no hidden secrets in my emails and anyone is welcome to write to the University of Cambridge under FOI and request any of all my emails with Elsevier. I personally cannot publish them many of them because they contain the phrase: The information is intended to be for the exclusive use of the intended addressee(s).  If you are not an intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this message is strictly prohibited.” However I suspect an FOI request would overrule this.

     

I have corresponded verbally and by email with several employees of Elsevier. I have done this through my natural contacts as Elsevier provide no central place for me to discuss the questions. I shall anonymise some of the Elsevier employees. If they feel their position has been misrepresented they are welcome to post a comment here and it will be reported in full. If they send an email I reserve the right to publish it Openly.

The simple facts (which can partly be substantiated by FOI on my emails but are stated without them are):

  • About 5 years ago I wrote to all five editors of Tetrahedron and also the Elsevier office about the possibility of enhancing Tetrahedron content through text-mining. I did not receive a single reply
  • Two years ago there was a textmining meeting at Manchester, organized by NaCTeM and UKOLN (http://www.nactem.ac.uk/tm-ukoln.php). At that meeting Rafael Sidi, Vice-President Product Management, Elsevier presented “Open Up” (30 mins). [He is the named Elsevier contact in the NaCTeM / Elsevier contract]. He gave no abstract and I do not have his slides. From a contemporaneous blog (http://namesproject.wordpress.com/2009/10/ ) “Rafael Sidi of Elsevier (who got through an eye-boggling 180 slides in 30 minutes!) emphasised the importance of openness in encouraging innovation”. With no other record I paraphrase the subsequent discussion between him and me (and I would be grateful for any eyewitness accounts or recordings). If Rafael Sidi wishes to give his account, he is welcome to use this blog.

     

    Essentially Rafael Sidi enthusiastically stated that we should adopt open principles for scientific content and mashup everything with everything. I then asked him if I could textmine Tetrahedron and mashup the content Openly. He said I could. I then publicly said I would follow this up. I have taken this as a public commitment by Sidi (who was representing Elsevier very clearly) that factual content in Tetrahedron could be mined without further permission.

     

  • I then followed it up with mails and phone calls to Sidi. Suffice to report that all the drive can from me and that after six months I had made no progress. I then tried another tack with Elsevier contact. After another 6 months no progress. I then raised this in 2010-10 with a member of Elsevier staff involved with the Beyond the PDF initiative http://sites.google.com/site/beyondthepdf/ . Although not directly concerned with chemistry she took up the case (and I personally thank her efforts) and thought she had made progress (a) by getting Elsevier to draw up a contract allowing me to textmine Tetrahedron and (b) relaying this to David Tempest (Deputy Director “Universal Access”, Elsevier) who is “currently reviewing policies” and “we have finalised our policy and guidelines I would be happy to discuss this further with you.” [That was 9 months ago and I have heard nothing].

The contract is public, apparently available to anyone to negotiate (though there are no rights – all decisions are made by Elsevier). I was told:

You can mine 5 years of Tetrahedron, and will be helped to do so by Frankfurt. You can talk to them about formats.  There are two conditions:

1) You agree with the SciVerse developer’s agreement – on http://developer.sciverse.com/start this is http://developer.sciverse.com/developeragreement – this also means you are not allowed to provide access to the Tetrahedron content (no surprise)

2) You can send us a description of the project you are working on, specifically describing the entities you are interested in mining, and the way in which you will use them.

To summarise:

  • Elsevier decide whether I can mine “their” content. I have no right. I can only beg.
  • All my results belong to Elsevier and I cannot publish them. Specifically:

     

    3.1.3 the Developer has not used robots, spiders or any other device which could retrieve or

    index portions of the Elsevier website, the Elsevier content or the APIs for any unauthorized

    purpose, and Developer conforms to all ethical use guidelines as published on the Elsevier

    website;

So I cannot search their site except as they permit

3.1.4 the Developer acknowledges that all right, title and interest in and to the Elsevier content,

and any derivative works based upon the Elsevier content, remain with Elsevier and its

suppliers, except as expressly set forth in this Agreement, and that the unauthorized

redistribution of the Elsevier content is not permitted;

“And any derivative works” means that everything I do – chemical structures, spectral data – everything BELONGS TO ELSEVIER. Note the phrase “Elsevier content”. The whole agreement is based on the concept that Sciverse (their platform for publishing “Elsevier content”) is being developed as a walled garden where no-one has rights other than Elsevier.

Well I have only taken 18 months to get to that position. I might be able to negotiate something slightly better if I take another 2 or three years.

And, in any case, I am not begging for permission to do a project. I am asking for my right. Both implied by current practice and also started by Rafael Sidi.

[Incidentally It will be interesting to see if the University of Manchester has signed up to

 

And that’s where the matter rests. No progress…

 

 

But no, I received a request from Elsevier asking if they can use my software. (Why? Because our group is a/the leading one in chemical information extraction). I can’t reproduce it as it’s confidential and I have therefore omitted names , but here is my reply (copied to all the people in Elsevier including Rafael Sidi):

Dear Mr. Murray-Rust,

With great interest I have read your description of the OSCAR 4 chemical entity recognizer. We (redacted) would like to evaluate OSCAR for use in our entity recognizer system and compare it to other analysers.

Because OSCAR is Open Source you may do this without permission.

A few months ago, I have done some comparisons with other annotators and can only say that OSCAR compares quite favourably and is easily deployed – that is to say, if it runs as a Java server.

I assume these comparisons are confidential to Elsevier

This type of functionality is included in the the OSCAR 3 implementation and is really easy to access because no coding layers are required to go between our code and yours – just an http webrequest.

We are using .Net for all our development so a web interface would be real nice. I gather from the article posted (OSCAR4: a flexible architecture for

chemical text-mining) that there are several wrappers around by several users – is there any chance that there is a .Net or HTTP wrapper that we might use? A short-cut in Java to build one ourselves?

I understand this to be a request for free consultancy. Unfortunately we have run out of free consultancy at present.

Do you have any advice here?

Normally I would reply in a positive light to anyone asking polite questions, but I have had two years of unfulfilled promises from Elsevier so I am will engage on one condition – that Elsevier honour the public promise that Rafael Sidi made two years ago.

Mr Sidi stated in public that I could have permission to use OSCAR on chemical reactions published in Elsevier journals (Tetrahedron, Tett Letters, etc.) and to make the results publicly Open. Over that last two years I have tried to get action on this (see copied people). The  closest I got was an agreement which I would have to sign saying that all my work would belong exclusively to Elsevier and that I would not be able to publish any of it. (The current agreement that my library has signed for subscriptions to Elsevier is that all text-mining is explicitly and strictly forbidden). Not surprisingly I did not sign this.

By Elsevier making a public promise I assumed I would be able to do research in this field and publish all the results. In fact Elsevier has effectively held back my work for this period and looks to continue to do it. I regard Elsevier as the biggest obstacle to the academic deployment of textmining at present.

The work that you are asking me to help you with will be an Elsevier monopoly with restrictive redistribution conditions and I am not keen on supporting monopolies. If you can arrange for Elsevier to honour their promise I will be prepared to explore a business arrangement though I am making no promises at present.

Thank you very much,

I am sorry this mail is written in a less than friendly tone but I can not at present donate time to an organisation which works against the direction of my research and academia in general. If Elsevier agrees that scientific content can be textmined without permission and redistributed (as it should be if it is to be useful) then you will have helped to make progress.

I have copied in your colleagues who have been involved in the correspondence over the last two years.

[Name redacted]

I am currently treating your request as confidential as it says so but I do not necessarily regard my reply as such. You will understand that I need a reply

Needless to say I have received no reply. You may regard my reply as rude, but it is the product of broken promises from Elsevier, delays, etc. So, Rafael Sidi, if you are reading this blog I would appreciate a reply and the uncontrolled permission to mine and publish data from Tetrahedron.

Because I shall forward your response (or the lack of one) to the UK government who will use your reply as an example of whether the publishers are helpful to those wanting to textmine the literature.

 

 

 

 

This entry was posted in Uncategorized. Bookmark the permalink.

22 Responses to Textmining: My years negotiating with Elsevier

  1. They’re on the verge of collapse. Keep up the good work.
    Cheers,
    S.

  2. X says:

    Why not do it without Elsevier’s permission? If you are certain that you have the right through a fair use / facts cannot be copyrighted argument, then what’s stopping you?

    • pm286 says:

      (a) because it would break the contract that the Universityhas signed. Elsevier would immediately cut off content from the University. (Publishers have already cut off the University for my legitimate actions – they would not hesitate for an illegal one.)
      (b) it would give them legitimacy. At present I and supporters have the complete upper moral hand. By “stealing” “their” content I would be labelled as a “pirate”.
      No one ever is certain in a legal case. I could be morally right, technically right but if my lawyer is paid less than theirs they have an increased chance of winning. There *are* issues I would go to court for but this is not yet one of them (my chemical colleagues couldn’t care).

  3. I can verify PMR’s account of the Text Mining meeting at Manchester, organized by NaCTeM and UKOLN. As proof, I can also add that PMR asked Rafael Sidi if all of images in the “eye-boggling 180 slides” were released under open access licenses and could be copied/remixed without permission? In fact I can remember this event quite clearly, since at the time I had no idea who PMR was and was impressed that someone had the courage to make this stand so publicly.

    • pm286 says:

      Thanks,
      It will be interesting to see if Rafael Sidi remembers his answer. (he will be very ill-advised not to answer since this will be taken as non-cooperation in the Hargreaves process).

  4. Richard Kidd says:

    My recollection of the meeting is that Rafael agreed to discuss further with you. If he had agreed to open release of the results I would have fallen off my chair.

  5. Given the fact that Elsevier’s database is a protected database according EC law and the journal PMR wants for his data-mining is a non-essential part of the whole Elsevier database then are all contractual restrictions NOT VALID. Feel free to read the law. Each EC country has a similar legal ban for these restrictions based on the EC database directive. Therefore it is partly simply wrong what PMR writes.

    • pm286 says:

      Klaus,
      Please can you clarify. Are you saying that it is illegal for Elsevier to impose the contractual conditions that they do?

  6. See
    http://en.wikipedia.org/wiki/Database_Directive
    Here is the UK version:
    “Avoidance of certain terms affecting lawful users
    19.—(1) A lawful user of a database which has been made available to the public in any manner shall be entitled to extract or re-utilise insubstantial parts of the contents of the database for any purpose.
    (2) Where under an agreement a person has a right to use a database, or part of a database, which has been made available to the public in any manner, any term or condition in the agreement shall be void in so far as it purports to prevent that person from extracting or re-utilising insubstantial parts of the contents of the database, or of that part of the database, for any purpose.”
    http://www.legislation.gov.uk/uksi/1997/3032/regulation/19/made
    Is Elsevier’s journal database a protected database according EC law?
    I would say YES.
    Is your data mining extracting insubstantial parts of the Elsevier database (please note that it matters what the subject of the protection is: I would say the whole database not the journal)?
    I would say YES.
    Is Elsevier’s restriction illegal according EC law?
    You can give the answer!

    • pm286 says:

      Not sure what is different between your two posts? I have posted the first one.
      You seem to content that a journal is covered by database law. Perhaps you can give justification. I have no expertise.

  7. 1. Two posts: I had two browser tabs with the blog open and posted to the wrong entry. The posts are identical as I copied the first to post it here on the right place.
    2. An electronic journal or journal portal is a protected database if and only if the publisher has made a substantial investment.
    “While copyright protects the creativity of an author, database rights specifically protect the “qualitatively and/or quantitatively [a] substantial investment in either the obtaining, verification or presentation of the contents”: if there has not been substantial investment (which need not be financial), the database will not be protected [Art. 7(1)].” (Wikipedia)
    There is no doubt that the whole Elsevier database with all 20,000+ journals is such a database.
    Investor is Reed-Elsevier, an UK-based company:
    http://www.reedelsevier.com/Pages/Termsandconditions.aspx
    Thus the database right is applicable for the whole EC.
    3. I can only cite a German article on § 87e but you can trust that I have enough copyright expertise as author of the book “Urheberrechtsfibel” (2009).
    Harald Müller 2002 http://bibliotheksdienst.zlb.de/2002/02_03_06.pdf

  8. Pingback: Should you boycott academic publishers?

  9. Pingback: Publishers invent a whole new form of evil: suing their customers « Sauropod Vertebra Picture of the Week

  10. Katja says:

    I became interested in data mining on scientific papers when reading this blog post. Is there a good literature review paper on paper data mining?

  11. Pingback: Unilever Centre for Molecular Informatics, Cambridge - Elsevier, FooBar and Content-mining – yet another Digital Land Grab – wake up academia and fight. Or surrender for ever « petermr's blog

  12. Pingback: How Elsevier can save itself, part 2: Medium « Sauropod Vertebra Picture of the Week #AcademicSpring

  13. Pingback: Unilever Centre for Molecular Informatics, Cambridge - #vivo12 my talk “Reclaim Our Scholarship” « petermr's blog

  14. Pingback: Unilever Centre for Molecular Informatics, Cambridge - Topics and Links for my talk on Semantic Web for Materials « petermr's blog

  15. Pingback: Unilever Centre for Molecular Informatics, Cambridge - My Response to the UK parliament and BIS on Open Access; keep the CC-BY policy « petermr's blog

  16. Pingback: Unilever Centre for Molecular Informatics, Cambridge - #rds2013 Managing Research Data « petermr's blog

  17. Pingback: Content-mining; Why do Publishers insist on APIs and forbid screen scraping? – ContentMine

Leave a Reply

Your email address will not be published. Required fields are marked *