Textmining: Update, Wiley, Nature and Hargreaves. And Elsevier allows me unrestricted text-mining! Thanks!!!

I shall continue to update on a daily basis.


We have formed a small group to coordinate our reply to Hargreaves and this will take place on the OKF open-science and open-access lists (and @ccess). Please let us know of useful experience in access to published material


Today Richard van Noorden (of Nature) posted a useful article http://www.nature.com/news/trouble-at-the-text-mine-1.10184 on the current frustration within the research community about not being able to textmine when and where they want. It’s moderately well balanced. However it doesn’t say anything critical about Nature:

“Nature Publishing Group in London, which publishes this journal, says that it does not charge existing subscribers to mine content to which they already have access, subject to contract.”

RvN didn’t say that NPG sent Max Haeussler a quote for 85,000 USD to mine Nature content. I talked with RvN, gave him a lot of material for his article and pointed him to MaxH. I said that I would expect RvN to be objective in his report and not favour Nature. He said he would, but had to get his copy agreed. In the end he decided not to use any of my material – that’s fine, journalists collect more material than they can use


I have had a useful set of email communications with Alicia Wise of Elsevier. Today she has agreed that I can go ahead with textmining as I wish! Thank you Alicia!

Hi Peter,

As I indicated to you when we met in Oxford, we (at Elsevier) have no problem in principle with you text mining for research purposes. There are some practical matters to resolve through discussion. With regret I have formed the view that you are not – at this time – really seeking practical solutions. If this changes please do let me know as we remain willing to work with you and other colleagues at Cambridge – and elsewhere – who need and want to text mine.

While I am here, I would like to stress the real value of librarians in these discussions. Your library colleagues at Cambridge have – both directly and through JISC Collections – relationships and existing agreements with a wide array of publishers. They are constructive partners for us all in facilitating text mining and scaling up as we move forward.

With kind wishes,


(Alicia Wise, Director of Universal Access, Elsevier, @wisealic)


  • I am actively seeking practical solutions. I’m going to start tomorrow! (I’ll let our library know in case there are teething glitches). Last week we (Daniel Lowe) mined 1,000,000 (1 million) chemical reactions from US patents. This is for RESEARCH purposes (we are not going to sell them). We are analysing how well the technology works and then what types of chemistry are most effective. This feeds into the EPSRC Dial-A-Molecule Grand Challenge looking at how we can create better chemical synthesis for drugs. It could lead to a radical improvement of chemistry that’s RESEARCH. We are going to put the results up on DSpace and Figshare and our own Quixote so everyone else can do research on them as well. NOTE: I didn’t need any help from the USPTO or Cambridge Library.
  • I simply want to do the same with papers in Elsevier journals. I shan’t release any of the final PDF. I’m just going to publish the factual material – and conveniently in DSpace and Figshare and Quixote. This is research because the science is done for a different purpose than invention and generally is aimed towards novelty rather than production. So we get a whole new set of chemistry. It’s also done on a different scale – much novel chemistry doesn’t scale directly into production.

This is VERY good news. Thanks you Alicia. It’s not everything I have asked for but it’s of real value. We can mine Elsevier journals for research purposes. We start today!!!

I assume you will trust me as to what RESEARCH in chemical text-mining is – I’m a world expert, honoured by the ACS for this work. And I assume you will trust me not to publish copyright content – I haven’t done so in 10 years of semantic research. I shan’t publish the VoR PDF nor the author’s final manuscript. But I shall publish all the factual data on which the RESEARCH relies and all the bibliography metadata which is required to manage the output.

So here’s what I am going to do:

  • Use our Pubcrawler software to systematically retrieve all publications from Elsevier journals. (We can do this – we don’t need any technical help from Elsevier or our Library and we don’t need Sciverse, Scopus, Reaxys, Science Direct or any other Elsevier product. We shall only use the material for information mining
  • We shall determine which papers contain chemistry using our OSCAR4 software. This is the best Open Source software for chemical textmining and probably as good as if not better than closed proprietary tools
  • We shall filter the articles into those that have a significant proportion of chemistry and those that don’t and concentrate on the former.
  • We shall then extract and analyse the chemical names and formulae. Where possible we shall try to match redundant information (e.g. names and structure diagrams).
  • We shall extract the factual data (spectra) and check their validity against the chemical structure using our OSCAR2 software (Open Source). Many papers contain many errors (even Elsevier papers contain many errors). We’ll show where papers contain errors (and that’s a real benefit to scientific RESEARCH)
  • We shall use computational chemistry to compute the properties of the compounds and compare them with experiment. That’s really valuable RESEARCH. 15% of all supercomputer time is on compchem and there is a desperate need to calibrate its usefulness.
  • We shall extract the chemical reactions. There is very little research done in academia on the phenomenology of published reactions – we did some of this last year at the Open Science Summit where we analysed chemical reactions for eco-friendliness (the “Green Chain Reaction”). We’ll be able now to show whether the chemistry in Elsevier journals is more eco-friendly than in patents


That’s the start. It’ll take us a day or two to deploy the software on Elsevier journals but after that only a few days to do the analysis. Because it’s research we shall publish it (choice of publisher is currently Open) and the referees will demand that we make the data available. So we have to put it up publicly and we have DSpace and Figshare and our own Quixote system to do this.

This is really exciting! Thanks Alicia!

This entry was posted in Uncategorized. Bookmark the permalink.

8 Responses to Textmining: Update, Wiley, Nature and Hargreaves. And Elsevier allows me unrestricted text-mining! Thanks!!!

  1. Mat Todd says:

    Well, if you want to do a similarity search, it would be great to know who has been making compounds that look like our super-potent new malaria lead (ZYH 7-2) which is being discussed here:
    The year of publication of anything mentioning related compounds is of interest because we are trying to find PEOPLE who have made related compounds recently, and who may therefore have samples in their fridges that could be screened with no new synthetic effort.

    • pm286 says:

      This is great Matt,
      What I plan is to extract all compounds using OSCAR and then move to more sophisticated facts later

  2. Chris Rusbridge says:

    Peter, my reading isn’t quite the same as yours; Alicia is suggesting there’s “no problem in principle with you text mining for research purposes…” BUT “…There are some practical matters to resolve…” which suggests no actual agreement in practice (yet). You may have resolved this already, but as reported I would suggest there is still a risk of Elsevier’s robots detecting your robots and switching off the University! I do sincerely hope I’m wrong…
    Anyway, best of luck and do keep up the good fight!

    • pm286 says:

      There’s no technical matters to resolve. We can mine the info so that we don’t cause problems with overload. We don’t need an agreement – she has said so. If Elsevier cut us off we will start again.

  3. “However it doesn’t say anything critical about Nature”
    This sentence startled me until I realised you were actually referring to Nature Publishing Group 😉

  4. pete carroll says:

    Recently, I came across a EU FP7 framework call under the Innovative Medicines Initiative for setting up a Joint European Compound Collection of about 500,000 compounds from private and public sector sources for High Throughput Screening as potential lead structures for new pharmaceuticals. For more details see: Identifier: IMI-Call-2012-5 (download 5th Call 2012 topics text)
    Now,I am not a cheminformatics expert so forgive me if I’ve got it wrong but it seems to me that
    this yet another example of where researchers’ ability to text-mine the scientific literature would be invaluable. Compounds XY…Z are in the collection with some information- but what else is “out there” about them.
    Regarding Reed-Elsevier. On the principle of “know your enemy” you (or anybody else reading this and thinking of submitting to the IPO Hargreaves consultation) might find their submission to the BIS Parliamentary Select Committee consultation on the Hargreaves Review interesting reading.
    It concentrates on the issue of text mining and sets out their objections to an exception in some detail. Refute those objections point by point and your case is strengthened. For myself I find sec tion 6.8 particularly disturbing.
    “6.8 It is not clear that the proposed exception would satisfy the International Berne Convention test whereby exceptions must not conflict with the normal exploitation of the work nor unreasonably prejudice the rights of the author. If it is true that data mining is the new reading and the likely future mainstream method of consumption of information, then it is set to become 21st century “normal” exploitation and any exception would be contrary to the treaty.”
    Are they really saying that “normal reading” in the 21st Century should fall within copyright law and be subject to their license and permission. I’d like to see that one taken to the European Court of Human Rights!
    One last thing. In the IPO consultation document sections 7.93 and 7.94 the UK government seems to be hinting that some collective licensing agreement may be an answer – possibly (although this isn’t explicitly stated) through the proposed Digital Copyright Exchange. Keep an eye on this one- it needs thinking about.
    “7.93 However, under current conditions, in some cases research projects could require
    specific permissions from a very large number (potentially hundreds) of
    publishers in order to proceed. The current requirement for specific permissions
    from each publisher may be an insurmountable obstacle, preventing some
    research from taking place at all.
    7.94 The Government is not aware that publishers currently offer a collective solution
    that overcomes this difficulty. Therefore the current arrangements for using
    analytic technologies may well not be the best way of serving t he overall public
    good, and the overall public benefit of a text and data mining e xception appears
    to outweigh the harm to the licensing market. However, the Government will be
    very interested to hear of any alternative solutions which solve this “hold-up”
    problem of multiple permissions…”
    Best of luck

Leave a Reply

Your email address will not be published. Required fields are marked *