Information mining and Hargreaves: I set out the absolute rights for readers. Non-negotiable

As I have already blogged I have been asked by Ben Hawes at the UK Intellectual Property Office to respond to the Hargreaves report on "textmining". I shall be getting help from my OKF colleagues. The issues are, in my mind, simple:

  • Legitimate human readers of the literature ("subscribers") have a right to extract factual information from the literature and have exercised this for 200 years.
  • We can now do this with machines, often better than with humans. It's vastly faster and cheaper. It increases the value of the literature
  • The publishers forbid us to do this and put in place legal and technical obstacles on top of normal copyright
  • We are now demanding the removal of these obstacles.

This is not a negotiation, it's a statement of our absolute right. As a corollary it is an integral part of what we pay for human access so there is no reason to make any charge for this.

In essence we shall report to Hargreaves:

  • Our position and the justification for it
  • Whether the publishers have agreed that these are our rights. I have made them simple because they are simple. Publishers wish to appear "helpful"; this is their chance to show that they are working with us as they continually claim

We shall contact 9 publishers tomorrow through known contacts; this represents our best approach to non-repudiation. Since the publishers have no-formal mechanism for readers to make formal enquiries (in itself Institutionalised unhelpfulness to readers) it will be done through this blog and email. There is enough evidence to show that all publishers will be aware of this request and if they wish to be helpful they can.

Background and clarification

Human abstracters have for centuries abstracted from and commented on the scholarly literature and made the results public without requiring permission from publishers and/or authors. Indeed science is based on being able to do this. In our present request I shall confine the request to "facts" in scientific papers, but the permission I am asserting extends to abstracting papers and to commenting and other activities practised in the paper era. None of this requires new permissions; it is explicitly and implicitly part of current practice. If I read a paper I can write an abstract; I can also critique parts (e.g. reproduce paragraphs and comment in detail on what the authors said. Refusal to allow this is a direct attack on the integrity of science.

I do not have to be the owner of a scientific article to do this. If I borrow a journal from the public library I can sit at home and write abstracts on every paper. I would strongly urge anyone interested in abstraction, commentary, parody, etc. to make representation to Hargreaves – I personally only have time to address the extraction of scientific facts and indexing the literature.

To illustrate "facts" here is the Wikipedia article on aspirin The article is essentially a collection of facts (as it should be – WP is very strong on removing opinions). The fact that someone reported them in natural language does not stop them being facts. Here are some examples:

  • [Aspirin] was first isolated by Arthur Eichengrün, a chemist with the German company Bayer.[1]
  • It has also been established that low doses of aspirin may be given immediately after a heart attack to reduce the risk of another heart attack or of the death of cardiac tissue.[4][5]
  • (the factual synthesis of aspirin from salicylic acid and acetic anhydride)
  • Physical data Density 1.40 g/cm³ Melt. point 135 °C (275 °F) Boiling point 140 °C (284 °F) (decomposes) Solubility in water 3 mg/mL (20 °C)


All of these are FACTs. As is:

Synthesis of (+)-η5-Cyclopentadienyl(η4-(3aS,4S,7aS)-methyl 2,2-dimethyl-3a,7a-dihydrobenzo[d][1,3]dioxole-3a-carboxylate)cobalt(I) 18

Diene 16 (301 mg, 1.43 mmol, 1 equiv) was dissolved in dry, degassed toluene (10 mL) in a side-armed Schlenk that had been purged and refilled with argon three times. The resulting solution was added via cannula to η5-cyclopentadienylbis(ethylene)cobalt 17 (258 mg, 1.43 mmol, 1 equiv) and the mixture was stirred for 30 min at room temperature until the evolution of ethylene had ceased. The solvent was removed in vacuo and the solid residue was redissolved in a minimal amount of hexane and left to crystallise at -28°C for 48 h. Complex 18 was isolated as red-orange crystals (124 mg, 26%); m. pt. 118-120°C; [α]D +42 (c = 1, CH2Cl2); 1H-NMR (300 MHz, C6D6, Additional file 2) δ 5.44 (1H, d, J = 5.0 Hz, O-CH-), 5.09-5.04 (2H, m, -CH=CH-CH=CH-), 4.41 (5H, s, Cp-H), 3.53 (3H, s, O-CH3), 3.14 (1H, dd, J = 5.5, 1.0 Hz,=CH-C-COOCH3), 2.82 (1H, td, J = 5.0, 2.0 Hz, -O-CH-CH=), 1.42 (3H, s, C-CH3), 1.28 (3H, s, C-CH3) ppm; 13C-NMR (75 MHz, CDCl3, Additional file 2) δ 174.9, 113.9, 82.1, 80.8, 79.8, 74.6, 51.6, 48.9, 48.6, 27.1, 25.6 ppm; νmax (film) 2986, 2937, 1732, 1436, 1370, 1307, 1259, 1229, 1206, 1166, 1109, 1064, 1009, 888, 821, 762 cm-1; HRMS (+ve ESI-TOF) m/z calcd for (C15H20CoO4+H)+, 335.0688, found 335.0694. Found: C, 57.58; H, 5.76. C15H20CoO4 requires C, 57.49; H, 5.73%).

Everything in this paragraph is factual – what was done, what was observed, what was measured. And our software can extract 95% of the meaning from this in a few seconds, whereas many final year undergraduates might struggle.

Factual information is frequently contained in graphs, tables, images, speech and video. Therefore "text-mining" is a subset of information-mining and I shall use that term. Indeed our software can understand simple human spoken discourse about chemical reactions and extract the facts.


Alicia Wise from Elsevier wants to know what I want to do with the content. There is no reason why I should have to justify what I do to Elsevier, but here it is:


I want to extract as many facts as I can from the scientific literature and publish them (as CC0) for me and others to do science with, to build new scientific tools and improve the quality of science.

It is my right. There is absolutely no reason why anyone should need to involve the publisher in information-mining. I have legally mined 200,000 scholarly documents without requiring help or permission from the publisher. I strongly urge anyone thinking of information-mining to explore what, if anything they need the publisher for. Scientists should not have to ask permission not should they have to "use the publisher API" and they should never have to pay.

Legitimate publisher concerns about information-mining

There is only one valid reason for liaising with the publisher – the possibility of server overload. This is a negligible problem if done responsibly – for example if one allows a short pause between each download request (I use 1 second, but I'm willing to be informed of best practice.

publisher concerns about information-mining

I suspect the following concerns:

  • Peter Murray-Rust will steal and publish "our" content. I find this deeply offensive. I have been confronted by publisher robots which, in essence, announce: "You are illegally downloading content; we have cut off the journal supply to the whole University; you will have to justify to a senior member of the university why we should reconnect you". [I did nothing illegal or anticontractual – these robots act at the slightest trigger].There is no debate. PM-R is guilty until proved innocent. It is demeaning to be confronted by colleagues who accuse you of having their It is publisher HADOPI except it is "1 strike and you are out". It is publisher institutionalism at its worst. An unbelievable arrogance that the scientific world is out to defraud publishers. And to say "oh, it's not you but it's your graduate students" is worse. If there is anything that underpins science it is the need for ethical behaviour and almost all scientists are highly ethical. If unethical behaviour is detected they are severely reprimanded by the community and it may be the end of their career. To accuse scientists of being thieves (even if you accept that sharing copyrighted papers is theft) is inexcusable
  • Peter Murray-Rust and his robots will find errors in our papers. I hope that no-one is afraid of this. It is the purpose of science to find errors and our robots are better than humans at finding many types of error. The publishers' refusal to allow us to validate the literature is damaging science, not enhancing it.
  • Peter Murray-Rust will create disruptive technology that will seriously disturb our cosy monopolies. I think this is the real crux. Elsevier forbad me to data-mine chemistry because it threatens their cosy data monopolies. Here's what they said in 2010-12

    [An Elsevier staff member] have not been able to get a clear story on the Tetrahedron supplementary content mining. There are two opposing views on this (roughly: ScienceDirect is fine with it, and Reaxys is not), and it is not clear what the resolution is. I am very sorry about this, and will keep trying to get a coherent response out of the multifarious monster that is a big company. As soon as I know, you'll know.

    Interpretation. Elsevier run a chemical database where they abstract information from the literature (Reaxys) which probably has a revenue stream of several hundred million USD [ACS do the same ("Chemical Abstracts", CAS) and estimates are in the range 200-500 million USD]. So to preserve their monopoly they prevent me mining information. Is it a real threat? One chemist against Elsevier? Yes. Because I have many people who think the same way.
    Other walled gardens include bibliography and citations. It's possible to extract both of these robotically and we have the technology to do this. But Scopus and World of Science will be disrupted by this.

What Elsevier and other publishers should do, and what we should do.

The Elsevier contract states ( )

The CDL/ Elsevier contract includes [@ "Schedule 1.2(a)



"Subscriber shall not use spider or web-crawling or other software programs,

routines, robots or other mechanized devices to continuously and automatically

search and index any content accessed online under this Agreement. "

It forbids me to do anything. The answer is simple:


[Aside: How my University and any other University could meekly sign this without a titanic public fight is beyond me.]

If Elsevier don't scrap it I'll urge Universities to take them to court. It's against natural justice. We've paid enough for the subscription – we should be allowed our natural rights to do whatever we want.

Then the robots:


The robots have no benefit to the subscriber and are deeply insulting.

I am prepared to agree that we should be considerate in our crawling. I have been very considerate so far, verbally agreed it with at least two publishers. It's insulting to suggest that Universities are incapable of writing robots.

That's it. We demand our rights and will – in one area – agree to abide by a common technical protocol for information mining. There is no other legitimate reason for denying us.

And this applies to ALL toll-access publishers.


This entry was posted in Uncategorized. Bookmark the permalink.

8 Responses to Information mining and Hargreaves: I set out the absolute rights for readers. Non-negotiable

  1. Richard Kidd says:

    Can I ask a question?

    As we're talking about legislation here, we're not just talking about what people we like do. The language sets the boundaries of the law, applicable to all, and trust ain't enough.

    What are the boundaries of text mining? Can the mined results be 100% of the content of the original? Am really interested how this will or could be enshrined in law.

    • pm286 says:

      >>Can I ask a question?

      Of course!

      >>As we’re talking about legislation here, we’re not just talking about what people we like do. The language sets the boundaries of the law, applicable to all, and trust ain’t enough.

      Hargreaves makes it clear that the law is not appropriate for the digital age. There are two variations - forgive me if I get terms wrong:
      * change the law - this requires an act of parliament
      * bring in statutory instruments (secondary legislation). This is easier and can be done in Whitehall. It is still subject to scrutiny

      Useful progress is possible under the second course.

      The rest are my views.

      Trust is bidirectional and is currently in short supply. On the other hand running everything by law leads to a very restricted environment.

      >>What are the boundaries of text mining? Can the mined results be 100% of the content of the original? Am really interested how this will or could be enshrined in law.

      The boundaries of text-mining will depend on the jurisdiction. The US is - I think - more liberal than UK. Neither has been fully tested in court. Doctrines such as fair use exist in the US and there is considerable case law. There is no fair use in UK. I suspect that answers are only possible in court, or if Hargreaves reforms the law to make it clearer.

      Note that I am not restricting mining to text. There is (IMO) nothing sacred about a graph and it's absurd to have to redraw it. I remember many many years ago doing a Annual reports for the Chem Soc and the graphs (with 1000 data points) being redrawn by hand so as to avoid copyright problems. Not surprisingly many of the points were omitted or misplaced.

      I shall be inviting 8 major organisations including yourselves to comment on our submission - hopefully announce that tomorrow.

      • Richard Kidd says:

        But you understand my point though? - if text (and image) mining could theoretically emcompass just reformatting the entire publication and republishing it, then the implications are somewhat more serious (for a subscription publisher). And as Hargreaves is all about reforming copyright law, this is exactly what needs to be clear and understood if 'text mining' is an exception.

        • pm286 says:

          Yes, I completely understand your point.

          The process is that everyone, including publishers, readers, authors, filmmakers, librarians, musicians, etc. have an opportunity to make a representation. That's what we are doing. If we had already reached consensus then we wouldn't be making representations. We expect to be able to say:
          * these are our requirements

          if in addition we get comments from the publishers we are sending letters to we'll try to publish them to Hargreaves. Hargreaves will collate material and sent it to the legislators. They will make a democratic (in the parliamentary sense) decision on what to change. At that stage, no doubt, there is another round of lobbying/argument.

          We are stating what we want. We are not negotiating a communal agreement. It would be nice if we already had one but we don't and we shan't get one in two weeks

  2. Pingback: Unilever Centre for Molecular Informatics, Cambridge - Wiley: Cambridge scientist require to text-mine content in Wiley journals: please switch off the lawyers and the robots « petermr's blog

  3. Pingback: Libre redistribution – a key facet of Open Access | Palaeophylophenomics

  4. Pingback: Libre redistribution – a key facet of Open Access - Metacladistics

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>