Our Protocol for Text-mining: Preamble and “Institutionalism”; Elsevier and other publishers should take note

I have been invited by the UK Intellectual Property Office to collect information and produce a reply to the Hargreaves report on copyright reform. The particular area that Ben Hawes (IPO) and I agreed on is "text-mining" [I shall refine this term later]. We are doing this under the aegis of the Open Knowledge Foundation and with the help of their software. However it is not appropriate for the OKF, as a partner in the UK Government Transparency activity, to lobby for change so it will be an ad hoc group of identified individuals (perhaps under the label @ccess). It is probable, however, that the protocols we intend to develop will be part of the OKF activity, perhaps under the "Panton" brand.

Our group will represent that very serious harm is done to science and the use of science by the refusal to allow textmining. We shall be preparing our material completely in the open, coordinated on http://lists.okfn.org/mailman/listinfo/open-access. Anyone can take part in the discussion and interested parties such as publishers are invited.

We shall argue that as from today all publishers know of our activity and have the opportunity to influence what we say. A major problem is that publishers make it extremely difficult for a reader to get a useful reply to any question on rights and practice. I know, however, that staff in all major publishers follow this blog. We shall concentrate on a small subset of high-profile publishers, probably limited to Wiley/Blackwell, Elsevier, Springer, Nature, AAAS (Science), PLoS, BMC and because of my involvement in chemistry ACS and RSC. Those organizations have the opportunity to make their views and practice known on open-access. Any private mails on this subject will be posted to the list.

The publishers argue, from their own surveys, that the scholarly community assert that publishers are extremely helpful over text-mining and agree to a large percentage of requests (data collated by Eefke Smit, STM publishers' association). Our group asserts the opposite – that publishers have been extremely unhelpful.

We shall also argue that the publishers "institutionally" oppose text-mining. (In the UK we have a phrase, "institutional sexism/racism/ageism, etc." which identifies practices and attitudes – whether conscious or not – that oppose fundamental rights (http://en.wikipedia.org/wiki/Institutional_racism ). Thus the UK police have been described as "institutionally racist" and I assert that the scholarly publishing industry is "institutionally opposed" to text-mining. [If anyone has a better term please let me know]. The "glass ceiling" is a similar term. This is reflected in the large number of barriers, whether conscious or not, that publishers put in place or leave in place that effectively prevent text-mining. Institutionalism is defined as "the collective failure of an organisation to provide an appropriate and professional service to people" and I assert that the scholarly publishing industry is almost universally guilty of this for its READERS.

I will start by stating an unpleasant but true fact: many people no longer trust the scholarly publishing industry. There have been too many assertions of "we are doing everything we can", "I'll get back to you", "our marketing people will look at the problem" to trust effective action. This is "institutional" – I no longer care whether it's deliberate or unconscious, the effect is the same.

On Wednesday I talked with Alicia Wise, Elsevier's Director of "Universal Access". I put my concerns to her including the unacceptable manner in which Elsevier had treated me and I asserted my rights to text-mine scholarly content. [I intend to formalise these rights in the submission to Hargreaves]. It was an informal, unplanned conversation in the presence of other people and I shall not put words into her mouth. She agreed, I believe, to treat me with professional courtesy and to respond to my points in public. She said she would mail me yesterday (she hasn't) so I am assuming she will read this blog. I have her email and will email her.


If a publisher fails to take part in public discourse on text-mining and fails to comment on the principles and protocols we shall create on the list we shall represent them to Hargreaves as "institutionally opposed to text-mining". If you wish to take part please make your contact details known on the list, not on this blog.

The response to Hargreaves will consist of a number of questions which (generally) require the response "YES" to be seen as helpful to the provision of text-mining. A typical one is:

  • "Do you agree that facts and data are uncopyrightable?"

The only answers are YES and "not-YES" (which will be labelled by us as "unhelpful"). The following are examples of "not-YES":


  • Failure to reply
  • Additional of conditions ("it depends on…")
  • "I don't have authority to answer this question". Sorry – that's institutionalism. It may not be YOUR personal fault, but it is your organization's fault
  • Promises to "get back to us" – you have two weeks max as we need a week to collate for Hargreaves. That's a fact. So start preparing now.
  • Asserting that OPEN-ACCESS should have approached person X rather than person Y.

Any publisher who is actually well-intentioned towards textmining should be trivially able to answer the questions in half an hour. Any publisher who has to worry about them is probably guilty of institutionalism.

This is the first of several posts. I shall next address our RIGHTS and what "information-mining" covers. I may then give further examples of my and my colleagues experiences of publisher institutionalism.

The list will create protocol will be a draft of acceptable textmining practice by readers, subscribers and publishers.

** PUBLISHERS and STM-PUBLISHERS ** your immediate action should be to register with the OPEN-ACCESS list and make known the identity of the persons who will answer questions for Hargreaves. That can be done today (It's a working day in most countries).


This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Our Protocol for Text-mining: Preamble and “Institutionalism”; Elsevier and other publishers should take note

  1. Tony Hirst says:

    I'm not sure how broad the remit of your response is expected to be, but I often wonder about how publishers can do more to help unlock the structural value contained within papers they publish, some of which is a direct result of their efforts. One of the things I've started pondering is the filter value associated with indices. I'm not sure if these are manually compiled, fully automated, bootstrapped by automation/textmining then passed over to human editorial control (maybe with additional text mining to complete the indexing) but there is value in an index, I think, that can be used to help map knowledge structures across works and also support discovery tools that search within texts.

    To ground this, take the subject of business books. Indices mention companies (often with stylistic conventions such as bold font to make a semantic distinction that this is a company, for example), executives' names (maybe italicised), and key terms ("innovation", "data driven", whatever). And page numbers of course (sometimes with some pages emphasised). Getting access to indices as data allows the construction of graphs within and across a work that provide new ways of working with the knowledge contained in those works. But can we get hold of indices? Not that I know of. (Note the separate issue about whether indices should be freely available, or available as paid for works in their own right. What I think I'm arguing for here is that the publishers do not seem to be exploring ways of extracting more value from the works that may benefit readers/cultural development and may usefully feed back into funded research that pays for the work that gets written up in the books the publishers sell. By making works available "as (structured) data", at least they wouldn't prevent others from trying to exploit the structural value of their publications. (Hmmm.... I'm reminded here about about researchers who won't release data sets because they don't want others to be able to make discoveries from the data before them.....)

    See also: http://blog.ouseful.info/2011/12/08/whats-inside-a-book/

    • pm286 says:

      Many thanks Tony,
      You are absolutely right. Indexes are valuable. There are two ways (and possibly hybrid) to create them:

      Use expert humans
      Use expert machines

      I shall blog on this, but in brief.

      Humans are not perfect and nor are machines. In chemical tasks (e.g interpreting names) machines are far superior and cost almost zero.

      I can do this better than Elsevier for many types of chemical content and for zero cash. Elsevier expressly forbid subscribers in Cambridge to index any content for whatever means. Alicia Wise wants to talk to me about this. Maybe she will let me index everything in Elsevier. I will keep you posted.

      There is no need for PUBLISHERS to create indexes. I can and have done this where I am able to avoid being sued. See http://wwmm.ch.cam.ac.uk/crystaleye for 200,000+ crystal structures indexed by technology far in advance of what any publisher has.

      I asked Elsevier for permission to do this and they said their database section wouldn't let me.

      The problem is that's it's disruptive technology. History tells me that it's a fight not a negotiation. But I shall play it straight.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>