Permission for information-mining : Update and response from Royal Society of Chemistry

In our current search [some request only went out on Saturday] for factual information from publishers on permission for “text-mining” the position is:

  • Elsevier. Permission granted in principle for PM-R. [PMR and community is now gearing up to extract factual chemistry from Elsevier journals. First step will be to create a complete index of all content (e.g. in Open biblio/ Bibsoup] and then decide on strategy. Top driver is mat Todd’s need to find antimalarial compounds – so we’ll look in chemistry journals first.]
  • Wiley. Request [2012-03-07] to Bob Campbell transferred to Duncan Campbell.
  • Nature. Request to Philip Campbell [2012-03-10] transferred to appropriate department.
  • American Chemical Society. No reply yet.
  • Springer. Request [2012-03-10] transferred internally and significant useful response [next mail]
  • Royal Society of Chemistry. [2012-03-10] Significant response from Richard Kidd. See below. Note that Richard has often commented on this blog.

    Dear Peter 

    Thanks for your request. It’s good to see from this and the accompanying blog post you still have some positive memories of text mining with some publishers. So far, we have mainly supplied articles for academic text mining purposes as one-off deliveries – such as for the SESL project, and the 50,000+ articles we supplied to both the ChETA and TREC Chem projects. Often it is easier for miners to bulk load within their own systems than crawling to collect, but we recognise that times are changing.

    We ask you talk to your librarian colleagues, both in terms of them being happy with what you’re doing under the agreed licenses with RSC, and so they understand what ongoing value the results of any mining exercise derives from the RSC subscription.  

    This ongoing value issue is important in terms of text mining implications for us. Along with most publishers we supply counter stats to librarians of usage within their institution – and, as you know, when renewal times comes these are used to judge which journals are of most value. Our concern is if the mining extracts and republishes sufficient content from the publications as to reduce apparent usage (and citation) of the published papers in future. At the moment full text downloads are the major measure we have (rightly or wrongly in principle) for the librarian to judge if publications are of value to the institution, and republication of extracted facts and data at least potentially could affect this. Done right, the effect can be positive, but it could also be detrimental.

    Some of Cameron’s suggested principles of research data mining would have been a valuable addition to your proposed non-negotiables, to reduce concerns that future derived would reduce usage of the original papers by your institution and others:

    * Always link back to the version of record of the research output you have mined.

    * Include elements and snippets by reference, not by value. Restrict content replication to that reasonably allowed by Fair Use provisions or enabled by licences, and required for efficient services

    * Only redistribute content where copyright terms explicitly allow it

    * Respect API service limits where posted and develop polite tooling with exponential back-off where appropriate

    (a couple of principles deleted, due to non-relevance to this specific question rather than disagreement)

    Finally, a correction. You say we cut off access a  few years ago. My recollection is slightly different and I have the correspondence if you’d like to  see it, from 2006. We didn’t cut you off, though we suggested we would block one IP address if the downloading continued without any contact. We discussed it amicably – explanation made it clear and the download behaviour was modified for both sides to be happy with continuation. But it’s an excellent illustration of why we appreciate being asked about the approach – as in this case the downloader was trying to retrieve non-existent issues, filling our developers’ mailboxes with 404 alerts. So while you think we’re only concerned about server load with on-demand mining, you can end up killing other systems we have to improve customer service. Mike Taylor clearly values publishers who try to stay on top of broken links 😉

    I would also ask that you include our response verbatim if you are using it in any of your Hargreaves submissions, and of course we will be preparing our own submission. 

    In summary, we would strongly appreciate discussion on the extent of the factual information you intend to republish (I have seen the examples on the blog), together with the involvement of your librarian colleagues in the process – for current agreements, and effects on future usage and value measures.

    Best wishes



This is a useful response. It doesn’t however give me permission to text-mine RSC without permission. It suggests I contact my librarian. I have done on regular intervals – I think they recognise I don’t need technical help from them – I am simply alerting them to what I am doing.

Text-mining distorts the publisher metrics on value? Surely that can be overcome technically. If that’s the only problem lets’ create a dark cache and I’ll play in the sandbox. This is one of the sort of things where with goodwill on both sides a solution is straightforward.

Is it progress? Difficult to say – it’s no good to me or Mat Todd as it doesn’t advance my current ability to mine the RSC literature.

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Permission for information-mining : Update and response from Royal Society of Chemistry

  1. Nick Barnes says:

    So it’s a Campbell family conspiracy, then?

  2. Alicia Wise says:

    Hi Peter,
    Elsevier, like many other publishers, has technical measures in place on our websites to limit crawling and these would prevent you from taking the technical approach you have outlined. We have these measures in place because web crawling, while ubiquitous, is generally not the most efficient method of harvesting large quantities of content from sites such as ours (you are experienced in this area, but other interested readers might refer for example to Nelson et al., Efficient, Automatic Web Resource Harvesting, We are keen to provide the right data to you in the right format in a way that doesn’t impair the experience of our other users and that is scalable. Recognising that you, and other users, need programmatic access to the content, we have invested in recent years in services specifically designed to support machine-to-machine access, including a comprehensive suite of APIs and a content syndication service. We would be more than happy to work with you to explain how these services work and how we believe they can support your needs. We have existing agreements that cover the use of these services. If these are not right for you, then we are happy to work with you to develop an agreement tailored for your project.
    With kind wishes,

Leave a Reply

Your email address will not be published. Required fields are marked *