In our current search [some request only went out on Saturday] for factual information from publishers on permission for “text-mining” the position is:
- Elsevier. Permission granted in principle for PM-R. [PMR and community is now gearing up to extract factual chemistry from Elsevier journals. First step will be to create a complete index of all content (e.g. in Open biblio/ Bibsoup] and then decide on strategy. Top driver is mat Todd’s need to find antimalarial compounds – so we’ll look in chemistry journals first.]
- Wiley. Request [2012-03-07] to Bob Campbell transferred to Duncan Campbell.
- Nature. Request to Philip Campbell [2012-03-10] transferred to appropriate department.
- American Chemical Society. No reply yet.
- Springer. Request [2012-03-10] transferred internally and significant useful response [next mail]
- Royal Society of Chemistry. [2012-03-10] Significant response from Richard Kidd. See below. Note that Richard has often commented on this blog.
Thanks for your request. It’s good to see from this and the accompanying blog post you still have some positive memories of text mining with some publishers. So far, we have mainly supplied articles for academic text mining purposes as one-off deliveries – such as for the SESL project, and the 50,000+ articles we supplied to both the ChETA and TREC Chem projects. Often it is easier for miners to bulk load within their own systems than crawling to collect, but we recognise that times are changing.
We ask you talk to your librarian colleagues, both in terms of them being happy with what you’re doing under the agreed licenses with RSC, and so they understand what ongoing value the results of any mining exercise derives from the RSC subscription.
This ongoing value issue is important in terms of text mining implications for us. Along with most publishers we supply counter stats to librarians of usage within their institution – and, as you know, when renewal times comes these are used to judge which journals are of most value. Our concern is if the mining extracts and republishes sufficient content from the publications as to reduce apparent usage (and citation) of the published papers in future. At the moment full text downloads are the major measure we have (rightly or wrongly in principle) for the librarian to judge if publications are of value to the institution, and republication of extracted facts and data at least potentially could affect this. Done right, the effect can be positive, but it could also be detrimental.
Some of Cameron’s suggested principles of research data mining would have been a valuable addition to your proposed non-negotiables, to reduce concerns that future derived would reduce usage of the original papers by your institution and others:
* Always link back to the version of record of the research output you have mined.
* Include elements and snippets by reference, not by value. Restrict content replication to that reasonably allowed by Fair Use provisions or enabled by licences, and required for efficient services
* Only redistribute content where copyright terms explicitly allow it
* Respect API service limits where posted and develop polite tooling with exponential back-off where appropriate
(a couple of principles deleted, due to non-relevance to this specific question rather than disagreement)
Finally, a correction. You say we cut off access a few years ago. My recollection is slightly different and I have the correspondence if you’d like to see it, from 2006. We didn’t cut you off, though we suggested we would block one IP address if the downloading continued without any contact. We discussed it amicably – explanation made it clear and the download behaviour was modified for both sides to be happy with continuation. But it’s an excellent illustration of why we appreciate being asked about the approach – as in this case the downloader was trying to retrieve non-existent issues, filling our developers’ mailboxes with 404 alerts. So while you think we’re only concerned about server load with on-demand mining, you can end up killing other systems we have to improve customer service. Mike Taylor clearly values publishers who try to stay on top of broken links
I would also ask that you include our response verbatim if you are using it in any of your Hargreaves submissions, and of course we will be preparing our own submission.
In summary, we would strongly appreciate discussion on the extent of the factual information you intend to republish (I have seen the examples on the blog), together with the involvement of your librarian colleagues in the process – for current agreements, and effects on future usage and value measures.
This is a useful response. It doesn’t however give me permission to text-mine RSC without permission. It suggests I contact my librarian. I have done on regular intervals – I think they recognise I don’t need technical help from them – I am simply alerting them to what I am doing.
Text-mining distorts the publisher metrics on value? Surely that can be overcome technically. If that’s the only problem lets’ create a dark cache and I’ll play in the sandbox. This is one of the sort of things where with goodwill on both sides a solution is straightforward.
Is it progress? Difficult to say – it’s no good to me or Mat Todd as it doesn’t advance my current ability to mine the RSC literature.