Text-mining the scholarly literature: towards a set of universal Principles; Update and strategy

Posted on April 25, 2012 by pm286

For some years I have seen the primary literature as an enormous untapped resource of scholarly information. We humans are very good at some aspects of “reading the literature” but there are many areas where machines are better and should be used. These include scale (hundreds of thousands of manuscripts), checking, validation, transformation (e.g. scientific units), deduction (many papers have implicit semantics), aggregation of knowledge, and much more. We are now reaching the time when the technology of “text-mining” is mature enough to deploy and, for example, my group and I have developed among the best tools in the world for mining chemistry. I am now expanding that to other fields which I will describe in later posts.

In general the readers of the scholarly literature (who may include the #scholarlypoor) have been seriously frustrated by the restrictions imposed by publishers and universally agreed by librarians. Most subscriptions to most major journals have terms forbidding readers to mine/crawl/index/extract etc. This is not a consequence of copyright – it is an additional restriction imposed by published and apparently automatically assented to by academic purchasing systems (mainly libraries). This automatic assent has done scholarship a grave disservice, so I give the library community a chance to correct the historical record:

Has any library ever publicly challenged the terms of use [on mining] set by publishers? I haven’t seen any. But I’d be grateful to know public cases, and what happened. My current view is that publishers set conditions and that libraries accept them verbatim, which, unfortunately, means that they don’t have a track record of fighting for text-mining or other freedoms.

Moving on, the UK Hargreaves report has recommended removing these restrictions (which are not legally required) and also modifying copyright law. My grapevine suggests there is a high probability that significant changes will be made and that “text-mining” will become widely available without requiring explicit permission. We should prepare for this, and any responsible publisher and library/purchaser should be preparing for this.

A month ago I and colleagues in OKF submitted cases to the Hargreaves process. As part of that I asked 6 major publishers whether I could “text-mine” their journals. Naomi Lillie of OKF is summarising the results and I will keep you in suspense till then. It’s fair to say some were helpful, some were not and some were fuzzy (for whatever motivation).

A number of publishers said we should discuss it with the library. There is no need for this. I and my group can text mine material by myself – in one week Daniel Lowe extracted 500,000 chemical reactions from the US Patent Office without needing any help. Nick Day has built PubCrawler and extracted 200,000 crystal structures from supplemental information without any help. The only thing I need is:

An assurance I won’t be sued for behaving like a responsible scholar
An assurance that my institution won’t get cut off for (my) responsible behaviour

In case anyone in the publishing or library communities doesn’t understand what “responsible” means, it means:

I do not intend deliberately to re-publish the publishers manuscripts (“the PDF”) in bulk without valid scholarly reason.

I am a responsible scholar. I conform to health and safety. I obey the law of the UK. I do not steal. I can justify the expenditures on my grants. I attempt to value and promote human equality in my scholarship. I try to give credit where it is due. Responsible scholarship is a fundamental principle which I believe applies to almost all readers of the scholarly literature. Occasionally I and others fail – there are ample mechanisms for addressing these without forbidding textmining.

So this post asserts my absolute right as a subscriber to the scholarly literature to carry out textmining and to disseminate the results to anyone. I do not need any other permissions.

A number of details follow which I’ll address in later posts.

At present, therefore, a group of us – under the aegis of the Open Knowledge Foundation – is drafting a set of principles for textmining. They include:

Heather Piwowar. Heather has written several blogposts (http://researchremix.wordpress.com/ ) about text-mining. They include negotiations with Elsevier (which include the need for Elsevier and librarians to give her permission) and more recently a manifesto (http://researchremix.wordpress.com/2012/04/20/new-fron/ ).
Maximilian Haussler. See (/pmr/2012/03/09/textmining-update-max-haussler%E2%80%99s-questions-to-publishers-they-have-a-duty-to-reply/ ). Max was quoted 85,000 USD by NPG to mine their content (I think this has been altered to 0?) . He and colleagues have fought for the right and he has submitted a detailed case to the US government
Diane Cabell and Jenny Molloy, OKF. Diane is a specialist in intellectual property law and has helped to craft the OKF open-science response to Hargreaves.
Ross Mounce. Panton fellow (http://about.me/rossmounce ). Ross has created a superb and damning summary of publishers distortion of the term “Open Access” in paid hybrid journals. Ross and I are now working on the technology and strategy of textmining.

We shall come up with a manifesto/set-of-principles. This will be a statement of our rights and our responsibilities. It is not a negotiation, anymore than Tom Paine or the Founding fathers negotiated in the construction of their declarations. Or, more recently, the BBB declarations of Open Access. Those declaration are priceless – it’s just a pity that there are not enough who believe in them enough to push for their universal acceptance. We shall not make the same mistake with the principles of textmining.

This entry was posted in Uncategorized. Bookmark the permalink.

25 Responses to Text-mining the scholarly literature: towards a set of universal Principles; Update and strategy

Neil Stewart says:

April 25, 2012 at 3:21 pm

I’m not sure it’s fair to blame “librarians” for the lack of text-mining rights. Surely since text mining is a relatively new method of looking at the scholarly literature, the obstacles that Elsevier et al are putting in place are things that librarians need to start to look at that as and when they re-negotiate contracts?
Librarians have been banging on about open access for years with little response from academics in general, but when one new method of leveraging open access is identified, they immediately get the blame for not having foreseen the lack of text-mining as being a potential issue.

Reply
- pm286 says:
  
  April 25, 2012 at 3:38 pm
  
  Thanks for commenting.
  This is nothing to do with “Open Access”. It’s to do with the rights of subscribers. When I buy a book I have certain rights. covered by copyright law. When my library subscribes to an electronic journal they agree to restrict my use far beyond the law. The contract forbids me to do almost anything.
  I am addressing librarians because – as I understand it – they negotiate the deals with the publishers. Those deals contain contracts proposed by publishers. They should read them and challenge them – as far as I know they don’t do both.
  I have no idea when these clauses came into the contracts but I can use UK-FOI to find out and I’ll find out whether any university actually challenged the publishers.
  This is not a “new method for leveraging open access”. It the use of machines to extract information from documents. Text-mining is 20 years old. Google has been doing it for over 10 years. So have I. It’s not new. Anyone who understood Google 10 years ago would have seen the potential for automated information extraction.
  And in general if I get a contract that says I cannot do X, Y and Z I ask myself and others what I am signing away. Universities spend 10,000,000,000 USD annually on scholarly pubs, on our behalf, and that money is not getting good value.
  
  Reply
  - Neil Stewart says:
    
    April 26, 2012 at 8:43 am
    
    Thanks Peter- I concede your corrections!
    The issue of value for money for journal subscriptions is also an issue librarians have been going on about for a while- and I think there is a new mood of openness with regard to this- see for example Harvard’s recent announcement about unsustainable journal subs costs. I think text-mining rights will be something that could and will be asked for in this new context.
    
    Reply
    - pm286 says:
      
      April 26, 2012 at 9:31 am
      
      >>Thanks Peter- I concede your corrections!
      Thanks and graciously accepted. I get a fair amount of flak, but partly because I am one of the very few prepared to speak openly.
      >>The issue of value for money for journal subscriptions is also an issue librarians have been going on about for a while- and I think there is a new mood of openness with regard to this- see for example Harvard’s recent announcement about unsustainable journal subs costs. I think text-mining rights will be something that could and will be asked for in this new context.
      Value for money is important, and one of the skills of the publishers has been to divide and conquer. We are often told we are getting a “good negotiated deal”. A Deal where the value of the product is unquantifiable. Where the vendor demand huge amounts and a reduction is seen as a victory, whereas it’s still a huge deal.
      Price is not everything. I see little evidence publicly, however, that anything else matters. Indeed I think contractual rights have been forfeited to get a few more percent of the – completely artificial – “price”
      
      Reply
Dan says:

April 25, 2012 at 10:55 pm

Peter, you are being silly. Librarians are powerless in the face of faculty who publish in close source journals to which these very faculty insist their library subscribe. Have you seen the latest from the Harvard on this?

Reply
- pm286 says:
  
  April 26, 2012 at 8:54 am
  
  Thanks for the comment.
  The answers so far seem to suggest that librarians are powerless.
  They are not powerless to raise the issue. As far as I know this has not been identified as an issue. It’s nothing to do with closed access.
  From the comments so far I take:
  “faculty demand closed access because of the prestige”
  “the closed access publishers have bundled in additional terms restricting use”
  “librarians must go along with what publishers demand because otherwise faculty will be angry”
  If librarians do not challenge the terms and conditions then maybe we should shift the purchasing elsewhere. I’m in favour of a national purchasing office – as I believe they have in Brazil. But in any case there should tough public negotiations – much of it is ultimately public money.
  
  Reply
  - Stacy says:
    
    April 26, 2012 at 7:15 pm
    
    “But in any case there should tough public negotiations – much of it is ultimately public money.”
    Could not agree more!
    “The answers so far seem to suggest that librarians are powerless.”
    I think that librarians, who are working in the service of researchers such as yourself and are worried as to what might happen if we can’t get you the access to journals that you require, more often than not do _feel_ powerless. Unless faculty is on our side, no one is listening (except for other librarians)! You’re absolutely right that we need to start making a stand–but we also need to know that researcher such as yourself, Heather, and everyone else who needs text mining access will be on our side, when we take that stand.
    It’s all about gaining critical mass. And about faculty getting rowdy, so we can show the library administration (which in most institutions tends to err on the side of conservatism)–not to mention publishers–that this is a pressing issue worth standing up for.
    
    Reply
Stacy says:

April 26, 2012 at 12:26 am

“I am addressing librarians because – as I understand it – they negotiate the deals with the publishers. Those deals contain contracts proposed by publishers. They should read them and challenge them – as far as I know they don’t do both.”
I’m going to have to disagree with you here. Librarians certainly do read the terms, but they’re often unable to challenge them because they’ve been told that it will void the bundle deals that they’ve worked out with publishers–to allow for access to many journals for a fraction of the price. Text mining is definitely a new thing for many librarians, but by and large they are interested in supporting that need when researchers approach them.
The problem is, if only one or two researchers at an institution with hundreds of faculty are interested, they’re sometimes reluctant to go to the trouble and possible expense of demanding that all publishers change their licenses. That’s not to say that they don’t challenge publishers, even when only one or two researchers have shown a need–in my experience, they’ve been all to happy to help those researchers gain the access they need by entering into conversations with publishers to try to change the terms of the licenses.
“A number of publishers said we should discuss it with the library. There is no need for this.”
There is a need to discuss it with the library, if only because we’re the folks who pay for journal subscriptions and negotiate licenses on behalf of you and the rest of the university. Publishers are absolutely right to tell you to talk to librarians, so they have an understanding of your need, and can take that into account when renegotiating licenses. Librarians do not in any way shape or form wish to entrench themselves in your research; they only need to be in the loop as to your need.
Researchers need to actually approach librarians in order to demonstrate there’s an actual need. Our hands are often tied by forces you don’t know about; better to engage and understand, than to assume and continue with these researcher/librarian silos. It’s unproductive to point fingers.

Reply
- pm286 says:
  
  April 26, 2012 at 9:26 am
  
  See my reply to @Dan for general points
  >>Librarians certainly do read the terms, but they’re often unable to challenge them because they’ve been told that it will void the bundle deals that they’ve worked out with publishers–to allow for access to many journals for a fraction of the price. Text mining is definitely a new thing for many librarians, but by and large they are interested in supporting that need when researchers approach them.
  I don’t see why “unable to challenge them”. I have been involved in purchasing equipment, software. I certainly challenge terms. I appreciate that there is a pseudo-monopoly here, but I don’t buy the fact that the only thing that matters is price. What you are buying is as important as how much you pay for it.
  >>There is a need to discuss it with the library, if only because we’re the folks who pay for journal subscriptions and negotiate licenses on behalf of you and the rest of the university. Publishers are absolutely right to tell you to talk to librarians, so they have an understanding of your need, and can take that into account when renegotiating licenses. Librarians do not in any way shape or form wish to entrench themselves in your research; they only need to be in the loop as to your need.
  This is the root of the problem. Librarians go along with what publishers offer, except on price. My need is simple. I simply want my machines to read the literature that my institution and my public purse has paid for. I am completely competent to do this technically – it is simply the contractual restrictions that are holding me back.
  I believe I have a right to mine the literature – you believe that this is something that should be negotiated. Every negotiation will remove my rights. The publishers are already saying that I need to get explain my research to them to get permission. And the librarians will be involved in negotiating what I can and what I can’t do. This has already happened in UBC.
  This is not a negotiation. Billions have already been negotiated away. It’s only because funders are fighting that we are getting anywhere.
  >>Publishers are absolutely right to tell you to talk to librarians, so they have an understanding of your need, and can take that into account when renegotiating licenses. Librarians do not in any way shape or form wish to entrench themselves in your research; they only need to be in the loop as to your need.
  The only “renegogiation” needed is to remove the restrictive clauses. And I expect the Hargreaves process do to that for us.
  >>Researchers need to actually approach librarians in order to demonstrate there’s an actual need.
  I have demonstrated a need for 5 years.
  >>Our hands are often tied by forces you don’t know about; better to engage and understand, than to assume and continue with these researcher/librarian silos.
  So there is a layer of secrecy holding back the process? OK, I’ll be using FOI to find out everything I have a right to. Secrecy is a very bad tool for progressive policy.
  
  Reply
Ian Gibson says:

April 26, 2012 at 12:28 pm

Peter, I completely agree that text mining should not be something we have to negotiate for but I do know why those clauses exist.
I am a science librarian at a medium sized Canadian university. We have on multiple occasions since I have been here had our access to various resources blocked because of mass downloading. Unfortunately, the people doing the mass downloading were not textmining out facts but students (mostly international) creating vast stores of papers which they were sharing with unknown persons back home. Now they could have been sharing this stuff with local charities or other doers of good deeds (or perhaps other researchers) but given the majors of the folks involved I suspect that instead it was being used to help the East Asian electronics industry and the oil and gas industry in South Asia. Knowing this I am never surprised when I see the no robots, spiders, search algorithms, etc. clause in a contract from e.g. Elsevier.
The problem from the paywall’s perspective of course is how do you sort out the legitimate textminers from the mass downloader/”pirate” types?
I think going forward you will see more institutions insist on the ability to textmine but I’m not so convinced that the publishers will be eager to go along.
One last point – above you talk about a national purchasing office above. In Canada we have a national consortium, CRKN, that negotiates a lot of our licenses to e-resources on behalf of its 75 members. It is not the panacea that some folks expected it to be.

Reply
- pm286 says:
  
  April 26, 2012 at 1:46 pm
  
  Thanks very much Ian, very helpful
  >>Peter, I completely agree that text mining should not be something we have to negotiate for but I do know why those clauses exist.
  I understand why as well. I am interested in when they happened.
  >>I am a science librarian at a medium sized Canadian university. We have on multiple occasions since I have been here had our access to various resources blocked because of mass downloading. Unfortunately, the people doing the mass downloading were not textmining out facts but students (mostly international) creating vast stores of papers which they were sharing with unknown persons back home. Now they could have been sharing this stuff with local charities or other doers of good deeds (or perhaps other researchers) but given the majors of the folks involved I suspect that instead it was being used to help the East Asian electronics industry and the oil and gas industry in South Asia. Knowing this I am never surprised when I see the no robots, spiders, search algorithms, etc. clause in a contract from e.g. Elsevier.
  I fully accept that this mass downloading happens. I am not supporting these actions. Any determined group of people could do this without triggering alarms. However they are preventable by other means, both legal and technical.
  It is, of course, symptomatic ot the “piracy” mentality where everyone colludes in preventing new developments on the basis of protecting the present. In the current case of scholarly pubs which are created by us, not publishers, it is doubly unsatisfactory.
  >>The problem from the paywall’s perspective of course is how do you sort out the legitimate textminers from the mass downloader/”pirate” types?
  I can make suggestions. I would have no personal objection in registering *with my university* for being able to access bulk papers any more that I object for registering for permission to use controlled substances. But it should be done through a single portal and not through the plethora of publisher-hackups.
  >>I think going forward you will see more institutions insist on the ability to textmine but I’m not so convinced that the publishers
  will be eager to go along.
  I wouldn’t disagree. I just hope that progress is faster than apparent at the moment, especially at the library end. They control the legal apparatus and at the moment nothing is happening.
  >>One last point – above you talk about a national purchasing office above. In Canada we have a national consortium, CRKN, that negotiates a lot of our licenses to e-resources on behalf of its 75 members. It is not the panacea that some folks expected it to be.
  There are no panaceas. But I suspect it can’t be worse than our 200+ scattered institutions. How much person-time and inefficiency does this cost?
  
  Reply
  - Ian Gibson says:
    
    April 26, 2012 at 4:06 pm
    
    CRKN certainly imposes quite a bit of efficiency – there is one central negotiating team drawn from the different regions, CRKN’s permanent staff has drawn up a model license which publishers are expected to comply with, (their homepage may not be the most illuminating thing but it does give interested parties a bit of a flavour of what they do. The problems pretty much all stem from the fact that you have one org negotiating for schools ranging in size and financial clout from the massive University of Toronto to tiny Acadia University.
    Your suggestion to create of registry of local text miners would in theory not be difficult to accomplish – and with a registry it would probably be easy to assuage the publishers fears about misuse… This should be low hanging fruit providing the publishers will play ball.
    
    Reply
    - pm286 says:
      
      April 26, 2012 at 4:53 pm
      
      >>Your suggestion to create of registry of local text miners would in theory not be difficult to accomplish – and with a registry it would probably be easy to assuage the publishers fears about misuse… This should be low hanging fruit providing the publishers will play ball.
      It’s very cost-effective. It could be even done through something like the British Library – so Universities didn’t have to manage their own portals. It would immediately answer the (spurious) argument that textmining will reduce publishers’ servers to smoking ruins. It would act as a cache. Any responsible subscriber would simply get permission to use this cache. It would be efficient as there would be good mechanisms. It would reduce the absurd plethora of inappropriate interfaces developed by publishers (look at the average publisher/journal home page and it’s stuffed with Flash adverts for how wonderful they are – difficult to scrape and download, whereas a simple RSS / Atom feed would be ideal).
      We’ll see.
      
      Reply
Peter M says:

April 26, 2012 at 2:21 pm

The products of text mining have commercial value (to industrial R&D as well as to academics). I’m guessing the publishers know that and want to keep it for themselves. They might be prepared to license you to mine text in return for a higher subscription fee, but probably (almost certainly do) have other plans.
If you’ve developed good text mining tools, publishers may be interested in working with you to develop commercial products. But maybe that would grate on you, given your appeals to fundamental principles and absolute right.
Don’t researchers have an absolute right to ignore the commercial publishers and publish in the open on the web right now? The extent to which they instead publish in ‘high impact’ journals that attract readers but do not offer open access is an indication of how much they value the prestige that comes with appearing in such journals. What fundamental principle is being demonstrated there?
Some may point to the iniquities of research assessment and the ‘publish or perish’ culture that supposedly drive researchers to the high impact journals in fear of otherwise losing their careers. But that only applies to researchers who choose to inhabit institutions that work that way. Don’t they have the absolute right to leave the universities and start their own private research institutes? What’s stopping them?

Reply
- pm286 says:
  
  April 26, 2012 at 4:46 pm
  
  >>The products of text mining have commercial value (to industrial R&D as well as to academics).
  Yes. I told Hargreaves that this was “low billions” in chemistry world wide.
  >>I’m guessing the publishers know that and want to keep it for themselves. They might be prepared to license you to mine text in return for a higher subscription fee, but probably (almost certainly do) have other plans.
  They may well want to – and some do. But they have no absolute right – they have possession which has been gifted to them and that’s all. They don’t enhance the content, quite the reverse.
  They may have other plans. Publishers in general are very poor at handling any information other than text and I suspect they will lose this market. Theyre’ll be court cases but ultimately they have a poor basis for “owning” content.
  >>If you’ve developed good text mining tools, publishers may be interested in working with you to develop commercial products. But maybe that would grate on you, given your appeals to fundamental principles and absolute right.
  I’m happy to work with anyone in principle as long as the outcome is fair. But “all your data are belong to us” is not acceptable.
  >>Don’t researchers have an absolute right to ignore the commercial publishers and publish in the open on the web right now? The extent to which they instead publish in ‘high impact’ journals that attract readers but do not offer open access is an indication of how much they value the prestige that comes with appearing in such journals. What fundamental principle is being demonstrated there?
  Prisoners’ dilemma? Everyone has to play this non-zero-sum game?
  >>Some may point to the iniquities of research assessment and the ‘publish or perish’ culture that supposedly drive researchers to the high impact journals in fear of otherwise losing their careers. But that only applies to researchers who choose to inhabit institutions that work that way. Don’t they have the absolute right to leave the universities and start their own private research institutes? What’s stopping them?
  Nothing except conservatism, elitism, public funding. I think it’s a good idea to start breaking away from Universities.
  
  Reply
  - Peter M says:
    
    April 27, 2012 at 2:49 pm
    
    pmr wrote:
    > Publishers in general are very poor at handling any information
    > other than text and I suspect they will lose this market.
    I agree that they don’t appear to have the imagination necessary to make it themselves. There’s value to be created from the sci literature and they are using their monopoly position to prevent that happening. It is a legitimate funciton of govt to stop that kind of bad use of monopoly power. Good luck in your efforts to drive a change.
    
    Reply
    - pm286 says:
      
      April 27, 2012 at 3:10 pm
      
      Thanks! Encouragement is always valuable.
      
      Reply
Stacy says:

April 26, 2012 at 7:08 pm

\I believe I have a right to mine the literature – you believe that this is something that should be negotiated.\
I believe in your right, as do most other librarians. Publishers are the ones who don’t, and they’ve been setting the terms and writing the contracts before librarians even knew that text mining was A Thing. It’s pretty great that you, Heather, and other researchers are bringing the need for text mining capabilities to light, so we can go about standing up for your rights when we handle contract renegotiations. Which: like it or not, right now it’s a fact that we have to negotiate for (what should be inherent) researcher rights with publishers.
\So there is a layer of secrecy holding back the process? OK, I’ll be using FOI to find out everything I have a right to. Secrecy is a very bad tool for progressive policy.\
I think that Heather Piwowar’s work in negotiating with Elsevier and making those negotiations completely public is FANTASTIC. The problem is, as Heather has explained (and most librarians will confirm), a lot of publishers require that you keep contracts, negotiations, etc completely confidential. FWIW, I disagree with that wholeheartedly and think it’s a way that publishers keep librarians from being able to band together to demand more rights–to make a poor metaphor, it’s a way of union-busting. But, it’s a reality nonetheless. (One that sarcasm won’t solve, unfortunately. 🙂 ). I hope that Heather’s example will help librarians feel more empowered to demand transparency in the future. And I encourage you to connect with librarians at your institution to do something similar, so you can get the access you need as a stop gap measure until this whole \text mining rights\ document that’s been floating about takes root.

Reply
Richard Kidd says:

April 27, 2012 at 1:53 pm

If you want to make it easier for publishers to say ‘yes’:
Work with your librarian – they’ll measure the publisher’s value for next year (and renew subscriptions, or not) based on full-text downloads. If want to squirrel it away and work on an offline cache, and never download again, they need to know maintaining the subscription is important. Also, as others have pointed out, they’ll be the ones negotiating, whether at an institutional level or through the many consortia. They need to know it’s important (and relative demand).
Define the scope of republication – using text mining to improve scientific research is an obvious good thing (and one we’ve participated in with Peter’s group and others). Few publishers would object (I would hope) with Heather’s research, and publishing research results, snippet, links should be fine. If it’s another way for people to find the original research on our own sites, we’ll love it.
But I’ve asked this question about defining the scope of republication before, and clearly no-one wants to answer it. Republishing “all facts in a publication” can equal “the publication”, in my opinion. How the extracted information is used can act as another discovery tool, and it doesn’t necessarily kill the need for the original publication – but we don’t know that yet. Being coy about the ‘what’ and the ‘how much’ risks the perception that text mining rights are being demanded as a Trojan horse to destroy publishing business models some don’t approve of. I hope you don’t want people to misunderstand your intentions.
Without a definition or boundaries on the extent of what will be republished freely, unspecific text mining rights threatens any existing subscription-based collection. This is what most concerns publishers really – and is at the root of this “what are you going to do with it” question we always ask.
I hope standards in licensing and exchange formats/methods will build the bridges.

Reply
- pm286 says:
  
  April 27, 2012 at 3:06 pm
  
  Thanks,
  >>If you want to make it easier for publishers to say ‘yes’: Work with your librarian – they’ll measure the publisher’s value for next year (and renew subscriptions, or not) based on full-text downloads. If want to squirrel it away and work on an offline cache, and never download again, they need to know maintaining the subscription is important. Also, as others have pointed out, they’ll be the ones negotiating, whether at an institutional level or through the many consortia. They need to know it’s important (and relative demand).
  I note this, though I doubt it will lead to effective working in most institutions – librarians have already admitted on this list that they hadn’t anticipated text-mining – there will now be a delay while they work out what the issues are.
  >>Define the scope of republication – using text mining to improve scientific research is an obvious good thing (and one we’ve participated in with Peter’s group and others). Few publishers would object (I would hope) with Heather’s research, and publishing research results, snippet, links should be fine. If it’s another way for people to find the original research on our own sites, we’ll love it.
  I have no doubt that the next ten years will – unfortunately – see many discussions in this area. One of the benefits of Open Access is that the costs of negotiation is ZERO. Closed access leads into greatly increased discussion costs and wasted time and science.
  Yes – I have no doubt that our ability to extract facts from your publications would give you more hits. As you know Daniel Lowe extracted 500,000 reactions from USPTO and claims 95% accuracy for key components. I cannot imagine why you don’t give us permission to mine and publish that. You don’t even have a database revenue like CAS or Reaxsys. But I have 100 publishers to worry about and if it takes me a day to agree/disagree with each that’s half a years work gone.
  >>But I’ve asked this question about defining the scope of republication before, and clearly no-one wants to answer it.
  I’m perfectly happy to comment. There cannot be a simple answer because closed access it is necessarily highly detailed and complex. Are we allowed to extract chemical diagrams? I have no idea what your view is, what ACS’s view is what Tetrahedron’s is, etc. It will take me months to find out if I can mine chemical diagrams UNLESS each publisher gives a clear unequivocal answer. That’s a fact, regardless of one’s position.
  >> Republishing “all facts in a publication” can equal “the publication”, in my opinion.
  Sometimes yes, sometimes no. So on every questions I have to go to you to find out what I can do. Can I mine a spectrum technically I can, but is it a creative work? 100 publishers, so 100 opinions. And if you doubt this figure, look at the plethora of licences (and pseudo-licences for hybrid gold access that publishers have created. Some are frankly un-operational. We’ll get the same fuzziness in text-mining.
  >>How the extracted information is used can act as another discovery tool, and it doesn’t necessarily kill the need for the original publication – but we don’t know that yet. Being coy about the ‘what’ and the ‘how much’ risks the perception that text mining rights are being demanded as a Trojan horse to destroy publishing business models some don’t approve of. I hope you don’t want people to misunderstand your intentions.
  I will put it very simply:
  I WANT TO DO SCIENCE
  I have no interest in destroying publishing models.
  I want to build an artificially intelligent chemical/scientific system that can use the current literature to make new discoveries.
  Is that clear enough?
  At present the publishing models in chemistry prevent me doing anything without endless fruitless discussion. So I am moving to other fields where I have more chance.
  >>Without a definition or boundaries on the extent of what will be republished freely, unspecific text mining rights threatens any existing subscription-based collection. This is what most concerns publishers really – and is at the root of this “what are you going to do with it” question we always ask.
  I’ve told you what I would do if you let me. I am going to extract facts so we can do chemistry based on immediate global knowledge. I want to contribute this to Richard Whitby’s program to create better chemical synthesis – I can’t. I want to help Mat Todd search the literature for better antimalarials. I can’t. I want to use machines to interpret spectra better and check the quality of published science. I can’t. I want to find out the best strategies for chemical synthesis. I can’t. I abeen asked to find second harmonic generation in published spectra. I can’t.
  Is that detailed enough? Either say I can do this without further discussion, or I assume I can’t
  >>I hope standards in licensing and exchange formats/methods will build the bridges
  The publishing community so far has shown little interest in licensing standards. See Ross Mounce’s list of hybrids. A considerable number of the publishers have made up their own conditions.
  
  Reply
  - Richard Kidd says:
    
    April 27, 2012 at 3:45 pm
    
    Standards – as I’ve said before, a cross publisher text mining licensing standard is in preparation, and formats/access also will be looked at.
    Another misunderstanding. What – scientifically – miners want to do isn’t the problem/concern (and I’m talking from what I think a general publishers’ point of view is, not RSC or your own text mining in this case). It is – baldly – what will be published as a freely accessible database and how that will affect future usage (and citation, let’s not forget that) of the original publication.
    
    Reply
    - pm286 says:
      
      April 27, 2012 at 4:39 pm
      
      >>Standards – as I’ve said before, a cross publisher text mining licensing standard is in preparation, and formats/access also will be looked at.
      What efforts have been made to take this to the wider community? The last publisher to create a standard – NPG with OTMI – wasn’t very successful.
      >>Another misunderstanding. What – scientifically – miners want to do isn’t the problem/concern (and I’m talking from what I think a general publishers’ point of view is, not RSC or your own text mining in this case). It is – baldly – what will be published as a freely accessible database and how that will affect future usage (and citation, let’s not forget that) of the original publication.
      So I can do the research as long as I don’t publish the data?
      The picture – which inevitably is building up – is that publishers don’t trust researchers to do text-mining research responsibly. I don’t have to get permission from instrument manufacturers or chemical suppliers to do experiments. They both run the risk that scientists will build on those products to create other tools which may imbalance the market – that’s the way the world evolves.
      But – in essence – unless I can guarantee that my research won’t harm your business model, you won’t let me do it. I think that’s a fair summary. The flip-side is that my research might actually HELP your business model. By treating me as a problem rather than an opportunity we both lose – you can alter that situation – I can’t.
      There is no possible way that extracting factual data out of RSC journals could harm your business. That may not be true for publishers who sell data in large amounts; are you more concerned about their viability than doing science?
      
      Reply
      - Richard says:
        
        May 2, 2012 at 8:28 am
        
        Once again, my contribution to this tread was meant to convey the concerns I hear when talking to a variety of publishers about the issue, and as a help to future discussions. I’m not trying to reply about the RSC’s position, I’ve jumped through that hoop already.
        But stakeholders matter, and the publishers are a stakeholder. Listen to what other stakeholders’ concerns are, and you will avoid some fruitless discussions and misunderstandings.
        And I couldn’t have put it better myself than: “By treating me as a problem rather than an opportunity we both lose”. Indeed.
      - pm286 says:
        
        May 2, 2012 at 8:40 am
        
        >>>Once again, my contribution to this tread was meant to convey the concerns I hear when talking to a variety of publishers about the issue, and as a help to future discussions.
        And it is appreciated as such. You are one of the very very few people in closed access #scholpub who engage in useful discussion.
        >>I’m not trying to reply about the RSC’s position, I’ve jumped through that hoop already.
        I know. I would help if the RSC *had* a position other than silence.
        >>>But stakeholders matter, and the publishers are a stakeholder. Listen to what other stakeholders’ concerns are, and you will avoid some fruitless discussions and misunderstandings.
        I do listen. I’ve read Eefke Smit’s report. I have taken the trouble to contact publishers systematically. They are extremely bad at or reluctant to become involved in constructive debates. And remember that we cannot assume that the publishers’ position is where we start from in this new venture.
Pingback: Around the Web: More on open data, textmining the literature and The Panton Principles – Confessions of a Science Librarian