Towards a manifesto on Open Mining of scholarship

Tomorrow a small group of people interested in "textmining" will have a Skype meeting under the auspices of the OKFN. We have sort-of-pushed this agenda for some years and now it's come to fruition – there is clear public awareness of the value of textmining and the barriers that prevent it being used. Indeed my blog has even got mentioned in a financial analyst's review of Elsevier (the implication being that if Elsevier continues to drag their feet their market will react against them). Of course it's not just Elsevier, but they are the ones that have had most prominence. So this post if to prepare my mind and hopefully come out with some useful ideas.

There is no doubt that the lack of positive approaches to textmining is having huge costs:

  • Opportunity. We cannot do the things that we want to. Moreover this stifles the imagination of the rest of the community – without exciting examples of what can be done – and they *are* exciting – people do not realise what they are missing. And that's all of us, not just subscribers to journals.
  • In wasted time. Anyone wishing to do textmining has to spend huge amounts of time trying to get permissions, worrying about being taken to court, and simply waiting for null responses.
  • Bad science. Much published scientific data is flawed. Not necessarily deliberately, but by the outdated methods of publication. Almost no scientific data are reviewed (a few publishers like Int. Union of Crystallography are shining exceptions). And their tools have unearthed bad and fraudulent science. There is no reason to believe it is different elsewhere – in fact I suspect it's worse – the chance of getting caught is often near zero. Textmining is a major tool in data review.
  • Unexploited information and products. Google et at have shown that there are huge new markets. There is undoubtedly a large market in downstream information and information products from scientific research. I estimate it at low billions for chemistry alone.
  • Bad policy decisions. If the scientific literature is not used fully then decisions are flawed. These range from new drugs, to climate, to the effects of chemical to… Machines can provide decision support that complements humans.
  • Bad scholarship and bad scholarly relations. When a new technology emerges of benefit to scholarship then its wilful prevention for non-scholarly reasons has harmful effects on the whole community. It's fair to say that many textminers see publishers as a major problem who are solely bent on making money by restrictive practices

There are more – but that should be more than enough to build an overwhelming case.

Now what is "textmining". The word is very unfortunate for several reasons:

  • There are specific legal aspects of text which may differ from other forms of information.
  • There is a confusion with "fulltext".
  • It suggests that only the words in scholarship are involved. This is particularly damaging since much information is conveyed in images, diagrams, audio and video (in fact all of the major MIME-types!). For example commercial publishers often forbid the re-use of diagrams or charge large amounts because artistic images have special protection under copyright.

I would like to see a more general term – perhaps "information mining" (IM) which covers all the types about and also "data". Or possibly "publication mining". It would be a disaster if we only agree how to manage "text" and left the rest unchallenged.

Some technical background. (I actually suspect that most of the people who make the rules about IM (libraries, publishers) haven't a clue how it's done). Simply:

  • You write (or borrow) a program that retrieves the things you want to mine. A simple F/OSS one is called wget. Ours (Nick Day, Sam Adams) is called "PubCrawler and has been specially built for crawling scholarly publications. You point it at a website and it systematically retrieves files/pages one-by-one. The only problem is that if you do this too quickly then it may overload the website, so responsible crawlers have a delay (perhaps 5 seconds) – POINT 1. The argument that textmining will destroy servers is a smokescreen. (There are many ways of avoiding technical problems). Note that if you already have the papers on a local machine this step is unnecessary. Universities create caches to avoid repeated downloads but publisher want the downloads so they can count-the-clicks. This process does NOT violate copyright though it may technically violate the restrictive publisher contracts that Universities have signed.
  • You have another program that mines information from each paper. This is hard and tedious to write but once done is automatic to run. How well it performs depends on many factors (the format of the paper, the language/style of the journal/authors, the use of dumb (GIF/PNG) or semi-semantic (SVG) diagrams, etc.). For text you could use Lucene – an Apache project. Daniel Lowe has shown that it's possible to mine 500,000 chemical reactions from US patents using our F/OSS OSCAR/OPSIN/ChemicalTagger and the NIH's OSRA for chemical diagrams. Things are better than they were 5 years ago and I am fairly hopeful about the technical mass-mining of chemistry. This process does NOT violate copyright though it may technically violate the restrictive publisher contracts that Universities have signed.
  • You publish your results. Here there is a potential problem with copyright although I suspect it has never been tested. I suspect anything less than bulk republishing of verbatim full-text would be allowable in many courts. In particular republishing "factual" information would incur no legal penalties, whether or not for commercial purposes.

The miner's problem.

Simply stated:

  • IM MIGHT fall foul of copyright law. Because of the risk-averseness of libraries and the pressure from some publishers to limit activities such as UK/PMC no authorities are prepared to challenge of test this. Individual researchers left to make their own judgments, with little hope that they will get support from institutions. This canopy of fear is a dampener for research.
  • There are NO explicit rules. Because of this researchers do not know what they can and cannot do. Logic does NOT work in courts of law – only laws and precedence. People who make facile assertions that you can/not do something only muddy the waters.
  • It MIGHT fall foul of database laws such as sui generis in Europe. Against in our risk-averse culture no-one offers support to challenge this.
  • It probably WILL fall foul of the Publisher-imposed extensions to University contracts. These are basically unethical and imposed solely (IMO) for protecting the market.

Simply stated: Miners need clear, simple, permanent, automatic answers so they know what they can and cannot do.

Researchers are responsible people. There are many places where research has to take account of law and there are very few public breaches. The same should be assumed for IM.

The publishers' problem.

The primary problem is that publishers now have a market (not necessarily of their own making) which is profitable and where change may bring problems. The flip-side, that IM may bring benefits is never mentioned! Thus Richard Kidd of the Royal Soc. Chemistry on this blog has voiced the fear that he/they are worried that my textmining may undermine the RSC's viability and he wants an assurance that I won't do anything to harm their income. I think of all publishers in the world the RSC is best placed to benefit massively from IM instead of preventing it happening.

This is a typical problem with monopolies (which the publishers have). They want to see their income continue indefinitely in the same way rather than changing their models. It's natural, and history shows it's ultimately doomed. Only the conservatism of academia (see Michael Eisen's blog) keeps them in business. Whether or not we take the publishers' interests into account depends on the worth that society gives to their services – and that is changing rapidly.

There is no natural law that says we do or don't have to accommodate the publishers, whether or not they are learned socs. They no longer have the moral right to control unilaterally how scientific knowledge is published and used. There has been no constructive debate in this area and publishers should think about their source of material and its volatility.

The libraries' problem.

This is a completely new technology which is opaque to many libraries. There are, of course some world-leaders in information management , especially the NLM and national libraries but the average University has no experience of either the technology or the law. This makes it problematic when publishers suggest that text-miners should go through their libraries and have joint discussions with publishers. This is counterproductive as is drastically slows the process and means that many of the decisions are made by non-practitioners. [I have so far written several times to my librarian and am waiting for a reply]. The rigmarole that Elsevier put Heather Piwowar through with UBC librarians is out of order and in any case doesn't scale across publishers , libraries or researchers.

Current concerns and why we need principles

There is a high probability that some well-intentioned academics will "negotiate" terms with publishers which then are used a precedent to constrain everyone else. I, for example, am unwilling to accept the terms that UBC have. For that reason we are setting out principles, which we believe are absolute and which will inform the practices and their adoption. In the spirit of the excellent crafted BOAI and other declarations we are working towards words which will last for decades.

Bases of the principles:

  • The scholarly literature is created to inform and enlighten humankind. Authors expect that their material will be as widely used in an many ways as possible and by as many people as possible.
  • Information mining is a natural and major advance in the use of the scholarly literature and brings very large benefits.
  • The only inexorable laws relating to IM are copyright and database rights. These were not designed to restrict the flow of scholarship and should not be used for this purpose.
  • Subscribers to the scholarly literature are responsible people and will not deliberately break the law. They need a globally published set of principles by which they can determine what they may do.
  • Technology and human attitudes are changing rapidly and we should be positively and proactively responsive to them. We cannot and should not try to guess the future and we should not jeopardies it by short-term considerations

And perhaps a single definition. I suggest the term "Open Mining" as inclusive. Note that these principles are statements of what we wish to be the case, not a negotiation. BBB are statements of aspiration.

  • "By Open-mining we mean the unrestricted use of machines to extract, process and republish content in whatever form (text, diagrams, images, data, audio, video, etc.) without prior specific permissions other than community norms of responsible behaviour in the electronic age."

"Responsible behaviour" and "community norms" covers stuff like server overloading, personal data, deliberate corruption, and adherence to generally accepted Internet practice.

That's the aspiration. BBB are aspirations. Some scholars and some publishers have adopted them enthusiastically. They have helped enormously.





  1. The term the publishing industry is using these days to imply information beyond text is "Content Mining", e.g.

    • pm286 says:

      Thanks Casey,
      Good idea - so maybe Open Content Mining.

      That was the article that said everyone found the publishers very helpful except Peter Murray-Rust

  2. Richard Kidd says:

    Peter, you misunderstood my contribution - I've replied before about the RSC's position, as you've noted.

    I was trying to explain, in general terms, publishers' concerns as stakeholders garnered from a *lot* of conversations, in a attempt to avoid some of the misunderstandings about text/data mining, and as a help for the community of potential content miners to understand why publishers ask for some of the things we do.

    I didn't make these general statements about the RSC, so the extrapolations on threats to us or our income are just that - extrapolations, and I refer you to my earlier answer on that.

    RSC are perfectly open to persuasion on benefits to the original research of text mining, that's why we've done so many collaborations on it in the past, and are involved in projects like Open PHACTS. Our own text mining puts compounds and links into ChemSpider for anyone to find.

    What I said is that you have to emphasise these benefits to stakeholders to help them to agree. And the first one who proves it will be knocked over by the rush of publishers.

    • Mike Taylor says:

      But, Richard, who wants to be knocked over? Not me.

      The brutal truth, which will be hard for publishers to hear, is that when it comes to mining, researchers are not looking for partnerships with publishers. All we want is for you to get out of the way.

      Note, I am not saying that publishers have no role. I am saying that their role is not in mining. That is research; researchers do it. Trying to get publishers involved in that is like trying to get them involved in phylogenetic analysis or specimen photography.

      • pm286 says:

        But, Richard, who wants to be knocked over? Not me.

        >>The brutal truth, which will be hard for publishers to hear, is that when it comes to mining, researchers are not looking for partnerships with publishers. All we want is for you to get out of the way.

        >>Note, I am not saying that publishers have no role. I am saying that their role is not in mining. That is research; researchers do it. Trying to get publishers involved in that is like trying to get them involved in phylogenetic analysis or specimen photography.

        This is true.

        The fact is that we do not need publishers in a *technical* sense. We have mined hundreds of thousands of documents without the publishers being involved. The Open Access publishers do not get involved.

        The only reason publishers are involved is because, currently, most of them prevent us from doing it by legal and contractual means.

      • Richard Kidd says:

        Hi Mike - ok, back to speaking from my own viewpoint…

        I've talked to, and collaborated on mining projects with, researchers who definitely want us to be involved. And, come to mention it, have heard the viewpoints of authors who don't welcome the prospect of their stuff being arbitrarily text mined and republished when they don't know how the results will be validated or presented - I have examples of mining of RSC publications where completely unambiguous primary bibliographic data was mangled. So in that sense I don't believe you speak for anything like all researchers on this.

        I also disagree that the issues of text extraction, analysis, validation, and publishing of standard chemical data, not to mention credit & attribution (all *extremely* pertinent to mining) are none of the business of responsible publishers, let alone a society like us with domain responsibility. We are embedded in our community, and fwiw also publish with collaborators ( Isn’t this like me telling researchers to stay out of publishing !?

        I can live with brutal opinions, but yer actual truth is rarely simple (and may be inconvenient).

        • Mike Taylor says:

          Hi, Richard. You do make some important points here. First, of course I can't speak for all researchers, and I shouldn't have implied that everyone shares my attitude on this. (I know Peter does, but a sample of two is hardly statistically significant!)

          On authors who don't want their work to be mined: I suppose that could be construed as their privilege, however stupid. (It's essentially the same as saying they don't want their work to be read.) I suppose going forward, there will be a market for publication options that explicitly remove people's work from being useful in this way. In time, as the value of such works becomes visibly marginalised, that option will die out on its own. In the mean time, we just have to be careful not accidentally pollute our mining corpuses with such works. The best way to do that is of course with machine-readable licences.

          You are right that in some cases researchers have a responsibility to ensure that mined data is correctly credited and attributed. I don't understand why you think they need publishers' help with that, though, any more than we need your help when citing a work in a bibliography.

          In the end, the killer argument is Peters: OA publishers such as PLoS don't get involved in the work of researchers who mine their output, and no-one misses them. I can think of legitimate reasons why publishers would want to be involved, but none why researchers would want them to be.

          (Again, to be clear: I recognise that publishers have a role. Just that this isn't it.)

  3. Hi Peter: FYI, I mention this post in a column published today in the Chronicle of Higher Education: The focus of the column is Heather Piwowar and UBC's negotiations with Elsevier. I look forward to hearing what comes out of your manifesto discussions.



    • pm286 says:

      Many thanks - I was forwarded this and it's well balanced.

      A lot will come out of what we are doing because there is so much unfulfilled promise in the system..

  4. Stephanie says:

    Peter, I strongly suspect that your statement that most libraries and publishers don't have a clue as to the technology behind the crawling is wrong. I can't speak for the publishers, but the academic libraries at which I have worked are all plenty sophisticated technologically, and in a few cases, ran IT or a large portion thereof for the whole darn campus on which they were located. Libraries are becoming far more technologically sophisticated and savvy than many people think, and your perception is as dated as the shushing librarian stereotype.

    • pm286 says:

      OK - possibly overstated. But if you look at the crawlability of Institutional Repositories it's awful. What systematic efforts have been made to index or extract any information from IR's? Simply putting a OAI/PMH API doesn't give much.

      When someone can answer me a simple question like:
      "Please find me all chemistry theses in UK repositories"

      I may start to believe something has happened.

      Of course I can't text mine them. University repositories often put a blanket ban on any re-use of the information in the repositories. You then have to ask the authors...

    • pm286 says:

      Thanks for the support.

      Note that Heather has joined our small group which is fleshing out the manifesto. It will not make concessions.

