petermr's blog

A Scientist and the Web


Towards a manifesto on Open Mining of scholarship

Tomorrow a small group of people interested in “textmining” will have a Skype meeting under the auspices of the OKFN. We have sort-of-pushed this agenda for some years and now it’s come to fruition – there is clear public awareness of the value of textmining and the barriers that prevent it being used. Indeed my blog has even got mentioned in a financial analyst’s review of Elsevier (the implication being that if Elsevier continues to drag their feet their market will react against them). Of course it’s not just Elsevier, but they are the ones that have had most prominence. So this post if to prepare my mind and hopefully come out with some useful ideas.

There is no doubt that the lack of positive approaches to textmining is having huge costs:

  • Opportunity. We cannot do the things that we want to. Moreover this stifles the imagination of the rest of the community – without exciting examples of what can be done – and they *are* exciting – people do not realise what they are missing. And that’s all of us, not just subscribers to journals.
  • In wasted time. Anyone wishing to do textmining has to spend huge amounts of time trying to get permissions, worrying about being taken to court, and simply waiting for null responses.
  • Bad science. Much published scientific data is flawed. Not necessarily deliberately, but by the outdated methods of publication. Almost no scientific data are reviewed (a few publishers like Int. Union of Crystallography are shining exceptions). And their tools have unearthed bad and fraudulent science. There is no reason to believe it is different elsewhere – in fact I suspect it’s worse – the chance of getting caught is often near zero. Textmining is a major tool in data review.
  • Unexploited information and products. Google et at have shown that there are huge new markets. There is undoubtedly a large market in downstream information and information products from scientific research. I estimate it at low billions for chemistry alone.
  • Bad policy decisions. If the scientific literature is not used fully then decisions are flawed. These range from new drugs, to climate, to the effects of chemical to… Machines can provide decision support that complements humans.
  • Bad scholarship and bad scholarly relations. When a new technology emerges of benefit to scholarship then its wilful prevention for non-scholarly reasons has harmful effects on the whole community. It’s fair to say that many textminers see publishers as a major problem who are solely bent on making money by restrictive practices

There are more – but that should be more than enough to build an overwhelming case.

Now what is “textmining”. The word is very unfortunate for several reasons:

  • There are specific legal aspects of text which may differ from other forms of information.
  • There is a confusion with “fulltext”.
  • It suggests that only the words in scholarship are involved. This is particularly damaging since much information is conveyed in images, diagrams, audio and video (in fact all of the major MIME-types!). For example commercial publishers often forbid the re-use of diagrams or charge large amounts because artistic images have special protection under copyright.

I would like to see a more general term – perhaps “information mining” (IM) which covers all the types about and also “data”. Or possibly “publication mining”. It would be a disaster if we only agree how to manage “text” and left the rest unchallenged.

Some technical background. (I actually suspect that most of the people who make the rules about IM (libraries, publishers) haven’t a clue how it’s done). Simply:

  • You write (or borrow) a program that retrieves the things you want to mine. A simple F/OSS one is called wget. Ours (Nick Day, Sam Adams) is called “PubCrawler and has been specially built for crawling scholarly publications. You point it at a website and it systematically retrieves files/pages one-by-one. The only problem is that if you do this too quickly then it may overload the website, so responsible crawlers have a delay (perhaps 5 seconds) – POINT 1. The argument that textmining will destroy servers is a smokescreen. (There are many ways of avoiding technical problems). Note that if you already have the papers on a local machine this step is unnecessary. Universities create caches to avoid repeated downloads but publisher want the downloads so they can count-the-clicks. This process does NOT violate copyright though it may technically violate the restrictive publisher contracts that Universities have signed.
  • You have another program that mines information from each paper. This is hard and tedious to write but once done is automatic to run. How well it performs depends on many factors (the format of the paper, the language/style of the journal/authors, the use of dumb (GIF/PNG) or semi-semantic (SVG) diagrams, etc.). For text you could use Lucene – an Apache project. Daniel Lowe has shown that it’s possible to mine 500,000 chemical reactions from US patents using our F/OSS OSCAR/OPSIN/ChemicalTagger and the NIH’s OSRA for chemical diagrams. Things are better than they were 5 years ago and I am fairly hopeful about the technical mass-mining of chemistry. This process does NOT violate copyright though it may technically violate the restrictive publisher contracts that Universities have signed.
  • You publish your results. Here there is a potential problem with copyright although I suspect it has never been tested. I suspect anything less than bulk republishing of verbatim full-text would be allowable in many courts. In particular republishing “factual” information would incur no legal penalties, whether or not for commercial purposes.

The miner’s problem.

Simply stated:

  • IM MIGHT fall foul of copyright law. Because of the risk-averseness of libraries and the pressure from some publishers to limit activities such as UK/PMC no authorities are prepared to challenge of test this. Individual researchers left to make their own judgments, with little hope that they will get support from institutions. This canopy of fear is a dampener for research.
  • There are NO explicit rules. Because of this researchers do not know what they can and cannot do. Logic does NOT work in courts of law – only laws and precedence. People who make facile assertions that you can/not do something only muddy the waters.
  • It MIGHT fall foul of database laws such as sui generis in Europe. Against in our risk-averse culture no-one offers support to challenge this.
  • It probably WILL fall foul of the Publisher-imposed extensions to University contracts. These are basically unethical and imposed solely (IMO) for protecting the market.

Simply stated: Miners need clear, simple, permanent, automatic answers so they know what they can and cannot do.

Researchers are responsible people. There are many places where research has to take account of law and there are very few public breaches. The same should be assumed for IM.

The publishers’ problem.

The primary problem is that publishers now have a market (not necessarily of their own making) which is profitable and where change may bring problems. The flip-side, that IM may bring benefits is never mentioned! Thus Richard Kidd of the Royal Soc. Chemistry on this blog has voiced the fear that he/they are worried that my textmining may undermine the RSC’s viability and he wants an assurance that I won’t do anything to harm their income. I think of all publishers in the world the RSC is best placed to benefit massively from IM instead of preventing it happening.

This is a typical problem with monopolies (which the publishers have). They want to see their income continue indefinitely in the same way rather than changing their models. It’s natural, and history shows it’s ultimately doomed. Only the conservatism of academia (see Michael Eisen’s blog) keeps them in business. Whether or not we take the publishers’ interests into account depends on the worth that society gives to their services – and that is changing rapidly.

There is no natural law that says we do or don’t have to accommodate the publishers, whether or not they are learned socs. They no longer have the moral right to control unilaterally how scientific knowledge is published and used. There has been no constructive debate in this area and publishers should think about their source of material and its volatility.

The libraries’ problem.

This is a completely new technology which is opaque to many libraries. There are, of course some world-leaders in information management , especially the NLM and national libraries but the average University has no experience of either the technology or the law. This makes it problematic when publishers suggest that text-miners should go through their libraries and have joint discussions with publishers. This is counterproductive as is drastically slows the process and means that many of the decisions are made by non-practitioners. [I have so far written several times to my librarian and am waiting for a reply]. The rigmarole that Elsevier put Heather Piwowar through with UBC librarians is out of order and in any case doesn’t scale across publishers , libraries or researchers.

Current concerns and why we need principles

There is a high probability that some well-intentioned academics will “negotiate” terms with publishers which then are used a precedent to constrain everyone else. I, for example, am unwilling to accept the terms that UBC have. For that reason we are setting out principles, which we believe are absolute and which will inform the practices and their adoption. In the spirit of the excellent crafted BOAI and other declarations we are working towards words which will last for decades.

Bases of the principles:

  • The scholarly literature is created to inform and enlighten humankind. Authors expect that their material will be as widely used in an many ways as possible and by as many people as possible.
  • Information mining is a natural and major advance in the use of the scholarly literature and brings very large benefits.
  • The only inexorable laws relating to IM are copyright and database rights. These were not designed to restrict the flow of scholarship and should not be used for this purpose.
  • Subscribers to the scholarly literature are responsible people and will not deliberately break the law. They need a globally published set of principles by which they can determine what they may do.
  • Technology and human attitudes are changing rapidly and we should be positively and proactively responsive to them. We cannot and should not try to guess the future and we should not jeopardies it by short-term considerations

And perhaps a single definition. I suggest the term “Open Mining” as inclusive. Note that these principles are statements of what we wish to be the case, not a negotiation. BBB are statements of aspiration.

  • “By Open-mining we mean the unrestricted use of machines to extract, process and republish content in whatever form (text, diagrams, images, data, audio, video, etc.) without prior specific permissions other than community norms of responsible behaviour in the electronic age.”

“Responsible behaviour” and “community norms” covers stuff like server overloading, personal data, deliberate corruption, and adherence to generally accepted Internet practice.

That’s the aspiration. BBB are aspirations. Some scholars and some publishers have adopted them enthusiastically. They have helped enormously.





Leave a Reply