Yesterday I spent a wonderful 2 hours with the publishing division of theRoyal_Society [1], the oldest and most influential scientific society in the world. The Royal Society supports research, helps formulate policy, works with public bodies…
… and publishes science. Wikipedia notes “The history of scientific journals dates from 1665, when the French Journal des sçavans and the English Philosophical Transactions of the Royal Society first began systematically publishing research results.”

So when the question of Content-mining became important – in about 2010 – The Royal Society took a pro-active view. In the “Licences for Europe” struggle in 2013 (see letter) it supported content-mining without requiring a licence from the publisher – effectively “The Right to Read is the Right to Mine”. [Geoffrey Boulton FRS, [2]]  I think it was the only body with a conventional (“toll-access”) publishing arm to do so, and therefore deserves our praise and thanks.

So having met with Geoffrey, and also with Louise Pakseresht , Policy Advisor , I spent 2 hours with Stuart Taylor , Publishing Director and Helen Duriez , ePublishing Manager. We spent some time looking at the technology of mining and also the value and possible problems. Informally we’ve agreed to work together on content-mining and see how it works.
In Cambridge we’re going ahead to mine the daily scientific literature for facts and to publish the factual content. And since we are mining the whole scientific literature – closed as well as open – this will include the daily output of the Royal Society. Legally we can do this without their permission but it makes sense for us to work together to see what the problems are, and hopefully to remove some of the ignorance and unfounded worries.
ContentMine is committed to responsible ContentMining (see our paper) and this experience will be extremely valuable in helping everyone know what the issues are. We were able to reassure Stuart and colleagues that

  • the daily rate of perhaps 10 papers an hour would not burn out servers (they have planned for several orders of magnitude more traffic from single users).
  • we would not publish significant amounts of copyrighted material. The default is the publishing industry de facto 200 characters (or 1-2 sentences) surrounding each entity. We have no intention of deliberately causing problems – this is not “pirating” or “stealing”.
  • We are committed to a detailed audit trail. This will take some time to develop, but hopefully a communal approach will be developed.
  • All research should be reproducible, so there would be a manifest of resources used and protocols.

There are some real unknowns. Any application of machine-scale analysis brings new benefits and concerns. One of them is that the process may corrupt information – and we’ll do whatever we can to measure and minimise this. In reverse we have already shown that mining detects errors in the literature which can be put right – indeed our technology could be valuable in the reviewing and editing of material for publication. Another is the sheer scale – we could mine the whole literature for – say – breeding grounds and create systematic maps. That brings benefits (see a study  where herbaria can give new insights on climate change and invasive species). I expect contentmining will do the same. But there are also dangers – it may pinpoint endangered areas or species. But this is the inevitable challenge of the Digital Century – we have to learn how to live with and manage massive new knowledge.
So this week we are releasing a client-side version of ContentMine. (It’s already released, so it’s a soft-launch) but we are reasonably confident that it can be installed and run by commandline-aware citizens. I’ll be blogging more about the details. Helen has volunteered to try it out and this could be one of the first examples of a publisher using contentMining!

There are many publishers who want to take a responsible approach to reader-driven content mining but don’t know enough to take it forward. I believe that Royal Society publishing will set the defacto approach and act as an important reference for others.

[1] do not confuse with many other Royal Society of ***
[2][Prof Geoffrey Boulton, Chair of Science as an open enterprise report’s Working Group, and Chair of the Science Policy Advisory Group, Royal Society.]

