Elsevier still charge for papers on Zika

Posted on April 16, 2016 by pm286

II thouhj

I thought that all major publishers had agreed to make papers on Zika available for free as a public service.
But I was wrong. Elsevier are charging people. Admittedly I haven’t read this paper as I’d have to pay. But I think it’s about Zika.
Prediction: we’ll have a mail from Elsevier saying “this is part of our bumpy road. We’ll still not fully competent to manage Open Access, so please forgive us and it will take some more years to get it right. Meanwhile we’ll continue to charge you.”
NOTE. This paper wasn’t difficult to discover – I found it on Europe PubmedCentral http://europepmc.org/. I just typed “Zika”

Posted in Uncategorized | Leave a comment

ContentMine is (alpha)-ready for you to use. Hacking on Thursday/Friday.

Posted on April 12, 2016 by pm286

We’ve now built a version of ContentMine that we feel happy to promote for you to use. (There’s been no secret – all the code (http://https://github.com/ContentMine/ ) and most of the discussion has been public). But now we are actively asking people to try it and give feedback.
Here’s a quote we got today:
“Hi, first congratulations. I have to say that I was skeptical at the beginning but I have tested the software and it is fantastic. “
That is the sort of thanks that keeps you going at midnight when the build doesn’t build and the tests start failing and …
… but ContentMine is a large system and distributing and maintaining large systems is hard, tedious, endlessly frustrating. We’re very conscious of when it fails for you. The good thing is that we’ve had a number of volunteers try it out and they deserve y/our thanks. It’s very hard being the first and getting fails.
A great difference has been made by Tom Arrow. Tom graduated last year from Imperial, London, and won the first Bradley-Mason prize. We have been delighted that he’s chosen to join ContentMine as developer and move to Cambridge two months ago . We’re excited that this role is turning out to be critical and exciting. Tom has been helping people set up the system.
CM consists of 6 modules of which 2 (cat, canary) are primarily server-side and 4 (getpapers, quickscrape, norma, ami) are downloadable by anyone. The use has evolved over the last year – in April 2015 we ran a workshop where the first day was showing how to use the system. Now we suggest you can get up to speed on your own – and in maybe 15 minutes. That depends critically on your experience of installation on your system – if you know about the commandline and things like apt-get, Node.js, npm, JRE, and PATHs, it should be reasonably straightforward. So we’ll call it “alpha”, which means it works but you benefit from knowing what you are doing and how to avoid mistakes.
There’s 4 modules:

getpapers (Node.js) which queries a repository (by default Europe PubmedCentral). It returns anywhere from 10 to >1000 papers (mainly OpenAccess). This can take 15-200 s depending on line speeds
quickscrape (Node.js) which uses known URLs to retrieve the components of a paper (HTML, PDF, supplemental data, etc.). Normally you’ll use one of these two.
norma (Java) which turns XML, PDF, HTML into scholarlyHtml. In most cases norma will do what she needs to without you having to worry
ami (Java) a set of modules for searching and filtering on scientific criteria (species, genes, chemistry, disease, countries, etc.).

We use the commandline to launch processes, and the file system to communicate and stores results. This general approach is tried and tested over 50 years! In essence we have built a toolbox for knowledge.
We’re showing this in Brussels on Thursday with a hackday on Friday. Do come! We are delighted that Julia Reda, MEP is coming in the afternoon. Julia has been a huge supporter of mining the literature and the copyright reform required to support it. Julia has volunteered to install the system on her Ubuntu machine – so we are on public show!
It’s much easier installing programs on multiple Operating systems than it used to be. Here we have two languages (Node, Java) and 3 systems (UNIX, MacOS, Windows) so that’s 6 combinations. Tom has made a great job of making all these work but there are still alpha bugs we don’t know of!
And we know that different countries have variants – diacritics, keyboards, even file names. We can’t guarantee anything other that UK/US-EN, but we’ll certainly try!
In Brussels we’ll hope to have communication to anyone via Etherpad, possibly Skype or Uber. More details on Twitter – follow @TheContentMine.
I’ll talk about the science we are going to do in the next blog…

Posted in Uncategorized | Leave a comment

ContentMine and the Royal Society working together

Posted on April 12, 2016 by pm286

Yesterday I spent a wonderful 2 hours with the publishing division of theRoyal_Society [1], the oldest and most influential scientific society in the world. The Royal Society supports research, helps formulate policy, works with public bodies…
… and publishes science. Wikipedia notes “The history of scientific journals dates from 1665, when the French Journal des sçavans and the English Philosophical Transactions of the Royal Society first began systematically publishing research results.”

So when the question of Content-mining became important – in about 2010 – The Royal Society took a pro-active view. In the “Licences for Europe” struggle in 2013 (see letter) it supported content-mining without requiring a licence from the publisher – effectively “The Right to Read is the Right to Mine”. [Geoffrey Boulton FRS, [2]] I think it was the only body with a conventional (“toll-access”) publishing arm to do so, and therefore deserves our praise and thanks.

So having met with Geoffrey, and also with Louise Pakseresht , Policy Advisor , I spent 2 hours with Stuart Taylor , Publishing Director and Helen Duriez , ePublishing Manager. We spent some time looking at the technology of mining and also the value and possible problems. Informally we’ve agreed to work together on content-mining and see how it works.
In Cambridge we’re going ahead to mine the daily scientific literature for facts and to publish the factual content. And since we are mining the whole scientific literature – closed as well as open – this will include the daily output of the Royal Society. Legally we can do this without their permission but it makes sense for us to work together to see what the problems are, and hopefully to remove some of the ignorance and unfounded worries.
ContentMine is committed to responsible ContentMining (see our paper) and this experience will be extremely valuable in helping everyone know what the issues are. We were able to reassure Stuart and colleagues that

the daily rate of perhaps 10 papers an hour would not burn out servers (they have planned for several orders of magnitude more traffic from single users).
we would not publish significant amounts of copyrighted material. The default is the publishing industry de facto 200 characters (or 1-2 sentences) surrounding each entity. We have no intention of deliberately causing problems – this is not “pirating” or “stealing”.
We are committed to a detailed audit trail. This will take some time to develop, but hopefully a communal approach will be developed.
All research should be reproducible, so there would be a manifest of resources used and protocols.

There are some real unknowns. Any application of machine-scale analysis brings new benefits and concerns. One of them is that the process may corrupt information – and we’ll do whatever we can to measure and minimise this. In reverse we have already shown that mining detects errors in the literature which can be put right – indeed our technology could be valuable in the reviewing and editing of material for publication. Another is the sheer scale – we could mine the whole literature for – say – breeding grounds and create systematic maps. That brings benefits (see a study where herbaria can give new insights on climate change and invasive species). I expect contentmining will do the same. But there are also dangers – it may pinpoint endangered areas or species. But this is the inevitable challenge of the Digital Century – we have to learn how to live with and manage massive new knowledge.
So this week we are releasing a client-side version of ContentMine. (It’s already released, so it’s a soft-launch) but we are reasonably confident that it can be installed and run by commandline-aware citizens. I’ll be blogging more about the details. Helen has volunteered to try it out and this could be one of the first examples of a publisher using contentMining!

There are many publishers who want to take a responsible approach to reader-driven content mining but don’t know enough to take it forward. I believe that Royal Society publishing will set the defacto approach and act as an important reference for others.

[1] do not confuse with many other Royal Society of ***
[2][Prof Geoffrey Boulton, Chair of Science as an open enterprise report’s Working Group, and Chair of the Science Policy Advisory Group, Royal Society.]

Posted in Uncategorized | 1 Comment

Public demonstration of ContentMining (TDM) in Brussels April 14/15

Posted on April 9, 2016 by pm286

I am excited and honoured to be invited to talk/present at http://www.openforum.be/ next week. I was going to talk about ContentMining anyway – the potential, the vision , the value to citizens, but the launch of Open Science Europe last week has made this even more relevant.

Simply:

Mining for Science is valuable for everyone – citizens (doctors, policymakers, secondary schools, patients, conservationists, transport, finance … etc.)
Mining can be done by anyone who is happy to install a program [1]. Takes 10 minutes.

I shall demo this in my talk but I am present for 2 days. On Friday I am happy to hack, be contacted – just drop in. We can demo in a 15 minute session and answer a subject YOU are interested in. (The emphasis is biomedical but other subjects – including social science and humanties – can be addressed ).
If you are in any way involved in the current European debate on Copyright or Open Science please come. This is aimed at citizens, not just “academics” or professionals. We would love to see Commission and Parliament people at all levels,but also any interested citizens and curious minds.

From the programme:
Open Data & Open Access : getting more from scientific papers with content mining, Thursday 14 April 2016, 18h30, University Foundation (on a map), Brussels
There several thousand scientific papers published each day, and nobody can keep up with them. If they are Open Access they can be aggregated in a single place such as the repositories CORE (UK), HAL (FR), and Europe PubMedCentral (for biomedical papers).
It’s then possible to use machines to help us filter them on scientific grounds and select exactly those sections of each paper that the reader wants to read. It’s also possible to extract chunks of scientific knowledge such as molecular structures or evolutionary trees and compute completely new knowledge.
I shall demo this system using at least two examples:

The “Zika epidemic”. What do we actually know about Zika from the peer-reviewed literature? How does it link to other Open Scientific Knowledge?
Clinical trials. Europe and other countries have collected 400,000 clinical trials. Can we search them? What procedures where used? How many patients? And, very importantly, has the trial been reported in the recent literature?

This presentation will be accessible to anyone: school students, scientists, policy makers, data journalists, etc.
All content and tools are free and open, and can by used by anyone.

Hackday on Friday 15 April 2016, 9h

The hackday will explore the automatic extraction of facts from documents, especially (not not exclusively) science and medicine . By default we can extract:

species
DNA
places
genes
word frequencies
drugs
organizations

Participants can also create their own word lists and regular expressions.
By default we’ll use the Open Access scientific literature but we can also look at any easily retrieved public documents (e.g. government, NGO).
[1] If you are interested in installing and running, come to the Friday session. You need to be able to use a commandline, and know how to install a program. That’s all.

Posted in Uncategorized | 2 Comments

What is “Open Science”? Carlos Moedas gets it, do you?

Posted on April 6, 2016 by pm286

I listened and watched (as best as possible from 20000 km in AU) the EU OpenScience meeting inspired by the Dutch presidency. I didn’t get all the presentations, but I got enough from the opening session that I could follow and make some tweets. There was an opening by Commissioner Carlos Moedas, followed by a panel of 6 great-and-good, necessarily “balanced”. Some positive, some negative (like BusinessEurope, which interprets “Open” as public-private partnerships, and Haank from Springer who used Open as a reason to promote higher revenue for publishers – yes, you heard that right – Open will cost more and Springer wants to “do what the community wants”.)

But Moedas’s speech was compelling and heartfelt. I’d heard him speak on Cambridge a month ago on research funding. I wanted to ask him then about Open and ContentMining (Text and data mining, TDM) but it wasn’t the time.

Now he has answered much of what I wanted in his speech at EU2016NL. I’ll quote passages and then comment.

He started with the question that many are asking – what about Sci-Hub?:

“Last week, The Washington Post published an article about Alexandra Elbakyan, a 27 year old student from Kazakhstan and Founder of Sci-Hub, an online database of nearly 50 million pirated academic journal articles. To some, she is “The Robin Hood of Science.” To others, she is a notorious cyber-criminal.
Elbakyan’s case raises many questions. To me the most important one is: is this a sign that academic journals will face the same fate as the music and media industries? If so – and there are strong parallels to be drawn − then scientific publishing is about to be transformed.
“

This is remarkable. Alexandra (who I haven’t met, or corresponded with, but would like to) is shunned by academia and publishers. She is “breaking the law” and therefore must by treated as a criminal. Possibly extradited and then tried as a cyber-terrorist as they wanted to do with Aaron Swartz.
But Alexandra is in a long tradition of civil disobedience. Our current Copyright law is bad. It’s bad for science. It’s bad for citizens. It’s bad for the health of the planet and of the human race. Not enough people stand up and denounce it. The arguments for reform are made, but they are smashed by the massive money being thrown at Brussels by the entrenched industry. Reform, if it comes at all, will be miniscule – “TDM by public interest research organizations doing non-commercial research” was one Commission proposal. Is that even me – one of the most prominent public practitioners?

I will write later about Sci-hub and Alexandra. Currently I want to change the law by legal means and political representation. But…
So Moedas is the first person in authority I have seen raise Sci-hub. I haven’t seen any Vice-Chancellors, or Heads of Funders, or Research Organizations say anything positive.

Yet technically Sci-hub can do more for scientific knowledge than almost any other initiative. It’s conceptually simple – collect all the world’s scientific publications together and make the available to everyone.
That’s what the Open Access movement (or at least some of its founders) tried to do, and failed.
Failed because at best 15% of published science is “open” in some form. In fields such as engineering and materials it’s near zero. Failed because the means of discovery, access and re-use are either non-existent or left to commercial companies who are unregulated and by default non-transparent and therefore cannot be trusted to do what we want and how we want.
I predict that unless the Open Access movement actually creates a Sci-hub lookalike within 5 years, then it will either become irrelevant or will be a wrapper for commercial organisations who develop their own private infrastructure/s that we are forced to use, without trust or control. I fear the latter, because Universities show no signs of any real commitment to Open.

So what’s Open?
Open has worked in other fields, especially software. Moedas continues:
In my view, there is a strong economic, scientific and moral case for embracing open science. Which brings me to my first point: why open science is a good thing and one of the 3 core priorities of my mandate.

I agree these are the fundamentals. The economic case is very strong, but it is very badly presented by Open Access enthusiasts. Public investment in science gets a huge multiplier, IF properly re-used:

A recent study analysed the economic impact of opening-up research data. Using the example of the European Bioinformatics Institute of the European Molecular Biology Laboratory, the study demonstrated that the institute generates a benefit to users and their funders of around 1.3 billion euros per year − just by making scientific information freely available to the global life science community. This is equivalent to more than 20 times the direct operational cost of the institute!

A similar study showed an even higher multiplier for the Human genome project.
Academia generates wealth, but hides it, except to the chosen few within the ivory tower. I’d love to know of Universities (I hope there are some) who measured their value to citizens.
Then, there is the moral case for open access. I think the public have the right to see the results of the research they have invested in.
In short, open access makes complete sense. It generates income, raises excellence and integrity, and involves the public in what they pay for. The question is rather how do we make the transition? Who pays and who benefits, and how do we do this together?
The great opportunity of the digital century is that anyone can technically take part. Most of the world is literate, most is computer-literate and those who aren’t are desperately trying to become so. It may take a few years, but we must have the vision that everyone, from primary school up is a digital citizen and a digital scientist
Because being a scientist is an attitude of mind, not lab coats or professorships. It includes

asking questions to which you don’t know the answer
Reading and analysing previous answers
Exposing your scientific activity to others – which can be a harsh but necessary exercise
Collaborating. Few individuals have “the right answer”
Challenging those you don’t agree with, and being prepared to have to accept that you may often “be wrong”
Where possible collecting data and doing experiments (although these may be necessarily regulated)
Talking with other scientists and frequently revising views
Telling the world what you (singly and together) have done. Ideally as soon as you do it: “Open Notebook Science”

And when science is truly open then everyone can partake.
But when access to data, papers etc. is restricted in any way then there can be no Open.

And when minds are not fully Open, then actions are bureaucratic and formulaic:
“You should publish Open Access because you’ll get more citations”
“You have to publish Open Access if it’s to be counted for your career”

So when you think of Open, ask yourself questions like:

Does my Open Science mean Better Science? (as it should)
Am sharing with the world in all directions?
Is knowledge getting freely to those who can most use it?
Am I inclusive in who I work with?

Because this is Moedas’ vision. It follows in the tradition of Commissioner Neelie Kroes who was passionate in promoting the Digital agenda … especially to 12-year olds.

But not all commissioners share Carlos Moedas’ views.
So please support Open Science in Europe, by arguing for it, building the bridges to citizens and actually doing it.

Posted in Uncategorized | Leave a comment

Open Letter to EC Carlos @Moedas on Open Science and ContentMining (TDM)

Posted on April 4, 2016 by pm286

Open Letter to EC Carlos @Moedas on Open Science and ContentMining (TDM)

Dear Commissioner Moedas,

I am an academic at the University of Cambridge UK determined to see published scientific knowledge brought to citizens. I also run contentmine.org, a non-profit which does this technically by content-mining (TDM) the complete scientific literature for facts.

I was inspired by your speech to EU2016NL yesterday [1] where you wholeheartedly promoted the Open Science agenda for Europe. I support all of your vision, but wish specifically to urge the unrestrained development of published science and content mining as a key tool. I was delighted to see your praise for the European Bioinformatics Institute. EBI hosts Europe PubmedCentral (EPMC), a collection of the world’s published biomedical literature,and I have worked with them for over 10 years. Here is a short video of how citizens can extract published factual Open science on the Zika virus from EPMC in less than 5 minutes [2].

It is critical to reform copyright law in Europe. It must go beyond the UK 2014 “Hargreaves” legislation (“personal non-commercial use”). I am probably one of only 2 UK groups using this, because it is heavily weighted against us. It depends on Universities allowing their staff to mine without explicit publisher permission. My anecdotal evidence is that many libraries will give in to publishers, sign restrictive contracts and regulate academic access [3] thereby negating the law.
We then have a problem publishing the results – as this may break copyright. Hargreaves allows freedom of quotation, but this is untested. In short, we must have legal clarity.

Changing the law is not enough; we must change hearts and minds. Not enough academics actively work with citizens and it’s critical that science is equally available to conservationists, doctors, policy makers, schools, patient groups, etc. This must not be controlled, however lightly, through the current publishers. Please find ways of actively involving citizens outside academia.

There has been massive lobbying by “rightsholders” against reform of content-mining. This includes FUD that (a) mining will break servers [4] (b) there is no demand [5] (c) you need publisher APIs [6] (d) only experts can do it [7]. (e) we will use this to steal content [8]. This is an asymmetric battle. I have watched the lobbyists spend millions on lobbying for watering down of Julia Reda’s EP proposals, and diluting and delaying any reform from the Commission. To redress the balance I’ll offer to come to Brussels and demonstrate on my (or your) laptop the value of ContentMining (TDM) for Open Science.

Peter Murray-Rust
Reader Emeritus, University of Cambridge

[1]http://europa.eu/rapid/press-release_SPEECH-16-1225_en.htm
[2] https://www.youtube.com/watch?v=5lYzOZ2Cv_I This video is shot in real time (5mins) demonstrating that any citizen can access knowledge on that timescale.
[3] http://onsnetwork.org/chartgerink/2016/02/23/wiley-also-stopped-my-doing-my-research/ A Dutch statistician (Chris Hartgerink) was mining the literature to detect scientific malpractice, and both Wiley and Elsevier wrote to the University of Tilburg (NL) to get his research stopped. The University complied with the publishers without any public comment.
[4] In Cambridge I can mine the whole daily scientific literature on my laptop in an hour. This is probably less than one millionth of the daily accesses made by other subscribers. And if there is a trusted cache, as suggested in the recent French proposals, then there is no problem of overload.
[5] Publishers have made this so difficult that no one asks. (/pmr/2011/11/27/textmining-my-years-negotiating-with-elsevier/ where I chronicle 5 wasted years trying to get anything from Elsevier).
[6] Our software can scrape publisher sites directly. And without external regulation I don’t trust any company to respect my privacy, nor to control the view presented through an API.
[7] We shall make our (Open) software available to MEP Julia Reda and we’d be delighted if you and other Commision staff wish to use it to see how easy it is.
[8] I am a responsible citizen and have no intention of making copyrighted content available illegally. I coined the slogan “The Right to Read is the Right to Mine”. Yet I and others have been branded as potential thieves.

Posted in Uncategorized | 1 Comment

With ContentMine you can now mine 100 papers/minute

Posted on April 4, 2016 by pm286

I have been silent on this blog for many months, not because I had nothing to say, but because ContentMine is saying it in software. In short, ContentMine is a new approach to extracting knowledge from the literature, but using technology that anyone can use. And that means anyone – not just academics, but citizens: school students, doctors, local government, conservation, patient groups, social enterprises… Anyone.

We presented this at two workshops last month.

Firstly a meeting of plant scientists at the The_Genome_Analysis_Centre (TGAC) in Norwich, which also included John_Innes_Centre (JIC) and Sainsbury laboratories. These organisations are committed to the use of science to improve plants and agriculture, and knowledge is an increasingly critical part of this research.

A week later at the Cochrane annual meeting. “The [Cochrane] group was formed to organize medical research information in a systematic way to facilitate the choices that health professionals, patients, policy makers and others face in health interventions according to the principles of evidence-based medicine.[4][5]” (Wikipedia). The primary basis of Cochrane is reviewing already published medical and related work and giving a systematic and objective analysis.

Both of these groups were very receptive to the idea of Open mining of the current literature. It doesn’t remove the need for humans – rather it allows them to work on the precise areas where humans are essential and most productive.

With mining techniques we can make the (Open) peer-reviewed literature available to anyone. You – and we mean you – can download 100 papers in a minute and analyse them for scientific concepts. It is astonishing what mining reveals literally within minutes. It goes beyond traditional search engines such as Google because the software picks out the common threads in the papers. It is a compelling demo of the value of Open, of mining, and also the accessibility of the taxpayer-funded research to the taxpayer.

The success of the approach depends on three main resources:

The Open scientific literature, especially through EuropePubmedCentral which has over a million Open papers.
Wikimedia, which includes Wikipedia and Wikidata. This is now becoming my first stop for trustable science and I’ll convince you it should be yours.
And http://contentmine.org where we have developed Open usable simple powerful technology for linking all this together.

There’s a 5-minute video (https://www.youtube.com/watch?v=5lYzOZ2Cv_I ) based on exploring knowledge about Zika that shows how this works.

The workshops have shown that you can now install the software yourself if you wish to. You need to be generally competent in installing a range of programs and using the commandline. I’ll cover the details in later posts.

Posted in Uncategorized | 2 Comments

Article Level Metrics – how reliable are they? (I prefer to read the paper)

Posted on December 7, 2015 by pm286

I am on the board of a wonderful community voluntary organization – the Crystallography Open Database (COD) http://www.crystallography.net/ . For 10 years it has been collecting crystal structures from the literature and making them Open – more than 300,000. It’s the only Open database for small structures (the others CSD and ICSD are closed and based on subscriptions even though the data is taken from public papers. This morning we heard of a great paper using the COD for data-driven research. Here’s the landing page http://www.mdpi.com/2073-4352/5/4/617 :

The paper is trawling through hundreds of thousands of structures to find those with a high proportion of hydrogen atoms – a clever idea for finding possible Hydrogen Stores for Energy. The closed databases couldn’t be used without subscriptions.
I think this is a clever idea and tweeted it. I’m a crystallographer and structural chemist so it’s not surprising that a few other people retweeted it.
However I noticed that the article had a daily count of accesses and that there was a small glitch of 3 accesses today. I tweeted this and – surprise – the accesses went up. after 12 hours there have been over 100 accesses

You’ll see there have been over 100 accesses today because I and 3-4 others have tweeted it. This is nothing to do with the contents of the paper because not many have actually read it today. People have clicked to view the graph, and every time they visit the graph goes up. It’s nothing to do with the quality of the science (which I think is good) or the fact that the paper is Open Access – it’s just a Heisentwitter.
So what does 100 accesses today mean? Nothing.
What does my opinion of the paper count? Something, I hope (I would have recommended publication).
The point is that to decide whether science is good or useful
you have to read the paper

Posted in Uncategorized | 1 Comment

I urge my MEPs to reform European Copyright – please do the same

Posted on December 2, 2015 by pm286

I have written to my members of The European Parliament to argue for reform of Copyright to allow Text and Data Mining (TDM, “ContentMining”) for commercial and non-commercial purposes. This issue has been very high-profile this year and Commissioner Oettinger will present his recommendations soon, so it’s important that we let him and MEPs know immediately that we need a change in the law.
I urge you also to write to your MEPs. Its’ easy – just use write writetothem.org and it will work out who you should write to. You can use some of my letter, but personalise it to represent your own views and goals. MEPs take these letters seriously – and they are critical evidence against all the lobbying that they get from vested interests

Dear Geoffrey Van Orden, Stuart Agnew, Vicky Ford, Tim Aker, Richard Howitt, Patrick O’Flynn and David Campbell Bannerman,
Reform of European Copyright to allow Text and Data Mining (TDM)
I am a scientist at the University of Cambridge and write to urge you to promote the reform of European laws and directives relating to Copyright; and particularly the current restrictions on Text and Data Mining (“ContentMining”). The reforms [1] that MEP Reda promoted to the European Parliament earlier this year are sensible, pragmatic and beneficial and I urge you to represent them to Commissioner Oettinger before he produces the policy document on the Digital Single Market (expected in early December 2015).
Science and medicine publishes over 2 million papers a year and billions of Euro’s worth of publicly funded research lie unused, since no human can read the current literature. That’s an opportunity cost (at worst people die) and potentially a huge new industry. I and colleagues have been working for many years to develop the technology and practice of mining (especially in bio- and chemical sciences) . I am convinced that Europe is falling badly behind the US. “Fair use” (see the recent “Google” [2] and “Hathi” books case) is now often held to allow the US, but not Europeans (with only “fair dealing” at best), to mine science and publish results.
Over several years I and others have tried to find practical ways forward, but the rightsholders (mainly mega publishers such as Elsevier/RELX, Springer, Wiley, Nature) have been unwilling to engage. The key issues is “Licences” , where rightsholders require readers to apply for further permissions (and maybe additional payments) just to allow machines to read and process the literature. The EC’s initiative “Licences for Europe” failed in 2013, with institutions such as LIBER, RLUK, and British Library effectively walking out [3]. Nonetheless there has been massive industry lobbying this year to try to convince MEPs , and Commissioners, that Licences are the way forward [4].
The issue is simply encapsulated in my phrase “The Right to Read is the Right to Mine”; if a human has the right to read a document, she should be allowed to use her machines to help her. We have found scientists who have to read 10,000 papers to make useful judgments (for example in systematic reviews of clinical trials, animal testing, and other critical evaluations of the literature. This can take 10-20 days of highly skilled scientist’s time, whereas a machine can filter out perhaps 90%, saving thousands of Euros. This type of activity is carried out in many European laboratories, so the total waste is very significant.
Unfortunately the rightsholders are confusing and frightening the scientific and library community. Two weeks ago a NL statistician [5] was analysing the scientific literature on a large scale to detect important errors in the conclusions reached by statistical methods. After downloading 30,000 papers, the publisher Elsevier demanded that the University (Tilburg) stop him doing his research, and the University complied. This is against natural justice and is also effectively killing innovation – it is often said that Google and other industries could not start in Europe because of restrictive copyright.
In summary, European knowledge workers require the legal assurance that they can mine and republish anything they can read, for commercial as well as non-commercial purposes. This will create a new community and industry of mining which will bring major benefits to Europe. see [6]
Peter Murray-Rust
[1]
https://juliareda.eu/copyright-evaluation-report-explained/
https://juliareda.eu/2015/07/eu-parliament-defends-freedom-of-panorama-calls-for-copyright-reform/
[2] http://fortune.com/2015/10/16/google-fair-use/
[3] https://edri.org/failure-of-licenses-for-europe/, http://ipkitten.blogspot.co.uk/2013/11/licences-for-europe-insiders-report.html
[4] The use of “API”s is now being promoted by rightsholders as a solution to the impasse. APIs are irrelevant; it is the additional licences (Terms and Conditions) which are almost invariably added.
[5] “Elsevier stopped me doing my research” http://onsnetwork.org/chartgerink/2015/11/16/elsevier-stopped-me-doing-my-research/
[6] http://contentmine.org/2015/11/contentmining-in-the-uk-a-contentmine-perspective/
Yours sincerely,

Peter Murray-Rust

Posted in Uncategorized | 3 Comments

ContentMining: My Video to Shuttleworth about our proposed next year

Posted on November 23, 2015 by pm286

I have had two very generous years of funding from the Shuttleworth Foundation to develop TheContentMine. Funding is in yearly chunks and each Fellow must reapply if s/he wants another year (up to 3). The mission is simple: change the world. As with fresh applicants we write a 2-page account of where the world is at, what and how we want to change things.
TL;DR I have reapplied and submitted a 7 minute video (https://vimeo.com/146552838 ).
These two years have been a roller-coaster – seriously changed my life. I can honestly say that the Fellowship is one of the most wonderful organizations I know. We meet twice a year with about 20 fellows/almuni/team committed to making sure the world is more just, more harmonious, and that humanity and the planet have a better chance of prospering.
There’s no set domain of interest for applying, but Fellows have a clear sense of something new that could be done or something that badly needs mending. Almost everyone uses technology, but as a means, not as an end. And almost everyone is in some way building or enhancing a community. I can truly say that my fellow Fellows have achieved amazing things. Since we naturally live our lives openly you’ll find our digital footprints all over the Internet.
I’m not going to describe all the projects – you can read the web site and you may know several Fellows anyway.

Some are trying to fill a vacuum – do something exciting that is truly visionary – and I’ll highlight Dan Whaley’s https://hypothes.is/ . This project (and ContentMine is proud to be an associate) will bring annotation to documents on the Web. That sounds boring – but it’s as exciting as what TimBL brought with HTML and HTTP (which changed the world). Annotation can create a read-write web where the client (that’s YOU!) can alter/enhance our existing knowledge and it’s so exciting it’s impossible to see where it will go. The web has evolved to a server-centric model where organizations pump information at dumb clients and build walled gardens where you are trapped in their model of the world. Annotation gives you the freedom to escape , either individually or in subcommunities.
Others are challenging injustice – I’l highlight two. Jesse von Doom (https://cashmusic.org/ ) is changing the way music is distributed – giving artists control over their careers. Johnny West (https://openoil.net/ ) is bringing transparency to the extractive industries. Did you know “BP” consists of over 1000 companies? Where the fracking contracts in UK are?

So when I launched TheContentMine as a project in 2014 we were in the first category. Few people were really interested in ContentMining and fewer were doing it. We saw our challenge as training people, creating tools, running workshops, and that was the theme of my first application (https://vimeo.com/78353557 ). Our vision was to create a series of workshops which would train trainers and expand the knowledge and practice of mining. And the world would see how wonderful it was and everyone would adopt it.
Naive.
In the first year we searched around for likely early adopters, and found a few. We built a great team – where everyone can develop their own approaches and tools – and where we don’t know precisely what we want for the future. And gradually we get known. So for the second year our application centred on tools and mining the (Open ) literature (vimeo.com/110908526). It’s based on the idea that we’d work with Open publishers, show the value, and systematically extend the range of publishers and documents that we can mine. And that’s now also part of our strategy.
But then in 2014 politics…
The UK has already pushed for and won a useful victory for mining. We are allowed to mine any documents we have legal access to for “non-commercial research”. There was a lot of opposition from the “rights-holders” (i.e. mainstream TollAccess publishers to whom authors have transferred the commercial rights of their scientific papers). They’d also been fighting in Europe under “Licences for Europe” to stop the Freedom to mine. Indeed I coined the phrase “The Right to Read is the Right to Mine” and the term “Content Mining”. So perhaps when the UK passed the “Hargreaves” exception for mining, the publishers would agree that it was time to move on.
Sadly no.
2015 has seen the eruption of a fullscale conflict in EU over the right to mine. In 2014 Julia Reda MEP was asked to create a proposal for reform of copyright in Europe’s Digital Single Market. (The current system is basically unworkable – laws are different in every country and arcanely bizarre [1]). Julia’s proposal was very balanced – it did not ask for copyright to be destroyed – and preserved rights for “rights-holders” as well as for re-users.
ContentMining (aka Text and Data Mining, TDM) has emerged as a totemic issue. There was massive publishers pushback against Julia proposal, epitomised in the requirement for licences [2]. There were over 500 amendments, many being simply visceral attacks on any reform. And there has been huge lobbying, with millions of Euros. Julia could get a free dinner several times over every night!
There is no dialogue and no prospect of reconciliation. There is simply a battle. (I am very sad to have to write this sentence)
So ContentMine is now an important resource for Freedom. We are invited to work with reforming groups (such as LIBER who have invited us to be part of FutureTDM, an H2020 project to research the need for mining). And we accept this challenge by:

advocacy. This includes working with politicians, legal experts, reformers, etc.
software. Our software is unique, Open, and designed to help people discover and use ContentMining either with our support or independently.
Science. We are tackling real problems such as endangered species, and clinical trials.
Hands-on. We’ve developed training modules and also run hands-on workshops to explore scientific and technical challenges.
Partners. We’re working with university and national libraries, open publishers, and others.

So I’ve put this and more into the video. [3] This tells you what we are going to do and with whom. And I’ll explain the detail of what we are going to do in a future post.

[1] Read https://euobserver.com/justice/126375 and laugh, then weep. You cannot publish photos of the Eiffel Tower taken at night….
[2] Licensing effecetively means that the publishers have complete control over who, when, where, how is allowed to mine content (and we have seen Elsevier forbidding Chris Hartgerink to do research without their permission, see /pmr/2015/11/22/content-mining-why-do-universities-agree-to-restrictive-publisher-contracts/ and earlier blog posts).
[3] It’s a non-trivial amount of work. Approximately 1 PMR-day per minute of final video. It took time for the narrative to evolve (thanks to Jenny Molloy and Richard Smith-Unna for the polar bear theme). And it’s CC-BY.

Posted in Uncategorized | 1 Comment

petermr's blog

Elsevier still charge for papers on Zika

ContentMine is (alpha)-ready for you to use. Hacking on Thursday/Friday.

ContentMine and the Royal Society working together

Public demonstration of ContentMining (TDM) in Brussels April 14/15

Hackday on Friday 15 April 2016, 9h

What is “Open Science”? Carlos Moedas gets it, do you?

Open Letter to EC Carlos @Moedas on Open Science and ContentMining (TDM)

Open Letter to EC Carlos @Moedas on Open Science and ContentMining (TDM)

With ContentMine you can now mine 100 papers/minute

Article Level Metrics – how reliable are they? (I prefer to read the paper)

I urge my MEPs to reform European Copyright – please do the same

ContentMining: My Video to Shuttleworth about our proposed next year

Recent Posts

Recent Comments

Archives

Categories

Meta