Article Level Metrics - how reliable are they? (I prefer to read the paper)

I am on the board of a wonderful community voluntary organization - the Crystallography Open Database (COD) . For 10 years it has been collecting crystal structures from the literature and making them Open - more than 300,000. It's the only Open database for small structures (the others CSD and ICSD are closed and based on subscriptions even though the data is taken from public papers. This morning we heard of a great paper using the COD for data-driven research. Here's the landing page :



The paper is trawling through hundreds of thousands of structures to find those with a high proportion of hydrogen atoms - a clever idea for finding possible Hydrogen Stores for Energy. The closed databases couldn't be used without subscriptions.

I think this is a clever idea and tweeted it. I'm a crystallographer and structural chemist so it's not surprising that a few other people retweeted it.

However I noticed that the article had a daily count of accesses and that there was a small glitch of 3 accesses today. I tweeted this and - surprise - the accesses went up. after 12 hours there have been over 100 accesses



You'll see there have been over 100 accesses today because I and 3-4 others have tweeted it. This is nothing to do with the contents of the paper because not many have actually read it today. People have clicked to view the graph, and every time they visit the graph goes up. It's nothing to do with the quality of the science (which I think is good) or the fact that the paper is Open Access - it's just a Heisentwitter.

So what does 100 accesses today mean? Nothing.

What does my opinion of the paper count? Something, I hope (I would have recommended publication).

The point is that to decide whether science is good or useful

you have to read the paper


Posted in Uncategorized | 1 Comment

I urge my MEPs to reform European Copyright - please do the same

I have written to my members of The European Parliament to argue for reform of Copyright to allow Text and Data Mining (TDM, "ContentMining") for commercial and non-commercial purposes. This issue has been very high-profile this year and Commissioner Oettinger will present his  recommendations soon, so it's important that we let him and MEPs know immediately that we need a change in the law.

I urge you also to write to your MEPs. Its' easy - just use write and it will work out who you should write to. You can use some of my letter, but personalise it to represent your own views and goals. MEPs take these letters seriously - and they are critical evidence against all the lobbying that they get from vested interests

Dear Geoffrey Van Orden, Stuart Agnew, Vicky Ford, Tim Aker, Richard Howitt, Patrick O'Flynn and David Campbell Bannerman,

Reform of European Copyright to allow Text and Data Mining (TDM)

I am a scientist at the University of Cambridge and write to urge you to promote the reform of European laws and directives relating to Copyright; and particularly the current restrictions on Text and Data Mining (“ContentMining”). The reforms [1] that MEP Reda promoted to the European Parliament earlier this year are sensible, pragmatic and beneficial and I urge you to represent them to Commissioner Oettinger before he produces the policy document on the Digital Single Market (expected in early December 2015).

Science and medicine publishes over 2 million papers a year and billions of Euro’s worth of publicly funded research lie unused, since no human can read the current literature. That’s an opportunity cost (at worst people die) and potentially a huge new industry. I and colleagues have been working for many years to develop the technology and practice of mining (especially in bio- and chemical sciences) . I am convinced that Europe is falling badly behind the US. “Fair use” (see the recent “Google” [2] and “Hathi” books case) is now often held to allow the US, but not Europeans (with only "fair dealing" at best), to mine science and publish results.

Over several years I and others have tried to find practical ways forward, but the rightsholders (mainly mega publishers such as Elsevier/RELX, Springer, Wiley, Nature) have been unwilling to engage. The key issues is “Licences” , where rightsholders require readers to apply for further permissions (and maybe additional payments) just to allow machines to read and process the literature. The EC’s initiative “Licences for Europe” failed in 2013, with institutions such as LIBER, RLUK, and British Library effectively walking out [3]. Nonetheless there has been massive industry lobbying this year to try to convince MEPs , and Commissioners, that Licences are the way forward [4].

The issue is simply encapsulated in my phrase “The Right to Read is the Right to Mine”; if a human has the right to read a document, she should be allowed to use her machines to help her. We have found scientists who have to read 10,000 papers to make useful judgments (for example in systematic reviews of clinical trials, animal testing, and other critical evaluations of the literature. This can take 10-20 days of highly skilled scientist’s time, whereas a machine can filter out perhaps 90%, saving thousands of Euros. This type of activity is carried out in many European laboratories, so the total waste is very significant.

Unfortunately the rightsholders are confusing and frightening the scientific and library community. Two weeks ago a NL statistician [5] was analysing the scientific literature on a large scale to detect important errors in the conclusions reached by statistical methods. After downloading 30,000 papers, the publisher Elsevier demanded that the University (Tilburg) stop him doing his research, and the University complied. This is against natural justice and is also effectively killing innovation - it is often said that Google and other industries could not start in Europe because of restrictive copyright.

In summary, European knowledge workers require the legal assurance that they can mine and republish anything they can read, for commercial as well as non-commercial purposes. This will create a new community and industry of mining which will bring major benefits to Europe. see [6]

Peter Murray-Rust

[4] The use of “API”s is now being promoted by rightsholders as a solution to the impasse. APIs are irrelevant; it is the additional licences (Terms and Conditions) which are almost invariably added.
[5] “Elsevier stopped me doing my research”
Yours sincerely,

Peter Murray-Rust




Posted in Uncategorized | Leave a comment

ContentMining: My Video to Shuttleworth about our proposed next year

I have had two very generous years of funding from the Shuttleworth Foundation to develop TheContentMine. Funding is in yearly chunks and each Fellow must reapply if s/he wants another year (up to 3). The mission is simple: change the world. As with fresh applicants we write a 2-page account of where the world is at, what and how we want to change things.

TL;DR I have reapplied and submitted a 7 minute video ( ).

These two years have been a roller-coaster – seriously changed my life. I can honestly say that the Fellowship is one of the most wonderful organizations I know. We meet twice a year with about 20 fellows/almuni/team committed to making sure the world is more just, more harmonious, and that humanity and the planet have a better chance of prospering.

There's no set domain of interest for applying, but Fellows have a clear sense of something new that could be done or something that badly needs mending. Almost everyone uses technology, but as a means, not as an end. And almost everyone is in some way building or enhancing a community. I can truly say that my fellow Fellows have achieved amazing things. Since we naturally live our lives openly you'll find our digital footprints all over the Internet.

I'm not going to describe all the projects – you can read the web site and you may know several Fellows anyway.

  • Some are trying to fill a vacuum – do something exciting that is truly visionary – and I'll highlight Dan Whaley's . This project (and ContentMine is proud to be an associate) will bring annotation to documents on the Web. That sounds boring – but it's as exciting as what TimBL brought with HTML and HTTP (which changed the world). Annotation can create a read-write web where the client (that's YOU!) can alter/enhance our existing knowledge and it's so exciting it's impossible to see where it will go. The web has evolved to a server-centric model where organizations pump information at dumb clients and build walled gardens where you are trapped in their model of the world. Annotation gives you the freedom to escape , either individually or in subcommunities.
  • Others are challenging injustice – I'l highlight two. Jesse von Doom ( ) is changing the way music is distributed – giving artists control over their careers. Johnny West ( ) is bringing transparency to the extractive industries. Did you know “BP” consists of over 1000 companies? Where the fracking contracts in UK are?

So when I launched TheContentMine as a project in 2014 we were in the first category. Few people were really interested in ContentMining and fewer were doing it. We saw our challenge as training people, creating tools, running workshops, and that was the theme of my first application ( ). Our vision was to create a series of workshops which would train trainers and expand the knowledge and practice of mining. And the world would see how wonderful it was and everyone would adopt it.


In the first year we searched around for likely early adopters, and found a few. We built a great team – where everyone can develop their own approaches and tools – and where we don't know precisely what we want for the future. And gradually we get known. So for the second year our application centred on tools and mining the (Open ) literature ( It's based on the idea that we'd work with Open publishers, show the value, and systematically extend the range of publishers and documents that we can mine. And that's now also part of our strategy.

But then in 2014 politics...

The UK has already pushed for and won a useful victory for mining. We are allowed to mine any documents we have legal access to for “non-commercial research”. There was a lot of opposition from the “rights-holders” (i.e. mainstream TollAccess publishers to whom authors have transferred the commercial rights of their scientific papers). They'd also been fighting in Europe under “Licences for Europe” to stop the Freedom to mine. Indeed I coined the phrase “The Right to Read is the Right to Mine” and the term “Content Mining”. So perhaps when the UK passed the “Hargreaves” exception for mining, the publishers would agree that it was time to move on.

Sadly no.

2015 has seen the eruption of a fullscale conflict in EU over the right to mine. In 2014 Julia Reda MEP was asked to create a proposal for reform of copyright in Europe's Digital Single Market. (The current system is basically unworkable – laws are different in every country and arcanely bizarre [1]). Julia's proposal was very balanced – it did not ask for copyright to be destroyed – and preserved rights for “rights-holders” as well as for re-users.

ContentMining (aka Text and Data Mining, TDM) has emerged as a totemic issue. There was massive publishers pushback against Julia proposal, epitomised in the requirement for licences [2]. There were over 500 amendments, many being simply visceral attacks on any reform. And there has been huge lobbying, with millions of Euros. Julia could get a free dinner several times over every night!

There is no dialogue and no prospect of reconciliation. There is simply a battle. (I am very sad to have to write this sentence)

So ContentMine is now an important resource for Freedom. We are invited to work with reforming groups (such as LIBER who have invited us to be part of FutureTDM, an H2020 project to research the need for mining). And we accept this challenge by:

  • advocacy. This includes working with politicians, legal experts, reformers, etc.
  • software. Our software is unique, Open, and designed to help people discover and use ContentMining either with our support or independently.
  • Science. We are tackling real problems such as endangered species, and clinical trials.
  • Hands-on. We've developed training modules and also run hands-on workshops to explore scientific and technical challenges.
  • Partners. We're working with university and national libraries, open publishers, and others.

So I've put this and more into the video. [3] This tells you what we are going to do and with whom. And I'll explain the detail of what we are going to do in a future post.


[1] Read and laugh, then weep. You cannot publish photos of the Eiffel Tower taken at night....

[2] Licensing effecetively means that the publishers have complete control over who, when, where, how is allowed to mine content (and we have seen Elsevier forbidding Chris Hartgerink to do research without their permission, see and earlier blog posts).

[3] It's a non-trivial amount of work. Approximately 1 PMR-day per minute of final video. It took time for the narrative to evolve (thanks to Jenny Molloy and Richard Smith-Unna for the polar bear theme). And it's CC-BY.


Posted in Uncategorized | Leave a comment

Content-mining; Why do Universities agree to restrictive publisher contracts?

[I published a general blog about the impasse between digital scholars and the Toll-Access publishers . This is followed by a series of detailed posts which look at the details and consequences

This is the second]

If you have read these earlier posts you will know that the issue is whether I and others are allowed to use machines to read publications we have legal access to read with our eyes.

The (simplified) paradigm for Content-mining scholarly articles consists of:

  • finding links to papers (articles) we may be interested in (“crawling”). The papers may be on publishers web sites (visible or behind paywall) or in repositories (visible). Most of this relates to paywalled articles

  • downloading these papers from (publisher) servers onto local machines (clients). (“scraping”). If paywalled this requires paid access (subscription) which is only available to members of the subscribing institution. Thus I can read thousands of articles to which Cambridge University has a subscription.

  • Running software to extract useful information from the papers (“mining”). This information can be chunks of the original or reworked material.

  • (for responsible scientists – including me) publish the results in full.

This is technically possible. Messy, if you start from scratch, but we and others have created Open Source tools and services to help.

The problem is that Toll-Access publishers don't want us to do it (or only under unworkable restrictions). So what stops us?


What follows is simplistic and IANAL (I am not a lawyer) though I talk with people who are. I am happy to be corrected by people more knowledgeable than me.

There are two main types of law relevant here:

  • Copyright law. . TL;DR any copying may infringe copyright and allow the “rights-holder” to sue. The burden of proof is lower : “However, in a civil case, the plaintiff must simply convince the court or tribunal that their claim is valid, and that on balance of probability it is likely that the defendant is guilty”. Copyright law varies between countries and can be extraordinary complex and difficult to get clear answers. The simple, and sad, default assumed by many people and promoted by many vendors is that readers have no rights. (The primary method of removing these restrictions is to add a licence (such as CC-BY) which is compatible with copyright law and explicitly gives rights to the reader/user).

  • Contract law.
    Here the purchasers of goods and services (e.g. Universities) may agree a contract with the vendors (Publishers) that gives rights and responsibilities to both. In general these contracts are no publicised to users like me and may even be secret. Therefore some of what follows is guesswork. There are also hundreds of vendors and a wide variation on practice. However we believe that the main STMPublishers have roughly similar contracts.

    In general these contracts are heavily weighted in favour of the publisher. They are written by the publisher and offered to the purchaser to sign. If the University doesn't like the conditions they have to “negotiate” with the publisher. Because there is no substitutability of goods (you can't swap Nature with J. Amer. Chem. Soc.) the publisher often seems to have an advantage.

    The contracts contain phrases such as “you may not crawl our site, index it, spider it, mine it, etc.” These are introduced by the publisher to stop mining. (There is already copyright law to prevent the republishing of material without permission, so the new clauses are not required.). I queried a number of UK Universities as to what they had some – some were constructive in their replies but many – unfortunately – unhelpful.

    However there is no legal reason why a University has to sign the contract put in front of them. But they do, and they have signed clauses which restrict what I and Chris Hartgerink and other scientists can do. And they do it without apparent internal or external consultation.

    And this was understood by the Hargreaves reform which specifically says that text-miners can ignore any contracts which stop them doing it. Presumably they reasoned that vendors pressure Universities into signing our rights away, and this law protects us. And, indeed it's critically important for letting us proceed.

But this law doesn't (yet) apply to NL and so can't help Chris (except when he comes to UK). We want it changed, and library organizations such as LIBER, RLUK, BL etc. want it changed.

So this mail is to ask Universities – and I expect their libraries will answer:




And then we'll work out how to help.

Posted in Uncategorized | 2 Comments

Content-mining; Why do Publishers insist on APIs and forbid screen scraping?

[I published a general blog about the impasse between digital scholars and the Toll-Access publishers . This is the first of a number of posts which look at the details and consequences]

Chris Hartgerink described how Elsevier have stopped him doing content-mining:


There is a lot of comment on both of these , to which I may refer but will not reproduce in detail. It informs my comments. The key issue is “APIs”, commented by Elsevier's Director of Access & Policy (EDAP)

Dear Chris,

We are happy for you to text mind content that we publish via the ScienceDirect API, but not via screen scraping. You can get access to an API key via our developer’s portal ( If you have any questions or problems, do please let me know. If helpful, I am also happy to engage with the librarian who is helping you.

With kind wishes,

Dr Alicia Wise
Director of Access & Policy

The TAPublishers wish contentmining to be done through their APIs and forbid (not merely discourage) screenscraping. On the surface this may look like a reasonable request – and many of us use APIs – but there are critically important and unacceptable aspects.

What is screen scraping and what is an API?

Screen scraping simulates the action of a human reading web pages via a browser. You feed the program (ours is “quickscrape”) a URL and it will retrieve the HTML “landing page”. Then it find links in the landing page which refer to additional documents and downloads them. If this is done responsibly (as quickscrape does) is causes no more problem for the server than a human. Any publisher who anticipates large numbers of human readers has to implement software and which must robust. (I run a server, and the only time it's had problems is when I have been the interest on Slashdot or Reddit, which are multi-human sites). A well-designed polite screen scraper like “quickscrape” will not cause problems to modern sites.

Screen-scraping can scrape a number of components from the web page. These differ for evry publisher or journal, and for science this MAY include:

  • the landing page
  • article metadata (often in the landing page)
  • abstract (often in the landing page)
  • fulltext HTML
  • fulltext PDF
  • fulltext XML (often only on Open Access publishers' websites, otherwise behind paywall)
  • references (citations),
  • required files (e.g. lists of contributors, protocols)
  • supporting scientific information / data (often very large). A mixure of TXT, PDF, CSV, etc.
  • images
  • interactive data, e.g. 3D molecules

An excellent set of such files is in Acta Crystallographica journals (e.g. ) where the buttons represent such files.

I and colleagues at Cambridge have been screen-scraping many journals in this way for about 10 years to get crystallographic data for research and have never been told we have caused a problem. We have contributed our output to the excellent Free/Open Crystallography Open Database.

So I reject the idea that screenscraping is a problem, and regard the EDAP's argument as FUD. I say that because despite the EDAP's assertion that they are trying to help us, the reverse is true. I have spent 5 years of my life beating emails back and forth and got no-where, ( ) and you should prepare for the same.

An API ( allows a browser or program to request specific information or services from a server. It's a precise software specification which should be documented precise and where the client can rely on what the server can provide. At EuropePMC there is such an API, and we use it frequently and in our “getpapers” tool. Richard Smith-Unna in Plant Sciences (Univ Cambridge) and in ContentMine has written a “wrapper” which issues queries to the API and stores the results.

When well written, and where there is a common agreement on rights, then APIs are often, but not always, a useful way to go. Where there is no common agreement they are unacceptable.

Why Elsevier and other TAPublishers' APIs are unacceptable.

There are several independent reasons why I and Chris Hartgerink and others will not use TAPublisher APIs. This is unlikely to change unless the publishers change the way they work with researchers and acceptable that researchers have fundamental rights.

  • An API gives total control to the server (the publisher) and no control to the client (reader/user).

That's the simple, single feature that ultimately decides whether an API is acceptable. The only way that I would use one, and would urge you consider, is

  • is there a mutually acceptable public contract between the publisher and the researcher?

In this case, and the case of all STMPublishers, NO. Elsevier has written its own TandC. It has done this without even the involvement of the purchasing customer. I doubt that any library, any library organization, any university, and university organization has publicly met with Elsevier or the STMPublisher's association and agreed mutually satisfactory terms.

All the rest is secondary. Very important secondary, which I'll discuss. But none of this can be mended without giving the researcher their rights.

Some of the consequences (which have already happened) include:

  • It is very heavily biassed towards Elsevier's interests and virtually nothing about the user interests.
  • The TandC can change at any time (and do so) without negotiation
  • The API can change at any time.
  • There is no guaranteed level of software design or service. When (not if) it breaks we are expected to find and report Elsevier bugs. There is no commitment to mend them.
  • The API is designed and built by the publisher without the involvement of the researcher. Quite apart from the contractual issues this a known way of producing bad software.
  • The researcher has no indication of how complete or correct the process is. The server can give whatever view of the data they wish.
  • The researcher has no privacy.
  • (The researcher probably has no legal right to sign the TandC for the API – it is the University that contracts with the publisher.)
  • The researcher contracts only to publish results as CC-NC, which debars them from publishing in Open Access journals.
  • The researcher contracts not to publish anything that will harm Elsevier's marketplace. This immediately rules me out as publishing chemistry will compete with Elsevier database products.

So the Elsevier API is an instrument for control.

To summarise, an API:

  • Allows the server to control what, when, how, how much, in what format the user can access the resource. It is almost certain that this will not fit onto how researchers work. For example, the Elsevier API does not serve images. That already make it unusable for me. I doubt it serves supplemental data such as CIFs either. If I find problems with EuropePMC API I discuss this with the European Bioinformatics Institute. If I have problems with Elsevier API I ...
  • Can monitor all the traffic. I trust EuropePMC to behave responsibly as it has a board of governance (including one I have been on). It allows anonymity. With Elsevier I … In general no large corporate can be trusted with my data, which here includes what I did, when, what I was looking at and allows a complete history of everything I have done. From that machines can work out a great deal more, and sell it to people I don't even know exist.


  • APIs can be well written or badly written. Do you, the user, have an involvement?
  • Their use can be voluntary or mandatory. Is the latter a problem?
  • Is there a guarantee of privacy and non-use of data?
  • Do you know whether the API gives the same information as the screen-scraper (almost certainly not, but how)
  • what do you have to sign up to? Was it agreed by a body you trust?


APIs are being touted by Elsevier and other STMPublishers as the obvious friendly answer to Mining. In their present form, and with present Terms and Conditions they are completely unacceptable and very dangerous.

They should be absolutely rejected. Ask your library/university to cancel all clauses in contracts which forbid mining by scraping. They have the legal right to do so.


Posted in Uncategorized | Leave a comment

Content-mining; Rights versus Licences

[I intend to follow with several more detailed posts.]

Last week was a critical point for those who regard the scholarly literature as a public good, rather than a business. Those who care must now speak out, for if they do not, we shall see a cloud descend over the digital century where we are disenfranchised and living in enclosures and walled gardens run by commercial mega-corporations.

Chris Hartgerink, a statistician at the University of Tilburg NL, was using machines to read scholarly literature to do research (“content-mining”). Elsevier, the mega-publisher, contacted the University and required them to stop Chris. The University complied with the publisher and Chris is now forbidden to do research using mining without Elsevier's permission.

Some reports include:

The issues are simple:

  • Chris has rightful access to the literature and can read it with his eyes.

  • His research is serious, valuable and competent.

  • Machines can save Chris weeks of time and prevent many types of error.

What Chris has been doing has been massively resisted by mainstream “TAPublishers” [1]. This includes:

  • lobbying to reject proposed legislation (often by making it more restrictive).

  • producing FUD (“Fear Uncertainty and Doubt”) aimed at politicians, libraries and researchers such as Chris. Note that “stealing” is now commonly used in TAPublisher-FUD.

  • physically preventing mining (e.g. through CAPTCHAs).

  • Preventing mining though contractual or legal means (as with Chris).

Many of us met in The Hague last year to promote this type of new and valuable research, and wrote The Hague Declaration . A wide range of organisations and individuals ranging from universities, libraries, and liberal publishers have signed. This is often represented with by my phrase “The Right to Read is the Right to Mine”.

Many reformers, led initially by Neelie Kroes (European Commissioner till 2014) and now by Julia Reda (MEP) have pushed for reforms of copyright to allow and promote mining. The European Parliament and the Commission have produced in-depth proposals for liberalising European law.

The reality is that reformers and the Publishers have little common ground on mining. Reformers are campaigning for their Rights; TAPublishers are trying to prevent this. This is often encapsulated in additional mining “Licences” proposed by TAPublishers. This is epitomised by the STMPublisher-lobbied “Licences for Europe” proposed in 2013 in Commission discussions, but which broke down completely as the reformers were not prepared.

The TAPublishers are trying to coerce the scholarly and wider community into accepting Licences; we are challenging this by asserting our Rights.

Unfettered Access to Knowledge is as important in the Digital Century as food, land, water, and slavery have been over the millenia.

The issue for Chris and others is:

  • Can I read the literature I have access to

    1. in the way I want,

    2. for any legal purpose

    3. using machines when appropriate

    4. without asking for further permission

    5. or telling corporations who I am and what I am doing

    6. and publishing the results in the open literature without constraints

Chris has the moral right to do 1-6, but not the legal right, because the TAPublishers have added restrictions to the subscription contracts, and his University has signed them. He is therefore (probably) bound by NL contract law.

In the UK the situation is somewhat better. Last year a copyright Exception was enacted which allows me to do much of this. (2) has to be for “non-commercial research” and (6) would only be permissable if I don't break copyright in publishing the results. So I can do something useful (although not nearly as much as I want to do, and as reponsible science requires). I know also that I will have constant opposition from publishers, probably including lobbying of my institution.

European reformers are pushing for a similar legal right in Europe and many propose removing the “non-commercial” clause. There is MASSIVE opposition from publishers primarily through lobbying, where key politicians and staff are constantly fed the publishers's story. There is no public forum (such as a UK Select Committee) where we can show the fallaciousness of TAPublisher arguments. (This is a major failing of European democracy – much of it happens in personal encounters with unelected representatives who have no formal responsibility to the people of Europe). The fight – and it is a fight – is therefore hugely asymmetric. If I want to represent my views we have to travel to Brussels at our own expense – TAPublishers have literally billions.

The issue is RIGHTS (not APIs, not bandwidth, not cost, not FUD about server-load, not convenience)


I hope you feel that this is the time to take a stand.

What can we do?

Some immediate and cost-free tasks:

  • Sign the Hague Declaration. Very few European University / Libraries have so far done so

  • Write to your MEP. Feel free to take this mail as a basis, but personalise it

  • Write to Commissioner Oettinger (“Digital Single Market”)

  • Write to your University and your University Library. Use Freedom of Information to required that they reply. Challenge the current practice

  • Alert your learned socity to the muzzling of science and scholarship.

  • Alert organizations who are campaigning for Rights in the Digital age.

  • Tweet this post, and push for retweets

And think about what ContentMining could do for you. And explore with us.

And what are PMR and colleagues going to do?

Because I have the legal right to mine the Cambridge subscription literature for non-commercial purposes, I and colleagues are going to do that. Ross Mounce and I have already shown that totally new insights are possible (see We've developed a wide range of tools and we'll be working on our own research and also with the wider research community in areas that we can contribute to.

[1]. There is a spectrum of publishing ranging from large conventional, often highly profitable, publishers through learned societies, to new liberal startups in the last 10 years. I shall use “TAPublisher” (TollAccess publisher) to refer to publishers such as (but not limited to) Elsevier, Wiley, Springer, Macmillan, Nature . They are represented by an association (STMPublishers Association) which effectively represents their interests and has been active in developing and promoting licences..

Posted in Uncategorized | Leave a comment

Extracting 100 million facts from the Scientific literature -1

In TheContentMine, funded by the Shuttleworth Foundation, we aim to extract 100 million facts from the scientific literature. That's a large number so here's our thinking....

What is a fact?

Common Mist Frog (Litoria rheocola) of northern Australia. Credit: Damon Ramsey / Wikimedia / CC BY-SA

Common Mist Frog (Litoria rheocola) of northern Australia. Credit: Damon Ramsey / Wikimedia / CC BY-SA

"Fact" means slightly different things to philosophers, lawyers, scientists and advertisers. Wikipedia describes facts in science,

a fact is a repeatable careful observation or measurement (by experimentation or other means), also called empirical evidence. Facts are central to building scientific theories.


a scientific fact is an objective and verifiable observation, in contrast with a hypothesis or theory, which is intended to explain or interpret facts.

In ContentMine we highlight the "objective", i.e. people will agree that they are talking about the same things in the same way. We concentrate on facts reported in scientific papers and my colleague Ross Mounce showed (Daily updates on IUCN Red List species) ... some excellent examples about a mistfrog[1]Here are some examples he quoted:

  • Litoria rheocola is a small treefrog (average male body size: 2.0 g, 31 mm; average female body size: 3.1 g, 36 mm [20])
  • the common mistfrog (Litoria rheocola), an IUCN Endangered species [22] that occurs near rocky, fast-flowing rainforest streams in northeastern Queensland, Australia [23]…
  • we tracked frogs using harmonic direction finding [32,33].
  • individuals move along and at right angles to the stream
  • Fig 3. Distribution of estimated body temperatures of common mistfrogs (Litoria rheocola) within categories relevant to Batrachochytrium dendrobatidis growth in culture (<15°C, 15–25°C, >25°C). [Figure]
  • Fig 3.  Distribution of estimated body temperatures of common mistfrogs (Litoria rheocola) within categories relevant to Batrachochytrium dendrobatidis growth in culture (<15°C, 15–25°C, >25°C).

All of the above , including the components of the graph, are FACTS. They have the features:

  • they are objective. They may or may not be "true" - another author might dispute the sizes of the frogs or where they live - but the authors have stated them as facts.
  • they can be represented in formal representations without losing meaning or precision. There are normally very few different ways of representing such facts. "Alauda arvensis sing on the wing" is a fact. "Hail to thee blithe spirit, bird thou never wert" is not a fact.
  • they are uncopyrightable. We content that all the facts we extract are uncopyrightable statements and therefore release them as CC0.

How do we represent facts? Generally they are a mixture of simple natural language statements and formal specifications. "A brown frog that lives in Queensland" is adequate; "L. rheocola. colour: brown; habitat: Queensland" says the same, slightly more formally.  Formal language is useful for us as it's easier to extract. The form:

object: name;    property1: value1 ; property2: value2

is very common and very useful. Often it's put in a table, graph or diagram. Transforming between these is one of the strengths of ContentMine software. The box plots could be put in words: "In winter in Windin Creek between 0 and 12% of the frogs had body temperatures below 15 Celsius", but the plot may be more useful to some scientists (note the redundancy!).

So the scientific observations - temperatures, locations, dates - are all Facts. The sentence contains 6 factual concepts: winter, Windin Creek, 0%, 12%, L. rheocola, body temperature, < 15 C. In ContentMine we refer to all of these as "Facts". Perhaps more formally we might say:
section-i/para-j/sentence-k in DOIjournal.pone.0127851 contains Windin Creek

section-i/para-j/sentence-k in DOIjournal.pone.0127851 contains L. rheocola

Those who like RDF (we sometimes use it)  may regard these as triples (document-component contains entity). In similar manner the linked data as in Wikidata should be regarded as Facts (which is why we are working with Wikidata to export extracted facts there).

How many facts does a scientific paper contain?

Every entity in a document is a Fact. Each author, each species, each temperature, date, colour. A graph may have 100 facts, or 1000. Perhaps 100 facts / page? A 10-page paper might have have 1000 facts. Some chemistry papers have 300 pages of supporting information. So if we read 1 million papers we might get a billion facts - our estimate of 100 million is not hyperbole.


[1] reported in PloSONE (Roznik EA, Alford RA (2015) Seasonal Ecology and Behavior of an Endangered Rainforest Frog (Litoria rheocola) Threatened by Disease. PLoS ONE 10(5): e0127851. doi:10.1371/journal.pone.0127851).

Posted in Uncategorized | Leave a comment

Should Wikipedia work with Elsevier?

This story has erupted in the last 2 days - if it had been earlier I would have covered it at my talk to Wikipedia Science].

TL;DR. Elsevier has granted accounts to 45 top editors at Wikipedia so they can read closed access publications as part of their editing. I strongly oppose this and say why. BTW I consider myself a committed Wikipedian.]

Glyn Moody in Ars Technica has headlined:

WikiGate” raises questions about Wikipedia’s commitment to open access

Glyn mailed me for my opinion and the piece, which is accurate, also highlights Michael Eisen's opposition to the new move. I'll cut and paste large chunks and then add additional comment.

Scientific publisher Elsevier has donated 45 free ScienceDirect accounts to "top Wikipedia editors" to aid them in their work. Michael Eisen, one of the founders of the open access movement, which seeks to make research publications freely available online, tweeted that he was "shocked to see @wikipedia working hand-in-hand with Elsevier to populate encylopedia w/links people cannot access," and dubbed it "WikiGate." Over the last few days, a row has broken out between Eisen and other academics over whether a free and open service such as Wikipedia should be partnering with a closed, non-free company such as Elsevier.

Eisen's fear is that the free accounts to ScienceDirect will encourage Wikipedia editors to add references to articles that are behind Elsevier's paywall. When members of the public seek to follow such links, they will be unable to see the article in question unless they have a suitable subscription to Elsevier's journals, or they make a one-time payment, usually tens of pounds for limited access.

Eisen went on to tweet: "@Wikipedia is providing free advertising for Elsevier and getting nothing in return," and that, rather than making it easy to access materials behind paywalls, "it SHOULD be difficult for @wikipedia editors to use #paywalled sources as, in long run, it will encourage openness." He called on Wikipedia's co-founder, Jimmy Wales, to "reconsider accommodating Elsevier's cynical use of @Wikipedia to advertise paywalled journals." His own suggestion was that Wikipedia should provide citations, but not active links to paywalled articles.

Agreed. It is not only providing free advertising, but worse, it implicitly legitimizes Elsevier's control of the scientific literature. Rather than making it MORE accessibile to the citizens of the world, it makes it LESS.

Eisen is not alone in considering the Elsevier donation a poisoned chalice. Peter Murray-Rust is Reader Emeritus in Molecular Informatics at the University Of Cambridge, and another leading campaigner for open access. In an email to Ars, he called the free Elsevier accounts "crumbs from the rich man's table. It encourages a priesthood. Only the best editors can have this. It's patronising, ineffectual. And I wouldn't go near it."

This arbitrary distinction between the 45 top editors and everyone else is seriously divisive. Even if this was a useful approach (it isn't) why should Elsevier decide who can, and who can't, be a top Wikipedia editor? Wikipedia has rightful concerns about who and how editors are "appointed" - it's meritocratic and, though imperfect, any other solution (cf. Churchil on democracy) is worse.

You may think I am overreacting - that Elsevier will behave decently and collaboratively. I've spent 6 years trying to "negotiate" with Elsevier about Content Mining - and it's one smokescreen after another. They want to develop and retain control over scholarship.

And I have additional knowledge. I've been campaigning for reform in Europe (including UK) and everywhere the publishers are fighting us. Elsevier wants me and collaborators to "licence" the right to mine - these licences are desiged to make Elsevier the central control. I would strongly urge any Wikipedian to read the small print and then run a mile.

This isn't the first time that Wikipedia has worked closely with a publisher in this way. The Wikipedia Library "helps editors access reliable sources to improve Wikipedia." It says that it supports "the broader move towards open access," but it also arranges Access Partnerships with publishers: "You would provide a set number of qualified and prolific Wikipedia editors free access to your resources for typically 1 year." As Wikipedia Library writes: "We also love to collaborate on social media, press releases, and blog posts highlighting our partnerships."

It is that cosy relationship with publishers and their paywalled articles that Eisen is concerned about, especially the latest one with Elsevier, whom he described in a tweet as "#openaccess's biggest enemy." Eisen wrote: "it is a corruption of @Wikipedia's principles to get in bed with Elsevier, and it will ultimately corrupt @Wikipedia." But in a reply to Wikipedia Library on Twitter, Eisen also emphasised: "don't get me wrong, i love @wikipedia and i totally understand everything you are doing."

Murray-Rust was one of the keynote speakers at the recent Wikipedia Science Conference, held in London, which was "prompted by the growing interest in Wikipedia, Wikidata, Commons, and other Wikimedia projects as platforms for opening up the scientific process." The central question raised by WikiGate is whether the Wikipedia Library project's arrangements with publishers like Elsevier that might encourage Wikipedia editors to include more links to paywalled articles really help to bring that about.

Elsevier and other mainstream publishers have no intention of major collaboration, nor of releasing the bulk of their material to the world. Witness the 35-year old paper, which is hidden behind a paywall, that predicted that Ebola could break out in Liberia. It's still behind an Elsevier paywall.

[These problems aren't confined to Elsevier, many of the major publishers do similar things to restrict the flow of knowledge.  When it appeared that ContentMining might become a reality, Wiley recently added "Captcha's" to its site to prevent ContentMining . But Elsevier is the largest and most unyielding publisher, often taking the lead in devising restrictions,  and so it gets most coverage.]

Wikimedian Martin Poulter, who is the organiser of the Wikipedia Science Conference, has no doubts. In an email, he told Ars: "Personally, I think the Wikipedia Library project (which gives Wikipedia editors free access to pay-walled or restricted resources like Science Direct) is wonderful. As a university staff member, I don't use it myself, but I'm glad Wikipedians outside the ivory towers get to use academic sources. Wikipedia aims to be an open-access summary of reliable knowledge—not a summary of open-access knowledge. The best scholarly sources are often not open-access: Wikipedia has to operate in this real world, not the world we ideally want."

The debate will continue publicly in Wikip/media. That's good.

The STM publishers, Rightslink, and similar organisations are working to lobby politicians, librarians, to prevent the liberation of knowledge. That must be fought every day


Posted in Uncategorized | 5 Comments

Wikipedia and Wikidata. Massive Open resources for Science.

I think Wikipedia is a wonderful creation of the XXIst Century, the Digital Enlightenment. It has arisen out of the massive cultural change enabled by digital freedom - the technical ability for over half the world (and hopefully soon almost all) to read and write what they want.

I was invited to give the plenary lecture at Wikipedia Science - a new venture, and one which was wonderfully successful. Here's me, promoting "The Right to Read is The Right to Mine".


I'm not sure whether there is a recording or transcript - I'd certainly value them as I don't read prepared speeches.

My theme was that Wikidata - one of the dozen major sections of Wikimedia - should be the first stopping place for people who want to find and re-use scientific data. It doesn't mean that WD necessarily contains all the data itself, but it will have structured validated link to where it can be found.

Here's my slides which contain praise for Wikim/pedia, the problems of closed information, and the technology of liberating it through ContentMining.  In ContentMine we are mining the daily literature for science and Wikidata will be one of the places that we shall look to for recording the results .

One of the great aspects of Wikipedia is that it has an Open approach to governance. Last year at Wikimania I was impressed by the self-analysis of Wikipedia - how can we run a distributed, vibrant, multicultural, multidisciplinary organisation? If anyone can find the answer it' Wikimedia.

But running societies has neve been and never will be easy. People will always disagree about what is right and what is wrong; what will work and what won't.

And that's what the next post is about. Wikipedia has embarked on a collaboration with Elsevier to read the closed literature. Many people think it's a good way forward. Others like Michael Eisen and I think it's a dereliction of our fundamental values.

It's healthy that we debate this loudly in public. During that process we may lose friends and make new ones, but we advance our communal processes.

What's supremely unhealthy is that larged closed monopolistic capitalist organisations make decisions in private, colluding with governments and constrain and control the Digital Enlightenment.




Posted in Uncategorized | Leave a comment

Announce: Microbial Supertree through ContentMining

I haven't blogged for some time as I have been writing Liberation Software (software to make knowledge and people free). Now we (Ross Mounce, Matt Wills and I) have got our first significant scientific result - a supertree:


I am going to leave Ross the opportunity to blog this in detail - he was hacking this late last night - so a brief overview:

For every new microorganism it's obligatory to compare it with other organisms in an evolutionary (phylogenetic) tree. Here's a typical one (don't be frightened - everyone can understand this if they are familiar with evolutionary ideas.) . The image was pusblished in (Citation: International Journal of Systematic and Evolutionary Microbiology b(2009),59,972–980 DOI 10.1099/ijs.0.000364-0)


wikipedia2015 are

[I have added "root" and magnified some of it].

31 microorganisms (mainly bacteria) listed in the middle. Each has a binomial (scientific) name (Pyramidobacter piscolens) , a strain identifier (W5455T), and an identifier in an RNA database (e.g. EU379932). The lines represent a "tree" with its root (not shown) at the left hand side and presumed divergence of the species. It's certainly a useful classification; you can debate whether it's a useful historical model of the actual evolution over many million years. Thus it says that Pyramidobacter piscolens is closely related to Jonquetella anthropi and much more distantly related to Escherichia coli, a bacterium in everybody's gut.

Each paper provides one such tree - which could take significant amounts of computation (often hours depending on the strictness). What we have done - and this is a strength of Matt's group, is to bring thousands of such trees together. They weren't calculated with this in mind, and we are adding value by extracting them from the literature and making comparisons and aggregations.

Ross downloaded about 4,300 papers and the trees in them. I wrote the software to extract trees from the images. This is not trivial - the images are made of pixels - there are no explicit lines or characters or words and this research is full of heuristics. So we can't always distinguish "O" (Oh) from "0" (one).  So there will be an unavoidable percentage of garbles.

BUT we have ways of detecting and correcting these ("cleaning") and the most valuable are:

  • comparing the scientific name with the RNA ID
  • looking up the name in the NCBI's Taxdump (a list of all biomedical species)

Ross has developed several methods of cleaning and we are reasonably confident that the error  rate in species is no worse that 1 in 1000. (Note, by the way, that the in a sibling image the authors have made a misprint: "Optiutus" should be "Opitutus". so the primary literature also contains errors).

Everything we do is Open Notebook. We post stuff as soon as it is ready. We store it on Github (see above link, which has all 4300 trees), discuss it on ContentMine's Discourse ( - you can see that every detail is made open), software in ( and many other repos, fully Open and often updated several times a day), and where Ross will be blogging it.

I hope to write much more frequently now.






Posted in Uncategorized | Leave a comment