How Wikidata can change the world of scientific information 1/n


We're getting involved in Wikidata! It will change the world of scientific (and other) information. So here is an emerging conversation, hopefully over several blog posts.

Wicki: Hang on! What's Wikidata? And Wikimedia? I've heard of Wikipedia, but...

Dater: Wikipedia is a free encyclopedia. It doesn't do everything. It's one of about 12 projects under the aegis of the Wikimedia Foundation. It's the one everyone has heard of, but there are lots of others which are also about making structured information and knowledge available for free and freely reusable by everyone. For example Wikimedia Commons is a huge resource of free images, videos, etc. Many of them are linked from Wikipedia articles but there are lots more which can be re-used in all sorts of ways. Teaching, research, new media ...

Wicki: OK, so Wikidata is the same thing for data? ...

Dater: Yes, but it's not "all the world's free data". It's carefully described data, carefully selected, and with clear provenance. When you find some Wikidata you know:

  •  what it is
  • where it came from
  • how it can be used
  • what other data it is related to

>> so give me an example. If I want to find out where Zika is endemic, then can I find it in Wikidata?

D: Yes. Good example. Actually "Zika" represents quite a lot of different things. It represents a virus...

W: Yes, but surely that's it?

D: No, it also represents the fever caused by the virus. They aren't the same ...

W: OK, I can see that. OK there would have to be two entries...

D: No there's more. Do you know where Zika virus was first discovered ?

W: In Africa? But no idea where...

D: In the Zika forest - in Uganda. The virus was named after the forest. So it's got a separate identifier. Lots of diseases are named after the place where they were first identified.

And then there are people called "Zika"

W: But they wouldn't cause any confusion?

D: Yes, some of them are authors of scientific papers. Which have nothing to do with Zika virus, Zika forest, Zika fever...

W: H'mm. So if I search for "Zika" in G**gle. I'll get all of these?

D: G**gle will guess what you want, and add in what it and its sponsors want you to see. So I didn't find any authors in the first 4 pages. It's powerful, but it's not objective,
and it's not reproducible. If you search tomorrow you'll get different results.

W: And Wikidata is more objective?

Dater: Yes. Wikidata has different entries (items) for each of the categories above. The virus, the fever, the forest and the authors have different identifiers.

W: identifiers?

Dater: Yes. Good information systems have unique identifiers for each piece of information. Your passport number is unique. That's what the machines read at airports. So here are some identifiers (we searched : )





have a look at, that's got masses of information about Zika virus.

  •    screen-shot-2016-11-08-at-13-14-15

Oh, and here's a botanist, Peter Francis Zika,  whose Wikidata identifier is Q21613657.

W: Help - that's too much at once

D: understood

W: H'm. So does everything in the scientific world have an identifier in Wikidata?

D: no - there's far too much. Even G**gle won't get everything. But everything with a Wikipedia article will (or should) have a Wikidata item.
And lots of things are in Wikidata that don't have articles.
The Wikidata community has imported lots of information directly from authoritative sources.

W: Ok so I can assume that every *important* scientific fact is in Wikidata?

D: that depends on what is "important"? But there are already huge amounts of bioscientific information. Drugs, diseases ...

W: Hm, my brain is really starting to overheat. Let's take a break and come back. Maybe with some more examples??

D: certainly with some more examples. I'll show you how items can be linked together by properties...

W: OK. We've not even talked about how it will change science. you may have to reteach me some of this when we next meet...

D: Just remember "Wikidata".  be seeing you

The critical role of e-Theses: award acceptance speech at NDLTD

I am honoured by this award; I ‘ll describe the current struggle for ownership of digital scholarly knowledge, emphasize young people and machine-understandable theses and suggest practices.


Early Career Researchers see the digital literature – including theses - as a primary research resource. We’ve set up ContentMine – a non-profit supporting machine reading and analysis of scholarship. There are 10-20,000 journal articles a day – and several hundred theses - so machines are essential. Today we’re announcing 6 ContentMine fellows – all of whom have exciting projects to create new bioscience from the scholarly literature.


But this brave new world is often opposed by the Publisher-Academic complex. Academia feeds knowledge and public money into companies who in return define the scholarly infrastructure and the rules by which Academia has to play.

The key issue is who controls scholarship? Universities? Students? Researchers? Or corporations only answerable to their shareholders? How many universities have been arbitrarily cut off by publishers with the accusation that “their” content is being stolen? Knowledge that should be available to the whole world is being controlled and monitored. Increasingly, universities are acquiescent and even required by publishers to police “compliance”.


Last month one of our fellowship - a graduate student colleague in the Netherlands - was legally mining the literature to detect malpractice – such as unjustifiable statistical procedures. After 30,000 downloads a publisher cut off the University and – without discussion - wrote denouncing him for “stealing” content. They required his research be stopped. The University complied. Then another publisher. And a third. Last month Cambridge was cut off for 3 weeks by one publisher. No explanation. No dialogue.


Europe is trying to reform copyright to support research. I am working with them, but there’s massive lobbying by publishers. They want to control and monitor everything. Textual content, repositories for data, metadata, metrics for academic glory.


Machine-understandable e-theses represent one of the remaining areas not controlled by publishers. They are a new opportunity for universities and a knowledge resource for everyone – citizens as well as academics. They report billions of dollars of research, and are often the only place where it’s published. To maximize the spread of knowledge – which young people are passionate about – some suggestions.

  • Be proud of theses.
  • Think of “use” rather than “deposit”
  • Make theses globally discoverable.
  • Involve citizens everywhere. Think of the Global South.
  • Don’t repeat the mistakes of the “West”. Do it differently.
  • Release immediately.
  • Use DOCX, Tex, CSV, SVG, XHTML, besides PDF.
  • Use versioned text and data GIT, DAT …
  • Use openly controlled international repositories.
  • Use permissive licences allowing mining and re-use.
  • Do not hand over rights for content, discovery or access.
  • Don’t buy systems - Encourage young people to build them.
  • Experiment with Open Notebook Science.
  • Encourage and use e-theses as a primary tool for research.
  • Use Wikipedia / Wikidata as the default metadata for scholarship.


And a warning: Unless libraries take this type of opportunity now they will be increasingly replaced by commercial services and disappear. E-theses and young people are your chance.


"Dialogue" with Elsevier - story-2 ("Despicable" Legal Weasel Words)

ContentMine is going to mine the whole scholarly literature (10,000 articles every day). We'd hoped to do this some months ago and one of the reasons is the massive pushback from major publishers. Technically, legally , politically.

UK government note: You are about to spend about 40 M GBP each year with Elsevier. The real costs are about 2.5 M GBP according to Bjoern Brembs. A significant amount of the rest (even after the huge profit of ca 38% (yes!)) is spent on lobbyists, reps , lawyers, firewalls, captchas, etc. Much of their time is spent trying to make it as difficult as possible to create the Scholarly Commons [1] where we can read, use and re-use the literature without constantly looking up to worry about publishers.

So one of the aspects is legal agreements. We need legal agreement in all sorts of areas, buying houses, hiring staff, etc. These are often between two parties and they negotiate (e.g. on price and exactly what is included) and most of the time it's relatively understood what the bargain is.

But not with Elsevier. Elsevier produce devious, complex, bespoke legal agreements unlike any other publisher. They neve use a standard form if they can complicate and mislead. You may think I'm being unfair and biassed, but I have spent many days challenging them over text and data mining. (TDM). They put in specific restrictions and clauses about what they hold onto. Despite the fact that it's legal in UK, they try to persuade you that you have to make a separate agreement with them (an API). You don't. It's legal, probably, but it's immoral and unethical.

Here's the most recent unpleasantness. A common way to publish your work as Open Access is to pay the publisher (often a lot of money) to allow you to use a CC BY licence. And you retain all rights as author. Straightforward publishers like BMC have done this for 10 years and I have published with them perfectly happily

So when you hear that Elsevier's licence is CC BY you think fine, I continue to own the paper and Elsevier have a non-exclusive right to use it.

But no. Elsevier has written weasel words into the small print. You no longer own the paper. It may be CC BY but it's Elsevier's. And the weasel words are there to look like you are getting what you paid for, but actually you have to be a lawyer to be sure that you have actually been fooled.

Does this matter? At first sight not. And if you trust Elsevier ,  maybe not. But I don't, and nor does Heather Morrison and nor does Michael Eisen. So let's listen to them:


Here's one of the ubiquitous Elsevier staff trying to convince Michael, and here's Michael's repsonse. I'll leave it there , the TL;DR is that this contract is misleading and should be rejected. Michael calls it "despicable". I wish that Universities treated licences as serious and challenged them rather than letting Michael, Mike Taylor, Heather, Charles Oppenheim, Ross Mounce, me, etc. to to their work voluntarily. After all it's the Universities who contract with the publishers, and they just don't seem to care whether their money is well spent.

Read the following from MikeE. If you teach law students, set it as an exercise to pick holes in...

<quote from="mikeEisen" >
Elsevier is tricking authors into surrendering their rights
By MICHAEL EISEN | Published: MAY 24, 2016
A recent post on the GOAL mailing list by Heather Morrison alerted me to the following sneaky aspect of Elsevier’s “open access” publishing practices.

To put it simply, Elsevier have distorted the widely recognized concept of open access, in which authors retain copyright in their work and give others permission to reuse it, and where publishers are a vehicle authors use to distribute their work, into “Elsevier access” in which Elsevier, and not authors, retain all rights not granted by the license. As a result, despite highlighting the “fact” that authors retain copyright, they have ceded all decisions about how their work is used, if and when to pursue legal action for misuse of their work and, crucially, if they use a non-commercial license they are making Elsevier is the sole beneficiary of commercial reuse of their “open access” content.

For some historical context, when PLOS and BioMed Central launched open access journals over a decade ago, they adopted the use of Creative Commons licenses in which authors retain copyright in their work, but grant in advance the right for others to republish and use that work subject to restrictions that differ according to the license used. PLOS and BMC and most true open access publishers use the CC-BY license, whose only condition is that any reuse must be accompanied by proper attribution.

When PLOS, BioMed Central and other true open access publishers began to enjoy financial success, established subscription publishers like Elsevier began to see a business opportunity in open access publishing, and began offering a variety of “open access” options, where authors pay an article-processing charge in order to make their work available under one of several licenses. The license choices at Elsevier include CC-BY, but also CC-BY-NC (which does not allow commercial reuse) and a bespoke Elsevier license that is even more limiting (nobody else can reuse or redistribute these works).

At PLOS, authors do not need to transfer any rights to the publisher, since the agreement of authors to license their work under CC-BY grants PLOS (and anyone else) all the rights they need to publish the work. However, this is not true with more restrictive licenses like CC-BY-NC, which, by itself, does not give Elsevier the right to publish works. Thus, Elsevier if either CC-BY-NC or Elsevier’s own license are used, the authors have to grant publishing rights to Elsevier.

However, as Morrison points out, the publishing agreement that Elsevier open access authors sign is far more restrictive. Instead of just granting Elsevier the right to publish their work:

Authors sign an exclusive license agreement, where authors have copyright but license exclusive rights in their article to the publisher**.

**This includes the right for the publisher to make and authorize commercial use, please see “Rights granted to Elsevier” for more details.

(Text from Elsevier’s page on Copyright).

This is not a subtle distinction. Elsevier and other publishers that offer it routinely push CC-BY-NC to authors under the premise that they don’t want to allow people to use their work for commercial purposes without their permission. Normally this would be the case with a work licensed under CC-BY-NC. But because exclusive rights to publish works licensed with CC-BY-NC are transferred to Elsevier, the company, and not the authors, are the ones who determine what commercial reuse is permissible. And, of course, it is Elsevier who profit from granting these rights.

It’s bad enough that Elsevier plays on misplaced fears of commercial reuse to convince authors not to grant the right to commercial reuse, which violates the spirit and goals of open access. But to convince people that they should retain the right to veto commercial reuses of their work, and then seize all those rights for themselves, is despicable.

- See more at:



[1] Maryann Martone's phrase.


"Dialogue" with Elsevier - story-1 (Will Elsevier publish Crystallographic Data?)

TL;DR. I continue to try to get public data out of Elsevier. I think I should be able to - every other publisher has no problem. After some not-very-useful replies Elsevier simply give up answering me.

Over the last 7-8 years I have had major issues with Elsevier on many aspects - licensing, paywalls , availability etc.  It's normally impossible to find anyone who gives me a straight answer. I believe that a modern company should have a clear channel of communication - where requests are handled formally and there is accountability when things go wrong.

Indeed some do. Cambridge and Oxford University Presses. They are parts of the Universities and so have to abide by Freedom Of Information request rules - give clear public answers to  questions within a given period of time (20 working days). - and I have used this. By contrast Elsevier find many ways of not answering questions.

So I try again - and I leave it to you to decide whether this is a company that the UK should give 40 M GBP of taxpayers' money to. When I go to meetings about scholarly publishing and informatics there are increasingly representatives from Elsevier who mingle with the other delegates.  They are "friendly" and "want to help us". So here's the first story - there may be more. Always remember that we are paying them.

Background: Everyone (except Elsevier) thinks that non-sensitive scientific data accompanying an article should be in the public domain. This is critical because the data:

  • is there to support the claims in the article.
  • can be re-used by others for many purposes (data-driven science, deriving parameters, aggregation, simulation - a huge list). I have spent much of my scientific career re-using public data.

I'm going to take crystallographic data - my field, but also central to much modern science. Its publication has been supported by International Unions, CODATA, and many other respected scientific bodies.

And almost all publishers make it public. American Chemical Society, Royal Society of Chemistry, Acta Crystallographica and many more.

But not Elsevier. They either hide it behind a paywall, or send it to the Cambridge Crystallographic Data centre,   [1] send it to the Cambridge Crystallographic Data centre who provide it under a subscription licence (a trivial amount - probably < 1% is available for free, but NOT for re-use) OR [2] hide it behind a paywall,. I wrote to the "Director of Universal Access" some years ago and got waffle.

I and many others think this is outrageous. It's public data, not Elsevier's . The science in the paper is seriously diminished without the data. I help run the Crystallography Open Database (COD) which has hundreds of thousands of structures. Will Elsevier give these back to the public?

The only route that I have are the "helpful" reps I meet. So a month ago I met one. He agreed to take my concern into Elsevier. At least I would get a clear answer...

I am including all the letters. I have removed the name of the Elsevier rep.

[1,2,3] TL;DR he agreed to take something on. It wasn't his department. He's sent it off.

[4,5] He discovers that authors can send their files to [1] CCDC or [2] Elsevier (behind the paywall). This defines the scope of the question. (It doesn't tell me anything I didn't already know). Most of the data are in the second category.

[6] PMR reiterates that by hiding data Elsevier is going against all other responsible parties in the field.

[7] Elsevier replies that they assume I only want data of type 2 and that they can make them available "openly" behind a Mendeley login.

[8] PMR replies that we want all the data (this is consistent with every other publisher) and that data behind a Mendeley login [2] are not open. PMR lists 6 questions that he would like answered.

I have not had the courtesy of a reply even after 19 days, so I can only assume Elsevier regard me as not worth continuing to answer.


[1]  PMR 2016-04-19
I thank you for our conversation yesterday, where you agreed that all factual supplemental crystallographic data published with papers in Elsevier journals should be made available without restrictions (effectively CC0). You agreed that you would work with your technical colleagues to see how this could be done as soon as possible ("flipping a switch"). You agreed that this would bring Elsevier into line with most other major publishers (ACS, RSC, IUCr, Nature) who have for many years released all their crystallographic data ("CIF"s) into public view on their websites, without restrictions. This data would include both current published data (probably back to about 1990) and all future data.

The Crystallography Open Database (COD) (PMR is a board member) has a 10-year record of accepting, validating CIFs from all domains (organic, organometallic, inorganic, metals and alloys) and then offering to the world for re-use under effectively CC0 licence. It also provides a variety of modern search and analysis software.

I ask that you commit to this publicly now and am confident that COD will be willing to host the data if Elsevier does not wish to mount them on its web pages.

[2] Elsevier 2016-04-19 =======

To be very clear - what we agreed was that I would look into this and get back to you with a clear response one way or the other. I made no commitment as this is not in my area of responsibility. I'd appreciate in the interests of establishing trust between us that we are both careful in reporting our conversations accurately.

Per our conversation I have already reached out to my colleagues to understand the current situation w.r.t. Crystallographic supplemental files in our journals. I will let you know as soon as hear back.

[3] PMR: 2016-04-19 =========

Thank you,

I would not intend to publish anything representing your views and position that you weren't happy with.

For reference I shall forward you Elsevier's less-than-useful reply 3 years ago. If you can do better than this , fine, else it will be a waste of my time.

It would be useful for you to set a tight timescale. If you aren't able to give a clear yes/no in a month from now it will be yet another "we'll look into it for you" that disappears, and I shall regard it as "no".

After a month I shall announce Elsevier's decision as reported to me.

If you want technical help and explanation of what we want and how to make it available. then I am sure Saulius will be delighted to help.
[4] Elsevier 2016-04-19 ==========
I'll follow up as promised and let you know the outcome
[5] Elsevier 2016-05-05 ==========

I wanted to let you know that I am making some progress in discussions with internal colleagues w.r.t. how we currently treat CIF files, but I haven’t got fully to the bottom of the story.A couple of facts I have discovered:

  • We give authors the choice as to whether they deposit their CIF files with an external database and provide us with a link to the file, or have us host their files as Supplementary Material. You can see examples of the two cases as follows
  1. Articles linking out to CCDC using data banner links (e.g.
  2. Articles with CIF files delivered as supplementary material (e.g.

I will keep you informed as I learn more. However, for confirmation I assume that your main interest is in securing open access to files in category 2 above?


PMR: [6] 2016-05-05 ============

To clarify our relationship. I am acting as a board member of the Crystallography Open Database, a Public Interest Research Organization (PIRO),  and copying them. You are a formal representative of Elsevier. / RELX. I regard our correspondence as in the public interest and intend to publish all of it.

I list below a number of direct, simple questions to which I request answers. These are ones that I would expect organizations subject to Freedom Of Information requests (including, for example OUP and CUP, and myself) to have to answer. Although Elsevier is not subject to FOI I am expect the same comprehensive and clarity of response. There is also a request for crystallographic data.
I have set out our expectations. In summary: All major publishers except Elsevier make their crystallographic data fully and publicly available, effectively CC0. Elsevier's policy is in direct conflict with national and international science organizations such as CODATA, ICSU and the International Union of Crystallography. Elsevier's position in withholding scientific data is in direct opposition to the  norms and expectations of the scientific world.

[7] Elsevier 2016-05-05 ==========

I wanted to let you know that I am making some progress in discussions with internal colleagues w.r.t. how we currently treat CIF files, but I haven’t got fully to the bottom of the story.

  1. Articles linking out to CCDC using data banner links (e.g.
  2. Articles with CIF files delivered as supplementary material (e.g.


I will keep you informed as I learn more. However, for confirmation I assume that your main interest is in securing open access to files in category 2 above?


[8] PMR 2016-05-05 ============

Thank you. I have allocated you the same length of time (one month/20 working days) as for a UK FOI request to provide information. If you are also, as I expect, working towards a change in Elsevier policy and practice, then it will be necessary at the end of the month to detail what you have set in motion and with what expected timescale. Until 21st May I accept that I will not make our discussions public.

>ELS> For articles with CIF files that we host as supplementary material [type 2], we are still evaluating both from the technical point of view and the legal point of view the feasibility of making these available openly via our new Mendeley data platform (

>PMR> The word "openly" is imprecise.  I note that Mendeley requires a login which is inconsistent with Openness. I assume therefore that Mendeley will impose its own terms and conditions, which by definition will be inconsistent with CC0.

Note that under the new 2014 UK exception to Copyright I can legally mine the data associated with any Elsevier publication that I have the right to read. Since the data itself is uncopyrightable, and since a journal is not a database covered by European sui generis database rights,  I can therefore download all CIFs as part of my personal non-commercial research and I can publish the data from that research. There is no benefit in using Mendeley. From Elsevier's point of view it would possibly be preferable to bundle this historic data now and ship it to COD and we would be happy to make this technically possible. Otherwise I shall extract it under the UK law.

>ELS> I will keep you informed as I learn more. However, for confirmation I assume that your main interest is in securing open access to files in category 2 above?

>PMR> **NO**. Our interest is in all supporting ALL crystallographic data that is associated with ALL Elsevier publications, in line with all other major publishers.

I would therefore like answers to the following questions and will publish answers when the month is up. I have framed them so that many can be answered with Yes/No/DeclineToAnswer/. I use the phrase "NonOpen database) to refer to databases such as those run by CCDC, ICSD and other organizations which do not make the total data available under CC0.  (Note that if these questions were submitted to OUP I would expect them all to be fully answered under FOI).

  1. Does Elsevier hold copies of ALL raw CIFs associated with Elsevier publications, or if not can it obtain these CIFs?
  2. Please provide a complete list of all NonOpen databases that Elsevier requires or allows authors to submit crystallographic data to. Please indicate whether Elsevier has the right to obtain ALL the crystallographic data in BULK associated with their publications.3. Does Elsevier have formal contractual relations with these NonOpen databases. Please indicate what these contracts allow and forbid.
  3. Please indicate how Elsevier decides on its policy on crystallographic data. Does it consult with ICSU, CODATA or IUCr? When was the policy last reviewed? What is the mechanism for PIROs to formally request changes in policy?
  4. Please provide a list of all files of Type 1 and Type 2 and a service where updates of these lists can be obtained.

These are requests for information.

Our request for crystallographic data, which is consistent will all major scientific bodies, funding bodies and all major publishers other than yourselves, for the files themselves is:

5.Please provide all files of type 1 and 2 , or an open mechanism (e.g. an API) where all these files can be obtained. Please confirm that redistribution of the files is permitted without further permission.

  1. Please indicate that Elsevier is committing to changing the policy to make supplemental data files publicly, freely and openly available. Please indicate the process that has been initiated and how it will report back to the world.

I will publish the correspondence, unedited, on 21st May.



19 days have elapsed without the courtesy of a reply

Taxi Ken and I discuss the UK's negotiations with Elsevier

When I go the airport by taxi [1] I try to get the same taxi driver, Ken [2]. Ken is a shining example of why every citizen of the world needs access to the whole scholarly literature - open and for free.

You often hear publishers (and some academics) say "ordinary people wouldn't understand the science".  This is appallingly arrogant , and blatantly untrue.   In the taxi we are discussing whether people listen more to scientists from Cambridge than less-well-known universities:

PMR: Doctors in a Western Australian hospital struggled for many years to convince the medical profession of the true cause of stomach ulcers...

KEN: You mean Campylobacter.

This is the point. Ken has no University education. But he knows the cause of ulcers and he knows the precise scientific name [3] (Read the story of Barry_Marshall and Robin Warren; everyone should be able to follow it. They published their results [4] in The Lancet a well-known medical journal.)

Oh, dear, Ken. I'm sorry you can't read this unless classic paper unless you fork out 36 USD - and you would then have just 24 hours to read it. And you can't show it to your mates - that's copyright violation. Oh, and all the money goes to Elsevier - none to the authors.

Taxi drivers are an underclass. Only academics in Cambridge are allowed to read about Helicobacter. What? It was published in 1983? Yes,  that's far too recent to make it Open and Free for taxi-drivers. One of the most important papers in science? won a Nobel prize? You expect taxi-driver tax-payers to be allowed to read work they fund??  Sorry.  Just keep driving taxis.

It's a moral imperative to publish science for everyone. Not just academics but also taxi drivers. Next time you are in a taxi, don't sit back but ask your driver: "are you interested in science?". Not everyone is, but everyone who is interested in science can be a scientist. It's a matter of attitude and philosophy, not a white coat.

PMR: So we had a meeting last week to discuss negotiations with Elsevier.

KEN: Elsevier , the publisher?... (Ken is interested in politics , and science/Cambridge. He knows about Elsevier.)

PMR Yes...

And I continue to set the scene:

A week ago a small selected group of concerned Cambridge academics (including PMR), and library staff, met with Jisc (who are advising HEFCE - who fund English universities), to find out about Jisc's negotiations with Elsevier about university subscriptions.  Almost 40M GBP  year  of taxpayer and student money for academics to read journals. Until now I didn't even realise there was a negotiation - it has been kept very quiet indeed. The deal has to be concluded by 24:00 2016-12-31 ; if not the subscriptions are cancelled and even Cambridge academics won't be able to read The Lancet.  (The journal which Ken still can't anyway read, thus bringing Cambridge academics and taxi-drivers even closer together).

The meeting opened my eyes to the massive and visceral resistance that Elsevier was putting up against any normal "negotiations". (I've negotiated with equipment suppliers before - one deal was >2M GBP in today's money.) What came over to me very clearly was that this is not about price. It's a battle for control. Who makes the decisions about the dissemination of scholarly knowledge, whether or not paid for by the taxpayers and student fees?

Elsevier (along with Digital Science from Nature/Springer, etc.) are rapidly taking over our academic infrastructure. Last week Elsevier they bought SSRN, the social sciences repository. They now control preprints in significant part of academia. They are selling the Universities PURE - a system of repositories that are under Elsevier control and where I fear the we are the product as well as the customers. What does Open matter if the gateway is controlled by a publisher who Openwashes the language to legitimize its control?

It's not because Elsevier are the biggest, it's because they are the most ruthless, arrogant, publisher. I have been dealing with them for ca 8 years. They treat me as a nobody - a nuisance. I'm far from the only one. It's critical that we wrest back control. After all it's us who are paying the money. And it's every taxi driver in the world who ultimately suffers.

The UK is not the first country to negotiate with Elsevier. Last year (2015) the Dutch did. They announced that unless they got a set of nonnegotiable demands they would walk away from Elsevier and cancel subscriptions.

What actually happened?

I don't know. The Dutch agreed something with Elsevier. What? Don't know , because Elsevier requires secrecy and the Dutch agreed. You can read reports before the deal and after the deal. Maybe, Ken, you can make more of them than me.
The worst possible thing would be an announcement on 2017-01-01:

"The UK and Elsevier have concluded a 40 M agreement about purchase of Elsevier publications and services. The details are commercially sensitive but [some important person] says: 'This is a good deal for the UK ...'"

The thinking world will say: "The Dutch stuck their heels in a bit but finally Elsevier won. The same has happened in UK".

The unthinking will say: "I didn't even realise that UK and Elsevier were negotiating. Ah well, I don't read the scientific literature because I can't afford it."

KEN: had already said - his words - "The theft of Knowledge".

This has to be fought in public, starting now. Elsevier are effectively a monopoly. The people of Europe - 250,000 - fought software patents and won. The people of many countries took on Microsoft and won. The people of UK, and later Europe, should take on Elsevier and win. It won't be easy but it has to be done.

And if the taxi drivers take to the streets it can happen.

And in case you are wondering why we are paying 40 Million GBP for electronic content - which Elsevier neither authors nor referees , here is Bjoern Brembs  with the breakdown of costs.  Bjoern - a neuroscientist - asks  Why haven’t we already canceled all subscriptions?

The question in the title is serious: of the ~US$10 billion we collectively pay publishers annually world-wide to hide publicly funded research behind paywalls, we already know that only between 200-800 million go towards actual costs. The rest goes towards profits (~3-4 billion) and paywalls/other inefficiencies (~5 billion).

So the UK will be paying 20M GBP to Elsevier for gross inefficiencies (which from my own experience I can confirm) and technology to stop Ken, and Chris Hartgerink ("Elsevier stopped me doing my research") , reading science.

And we should pay them just 3M GBP at the most.

Or, much better, do it ourselves. We'd do it cheaper, better, faster. Bjoern, a senior scientist, says so and I agree.


[1] Don't worry - It is the cheapest overall cost (over car parking or hotels).

[2] not his real name but he is happy for me to publish our discussion. He cares.

[3] it's been renamed to Helicobacter in the intervening period.

[4] Marshall BJ, Warren JR (June 1983). "Unidentified curved bacilli on gastric epithelium in active chronic gastritis". Lancet 321 (8336): 1273–5. doi:10.1016/S0140-6736(83)92719-8. PMID 6134060. [from Wikipedia].

Sci-Hub and my personal position on legality 5/n

I have just blogged on the legal aspects of ContentMining: (which also contains links to previous blogs.

These are general considerations but also relevant to the current issue of Sci Hub.

I am now going to set out my personal position and, where it impacts legally or organizationally , on how TheContentMine might behave. I am not going to impose any other non-legal requirement on my colleagues. In a non-ContentMine scenario they can think and act however they feel best.

When I came to Cambridge, ca 17 years ago I was in love with Universities. My first job was Assistant Lecturer at the University of Ghana when I was 22 years old (sic). My first UK job was 4 years later at the new University of Stirling. I helped build the University. It was wonderful and I loved it and have pride in what I helped with. After my time at Glaxo I moved to a part-time chair at the University of Nottingham to set up virtual science education and thence to Cambridge to a new Centre. I threw my heart into it.  I love all of them. I love the people.

But the system is getting worse. And one of the causes, or symptoms, is scholarly publishing. When I started this blog - almost 10 years ago - I was starry-eyed about the possibilities. I was going to build an artificially intelligent computer. It would have the chemical intelligence of a first year undergraduate. But - since much “intelligence” is based on knowledge - it needed knowledge in chemistry journals “published” by “publishers”. And they have completely failed me. I spend my days as a reformer as well as - hopefully - an innovator.

So I have been dissatisfied with scholarly publishing for about 10 years.

  • I’ve tried to change it constructively. Working with publishers. Little or no interest.
  • I’ve tried to get scientists interested. Little or no interest.
  • I’ve tried to get libraries involved. Little or no interest.
  • I’ve got some funding. Mainly JISC and M$. No one interested in the output.

… it’s clear that trying to do things through conventional university channels will take longer than my lifetime.
So what do I do?

  • I keep buggering on (W.S.Churchill). This issue is too important to drop.
  • I appeal to the non-academic world.
  • I build stuff. Stuff is wonderful.

And I think like a revolutionary.

How can I and others change the world?

It’s clear that laws and practice are broken. Copyright is being used to muzzle speech and creativity, not create it. Universities chase glory rather than public good.

I’ve got two main options:

  • Work within the law
  • Work outside the law

Both can be effective, and both can be ineffective. And reformers sometimes move from one to the other.

But you can’t do both at once.

There is a balance between the state and the individual. If you want the state to change the law then the state should respect the individual (e.g. ) and the individual should respect the state. This happened in UK in 2012 when the government asked for views on Copyright reform. They were prepared to listen to anyone - citizens, universities, publishers, etc. This is a fair and balanced approach - Government listens to everyone, makes a balanced decision through the Civil Service, recommends it to parliament (in this case the House of Lords), takes further amendments, asks for a vote and it passes into law (or at least a Statutory Instrument).

It’s a fair process. It’s democracy. Yes,  democracy is the worst form of government, except for all the others (WSC). I wish the result had allowed commercial mining. Others with the SI hadn’t been passed at all. But I accept it.

But the publishers have not accepted the result.

They have been using other methods to make it harder to use the law. These include lobbyists, disinformation and technical barriers. These are probably legal, but unethical and immoral. They have been spreading disinformation. This isn’t just my judgment, it’s Julia Reda MEP’s and many other policy makers. She’s had 80 offers to dinner in the first week after her report. She’s been fed information which may be possibly at variance with reality. Not l**s but or  those .

That’s why we worked to make ContentMine runnable by any MEP. And why Rik SU is building a GUI for it. So that Julia and her colleagues can challenge any assertions.

This is very sad. The Hargreaves SI was passed. The publishers fought in Europe to require miners to get permissions through licences (“Licences for Europe”). It was a standoff. Publishers are still trying to get libraries and academics to sign licences rather than accept Hargreaves.

What should I do?

In 2013 I thought that very few people understood what mining was about. So I was delighted that Shuttleworth offered me a Fellowship. I was delighted that I can continue to use University resources. I was delighted that I could build a system and devote my “retirement” to getting the practice of mining adopted by everyone.

Not just academics, but citizens. Taxi-drivers. Patients, planners…

But the publishers have made it very hard. And there is massive lobbying against legal reform in Brussels. The new phraseology “by Public Interest Research Organizations” is almost meaningless. It’s sufficiently frightening that very few will actually do it.

And currently I’m the only person in Europe doing legal content-mining (unless you tell me different).

And that’s a minor success for the Luddites in the publishing industry. They’ve frightened people (I’ve talked with some), bewildered others, offered no technical support.

So I have to make the decision.

  • Work within the law
  • Work without the law

I do not like breaking the law, I wouldn’t ask anyone else to, and in addition:

  • My funders expect me to work within the law. The project is to develop mining tools, strategy, policy, practice that will convince lawmakers
  • My University requires me to work within the law
  • The politicians that I talk to in UK and Brussels expect me to work within the law.

You can’t break the law just-a-little. So I’m not breaking it at all.

The downside is that the law we have in UK, and the law we might have some-time-in-Europe is very restrictive. I’m pioneering it to see what use it can be and where it needs enlarging. I can only do this if I am legal. I can say to lawmakers:
“It’s working here  but failing here and here and here and…”

It’s not what I want, but it’s the bargain that I make with legislation.

And in addition I work with other legal groups such as Open Forum Europe to help and be helped in getting legal progress.

History will decide whether trying to stop progress has been successful…

… or whether ContentMine has made a difference.


Sci-hub and Legal aspects of ContentMining 4/n

I have written today to my collaborators in ContentMine - staff, volunteers, advisory board and Shuttleworth funders and mentors. It's on the legal aspects of mining. It's long, but laws are complex. It's meant to put everyone 's minds at rest - us, universities, Shuttleworth, etc. it's not authoritative, but may be a useful guide. We'd love to have your feedback. tl;dr I've assessed the main problems and most people should assume we have taken a responsible and public approach.
ContentMine is preparing to mine the complete scholarly literature every day - about 10,000 scholarly articles.People from inside CM and from outside have recently raised the question of whether CM is breaking or intends to break the law. This has arisen in parts because of our intention to use the UK Copyright exception to mine the whole literature, and because of speculation about the possible use of our technology by "illegal" sites such as Sci-Hub.
NOTE: I am not a lawyer (IANAL) but I have spoken to several and am aware of general principles and practice.

The simple answer is simple:

CM does not intend to break the law and intends not to break the law.

and to my colleagues.
Do not worry. You will not end up in court. If anyone does - and it is unlikely - it will be me and I am prepared.

I shall expand on this in blog posts, but please be assured that I am actively assessing areas where the laws might be broken, especially inadvertently. Note, of course, that there are many other laws where we have to observe on a continual basis, and include health and safety, employment, racial discrimination, libel, immigration, etc. I get frequent updates from the Chemistry Department  as to what procedures we have to observe. You, I, and everyone are bound to observe and practice these laws. They are complex in detail, extent, interpretation and we generally manage by knowing the outline of the law. We don't steal, and we don't read the small print of what is and is not a theft (e.g. "illegal borrowing"). But in others, e.g. animal experiments or immigration, the small print is critical. "Ignorance of the law is no defence".

But I will take the responsibility of guiding you and making sure that you don't transgress inadvertently.

The  laws particularly relevant to in question include:

* copyright law

* sui generis database rights (Europe only)

* computer fraud law

* technological protection measures (TPM) and digital rights management (DRM)

* national security laws

Most of these laws have a concern about geo-location. We shall attempt to make sure that all our activities are carried out by UK staff, "in the UK", on UK machines.  But what is legal here may be illegal elsewhere and vice versa. Note also that many laws, especially new ones cannot have definite answers until they are tested in a courtcase. Lawyers may give opinions (for fees) but ultimately the court decides.
These laws are complex and often recent and - like many laws - it is possible to transgress unknowingly. We have have to educate ourselves and to behave responsibly in actions and language. If anyone is unsure they should raise the issue.
Note that by discussing this in public we will show our good faith and also be alerted by others to potential problems and misinterpretations.
Copyright law is exceedingly complex and also depends on the country. What is legal in the US may not be in Britain and vice versa. It includes:
* the process of copying for the purpose of mining for non-commercial research
* storage of copied material
* republication of the (transformed) output as part of the research/audit/verifiability requirement.

We continually discuss this with lawyers and with librarians. No one can predict precisely what is allowed and what is not - it may depend on "impact on the market of the rights-holder". All law includes a balance of risks - It is my responsibility and (for some content) the librarians to make sure that we have a balanced assessment.

We believe that our mining is fully allowed under the UK 2014 reform ("Hargreaves"). It would not be allowed if we took money from commercial companies and mined the literature solely for their benefit. Europe has noted that much research is a public/private partnership (I worked for 15 years in the Cambridge Unilever Centre, for example). Was this non-commercial? I would take the view that all the projects I worked on were. If I was paid extra to do private contract research for a company which would not be published it would be commercial.

Since I and ContentMine are probably the only group in UK at present who publicly intend to use Hargreaves there is no case law to answer these questions. We read the current public discourse and form a balanced judgment.

What copyright material can we hold on our machines? It is common for researchers to have thousands of copies of copyright material on their machines and no one is challenged. Unlike them, our material is in a secure computer room in Cambridge with physical access only by trusted staff and e-access only to 2-3 named and authorised people. If anyone wishes to "steal" the literature from our server we will actively prevent and report this. We are not, of course, ourselves redistributing any of the University subscription content other than facts and fair quotations. If, as we hope, the resource becomes useful in the University, we will work with library staff to create a legally acceptable approach where any Cambridge scholar can use the system.

How long can we hold it for? Mining is often an iterative process, so we may wish to re-run searches with new parameters. It would be a technical waste to have to re-download everything everyday. It would also put additional workload on the publisher's servers. We can't give an answer in days or months or years until we know what the likely usage patterns are.

What can we republish? Since facts are uncopyrightable we can publish them without permission (although in Europe we cannot systematically republish the contents of databases protected by sui generis. Journals and supplemental data are not databases). But:


is not a useful fact.

"The average snout-vent-length (SVL, see ) of the common lizards (Zootoca vivipara) found on Borchester Common ( )  was 42 mm (+- 5) measured by 3 independent researchers using the Graduated Ruler and Eyeball Method (see )"

is a useful fact. We intend to publish some or all of the facts we extract without formal permission from the publisher.

Note that a fact does not have to be "true". I don't actually know the sizes of newborn sandlizards. But what I have stated is a fact. The result might be a misprint for 142 mm (which is possible for an adult). It is still a (potentiallly falsifiable) fact. It remains a fact regardless of further lizard research.
I will blog more on facts as "facts" are uncopyrightable.
* sui generis database rights. We do NOT currently intend to systematically extract facts from factual databases described as such and specifically created for the purpose of holding facts.
* computer fraud laws. We scrupulously avoid breaking these laws. They carry the additional features that they are criminal, and so prosecution would be by the police. The UK takes these very seriously and wishes to extend the maximum term of imprisonment to 10 years: personally protest against this, but I do it legally).You should therefore take especial care not to share files "illegally". This means that ContentMine cannot have any dealings with Sci-Hub as it is seen by many as an "illegal" filesharing . Read  Ars technica:<quote>The UK government has responded to that issue by saying that it accepts there are concerns, and writes: "the policy intention is that criminal offences should not apply to low level infringement that has a minimal effect or causes minimum harm to copyright owners, in particular where the individuals involved are unaware of the impact of their behaviour."Another major worry was the use of the term "affect prejudicially" in judging copyright infringements, which many felt was too vague and could mean a single infringing file would fulfil the requirement—for example, if it were widely shared online. Many thought this set the threshold for committing an offence far too low.The UK government said it was not aware of any cases where minor infringement had resulted in a criminal prosecution, but "agrees that the undefined term ‘affect prejudicially’ could give rise to an element of ambiguity." The government is now proposing to introduce "re-worded offence provisions" to address that.


It is extremely unlikely that we will trigger this law as we don't deliberately intend to break it and deliberately don't intend to break it. However #icanhazpdf is almost certainly "illegal" and also breaks the rules of the University. I have never used #icanhazpdf in either direction and never sent files to people who weren't subscribed. ContentMine staff should not use #icanhazpdf.

In some cases crawling has been held to be a violation of the CFA acts of various flavours. I am not aware of any cases where scholarly publishers have used this to prosecute bona fide researchers, nor where the police have.,

Note also that many publishers know that I and others (e.g. Crystallography Open Database) have been crawling their sites for many years and by implication permit it. This includes Nature, Elsevier, American Chemical Society, Royal Society of Chemistry, Acta Crystallographica, Science. We are careful to adhere to responsible mining practice (see )

Aaron Swartz's case was - for many, including me - a serious miscarriange of justice. From Wikipedia:

( )

<quote>In the wake of the prosecution and subsequent suicide of Aaron Swartz, lawmakers have proposed to amend the Computer Fraud and Abuse Act. Representative Zoe Lofgren has drafted a bill that would help "prevent what happened to Aaron from happening to other Internet users".[35] Aaron's Law (H.R. 2454, S. 1196[36]) would exclude terms of service violations from the 1984 Computer Fraud and Abuse Act and from the wire fraud statute, despite the fact that Swartz was not prosecuted based on Terms of Service violations.[37]

In addition to Lofgren's efforts, Representatives Darrell Issa and Jared Polis (also on the House Judiciary Committee) raised questions about the government's handling of the case. Polis called the charges "ridiculous and trumped up," referring to Swartz as a "martyr."[38] Issa, who also chairs the House Oversight Committee, announced an investigation of the Justice Department's prosecution.[38][39]

As of May 2014, Aaron's Law was stalled in committee, reportedly due to tech company Oracle's financial interests.[40]


* TPM and DRM

These are technical methods of prevent access to material and can include firewalls, encryption, specific tools, and possibly Captcha. We have bought legal advice and the result is not clear about whether Hargreaves allows us to circumvent them. The rule for all of us is that if there is any technical barrier to mining we should identify it and alert the librarians and possibly computer officers. Deliberately breaking this law could have serious consequences. Rest assured that I will publicize and comment on publishers who impose TPM.

Charles Oppenheim (Chair ContentMine advisory board) adds:
...within the UK Copyright Act there are regulations allowing for someone to ask the Secretary of State responsible for copyright law to stop a rights owner using Technical Protection Measures [to] prevent people from exercising an exception to copyright, such as the TDM exception, and the Secretary of State, after examining the evidence, can require the copyright owner to lower the barrier, or be prosecuted. However, the procedure for doing this is complex and clunky, and has been very rarely used hitherto for that reason.

* national security. It is very unlikely that we shall trigger this very serious offence. However, overzealous prosecutors or government departments - particularly in the US - have used such provisions.

There is a simplistic tendency of some companies and government departments to demonize all "hacking" as security violations. My laptop carries "Wget is not a crime" , after

was jailed for its use. See Slashdot for the link to Snowden and hackerbabble:

* scraping

Contentmine is in the business of scraping websites - scholarly publishers , academic departments, etc. Is this legal? People have been prosecuted for scraping ( from a company selling anti-scraping software). Wiley and Elsevier caused Tilburg to cut off Chris Hartgerink for downloading ("stealing") material to which he had legal access. Their accusations have not been made public and it seems most unlikely he had done anything illegal. However I have scraped publishers for 12 years (for legally accessible materials) with no complaints and I do not expect any.
*incitement to commit a crime.
in general it is a serious offence to encourage others to break the law. See for the official (and complex) UK law. For example I believe that any formal contact with Sci-hub or recommendation to use it could be interpreted as a crime.  Whether the same applies to breaking contract law is less clear, but ContentMine will not , knowingly, break this either.
Please let me know whether I have omitted an important item or have misrepresented one.

A commentary on Sci-Hub: 3/n Legal aspects


It’s impossible to discuss Sci-Hub without discussing legal aspects. Unfortunately these are complex and highly varied, so it is impossible to give simple clear answers. On one hand many claim that this is a criminal (or near criminal) activity and She-who-must-not-be-named should be incarcerated or worse;  others including Aleksandra herself claim that what she is doing is her (and our) right and is therefore not illegal.

Now I am not a lawyer (IANAL) but I have talked to lawyers about this and talked to legal academics and talked with people who are authorities and

… and PLEASE correct anything I get wrong and ...

The simple answer is that no-one can give a definite answer about the law. And I will deliberately try to avoid giving anything definite. A competent lawyer will advise about the risks and the client has to decide whether they are worth taking.

Then there are many laws involved. And many jurisdictions. What is illegal in US might not be illegal on some Pacific islands or even in Kazahkstan. And copyright alone is fiendishly complicated.
Criminal as well as civil law may be involved. It’s not just about copyright. It’s also potentially about The US Computer_Fraud_and_Abuse_Act . That’s the Act under which Aaron Swartz was indicated - a criminal, not a civil case. That means Aaron could have been sent to jail - and there were demands for a 35-year sentence… for downloading academic articles.

And most relevantly, incitement to break the law is, in itself, a crime in many jurisdictions. I have been advised that those who support Sci-Hub and urge its use could be prosecuted for this incitement and could be jailed.

People have been asking questions about using Sci-hub and ContentMine on Twitter,  such as:

  • “We need a definite answer soon”. Well, you are unlikely to get one.
  • “If we re-use facts extracted from Sci-hub content, they are uncopyrightable, right?”. Even if true, you might be breaking CFAA or other laws.
  • “X  used facts which Y had extracted from an ‘illegal’ scrape and so X is OK even if Y isn’t”. I would be very unhappy with this reasoning.

So the simple answer is that Peter Murray-Rust and ContentMine colleagues are not going to pronounce on these questions, nor are they going to deliberately or knowingly break the law. I will write a further blog post on what I am going to do.

Note that ContentMine software is offered under the permissive Apache2 licence and is owned by the Shuttleworth Foundation. Like other permissive licences there is no restriction on fields of endeavour. Therefore if someone wishes to use ContentMine software with Sci-Hub content the licence does not restrict this (other laws might).

I’ll be happy to add more to this post if other feel there are omissions or errors. But I may not feel myself able to answer questions, for reasons given above.

There’s a useful commentary on both the politics and legality of Sci-Hub which may be useful (some extracts)

Leaving aside how they obtain the credentials, the fact remains that the process violates copyright. In November 2015, a New York District Court granted an injunction against Sci-Hub, LibGen and several other sites in response to a complaint from Elsevier, ordering them to stop offering access to infringing content and suspending their domain names. The Judge in the case stated that “the balance of hardships clearly tips in favor of the Plaintiffs. Elsevier has shown that it is likely to succeed on the merits, and that it continues to suffer irreparable harm due to the Defendants’ making its copyrighted material available for free” (8)


Furthermore, human rights are generally considered to be enforceable against a State, not private entities. This means that a case attempting to defend Sci-Hub on the basis of article 27  would likely need to be brought against a government, presumably the US, instead of the publishers, and be aimed at radically altering their copyright law.


Copyright is considered an intellectual property right (9), and article 17 of the UNDHR states that (1) everyone has the right to own property alone as well as in association with others and that (2) no one shall be arbitrarily deprived of his property. It is therefore arguable that article 17 protects rights-holders and their right to enforce copyright.  Elbakayan would need to establish that her users’ article 27 rights overrode or were more important than copyright-holders’ article 17 rights.


Interacting with Sci-Hub may, therefore, bring me into conflict with the law, probably US law. I have written earlier about where it is morally legitimate, and in some cases morally imperative, to break the law (“civil disobedience”).

There are many cases in British law and elsewhere , where civil disobedience has led to reform, often with the campaigners being first found guilty and often jailed. The history of independence movements is marked by imprisonment.

Most relevantly, the “Right to Roam” movement, which inspired my mantra “The Right to Read is the Right to Mine”, was only won after imprisonments.

Currently I am fighting for the  “Right to Mine” through legal and political means (and optimistically that we will get positive help from some publishers).  At present I am scrupulously trying to avoid, even inadvertently, breaking the law. I will explain why in the next blog.

A commentary on Sci-hub: 2/n. Why it matters to me and ContentMine

In my previous post , catalyzed by Sci-Hub, I argued that scholarly publishing is completely broken. It’s now lost a huge amount of respect, it’s unwieldly, unfair and mired in bickering. It pays no attention to readers. It’s becoming a write-only system where authors write not to communicate but for glory - self advancement. There’s no clear political goal …

… and no clear technological goal.

And that’s the problem.

Because we desperately need the ability to search and analyze the scientific and medical literature in a 21stC manner.  While we’ve been creating our we’ve discovered many researchers who have to “read” 10,000 papers in a day or two. They use 20thC methods - click and read - taking weeks where they should take hours. ContentMine software (completely Open) has been built to solve this problem by filtering out the papers you don’t want - often 90% of the first search. (and it does much more - it can extract complex objects). It’s Open to everyone and it works (see previous posts).

When I came to Cambridge I had the vision of building an “artificially intelligent chemical reader” part of which was the  World_Wide_Molecular_Matrix a system for capturing and sharing versioned semantic chemistry. Bits of it are being built in ContentMine . I built systems where I could draw chemical formulae by speaking to the machine. We’ve built the de facto tool for chemical name recognition (OSCAR) and interpretation (OPSIN). I thought it would take 5 years to create my chemical amanuensis - scholarly assistant. With help from the publishers and scientists it probably would have. Now, after 15 years, it’s still a dream, frustrated by stagnant thinking on all sides, and deliberate opposition (e.g. nullifying European legislation).

So Stackoverflow, Github, Bitbucket, Apache, GNU, Jenkins, OuterCurve, Mozilla and many others are creating the human-machine technology of tomorrow. This encourages innovation from predictable and unpredictable sources. It works - it’s exciting and we are all part of it.

In contrast the Scholarly publishing industry has created nothing in the last 20 years. (The Scholarly Kitchen hailed the “big deal” (a pricing strategy to increase sales) as one of the greatest achievements of schol pub).

20 billion dollars per year - that’s 200 billion since I started at Cambridge - and nothing positive to show for it.

The current technology of the mainstream publishing industry is just awful. Really awful. It’s often built by outsourcing parts to people and companies who do not care how the result is used. The methods used - awful PDF and really awful HTML - are for the publisher’s convenience , not for the reader. And every publishers complains about how awful the tools are. They can’t change, they can’t innovate, they’re locked in. Add that every publisher feels they have to use a different technology to differentiate themselves from the others and it’s a complete tower of Babel. (I have spent 2 years of my life trying to solve this awful mess - and ContentMine can untangle a good deal.)

What’s even worse is that most of the publishers spend effort on STOPPING people reading the literature. The obstacles to getting to a paper grow every month. These include (from my own experience):

  • Deliberately bad PDFs.
  • Pixel maps rather than characters.
  • “Glass screens” that can’t be copied (Readcube from Nature/Springer).
  • Captchas to stop readers after 25 papers (Wiley, i.e. 400 Captchas for a literature review).
  • Monitoring every download and requiring libraries to stop researchers. (Elsevier, Wiley).
  • Automatically cutting off 200 universities for a single click (Amer. Chem Society).

Why does this matter?

Because there is so much we are missing out on. New medicinal knowledge, new ecology, new astronomy, materials, chemical reactions, … and innovation...

I should be able to ask a computer (in speech):

“Find me all chemical compounds that occur in Lantana species south of the Wallace line and compare their chemical and plant evolution. What types of compound might we see in the future, particularly due to invasive species?”

And get a result in minutes… it’s not as hard as it looks. It’s knowledge-driven science.

(Sadly All I WILL get in minutes is a cease-and-desist letter from publishers demanding that I shouldn’t “steal their content”.)

So because we cannot innovate in this area we are 20 years behind the mainstream.

So why do I want Sci-hub? (Note carefully that I haven’t said what I am going to do and, until I do, you cannot judge my intention. I haven’t said I’m going to use it. You’ll have to wait till the next blog post).

I want Sci-hub because it’s technically BETTER than anything else we have. Much better.

And it’s the perfect complement to ContentMine.

Sci-hub has all the world's scientific knowledge in one logical place. It doesn't matter that it's spread over Torrents and other fragmentation - logically it's all there. And it's run by someone who knows what she's doing technically - unlike many publisher sites. And, I assume, she and colleagues will be receptive to technical requests and suggestions. (No one has any chance of getting conventional publishes to innovate).

Using Sci-hub would advance my and ContentMine technology enormously. ContentMine and Sci-hub fit together perfectly - because they are both designed with the 21stC mentality. Because they react to what readers want. Yes, READERS; the marginalised community of scholarly publishing. 21stC projects create a community round them. They are organic and vibrant. They respect machines and humans equally.

ContentMine + Sci-hub could be the greatest search engine in scholarship, especially for science, technology and medicine. Because it’s semantic. Because all the literature is trivially accessible in one place and one format. I don’t know of anything that remotely comes close. We can search and index diagrams - extract 15 million chemical reactions a year. (Even if a publisher tried to develop it they could only use it on “their own” content.)


But for many, including the law, Sci-hub is forbidden fruit.  Run by She-who-must-not-be-named. The arch-pirate. The criminal. (These terms are used). Peter Murray-Rust cannot use it (and I haven’t). ContentMine cannot mine it (and we won’t). We’ve looked at the legal and political aspects and I’ll analyse these in a subsequent post.

But 21stCCitizens - me, ContentMine, taxi-drivers really really want Sci-Hub.

The only things stopping us are copyright law, prosecutors and an intransigent, uncaring, out-of-touch, money-driven and self-seeking publisher-academic complex.

I’ll deal with the politico-legal in the next post.


A commentary on Sci-hub: 1. Scholarly publishing is broken

Many of you will already have read of Science Magazine’s account of Sci-Hub, the “pirate” site for scholarly publications. “Science” is often seen as one to the “top three” outlets, along with Nature and Cell. Here’s the original:

And here’s a typical commentary which applauds the research in the article but criticizes the accompanying editorial showing that Science has an ethically flawed business model.

This (and following) blog is one of the most important I have written, and I shall choose words carefully. I shall include facts, opinions, and what I intend to do and not do, and why. I am always open to criticism and try to be polite and constructive. My message is already spreading to more than one posting. This one sets the scene.

This blog is nearly 10 years old. I’d like to believe that I have tried to help make scholarly publishing fit for the 21st Century (C21). I’ve seen Tim Berners-Lee’s vision of the semantic web for scholarship - I was there in CERN in 1994 - and it made sense then and even more so now. I (“I” includes many collaborators, but I use"I" to make it clear that the views here are mine and mine alone. Special Thanks to Henry Rzepa, my wonderful ex-group in Cambridge, Open Knowledge, ContentMine, Blue Obelisk, Crystallography Open Data Base (COD),  librarians in Cambridge and others. Please accept this pronoun).

I write and use Computer Programs.

  • I write programs and deposit them in Github/Apache/OuterCurve/BitBucket, etc.. People use them, build on them and acknowledge me. I’ll use “Github” as a generic pointer.
  • Others write programs and reposit in Github. I use them and build on them and I acknowledge them.
  • I offer constructive criticism.
  • I ask questions on Stackoverflow and I also answer them.
  • I’ve set up the Blue Obelisk where chemists can commit programs, make them interoperable.

This represents the pinnacle of what is possible in C21, with very modest/no funding and a collaborative intent. It works. It makes my heart soar. It’s wonderful. I’m proud to be a small part of it. Everyone wins.

There’s a similar ethos in Wiki/pedia/media/data. (“Wikipedia”).  Everyone can be a Wikipedian - all you need is to do it.

  • I have used Wikipedia for enhancing my knowledge and have contributed my knowledge to it.
  • I have used systems built by Wikipedians and I have contributed systems for use.
  • I have been to Wikipedia meetings, worked with the Wikipedians.
  • I promote Wikipedia.

I have been on the Advisory Board of Open Knowledge Foundation since it started. I have used OKF resources. I have contributed to them.

And so on… groups that I use and would contribute to if I had the time

  • Open Streetmap
  • Geograph
  • Open Corporates
  • MySociety
  • Mozilla
  • ...

Most of these are cash-starved, and find innovative ways to generate enough income to make their primary products free and Open. (“Open” == “Free to use, free to re-use, free to re-distribute”, “Free as in speech”, “Free as in liberty”).

The C21 makes the sharing knowledge communities possible. It’s very very wonderful. If you don’t understand what I am saying then maybe you have to try it. Contribute to Wikipedia, add a photo to Geograph, “Write to Them” to your MEPs, FOI with “What do They Know”.

And you can start to be a C21 citizen at a very early age. The knowledge century is a wonderful place to live.

Sadly ...

Scholarly publishing in the 21st Century (C21) is completely broken

It’s a 20 Billion USD industry.



of citizens' money

It’s probably 1000 times more money than the average project mentioned above. Maybe even more.

So how is it broken? (If you know and love Github or Stackoverflow use them as a comparison of the wonderful against the broken). I am not going to apportion blame to publishers, libraries, authors, funders. They have all, wittingly or unwittingly contributed to one of the most dysfunctional knowledge systems on the planet.

And it matters. It’s not just money. It’s:


  • Human lives. I coined the phrase “Closed Access Means People Die”. I have been attacked for it. If it makes you feel more comfortable “Open Knowledge saves lives”.


  • The planet. To work out what is going to happen from anthropogenic (“human-made”) change of all sorts we need as much knowledge as possible. We are being deprived of it.
  • Citizens. It’s an unacceptably divisive system. Only 1% of the UK population (those in universities) are involved. Most of those are passive. They get told what to do. Citizens - doctors, teachers, politicians, businesses, taxi-drivers are excluded. Yes! Until taxi-drivers have a right to be involved in scholarship we are a divisive society.
  • Values. It’s distorting values. Ask a librarian/researcher/administrator why scientific publications should be free to everyone and you’ll probably get:

1. "The Funders require it".

2. "You’ll get more citations if you publish Open Access".

The moral and ethical imperative (“we have a responsibility to make knowledge free to everyone”) often isn’t mentioned.

  • Community.  For me “Open” is not primarily about money, it’s about working together, and being transparent.

... and in detail ...

  • It’s criminally expensive. Publishers receive ca $5000 for each paper. It’s largely public or personal (e.g. student fees) money. It actually costs around $300 (administration: reviewers don’t get paid, authors don’t get paid). Maximum. Many people publish for $0 and give their time and marginal resources. That money could be used for research, could be used for teaching. The amounts spent on journal subscriptions in the UK (ca $1billion/year is similar to the cost of postgraduate education).
  • It’s criminally inefficient. Much of the work is carried out by humans when C21 systems could do the same for 5% of the cost. Stackoverflow manages 10 million questions.
  • It’s criminally slow. Some papers take years to appear. Postings to repositories take fractions of a second. The great Physics/Maths site arxiv can do this. But many publishers take years to publish a paper.
  • It’s elitist and probably corrupt. It stresses “top” journals. I am all for public competition and the best winning, but this isn’t that. It favours “top” institutions (I heard of one large research org that negotiates with a “top” publisher on how many papers they are allowed per year - before the work is done).
  • It destroys the real purpose of publication. I believe that science requires that you tell the world (not an elite) - fully (not in summary):
    • What you did
    • Who did it
    • Why you did it
    • How you did it (verification and re-use)
    • When you did it
    • Where you did it
    • what you discovered (or didn't discover)

And invite the world to confirm/refute/help/criticize continually and continuously. Some competition is valuable. But competition has now become an end in itself and is destroying the other values.

I am involved in trying to bring these ideas into scholarly publishing. I have very largely been unsuccessful, when measured against the other Open activities where I have ben able to help create the C21 knowledge community.

  • I’ve developed semantics for chemistry (Chemical Markup Language, CML). Chemists, chemical publishers, universities ignore this.
  • I’ve developed open data bases (CrystalEye/COD). Publishers and universities ignore these.
  • I’ve prototyped semantic publication . Ignored.
  • I’ve pushed for a fully Open community of scientific scholarship. The Blue Obelisk. Ignored.
  • We’ve developed new tools for University Libraries. (Open Bibliography and BibJSON). Ignored.
  • I’ve campaigned for reform of Copyright. Ignored by academia and publishers
  • I’ve developed tools for using machines to help everyone read the scholarly literature. Active opposition.

Everyone blames everyone else. Some suffer, some get super-rich. Everyone is losing out.

It must change. Completely. If not from within, then from without.

Sci-hub is one of the external factors that could change scholarly publishing.


