Sci-hub and Legal aspects of ContentMining 4/n

I have written today to my collaborators in ContentMine - staff, volunteers, advisory board and Shuttleworth funders and mentors. It's on the legal aspects of mining. It's long, but laws are complex. It's meant to put everyone 's minds at rest - us, universities, Shuttleworth, etc. it's not authoritative, but may be a useful guide. We'd love to have your feedback. tl;dr I've assessed the main problems and most people should assume we have taken a responsible and public approach.
ContentMine is preparing to mine the complete scholarly literature every day - about 10,000 scholarly articles.People from inside CM and from outside have recently raised the question of whether CM is breaking or intends to break the law. This has arisen in parts because of our intention to use the UK Copyright exception to mine the whole literature, and because of speculation about the possible use of our technology by "illegal" sites such as Sci-Hub.
NOTE: I am not a lawyer (IANAL) but I have spoken to several and am aware of general principles and practice.

The simple answer is simple:

CM does not intend to break the law and intends not to break the law.

and to my colleagues.
Do not worry. You will not end up in court. If anyone does - and it is unlikely - it will be me and I am prepared.

I shall expand on this in blog posts, but please be assured that I am actively assessing areas where the laws might be broken, especially inadvertently. Note, of course, that there are many other laws where we have to observe on a continual basis, and include health and safety, employment, racial discrimination, libel, immigration, etc. I get frequent updates from the Chemistry Department  as to what procedures we have to observe. You, I, and everyone are bound to observe and practice these laws. They are complex in detail, extent, interpretation and we generally manage by knowing the outline of the law. We don't steal, and we don't read the small print of what is and is not a theft (e.g. "illegal borrowing"). But in others, e.g. animal experiments or immigration, the small print is critical. "Ignorance of the law is no defence".

But I will take the responsibility of guiding you and making sure that you don't transgress inadvertently.

The  laws particularly relevant to in question include:

* copyright law

* sui generis database rights (Europe only)

* computer fraud law

* technological protection measures (TPM) and digital rights management (DRM)

* national security laws

Most of these laws have a concern about geo-location. We shall attempt to make sure that all our activities are carried out by UK staff, "in the UK", on UK machines.  But what is legal here may be illegal elsewhere and vice versa. Note also that many laws, especially new ones cannot have definite answers until they are tested in a courtcase. Lawyers may give opinions (for fees) but ultimately the court decides.
These laws are complex and often recent and - like many laws - it is possible to transgress unknowingly. We have have to educate ourselves and to behave responsibly in actions and language. If anyone is unsure they should raise the issue.
Note that by discussing this in public we will show our good faith and also be alerted by others to potential problems and misinterpretations.
Copyright law is exceedingly complex and also depends on the country. What is legal in the US may not be in Britain and vice versa. It includes:
* the process of copying for the purpose of mining for non-commercial research
* storage of copied material
* republication of the (transformed) output as part of the research/audit/verifiability requirement.

We continually discuss this with lawyers and with librarians. No one can predict precisely what is allowed and what is not - it may depend on "impact on the market of the rights-holder". All law includes a balance of risks - It is my responsibility and (for some content) the librarians to make sure that we have a balanced assessment.

We believe that our mining is fully allowed under the UK 2014 reform ("Hargreaves"). It would not be allowed if we took money from commercial companies and mined the literature solely for their benefit. Europe has noted that much research is a public/private partnership (I worked for 15 years in the Cambridge Unilever Centre, for example). Was this non-commercial? I would take the view that all the projects I worked on were. If I was paid extra to do private contract research for a company which would not be published it would be commercial.

Since I and ContentMine are probably the only group in UK at present who publicly intend to use Hargreaves there is no case law to answer these questions. We read the current public discourse and form a balanced judgment.

What copyright material can we hold on our machines? It is common for researchers to have thousands of copies of copyright material on their machines and no one is challenged. Unlike them, our material is in a secure computer room in Cambridge with physical access only by trusted staff and e-access only to 2-3 named and authorised people. If anyone wishes to "steal" the literature from our server we will actively prevent and report this. We are not, of course, ourselves redistributing any of the University subscription content other than facts and fair quotations. If, as we hope, the resource becomes useful in the University, we will work with library staff to create a legally acceptable approach where any Cambridge scholar can use the system.

How long can we hold it for? Mining is often an iterative process, so we may wish to re-run searches with new parameters. It would be a technical waste to have to re-download everything everyday. It would also put additional workload on the publisher's servers. We can't give an answer in days or months or years until we know what the likely usage patterns are.

What can we republish? Since facts are uncopyrightable we can publish them without permission (although in Europe we cannot systematically republish the contents of databases protected by sui generis. Journals and supplemental data are not databases). But:


is not a useful fact.

"The average snout-vent-length (SVL, see ) of the common lizards (Zootoca vivipara) found on Borchester Common ( )  was 42 mm (+- 5) measured by 3 independent researchers using the Graduated Ruler and Eyeball Method (see )"

is a useful fact. We intend to publish some or all of the facts we extract without formal permission from the publisher.

Note that a fact does not have to be "true". I don't actually know the sizes of newborn sandlizards. But what I have stated is a fact. The result might be a misprint for 142 mm (which is possible for an adult). It is still a (potentiallly falsifiable) fact. It remains a fact regardless of further lizard research.
I will blog more on facts as "facts" are uncopyrightable.
* sui generis database rights. We do NOT currently intend to systematically extract facts from factual databases described as such and specifically created for the purpose of holding facts.
* computer fraud laws. We scrupulously avoid breaking these laws. They carry the additional features that they are criminal, and so prosecution would be by the police. The UK takes these very seriously and wishes to extend the maximum term of imprisonment to 10 years: personally protest against this, but I do it legally).You should therefore take especial care not to share files "illegally". This means that ContentMine cannot have any dealings with Sci-Hub as it is seen by many as an "illegal" filesharing . Read  Ars technica:<quote>The UK government has responded to that issue by saying that it accepts there are concerns, and writes: "the policy intention is that criminal offences should not apply to low level infringement that has a minimal effect or causes minimum harm to copyright owners, in particular where the individuals involved are unaware of the impact of their behaviour."Another major worry was the use of the term "affect prejudicially" in judging copyright infringements, which many felt was too vague and could mean a single infringing file would fulfil the requirement—for example, if it were widely shared online. Many thought this set the threshold for committing an offence far too low.The UK government said it was not aware of any cases where minor infringement had resulted in a criminal prosecution, but "agrees that the undefined term ‘affect prejudicially’ could give rise to an element of ambiguity." The government is now proposing to introduce "re-worded offence provisions" to address that.


It is extremely unlikely that we will trigger this law as we don't deliberately intend to break it and deliberately don't intend to break it. However #icanhazpdf is almost certainly "illegal" and also breaks the rules of the University. I have never used #icanhazpdf in either direction and never sent files to people who weren't subscribed. ContentMine staff should not use #icanhazpdf.

In some cases crawling has been held to be a violation of the CFA acts of various flavours. I am not aware of any cases where scholarly publishers have used this to prosecute bona fide researchers, nor where the police have.,

Note also that many publishers know that I and others (e.g. Crystallography Open Database) have been crawling their sites for many years and by implication permit it. This includes Nature, Elsevier, American Chemical Society, Royal Society of Chemistry, Acta Crystallographica, Science. We are careful to adhere to responsible mining practice (see )

Aaron Swartz's case was - for many, including me - a serious miscarriange of justice. From Wikipedia:

( )

<quote>In the wake of the prosecution and subsequent suicide of Aaron Swartz, lawmakers have proposed to amend the Computer Fraud and Abuse Act. Representative Zoe Lofgren has drafted a bill that would help "prevent what happened to Aaron from happening to other Internet users".[35] Aaron's Law (H.R. 2454, S. 1196[36]) would exclude terms of service violations from the 1984 Computer Fraud and Abuse Act and from the wire fraud statute, despite the fact that Swartz was not prosecuted based on Terms of Service violations.[37]

In addition to Lofgren's efforts, Representatives Darrell Issa and Jared Polis (also on the House Judiciary Committee) raised questions about the government's handling of the case. Polis called the charges "ridiculous and trumped up," referring to Swartz as a "martyr."[38] Issa, who also chairs the House Oversight Committee, announced an investigation of the Justice Department's prosecution.[38][39]

As of May 2014, Aaron's Law was stalled in committee, reportedly due to tech company Oracle's financial interests.[40]


* TPM and DRM

These are technical methods of prevent access to material and can include firewalls, encryption, specific tools, and possibly Captcha. We have bought legal advice and the result is not clear about whether Hargreaves allows us to circumvent them. The rule for all of us is that if there is any technical barrier to mining we should identify it and alert the librarians and possibly computer officers. Deliberately breaking this law could have serious consequences. Rest assured that I will publicize and comment on publishers who impose TPM.

Charles Oppenheim (Chair ContentMine advisory board) adds:
...within the UK Copyright Act there are regulations allowing for someone to ask the Secretary of State responsible for copyright law to stop a rights owner using Technical Protection Measures [to] prevent people from exercising an exception to copyright, such as the TDM exception, and the Secretary of State, after examining the evidence, can require the copyright owner to lower the barrier, or be prosecuted. However, the procedure for doing this is complex and clunky, and has been very rarely used hitherto for that reason.

* national security. It is very unlikely that we shall trigger this very serious offence. However, overzealous prosecutors or government departments - particularly in the US - have used such provisions.

There is a simplistic tendency of some companies and government departments to demonize all "hacking" as security violations. My laptop carries "Wget is not a crime" , after

was jailed for its use. See Slashdot for the link to Snowden and hackerbabble:

* scraping

Contentmine is in the business of scraping websites - scholarly publishers , academic departments, etc. Is this legal? People have been prosecuted for scraping ( from a company selling anti-scraping software). Wiley and Elsevier caused Tilburg to cut off Chris Hartgerink for downloading ("stealing") material to which he had legal access. Their accusations have not been made public and it seems most unlikely he had done anything illegal. However I have scraped publishers for 12 years (for legally accessible materials) with no complaints and I do not expect any.
*incitement to commit a crime.
in general it is a serious offence to encourage others to break the law. See for the official (and complex) UK law. For example I believe that any formal contact with Sci-hub or recommendation to use it could be interpreted as a crime.  Whether the same applies to breaking contract law is less clear, but ContentMine will not , knowingly, break this either.
Please let me know whether I have omitted an important item or have misrepresented one.

A commentary on Sci-Hub: 3/n Legal aspects


It’s impossible to discuss Sci-Hub without discussing legal aspects. Unfortunately these are complex and highly varied, so it is impossible to give simple clear answers. On one hand many claim that this is a criminal (or near criminal) activity and She-who-must-not-be-named should be incarcerated or worse;  others including Aleksandra herself claim that what she is doing is her (and our) right and is therefore not illegal.

Now I am not a lawyer (IANAL) but I have talked to lawyers about this and talked to legal academics and talked with people who are authorities and

… and PLEASE correct anything I get wrong and ...

The simple answer is that no-one can give a definite answer about the law. And I will deliberately try to avoid giving anything definite. A competent lawyer will advise about the risks and the client has to decide whether they are worth taking.

Then there are many laws involved. And many jurisdictions. What is illegal in US might not be illegal on some Pacific islands or even in Kazahkstan. And copyright alone is fiendishly complicated.
Criminal as well as civil law may be involved. It’s not just about copyright. It’s also potentially about The US Computer_Fraud_and_Abuse_Act . That’s the Act under which Aaron Swartz was indicated - a criminal, not a civil case. That means Aaron could have been sent to jail - and there were demands for a 35-year sentence… for downloading academic articles.

And most relevantly, incitement to break the law is, in itself, a crime in many jurisdictions. I have been advised that those who support Sci-Hub and urge its use could be prosecuted for this incitement and could be jailed.

People have been asking questions about using Sci-hub and ContentMine on Twitter,  such as:

  • “We need a definite answer soon”. Well, you are unlikely to get one.
  • “If we re-use facts extracted from Sci-hub content, they are uncopyrightable, right?”. Even if true, you might be breaking CFAA or other laws.
  • “X  used facts which Y had extracted from an ‘illegal’ scrape and so X is OK even if Y isn’t”. I would be very unhappy with this reasoning.

So the simple answer is that Peter Murray-Rust and ContentMine colleagues are not going to pronounce on these questions, nor are they going to deliberately or knowingly break the law. I will write a further blog post on what I am going to do.

Note that ContentMine software is offered under the permissive Apache2 licence and is owned by the Shuttleworth Foundation. Like other permissive licences there is no restriction on fields of endeavour. Therefore if someone wishes to use ContentMine software with Sci-Hub content the licence does not restrict this (other laws might).

I’ll be happy to add more to this post if other feel there are omissions or errors. But I may not feel myself able to answer questions, for reasons given above.

There’s a useful commentary on both the politics and legality of Sci-Hub which may be useful (some extracts)

Leaving aside how they obtain the credentials, the fact remains that the process violates copyright. In November 2015, a New York District Court granted an injunction against Sci-Hub, LibGen and several other sites in response to a complaint from Elsevier, ordering them to stop offering access to infringing content and suspending their domain names. The Judge in the case stated that “the balance of hardships clearly tips in favor of the Plaintiffs. Elsevier has shown that it is likely to succeed on the merits, and that it continues to suffer irreparable harm due to the Defendants’ making its copyrighted material available for free” (8)


Furthermore, human rights are generally considered to be enforceable against a State, not private entities. This means that a case attempting to defend Sci-Hub on the basis of article 27  would likely need to be brought against a government, presumably the US, instead of the publishers, and be aimed at radically altering their copyright law.


Copyright is considered an intellectual property right (9), and article 17 of the UNDHR states that (1) everyone has the right to own property alone as well as in association with others and that (2) no one shall be arbitrarily deprived of his property. It is therefore arguable that article 17 protects rights-holders and their right to enforce copyright.  Elbakayan would need to establish that her users’ article 27 rights overrode or were more important than copyright-holders’ article 17 rights.


Interacting with Sci-Hub may, therefore, bring me into conflict with the law, probably US law. I have written earlier about where it is morally legitimate, and in some cases morally imperative, to break the law (“civil disobedience”).

There are many cases in British law and elsewhere , where civil disobedience has led to reform, often with the campaigners being first found guilty and often jailed. The history of independence movements is marked by imprisonment.

Most relevantly, the “Right to Roam” movement, which inspired my mantra “The Right to Read is the Right to Mine”, was only won after imprisonments.

Currently I am fighting for the  “Right to Mine” through legal and political means (and optimistically that we will get positive help from some publishers).  At present I am scrupulously trying to avoid, even inadvertently, breaking the law. I will explain why in the next blog.

A commentary on Sci-hub: 2/n. Why it matters to me and ContentMine

In my previous post , catalyzed by Sci-Hub, I argued that scholarly publishing is completely broken. It’s now lost a huge amount of respect, it’s unwieldly, unfair and mired in bickering. It pays no attention to readers. It’s becoming a write-only system where authors write not to communicate but for glory - self advancement. There’s no clear political goal …

… and no clear technological goal.

And that’s the problem.

Because we desperately need the ability to search and analyze the scientific and medical literature in a 21stC manner.  While we’ve been creating our we’ve discovered many researchers who have to “read” 10,000 papers in a day or two. They use 20thC methods - click and read - taking weeks where they should take hours. ContentMine software (completely Open) has been built to solve this problem by filtering out the papers you don’t want - often 90% of the first search. (and it does much more - it can extract complex objects). It’s Open to everyone and it works (see previous posts).

When I came to Cambridge I had the vision of building an “artificially intelligent chemical reader” part of which was the  World_Wide_Molecular_Matrix a system for capturing and sharing versioned semantic chemistry. Bits of it are being built in ContentMine . I built systems where I could draw chemical formulae by speaking to the machine. We’ve built the de facto tool for chemical name recognition (OSCAR) and interpretation (OPSIN). I thought it would take 5 years to create my chemical amanuensis - scholarly assistant. With help from the publishers and scientists it probably would have. Now, after 15 years, it’s still a dream, frustrated by stagnant thinking on all sides, and deliberate opposition (e.g. nullifying European legislation).

So Stackoverflow, Github, Bitbucket, Apache, GNU, Jenkins, OuterCurve, Mozilla and many others are creating the human-machine technology of tomorrow. This encourages innovation from predictable and unpredictable sources. It works - it’s exciting and we are all part of it.

In contrast the Scholarly publishing industry has created nothing in the last 20 years. (The Scholarly Kitchen hailed the “big deal” (a pricing strategy to increase sales) as one of the greatest achievements of schol pub).

20 billion dollars per year - that’s 200 billion since I started at Cambridge - and nothing positive to show for it.

The current technology of the mainstream publishing industry is just awful. Really awful. It’s often built by outsourcing parts to people and companies who do not care how the result is used. The methods used - awful PDF and really awful HTML - are for the publisher’s convenience , not for the reader. And every publishers complains about how awful the tools are. They can’t change, they can’t innovate, they’re locked in. Add that every publisher feels they have to use a different technology to differentiate themselves from the others and it’s a complete tower of Babel. (I have spent 2 years of my life trying to solve this awful mess - and ContentMine can untangle a good deal.)

What’s even worse is that most of the publishers spend effort on STOPPING people reading the literature. The obstacles to getting to a paper grow every month. These include (from my own experience):

  • Deliberately bad PDFs.
  • Pixel maps rather than characters.
  • “Glass screens” that can’t be copied (Readcube from Nature/Springer).
  • Captchas to stop readers after 25 papers (Wiley, i.e. 400 Captchas for a literature review).
  • Monitoring every download and requiring libraries to stop researchers. (Elsevier, Wiley).
  • Automatically cutting off 200 universities for a single click (Amer. Chem Society).

Why does this matter?

Because there is so much we are missing out on. New medicinal knowledge, new ecology, new astronomy, materials, chemical reactions, … and innovation...

I should be able to ask a computer (in speech):

“Find me all chemical compounds that occur in Lantana species south of the Wallace line and compare their chemical and plant evolution. What types of compound might we see in the future, particularly due to invasive species?”

And get a result in minutes… it’s not as hard as it looks. It’s knowledge-driven science.

(Sadly All I WILL get in minutes is a cease-and-desist letter from publishers demanding that I shouldn’t “steal their content”.)

So because we cannot innovate in this area we are 20 years behind the mainstream.

So why do I want Sci-hub? (Note carefully that I haven’t said what I am going to do and, until I do, you cannot judge my intention. I haven’t said I’m going to use it. You’ll have to wait till the next blog post).

I want Sci-hub because it’s technically BETTER than anything else we have. Much better.

And it’s the perfect complement to ContentMine.

Sci-hub has all the world's scientific knowledge in one logical place. It doesn't matter that it's spread over Torrents and other fragmentation - logically it's all there. And it's run by someone who knows what she's doing technically - unlike many publisher sites. And, I assume, she and colleagues will be receptive to technical requests and suggestions. (No one has any chance of getting conventional publishes to innovate).

Using Sci-hub would advance my and ContentMine technology enormously. ContentMine and Sci-hub fit together perfectly - because they are both designed with the 21stC mentality. Because they react to what readers want. Yes, READERS; the marginalised community of scholarly publishing. 21stC projects create a community round them. They are organic and vibrant. They respect machines and humans equally.

ContentMine + Sci-hub could be the greatest search engine in scholarship, especially for science, technology and medicine. Because it’s semantic. Because all the literature is trivially accessible in one place and one format. I don’t know of anything that remotely comes close. We can search and index diagrams - extract 15 million chemical reactions a year. (Even if a publisher tried to develop it they could only use it on “their own” content.)


But for many, including the law, Sci-hub is forbidden fruit.  Run by She-who-must-not-be-named. The arch-pirate. The criminal. (These terms are used). Peter Murray-Rust cannot use it (and I haven’t). ContentMine cannot mine it (and we won’t). We’ve looked at the legal and political aspects and I’ll analyse these in a subsequent post.

But 21stCCitizens - me, ContentMine, taxi-drivers really really want Sci-Hub.

The only things stopping us are copyright law, prosecutors and an intransigent, uncaring, out-of-touch, money-driven and self-seeking publisher-academic complex.

I’ll deal with the politico-legal in the next post.


A commentary on Sci-hub: 1. Scholarly publishing is broken

Many of you will already have read of Science Magazine’s account of Sci-Hub, the “pirate” site for scholarly publications. “Science” is often seen as one to the “top three” outlets, along with Nature and Cell. Here’s the original:

And here’s a typical commentary which applauds the research in the article but criticizes the accompanying editorial showing that Science has an ethically flawed business model.

This (and following) blog is one of the most important I have written, and I shall choose words carefully. I shall include facts, opinions, and what I intend to do and not do, and why. I am always open to criticism and try to be polite and constructive. My message is already spreading to more than one posting. This one sets the scene.

This blog is nearly 10 years old. I’d like to believe that I have tried to help make scholarly publishing fit for the 21st Century (C21). I’ve seen Tim Berners-Lee’s vision of the semantic web for scholarship - I was there in CERN in 1994 - and it made sense then and even more so now. I (“I” includes many collaborators, but I use"I" to make it clear that the views here are mine and mine alone. Special Thanks to Henry Rzepa, my wonderful ex-group in Cambridge, Open Knowledge, ContentMine, Blue Obelisk, Crystallography Open Data Base (COD),  librarians in Cambridge and others. Please accept this pronoun).

I write and use Computer Programs.

  • I write programs and deposit them in Github/Apache/OuterCurve/BitBucket, etc.. People use them, build on them and acknowledge me. I’ll use “Github” as a generic pointer.
  • Others write programs and reposit in Github. I use them and build on them and I acknowledge them.
  • I offer constructive criticism.
  • I ask questions on Stackoverflow and I also answer them.
  • I’ve set up the Blue Obelisk where chemists can commit programs, make them interoperable.

This represents the pinnacle of what is possible in C21, with very modest/no funding and a collaborative intent. It works. It makes my heart soar. It’s wonderful. I’m proud to be a small part of it. Everyone wins.

There’s a similar ethos in Wiki/pedia/media/data. (“Wikipedia”).  Everyone can be a Wikipedian - all you need is to do it.

  • I have used Wikipedia for enhancing my knowledge and have contributed my knowledge to it.
  • I have used systems built by Wikipedians and I have contributed systems for use.
  • I have been to Wikipedia meetings, worked with the Wikipedians.
  • I promote Wikipedia.

I have been on the Advisory Board of Open Knowledge Foundation since it started. I have used OKF resources. I have contributed to them.

And so on… groups that I use and would contribute to if I had the time

  • Open Streetmap
  • Geograph
  • Open Corporates
  • MySociety
  • Mozilla
  • ...

Most of these are cash-starved, and find innovative ways to generate enough income to make their primary products free and Open. (“Open” == “Free to use, free to re-use, free to re-distribute”, “Free as in speech”, “Free as in liberty”).

The C21 makes the sharing knowledge communities possible. It’s very very wonderful. If you don’t understand what I am saying then maybe you have to try it. Contribute to Wikipedia, add a photo to Geograph, “Write to Them” to your MEPs, FOI with “What do They Know”.

And you can start to be a C21 citizen at a very early age. The knowledge century is a wonderful place to live.

Sadly ...

Scholarly publishing in the 21st Century (C21) is completely broken

It’s a 20 Billion USD industry.



of citizens' money

It’s probably 1000 times more money than the average project mentioned above. Maybe even more.

So how is it broken? (If you know and love Github or Stackoverflow use them as a comparison of the wonderful against the broken). I am not going to apportion blame to publishers, libraries, authors, funders. They have all, wittingly or unwittingly contributed to one of the most dysfunctional knowledge systems on the planet.

And it matters. It’s not just money. It’s:


  • Human lives. I coined the phrase “Closed Access Means People Die”. I have been attacked for it. If it makes you feel more comfortable “Open Knowledge saves lives”.


  • The planet. To work out what is going to happen from anthropogenic (“human-made”) change of all sorts we need as much knowledge as possible. We are being deprived of it.
  • Citizens. It’s an unacceptably divisive system. Only 1% of the UK population (those in universities) are involved. Most of those are passive. They get told what to do. Citizens - doctors, teachers, politicians, businesses, taxi-drivers are excluded. Yes! Until taxi-drivers have a right to be involved in scholarship we are a divisive society.
  • Values. It’s distorting values. Ask a librarian/researcher/administrator why scientific publications should be free to everyone and you’ll probably get:

1. "The Funders require it".

2. "You’ll get more citations if you publish Open Access".

The moral and ethical imperative (“we have a responsibility to make knowledge free to everyone”) often isn’t mentioned.

  • Community.  For me “Open” is not primarily about money, it’s about working together, and being transparent.

... and in detail ...

  • It’s criminally expensive. Publishers receive ca $5000 for each paper. It’s largely public or personal (e.g. student fees) money. It actually costs around $300 (administration: reviewers don’t get paid, authors don’t get paid). Maximum. Many people publish for $0 and give their time and marginal resources. That money could be used for research, could be used for teaching. The amounts spent on journal subscriptions in the UK (ca $1billion/year is similar to the cost of postgraduate education).
  • It’s criminally inefficient. Much of the work is carried out by humans when C21 systems could do the same for 5% of the cost. Stackoverflow manages 10 million questions.
  • It’s criminally slow. Some papers take years to appear. Postings to repositories take fractions of a second. The great Physics/Maths site arxiv can do this. But many publishers take years to publish a paper.
  • It’s elitist and probably corrupt. It stresses “top” journals. I am all for public competition and the best winning, but this isn’t that. It favours “top” institutions (I heard of one large research org that negotiates with a “top” publisher on how many papers they are allowed per year - before the work is done).
  • It destroys the real purpose of publication. I believe that science requires that you tell the world (not an elite) - fully (not in summary):
    • What you did
    • Who did it
    • Why you did it
    • How you did it (verification and re-use)
    • When you did it
    • Where you did it
    • what you discovered (or didn't discover)

And invite the world to confirm/refute/help/criticize continually and continuously. Some competition is valuable. But competition has now become an end in itself and is destroying the other values.

I am involved in trying to bring these ideas into scholarly publishing. I have very largely been unsuccessful, when measured against the other Open activities where I have ben able to help create the C21 knowledge community.

  • I’ve developed semantics for chemistry (Chemical Markup Language, CML). Chemists, chemical publishers, universities ignore this.
  • I’ve developed open data bases (CrystalEye/COD). Publishers and universities ignore these.
  • I’ve prototyped semantic publication . Ignored.
  • I’ve pushed for a fully Open community of scientific scholarship. The Blue Obelisk. Ignored.
  • We’ve developed new tools for University Libraries. (Open Bibliography and BibJSON). Ignored.
  • I’ve campaigned for reform of Copyright. Ignored by academia and publishers
  • I’ve developed tools for using machines to help everyone read the scholarly literature. Active opposition.

Everyone blames everyone else. Some suffer, some get super-rich. Everyone is losing out.

It must change. Completely. If not from within, then from without.

Sci-hub is one of the external factors that could change scholarly publishing.


Tummy bug 2: The scientific literature teaches us about Isospora

In the previous post we showed how ContentMine could give immediate knowledge about a scientific topic - we analysed “Isospora”, which is a nasty tummy bug. Let’s just read Wikipedia to get some idea of the language we’ll need


Life Cycle

PHIL 3398 lores

  • An oocyst with one sporoblast is released in stool of infected person
  • After the oocyst has been released, the sporoblast matures further and divides into two
  • After the sporoblasts divide they create a cyst wall and become sporocysts
  • The sporocysts each divide twice, resulting in four sporozoites
  • Transmission occurs when these mature oocysts are ingested
  • The sporocysts excyst in the small intestine where sporozoites are released
  • The sporozoites then invade epithelial cells and schizogony is initiated
  • When the schizonts rupture, mereozoites are released and continue to invade more epithelial cells
  • Trophozoites develop into schizonts, containing many mereozoites
  • After about one week, development of male and female gametocytes begin in the mereozoites
  • Fertilization results in the development of oocysts, which are released in the stool [1][6]

The sporulation time of this parasite’s egg is usually 1–4 days, and the entire life cycle takes about 9–10 days.[7]


Wow! That’s complicated! But that’s because Life is complicated! These parasites have complex life cycles. You have to learn the terms - but it’s no harder than learning the terms in a new game, or a law case, or soccer strategy. You just need to want to do it! And Wikipedia will help. Wikipedia is always there. These parasites are all Apicomplexans and here’s their language


So if you are interested in more than just Isospora, use ContentMine to search for “Apicomplexan”.


Most of the papers have well defined messages. The first was about opportunistic infections in HIV patients. Read the word cloudlet for each paper here and see if you can guess the subject of papers 2,3,4,5,6. If you know the species behind the latin names that helps. If you don’t use your friend Wikipedia.


Here’s my thinking:

  1. Already done
  2. “Caninum, Parasitology, Vets - probably about Dogs. Toxoplasma I’ve heard of - it’s a parasite and confirms it. Never heard of Neospora or Hammondia but I wouldn’t eat them. Check - , yes they are both Apicomplexa, the latter of cats. Did we get it right?

Canine faecal contamination and parasitic risk in the city of Naples (southern Italy).

  1. Seems to be about ferrets , and mink (Mustela) getting influenza.  Ferrets develop fatal influenza after inhaling small particle aerosols of highly pathogenic avian influenza virus A/Vietnam/1203/2004 (H5N1).

It is. But why are people worried about ferrets getting sick?? Because influenza uses non-human hosts such as birds and ferrets so we might get it from them. And when I was in the pharma industry they used ferrets as a model of human disease.
Where’s the Isospora?
The animals lacked signs of epizootic catarrhal enteritis, and were negative by microscopy for enteric protozoans such as Eimeria and Isospora species using fecasol, a sodium nitrate fecal flotation solution (EVSCO Pharmaceuticals, Buena, NJ).


Translation: we made sure the test animals didn’t have other infections that could distort our research (and we told you how we did it).

  1. I know Gallus is a hen. And we’re going to add an icon and a mouseover on the table so you don’t need to look it up. Eimeria is an apicomplexan, and because it occurs 6 times in the paper it’s pretty important. I’m guessing it’s about parasites of hens. But what’s the rest? There are lots of genes and my guess is that they being used for c omparative genetics or possibly modes of action.
    I don’t know what “QTL”. I probably should, but why bother when we have Wikipedia?


A quantitative trait locus (QTL) is a section of DNA (the locus) that correlates with variation in a phenotype (the quantitative trait).[1] The QTL typically is linked to, or contains, the genes that control that phenotype.


Rough Translation: The phenotype is what we feel, touch, smell, observe in an organism. and the QTL is that part of the genes that affects it.
So the paper is probably about genomic studies on parasites and chickens. Let’s look: QTL detection for coccidiosis (Eimeria tenella) resistance in a Fayoumi × Leghorn F₂ cross, using a medium-density SNP panel.

Rough translation: analysing the genome of chickens for regions that confer resistance the the most serious parasite. Eimeria is an apicompelxan, so I expect the paper mentions a range of them, including Isospora. (Yes: “Coccidia are sub-classified into several genera, including Eimeria, Isospora, Cryptosporidium, Toxoplasma and Sarcocystis. ) So we’re becoming experts on Apicomplexan names!

  1. Turdus, Coccothraustes … Thrushes and Hawfinch. Also cloudlet show “birds” and “iron”. “Deadly Outbreak of Iron Storage Disease (ISD) in Italian Birds of the Family Turdidae” . This is the paper where they examines the birdshit for parasites...


So that seems a lot of work - and we are only 5 papers through. But some of those are relevant to Natalie and some aren’t - her false positives. So can we get ContentMine to select just the ones she needs?
We hope so. If the paper has a lot about apicomplexans it’s probably relevant. If it’s about other diseases such as HIV or flu it’s probably not. So we could remove those automatically.

And that would save a lot of time. And hopefully help us learn bioscience in an efficient manner.



How ContentMine can help you! Our example looks for "tummy bug" for Natalie

Yesterday Tom, Natalie and I had coffee together. Natalie’s a Vet student - at Royal Veterinary College - and we got talking about her project - 8 weeks doing practical research on Isospora. I’ve never heard of it. No idea what it is.

But ContentMine will know, so we’ll ask it…

We’ll be showing you in later posts how it all works, but just accept that we type:

getpapers -q isospora -x

Wait a minute for ca 207 open access papers to be downloaded , and then

cmine isospora

And wait another minute for ami to crunch through the data. Ami has already created summary files and we’ll look at full.dataTables.html which gives an overall view of all the “plugins” we have used (species, genes, words, etc.). Here’s the first few papers:

Screen Shot 2016-04-29 at 14.04.19

No need to squint - We’ll describe them in larger detail. (Note: some of the links are broken and there are a few false positives, both are being cleaned up).


The first column results gives links to the papers (PMC2758902 is a PubMedCentral id and clicking it will link to the EuropePubMedCentral repository of full text papers). Yes, YOU can read them. 200 free papers. If your are interested in Isospora, they are all yours! So here’s the first paper of the 200..

PMC2758902 local


We still don’t know what Isospora is, so let’s click on Isospora belli . It’s linked to Wikipedia which says:

Cystoisospora belli, previously known as Isospora belli, is a parasite that causes an intestinal disease known as cystoisosporiasis.[1] This protozoan parasite is opportunistic in immune suppressed human hosts.[2] It primarily exists in the epithelial cells of the small intestine, and develops in the cell cytoplasm.[2] The distribution of this coccidian parasite is cosmopolitan, but is mainly found in tropical and subtropical areas of the world such as the Caribbean, Central and S. America, India, Africa, & S.E. Asia. In the U.S., it is usually associated with HIV infection and institutional living.[3]

So, to paraphrase,

“Isospora is the old name of a nasty tummy bug, found mainly, but not exclusively, in the sub/tropical world that can infect HIV-sufferers”

Biological science is often hard to read for newcomers, but with practice you learn how to translate. Here’s a sentence from one paper:

Coprological examination of fresh stool specimens revealed coccidian oocysts of the genus Isospora in 36% of the birds


We examined birdshit and found parasite eggs in 36%.

The long words are useful - they aren’t there just to put you off or be pompous. They help translate between human languages, and they increase precision. If we search for “parasite eggs in birds” we might end up with bird eggs, whereas “oocytes” is more precise. ContentMine loves precise words because it reduces false positives (results that aren’t relevant to what you want).

Column “words” is a list of the commonest word tokens. In this case it’s just “patients”. That confirms that the paper is probably about human infection (though Natalie and other Vets call animals “patients”). So were we right? Click on PMC2758902 and we’ll see:

Screen Shot 2016-04-29 at 14.49.41

So it’s about HIV, and drug treatment. Where’s the Isospora? Search down the full text and we find:

The reasons for hospitalization were: disseminated tuberculosis (month 5), reactivation of oropharyngeal Kaposi's sarcoma (month 3), and Isospora belli diarrhea with severe dehydration

So if you are interested in finding all papers where Isospora has infected HIV papers, ContentMine can immediately help you.

Nataliie’s main interest is veterinary, so we’ll look at the next few papers. But that shows how much there is in just ONE paper. And why we need machines to help us. Natalie probably mainly wants papers about animals and we can address that as well…


… in the next blog post!


TDM at European Parliament - tweet-like report

Great meeting at Brussels EP yesterday. Would have liked to tweet but didn't have password. - There *were* tweets by the MEPs. So I wrote my notes like tweets I would have made.  Maye be useful to some, mystifying to others...


Also Julia Reda MEP was there at the start!

Here's the panel (7-8) run by Catherine Stihler MEP (who chaired well and let everyone else speak)

Marco Giorello Head copyright Unit DG Connect

Problem: data analytics techniques involve making copies
These copies are relevant to copyright
Legal situation unclear;  some exceptions temporal copying, and copying for research purposes
(a) contractual conditions and policies
(b) legislation - UK exception - because there was already research exception (but leads to Euro fragmentation).
Other states have "research exception". Other states e.g. France, and ?Germany we don't want 15 different legislations
Dec 2015 - EC trying to find balance - PIRO [Public Interest Research Organization, yes I don't what that is either, so asked later...] - to address Univs and research insts.
But aware that Univs have private partners
UK "non-commercial" has caused problems.
Not only about copyright - but also technology , standards ...

John Boswell SAS (software company) - analysis of data.
TDM is just one form of data analysis. Copyright wider, bcos movies, images, voice all covered by copyright
analysis of 1 million docs to extract sentiment and time series, does not implicate (C).
(C) is protection of expression of an idea. Analysing this does not copy the expression or create a derivative work.(C) must not prevent TDM. Issue much bigger than Universities. World has so much (C) - ca 300, 000 every minute FB, Tweets, Instagram, etc. . Much covered by copyright
Analysis of social media is major good. Govs can use social media to predict economics
Debate must realise that TDM does  not implicate (C)

Theresa Comodini Cachia (MEP and meeting convener)

Don't wish to have debate on copyright vs TDM
Startups need protection from copyright and also need to use TDM
Startup innovation are EU priority - social and economic development
TDM will lead to new economic development
Reda report focussed on academic reearch.
innovation not just economic but also health and social
would give good push to innovation

Jakub Czakon (Stermedia) - (data analyst Physics + finance + chess)
loves data
TDM = data -> information -> knowledge
example s/w that matches CVs onto job offers
extract important info from data
try to match qualifications- find connections and distances between documents
health care - diagnosis of tumour - used machine learning and public data - found public competition training set.
looks for cells and local structure. Created diagnostic indicators.
facial recognition
these skills and startups are critical for Europe

Adriana Homolova - data journalist and visualisation
dataScience >> data analysis (insight into data) >> data analytics (analysing large amounts of data) >> data mining
uses AI.
NeuralNets, RandomForests, NearestNeighbours
Data mining is starting in journalism
journalism qualitative vs quantitative - "Interview data"
makes journalism stronger
data analysis used to fliter professors for side jobs for "interesting people"
e.g. 3 side jobs per prof
BBC analysed tennis for match-fixing for repeated underperforming
published on github
revolutionary in journalism
Panama papers had 400 (competing) journalists to abandon secrecy "newsroom collaboration"
data are the raw material of our age.
copyright can do much harm.
data anslytics are extension of our thought proceses
we must look how to open up - e.g. copyleft

Jean-Francois Dechamp DG Research and Innovation
both policy creation and funding agency
FutureTDM and OpenMinTed
objective - best conditions to do their job
resarchers and both producers and consumers
researchers often don't own copyright of their resaerch
competition fierce - merger of Springer and Nature
data journals
publishers => service providers

Sergey Filippov Lisbon Council (Brussels Innovation Think tank)
Report 2 years ago on TDM in Academic and Research Communities in Europe
Academic pubs 1.5 / year , 60 million in total
"Publish or perish" leads to distraction from teaching and poor research
Traditional k/w search, TDM can recognise concept s, facts realtions, preparatory
idea -> lit rev  (TDM)-> hypothesis (TDM) -> data methodology -> analys conclusions
what's problem? copyright ...
researched this...
scientific publications 1200 pubs 47% from US EU 26% EU cited less than US
applicable to all subjects, not just hard sciences
10-fold increase in Data mining, TDM papers in last 5 years
US 21%, EU 28, CN 10, IN 13%
Patents in data mining huge growth in China
Then he interviewed 20 researchers
most people don't know about TDM or tech -savvy
many worried about copyright
leads to results of lower quality
academic want exceptions
growth in CN and IN and US
Europeans concerned but worried about clarity
if we don't manage to get TDM used, then far-reaching negative implications for EU


Christoph Bruch: Open Science Coordination Office of the Helmholtz association,

lot of researchers want assurance
Must not be universities only
(to  Marco EC) must not limit how society can use information
limit will do very much damage

Marco - commercial vs nc. Current draft is not final.
Why not business activities. Exception would also be (C) but certan classes of beneficiaries.
must look at (C) with care
cause friction
Pharma already use licences
Existing lucrative Market for re-use so EC can't easily sweep it away
attempt to give full legal certainty
will be positive for academia and neutral for others

Boswell SAS - there is broad exception for TDM as "fair use" if not used for other purpose
interim step - new work is not copy of expression
in EC temporary copy should be covered by 5.1 of InfoSoc directive
PPIs with universities - lines are blurred
Should not make lines between univs and others

PM-R gave TDMer point of view and asked about PIRO - more later


@TheContentMine preparing for largescale high-throughput Mining (TDM)

The ContentMine ( has almost finished the infrastructure and software for automatic daily mining of the scientific literature. We hope to start testing in the next few days. I'll try to post frequent information.

The software has been developed by the ContentMine Team, wonderfully funded by the Shuttleworth Foundation. The people involved include:

  • Mark MacGillivray
  • Anusha Ranganathan
  • Richard Smith-Unna
  • Tom Arrow
  • Peter Murray-Rust
  • Chris Kittel
  • and voluntary contributions

The daily oprtation (as opposed to user-driven getpapers) consists of:

  • DOIs and URLs provided by CrossRef
  • downloading software
  • indexing of fulltext documents (closed as well as open, legal under the UK "Hargreaves" exception)
  • fact extraction
  • display

We'll detail this later.

The sources include:

  • open repositories such as EuropePubMedCentral
  • arxiv and other repositories
  • closed documents to which Cambridge University subscribes. We are working intimately with Cambridge University Library staff and offer public applause and thanks.

All closed work will be carried out on closed machines run by the University's computer officers, primarily in Chemistry, and again public thanks to this wonderful group. We take great care to limit access so that no unauthorised access is possible and that there is also an audit trail of what we do and have done.

It is difficult to predict the daily volume. MarkMacG has found it to vary between 300 and 80,000 documents a day. My guess is about 2000-7000 on average.

This is NOT a resource problem. The whole scientific literature for a year can be held on a terabyte disk. The processing time is small - perhaps 1000 documents a minute on our system. The whole literature can be done within a long coffee break.

The impact on publisher servers is minimal. at, say, 5000 articles/day even the largest publisher would only get 1 request per minute. The others would be trivial (1 request every 5-10 minutes). There is no case that our responsible TDM would cause any problems at all.

And, just to reassure everyone, I and colleagues are working hard to stay completely within the law as we see it. We are not stealing content.


Off to Brussels for ContentMining (TDM) meeting.

I'm spending a (long) day going to Brussels to a meeting run by MEPs and the European Parliament on Text and Data Mining. Here's the metadata:

“Demystifying Text and Data Mining in a copyright context”

When: Wednesday 27 April 2016, 13.00 – 15.00

Where: European Parliament, ASP, Room A5E2

Event co-hosted by Miapetra Kumpula-Natri & Therese Comodini Cachia & Catherine Stihler

First - I am a great supporter of the MEPs who propose reform - we can add Julia Reda (@senficon) to this.

The blurb is only present as a woolly GIF:  Why??? I can't even cut-and-paste? we are in the digital century? euroinvite

The UK has one of the few Exceptions to Copyright allowing TDM (for very limited purposes - personal non-commercial research for those who have legal access to the material). I am one of the very few people - perhaps one of two - who is actually using this legal permission.

Europe has been fighting for similar rights - and so have individual jurisdictions such as France:

Declaration pro-exception in #copyright for #TDM in France (and in French) by group of entrepreneurs and leaders:
(PMR summary - the great-and-good of France are fighting for rights to carry out TDM).

However I am deeply worried about the European initiative. Every time there is to be a draft, the time slips. The current wording is so vague as to be almost useless. We are all fighting massive opposition from publishers and lobbyists and reform gets watered down month by month...

Simply - I (PMR) am allowed to mine in UK because ANYONE has "The right to read is the right to mine". By contrast in Europe only "Public (Interest) Research Organisations" can mine.

  • Is a journalist a PIRI? No.
  • Is a teacher a PIRI? No.
  • Is PMR a PIRI? No.

Who is?

My guess is that this will turn out to require either/or

  • a regulator
  • a court case

If we rely on the EC then maybe I would have to register as an approved TDM'er and only carry out TDM at approved institutions.

Please tell me that I am overreacting.


I shall certainly ask this tomorrow if I am allowed to speak.

oh - and here is the awful GIF that accompanied the event. I hope against hope that it was a mistake. It sends out every wrong message...
Screen Shot 2016-04-26 at 19.24.39

TDM Copyright reform is about LICENSING?? NO, NO, NO

ContentMine at Force2016; notes for my session

I have a 30 mins session at Force2016 on Semantic Publishing. I'll concentrate on ContentMine. I shall not powerpoint people, but do some experiments.

Here are some useful links:


NOTE: You can do all this yourself. You don't need to be in a University or get publisher permission. I shall explore this with my taxi-driver.


