Extracting 100 million facts from the Scientific literature -1

In TheContentMine, funded by the Shuttleworth Foundation, we aim to extract 100 million facts from the scientific literature. That's a large number so here's our thinking....

What is a fact?

Common Mist Frog (Litoria rheocola) of northern Australia. Credit: Damon Ramsey / Wikimedia / CC BY-SA

Common Mist Frog (Litoria rheocola) of northern Australia. Credit: Damon Ramsey / Wikimedia / CC BY-SA

"Fact" means slightly different things to philosophers, lawyers, scientists and advertisers. Wikipedia describes facts in science,

a fact is a repeatable careful observation or measurement (by experimentation or other means), also called empirical evidence. Facts are central to building scientific theories.


a scientific fact is an objective and verifiable observation, in contrast with a hypothesis or theory, which is intended to explain or interpret facts.

In ContentMine we highlight the "objective", i.e. people will agree that they are talking about the same things in the same way. We concentrate on facts reported in scientific papers and my colleague Ross Mounce showed (Daily updates on IUCN Red List species) ... some excellent examples about a mistfrog[1]Here are some examples he quoted:

  • Litoria rheocola is a small treefrog (average male body size: 2.0 g, 31 mm; average female body size: 3.1 g, 36 mm [20])
  • the common mistfrog (Litoria rheocola), an IUCN Endangered species [22] that occurs near rocky, fast-flowing rainforest streams in northeastern Queensland, Australia [23]…
  • we tracked frogs using harmonic direction finding [32,33].
  • individuals move along and at right angles to the stream
  • Fig 3. Distribution of estimated body temperatures of common mistfrogs (Litoria rheocola) within categories relevant to Batrachochytrium dendrobatidis growth in culture (<15°C, 15–25°C, >25°C). [Figure]
  • Fig 3.  Distribution of estimated body temperatures of common mistfrogs (Litoria rheocola) within categories relevant to Batrachochytrium dendrobatidis growth in culture (<15°C, 15–25°C, >25°C).

All of the above , including the components of the graph, are FACTS. They have the features:

  • they are objective. They may or may not be "true" - another author might dispute the sizes of the frogs or where they live - but the authors have stated them as facts.
  • they can be represented in formal representations without losing meaning or precision. There are normally very few different ways of representing such facts. "Alauda arvensis sing on the wing" is a fact. "Hail to thee blithe spirit, bird thou never wert" is not a fact.
  • they are uncopyrightable. We content that all the facts we extract are uncopyrightable statements and therefore release them as CC0.

How do we represent facts? Generally they are a mixture of simple natural language statements and formal specifications. "A brown frog that lives in Queensland" is adequate; "L. rheocola. colour: brown; habitat: Queensland" says the same, slightly more formally.  Formal language is useful for us as it's easier to extract. The form:

object: name;    property1: value1 ; property2: value2

is very common and very useful. Often it's put in a table, graph or diagram. Transforming between these is one of the strengths of ContentMine software. The box plots could be put in words: "In winter in Windin Creek between 0 and 12% of the frogs had body temperatures below 15 Celsius", but the plot may be more useful to some scientists (note the redundancy!).

So the scientific observations - temperatures, locations, dates - are all Facts. The sentence contains 6 factual concepts: winter, Windin Creek, 0%, 12%, L. rheocola, body temperature, < 15 C. In ContentMine we refer to all of these as "Facts". Perhaps more formally we might say:
section-i/para-j/sentence-k in DOIjournal.pone.0127851 contains Windin Creek

section-i/para-j/sentence-k in DOIjournal.pone.0127851 contains L. rheocola

Those who like RDF (we sometimes use it)  may regard these as triples (document-component contains entity). In similar manner the linked data as in Wikidata should be regarded as Facts (which is why we are working with Wikidata to export extracted facts there).

How many facts does a scientific paper contain?

Every entity in a document is a Fact. Each author, each species, each temperature, date, colour. A graph may have 100 facts, or 1000. Perhaps 100 facts / page? A 10-page paper might have have 1000 facts. Some chemistry papers have 300 pages of supporting information. So if we read 1 million papers we might get a billion facts - our estimate of 100 million is not hyperbole.


[1] reported in PloSONE (Roznik EA, Alford RA (2015) Seasonal Ecology and Behavior of an Endangered Rainforest Frog (Litoria rheocola) Threatened by Disease. PLoS ONE 10(5): e0127851. doi:10.1371/journal.pone.0127851).

Should Wikipedia work with Elsevier?

This story has erupted in the last 2 days - if it had been earlier I would have covered it at my talk to Wikipedia Science].

TL;DR. Elsevier has granted accounts to 45 top editors at Wikipedia so they can read closed access publications as part of their editing. I strongly oppose this and say why. BTW I consider myself a committed Wikipedian.]

Glyn Moody in Ars Technica has headlined:

WikiGate” raises questions about Wikipedia’s commitment to open access

Glyn mailed me for my opinion and the piece, which is accurate, also highlights Michael Eisen's opposition to the new move. I'll cut and paste large chunks and then add additional comment.

Scientific publisher Elsevier has donated 45 free ScienceDirect accounts to "top Wikipedia editors" to aid them in their work. Michael Eisen, one of the founders of the open access movement, which seeks to make research publications freely available online, tweeted that he was "shocked to see @wikipedia working hand-in-hand with Elsevier to populate encylopedia w/links people cannot access," and dubbed it "WikiGate." Over the last few days, a row has broken out between Eisen and other academics over whether a free and open service such as Wikipedia should be partnering with a closed, non-free company such as Elsevier.

Eisen's fear is that the free accounts to ScienceDirect will encourage Wikipedia editors to add references to articles that are behind Elsevier's paywall. When members of the public seek to follow such links, they will be unable to see the article in question unless they have a suitable subscription to Elsevier's journals, or they make a one-time payment, usually tens of pounds for limited access.

Eisen went on to tweet: "@Wikipedia is providing free advertising for Elsevier and getting nothing in return," and that, rather than making it easy to access materials behind paywalls, "it SHOULD be difficult for @wikipedia editors to use #paywalled sources as, in long run, it will encourage openness." He called on Wikipedia's co-founder, Jimmy Wales, to "reconsider accommodating Elsevier's cynical use of @Wikipedia to advertise paywalled journals." His own suggestion was that Wikipedia should provide citations, but not active links to paywalled articles.

Agreed. It is not only providing free advertising, but worse, it implicitly legitimizes Elsevier's control of the scientific literature. Rather than making it MORE accessibile to the citizens of the world, it makes it LESS.

Eisen is not alone in considering the Elsevier donation a poisoned chalice. Peter Murray-Rust is Reader Emeritus in Molecular Informatics at the University Of Cambridge, and another leading campaigner for open access. In an email to Ars, he called the free Elsevier accounts "crumbs from the rich man's table. It encourages a priesthood. Only the best editors can have this. It's patronising, ineffectual. And I wouldn't go near it."

This arbitrary distinction between the 45 top editors and everyone else is seriously divisive. Even if this was a useful approach (it isn't) why should Elsevier decide who can, and who can't, be a top Wikipedia editor? Wikipedia has rightful concerns about who and how editors are "appointed" - it's meritocratic and, though imperfect, any other solution (cf. Churchil on democracy) is worse.

You may think I am overreacting - that Elsevier will behave decently and collaboratively. I've spent 6 years trying to "negotiate" with Elsevier about Content Mining - and it's one smokescreen after another. They want to develop and retain control over scholarship.

And I have additional knowledge. I've been campaigning for reform in Europe (including UK) and everywhere the publishers are fighting us. Elsevier wants me and collaborators to "licence" the right to mine - these licences are desiged to make Elsevier the central control. I would strongly urge any Wikipedian to read the small print and then run a mile.

This isn't the first time that Wikipedia has worked closely with a publisher in this way. The Wikipedia Library "helps editors access reliable sources to improve Wikipedia." It says that it supports "the broader move towards open access," but it also arranges Access Partnerships with publishers: "You would provide a set number of qualified and prolific Wikipedia editors free access to your resources for typically 1 year." As Wikipedia Library writes: "We also love to collaborate on social media, press releases, and blog posts highlighting our partnerships."

It is that cosy relationship with publishers and their paywalled articles that Eisen is concerned about, especially the latest one with Elsevier, whom he described in a tweet as "#openaccess's biggest enemy." Eisen wrote: "it is a corruption of @Wikipedia's principles to get in bed with Elsevier, and it will ultimately corrupt @Wikipedia." But in a reply to Wikipedia Library on Twitter, Eisen also emphasised: "don't get me wrong, i love @wikipedia and i totally understand everything you are doing."

Murray-Rust was one of the keynote speakers at the recent Wikipedia Science Conference, held in London, which was "prompted by the growing interest in Wikipedia, Wikidata, Commons, and other Wikimedia projects as platforms for opening up the scientific process." The central question raised by WikiGate is whether the Wikipedia Library project's arrangements with publishers like Elsevier that might encourage Wikipedia editors to include more links to paywalled articles really help to bring that about.

Elsevier and other mainstream publishers have no intention of major collaboration, nor of releasing the bulk of their material to the world. Witness the 35-year old paper, which is hidden behind a paywall, that predicted that Ebola could break out in Liberia. It's still behind an Elsevier paywall.

[These problems aren't confined to Elsevier, many of the major publishers do similar things to restrict the flow of knowledge.  When it appeared that ContentMining might become a reality, Wiley recently added "Captcha's" to its site to prevent ContentMining . But Elsevier is the largest and most unyielding publisher, often taking the lead in devising restrictions,  and so it gets most coverage.]

Wikimedian Martin Poulter, who is the organiser of the Wikipedia Science Conference, has no doubts. In an email, he told Ars: "Personally, I think the Wikipedia Library project (which gives Wikipedia editors free access to pay-walled or restricted resources like Science Direct) is wonderful. As a university staff member, I don't use it myself, but I'm glad Wikipedians outside the ivory towers get to use academic sources. Wikipedia aims to be an open-access summary of reliable knowledge—not a summary of open-access knowledge. The best scholarly sources are often not open-access: Wikipedia has to operate in this real world, not the world we ideally want."

The debate will continue publicly in Wikip/media. That's good.

The STM publishers, Rightslink, and similar organisations are working to lobby politicians, librarians, to prevent the liberation of knowledge. That must be fought every day


Wikipedia and Wikidata. Massive Open resources for Science.

I think Wikipedia is a wonderful creation of the XXIst Century, the Digital Enlightenment. It has arisen out of the massive cultural change enabled by digital freedom - the technical ability for over half the world (and hopefully soon almost all) to read and write what they want.

I was invited to give the plenary lecture at Wikipedia Science - a new venture, and one which was wonderfully successful. Here's me, promoting "The Right to Read is The Right to Mine".


I'm not sure whether there is a recording or transcript - I'd certainly value them as I don't read prepared speeches.

My theme was that Wikidata - one of the dozen major sections of Wikimedia - should be the first stopping place for people who want to find and re-use scientific data. It doesn't mean that WD necessarily contains all the data itself, but it will have structured validated link to where it can be found.

Here's my slides which contain praise for Wikim/pedia, the problems of closed information, and the technology of liberating it through ContentMining.  In ContentMine we are mining the daily literature for science and Wikidata will be one of the places that we shall look to for recording the results .

One of the great aspects of Wikipedia is that it has an Open approach to governance. Last year at Wikimania I was impressed by the self-analysis of Wikipedia - how can we run a distributed, vibrant, multicultural, multidisciplinary organisation? If anyone can find the answer it' Wikimedia.

But running societies has neve been and never will be easy. People will always disagree about what is right and what is wrong; what will work and what won't.

And that's what the next post is about. Wikipedia has embarked on a collaboration with Elsevier to read the closed literature. Many people think it's a good way forward. Others like Michael Eisen and I think it's a dereliction of our fundamental values.

It's healthy that we debate this loudly in public. During that process we may lose friends and make new ones, but we advance our communal processes.

What's supremely unhealthy is that larged closed monopolistic capitalist organisations make decisions in private, colluding with governments and constrain and control the Digital Enlightenment.




Announce: Microbial Supertree through ContentMining

I haven't blogged for some time as I have been writing Liberation Software (software to make knowledge and people free). Now we (Ross Mounce, Matt Wills and I) have got our first significant scientific result - a supertree:


I am going to leave Ross the opportunity to blog this in detail - he was hacking this late last night - so a brief overview:

For every new microorganism it's obligatory to compare it with other organisms in an evolutionary (phylogenetic) tree. Here's a typical one (don't be frightened - everyone can understand this if they are familiar with evolutionary ideas.)

https://github.com/ContentMine/ijsem/blob/master/batch1/ijs.0.000364-0-003.pbm/image/ijs.0.000364-0-003.pbm.png . The image was pusblished in http://ijs.sgmjournals.org/content/journal/ijsem/10.1099/ijs.0.000364-0?crawler=true&mimetype=application/pdf (Citation: International Journal of Systematic and Evolutionary Microbiology b(2009),59,972–980 DOI 10.1099/ijs.0.000364-0)


wikipedia2015 are

[I have added "root" and magnified some of it].

31 microorganisms (mainly bacteria) listed in the middle. Each has a binomial (scientific) name (Pyramidobacter piscolens) , a strain identifier (W5455T), and an identifier in an RNA database (e.g. EU379932). The lines represent a "tree" with its root (not shown) at the left hand side and presumed divergence of the species. It's certainly a useful classification; you can debate whether it's a useful historical model of the actual evolution over many million years. Thus it says that Pyramidobacter piscolens is closely related to Jonquetella anthropi and much more distantly related to Escherichia coli, a bacterium in everybody's gut.

Each paper provides one such tree - which could take significant amounts of computation (often hours depending on the strictness). What we have done - and this is a strength of Matt's group, is to bring thousands of such trees together. They weren't calculated with this in mind, and we are adding value by extracting them from the literature and making comparisons and aggregations.

Ross downloaded about 4,300 papers and the trees in them. I wrote the software to extract trees from the images. This is not trivial - the images are made of pixels - there are no explicit lines or characters or words and this research is full of heuristics. So we can't always distinguish "O" (Oh) from "0" (one).  So there will be an unavoidable percentage of garbles.

BUT we have ways of detecting and correcting these ("cleaning") and the most valuable are:

  • comparing the scientific name with the RNA ID
  • looking up the name in the NCBI's Taxdump (a list of all biomedical species)

Ross has developed several methods of cleaning and we are reasonably confident that the error  rate in species is no worse that 1 in 1000. (Note, by the way, that the in a sibling image the authors have made a misprint: "Optiutus" should be "Opitutus". so the primary literature also contains errors).

Everything we do is Open Notebook. We post stuff as soon as it is ready. We store it on Github (see above link, which has all 4300 trees), discuss it on ContentMine's Discourse (discuss.contentmine.org/t/error-analysis-in-ami-phylo/116/ - you can see that every detail is made open), software in (https://github.com/petermr/ami-plugin and many other repos, fully Open and often updated several times a day), and contentmine.org where Ross will be blogging it.

I hope to write much more frequently now.






Postal voting UK style


Whenever there is a national, international or local vote I feel it's my duty to vote. My grandmother was a suffragette and their actions won the right for today's British women to vote. (She didn't go to jail because she had small children and the suffragettes therefore wouldn't allow her to be put in the front line.). Universal suffrage represents one of the main struggles of the C20 - and it's still going on today.

So even if you think your vote makes no difference, the very fact of voting in the UK supports the disadvantaged elsewhere.

I take the act of voting seriously and dress up for the occasion - see my voting suit above. The suit came about because it was unclear (and still is) whether a voter has to show their face. In fact every time I have voted as a bear I have been asked to remove my head and have done so.

This year I can't vote in person so I applied for a postal vote. I have found the process archaic and arcane. I am sure that many voters don't use a postal vote because of the hassle.

  • Step 1. find out how to do it.
  • Step 2 send a postal request for a voting form. This cost me a first class stamp. It seems absurd. Why can we not ask for a form online? Anyway I sent the form off and heard nothing. I tweeted my MP (Julian Huppert) and he agreed that I should have heard. Anyway I rang up this Monday and they said forms had been sent out last Friday. Heard nothing.  Finally the forms arrived on Wednesday. So it's taken 3 (or 4) working days for the forms to travel 2 miles (3 km). This is absurd. Relying on the Royal Mail to provide "next day" delivery for first-class local post seems broken.
  • step 3. I have filled the forms in, for the National and Local elections. How did I vote? [see below]. I then dressed up to vote tday 2015-05-01, went to my local post-box, and put the envelope in it. It's got 4 working days to get to the Guildhall (3 km distant). I will never know if it reached it in time. It's possible I may be disenfranchised by slow postage.

So the system must be changed. It should be possible to get a voting form instantaneously. I am not arguing for electronic voting, but it should be possible to know that your vote has arrived and will go into the counting process.

So who did I vote for?

The party system is so broken in UK that it's impossible to vote for a party. I used to know what Labour, Conservative and Liberal stood for. Now I don't. Blair betrayed the system for ever - with the implied slogan "Trust me, I know what's best". The current leaders are pathetic. They are currently trying to find small gaps in the opponent's policies and add sweeties for the electorate. "No new income tax for 5 years". If that was so important it should have been in the manifesto. They are simply gibbering and most people who think about the issues have long ago given up. Clegg has destroyed the Lib-Dems - left them with no solid bedrocks.

So I don't vote for parties. I vote for people. That's on the basis that a responsible representative will recognize when policy is so far adrift that it has to be challenged. (Yes, you have spotted that I am an idealist). It makes it very difficult in Europe because you can only vote for parties.

Politics is carried out 365 days a year. Many issues are not party-political but require an independent, committed analysis and also hard work. There is good politics in the UK, In Europe, and in US (I don't have first hand knowledge of most countries). There is also awful politics in all of them, and it's that that I am fighting.

So you will have to guess who I voted for.


Is Figshare Open? "it is not just about open or closed, it is about control"

[Quote in title is from Mark Hahnel, see below]

I have been meaning to write on this theme for some time, and more generally on the increasing influence of DigitalScience's growing influence in parts of the academic infrastructure. This post is sparked by a twitter exchange (follow backwards from https://twitter.com/petermurrayrust/status/591197043579813888 ) in the last few hours, which addresses the question of whether "Figshare is Open".

This is not an easy question and I will try to be objective. First let me say - as I have said in public - that I have huge respect and admiration for how Mark Hahnel created Figshare while a PhD student. It's a great idea and I am delighted - in the abstract - that it gained some much traction so rapidly.

Mark and I have discussed issues of Figshare on more than one occasion and he's done me the honour of creating a "Peter Murray-Rust" slide (http://www.slideshare.net/repofringe/figshare-repository-fringe-2013 ) where he addresses some (but not all) of my concerns about Figshare after its "acquisition" by Macmillan Digital Science (I use this term, although there are rumours of a demerger or merger). I use "acquisition" because I have no knowledge of the formal position of Figshare as a legal entity (I assume it *is* one? Figshare FAQs ) and that's one of the questions to be addressed here.

From the FAQs:

figshare is an independent body that receives support from Digital Science. "Digital Science's relationship with figshare represents the first of its kind in the company's history: a community based, open science project that will retain its autonomy whilst receiving support from the division."

However http://www.digital-science.com/products/ lists Figshare among "our products" and brands it as if it is a DigitalScience division or company. Figshare appears to have no corporate address other than Macmillan and I assume trades through them.

So this post has been catalysed by a tweet of a report from a DS employee(?) Dan Valen

John Hammersley @DrHammersley tweeted:
Such a key message: "APIs are essential (for #opendata and #openscience)" - Dan Valen of @figshare at #shakingitup15 pic.twitter.com/HDyYEaXJRn

This generated a twitter exchange about why APIs were/not essential. I shan't explore that in detail, but my primary point is that:

If the only access to data is through a controlled API, then the data as a a whole cannot be open , regardless of the openness of individual components.

There is no doubt that some traditional publishers see APIs as a way of enforcing control over the user community. Readers will remember that I had a robust discussion with Gemma Hirsh of Elsevier, who stated that I could not legally mine Elsevier's data without going through their API. She was wrong, categorically wrong, but it was clear that she and Elsevier saw, and probably still see, APIs as a control mechanism. Note that Elsevier's Mendeley never exposed their whole data - only an API.

An API is the software contract with a webserver offering a defined service. It is often accompanied with a legal contract for the user (with some reciprocity). The definition of that service is completely in the hands of the provider. The control of that service is entirely in the hands of the provider. This leads to the following technical possibilities:

  • control: The provider can decide what to offer , when, to whom, on what basis. They can vary this by date, geography or IP of user, and I have no doubt that many publishers do exactly this. In particular, there is no guarantee that the user is able to see the whole data and no guarantee that it is not modified in some way from the "original". This is not, per se, reprehensible but it is a strong technical likelihood.
  • monitoring: ("snooping") The provider can monitor all traffic coming in from IP addresses, dwell times, number of revisits, quite apart from any cached information. I believe that a smart webserver, when coupled to other data about individuals, can deduce who the user is, where they are calling from and, with the sale of information between companies, what they have been doing elsewhere.

By default companies will do both of these. They could lead to increased revenue (e.g. Figshare could sell user data to other organizations) and increased lockin of users. Because Figshare is one of several Digital Science products (DS words, not mine) they could know about a user's publication record, their altmetric activity, what manuscripts they are writing, what they have submitted to the REF, what they are reading in their browser, etc. I am not asserting this is happening but I have no evidence it is not.

Mark says, in his slides,

"it is not just about open or closed, it is about control"

and I agree. But for me the question is who controls Figshare? and is Figshare controlling us?

Figshare appears to be one of the less transparent organizations I have encountered. I cannot find a corporate structure, and the companies' address is:

C/o Macmillan Publishers Limited, Brunel Road, Basingstoke, Hampshire, RG21 6XS

I can't find a board of directors or any advisory or governing board. So in practice Figshare is legally responsible to no-one other than UK corporate law.

You may think I am being unfair to an excellent (and I agree it's excellent) service. But history inexorably shows that these beginnings become closed, mutating into commercial control and confidentiality. Let's say Mark moves on? Who runs Figshare then? Or Springer buys Digital Science? What contract has Mark signed with DS? Maybe it binds Figshare to being completely run by the purchaser?

I have additional concerns about the growing influence of DigitalScience products, especially such as ReadCube, which amplify the potential for "snoop and control" - I'll leave those to another blogpost.

Mark has been good enough to answer some of my original concerns, so here are some othe'r to which I think an "open" ("community-based") organization should be able to provide answers.

  • who owns Figshare?
  • who runs Figshare?
  • Is there any governance process from outside Macmillan/DS? An advisory board?
  • How tightly bound is Figshare into Macmillan/DS? Could Figshare walk away tomorrow?
  • What could and what would happen to Figshare if Mark Hahnel left?
  • What could and what would happen to Figshare if either/both of Macmillan / DS were acquired?
  • Where are the company accounts for the last trading year?
  • how, in practice, is Figshare a "a community based, open science project that will retain its autonomy whilst receiving support from the (DS) division."?

I very much hope that the answers will allay any concerns I may have had.



The power of Digital Theses to change the world

I am speaking tomorrow at Lille to a group of Digita Humanists:

Séminaire DRTD-SHS

« Les données de la recherche dans les humanités numériques » 

Journée du 21 avril 2015 : « Maîtriser les technologies pour valoriser les données »

Lieu : MESHS (salle 2), 2 rue des Canonniers, 59000 Lille


I always wait to meet the audience before deciding what to say precisely but here are some themes:

  • Most research and scholarship is badly published, not reaching the people who need it and not creating a dialogue or community of practice. This is a moral crime and leads to impoverishment of the human spirit and the health of the citizens of the world.
  • The paradox of this century is that we have the potential for a new Digital Enlightenment, but in the Universities we are collaborating with those who, for their own personal gain, wish to restrict the distribution of knowledge. The large publishing corporations, taking support from media corporations are building an infrastructure which they monitor and control.
  • We have the technical means to break out of this. In our contentmine.org we can scrape the whole of the literature published every day; create a semantic index for searching and extract facts in far greater number than humans can ever do.
  • We are held back by the lack of vision, and our solution lies not in science, but in humanities. We lack a communal goal, communal values.

How can we harness the vision of Diderot and the Enlightenment and the radicalism of Mai 1968? How can we create the true culture of the digital century?

I shall show some of the tools we have developed in contentmine.org which can scrape and "understand"  the whole of scholarly publication. In the UK , after an intense battle against the mainstream publishing community, we have won the right for machines to read and analyze electronic documents without fear of copyright. I express this as:
We need this in the rest of Europe - Julia Reda MEP has recently proposed this  (and much more). There is again intense backlash - so we need philosophers, political scientists, historians, literary studies, economists to show why this freedom has to triumph.
All our tools are Open (Apache2, CC BY, CC0) and we have shown that "anyone" can learn to use them within a morning. They are part of the technical weaponry of digital liberation.
Theses are the major resource over which publishers have no control. Much of our scholarship is published in theses as well as in journals; and much is only published in theses. My single laptop can process 5000 theses per day - or 1 million per year - which should suffice.
The solution will come through human-machine symbionts - communities of practice who understand what machines can and cannot do.

TheContentMine is Ready for Business and will make scientific and medical facts available to everyone on a massive scale.



It's a year since I started TheContentMine (contentmine.org) - a project funded by the Shuttleworth Foundation. In ContentMine we intend to extract all the world's scientific and medical facts from the scholarly literature and make them available to everyone under permissive Open licences. We have been so busy - writing code, lobbying politically, building the team, designing the system, giving workshops, creating content, writing tutorials, etc. that I haven't had time to blog.

This week we launched, without fanfare, at a workshop sponsored by Robert Kiley of the Wellcome Trust:


[RK presented with an AMI, the mascot of TheContentMine]

Robert (and WT) have been magnificent in supporting ContentMining. He has advocated, organised, corralled, pushed, challenged over many years. The success of the workshop owes a great deal to him.

On Monday and Tuesday (2015-04-13/14) we ran a 2day workshop - training , hacking and advocacy/policy. We advertised the workshop, primarily for Early Career Researchers and were overwhelmed - FOUR TIMES oversubscribed [1]. Jenny Molloy organised the days, roughly as follows:

  • Day1
  • tutorials and simple hands on about technology
  • aspects of policy and protocols
  • planning projects
  • Day2
  • hacking projects for 6 hours
  • 2-hour policy/advocacy session with key UK and EU attendees.

It worked very well and showed that ContentMine is now viable in many areas:

  • We have unique software that has a completely new approach to searching scientific and medical literature.
  • We have an infrastructure that allows automatic processing of the literature through CRAWLing, SCRAPE-ing, NORMAlising and MINING (AMI).
  • architecture
  • We have a back-end/server CATalogue (contracted through CottageLabs) which has ingested and analysed a million articles.
  • We have novel search interfaces and display of results.
  • We have established that THE RIGHT TO READ IS THE RIGHT TO MINE. in the UK
  • We have built a team, and shown how to build communities.
  • We have tested training sessions that can be used to train trainers and spread the word.
  • And we are credible at the policy level.


[Part of the policy session]

We are delighted that a dozen funders, policy makers, etc came. They included JISC, IPO, LIBER, RLUK, RCUK, HEFCE, CUL, WT, BIS, UbiquityPress, NatureNews. The discussion took for granted that ContentMining is critically important and addressed how it could be suported and encouraged.

My slides for the policy session are at http://www.slideshare.net/petermurrayrust/content-mining-at-wellcome-trust.

I will blog more details later and show more pictures and so will Graham McDawg Steel. But the highlight for me was the speed and effciency of the Early Career Researchers in adopting, using, modifying and promoting the system. They came mainly from bioscience, /medical and ranged from UNIX geeks to those who hadn't seen a commandline. In their projects they were able to make the CM software work for them and extract facts from the literature. One group wrote additional processing software, another created a novel display with D3.

Best of all they said they'd be happy to learn how to run a workshop and take the ideas and software (which is completely Open Apache2/CC BY/CC0) to their communities.

NOTE: Hargreaves allows UK researchers to mine ANYTHING (that they have legal right to read) for non-commercial use. The publishers cannot stop them, either by technical means or contracts with libraries.

This should make the UK the content-mining capital of the world. Please join us!



I am very angry with the publishing industry.

Last week the NY Times reported that the Ministry of Health in Liberia had discovered a 30-year old paper that, if they had known about it, might have alerted Liberians to the possibility of Ebola. See a report in TechDirt (https://www.techdirt.com/articles/20150409/17514230608/dont-think-open-access-is-important-it-might-have-prevented-much-ebola-outbreak.shtml ) and also the article in the NY Times itself (http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-ebola.html ). The paper itself (http://www.sciencedirect.com/science/article/pii/S0769261782800282 ) is in Science Direct and paywalled (31 USD for ca 1000 words (3.5 pages). I'll write more on what the Liberians had to say and how they feel about the publishing industry and Western academia (they are incredibly restrained). But I'm not, and this makes me very angry .

This paper contains the words;

“The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be prepared to avoid nosocomial epidemics,”

The Liberians argue that if they had known about this risk some of the effects of Ebola could have been prevented.

Suppose I'm a medical educational organization In Liberia and I wanted to distribute this paper to 50 centers in Liberia. I am forbidden to do this by Elsevier unless I pay 12 USD per 3-page reprint (from https://s100.copyright.com).


I adamantly maintain "Closed access means people die".

This is self-evidently true to me, though I am still cricitized for not doing a scientific study (which would be necessarily unethical). But the Liberian Ministry is not impressed with academia and:

There is an adage in public health: “The road to inaction is paved with research papers.

We've paid 100 BILLION USD over the last 10 years to "publish" science and medicine. Ebola is a massive systems failure which I'll analyze shortly.



Content Mining Hackday in Cambridge this Friday 20150123 all welcome

We are having a ContentMine hackday - open to all - this Friday in Cambridge https://www.eventbrite.co.uk/e/contentmining-hackday-in-cambridge-facilitated-by-contentmine-tickets-716287435 .

We are VERY grateful to Laura James, from our Advisory Board who also set up the Cambridge Makespace where the event will be held. This event will cover everything - technical, science, sociolegal, etc. We are delighted that Professor Charles Oppenheim , another of our Advisory Board, will be present. Charles is a world expert on scholarship including the policy and legality of mining. For example he flagged up today that the EU and its citizens are pushing for reform...

We're also expecting colleagues from Cambridge University Library so we can have a lively political stream. And we've got scientific publishers in Cambridge - love to see you.

There'll be a technical stream - integrating the components of quickscrape, Norma, AMI and our API created by Mark MacGillivray and colleagues at CottageLabs. All the technology is brand new and everything is offered Openly (including commercial use).

And there'll be a group of subprojects based on scientific disciplines. They include:

  • clinical trials
  • farming and agronomy
  • crystallography

If you have an area you'd like to mine, come along. You'll need to have a good idea of your sources (journals, theses, etc.) , and some idea of what you'd like to extract. And, ideally, you'll need energy and stamina and friends...

Oh, and in the unlikely event you get bored we are 15 metres from the Cambridge Winter Beer Festival.