How contentmine will extract millions of species

We are now  describing our workflow from extracting facts from the scientific literature on . Yesterday Ross Mounce and I hacked through what was necessary to extract species from PLoSone. Here’s the workflow we came up with:

Ross has described it in detail at and you should read that for the details. The key points are:

  • This is an open project. You can join in; be aware it’s alpha in places. There’s a discussion list at!forum/contentmine-community . Its style and content will be determined by what you post!
  • We are soft-launching it. You’ll wake up one day and find that it’s got critical mass of people and content (e.g. species). No fanfare and no vapourware.
  • It’s fluid. The diagram above is our best guess today. It will change. I mentioned in the previous post that we are working with WikiData for part of “where it’s going to be put”. If you have ideas please let us know.



How will extract 100 million scientific facts


At we have been working hard to create a platform for extracting facts from the literature. It’s been great to create a team – CottageLabs (CL) and I have worked together for over 5 years and they know better than me what needs to be built. Richard (RSU) is more recent but is a wonderful combination of scientist, hacker and generator of community.

Community is key to ContentMine. This will succeed because we are a community. We aren’t a startup that does a bit for free and then sells out to FooPLC or BarCorp and loses vision and control. We all passionately believe in community ownership and sharing. Exactly where we end up will depend on you as well as us. At present the future might look like OpenStreetMap, but it could also look like SoftWareCarpentry or Zooniverse.  Or even the Blue Obelisk.

You cannot easily ask volunteers to build infrastructure. Infrastructure is boring, hard work, relatively unrewarding and has to be built to high standards. So we are very grateful to Shuttleworth for funding this. When it’s prototyped, with a clear development path, the community will start to get involved.

And that’s started with quickscrape. The Mozilla science sprint created a nucleus of quickscrape hackers. This proved we (or rather Richard!) had built a great platform that people could build one and create per-journal and per-publishers scrapers.

So here’s our system. Don’t try to understand all of it in detail – I’ll give a high-level overview.


CRAWL: we generate a feed of some or all of the scientific literature. Possible sources are JournalToCs, CrossRef, doing it ourselves, gathering exhaust fragments. We’d be happy not to do it if there are stable, guaranteed sources. The result of crawling is a stream of DOIs and or bibliographic data passed to a QUEUE to be passed to …

SCRAPE: This extracts the components of the publications – e.g. abstract, fulltext, citations, images, tables, supplemental data, etc. Each publisher (or sometimes each journal) requires a scraper. It’s easy to write these for Richard’s quickscrape platform, which includes the scary Spooky, Phantom and Headless. A scraper  takes between 30 minutes and and 2 hours so it’s great for a spare evening. The scraped components are passed to the next queue …

EXTRACT. These are plugins which extract science from the components. Each scientific disciplines requires a different plugin. Some are simple and can be created either by lookup against Wikipedia or other open resources; or by creating regular expressions (not as scary as they sound). Others, such as those interpreting chemical structure diagrams or phylogenetics trees have taken more effort (but we’ve written some of  them).

The results can be used in many ways. They include:

  • new terms and data which can go direct;y into Wikidata – we’ve spent time at Wikimania exploring this. Since facts are uncopyrightable we can take them from any publication whether or not it’s #openaccess
  • annotation of  the fulltext. This can be legally done on openaccess text.
  • new derivatives of the facts – mixing them, recoputing them, doing simulations and much more

Currently people are starting to help  writing scrapers and if you are keen let us kn0w on the mailing list!forum/contentmine-community


Wikimania (Wikipedia) has changed my life

I’ve just spent 3 hectic days at Wikimania (the world gathering of world Wikimedians) and am so overwhelmed I’m spending today getting my thoughts in order.  Wikimedia is the generic organization for Wikipedia , Wikidata, Wikimedia Commons, and lots else. I’ll use Wikimedia as the generic term

Very simply:

Wikimedia is at the centre of the Century of the Digital Enlightenment.

Everything I do will be done under the influence of WM. It will be my gateway to the facts, the ideas, the people, the organizations, the processes of the Digital Enlightenment. It is the modern incarnation of Diderot and the Encyclopediee.

2000 Wikimanians gathered at the Barbican (London) for talks, bazaars, food, and fun. It’s far more than an encyclopedia.

We are building the values of the Digital Enlightenment

If we succeed in that everything follows. So I sat through incredible presentation on digital democracy, the future of scholarship, the liberation of thought, globalization, the fight against injustice, the role of corporations, and of the state. How Wikipedia  is becoming universal in Southern Africa through smart phones (while publishing in the rich west is completely out of touch with the future).

And fracture lines are starting to appear between the conservatism of the C20 and the Digital Enlightenment. We heard how universities still cannot accept the Wikipedia is the future. Students are not allowed to cite WP in their assignments – it’s an elephant in the room. Everyone uses it but you’re not allowed to say so.

If you are not using Wikipedia as a central part of your educational process, then you must change the process.

It’s a tragedy that universities are so conservative.

For me

Wikipedia is the centre of scientific research and publishing in this century.

I’ll expand in a later post.

But among the top values in Wikipedia is the community. The Digital Enligyhtenment is about community. It’s about inclusiveness. It’s about networks. It’s about sharing. C20 academia spends its enegery fighting its neighbours (“we have a higher ranking than you and I have a higher impact factor than you”). In Wikipedia community is honoured. The people who build the bots are welcomed by those who edit and those who curate the tools mutually honour those who build the chapters.

And anyone who thinks Wikipedia is dying has never been to Wikimania.

Wikipedia constantly reinvents itself and it’s doing so now. The number of edits is irrelevant. What matters is community and values.  We’re concerned about how organizations function in the Digital age. How do we resolve conflicts? what is truth?

I was honoured to be asked to give a talk, and here it is: (3h12m-> 3h42m)

I’ll blog more about it later.

My major discovery was

WIKIDATA, which will become a cornerstone of modern science

I’ll write more but huge thanks to:

  • Ed Saperia who spent 2 years organizing Wikimania
  • Wikichemistry (to whom I awarded a Blue Obelisk)
  • Wikidata

I am now so fortunate to be alive in the era of Wikimedia, Mozilla, Open Street Map, Open Knowledge Foundation, and Shuttleworth Fellowships.  These and others are a key part of change=ing and building a new world.

And I and will be part of it (more later).


Wikimania: I argue for human-machine symbiotes to read and understand science

I have the huge opportunity to present a vision of the future of science at @WikimaniaLondon  (Friday 2014-08-08:1730) . I am deeply flattered. I am also deeply flattered that the Wikipedians have created a page about me (which means I never have to write a bio!). And that I have been catalogued as an Activist in the Free culture and open movements.

I have always supported Wikipedia. [Purists, please forgive "Wikipedia" as synonym for Wikimedia, Wikispecies, Wikidata...). Ten years ago I wrote  in support of WP (recorded in ):

The bit of Wikipedia that I wrote is correct.

That was offered in support of Wikipedia - its process , its people and its content. (In Wikipedia itself I would never use "I" , but "we", but for the arrogant academics it gets the message across). I'll now revise it:

For facts in physical and biological science I trust Wikipedia.

Of course WP isn't perfect. But neither is any other scientific reference. The difference is that Wikipedia:

  • builds on other authorities
  • is continually updated

Where it is questionable then the community can edit it. If you believe, as I do, that WP is the primary reference work of the Digital Century then the statement "Wikipedia is wrong" is almost meaningless. It's "we can edit or annotate this WP entry to help the next reader make a better decision".

We are seeing a deluge of scientific information. This is a good thing, as 80% of science is currently wasted. The conventional solution, disappointingly echoed by Timo Hannay (whom I know well and respect) is that we need a priesthood to decide what is worth reading

"subscription business models at least help to concentrate the minds of publishers on the poor souls trying to keep up with their journals." [PMR: Nature is the archetypal subscription model, and is owned by Macmillan, who also owns Timo Hannay's Digital Science]. “The only practical solution is to take a more differentiated approach to publishing the results of research. On one hand funders and employers should encourage scientists to issue smaller numbers of more significant research papers. This could be achieved by placing even greater emphasis on the impact of a researcher’s very best work and less on their aggregate activity.”

In other words the publishers set up an elite priesthood (which they have already) and academics fight to get their best work published. Everything else is lowgrade. This is so utterly against the Digital Enlightenment – where everyone can be involved – that I reject it totally.

I have a very different approach – knock down the ivory towers; dissolve the elitist publishers (the appointment of Kent Anderson to Science Magazine locks us in dystopian stasis).

Instead we must open scholarship to the world.  Science is for everyone. The world experts in Binomial names (Latin names) of dinosaurs are 4 years old. They have just as much right to our knowledge as professors and Macmillan.

So the next premise is

Most science can be understood by most human-machine symbiotes.

A human-machine scientific symbiote is a social machine consisting of (explained later):

  1. one (or preferably more) humans
  2. a discovery mechanism
  3. a reader-computer
  4. a knowledgebase

This isn’t science fiction. They exist today in primitive form. A hackathon is a great example of a symbiote – a group of humans hacking on a communal problem and sharing tools and knowledge. They are primitive not because of the technology, but because of our lack of vision and restrictive practices. They have to be built from OPEN components (“free to use, re-use, and redistribute”). So let’s take the components:

  1. Humans. These will come from those who think in a Digitally Enlightened way. They need to be open to sharing, putting group above self, of exposing their efforts, of not being frightened, or regarding “failure”as a valuable experience. Unfortunately such humans are beaten down by academia throughout much of the education process, through research; non-collaboration is often a virtue as is conformity. Disregard of the scholarly poor is universal. So either Universities must change or the world outside will change and leave them isolated and irrelevant
  2. Discovery. We’ve got used to universal knowledge through Google. But Google isn’t very good for science – it only indexes words, not chemical structures or graphs or identifiers or phylogenetic trees … We must build our own discovery system for science. It’s a simpler task than building a Google – there’s 1.5 million papers a year, add theses and grey literature and it’s perhaps 2 million documents. That’s about 5000 a day or 3 a minute. I can do that on my laptop. (I’m concentrating on documents here – data needs different treatment).

The problem would be largely solved if we had an Open Bibliography of science (basically a list of all published scientific documents). That’s easy to conceive and relatively easy to build.The challenge is sociopolitical – libraries don’t do this any more – they buy rent products from commercial companies – who have their own non-open agendas. So we shall probably have to do this as a volunteer community – largely like Open StreetMap – but there are several ways we can speed it up using the crowd and exhaust data from other processes (such as Open AccessButton and PeerLibrary).

And an index. When we discover a fact we index it. We need vocabularies and identifier systems. IN many subjects these exist and are OPEN but in many more they aren’t – so we have to build them or liberate them. All of this is hard, drawn out sociopolitical work. But when the indexes are built, then they create the scientific search engines of the future. They are nowhere near as large and complex as Google. We citizens can build this if we really want.

3. A machine reader-computer.  This is software which reads and processes the document for you. Again it’s not science fiction, just hard work to build. I’ve spent the last 2 years building some of it! and there are others. It’s needed because the technical standard of scholarly publishing is often appalling – almost no-one uses Unicode and standard fonts, which makes PDF awful to read. Diagrams which were created as vector diagrams are trashed to bitmaps (PNGs and even worse JPEGs). This simply destroys science. But, with hard work, we are recovering some of this into semantic form. And while we are doing it we are computing a normalised version. If we have chemical intelligent software (we do!) we compute the best chemical representation. If we have math-aware software (we do) we compute the best version. And we can validate and check for errors and…

4. A knowledge base. The machine can immediately  look up any resource – as long as it’s OPEN. We’ve seen an increasing number of Open resources (examples in chemistry are Pubchem (NIH) and ChEBI and ChEMBL (EBI)) .

And of course Wikipedia. The quality of chemistry is very good. I’d trust any entry with a significant history and number of edits to be 99% correct in its infobox (facts).

So our knowledgebase is available for validation, computation and much else. What’s the mass of 0.1 mole of NaCl? Look up WP infobox and the machine can compute the answer. That means that the machine can annotate most of the facts in the document – we’re going to examine this in Friday.

What’s Panthera leo? I didn’t know, but WP does. It’s  So WP starts to make a scientific paper immediately understandable. I’d guess that a paper has hundreds of facts – we shall find out shortly.

But, alas, the STM publishers are trying to stop us doing this. They want to control it. They want to licence the process. Licence means control, not liberation.

But, in the UK, we can ignore the STM publisher lobbyists. Hargreaves allows us to mine papers for factual content without permission.

And Ross Mounce and I have started this. With papers on bacteria. We can extract tens of thousands of binomial names for bacteria.

But where can we find out what these names mean?

maybe you can suggest somewhere… :-)









July summary: an incredible month: ContentMine, OKFest, Shuttleworth, Hargreaves, Wikimania

I haven’t blogged for over a month because I have been busier than I have ever been in my life. This is because the opportunities and the challenges of the Digital Century appear daily. It’s also because our ContentMine ( project has progressed more rapdily, more broadly and more successfully than I could have imagined.

Shuttleworth fund me/us to change the world. And because of the incredible support that they give – meetings twice a week, advice, contacts, reassurance, wisdom we are changing the world already. I have a wonderful team who I trust to do the right thing almost by instinct – like a real soccer team – each anticipates what is required when.

It’s getting very complex and hectic as we are active on several fronts (details in later posts and at Wikimania)

  • workshops. We offer workshops on ContentMining, agree dates and place and then have to deliver. Deadlines cannot slip. A workshop on new technology is a huge amount of effort. When we succeed we know we have something that not only works, but is wanted.  It’s very close to the OpenSource and OpenNotebook Science where everything is  made available to the whole world. That’s very ambitious and we are having to build the …
  • technology. This has developed very rapidly, but is also incredibly ambitious -  the overall aim is to have Open technology for reading and understanding and reusing the factual scientific literature. This can only happen with a high quiality generic modular architecture and
  • community. Our project relies on getting committed democratic meritocratic volunteers (like Wikipedia, OpenStreetMap, Mozilla, etc.). We haven’t invited them but they are starting to approach us and we have an excellent core in RichardSmith-Unna’s quickscrape (
  • sociopoliticolegal. The STM publishers have increased their effort to require licences for content mining. There is no justification for this and no benefit (except to publishers income and control). We have to challenge this and we’ve written blogs and a seminal paper and…

Here’s a brief calendar …

  • 2014-06-04-> 06 FWF talk, workshop, OK hackday in Vienna
  • 2014-06-19->20 Workshop in Edinburgh oriented to libraries.
  • 2014-07-07->12 Software presented at BOSC (Boston)
  • 2014-07-14 Memorial for Jean-Claude Bradley and promotion of OpenNotebookScience
  • 2014-07-15 Presentation at CSVConf Berlin
  • 2014-07-16->19 OKFest at Berlin – 2 workshops and 2 presentations
  • 2014-07-22->23 Mozilla Sprint – Incredibly valuable for developing quickscrape and community
  • 2014-07-24 Plenary lecture to NLDTD (e-Theses and Dissertations) Leicester
  • 2014-07-25->27 Crystallography and Chemistry hack at Cambridge (especially liberating crystallographic data and images)
  • 2014-07-28->29 Visit of Karien Bezuidenhout from Shuttlworth – massive contribution
  • 2014-08-01 Development of PhyloTreeAnalyzer and visit to Bath to synchronise software
  • 2014-08-02 DNADigest hack Cambridge – great role that ContentMine can play in discovery of datasets




  • preparing for Featured Speaker at Wikimania on 2014-08-08 where I’ll present the idea that Wikipedia is central to understanding science. I’ll blog initial thought later today

Jean-Claude Bradley Memorial Symposium ; Updates, including live streaming

Tomorrow we have the Memorial Symposium for Jean-Clause Bradley in Cambridge:

We have 13 speakers and other items related to JCB. The lecture theatre is nearly full (ca 48 people)

** We have arranged live streaming and recording so those who cannot attend in person can follow and we will also have a recording (don’t know how long that will take to edit) **

Here are the notes – please try them out:

Meeting Name: Unilever Centre Lecture Theatre

Invited By: IT Support Chemistry

To join the meeting:


If you have never attended an Adobe Connect meeting before:

Test your connection:

Get a quick overview:


I  suggest a hashtag of #jcbmemorial

We meet tonight in the Anchor pub in Cambridge – I and TonyW will be there at 1800 – I will have to leave ca 1830.



Content Mining: Extraction of data from Images into CSV files – step 0

Last week I showed how we can automatically extract data from images. The example was a phylogenetic tree, and although lots of people think these are wonderful, even more will have switched off. So now I’m going to show how we can analyse a “graph” and extract a CSV file. This will be in instalments so that you will  be left on a daily cliff-edge… (actually it’s because I am still refining and testing the code).  I am taking the example from “Acoustic Telemetry Validates a Citizen Science Approach for Monitoring Sharks on Coral Reefs” ( [I’ve not read it, but I assume they got volunteers to see how long they could evade being eaten with and without the control).

Anyway here’s our graph. I think most people  can understand it. There’s:

  • an x-axis, with ticks, numbers (0-14), title (“Sharks detected”) and units (“Individuals/day”)
  • a y-axis, with ticks, numbers (0-20), title (“Sharks observed”) and units (“Individuals/day”)
  • 12 points (black diamonds)
  • 12 error bars (like Tie-fighters) appearing to be symmetric
  • one “best line” through the points


We’d like to capture this as CSV. If you want to sing along, follow: (the link will point to a static version – i.e. not updated as I add code).

This may look simple, but let’s magnify it:


Whatever has happened? The problem is that we have a finite number of pixels. We might paint them black (0) or white (255) but this gives a jaggy effect which humans don’t like. So the plotting software adds gray pixels to fool your eye. It’s called antialiasing (not a word I would have thought of). So this means the image is actually gray.

Interpreting a gray scale of images is tough, and most algorithms can only count up to 1 (binary) so we “binarize” the image. That means that  pixel becomes either 0 (black) or 1 (white). This has the advantage that the file/memory can much smaller and also that we can do toplogical analyses as in the last blog post. But it throws information away and if we are looking at (say) small characters this can be problematic. However it’s a standard first step for many people and we’ll take it.

The simplest way to binarize a gray scale (which goes from 0 to 255 in unit steps) is to classify 0-127 as “black” and 128-255 as “white”. So let’s do that:



Now if we zoom in we can see the pixels are binary:


So this is the next step on our journey – how are we going to turn this into a CSV file? Not quite as simple as I have made it out – keep your brain in gear…

I’ll leave you on the cliff edge…



Social Machines, SOCIAM, WWMM, machine-human symbiosis, Wikipedia and the Scientist’s Amanuensis

Over 10 years ago, when peer-to-peer was an exciting and (through Napster) a liberating idea, I proposed the World Wide Molecular Matrix (Cambridge), (wikipedia) as a new approach to managing scientific information. It was bottom-up, semantic, and allowed scientists to share data as peers. It was ahead of the technology and ahead of the culture.

I also regularly listed tasks that a semi-artificially-intelligent chemical machine – the Scientists’ Amanuensis – could do,  such as read the literature, find new information and compute the results and republish to the community. I ended with:

“pass a first year university chemistry exam”

That would be possible today – by the end of this year – we could feed past questions into the machine and devise heuristics, machine learning and regurgitation that would get a 40% pass mark. Most of the software was envisaged in the 1970′s in the Stanford and Harvard AI/Chemistry labs.

The main thing stopping us doing it today is that the exam papers are Copyright. And that most of published science is Copyright. And I am spending my time fighting publishers rather than building the system. Oh dear!

Humans by themselves cannot solve the problem – the volume is too great – 1500 new scientific papers each day. And machines can’t solve it, as they have no judgment. Ask them to search for X and they’ll often find 0 hits or 100,000.

But a human-machine symbiosis can do wonderfully. Its time has now come – and epitomised by the SOCIAM project which involves Southampton and Edinburgh (and others). It’s aim is to build human-machine communities. I have a close lead as Dave Murray-Rust (son) is part of the project and asked if The Content Mine could provide some synergy/help for a meeting today in Oxford. I can’t be there, and suggested that Jenny Molloy could (and I think she’ll meet in the bar after she has fed her mosquitoes).

There’s great synergy already. The world of social machines relies on trust – that various collaborators provide bits pf the solution and that the whole is larger than the parts. Academic in-fighting and meaningless metrics destroy progress in the modern world – the only thing worse is publishers’  lawyers. The Content Mine is happy to collaborate with anyone – The more you use what we can provide the better for everyone.

Dave and I have talked about possible SOCIAM/ContentMine projects. It’s hard to design them because a key part is human enthusiasm and willingness to help build the first examples. So it’s got to be something where there is a need, where the technology is close to the surface, where people want to share and where the results will wow the world. At present that looks like bioscience – and CM will be putting out result feeds of various sorts and seeing who is interested. We think that evolutionary biology, especially of dinosaurs, but also of interesting or threatened species , would resonate.

The technology is now so much better and more importantly so much better known. The culture is ready for social machines. We can output the results of searches and scrapings in JSON, link to DBPedia using RDF – reformat and repurpose using Xpath or CSS. The collaborations doesn’t need to be top-down – each partner says “here’s what we’ve got” and the others say “OK here’s how we glue it together”. The vocabularies in bioscience and good. We can use social media such as Twitter – you don’t need to have an RDF schema to understand #tyrannosaurus_rex. One of the great things about species is that the binomial names are unique (unless you’re a taxonomist!) and that Wikipedia contains all the scientific knowledge we need.

There don’t seem to be any major problems [1]. If it breaks we’ll add glue just as TimBL did for URLs in the early web. Referential and semantic integrity are not important in social machines – we can converge onto solutions. If people want to communicate they’ll evolve to the technology that works for them – it may not be formally correct but it will work most of the time. And for science that’s good enough (half the science in the literature is potentially flawed anyway).



[1] One problem. The STM publishers are throwing money at politicians desperately trying to stop us. Join us in opposing them.


Why I am fortunate to live and work in Cambridge


Today was the Tour de France; third day – Cambridge to London. A once-in-a-lifetime opportunity. Should I “take the morning off” to watch the race – or should I continue to hack code for freedom. After all we are in a neck and neck race with those who wish to control scientific information and restrict our work in the interests of capitalist shareholders.

I’m very fortunate in that I can do both. I’m 7 mins cycle from the historic centre of Cambridge. I can carry my laptop in my sack, find a convenient wall to sit on – and later stand on – and spend the waiting time hacking code. And when I got into the Centre I found the “eduroam” network. Eduroam is an academic network which is common in parts of the anglophone world, especially the British Commonwealth. So I could sit in front of the norman Round Church – 1000 years old – and pick up eduroam, perhaps from St Johns College.

The peleton rode ceremonially through Cambridge (it speeded up 2 kilometers down the road) but even so it only took 20 seconds to pass.

So I can do my work anywhere in Cambridge – on a punt, in a pub, in the Market Square, at home

and sometimes even in the Chemistry Department…

So thank you everyone who makes the networks work in Cambridge.

And here, if you can see it half way up the lefthand side (to the left of the red shirt) , is the bearsuit who came to watch the race.