Wikimania: I argue for human-machine symbiotes to read and understand science

I have the huge opportunity to present a vision of the future of science at @WikimaniaLondon  (Friday 2014-08-08:1730) . I am deeply flattered. I am also deeply flattered that the Wikipedians have created a page about me (which means I never have to write a bio!). And that I have been catalogued as an Activist in the Free culture and open movements.

I have always supported Wikipedia. [Purists, please forgive "Wikipedia" as synonym for Wikimedia, Wikispecies, Wikidata...). Ten years ago I wrote  in support of WP (recorded in http://www.theguardian.com/media/2009/apr/05/digital-media-referenceandlanguages ):

The bit of Wikipedia that I wrote is correct.

That was offered in support of Wikipedia - its process , its people and its content. (In Wikipedia itself I would never use "I" , but "we", but for the arrogant academics it gets the message across). I'll now revise it:

For facts in physical and biological science I trust Wikipedia.

Of course WP isn't perfect. But neither is any other scientific reference. The difference is that Wikipedia:

  • builds on other authorities
  • is continually updated

Where it is questionable then the community can edit it. If you believe, as I do, that WP is the primary reference work of the Digital Century then the statement "Wikipedia is wrong" is almost meaningless. It's "we can edit or annotate this WP entry to help the next reader make a better decision".

We are seeing a deluge of scientific information. This is a good thing, as 80% of science is currently wasted. The conventional solution, disappointingly echoed by Timo Hannay (whom I know well and respect) is that we need a priesthood to decide what is worth reading

"subscription business models at least help to concentrate the minds of publishers on the poor souls trying to keep up with their journals." [PMR: Nature is the archetypal subscription model, and is owned by Macmillan, who also owns Timo Hannay's Digital Science]. “The only practical solution is to take a more differentiated approach to publishing the results of research. On one hand funders and employers should encourage scientists to issue smaller numbers of more significant research papers. This could be achieved by placing even greater emphasis on the impact of a researcher’s very best work and less on their aggregate activity.”

In other words the publishers set up an elite priesthood (which they have already) and academics fight to get their best work published. Everything else is lowgrade. This is so utterly against the Digital Enlightenment – where everyone can be involved – that I reject it totally.

I have a very different approach – knock down the ivory towers; dissolve the elitist publishers (the appointment of Kent Anderson to Science Magazine locks us in dystopian stasis).

Instead we must open scholarship to the world.  Science is for everyone. The world experts in Binomial names (Latin names) of dinosaurs are 4 years old. They have just as much right to our knowledge as professors and Macmillan.

So the next premise is

Most science can be understood by most human-machine symbiotes.

A human-machine scientific symbiote is a social machine consisting of (explained later):

  1. one (or preferably more) humans
  2. a discovery mechanism
  3. a reader-computer
  4. a knowledgebase

This isn’t science fiction. They exist today in primitive form. A hackathon is a great example of a symbiote – a group of humans hacking on a communal problem and sharing tools and knowledge. They are primitive not because of the technology, but because of our lack of vision and restrictive practices. They have to be built from OPEN components (“free to use, re-use, and redistribute”). So let’s take the components:

  1. Humans. These will come from those who think in a Digitally Enlightened way. They need to be open to sharing, putting group above self, of exposing their efforts, of not being frightened, or regarding “failure”as a valuable experience. Unfortunately such humans are beaten down by academia throughout much of the education process, through research; non-collaboration is often a virtue as is conformity. Disregard of the scholarly poor is universal. So either Universities must change or the world outside will change and leave them isolated and irrelevant
  2. Discovery. We’ve got used to universal knowledge through Google. But Google isn’t very good for science – it only indexes words, not chemical structures or graphs or identifiers or phylogenetic trees … We must build our own discovery system for science. It’s a simpler task than building a Google – there’s 1.5 million papers a year, add theses and grey literature and it’s perhaps 2 million documents. That’s about 5000 a day or 3 a minute. I can do that on my laptop. (I’m concentrating on documents here – data needs different treatment).

The problem would be largely solved if we had an Open Bibliography of science (basically a list of all published scientific documents). That’s easy to conceive and relatively easy to build.The challenge is sociopolitical – libraries don’t do this any more – they buy rent products from commercial companies – who have their own non-open agendas. So we shall probably have to do this as a volunteer community – largely like Open StreetMap – but there are several ways we can speed it up using the crowd and exhaust data from other processes (such as Open AccessButton and PeerLibrary).

And an index. When we discover a fact we index it. We need vocabularies and identifier systems. IN many subjects these exist and are OPEN but in many more they aren’t – so we have to build them or liberate them. All of this is hard, drawn out sociopolitical work. But when the indexes are built, then they create the scientific search engines of the future. They are nowhere near as large and complex as Google. We citizens can build this if we really want.

3. A machine reader-computer.  This is software which reads and processes the document for you. Again it’s not science fiction, just hard work to build. I’ve spent the last 2 years building some of it! and there are others. It’s needed because the technical standard of scholarly publishing is often appalling – almost no-one uses Unicode and standard fonts, which makes PDF awful to read. Diagrams which were created as vector diagrams are trashed to bitmaps (PNGs and even worse JPEGs). This simply destroys science. But, with hard work, we are recovering some of this into semantic form. And while we are doing it we are computing a normalised version. If we have chemical intelligent software (we do!) we compute the best chemical representation. If we have math-aware software (we do) we compute the best version. And we can validate and check for errors and…

4. A knowledge base. The machine can immediately  look up any resource – as long as it’s OPEN. We’ve seen an increasing number of Open resources (examples in chemistry are Pubchem (NIH) and ChEBI and ChEMBL (EBI)) .

And of course Wikipedia. The quality of chemistry is very good. I’d trust any entry with a significant history and number of edits to be 99% correct in its infobox (facts).

So our knowledgebase is available for validation, computation and much else. What’s the mass of 0.1 mole of NaCl? Look up WP infobox and the machine can compute the answer. That means that the machine can annotate most of the facts in the document – we’re going to examine this in Friday.

What’s Panthera leo? I didn’t know, but WP does. It’s http://en.wikipedia.org/wiki/Lion.  So WP starts to make a scientific paper immediately understandable. I’d guess that a paper has hundreds of facts – we shall find out shortly.

But, alas, the STM publishers are trying to stop us doing this. They want to control it. They want to licence the process. Licence means control, not liberation.

But, in the UK, we can ignore the STM publisher lobbyists. Hargreaves allows us to mine papers for factual content without permission.

And Ross Mounce and I have started this. With papers on bacteria. We can extract tens of thousands of binomial names for bacteria.

But where can we find out what these names mean?

maybe you can suggest somewhere… :-)

 

 

 

 

 

 

 

 

July summary: an incredible month: ContentMine, OKFest, Shuttleworth, Hargreaves, Wikimania

I haven’t blogged for over a month because I have been busier than I have ever been in my life. This is because the opportunities and the challenges of the Digital Century appear daily. It’s also because our ContentMine (http://contentmine.org) project has progressed more rapdily, more broadly and more successfully than I could have imagined.

Shuttleworth fund me/us to change the world. And because of the incredible support that they give – meetings twice a week, advice, contacts, reassurance, wisdom we are changing the world already. I have a wonderful team who I trust to do the right thing almost by instinct – like a real soccer team – each anticipates what is required when.

It’s getting very complex and hectic as we are active on several fronts (details in later posts and at Wikimania)

  • workshops. We offer workshops on ContentMining, agree dates and place and then have to deliver. Deadlines cannot slip. A workshop on new technology is a huge amount of effort. When we succeed we know we have something that not only works, but is wanted.  It’s very close to the OpenSource and OpenNotebook Science where everything is  made available to the whole world. That’s very ambitious and we are having to build the …
  • technology. This has developed very rapidly, but is also incredibly ambitious -  the overall aim is to have Open technology for reading and understanding and reusing the factual scientific literature. This can only happen with a high quiality generic modular architecture and
  • community. Our project relies on getting committed democratic meritocratic volunteers (like Wikipedia, OpenStreetMap, Mozilla, etc.). We haven’t invited them but they are starting to approach us and we have an excellent core in RichardSmith-Unna’s quickscrape (https://github.com/ContentMine/quickscrape/).
  • sociopoliticolegal. The STM publishers have increased their effort to require licences for content mining. There is no justification for this and no benefit (except to publishers income and control). We have to challenge this and we’ve written blogs and a seminal paper and…

Here’s a brief calendar …

  • 2014-06-04-> 06 FWF talk, workshop, OK hackday in Vienna
  • 2014-06-19->20 Workshop in Edinburgh oriented to libraries.
  • 2014-07-07->12 Software presented at BOSC (Boston)
  • 2014-07-14 Memorial for Jean-Claude Bradley and promotion of OpenNotebookScience
  • 2014-07-15 Presentation at CSVConf Berlin
  • 2014-07-16->19 OKFest at Berlin – 2 workshops and 2 presentations
  • 2014-07-22->23 Mozilla Sprint – Incredibly valuable for developing quickscrape and community
  • 2014-07-24 Plenary lecture to NLDTD (e-Theses and Dissertations) Leicester
  • 2014-07-25->27 Crystallography and Chemistry hack at Cambridge (especially liberating crystallographic data and images)
  • 2014-07-28->29 Visit of Karien Bezuidenhout from Shuttlworth – massive contribution
  • 2014-08-01 Development of PhyloTreeAnalyzer and visit to Bath to synchronise software
  • 2014-08-02 DNADigest hack Cambridge – great role that ContentMine can play in discovery of datasets

 

Sleep?

No

  • preparing for Featured Speaker at Wikimania on 2014-08-08 where I’ll present the idea that Wikipedia is central to understanding science. I’ll blog initial thought later today

http://contentmine.org

Jean-Claude Bradley Memorial Symposium ; Updates, including live streaming

Tomorrow we have the Memorial Symposium for Jean-Clause Bradley in Cambridge:

http://inmemoriamjcb.wikispaces.com/Jean-Claude+Bradley+Memorial+Symposium

We have 13 speakers and other items related to JCB. The lecture theatre is nearly full (ca 48 people)

** We have arranged live streaming and recording so those who cannot attend in person can follow and we will also have a recording (don’t know how long that will take to edit) **

Here are the notes – please try them out:

===========================
Meeting Name: Unilever Centre Lecture Theatre

Invited By: IT Support Chemistry

To join the meeting:
https://collab8.adobeconnect.com/chem-conference/

—————-

If you have never attended an Adobe Connect meeting before:

Test your connection: https://collab8.adobeconnect.com/common/help/en/support/meeting_test.htm

Get a quick overview: http://www.adobe.com/go/connectpro_overview

============================

I  suggest a hashtag of #jcbmemorial

We meet tonight in the Anchor pub in Cambridge – I and TonyW will be there at 1800 – I will have to leave ca 1830.

 

 

Content Mining: Extraction of data from Images into CSV files – step 0

Last week I showed how we can automatically extract data from images. The example was a phylogenetic tree, and although lots of people think these are wonderful, even more will have switched off. So now I’m going to show how we can analyse a “graph” and extract a CSV file. This will be in instalments so that you will  be left on a daily cliff-edge… (actually it’s because I am still refining and testing the code).  I am taking the example from “Acoustic Telemetry Validates a Citizen Science Approach for Monitoring Sharks on Coral Reefs” (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0095565) [I’ve not read it, but I assume they got volunteers to see how long they could evade being eaten with and without the control).

Anyway here’s our graph. I think most people  can understand it. There’s:

  • an x-axis, with ticks, numbers (0-14), title (“Sharks detected”) and units (“Individuals/day”)
  • a y-axis, with ticks, numbers (0-20), title (“Sharks observed”) and units (“Individuals/day”)
  • 12 points (black diamonds)
  • 12 error bars (like Tie-fighters) appearing to be symmetric
  • one “best line” through the points

raw

We’d like to capture this as CSV. If you want to sing along, follow: http://www.bitbucket.org/petermr/diagramanalyzer/org.xmlcml.diagrams.plot.PlotTest (the link will point to a static version – i.e. not updated as I add code).

This may look simple, but let’s magnify it:

antialias

Whatever has happened? The problem is that we have a finite number of pixels. We might paint them black (0) or white (255) but this gives a jaggy effect which humans don’t like. So the plotting software adds gray pixels to fool your eye. It’s called antialiasing (not a word I would have thought of). So this means the image is actually gray.

Interpreting a gray scale of images is tough, and most algorithms can only count up to 1 (binary) so we “binarize” the image. That means that  pixel becomes either 0 (black) or 1 (white). This has the advantage that the file/memory can much smaller and also that we can do toplogical analyses as in the last blog post. But it throws information away and if we are looking at (say) small characters this can be problematic. However it’s a standard first step for many people and we’ll take it.

The simplest way to binarize a gray scale (which goes from 0 to 255 in unit steps) is to classify 0-127 as “black” and 128-255 as “white”. So let’s do that:

defaultUnthinnedBinary

 

Now if we zoom in we can see the pixels are binary:

binary

So this is the next step on our journey – how are we going to turn this into a CSV file? Not quite as simple as I have made it out – keep your brain in gear…

I’ll leave you on the cliff edge…

 

 

Social Machines, SOCIAM, WWMM, machine-human symbiosis, Wikipedia and the Scientist’s Amanuensis

Over 10 years ago, when peer-to-peer was an exciting and (through Napster) a liberating idea, I proposed the World Wide Molecular Matrix (Cambridge), (wikipedia) as a new approach to managing scientific information. It was bottom-up, semantic, and allowed scientists to share data as peers. It was ahead of the technology and ahead of the culture.

I also regularly listed tasks that a semi-artificially-intelligent chemical machine – the Scientists’ Amanuensis – could do,  such as read the literature, find new information and compute the results and republish to the community. I ended with:

“pass a first year university chemistry exam”

That would be possible today – by the end of this year – we could feed past questions into the machine and devise heuristics, machine learning and regurgitation that would get a 40% pass mark. Most of the software was envisaged in the 1970′s in the Stanford and Harvard AI/Chemistry labs.

The main thing stopping us doing it today is that the exam papers are Copyright. And that most of published science is Copyright. And I am spending my time fighting publishers rather than building the system. Oh dear!

Humans by themselves cannot solve the problem – the volume is too great – 1500 new scientific papers each day. And machines can’t solve it, as they have no judgment. Ask them to search for X and they’ll often find 0 hits or 100,000.

But a human-machine symbiosis can do wonderfully. Its time has now come – and epitomised by the SOCIAM project which involves Southampton and Edinburgh (and others). It’s aim is to build human-machine communities. I have a close lead as Dave Murray-Rust (son) is part of the project and asked if The Content Mine could provide some synergy/help for a meeting today in Oxford. I can’t be there, and suggested that Jenny Molloy could (and I think she’ll meet in the bar after she has fed her mosquitoes).

There’s great synergy already. The world of social machines relies on trust – that various collaborators provide bits pf the solution and that the whole is larger than the parts. Academic in-fighting and meaningless metrics destroy progress in the modern world – the only thing worse is publishers’  lawyers. The Content Mine is happy to collaborate with anyone – The more you use what we can provide the better for everyone.

Dave and I have talked about possible SOCIAM/ContentMine projects. It’s hard to design them because a key part is human enthusiasm and willingness to help build the first examples. So it’s got to be something where there is a need, where the technology is close to the surface, where people want to share and where the results will wow the world. At present that looks like bioscience – and CM will be putting out result feeds of various sorts and seeing who is interested. We think that evolutionary biology, especially of dinosaurs, but also of interesting or threatened species , would resonate.

The technology is now so much better and more importantly so much better known. The culture is ready for social machines. We can output the results of searches and scrapings in JSON, link to DBPedia using RDF – reformat and repurpose using Xpath or CSS. The collaborations doesn’t need to be top-down – each partner says “here’s what we’ve got” and the others say “OK here’s how we glue it together”. The vocabularies in bioscience and good. We can use social media such as Twitter – you don’t need to have an RDF schema to understand #tyrannosaurus_rex. One of the great things about species is that the binomial names are unique (unless you’re a taxonomist!) and that Wikipedia contains all the scientific knowledge we need.

There don’t seem to be any major problems [1]. If it breaks we’ll add glue just as TimBL did for URLs in the early web. Referential and semantic integrity are not important in social machines – we can converge onto solutions. If people want to communicate they’ll evolve to the technology that works for them – it may not be formally correct but it will work most of the time. And for science that’s good enough (half the science in the literature is potentially flawed anyway).

 

=============

[1] One problem. The STM publishers are throwing money at politicians desperately trying to stop us. Join us in opposing them.

 

Why I am fortunate to live and work in Cambridge

photo

Today was the Tour de France; third day – Cambridge to London. A once-in-a-lifetime opportunity. Should I “take the morning off” to watch the race – or should I continue to hack code for freedom. After all we are in a neck and neck race with those who wish to control scientific information and restrict our work in the interests of capitalist shareholders.

I’m very fortunate in that I can do both. I’m 7 mins cycle from the historic centre of Cambridge. I can carry my laptop in my sack, find a convenient wall to sit on – and later stand on – and spend the waiting time hacking code. And when I got into the Centre I found the “eduroam” network. Eduroam is an academic network which is common in parts of the anglophone world, especially the British Commonwealth. So I could sit in front of the norman Round Church – 1000 years old – and pick up eduroam, perhaps from St Johns College.

The peleton rode ceremonially through Cambridge (it speeded up 2 kilometers down the road) but even so it only took 20 seconds to pass.

So I can do my work anywhere in Cambridge – on a punt, in a pub, in the Market Square, at home

and sometimes even in the Chemistry Department…

So thank you everyone who makes the networks work in Cambridge.

And here, if you can see it half way up the lefthand side (to the left of the red shirt) , is the bearsuit who came to watch the race.

tdf1

Jean Claude Bradley Memorial Symposium; July 14th; let’s take Open Notebook Science to everyone

On July 14th we are holding a memorial meeting for Jean-Claude Bradley in Cambridge. Do come; it’s open for all. [NOTE: we hope to get live streaming for those who can't come.]
http://inmemoriamjcb.wikispaces.com/Jean-Claude+Bradley+Memorial+Symposium

Jean-Claude Bradley was one of the most influential open scientists of our time. He was an innovator in all that he did, from Open Education to bleeding edge Open Science; in 2006, he coined the phrase Open Notebook Science. His loss is felt deeply by friends and colleagues around the world.

On Monday July 14, 2014 we shall gather at Cambridge University to honour his memory and the legacy he leaves behind with a highly distinguished set of invited speakers to revisit and build upon the ideas which inspired and defined his life’s work.

Speakers

Simon Coles, University of Southampton, UK
Robert Hanson, St. Olaf College, USA
Nina Jeliazkova, Ideaconsult, Bulgaria
Andrew Lang, Oral Roberts University, USA
Daniel Lowe, NextMove Software, UK
Cameron Neylon, PLOS, USA
Peter Murray-Rust, Cambridge University, UK
Noel O’Boyle, NextMove Software, UK
Henry Rzepa , Imperial College London, UK
Valery Tkachenko , Royal Society of Chemistry, UK
Matthew Todd, University of Sydney, Australia
Antony Williams, Royal Society of Chemistry, UK
Egon Willighagen, Maastricht University, Netherlands

For me this is not to look back but forward.  Science, and science communication is in crisis. We need bold, simple visions to take us out of this, and Open Notebook Science (ONS) does exactly that. It:

  • is inclusive. Anyone can be involved at any level. You don’t have to be an academic.
  • is honest. Everything that is done is Open, so there is no fraud, no misrepresentation.
  • is immediate. The science is available as it happens. Publication is not an operation, but an attitude of mind
  • is preserved. ONS ensures that the record, and the full record, persists.
  • is repeatable or falsifiable. The full details of what was done are there so the experiment can be challenged or repeated at any time
  • is inexpensive. We waste 100 Billion USD / year of science through bad practice so we save that immediately. But also we get rid of paywalls, lawyers, opportunity costs, nineteenth century publishing practices, etc.

and a lot more. I shall take the opportunity to show the opportunities:

“Open Notebook Science NOW!” – Peter Murray-Rust, University of Cambridge and Shuttleworth Fellow
Open Notebook Science can revolutionise science in the same way as Open Source has changed software. Its impact will be massive: greatly increased quality, removal of waste and duplication, and an inclusive approach to involving citizens in science. It’s straightforward to do in many areas of science, especially computational. I shall present an ONS model which we can all follow and adapt. The challenge is changing minds and to do that we should start young.

 

Mozilla Global Science Hack – A must-attend event for scientists who want programs

In 3 weeks from now we’ll have a massive global hack for science. Many scientists probably think software is something that other people do. “I’m not a programmer” is a frequent cry. But things are changing. Programming is increasingly about finding out what the problem is, and finding tools and people who can help solve it. If you can run a chromatograph, or a mass spectrometer or a PCR machine you can use and build programs.

The main thing is your frame of mind. If you can organize and run an experiment , you can organize data. If you can organize data you are effectively doing computing. I had the great opportunity to go to a Software Carpentry course last year and it changed my life. It showed me that I needed to understand how I think and how I work and that the rest comes relatively naturally. And it showed the value of friends.

You want a program to do X? Thinking of writing it? Chances are that much of it exists already. Much of what programs do is universal – sorting, matching, transforming, searching. And we have great toolkits – R, Python, Apache, and visualisation D3, etc. So much of the solution is knowling what, and who, is out there.

So I’m off to Mozilla, in the heart of London. I went there for the first time a month ago – a great place that is human-friendly. Here’s the blurb – join us!

A multi-site sprint this July

(Also posted on the Software Carpentry blog.)

We’ll be holding our first-ever global sprint on July 22-23, 2014. This event will be modeled on Random Hacks of Kindness: people will work with friends and colleagues at sites around the globe, then hand off to participants west of them as their days end and others’ begin. We will set up video conferencing between the various locations and a show-and-tell at the end (and yes, there will be stickers and t-shirts).

We have booked space for the sprint at the Mozilla offices in Paris, London, Toronto, Vancouver, and San Francisco. If you aren’t in one of those cities, but are willing to help organize in your area, please add yourself to this Etherpad. We’ll hash out the what and how at the next Software Carpentry lab meeting—it’s a community event, so we’d like the community to choose what to sprint on—but please get the date in your calendar: it just wouldn’t be a party without you.

Visit of Richard Stallman (RMS) to Cambridge

Richard Stallman (RMS) from MIT stayed with us for 2 days last week. Since RMS has a 9000-word rider on what he needs and doesn’t need when visiting, I hope I will help future hosts by adding some comments. TL;DR It’s hard work.

OLYMPUS DIGITAL CAMERA

[RMS (St IGNUsias) selling PMR a GNU; (C) Murray-Rust, CC-BY]

I have a great regard for what RMS has done – Emacs, GNU, the 4 Freedoms. I heard him talk some years ago on Software Patents in Europe and it was great – he knew far more about the European system of government than I did; he had a clear political plan of action (who to write to, and when).  We’d corresponded but only met very briefly in a noisy room.

I posted on the dangers of publishers taking over our data, and he wrote and said he was coming to Cambridge (to talk at OWASP) and would like to talk. He mailed subsequently and said he was looking for somewhere to stay, so we offered him a bed. We’d read the rider – food requirements, temperature, music, dinner gurest, etc. We were prepared for a somewhat eclectic visitor.

In retrospect we should have prepared for an Old Testament prophet or mediaeval itinerant monk. (The dressing up as St IGNUsias – above – is actually quite a close parallel and a valuable addition to the rider.) Be prepared to arrange/fund taxi rides, random food browsing, and a flexible timetable.  In fact RMS didn’t require an internet cable – he used our wireless.

But the strange thing was that we had nothing to say to each other. RMS no longer writes software and does not seem engaged in practical politics or action other than raising money for FSF through sale of swag. His message – at least for these two days – was “everyone is snooping on us” (PMR agrees and is equally concerned) and “We must only run Free software” (Free as in speech, epitomised by GPL). For me GPL has the virtue of forestalling SW patents but when I raised it he seemed to downplay it. If he has a current agenda it’s not clear to me. The “Open” word is verboten in discourse – I wished to explore whether there was any difference between Free Data and Open Data (a term I promoted 9 years ago) but we didn’t.  So there was neither a practical agenda nor a dialectic.

The visit probably had the same impact on the household as most itinerant Prophets have.

And the animals are very happy to have a new addition (Connochaetes gnou). If you believe in the GNU-slash-Linux bintarian theology here it is:

GNU

 

 

 

Content Mining: we can now mine images (of phylogenetic trees and more)

The reason I use “content mining” and not “Text and Data Mining” is that science consists of more than text – images, audio video, code and much more.  Text is the best known and the most immediately tractable and many scientific disciplines have developed Natural Language Processing (NLP). In our group Lezan Hawizy, Peter Corbett, David Jessop, Daniel Lowe and others have developed ChemicalTagger, OSCAR, Patent Analysis, and OPSIN. (http://www-pmr.ch.cam.ac.uk/wiki/Main_Page ). So the contentmine.org is exactly that – an org that mines content.

But words are often a poor way of representing science and images are common. A general approach to processing all images is very hard and 2 years ago I though it was effectively impossible. However with hard work some subsets can be tractable. Here we show you some of the possibilities in phylogenetic trees (evolutionary trees). What is described below is simple to follow and simple to carry out, but it took me some months of exploration to find the best strategy. And I owe a great debt to Noureddin Sadawi who introduced me to thinning – I haven’t used his code but his experience was invaluable.

But you don’t need to worry. Here’s a typical tree. Its from PLoSONE, (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0036933 – “Adaptive Evolution of HIV at HLA Epitopes Is Associated with Ethnicity in Canada” .

pone.0036933.g001

The tree has been wrapped into a circle with the Root at the centre and the leaves/tips on the edge of the circle. To transcribe this manually would take hours – we show it being done in a second.

There isn’t always a standard way of doing things but for many diagrams we have to:

  • flatten (remove shades of gray)
  • separate colours (often by flattening them)
  • threshold (remove noise) and background)
  • thin (remove all pixels except the 1-pixel-think backbone)

and here is the thinned diagram:

cleaned

You’ll see that the lines are all still there but exactly 1 pixel thick. (We’ve lost a few colours, but that’s irrelevant for this example). Now we are going to look at the tree (and ignore the labels):

cleaned0

This has been selected automatically on pixel count, but we can also use bounding boxes and many shape characteristics.

We now analyse the structure and break it into connected components – a topological tree – by standard traversal methods. We end up with nodes and edges – this is a snapshot of a SVG.

graphAndChars

[The black lines are artifacts of Inkscape]. So we have identified every node and every edge. The next thing is to trace the edges – that’s easy if they are straight, but here they are curved. Ideally we plan to fit circles, but we’ll use segments for the time being:

segments

The curves are actually straight-line segments, but… no matter.

It’s now a proper phylogenetic tree! And we can serialize it as Newick (or NexML if we wanted).

((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187),n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),(((n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n160))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))))))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),((n53,n131),n159)))))));

And here is an interactive tree by posting that string into http://www.trex.uqam.ca/ (try it yourself).

tree1

So – to summarize – we have taken a phylogenetic tree – that may have taken hundreds of hours to compute and extracting the key data. (Smart people will ask “what about the text labels?” – be patient, that’s coming).

… in a second.

That scales to over a million images per year on my single laptop! And the technology scales to many other disciplines and it’s completely Open Source (Apache2). So YOU can use it – as long as you give us the credit for writing it.