Shuttleworth application: How the Content Mine is going to work

The second half of the Shuttleworth application asks how you are going to make it happen. Here’s my proposal. But if I am successful, I know that the Foundation and its fellows will be able to give advice and mentoring and I would expect the details to be continually improved… [Note- I have not formally asked all of the organizations – simply that I shall be contacting them. And there are many others which I haven’t had space to include and they shouldn’t feel “left out”!]


What do you want to explore? To deploy a framework (AMI) where scientists can create their own “plugins” to extract facts and enhance their understanding of publications. I/We have created chemical tools that “think as a chemist” and in narrow fields (e.g. understanding chemical names) are already better than all but a few experts. Ross Mounce (Panton fellow) and I are now starting on biodiversity where we have prototypes than can recognize species in papers, check them against known taxonomies, and publish collections in milliseconds. Simply by extracting all species, places and dates from the last 10 years of scholarly publications (10 million papers?) we make a massive open contribution. This can be used for scholarship, policy making, and outreach to help create citizen scientists.


As we ramp up, we’ll feed the results into organized semantic collections such as Wikipedia (which is becoming a primary reference for science), its offspring DBPedia which allows semantic querying of WP, and EuropePubMedCentral where our extracted facts (chemicals, species, etc.) will help create a far better index.

In many ways search engines control our thoughts. By building our own, better, scientific search engine we shall recapture our autonomy of thought.


What are you going to do to get there? I’ve just attended a 2-day boot camp in Software Carpentry (SWC) ( by Greg Wilson and been very impressed with the way it was run and the social dynamics. Much of my strategy is now based on his experience.


I and my group have built the software framework AMI. It works, but lacks user stressing and innovation. I have “acquired” a good 3rd year PhD student who is putting some of the final pieces in the framework, chemistry and numerical sciences. In some subjects AMI will answer questions well and usefully; in other fields we’ll see the way forward but need collaboration. Initially we shall have workshops concentrated on subject areas I know (bioscience, chemistry, publishing) and move into new ones with contacts from Greg and other scientists. I also expect to interest Open Access Publishers (PLoS, BMC, etc.) in running workshops to explore semantic publishing – this can be a source of sponsorship.


SWC runs “boot camps” with (say) 30 attendees who learn so much that some of them want to become instructors and run their own boot camps; this is a goal for when AMI is widely deployed. In the medium term, towards the end of the fellowship, I would expect to start running bootcamps in selected subjects.




Have you started implementation of the idea? Yes. The software framework has taken 5-20 years to develop and is now deployed as beta versions, with tools for (a) chemistry/metabolism and (b) biodiversity (phylogenetics). There is enough at beta that a new enthusiastic community could build a plugin within a month or two – with Greg I intend to explore this in (c) astronomy. Other tractable domains are bio-sequences (proteins and genes) and possibly environmental science (geolocation).


How have you funded your initiative in the past?
The code has been funded in part by scientific research grants (RCUK, JISC, Microsoft, Unilever, CSIRO/AU) and by volunteer contributions (mainly in the margins of existing research). Two modest grants from EPSRC (“Pathways to impact”) were critical in getting several tools released. But I have also harnessed the power of volunteer communities where energy and mutual respect are strong currencies. The key task now is to expand our communities – so that subject areas will each take on their own tasks.


I proposed Panton Fellowships (PF) 3 years ago (OKFN) and raised funds from (a) OSI(OSF) and now (b) CCIA. The Fellows are spreading the idea of bottom-up scientific knowledge and the initiative should become self-sustaining (not dependent on me) in a year or two. I helped “write a grant” (technically I’m not a Co-I) for BBSRC to support Ross Mounce (PF) for the AMI phylogenetic work at Bath and this will have a major impact in technology and community. Similarly I’m working with Kitware (a software company in Albany NY) for them to get grants in semantic chemistry.


Who are your current or potential key partners? I’m initially looking for organizations that would run joint workshops or fund them. I have excellent current contacts with: (a) community: OKFN (School-Of-Data), SWCarpentry,
Tabula , Crowdcrafting, MozillaScience (b) libraries and repositories: BL, EuropePMC. (c) publishers: PLoS, BMC, Ubiquity (Sam Moore is a Panton Fellow). (d) companies: Kitware (NY), Figshare. I have good links with Wikip(m)edia which could become very important long-term and am exploring Natural History Museums. I’ll be doing the first demo at the Oxford eScience Centre at the end of next month (2013-11). But I expect other potential partners to emerge from the workshops we’ll be running.


Not-for-profit. After talking with Greg Wilson and Francois Grey and Daniel Lombrada-Gonzales G I see no immediate need for a Foundation. Greg’s bootcamps are usually funded by a host (e.g. group of universities) who provide resources and modest sponsorship. Like Greg I don’t have a sense of personal ownership, but I absolutely want to protect against a digital landgrab by a big corporate. After 6 months we’ll have a clear idea whether The ContentMine generates its own identity or is naturally part of something else.


Where will you be based?
, UK. I currently travel abroad about once a month and love reacting to invitation. London is a great digital centre and I go there at least once a week so some events may be London-based.


Do you have an online presence? summarizes my science, but is not updated. My main structured presence is my blog /pmr which is active and widely followed. I don’t have a classical home page (spam). is one self-sustaining community I created. On Twitter @petermurrayrust has ca. 2000 followers. I’m also visible at and and


Does the idea/project have an online presence? hosts about 10 repositories directly involved in the project, the most accessible is I’ve deliberately not created a project page until we can show a working (alpha/beta) system because I hate vaporware and you only get one chance to release. (I have bought I’ll be working on tutorials for 2013-11-27 in Oxford. Ross Mounce has also blogged the AMI project and content mining –


The political aspect is frequently covered in my blog, by RossMounce, and by OKFN . A particular example is where a wide group of organizations withdraw from the (heavily lobbied) EC attempt to require licences for mining. I also blog on OKFN:

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *