petermr's blog

A Scientist and the Web

 

Wellcome’s recommendations on Data Access and Sharing

August 21st, 2014

The Wellcome Trust and other funders have commisioned a study on

ESTABLISHING INCENTIVES AND CHANGING CULTURES TO SUPPORT DATA ACCESS

http://www.wellcome.ac.uk/stellent/groups/corporatesite/@msh_peda/documents/web_document/wtp056495.pdf

(This is a report of the Expert Advisory Group on Data Access (EAGDA). EAGDA was established by the MRC, ESRC, Cancer Research UK and the Wellcome Trust in 2012 to provide strategic advice on emerging scientific, ethical and legal issues in relation to data access for cohort and longitudinal studies.)

This is very welcome – data is a poor relation of the holy – and in many subjects often largely useless – PDF full text. Here the report states why, and how we need to care for data.

Our findings were that

– making data accessible to others can carry a significant cost to researchers (both in terms of financial resource and the time it requires) and there are constraints in terms of protecting the privacy and confidentiality of research participants;

– while funders have done much valuable work to encourage data access and have made significant investments to support key data resources (such as the UK Data Service for the social sciences), the data management and sharing plans they request of researchers are often not reviewed nor resourced adequately, and the delivery of these plans neither routinely monitored nor enforced;

– there is typically very little, if any, formal recognition for data outputs in key assessment processes – including in funding decisions, academic promotion, and in the UK Research Excellence Framework;

– data managers have an increasingly vital role as members of research teams, but are often afforded a low status and few career progression opportunities;

– working in data intensive research areas can create potential challenges for early career researchers in developing careers in these fields;

– the infrastructures needed to support researchers in data management and sharing, and to ensure the long-term preservation and curation of data, are often lacking (both at an institutional and a community level).

TL;DR It needs commitment in money, policies and management and it’s a large task

So …

RECOMMENDATIONS 

We recommend that research funders should:

1. Strengthen approaches for scrutinising data management and sharing plans associated with their funded research – ensuring that these are resourced appropriately and implemented in a manner that maximises the long-term value of key data outputs.

 

2. Urge the UK Higher Education funding councils to adopt a clear policy at the earliest possible stage for high quality datasets that are shared with others to be explicitly recognised and assessed as valued research outputs in the post-2014 Research Excellence Framework

 

3. Take a proactive lead in recognising the contribution of those who generate and share high quality datasets, including as a formal criterion for assessing the track record and achievements of researchers during funding decisions.

 

4. Work in partnership with research institutions and other stakeholders to establish career paths for data managers.

 

5. Ensure key data repositories serving the data community have adequate funding to meet the long-term costs of data preservation, and develop user-friendly services that reduce the burden on researchers as far as possible.

PMR: This is the FUNDERS urging various bodies to act. Some items are conceivably possible. (4) is highly desirable but very challenging and universities have consistently failed to value support roles and honour  ”research outputs” instead. (5) is possible but must be done by people and organizations who undersdtand repositories, not university libraries whose repositories are effectively unused.

And we MUSTN”T hand this over to commercial companies.

We recommend that research leaders should:

6. Adopt robust approaches for planning and costing data management and sharing plans when submitting funding applications.

7. Ensure that the contributions of both early-career researchers and data managers are recognised and valued appropriately, and that the career development of individuals in both roles is nurtured.

8. Develop and adopt approaches that accelerate timely and appropriate access to key research datasets.

9. Champion greater recognition of data outputs in the assessment processes to which they contribute.

PMR: (6) will have lipservice unless the process is changed (7) means changing culture and diverting money (8) is possible (9) requires a stick from funders

We also emphasise that research institutions and journals have critical roles in supporting the cultural change required.

Specifically, we call for research institutions to develop clear policies on data sharing and preservation; to provide training and support for researchers to manage data effectively; to strengthen career pathways for data managers; and to recognise data outputs in performance reviews.

We call on journals to establish clear policies on data sharing and processes to enable the contribution of individual authors on the publication to be assessed, and to require the appropriate citation and acknowledgement of datasets used in the course of a piece of published research. In addition, journals should require that datasets underlying published papers are accessible, including through direct links in papers wherever possible.

PMR:  Journals have failed us catastrophically both technically (they are among the worst technical printed output in the world – broken bizarre HTML and not even using Unicode) and politically (where their main product is glory, not technical). The only way to change this is to create different organisations.

This will be very difficult.

The funders are to be commended on these goals – it will be an awful lot of money and time and effort and politics.

How contentmine will extract millions of species

August 13th, 2014

We are now  describing our workflow from extracting facts from the scientific literature on http://contentmine.org/blog . Yesterday Ross Mounce and I hacked through what was necessary to extract species from PLoSone. Here’s the workflow we came up with:

Ross has described it in detail at http://contentmine.org/blog/AMI-species-workflow and you should read that for the details. The key points are:

  • This is an open project. You can join in; be aware it’s alpha in places. There’s a discussion list at  https://groups.google.com/forum/#!forum/contentmine-community . Its style and content will be determined by what you post!
  • We are soft-launching it. You’ll wake up one day and find that it’s got critical mass of people and content (e.g. species). No fanfare and no vapourware.
  • It’s fluid. The diagram above is our best guess today. It will change. I mentioned in the previous post that we are working with WikiData for part of “where it’s going to be put”. If you have ideas please let us know.

 

 

How contentmine.org will extract 100 million scientific facts

August 11th, 2014

 

At contentmine.org we have been working hard to create a platform for extracting facts from the literature. It’s been great to create a team – CottageLabs (CL) and I have worked together for over 5 years and they know better than me what needs to be built. Richard (RSU) is more recent but is a wonderful combination of scientist, hacker and generator of community.

Community is key to ContentMine. This will succeed because we are a community. We aren’t a startup that does a bit for free and then sells out to FooPLC or BarCorp and loses vision and control. We all passionately believe in community ownership and sharing. Exactly where we end up will depend on you as well as us. At present the future might look like OpenStreetMap, but it could also look like SoftWareCarpentry or Zooniverse.  Or even the Blue Obelisk.

You cannot easily ask volunteers to build infrastructure. Infrastructure is boring, hard work, relatively unrewarding and has to be built to high standards. So we are very grateful to Shuttleworth for funding this. When it’s prototyped, with a clear development path, the community will start to get involved.

And that’s started with quickscrape. The Mozilla science sprint created a nucleus of quickscrape hackers. This proved we (or rather Richard!) had built a great platform that people could build one and create per-journal and per-publishers scrapers.

So here’s our system. Don’t try to understand all of it in detail – I’ll give a high-level overview.

wikimania6

CRAWL: we generate a feed of some or all of the scientific literature. Possible sources are JournalToCs, CrossRef, doing it ourselves, gathering exhaust fragments. We’d be happy not to do it if there are stable, guaranteed sources. The result of crawling is a stream of DOIs and or bibliographic data passed to a QUEUE to be passed to …

SCRAPE: This extracts the components of the publications – e.g. abstract, fulltext, citations, images, tables, supplemental data, etc. Each publisher (or sometimes each journal) requires a scraper. It’s easy to write these for Richard’s quickscrape platform, which includes the scary Spooky, Phantom and Headless. A scraper  takes between 30 minutes and and 2 hours so it’s great for a spare evening. The scraped components are passed to the next queue …

EXTRACT. These are plugins which extract science from the components. Each scientific disciplines requires a different plugin. Some are simple and can be created either by lookup against Wikipedia or other open resources; or by creating regular expressions (not as scary as they sound). Others, such as those interpreting chemical structure diagrams or phylogenetics trees have taken more effort (but we’ve written some of  them).

The results can be used in many ways. They include:

  • new terms and data which can go direct;y into Wikidata – we’ve spent time at Wikimania exploring this. Since facts are uncopyrightable we can take them from any publication whether or not it’s #openaccess
  • annotation of  the fulltext. This can be legally done on openaccess text.
  • new derivatives of the facts – mixing them, recoputing them, doing simulations and much more

Currently people are starting to help  writing scrapers and if you are keen let us kn0w on the mailing list https://groups.google.com/forum/#!forum/contentmine-community

 

Wikimania (Wikipedia) has changed my life

August 11th, 2014

I’ve just spent 3 hectic days at Wikimania (the world gathering of world Wikimedians) and am so overwhelmed I’m spending today getting my thoughts in order.  Wikimedia is the generic organization for Wikipedia , Wikidata, Wikimedia Commons, and lots else. I’ll use Wikimedia as the generic term

Very simply:

Wikimedia is at the centre of the Century of the Digital Enlightenment.

Everything I do will be done under the influence of WM. It will be my gateway to the facts, the ideas, the people, the organizations, the processes of the Digital Enlightenment. It is the modern incarnation of Diderot and the Encyclopediee.

2000 Wikimanians gathered at the Barbican (London) for talks, bazaars, food, and fun. It’s far more than an encyclopedia.

We are building the values of the Digital Enlightenment

If we succeed in that everything follows. So I sat through incredible presentation on digital democracy, the future of scholarship, the liberation of thought, globalization, the fight against injustice, the role of corporations, and of the state. How Wikipedia  is becoming universal in Southern Africa through smart phones (while publishing in the rich west is completely out of touch with the future).

And fracture lines are starting to appear between the conservatism of the C20 and the Digital Enlightenment. We heard how universities still cannot accept the Wikipedia is the future. Students are not allowed to cite WP in their assignments – it’s an elephant in the room. Everyone uses it but you’re not allowed to say so.

If you are not using Wikipedia as a central part of your educational process, then you must change the process.

It’s a tragedy that universities are so conservative.

For me

Wikipedia is the centre of scientific research and publishing in this century.

I’ll expand in a later post.

But among the top values in Wikipedia is the community. The Digital Enligyhtenment is about community. It’s about inclusiveness. It’s about networks. It’s about sharing. C20 academia spends its enegery fighting its neighbours (“we have a higher ranking than you and I have a higher impact factor than you”). In Wikipedia community is honoured. The people who build the bots are welcomed by those who edit and those who curate the tools mutually honour those who build the chapters.

And anyone who thinks Wikipedia is dying has never been to Wikimania.

Wikipedia constantly reinvents itself and it’s doing so now. The number of edits is irrelevant. What matters is community and values.  We’re concerned about how organizations function in the Digital age. How do we resolve conflicts? what is truth?

I was honoured to be asked to give a talk, and here it is:

http://new.livestream.com/wikimania/friday2014 (3h12m-> 3h42m)

I’ll blog more about it later.

My major discovery was

WIKIDATA, which will become a cornerstone of modern science

I’ll write more but huge thanks to:

  • Ed Saperia who spent 2 years organizing Wikimania
  • Wikichemistry (to whom I awarded a Blue Obelisk)
  • Wikidata

I am now so fortunate to be alive in the era of Wikimedia, Mozilla, Open Street Map, Open Knowledge Foundation, and Shuttleworth Fellowships.  These and others are a key part of change=ing and building a new world.

And I and contentmine.org will be part of it (more later).

 

Ebola Emergency: Lancet/Elsevier charges 31 USD to read about it

August 9th, 2014

ebola

and

http://www.theguardian.com/society/2014/aug/08/who-ebola-outbreak-international-public-health-emergency

 

Wikimania: I argue for human-machine symbiotes to read and understand science

August 6th, 2014

I have the huge opportunity to present a vision of the future of science at @WikimaniaLondon  (Friday 2014-08-08:1730) . I am deeply flattered. I am also deeply flattered that the Wikipedians have created a page about me (which means I never have to write a bio!). And that I have been catalogued as an Activist in the Free culture and open movements.

I have always supported Wikipedia. [Purists, please forgive "Wikipedia" as synonym for Wikimedia, Wikispecies, Wikidata...). Ten years ago I wrote  in support of WP (recorded in http://www.theguardian.com/media/2009/apr/05/digital-media-referenceandlanguages ):

The bit of Wikipedia that I wrote is correct.

That was offered in support of Wikipedia - its process , its people and its content. (In Wikipedia itself I would never use "I" , but "we", but for the arrogant academics it gets the message across). I'll now revise it:

For facts in physical and biological science I trust Wikipedia.

Of course WP isn't perfect. But neither is any other scientific reference. The difference is that Wikipedia:

  • builds on other authorities
  • is continually updated

Where it is questionable then the community can edit it. If you believe, as I do, that WP is the primary reference work of the Digital Century then the statement "Wikipedia is wrong" is almost meaningless. It's "we can edit or annotate this WP entry to help the next reader make a better decision".

We are seeing a deluge of scientific information. This is a good thing, as 80% of science is currently wasted. The conventional solution, disappointingly echoed by Timo Hannay (whom I know well and respect) is that we need a priesthood to decide what is worth reading

"subscription business models at least help to concentrate the minds of publishers on the poor souls trying to keep up with their journals." [PMR: Nature is the archetypal subscription model, and is owned by Macmillan, who also owns Timo Hannay's Digital Science]. “The only practical solution is to take a more differentiated approach to publishing the results of research. On one hand funders and employers should encourage scientists to issue smaller numbers of more significant research papers. This could be achieved by placing even greater emphasis on the impact of a researcher’s very best work and less on their aggregate activity.”

In other words the publishers set up an elite priesthood (which they have already) and academics fight to get their best work published. Everything else is lowgrade. This is so utterly against the Digital Enlightenment – where everyone can be involved – that I reject it totally.

I have a very different approach – knock down the ivory towers; dissolve the elitist publishers (the appointment of Kent Anderson to Science Magazine locks us in dystopian stasis).

Instead we must open scholarship to the world.  Science is for everyone. The world experts in Binomial names (Latin names) of dinosaurs are 4 years old. They have just as much right to our knowledge as professors and Macmillan.

So the next premise is

Most science can be understood by most human-machine symbiotes.

A human-machine scientific symbiote is a social machine consisting of (explained later):

  1. one (or preferably more) humans
  2. a discovery mechanism
  3. a reader-computer
  4. a knowledgebase

This isn’t science fiction. They exist today in primitive form. A hackathon is a great example of a symbiote – a group of humans hacking on a communal problem and sharing tools and knowledge. They are primitive not because of the technology, but because of our lack of vision and restrictive practices. They have to be built from OPEN components (“free to use, re-use, and redistribute”). So let’s take the components:

  1. Humans. These will come from those who think in a Digitally Enlightened way. They need to be open to sharing, putting group above self, of exposing their efforts, of not being frightened, or regarding “failure”as a valuable experience. Unfortunately such humans are beaten down by academia throughout much of the education process, through research; non-collaboration is often a virtue as is conformity. Disregard of the scholarly poor is universal. So either Universities must change or the world outside will change and leave them isolated and irrelevant
  2. Discovery. We’ve got used to universal knowledge through Google. But Google isn’t very good for science – it only indexes words, not chemical structures or graphs or identifiers or phylogenetic trees … We must build our own discovery system for science. It’s a simpler task than building a Google – there’s 1.5 million papers a year, add theses and grey literature and it’s perhaps 2 million documents. That’s about 5000 a day or 3 a minute. I can do that on my laptop. (I’m concentrating on documents here – data needs different treatment).

The problem would be largely solved if we had an Open Bibliography of science (basically a list of all published scientific documents). That’s easy to conceive and relatively easy to build.The challenge is sociopolitical – libraries don’t do this any more – they buy rent products from commercial companies – who have their own non-open agendas. So we shall probably have to do this as a volunteer community – largely like Open StreetMap – but there are several ways we can speed it up using the crowd and exhaust data from other processes (such as Open AccessButton and PeerLibrary).

And an index. When we discover a fact we index it. We need vocabularies and identifier systems. IN many subjects these exist and are OPEN but in many more they aren’t – so we have to build them or liberate them. All of this is hard, drawn out sociopolitical work. But when the indexes are built, then they create the scientific search engines of the future. They are nowhere near as large and complex as Google. We citizens can build this if we really want.

3. A machine reader-computer.  This is software which reads and processes the document for you. Again it’s not science fiction, just hard work to build. I’ve spent the last 2 years building some of it! and there are others. It’s needed because the technical standard of scholarly publishing is often appalling – almost no-one uses Unicode and standard fonts, which makes PDF awful to read. Diagrams which were created as vector diagrams are trashed to bitmaps (PNGs and even worse JPEGs). This simply destroys science. But, with hard work, we are recovering some of this into semantic form. And while we are doing it we are computing a normalised version. If we have chemical intelligent software (we do!) we compute the best chemical representation. If we have math-aware software (we do) we compute the best version. And we can validate and check for errors and…

4. A knowledge base. The machine can immediately  look up any resource – as long as it’s OPEN. We’ve seen an increasing number of Open resources (examples in chemistry are Pubchem (NIH) and ChEBI and ChEMBL (EBI)) .

And of course Wikipedia. The quality of chemistry is very good. I’d trust any entry with a significant history and number of edits to be 99% correct in its infobox (facts).

So our knowledgebase is available for validation, computation and much else. What’s the mass of 0.1 mole of NaCl? Look up WP infobox and the machine can compute the answer. That means that the machine can annotate most of the facts in the document – we’re going to examine this in Friday.

What’s Panthera leo? I didn’t know, but WP does. It’s http://en.wikipedia.org/wiki/Lion.  So WP starts to make a scientific paper immediately understandable. I’d guess that a paper has hundreds of facts – we shall find out shortly.

But, alas, the STM publishers are trying to stop us doing this. They want to control it. They want to licence the process. Licence means control, not liberation.

But, in the UK, we can ignore the STM publisher lobbyists. Hargreaves allows us to mine papers for factual content without permission.

And Ross Mounce and I have started this. With papers on bacteria. We can extract tens of thousands of binomial names for bacteria.

But where can we find out what these names mean?

maybe you can suggest somewhere… :-)

 

 

 

 

 

 

 

 

July summary: an incredible month: ContentMine, OKFest, Shuttleworth, Hargreaves, Wikimania

August 6th, 2014

I haven’t blogged for over a month because I have been busier than I have ever been in my life. This is because the opportunities and the challenges of the Digital Century appear daily. It’s also because our ContentMine (http://contentmine.org) project has progressed more rapdily, more broadly and more successfully than I could have imagined.

Shuttleworth fund me/us to change the world. And because of the incredible support that they give – meetings twice a week, advice, contacts, reassurance, wisdom we are changing the world already. I have a wonderful team who I trust to do the right thing almost by instinct – like a real soccer team – each anticipates what is required when.

It’s getting very complex and hectic as we are active on several fronts (details in later posts and at Wikimania)

  • workshops. We offer workshops on ContentMining, agree dates and place and then have to deliver. Deadlines cannot slip. A workshop on new technology is a huge amount of effort. When we succeed we know we have something that not only works, but is wanted.  It’s very close to the OpenSource and OpenNotebook Science where everything is  made available to the whole world. That’s very ambitious and we are having to build the …
  • technology. This has developed very rapidly, but is also incredibly ambitious -  the overall aim is to have Open technology for reading and understanding and reusing the factual scientific literature. This can only happen with a high quiality generic modular architecture and
  • community. Our project relies on getting committed democratic meritocratic volunteers (like Wikipedia, OpenStreetMap, Mozilla, etc.). We haven’t invited them but they are starting to approach us and we have an excellent core in RichardSmith-Unna’s quickscrape (https://github.com/ContentMine/quickscrape/).
  • sociopoliticolegal. The STM publishers have increased their effort to require licences for content mining. There is no justification for this and no benefit (except to publishers income and control). We have to challenge this and we’ve written blogs and a seminal paper and…

Here’s a brief calendar …

  • 2014-06-04-> 06 FWF talk, workshop, OK hackday in Vienna
  • 2014-06-19->20 Workshop in Edinburgh oriented to libraries.
  • 2014-07-07->12 Software presented at BOSC (Boston)
  • 2014-07-14 Memorial for Jean-Claude Bradley and promotion of OpenNotebookScience
  • 2014-07-15 Presentation at CSVConf Berlin
  • 2014-07-16->19 OKFest at Berlin – 2 workshops and 2 presentations
  • 2014-07-22->23 Mozilla Sprint – Incredibly valuable for developing quickscrape and community
  • 2014-07-24 Plenary lecture to NLDTD (e-Theses and Dissertations) Leicester
  • 2014-07-25->27 Crystallography and Chemistry hack at Cambridge (especially liberating crystallographic data and images)
  • 2014-07-28->29 Visit of Karien Bezuidenhout from Shuttlworth – massive contribution
  • 2014-08-01 Development of PhyloTreeAnalyzer and visit to Bath to synchronise software
  • 2014-08-02 DNADigest hack Cambridge – great role that ContentMine can play in discovery of datasets

 

Sleep?

No

  • preparing for Featured Speaker at Wikimania on 2014-08-08 where I’ll present the idea that Wikipedia is central to understanding science. I’ll blog initial thought later today

http://contentmine.org

Jean-Claude Bradley Memorial Symposium ; Updates, including live streaming

July 13th, 2014

Tomorrow we have the Memorial Symposium for Jean-Clause Bradley in Cambridge:

http://inmemoriamjcb.wikispaces.com/Jean-Claude+Bradley+Memorial+Symposium

We have 13 speakers and other items related to JCB. The lecture theatre is nearly full (ca 48 people)

** We have arranged live streaming and recording so those who cannot attend in person can follow and we will also have a recording (don’t know how long that will take to edit) **

Here are the notes – please try them out:

===========================
Meeting Name: Unilever Centre Lecture Theatre

Invited By: IT Support Chemistry

To join the meeting:
https://collab8.adobeconnect.com/chem-conference/

—————-

If you have never attended an Adobe Connect meeting before:

Test your connection: https://collab8.adobeconnect.com/common/help/en/support/meeting_test.htm

Get a quick overview: http://www.adobe.com/go/connectpro_overview

============================

I  suggest a hashtag of #jcbmemorial

We meet tonight in the Anchor pub in Cambridge – I and TonyW will be there at 1800 – I will have to leave ca 1830.

 

 

Content Mining: Extraction of data from Images into CSV files – step 0

July 9th, 2014

Last week I showed how we can automatically extract data from images. The example was a phylogenetic tree, and although lots of people think these are wonderful, even more will have switched off. So now I’m going to show how we can analyse a “graph” and extract a CSV file. This will be in instalments so that you will  be left on a daily cliff-edge… (actually it’s because I am still refining and testing the code).  I am taking the example from “Acoustic Telemetry Validates a Citizen Science Approach for Monitoring Sharks on Coral Reefs” (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0095565) [I’ve not read it, but I assume they got volunteers to see how long they could evade being eaten with and without the control).

Anyway here’s our graph. I think most people  can understand it. There’s:

  • an x-axis, with ticks, numbers (0-14), title (“Sharks detected”) and units (“Individuals/day”)
  • a y-axis, with ticks, numbers (0-20), title (“Sharks observed”) and units (“Individuals/day”)
  • 12 points (black diamonds)
  • 12 error bars (like Tie-fighters) appearing to be symmetric
  • one “best line” through the points

raw

We’d like to capture this as CSV. If you want to sing along, follow: http://www.bitbucket.org/petermr/diagramanalyzer/org.xmlcml.diagrams.plot.PlotTest (the link will point to a static version – i.e. not updated as I add code).

This may look simple, but let’s magnify it:

antialias

Whatever has happened? The problem is that we have a finite number of pixels. We might paint them black (0) or white (255) but this gives a jaggy effect which humans don’t like. So the plotting software adds gray pixels to fool your eye. It’s called antialiasing (not a word I would have thought of). So this means the image is actually gray.

Interpreting a gray scale of images is tough, and most algorithms can only count up to 1 (binary) so we “binarize” the image. That means that  pixel becomes either 0 (black) or 1 (white). This has the advantage that the file/memory can much smaller and also that we can do toplogical analyses as in the last blog post. But it throws information away and if we are looking at (say) small characters this can be problematic. However it’s a standard first step for many people and we’ll take it.

The simplest way to binarize a gray scale (which goes from 0 to 255 in unit steps) is to classify 0-127 as “black” and 128-255 as “white”. So let’s do that:

defaultUnthinnedBinary

 

Now if we zoom in we can see the pixels are binary:

binary

So this is the next step on our journey – how are we going to turn this into a CSV file? Not quite as simple as I have made it out – keep your brain in gear…

I’ll leave you on the cliff edge…

 

 

Social Machines, SOCIAM, WWMM, machine-human symbiosis, Wikipedia and the Scientist’s Amanuensis

July 8th, 2014

Over 10 years ago, when peer-to-peer was an exciting and (through Napster) a liberating idea, I proposed the World Wide Molecular Matrix (Cambridge), (wikipedia) as a new approach to managing scientific information. It was bottom-up, semantic, and allowed scientists to share data as peers. It was ahead of the technology and ahead of the culture.

I also regularly listed tasks that a semi-artificially-intelligent chemical machine – the Scientists’ Amanuensis – could do,  such as read the literature, find new information and compute the results and republish to the community. I ended with:

“pass a first year university chemistry exam”

That would be possible today – by the end of this year – we could feed past questions into the machine and devise heuristics, machine learning and regurgitation that would get a 40% pass mark. Most of the software was envisaged in the 1970′s in the Stanford and Harvard AI/Chemistry labs.

The main thing stopping us doing it today is that the exam papers are Copyright. And that most of published science is Copyright. And I am spending my time fighting publishers rather than building the system. Oh dear!

Humans by themselves cannot solve the problem – the volume is too great – 1500 new scientific papers each day. And machines can’t solve it, as they have no judgment. Ask them to search for X and they’ll often find 0 hits or 100,000.

But a human-machine symbiosis can do wonderfully. Its time has now come – and epitomised by the SOCIAM project which involves Southampton and Edinburgh (and others). It’s aim is to build human-machine communities. I have a close lead as Dave Murray-Rust (son) is part of the project and asked if The Content Mine could provide some synergy/help for a meeting today in Oxford. I can’t be there, and suggested that Jenny Molloy could (and I think she’ll meet in the bar after she has fed her mosquitoes).

There’s great synergy already. The world of social machines relies on trust – that various collaborators provide bits pf the solution and that the whole is larger than the parts. Academic in-fighting and meaningless metrics destroy progress in the modern world – the only thing worse is publishers’  lawyers. The Content Mine is happy to collaborate with anyone – The more you use what we can provide the better for everyone.

Dave and I have talked about possible SOCIAM/ContentMine projects. It’s hard to design them because a key part is human enthusiasm and willingness to help build the first examples. So it’s got to be something where there is a need, where the technology is close to the surface, where people want to share and where the results will wow the world. At present that looks like bioscience – and CM will be putting out result feeds of various sorts and seeing who is interested. We think that evolutionary biology, especially of dinosaurs, but also of interesting or threatened species , would resonate.

The technology is now so much better and more importantly so much better known. The culture is ready for social machines. We can output the results of searches and scrapings in JSON, link to DBPedia using RDF – reformat and repurpose using Xpath or CSS. The collaborations doesn’t need to be top-down – each partner says “here’s what we’ve got” and the others say “OK here’s how we glue it together”. The vocabularies in bioscience and good. We can use social media such as Twitter – you don’t need to have an RDF schema to understand #tyrannosaurus_rex. One of the great things about species is that the binomial names are unique (unless you’re a taxonomist!) and that Wikipedia contains all the scientific knowledge we need.

There don’t seem to be any major problems [1]. If it breaks we’ll add glue just as TimBL did for URLs in the early web. Referential and semantic integrity are not important in social machines – we can converge onto solutions. If people want to communicate they’ll evolve to the technology that works for them – it may not be formally correct but it will work most of the time. And for science that’s good enough (half the science in the literature is potentially flawed anyway).

 

=============

[1] One problem. The STM publishers are throwing money at politicians desperately trying to stop us. Join us in opposing them.