CKAN – an idea whose time has now come

CKAN – The Comprehensive Knowledge Archive Network is the brainchild of Rufus Pollock (a young and incredibly energetic economist) at Cambridge. It’s part of Rufus’ vision of a world of distributed semantic Open knowledge. I think CKAN is an idea whose time has now come. It is impossible to make accurate predictions as to exactly which new web resource will blossom, but here’s the case for CKAN.

We’ve seen how successful Wikipedia is. But it wasn’t the first encyclopedia on the web (I’d flag the WWW Virtual Library as that) and started fairly slowly (as far as I recall).  And the quality and meta-quality (the tools and protocols to create quality) were fairly primitive compared with what they today.

Wikipedia is wonderful for many things, and I am really pleased that they created the infobox. This is an approach that enourages an “annealing towards consistency of representation”. The infobox is a collection of key data or metadata about the subject of a page. These are not developed top-down but tend to arise from a subcommunity which wants to systematise their field – everything from steam engines to battles to chemical compounds. The volunteers in the subject decide on a representation, and some metadata and the community fills this in. What’s impressive is that even without clear direction they converge on a useful mean, rather like the synchronicity of firelies. That seems to me, at least, to suggest that if a good, but not perfect, framework is presented to voluntary contributors they will not only add content, but also work to improve the framework.

So after that rather lengthy introduction (I’ve just landed in Melbourne and  am readjusting) what’s CKAN?

CKAN is the Comprehensive Knowledge Archive Network, a registry of open knowledge packages and projects (and a few closed ones). CKAN is the place to search for open knowledge resources as well as register your own.

There are currently 368 active packages on the system.

Thanks to its underlying versioned domain model CKAN has a wiki-like interface that lets anyone add and correct packages. Examples of existing entries include a set of Shakespeare’s works, a global population density database, the voting records of UK MPs, or 30 years of US patents.

CKAN is not a data repository – it’s a meta-repository in that it points to the resources. But please don’t think of it as just another metadata repository.  It’s Open (and that’s the intention). It’s multidisciplinary. And it’s now got a growing network of volunteers.

The vision is that anyone with a piece of knowledge (I’l concentrate on data sets here) that they think might be useful to the world should deposit it in CKAN under an OKFN Open Data protocol. In some cases – like Shakespeare – it can be used on its own. But increasingly data is valuable when combined – mashed up – with other data. For that we need common pieces of information – ideally identifiers, but often simple text fields or even running prose.

CKAN is not yet large and not yet systematised. But that doesn’t mean it’s not valuable – as I said earlier WP started out small. The important things is that CKAN is a community project for communal contribution and exploitation. It’s got an emphasis on liberation (or perhaps enlightenment) – it need not be comprehensive, but it should be useful.

You might reasonably suggest that Wikipedia is already systematising data. And to the extent that infoboxes represent data that’s true. But WP is avowedly and rightly an encyclopedia – you could not devote a page to a data set (unless it was as important as the Keeling curve). So where are you going to get those crucial bits of information that you need for research, teaching, learning?

CKAN can be the answer. We are the usual cyclic argument at present – “it’s not got enough in it, so I won’t use it” – as opposed to “It could do with some more knowledge – so here is some”. It’s also not organised in a fully Linked Data way – but then what data is? CKAN is a great place to start experimenting with Linking Data.

The exciting thing is that not only public data storage but even public triple storage is starting to become massively freely available. As soon as I knew that Talis was offering their platform to host Open Data (PDDL, to which Talis made critical and significant contribution) I started to think how we could get CKAN into it. Not everything will fit. But we can get enough overlap of concepts that we can start to unite the entries using SPARQL.

And move towards a genuine Open knowledge resource.

