Linked Open Repositories:

At http://www.repositoryfringe.org/ we have a competition – run by JISC/Mahendra_Mahey. It’s primarily for hackers and they were hard at work last night, even forgoing the delights of the Edinburgh Fringe (to which the M-R clan succumbed instead). But one category is for a “good Idea”. So I’ll enter for this one and blog it – so even if I don’t win the prize the blogosphere can go ahead and implement it. (Mahendra/Assessors – if you are short of time, just jump to “Proposal” – you know the problem).

Background and Problem

Institutional repositories are designed for capturing the output of universities and other scholarly organizations. They have things like theses, preprints, priceless digital objects, teaching and learning objects. Quite varied. Not as full as they should be. So here’s a simple question:

Find me all chemistry theses in UK repositories.

And although we ought to be able to, we cannot answer this very simple question.

BTW if you want incentive, the Netherlands can expose all their theses. Portugal exposes its theses. Even France exposes its theses. But not the UK.

Why not? Repositories are searched by Bingle, aren’t they? Yes. But Bingle doesn’t give a complete list – it just gives a few pages. It doesn’t “expose its API”. And anyway Bingle might stop indexing them tomorrow. We can’t rely on Bingle, and more seriously we shouldn’t. We want something better. Owned by the academic community.

Well, can’t we use OAI-PMH to search them? What’s that? It’s a specialist search system for academic sites. And, possibly, we could use it to search for chemistry theses. But the hits rates would be very low. The content probably isn’t metadata-labelled as “chemistry” or as “thesis”.

But *I* know a chemistry thesis when I see the first page of it. It will say something like “A thesis submitted for the Doctor Of Philosophy the in the University of Laputa” . It has words such as “thesis” and “University” somewhere in the page. Machines can read a document and decide whether it’s a chemistry thesis. Better than most humans. All they need is the document.

So If I can get all the content of all the UK repositories I can find all the theses, simply by iterating over the whole content. [Iterate is described by Alice: “Begin at the beginning,”, the King said, very gravely, “and go on till you come to the end: then stop”]

And there is a simple way to express the content – organize it as LINKED OPEN DATA. http://en.wikipedia.org/wiki/Linked_Data .

What’s Linked Open Data? It’s TB-L’s great idea of the semantic web, giving everything an identifier, an address and using RDF. Wikpedia is in LOD (as DBPedia). Genome data is present as LOD. Government data is there as well.

Nearly everything except university content.

Isn’t it terribly difficult? And what’s this RDF anyway? No – it’s very easy – as my proposal will show:

PROPOSAL FOR LINKED OPEN REPOSTORIES

Step 0. Create a node in the LOD diagram/graph called “UK repositories”. Tell TimBL about it.

Step 1. Under this node, create a list of UK repositories in RDF. There’s about 100-200?? This could be done in the evening , in a bar. Use ORE to create an iterable list (“Table of Contents”). The list should point to each UK repository.

Step 2. Get all repository owners to provide an RDF list of their contents. This is technically trivial. All repository software has a button labelled “Dump as RDF”. Put the list (ise ORE?) on the repository web page.

That’s it. We’ve spent 100 million GBP in real and implicit costs to create UK repositories and their content. This proposal could be completed in a week.

As Tim says: JUST DO IT.

 


 

This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Linked Open Repositories:

  1. Dan Hagon says:

    Hi Peter, #rfringe11 sounds like a lot of fun. Wish I could be there.
    I’d say this could be done in an afternoon. Here I won’t cover the LOD conversion part, just the part about getting the raw data.
    First note that this has kind of already done as part of the JISC-funded EThOS project now hosted at the BL http://ethos.bl.uk/ but at present I don’t see anyway to access the underlying dataset other than by the search interface. Instead you can do the OAI-PMH harvesting for yourself.
    The OAI-PMH protocol is well-documented http://www.openarchives.org/OAI/openarchivesprotocol.html but you save yourself the trouble of playing around with resumption tokens and such-like by using the Python library pyoai http://www.infrae.com/download/OAI/pyoai
    The most important thing to note about the EThOS project was it’s use of the UKETD_DC metadata standard http://ethostoolkit.cranfield.ac.uk/tiki-index.php?page=Guide+to+using+an+existing+repository which is an extension of Dublin Core. Armed with this we can get a list of repositories that use the standard in ListFriends XML format from the Illinois OAI_PHM Data Provider Resgistry as follows: http://gita.grainger.uiuc.edu/registry/ListReposBySchema.asp?sch=http%3A%2F%2Fnaca.central.cranfield.ac.uk%2Fethos-oai%2F2.0%2Fuketd_dc.xsd&pf=uketd_dc
    We can now iterate through this list of repos. First, for each we get a list of set, since we only want the theses and/or items from chemistry departemts. Unfortunately there doesn’t appear to be consistency in the naming of sets between different repos (maybe do a keyword search on the returned list?) For instance: http://eprints.ecs.soton.ac.uk/cgi/oai2?verb=ListSets gives “747970653D746865736973” as the setSpec for “Thesis”
    With this in hand we can now get the data back we want:
    http://eprints.ecs.soton.ac.uk/cgi/oai2?verb=ListRecords&metadataPrefix=uketd_dc&set=747970653D746865736973
    Of course this isn’t the end of the story as you would now also need to determine the subject of the thesis to find just those that are Chemistry. Note also that the list of repos in ListFriends XML format certainly doesn’t cover all Universities in the UK.
    (Btw, I have some code snippets that encapsulate much of the above if someone is interested in taking further.)

Leave a Reply

Your email address will not be published. Required fields are marked *