I have suggested that we can and should create Linked Open Repositories (/pmr/2011/08/04/linked-open-repositories/ ) and that it might take a week. I expected this timescale to be challenged and that I would be seriously wrong.
I was. Dan, who was a wonderful summer student with us and who I am proud to say now works with Digital Science, says I am out by a factor of 14:
Dan Hagon says:
I’d say this could be done in an afternoon. Here I won’t cover the LOD conversion part, just the part about getting the raw data.
First note that this has kind of already done as part of the JISC-funded EThOS project now hosted at the BL http://ethos.bl.uk/ but at present I don’t see anyway to access the underlying dataset other than by the search interface. Instead you can do the OAI-PMH harvesting for yourself.
Yes – that’s the sort of problem. Dan and I can use the OAI-PMH (for theses only?) . How many others can?
The OAI-PMH protocol is well-documented http://www.openarchives.org/OAI/openarchivesprotocol.html but you save yourself the trouble of playing around with resumption tokens and such-like by using the Python library pyoai http://www.infrae.com/download/OAI/pyoai
We have to write a program to do it. Read Dan’s post for the details…
We can now iterate through this list of repos… [snipped]
[…] Unfortunately there doesn’t appear to be consistency in the naming of sets …
With this in hand we can now get the data back we want:
… you would now also need to determine the subject of the thesis to find just those that are Chemistry. Note also that the list of repos in ListFriends XML format certainly doesn’t cover all Universities in the UK.
So it can be done. But it’s an afternoon’s work. That doesn’t sound too bad ..
LET’S JUST DO IT?
P.
Hi Peter, this sounds like a great idea! Maybe I can help with getting the data from DSpace@Cambridge? The OAI-PMH details you need to pull out the Chemistry thesis data from DSpace@Cambridge is: http://www.dspace.cam.ac.uk/dspace-oai/request?verb=ListRecords&set=hdl_1810_218856&metadataPrefix=uketd_dc formatted according to the ETHOS guidelines.
Thanks
it’s doable whether or not I win! And we should do it