International harvesting of OA ETD repositories

From Peter Suber’s blog:


Leading the way with a European e-Theses demonstrator project, a press release from the Dutch SURF Foundation, July 31, 2007. Excerpt:

The organisations JISC (UK), the National Library of Sweden and the Dutch SURFfoundation have tested the interoperability of repositories for e-theses. The result is a freely accessible European e-Theses portal providing access to over 10,000 doctoral theses.
For the first time ever, various local repositories containing doctoral e-theses have been harvested on an international scale. Five countries were involved in the project: Denmark, Germany, the Netherlands, Sweden and the United Kingdom.
Doctoral theses contain some of the most current and valuable research produced within universities. Still, they are underused as research resources. Nowadays, theses and dissertations no longer have to gather dust in attics or on the shelves of university libraries. By making them available on the Internet, both the author and the university can showcase their research, benefiting not only fellow scientists, but a broad public as well. And when they are publicly available, they are used many times more often than printed theses available only at libraries or by inter-library loan.
The result of this pilot project is described in the report A Portal for Doctoral e-theses in Europe; Lessons Learned from a Demonstrator Project. The report gives practical recommendations to improve the interoperability between the service provider and the data supplier. The recommendations are entirely in line with the guidelines of the DRIVER project.
The report may be useful for institutions that wish to show the world the results of their research. By making their material accessible in a standardised manner and using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), they can reach beyond any boundary….

PMR: This is very exciting. We have estimated that most of the raw data connected with chemistry is never published. But where is the pressure to publish data greatest? In theses of course. If you fail to show your data, then the examiners can rightly ask you to find them or, worse, remeasure them if you have lost them. So it should be standard that at the end of a thesis there is a full set of (at leat partially) inspected and validated data. Yet most of this is subsequently lost. So we really welcome this – not least because we have a JISC project (SPECTRa-T : Submission, Preservation and Exposure of Chemistry ...) to extract chemistry (meta)data from theses.
As I am sure everyobe else is can I make the following suggestions. If we want to do them, they shouldn’t be too difficult:

  • Where possible text-based versions should be available. I know that many historical theses  may only be present as bitmaps (Tiffs, etc.) but it’s really valuable to have searchable text. And even if the OCR isn’t 100% it’s possible to do a lot with slightly imperfect scans. And even to suggest corrections in some cases. Maybe if people are interested in a thesis they could OCR it, correct it, and resubmit it.
  • Content should be freely text- and data-minable. Now I know that copyright can be a slight problem here, but can we try to find creative ways round it.  Every graduate student I have spoken to wants their thesis to be read and none have any problems with it being data mined. But when I produced my thesis no-one had though of text-mining. I actually have no idea whether I hold the copyright – I expect so. I don’t think there is actually very much worth mining as nearly all the data has got into the public domain in publications. But who knows. But please don’t let 20-year old “copyright” serve as an unnecessary barrier to text-mining – certainly not in the sciences.
  • there should be a way of communicating such material back so that theses can be annotated. That may be more difficult but not inconceivable.

and so gradually we build up a resource which our robots, as well as us, can read. That would give us a fully searchable resource.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *