In the previous post I indicated that I might be able to search University and other repositories through OpenDOAR – a repository of repositories. I can technically crawl the content of these repositories if I can get:
-
a list of the repositories
-
for each repository a list of the content.
I am quite prepared for these to be nested , or for modern technology such as RSS/Atom or similar to be used.
So here we go… Here’s the list of repositories:
OpenDOAR – Countries and Organisations
Africa | Asia | Australasia | Caribbean | Central America | Europe | North America | South America
Click on a name to see the corresponding OpenDOAR summaries, or on a URL to visit the relevant website.
Cape Verde | Egypt | Ethiopia | Kenya | Namibia | South Africa | Uganda | Zimbabwe
Universidade Jean Piaget de Cabo Verde – http://www.unipiaget.cv/
Biblioteca Digital da Universidade Jean Piaget de Cabo Verde
http://bdigital.cv.unipiaget.org/dspace/Bibliotheca Alexandrina (مكتبة الإسكندري) – http://www.bibalex.org/
Digital Assets Repository (DAR)
http://dar.bibalex.org/
… and presumably about 1400 more. So far so good. Our robots can navigate a table like this. I’d prefer RSS (and maybe there is something I’ve missed) but HTML will do.
So let’s go to Cambridge, which uses DSpace repository technology…
Description:A community centred university repository with a wealth of supporting information and documentation. Most of the items are CML files from the WorldWideMolecularMatrix dataset of small molecules. Otherwise, it is especially rich in multimedia (images and video) objects, less well populated with full-text papers. Some articles are restricted access and are not freely visible. Users may set up RSS feeds to be alerted to new content.
Yes, Jim Downing populated the repository with about 150,000 molecules, one entry per molecule.
Now the metadata and policies:
Grade: Metadata re-use permitted for not-for-profit purposes
Anyone may access the metadata free of charge.
The metadata may be re-used in any medium without prior permission for not-for-profit purposes provided the OAI Identifier or a link to the original metadata record are given.
The metadata must not be re-used in any medium for commercial purposes without formal permission.
For more information, please see webpage: http://www.lib.cam.ac.uk/repository/about/policies.html
Standardised Data Policy for full-text and other full data items
Grade: Harvesting full data items by robots prohibited
Anyone may access full items free of charge.
Copies of full items generally can be:
reproduced in any format or medium
for personal research or study, or not-for-profit purposes without prior permission or charge.
Full items must not be harvested by robots except transiently for full-text indexing or citation analysis
Full items must not be sold commercially in any format or medium without formal permission of the copyr
ight holders.Mention of the repository is appreciated but not mandatory.
For more information see webpage: http://www.lib.cam.ac.uk/repository/about/policies.html.
And this is where we hit the major problem:
Harvesting full data items by robots prohibited
Full items must not be harvested by robots except transiently for full-text indexing or citation analysis
So, simply, I put 150,000 items in the database and I am not allowed to extract them by robots. OK, I doubt Cambridge will dismiss me if I do, but consider the import of that message:
We don’t want any old hacker using our repository.
I’ve chosen Cambridge because it’s my instituion, but this restriction is extremely common in repositories.
We repository managers don’t want you using them.
If this is a problem with server overload there are well-known ways of getting round it. And most people who write crawlers will try hard to avoid damaging servers. So this can’t be the motivation.
No, the motivation is that most repositories don’t want to take the risk of anyone downloading material and possibly breaking copyright. The repositories are for preservation – look, not touch. Here’s Edinburgh:
Grade: Metadata re-use policy explicitly undefined
Anyone may access the metadata free of charge.
No metadata re-use policy defined. Assume no rights at all have been granted.
Standardised Data Policy for full-text and other full data items
Grade: Full data item policies explicity undefined
Anyone may access full items free of charge.
No full-item re-use policy defined. Assume no rights at all have been granted.
So what does “explicitly undefined” mean? It means that the repository managers will not help the user in determining what they can do with the material. Essentially “it’s your problem, not ours”.
Assume no rights at all have been granted.
So this is why scientists don’t use repositories, don’t use libraries. Why they use PubMedCentral, not their library. Why the use PubChem rather than their repository.
I actually want to give libraries something that they might be interested in – a tool which will extract chemistry from their theses. But their whole attitude is so web-unfriendly that I’m not sure it’s worth it. It’s far more important to uphold copyright than try to do something innovative in the C21.
I am still trying to get some positive input from libraries for my talk. So far nothing. Time is getting short.
Peter,
This is tangentially related to this topic, actually more to the topic of a number of your recent posts concerning institutional repositories. The art6icle in the link below discusses Elsevier’s latest tactic regarding trying to usurp IRs:
http://www.timeshighereducation.co.uk/story.asp?sectioncode=26&storycode=407046&c=1
I think this topic would certainly be a warning to librartians on what dangers lurk…
Steven
@Steve thanks. I think that the story has been circulated – e.g. through Peter Suber’s blog. I agree it’s worrying as I am not yet sure that the libraries are good at alerting faculty to these problems.