What sort of repositories do we want?

I had the pleasure of meeting Greg Crane in Phoenix (see below) and last week at our brainstorm on how to fund digital curation. Greg is a remarkable person – a classicist who is compleetely at home creating computer applications. He is familiar with many languages – trick question: “what is the most important language for the study of classics in the Near East [ans below]. Here he reports on the phoenix workshop and also questions our first generation of institutional repositories…

Open Access and Institutional Repositories: The Future of Scholarly Communications, Academic Commons,

Submitted by Greg Crane on December 16, 2007 – 10:19am.

Institutional repositories were the stated topic for a workshop convened in Phoenix, Arizona earlier this year (April 17-19, 2007) by the National Science Foundation (NSF) and the United Kingdom’s Joint Information Systems Committee (JISC). While in their report on the workshop, The Future of Scholarly Communication: Building the Infrastructure for Cyberscholarship, Bill Arms and Ron Larsen build out a larger landscape of concern, institutional repositories remain a crucial topic, which, without institutional cyberscholarship, will never approach their full potential.

PMR: Although I’m going to agree generally with Greg I don’t think the stated topic of the workshop was institutional repositories per se. It was digital scholarship, digital libraries and datasets. I would expect to find many datasets outside institutions (witness the bio-databases).

Repositories enable institutions and faculty to offer long-term access to digital objects that have persistent value. They extend the core missions of libraries into the digital environment by providing reliable, scalable, comprehensible, and free access to libraries’ holdings for the world as a whole. In some measure, repositories constitute a reaction against those publishers that create monopolies, charging for access to publications on research they have not conducted, funded, or supported. In the long run, many hope faculty will place the results of their scholarship into institutional repositories with open access to all. Libraries could then shift their business model away from paying publishers for exclusive access. When no one has a monopoly on content, the free market should kick in, with commercial entities competing on their ability to provide better access to that freely available content. Business models could include subscription to services and/or advertising.

Repositories offer one model of a sustainable future for libraries, faculty, academic institutions and disciplines. In effect, they reverse the polarity of libraries. Rather than import and aggregate physical content from many sources for local use, as their libraries have traditionally done, universities can, by expanding access to the digital content of their own faculty through repositories, effectively export their faculty’s scholarship. The centers of gravity in this new world remain unclear: each academic institution probably cannot maintain the specialized services needed to create digital objects for each academic discipline. A handful of institutions may well emerge as specialist centers for particular areas (as Michael Lesk suggests in his paper here).
The repository movement has, as yet, failed to exert a significant impact upon intellectual life. Libraries have failed to articulate what they can provide and, far more often, have failed to provide repository services of compelling interest. Repository efforts remain fragmented: small, locally customized projects that are not interoperable–insofar as they operate at all. Administrations have failed to show leadership. Happy to complain about exorbitant prices charged by publishers, they have not done the one thing that would lead to serious change: implement a transitional period by the end of which only publications deposited within the institutional repository under an open access license will count for tenure, promotion, and yearly reviews. Of course, senior faculty would object to such action, content with their privileged access to primary sources through expensive subscriptions. Also, publications in prestigious venues (owned and controlled by ruthless publishers) might be lost. Unfortunately, faculty have failed to look beyond their own immediate needs: verbally welcoming initiatives to open our global cultural heritage to the world but not themselves engaging in any meaningful action that will make that happen.
The published NSF/JISC report wisely skips past the repository impasse to describe the broader intellectual environment that we could now develop. Libraries, administrators and faculty can muddle through with variations on proprietary, publisher-centered distribution. However, existing distribution channels cannot support more advanced scholarship: intellectual life increasingly depends upon open access to large bodies of machine actionable data.
The larger picture depicted by the report demands an environment in which open access becomes an essential principle for intellectual life.The more pervasive that principle, the greater the pressure for instruments such as institutional repositories that can provide efficient access to large bodies of machine actionable data over long periods of time. The report’s authors summarize as follows the goal of the project around which this workshop was created:

To ensure that all publicly-funded research products and primary resources will be readily available, accessible, and usable via common infrastructure and tools through space, time, and across disciplines, stages of research, and modes of human expression.

To accomplish this goal, the report proposes a detailed seven-year plan to push cyberscholarship beyond prototypes and buzzwords, including action under the following rubrics:

Infrastructure: to develop and deploy a foundation for scalable, sustainable cyberscholarship

Research: to advance cyberscholarship capability through basic and applied research and development

Behaviors: to understand and incentivize personal, professional and organizational behaviors

Administration: to plan and manage the program at local, national and international levels

For members of the science, technology, engineering, and medical fields, the situation is promising. This report encourages the NSF to take the lead and, even if it does not pursue the particular recommendations advocated here, the NSF does have an Office of Cyberinfrastructure responsible for such issues, and, more importantly, enjoys a budget some twenty times larger than that of the National Endowment for the Humanities. In the United Kingdom, humanists may be reasonably optimistic, since JISC supports all academic disciplines with a healthy budget. Humanists in the US face a much more uncertain future.

PMR: I would agree with Greg that IRs are oversold and underdeliver. I never expected differently. I have never yet located a digital object I wanted in an IR expect when I specifically went looking (e.g. for theses). And I went to Soton to see what papers of Stevan’s were public and what their metadata were. But I have never found one through Google.

Why is this? The search engines locate content. Tyr searching for NSC383501 (the entry for a molecule from the NCI) and you’ll find: DSpace at Cambridge: NSC383501
But the actual data itself (some of which is textual metadata) is not accessible to search engines so isn’t indexed. So if you know how to look for it through the ID, fine. If you don’t you won’t.
I don’t know what the situation is in humantities, so I looked up the Fitzwilliam (the major museum in Cambridge) newsletter and looked for “The Fitzwilliam Museum Newsletter Winter 2003/2004” in Google and found: DSpace at Cambridge: The Fitzwilliam Museum Newsletter 22 but when I looked for the first sentence “The building phase of The Fitzwilliam Museum Courtyard“ Google returned zero hits.
So (unless I’m wrong and please correct me), deposition in DSpace does NOT allow Google to index the text that it would expose on normal web pages. Jim explained that this was due to the handle system and the use of one level of indirection – Google indexes the metadata but not the data. (I suspect this is true of ePrints – I don’t know about Fedora).
If this is true, then repositing at the moment may archive the data but it hides it from public view except to diligent humans. So people are simply not seeing the benefit of repositing – they don’t disover material though simple searches.
So I’m hoping that ORE will change all this. Because we can expose all the data as well as the metadata to search engines. That’s one of the many reasons why I’m excited about our molecular repositories (eChemistry) project.
As I said in a previous post, it will change the public face of chemical information. The key word for this post is “public”. In others we’ll look at “chemical” and “information”.
====================
[ans: German. Because the majority of scholarship in the C19 was in German.]

6 Responses to What sort of repositories do we want?

Dorothea Salo says:

December 18, 2007 at 3:27 pm

DSpace doesn’t natively stop crawlers from indexing textual content of ingested items. I did a Google on “Innkeeper at the Roach Motel” and turned up my article, which only appears on the Web in the DSpace repository MINDS@UW. It had Google’s usual “View as HTML” link for the PDF. Worked fine.
For another test, I did a search on Google for a phrase that appeared in my article but NOT in its metadata. Also worked fine.
PDFs can be a problem, though (not that this is news to you). In the newsletter case, I would need to know whether the PDF was created directly from whatever layout software the newsletter producers use, or whether it was scanned to PDF. If the latter, and no OCR was performed, Google can’t index the newsletter text because ceci n’est pas une texte — it’s a picture of a text, and Google crawlers don’t automatically OCR pictures.
Not sure what to tell you about your ChemML files. Possibly Google doesn’t know what to do with them and doesn’t try?

Dorothea Salo says:

December 18, 2007 at 3:30 pm

Argh. I meant to add that while DSpace doesn’t natively stop crawlers, individual DSpace administrators might — there have been problems in the past with overeager crawlers (hello, Googlebot) crashing DSpace installations. (Been there, done that, got the 400 “help, I’ve fallen and I can’t get up!” messages in my inbox.)
DSpace 1.4.2 took care of a lot of that nonsense, but I don’t know that admins who restricted crawlers in the past will have relaxed their restrictions now.

Tim Donohue says:

December 18, 2007 at 4:34 pm

I’m with Dorothea on this one…natively, DSpace will allow Google (and other crawlers) to full-text index your files in DSpace. Here’s a good example, from a tutorial handout I wrote with Dorothea:
Search the following in Google:
dspace how-to, “Messages.properties” (typed as-is in Google)
Here’s a direct link to the search:
http://www.google.com/search?hl=en&q=dspace+how-to%2C+%22Messages.properties%22&btnG=Google+Search
As you can see in the link, the top four hits are all PDFs or OpenOffice.org files that Google has indexed in various DSpace repositories (IDEALS and GMU’s MARS). Although the text “dspace how-to” exists in the title & abstract, the text “Messages.properties” ONLY exists in the full-text of the item. In fact, Google displays a small full-text snippet below each hit, showing where it located “Messages.properties” in the full-text. In addition, as Dorothea already mentioned, Google has auto-converted all these into HTML as well (in the “view as HTML” link).

Jim Downing says:

December 19, 2007 at 10:46 am

“Not sure what to tell you about your ChemML files. Possibly Google doesn’t know what to do with them and doesn’t try?”
That’s my understanding – interestingly, if you lie about the MIME type, Google does index CML (here, for example).
In DSpace@Cambridge, Tom De Mulder had to block bots from the DSpace browse pages because of the curious behaviour of overlapping browse pages. Instead, the template contains a link to a single HTML document with a link to every item. A low-tech atom archive, basically.

Chris Rusbridge says:

January 1, 2008 at 5:38 pm

Peter, I think this is an important post to come back to and comment on again. I read your main message as being “repositories don’t help your data be indexed and it may therefore not be found”. I have just run some tests on two DSpace and one eprints repository, and Google has indexed the text in all cases except where (for some unknown reason… possibly a publisher version?) the PDF is in image mode. I see your other commenters made similar points.
In the case of your CML file, my Mac does not know what it is, and I guess no more does Google… unless perhaps you tell Google that it’s text, as Jim Downing seemed to be suggesting in an earlier comment (I’m not sure this constitutes lying, more selective use of the truth). I can open your CML files in my text editor, fine, although of course to process them into something chemically interesting, I would need some additional software or plugins…
The point is, surely, that this would be just as true if the repository was simply a filestore full of CML files, which is how data is often made available. But unlike the filestore, there is potentially some useful metadata in the repository which could assist data users (ie people, in this case); in a filestore, this is either absent or in some conventional place such as README.TXT.
The case that you seemed to be making (that repositories are not useful for data) would be more convincing if the repository architecture got seriously in the way of processing (rather than just viewing) the data. Do you have any evidence that this is the case?
And finally… Happy New Year!

pm286 says:

January 1, 2008 at 6:27 pm

(5) Many thanks Chris,
You and others have slightly reassured me – my sample of 2 appears to have been limited and other people seem to get their material indexed.
There is a specific problem in that DSpace does not – at least when we filled it – allow collections to be easily iterated over. If someone asks for all the files we have deposited then there is no easy recipe to give them to extract the files systematically.
But hopefully these are problems of early adoption.

What sort of repositories do we want?

6 Responses to What sort of repositories do we want?

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta