Category Archives: repositories

Chemistry Repositories

Richard Van Noorden - writing in the RSC's Chemistry World - has described the eChemistry repository project, Microsoft ventures into open access chemistry. This is very topical as Jim Downing, Jeremy Frey, Simon Coles and me are off to join the US members of the project at the weekend. It's exciting, challenging, but eminently feasible. So what are the new ideas.

The main theme is repositories. Rather a fuzzy term and therefore valuable as a welcoming and comforting idea. Some of the things that repositories should encourage are:

  • ease of putting things in.  It doesn't require a priesthood (as so many relational databases do). You should be able to put in a wide range of things - these, molecules, spectra, blogs, etc. You shouldn't have to worry about datatypes, VARCHARS, third normal forms, etc.
  • it should also be easy to get things out.  That means a simple understandable structure to the repository. And being able to find the vocabulry used to describe the objects.
  • flexibility. Web 2.0 teaches us that people will do things in different ways. Should a spectrum contain a molecule or should a molecule contain a spectrum? Sme say one, some the other. So we have to support both. Sometimes  required information is not available, so it must be omitted and that shouldn't break the system.
  • interoperability. If there are several repositories built by independent groups it should be possible for one lot to find out what the otehrs have done without mailing them. And the machines should be able to work this out. That's hard but not impossile.
  • avoid preplanning. RDBs suffer from having to have aschema before you put data in. Repositories can describe a basic minimum and then we can work out later how to ingest or extract.
  • power is more important than performance (at least for me.) I'd rather take many minutes to find something difficult than not be ale to do it. When I started on relational databases for molecules it took at night to do a simple join. So everything is relative...

The core to the project is the ORE - Object Re-use and Exchange (ORE Specification and User Guide). A lot of work has gone into this and it's been implemented at alpha, so we know it works. ORE is quite a meaty spec, but Jim understands it. Basically the repositories can be described in RDF and some subgraphs (or additional ones) are "named graphs" ( e.g. Named Graphs / Semantic Web Interest Group) which are used to describes the subsets of data that you may be interested in. There is quite strong constraint on naming conventions and you need to be well up with basic RDF. But then we can expect the power of the triple stores to start retrieving information in a flexible way. (As an example Andrew Walkingshaw has extrected 10 million triples from CrystalEye and show that these can be rapidly searched for bibliographic and other info). Adding chemistry will be more challenging and I'm not sure how this intergrates with RDF - but this is a research project. Maybe we'll precompute a number of indexes. And, in principle, RDF can be used to search substructures but I suspect it will be a little slow to start with.

But maybe not... In which case we shall have made a very useful transition

From Peter Suber  More on the NIH OA mandate.

Many points but I pick one:


Jocelyn Kaiser, Uncle Sam's Biomedical Archive Wants Your Papers, Science Magazine, January 18, 2008 (accessible only to subscribers).  Excerpt:

If you have a grant from the U.S. National Institutes of Health (NIH), you will soon be required to take some steps to make the results public. Last week, NIH informed its grantees that, to comply with a new law, they must begin sending copies of their accepted, peer-reviewed manuscripts to NIH for posting in a free online archive. Failure to do so could delay a grant or jeopardize current research funding, NIH warns....


Scientists who have been sending their papers to PMC say the process is relatively easy, but keeping track of each journal's copyright policy is not....

PMR: Exactly. It should be trivial to find out what a journal's policy is. As easy as reading an Open Source licence. An enormous amount of human effort is wasted - authors, repositarians, on repeatedly trying to (and often failing to) get this conceptually simple information.


I've been doing article and interviews on OA and Open Data recently and  one thing that becomes ever clearer is that we need licences or other tools. Labeling with "open access" doesn't work.


What sort of repositories do we want?


Open Access and Institutional Repositories: The Future of Scholarly Communications, Academic Commons,

Institutional repositories were the stated topic for a workshop convened in Phoenix, Arizona earlier this year (April 17-19, 2007) by the National Science Foundation (NSF) and the United Kingdom's Joint Information Systems Committee (JISC). While in their report on the workshop, The Future of Scholarly Communication: Building the Infrastructure for Cyberscholarship, Bill Arms and Ron Larsen build out a larger landscape of concern, institutional repositories remain a crucial topic, which, without institutional cyberscholarship, will never approach their full potential.


PMR: Although I'm going to agree generally with Greg I don't think the stated topic of the workshop was institutional repositories per se. It was digital scholarship, digital libraries and datasets. I would expect to find many datasets outside institutions (witness the bio-databases).

Repositories enable institutions and faculty to offer long-term access to digital objects that have persistent value. They extend the core missions of libraries into the digital environment by providing reliable, scalable, comprehensible, and free access to libraries' holdings for the world as a whole. In some measure, repositories constitute a reaction against those publishers that create monopolies, charging for access to publications on research they have not conducted, funded, or supported. In the long run, many hope faculty will place the results of their scholarship into institutional repositories with open access to all. Libraries could then shift their business model away from paying publishers for exclusive access. When no one has a monopoly on content, the free market should kick in, with commercial entities competing on their ability to provide better access to that freely available content. Business models could include subscription to services and/or advertising.

Repositories offer one model of a sustainable future for libraries, faculty, academic institutions and disciplines. In effect, they reverse the polarity of libraries. Rather than import and aggregate physical content from many sources for local use, as their libraries have traditionally done, universities can, by expanding access to the digital content of their own faculty through repositories, effectively export their faculty's scholarship. The centers of gravity in this new world remain unclear: each academic institution probably cannot maintain the specialized services needed to create digital objects for each academic discipline. A handful of institutions may well emerge as specialist centers for particular areas (as Michael Lesk suggests in his paper here).

The repository movement has, as yet, failed to exert a significant impact upon intellectual life. Libraries have failed to articulate what they can provide and, far more often, have failed to provide repository services of compelling interest. Repository efforts remain fragmented: small, locally customized projects that are not interoperable--insofar as they operate at all. Administrations have failed to show leadership. Happy to complain about exorbitant prices charged by publishers, they have not done the one thing that would lead to serious change: implement a transitional period by the end of which only publications deposited within the institutional repository under an open access license will count for tenure, promotion, and yearly reviews. Of course, senior faculty would object to such action, content with their privileged access to primary sources through expensive subscriptions. Also, publications in prestigious venues (owned and controlled by ruthless publishers) might be lost. Unfortunately, faculty have failed to look beyond their own immediate needs: verbally welcoming initiatives to open our global cultural heritage to the world but not themselves engaging in any meaningful action that will make that happen.

The published NSF/JISC report wisely skips past the repository impasse to describe the broader intellectual environment that we could now develop. Libraries, administrators and faculty can muddle through with variations on proprietary, publisher-centered distribution. However, existing distribution channels cannot support more advanced scholarship: intellectual life increasingly depends upon open access to large bodies of machine actionable data.

The larger picture depicted by the report demands an environment in which open access becomes an essential principle for intellectual life.The more pervasive that principle, the greater the pressure for instruments such as institutional repositories that can provide efficient access to large bodies of machine actionable data over long periods of time. The report's authors summarize as follows the goal of the project around which this workshop was created:

To ensure that all publicly-funded research products and primary resources will be readily available, accessible, and usable via common infrastructure and tools through space, time, and across disciplines, stages of research, and modes of human expression.

To accomplish this goal, the report proposes a detailed seven-year plan to push cyberscholarship beyond prototypes and buzzwords, including action under the following rubrics:

  • Infrastructure: to develop and deploy a foundation for scalable, sustainable cyberscholarship
  • Research: to advance cyberscholarship capability through basic and applied research and development
  • Behaviors: to understand and incentivize personal, professional and organizational behaviors
  • Administration: to plan and manage the program at local, national and international levels

For members of the science, technology, engineering, and medical fields, the situation is promising. This report encourages the NSF to take the lead and, even if it does not pursue the particular recommendations advocated here, the NSF does have an Office of Cyberinfrastructure responsible for such issues, and, more importantly, enjoys a budget some twenty times larger than that of the National Endowment for the Humanities. In the United Kingdom, humanists may be reasonably optimistic, since JISC supports all academic disciplines with a healthy budget. Humanists in the US face a much more uncertain future.

PMR: I would agree with Greg that IRs are oversold and underdeliver. I never expected differently. I have never yet located a digital object I wanted in an IR expect when I specifically went looking (e.g. for theses). And I went to Soton to see what papers of Stevan's were public and what their metadata were. But I have never found one through Google.

Why is this? The search engines locate content. Tyr searching for NSC383501 (the entry for a molecule from the NCI) and you'll find: DSpace at Cambridge: NSC383501

But the actual data itself (some of which is textual metadata) is not accessible to search engines so isn't indexed. So if you know how to look for it through the ID, fine. If you don't you won't.

I don't know what the situation is in humantities, so I looked up the Fitzwilliam (the major museum in Cambridge) newsletter and looked for "The Fitzwilliam Museum Newsletter Winter 2003/2004" in Google and found: DSpace at Cambridge: The Fitzwilliam Museum Newsletter 22 but when I looked for the first sentence "The building phase of The Fitzwilliam Museum Courtyard" Google returned zero hits.

So (unless I'm wrong and please correct me), deposition in DSpace does NOT allow Google to index the text that it would expose on normal web pages. Jim explained that this was due to the handle system and the use of one level of indirection - Google indexes the metadata but not the data. (I suspect this is true of ePrints - I don't know about Fedora).

If this is true, then repositing at the moment may archive the data but it hides it from public view except to diligent humans. So people are simply not seeing the benefit of repositing - they don't disover material though simple searches.

So I'm hoping that ORE will change all this. Because we can expose all the data as well as the metadata to search engines. That's one of the many reasons why I'm excited about our molecular repositories (eChemistry) project.

As I said in a previous post, it will change the public face of chemical information. The key word for this post is "public". In others we'll look at "chemical" and "information".


[ans: German. Because the majority of scholarship in the C19 was in German.]

Scraping HTML

As we have mentioned earlier, we are looking at how experimental data can be extracted from web sources. There is a rough scale of feasibility:


I have been looking at several sites which produce chemical information (more later). One exposes SDF (a legacy ASCII file of molecular structures and data). The others all expose HTML. This is infinitely better than PDF, BUT...

I had not realised how awful it can be. The problems include:

  • encodings. If any characters outside the printing ANSI range (32-127) are used they will almost certainly cause problems. Few sites add an encoding and even if they do the interconversion is not necessarily trivial.
  • symbols. Many sites use "smart quotes" for quotes. These are outside the ANSI range and almost invariably cause problems. The author can be slightly forgiven since manu tools (including WordPress) convert to smart quotes ("") automatically. Even worse is the use of "mdash" (—) for "minus" in numerical values. This can be transformed into a "?" or a block character or even lost. Dropping a minus sign can cause crashes and death. (We also find papers in Word where the numbers are in symbol font and get converted to whatever or deleted.)
  • non-HTML tags. Some tools make up their own tags (e.g. I found <startfornow>) and these can cause HTMLTidy to fail.
  • non-well-formed HTML. Although there are acceptable ways of doing this (e.g. "br" can miss out the end tag) there are many that are not interpretable. The use of <p> to separate paragraphs rather than contain them is very bad style.
  • javascript, php, etc. Hopefully it can be ignored. But often it can't.
  • linear structure rather than groupings. Sections can be created with the "div" tag but many pages assume that a bold heading (h2) is the right way to declare a section. This may be obvious when humans read it, but it causes great problems for machines - it is difficult to know when something finishes.
  • variable markup. For a long-established web resource - even where pages are autogenerated - the markup tends to evolve and it may be difficult to find a single approach to understanding it.  This is also true of multi-author sites where there is no clear specification for the markup - Wikipedia is a good example of this.

As a result it is not usually possible to extract all the information from HTML pages and precision and recall both fall well short of 100%. The only real solution is to persuade people to create machine-friendly pages based on RSS, RDF, XML and related technology. This solves 90% of the above problems. That's why we are looking very closely at Jim Downing's approach of using Atom Publishing Protocol for web sites.

Why is it so difficult to develop systems?

Dorothea Salo (who runs Caveat Lector blog) is concerned (Permalink) that developers and users (an ugly word) don't understand each other:

(I posted a lengthy polemic to the DSpace-Tech mailing list in response to a gentle question about projected DSpace support for electronic theses and dissertations. I think the content is relevant to more than just the DSpace community, so I reproduce it here, with an added link or two.)


My sense is that DSpace development has only vaguely and loosely been guided by real-world use cases not arising from its inner circle of contributing institutions. E.g., repeated emails to the tech and dev lists concerning metadata-only deposits (the use case there generally being institutional-bibliography development), ETD management, true dark archiving, etc. etc. have not been answered by development initiatives, or often by anything but “why would you even want that?” incomprehension or “just hack it in like everybody else!” condescension.

PMR: This has been a perennial problem for many years and will continue to be so. I'm also not commenting on DSpace (although it is clearly acquiring a large code base).  But my impression of the last 10-15 years (especially W3C and Grid/eScience projects) is that they rapid become overcomplicated, overextended and fail to get people using them.

One the othe hand there are the piles of spaghetti bash, C, Pythin and so on which adorn academic projects and cause just as much heartache. Typical "head-down" or throwaway code.

The basic fact is that most systems are complicated. And there isn't a lot that can be done easily. It's summed up by the well-known  Conservation Of Complexity

This is a hypothesis that software complexity cannot be created or destroyed, it can only be shifted (or translated) from one place (or form) to another.

If, of course, you are familiar with the place that the complexity has shifted to it's much easier. So if someone has spent time learning how to run spreadsheets, or workflows, or Python, and if the system has been adapted to those it may be easier. But if those systems are new then they will have serious rough edges. We found this with the Taverna workflow which works for bioscience but isn't well suited (yet) for chemistry. We spent months on it, and but those involved have reverted to using Java code for much of our glueware. We understand it, our libraries work, and since it allows very good test-driven development and project management it's ultimately cost-effective.


We went through something like the process Dorothea mentions when we started to create a submission tool for crystallography in the SPECTRa : JISC project.  We though t we could transfer the (proven) business process that Southampton had developed for the National Crystallographic Centre. And that the crystallographers would appreciate it. It would automate the management of the process from  receibving the crystal to repositing the results in DSpace.


It doesn't work like that in the real world.


The crystallographers were happy to have a reposition tool, but they didn't want to change their process and wouldn't thank us for providing a bright shiny new one that was "better". They wanted to stick with their paper copies, the way they disseminated theoir data. So we realised, and backtracked. It cost us three months, but that's what we have to factor into these projects. It's a lot better than wasting a year producing something people don't want.


Ultimately much of the database and repository technology is too complicated for what we need at the start of the process. I am involved in one project where the database requires an expert to spend six months tooling it up. I thought DSpace was the right way to go to reposit my data but it wasn't. I (or rather Jim) put150,000+ molecules into it but they aren't indexed by Google and we can't get them out en masse. Next time we'll simply use web pages.


By contrast we find that individual scientists, if given the choice, revert to two or three simple, well-proven systems:

  • the hierarchical filesystem
  • the spreadsheet

A major reason these hide complexity is that they have no learning curve, and have literally millions of users or years' experience. We take the filesystem for granted, but it's actually a brilliant invention. The credit goes to Denis Ritchie in ca. 1969. (I well remember my backing store being composed of punched tape and cards).

If you want differential access to resources, and record locking and audit trails and rollback and integrity of commital and you are building it from scratch, it will be a lot of work. And you lose sight of your users.

So we're looking seriously at systems based on simpler technology than databases - such as RDF triple stores copuled to the filesystem and XML.

And the main rule is that both the users and the developers have to eat the same dogfood.  It's slow and not always tasty. And you can't avoid Fred Brooks:

 Chemical engineers learned long ago that a process that works in the laboratory cannot be implemented in a factory in one step. An intermediate step called the pilot plant is necessary....In most [software] projects, the first system is barely usable. It may be too slow, too big, awkward to use, or all three. There is no alternative but to start again, smarting but smarter, and build a redesigned version in which these problems are solved.... Delivering the throwaway to customers buys time, but it does so only at the cost of agony for the user, distraction for the builders while they do the redesign, and a bad reputation for the product that the best redesign will find hard to live down. Hence, plan to throw one away; you will, anyhow.

Very simply, TTT: Things Take Time.

Repository depositions - what scales? A simple idea

One of the problems of repositories at present is that everything is new. And much of it is complex. And some changes rapidly. So here is a simple idea, motivated by Dorothea's reply to a post of mine...

Dorothea Salo Says:
November 12th, 2007 at 2:51 pm e

[... why repositories need investment ...]

And some of the work is automatable over time. Once you know a particular journal’s inclinations, pushing anything from that journal out of the bucket becomes a two-second decision instead of a ten-minute slog through SHERPA and publisher websites.

PMR: Now this is an area that is a vast time-sink. Suppose I (as a simple scientific author) want to know if I can archive my Springer article (read also Wiley, Elsevier, ACS, RSC...). What do I have to do? When?

I imagine that hundreds of people struggle through this every year. Frantically hacking through awful, yes awful, pages from publishers. Many of these are not aimed at helping authors self-archive but suggesting how they can pay money to the publisher for re-use of articles. (I could easily rack up a bill of 1000 USD for re-using my own article if I wanted to include it in a book, use it for distance education, use it for training etc.). It is not easy to find out how to self-archive - I wonder why?

So I thought I would try to do this responsibly and find out what Springer actually allows. I have a paper in a Springer journal - what am I allowed to do and when? The following journey may be inexact, and I'd appreciate correction, but it's the one that a fairly intelligent, fairly knowledgeable, scientist who knows something about Open Access followed.

I went to the home page of J. Molecular Modeling and looked in "For authors and editors". Nothing about self-archiving. A fair amount about Open Choice (the option where the author pays Springer 3000 USD to have their full-text article article visible (like all other current non-OpenChoice articles in J.Mol.Mod) and archived in Pubmed and re-useable for teaching without payment but with the copyright retained by Springer. I went to Google and typed "Springer self-archiving". I won't list all in detail but the results in order were:

A report by Peter Suber (2005), Journal of Gambling Studies, a critique by Stevan Harnad, a Springer PPT presentation (2004) on Open choice (which stated:

Springer supports Self-archiving: Authors are allowed to post their own version on their personal website or in their institution’s online repository, with a link to the publisher’s version. (PMR this is the ONLY page I found from Springer).

... an attack by Richard Zach on the 3000 USD for Open Choice, an attack by Stevan Harnad on why Jan Velterop opposes Green self-archiving, a page from Sherpa-Romeo which gives the conditions:

and is the most helpful of all.

I immediately take away the fact that Springer is making no effort to help authors find the conditions for self-archiving. I have no idea where they are. I'd hate to do anything that violated the conditions.

So, to follow up Dorothea's post. A LOT of useful human effort is wasted because the publishers make it so difficult to find out how to self-archive. I'd like a fair bargain. If the publisher has agreed that you can self-archive, tell us how. Or we start to see publishers as a difficulty to be overcome.

So a suggestion. Suppose each institutional repositarian spent 1 day a year posting how to self-archive articles from the Journal of Irreproducible Results. (Don't be fooled - it takes a day to get a clear answer from most publishers or journals). And each one took a different journal. And posted it on a communal Wiki. Then we would have a clear up-to-date indication of what was allowed and what wasn't. Including things like "I asked if I could retain copyright and they said yes". Really vital info.

It's not a lot of work per person. It would pay back within a year. Someone has to set up the Wiki. And keep it free of spam. But that's not enormous. But sorry - I'm not volunteering. I'm in a discpline where there is very little chance of self-archiving legally. I've spent enough time trying.

Can I reposit my article?

Having re-explored the access to articles in the Journal Of Molecular Modeling I thought I would see if I am allowed to reposit my article in the Cambridge DSpace. So while the sun is shining here's a small pictorial journey...

I can read my article without paying (I'm not at work and have no special access AFAIK). So, I assume, can everyone else:


I click on Permissions & Reprints...

I assume I have got the right options here. I thought I did quite well to find the "Institutional Repository" option. I have no idea what "Prepress article" means but since it's the only option I don't need to think. So how much if anything do I have to pay...

... and thank you, Rightslink for making this very clear that I cannot put my paper in an IR. As an exercise see how long it takes you to find the relevant section in the SSBM "clear guidelines". First you have to find it. Then find "repository". The best I could come up with after 5 minutes was:

"Details of Use:

Details of use define where or how you intend to reuse the content. Details of use vary by the type of use.

Some of the details of use include: Advertising, Banner, Brochure or flyer, Catalogue, CME, Web site, Repository, Slides/Slide Set, Staff training, or Workshop.

Some details of use are geographic: National (in the country associated with your account set up), Internal (within your organization), or Global (worldwide).


Well, yes. But it doesn't answer my question about why and when and what and how I can put in my Institutional Repository.

But since the answer is actual NO to everything, shouldn't I just accept that?

Using our own repository

Elin Stangeland from our Institutional Repository will be talking to us (my Unilever Centre colleagues) tomorrow on how to use it. Jim and I have seen her draft talk, but I'll keep it a surprise till afterwards.

I still think there is a barrier to using IR's and I'll explain why.

We spent some of our group meeting on Friday discussing what papers we were writing and how. As part of that we looked at how to deposit them in the IR. It's not easy in chemistry as most publishers don't allow simple deposition of the "publishers' PDF". So here's the sort of problems we face and how to tackle them.

Firstly every publisher has different rules. It's appalling. I don't actually know for certain what ACS, RSC, Springer, Wiley allow me to do.  Elin has a list which suggests that I might be able to archive some of my older ACS papers. etc. This is an area where I'm meant to know things, and I don't. (I've just been looking through the Springer hybrid system and I do not understand it. I literally do not know why all the articles are publicly visible, but some are Open Choice, yet Springer copyright. I would have no idea which of these can be put in an IR. Or when. Or what the re-use it. I may write more about this later.)

Here are some basic problems about repositing:

  • the process from starting a manuscript to final publication can take months or years
  • there are likely to be multiple authors
  • authors will appear and disappear during the process
  • manuscripts may fission or fuse.
  • authors may come from different institutions

A typical example is the manuscript we are writing on the FOO project. The project has finished. The paper has 6 authors. I do not know where one of them is. There are 2 institutions and 4 departments involved. One person has been entrusted with the management of authoring. They are unlikely to be physically here when the final paper is published. The intended publisher does not support Open Access and may or may not allow self-archiving
We have to consider at least the following versions of the article:

  1. The manuscript submitted to the publisher (normally DOC or TeX). Note that this may not be a single version as the publisher may (a) refuse it as out of scope (b) require reformatting, etc. even before review. Moreover if after a refusal the material is submitted to a subsequent journal we must remember which manuscript is which.
  2. The publisher sends the article for review and returns reviewers comments. We incorporate these into a post-review manuscript. This process may be iterative - the journal may send the revision for further review. Eventually we get a manuscript that the journal accepts.
  3. We get a "galley proof" of the article which we need to correct. This may be substantially different from (2). Some of the alterations are useful, some are counterproductive (one publisher insists on setting computer code in Times Roman). There are no page numbers. We make corrections and send this back.
  4. At some stage the paper appears. We are not automatically notified of when - some publishers do, some don't. We may not even be able to read it - this has happened.

By this stage the original person managing the authoring has left us, and so has one of the co-authors. Maybe at this stage we are allowed to reposit something. Possibly (1). The original manuscript. But the author has left - where did they keep the document? It's lost.

This is not an uncommon scenario - I think at DCC 2005 were were informed that 30+% of authors couldn't locate their manuscripts. Yes, I am disorganized, but so are a lot of others. It's a complex process and I need help. There are two sorts - human amanuenses and  robot amanuenses. I love the former. Elin has suggested how she can help me with some of my back papers. Dorothea Salo wants to have a big bucket that everyone dumps their papers in and then she sorts it out afterwards (if I have got this right). But they don't scale. So how can robots help?

Well, we are starting to write our papers using our own repository. Not an IR, but an SVN repository. So Nick, Joe and I will share versions of our manuscripts in the WWMM SVN repository. Joe wrote his thesis with SVN/TeX and I think Nick's doing the same. Joe thought it was a great way to do things.

The advantage of SVN is that you have a complete version history. The disadvantage is only that it's not easy to run between institutions. I am not a supporter of certificates. And remember that not all our authors are part of the higher education system.  In fact Google documents starts to look attractive (though the versioning is not as nice as SVN.)

Will it work? I don't know. Probably not 100% - we often get bitten by access permissions, forgetting where things are, etc. But it's worth a try.

And if I were funding repositories I would certainly put resource into communal authoring environments. If you do that, then it really is a one-click reposition instead of the half-day mess of rtrying to find the lost documents.