Constance Wiebrands who has invited me to talk in ECU tomorrow asks on this blog (see /pmr/2012/11/14/openaccess-opendata-reclaiming-our-scholarship-ii-do-we-undervalue-it/ and previous):
It is probably more useful for [Librarians] to think about future steps and what libraries ought to be doing to support our academic colleagues. Have we (librarians) got the will to work on a large scale project to create and control an independent metadata record?
I hope we can have a general discussion about that tomorrow. But what I want to do now is to show the value and also to show that it is highly affordable. I’m an idealist and optimist so this colours the discussion. (However I have launched several successful software/informatics memes on the web, some succeeding in weeks, some taking decades, and I have watched others so I know that a great deal is possible).
First why? After all there are professional indexers of scholarly publication and their products cost a lot so surely we should chose them? PMR: No. The rapid change of modern technology means that products can get obsoleted quickly. OK, so that’s what Google has done with Google Scholar and MS with MS Academic search. And they are free. PMR: Yes. But they decide what to include and what not both breadth and depth. There’s also no contract that says they will continue to exist. They may, they may not. I have seen umpteen free services disappear or become commercialized. So the major benefits are:
- We know precisely what the basis is. We can inspect quality
- We can re-use the metadata for whatever purpose we like without permission. We can use it to control assessment exercises, act as definitive bibliography for citations, be annotated for a whole host of purposes (is it Open, does it contain certain types of material – e.g. data, arguments, etc.). Current awareness… reading lists, etc.
Assuming that we do want to control our data, isn’t it going to be very expensive? PMR: Technically, no. Indexing from the web is now very straightforward. There are already several collections of web-collected metadata in academic institutions. And I’ll show you tomorrow some of the things we are doing. (Crystaleye, Bibsoup, etc.)
Yes but, it isn’t really our job to be doing this.
Um… if libraries are going to continue (and I hope they are) they have to be about *creating* information and meta-information. Buying and distributing will be done in the cloud or wherever. Amazon can distribute eBooks and could do the same for journals. Institutions have to be in control of their information.
PMR: In the information age small formal resources can create huge results. The factors that drive this include:
- High quality and ever-increasing Open technology. Open is being increasingly adopted by governments for both software and data
- Involvement of citizens (“crowdsourcing”). The prime example is, of course, Wikipedia. (Later I’ll mention Openstreetmap)
The 2010-11 plan called for us to increase revenue 28% from 2009-10, to $20.4 million, and to increase spending 124% from 2009-10, to $20.4 million. In fact, we significantly over-achieved from a revenue perspective, and we also underspent, resulting in a larger reserve than planned. We’re projecting today that 2010-11 revenue will have actually increased 49% from 2009-10 actuals, to $23.8 million. Spending is projected to have increased 103% from 2009-10 actuals, to $18.5 million. This means we added $5.3 million to the reserve, for a projected end-of-year total of $19.5 million which represents 8.3 months of reserves at the 2011-12 spending level.
Let’s say 20 million USD. That’s a minute sum for academia as a whole (Univ College London alone pays Elsevier about 2 million dollars in subscriptions). So it’s among the best scholarly value anywhere. And the reason is, of course, that it has found a new way of doing things where the citizenry are an integral part. (Do universities give citizens a real stake in them? They should if they wish to retain public support).
But it also shows the value of concerted action. The publishing tools and the information infrastructure that WP has developed are far better than what the STM publishers use. The use of versioning and annotation is absent in traditional scholarly pub – not because it isn’t valuable but because it’s too much effort for the publishers. And because they don’t have the goodwill of the community in helping add it.
So why isn’t academia working with Wikipedia? It’s a natural partnership. (There are some instances such as biosciences – where data in databases is being actively linked but there could be much more). And if Wikipedia is using academic output (e.g. referring to papers) it’s creating one of the largest and best reference lists in the world. We’ve gently explored whether they could use our Open Bibliography specifications.
But the alternative (often complementary) approach is automation. It’s now easy to crawl the web. It’s increasingly possible to extract content from what is retrieved. And to do this in a domain-specific manner. Now Google and Microsoft aren’t interested in science – it’s too small a market (they *are* interested in health). So no-one will index science for us for free. The severe danger is that someone will index the science and then sell it back to us.
Here are some questions that we ought to be able to answer:
- Find all articles which discuss antimalarials and retrieve the chemical reactions used to make them.
- Find organic crystal structures that have been tested for second harmonic generation
- Find phylogenetic trees which contain diptera species.
All of these are within our current technology.
The major problem is that we may be challenged on the legal right to do this. Publishers want to restrict content mining. As long as we acquiesce in what publishers want we will get what publishers want us to have.
So where is the resource going to come from? Ultimately it has to come from reshaping the market. There is huge amount of wasted money at present and the more that academia takes control of the market the more that costs are decided by academia.
But there are also the dynamics of the Net / Web 2.0. many projects have made great impact with a committed human driving them, free webserver, and a growing volunteer community. Examples include:
- Openstreetmap whose volunteers have mapped much of the world at often submeter resolution.
- GalaxyZoo which classified a million Galaxies in a year with hundreds of thousands of volunteers
And our own efforts where Nick days’ Crystaleye has indexed all public crystal structures in #scholpub on the web (ca 200,000).
So you will need a champion. It may be that much of this can be done with resource already in the system. For example:
- Students on LIS courses should absolutely know this stuff and the best way to learn is by doing
- MSc students in engineering and CompSci need projects and this is an ideal one
- Undergraduate students can be financed, say, through GoogleSummerOfCode
And who knows – there are huge numbers of citizens interested in books and journals. I am sure some of them would be interested in volunteering. And these ideas are just the start