Many of the projects we are involved in and interact with are about systematising metadata for scientific and other scholarly applications. There are several sorts of MD; I include at least rights, provenance, semantics/format, and discovery. I’ll go along with the first three – I need to know what I can do with something, where it came from and how to read and interpret it. But do we really need discovery metadata?
Until recently this has been assumed as almost an axiom – if we annotate a digital object with domain-specific knowledge then it should be easier to find and there should be fewer false positives. If I need a thesis on the synthesis of terpenes then surely it should help if it is labelled “chemistry”, “synthesis” and “terpenes”. And it does.
But there are several downsides:
- It’s very difficult to agree on how to structure metadata. This is mainly because everyone has a (valid) opinion and no-one can quite agree with anyone else. So the mechanism involves either interminable committees in “smoke-filled-rooms” (except without the smoke) or self-appointed experts who make it up themselves. The first is valuable if we need precise definitions and possibly controlled vocabularies but is not normally designed for discovery. The second – as is happening all over the world leads to collisions and even conflict.
- Authors won’t comply. They either leave the metadata fields blank, or make something up to get it over with or simply abandon the operation
- It’s extremely expensive. If a domain expert is required to reposit a document it doesn’t scale.
So is it really necessary? If I have a thesis I can tell without metadata just by looking whether it’s about chemistry (whatever language it’s in), whether it has synthesis and whether it contains terpenes. And so can Google. I just type “terpene synthesis” and all the first page are about terpene synthesis.
The point is that indexing full text (or the full datument 🙂 ) is normally sufficient for most of our content discovery. Peter Corbett has implemented Lucene – a free text indexer and done some clever things with chemistry and chemical compounds. That means his engine is now geared up to discover chemistry on the Web from its content. I’ll speculate that it’s more powerful than the existing chemical metadata…
So it was great to see on Peter Suber’s Blog (can’t stop blogging it!)
Full-text cross-archive search from OpenDOAR
OpenDOAR has created a Google Custom Search engine for the 800+ open-access repositories in its directory. From today’s announcement:OpenDOAR – the Directory of Open Access Repositories – is pleased to announce the release of a trial search service for academics and researchers around the world….
OpenDOAR already provides a global Directory of freely available open access repositories that hold research material: now it also offers a full-text search service from this list of quality-controlled repositories. This trial service has been made possible through the recent launch by Google of its innovative and exciting Custom Search Engine, which allows OpenDOAR to define a search service based on the Directory holdings.
It is well known that a simple full-text search of the whole web will turn up thousands upon thousands of junk results, with the valuable nuggets of information often being lost in the sheer number of results. Users of the OpenDOAR service can search through the world’s open access repositories of freely available research information, with the assurance that each of these repositories has been assessed by OpenDOAR staff as being of academic value. This quality controlled approach will help to minimise spurious or junk results and lead more directly to useful and relevant information. The repositories listed by OpenDOAR have been assessed for their full-text holdings, helping to ensure that results have come from academic repositories with open access to their contents.
This service does not use the OAI-PMH protocol to underpin the search, or use the metadata held within repositories. Instead, it relies on Google’s indexes, which in turn rely on repositories being suitably structured and configured for the Googlebot web crawler. Part of OpenDOAR’s work is to help repository administrators improve access to and use of their repositories’ holdings: advice about making a repository suitable for crawling by Google is given on the site. This service is designed as a simple and basic full-text search and is intended to compliment and not compete with the many value-added search services currently under development.
A key feature of OpenDOAR is that all of the repositories we list have been visited by project staff, tested and assessed by hand. We currently decline about a quarter of candidate sites as being broken, empty, out of scope, etc. This gives a far higher quality assurance to the listings we hold than results gathered by just automatic harvesting. OpenDOAR has now surveyed over 1,100 repositories, producing a classified Directory of over 800 freely available archives of academic information.Comment. This is a brilliant use of the new Google technology. When searching for research on deposit in OA repositories, it’s better than straight Google, by eliminating false positives –though straight Google is better if you want to find OA content outside repositories at publisher or personal websites. It’s potentially better than OAIster and other OAI-based search engines, by going beyond metadata to full-text –though not all OA repositories are configured to facilitate Google crawling. If Google isn’t crawling your repository, consult OpenDOAR or try these suggestions.
I agree! Let’s simply archive our full texts and full datuments. We’ll never be able to add metadata by human means – let the machines do it. And this has enormous benefots for subjects like chemistry – Peter’s OSCAR3 can add chemical metadata automatically in great detail.
So now all we want is chemistry theses… and chemistry papers … in those repositories. You know what you have to do…
Well, err, yes and no. Certainly Google is well known for ignoring keywords put in the meta tags at the top of webpages in favour of indexing on the text itself – and a good semantic markup to turn a document into a datument will give the indexing algorithm plenty to work on. In short, losing the distinction between discovery metadata and semantic data won’t hurt. However…
…natural language processing won’t get you 100%. In fact, you’ll be lucky to get near 90% under many circumstances. This is often better than nothing but if you can involve humans in a hybrid annotation then you’ll get better results, especially if one of those is the author.
Also, Google has another source of metadata too – other people’s link text. For example, consider the famous googlebomb miserable failure. Apparently “these results may seem politically slanted” – googleblog tells all.
I suspect that at least part of the appeal of discovery metadata comes from the closed-access (and ultimately dead-tree) publishing world. If you can’t let the index have the full text then you’ve got to pull pertinent things out of the text to give the index something to go on. Open access on the web doesn’t have this problem.
(1) Of course you are right, Peter. But, as you say 90% is better than 0%. And 0% is what a lot of activities currently get.
I have no idea what percentage of DOAR documents have links. If they are HTML they have a chance of being useful (cows). If they are PDF then they probably have no links (hamburgers). Maybe someone from SHERPA can tell us. If they *do* have links, then we get a real chance to build a scholarly linkbase and I am sure there is Open software which can do some exciting things with this. It’s a different problem from general Google as there are (I hope) no academic spammers of repositories!!
Pingback: Unilever Centre for Molecular Informatics, Cambridge - Jim Downing » Blog Archive » Open Java AND sitemaps standard - too much excitement for one week?