Richard Van Noorden – writing in the RSC’s Chemistry World – has described the eChemistry repository project, Microsoft ventures into open access chemistry. This is very topical as Jim Downing, Jeremy Frey, Simon Coles and me are off to join the US members of the project at the weekend. It’s exciting, challenging, but eminently feasible. So what are the new ideas.
The main theme is repositories. Rather a fuzzy term and therefore valuable as a welcoming and comforting idea. Some of the things that repositories should encourage are:
- ease of putting things in. It doesn’t require a priesthood (as so many relational databases do). You should be able to put in a wide range of things – these, molecules, spectra, blogs, etc. You shouldn’t have to worry about datatypes, VARCHARS, third normal forms, etc.
- it should also be easy to get things out. That means a simple understandable structure to the repository. And being able to find the vocabulry used to describe the objects.
- flexibility. Web 2.0 teaches us that people will do things in different ways. Should a spectrum contain a molecule or should a molecule contain a spectrum? Sme say one, some the other. So we have to support both. Sometimes required information is not available, so it must be omitted and that shouldn’t break the system.
- interoperability. If there are several repositories built by independent groups it should be possible for one lot to find out what the otehrs have done without mailing them. And the machines should be able to work this out. That’s hard but not impossile.
- avoid preplanning. RDBs suffer from having to have aschema before you put data in. Repositories can describe a basic minimum and then we can work out later how to ingest or extract.
- power is more important than performance (at least for me.) I’d rather take many minutes to find something difficult than not be ale to do it. When I started on relational databases for molecules it took at night to do a simple join. So everything is relative…
The core to the project is the ORE – Object Re-use and Exchange (ORE Specification and User Guide). A lot of work has gone into this and it’s been implemented at alpha, so we know it works. ORE is quite a meaty spec, but Jim understands it. Basically the repositories can be described in RDF and some subgraphs (or additional ones) are “named graphs” ( e.g. Named Graphs / Semantic Web Interest Group) which are used to describes the subsets of data that you may be interested in. There is quite strong constraint on naming conventions and you need to be well up with basic RDF. But then we can expect the power of the triple stores to start retrieving information in a flexible way. (As an example Andrew Walkingshaw has extrected 10 million triples from CrystalEye and show that these can be rapidly searched for bibliographic and other info). Adding chemistry will be more challenging and I’m not sure how this intergrates with RDF – but this is a research project. Maybe we’ll precompute a number of indexes. And, in principle, RDF can be used to search substructures but I suspect it will be a little slow to start with.
But maybe not… In which case we shall have made a very useful transition
Pingback: Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Update on molecular repositories