Unilever Centre for Molecular Informatics
 

petermr's blog

A Scientist and the Web

 

Posts Tagged ‘OREChem’

OREChem

Friday, March 27th, 2009

I will start to widen out from the library of the future  and bring in chemistry and eScience. Librarians should not switch off as the topics are very relevant. Several in our group are off to Redmond – to two official meetings and other informal meet-ups. I’ll blog (or twitter/FF about these as we go).

The first meeting is OREChem, sponsored by Lee Giles Dirks from Microsoft External Research and PI’ed by Carl Lagoze from Cornell. Lee is part of Tony Hey’s empire in MSR and has responsibility for Scholarly publication and education. There is a good coherence and overlap between the projects and we are  committed to these being Open.

OAI-ORE (Open Archives Initiative Protocol – Object Exchange and Reuse) is brought to you by the people that brought you OAI-PMH – Carl and Herbert. One of the tricky problems on the web is being able to access a bounded set of information on the web. For example if you go to this blog address and download it, what do you get. I actually don’t know and I expect it’s a mess. This isn’t a new problem, and the hypermedia gurus have been active for decades – when I started SGML I spent many hours trying to understand “Bounded Object Sets” and “architectural forms”.

ORE tackles this problem in the context of research and scholarship. It can be used for anything, but the thrust is on making web resources for digital libraries, research laboratories, etc. I have the honour of being on the ORE advisory board MSR and I’d urge you to get involved. MSR are backing ORE and as an exemplar have applied this to chemistry, in OREChem. Here we are showing how to create bounded web resources in a context of linked data. I’ll write more later, but to put a marker down we have transformed CrystalEye into RDF and will be working over the weekend to agree what the best approach to ORE-ifying it is. I’ll leave you with Carl’s recent paper (The oreChem Project) …

The oreChem Project: Integrating Chemistry Scholarship with the Semantic Web

Carl Lagoze Information Science, Cornell University lagoze@cs.cornell.edu

The oreChem project, funded by Microsoft, is a collaboration1 between chemistry scholars and information scientists to develop and deploy the infrastructure, services, and applications to enable new models for research and dissemination of scholarly materials in the chemistry community. Although the focus of the project is chemistry, the work is being undertaken with an attention to general cyber infrastructure for eScience, thereby enabling the linkages among disciplines that are required to solve today’s key scientific challenges such as global warming. A key aspect of this work, and a core aim of this project, is the design and implementation of an interoperability infrastructure that will allow chemistry scholars to share, reuse, manipulate, and enhance data that are located in repositories, databases, and Web services distributed across the network.

The foundations of this planned infrastructure are the specifications developed as part of the Open Archives Initiative‐Object Reuse and Exchange (OAI‐ORE) [9] effort. These specifications provide a data model [8] and set of serialization syntaxes [10‐12] for describing and identifying aggregations of Web resources and describing the relationships among the resources that are constituents of aggregations. The OAI‐ORE specifications are firmly grounded in the Web architecture [6] and in the principles of the semantic web [4, 7] and the Linked Data Effort [3]. The relevant connections of the OAI‐ORE specifications to mainstream Web and Semantic Web architecture include:
  • All aspects of data model are expressed in terms of resources, representations, URIs, and triples.
  • The fundamental entity in the data model, the Aggregation, is a resource without a Representation (a “non‐document” resource). This paradigm is similar to the manner in which real‐world entities or concepts are included in the Web via the mechanisms proposed by the Linked Data Effort [3],
  • The description of an Aggregation, a Resource Map, is a separate Resource, which is accessible via the URI of the Aggregation using the mechanisms defined for Cool URIs [15].
  • The result of an HTTP access of a Resource Map URI is a serialization of the triples describing the Aggregation. This serialization may be in any of the OAI‐ORE serialization syntaxes: RDF/XML [2], RDFa [1], and Atom [14] (triples can be extracted from this via an OAI‐ORE defined GRDDL‐compliant XSLT script).


Our initial work in the oreChem Project is the design of a graph‐based object model that specializes the core OAI‐ORE data model for the chemistry domain. This model builds on the centrality of the molecule, or chemical compound, in the record of chemistry scholarship. In the nature of a relational database key, a molecule or compound, identified in a universal manner [13], forms the central hub for linkages to other entities such as investigations, experiments, scholars, and processes related to that molecule. We are then using this model to design interfaces and APIs to exchange molecular information and their relationships among distributed repositories, services, and agents.

We are demonstrating this infrastructure by adapting a number of existing chemistry data repositories2 to the APIs and models. We are also further populating these repositories by developing and refining automated techniques for retrospectively extracting chemical information and interlinking chemical data from existing chemistry research corpora.

Following this we will develop and deploy a number of tools, such as chemical structure searching, over the repositories that have been adapted to the infrastructure. In the latter stages of the project, we will extend the retrospective data extraction techniques with active “in the lab” capture of chemistry data, and the addition of that “in‐process” data to the knowledge network defined by the infrastructure data model.

Ultimately, we envision that this common data model, interchange protocols, and suite of data extraction and data capture tools will enable an eChemistry Web – a semantic graph with embedded subgraphs representing molecules which are then interrelated to publications that refer to them, experiments that work with them, the context of these experiments, the researchers working with these molecules, annotations about publications and experiments, and the like. A particularly interesting aspect of this semantic graph is the manner in which it mixes data, publication artifacts, and people – providing an informationrich social network built around the notion of object‐centered sociality [5]. In the latter phases of the project we hope to build innovative analysis tools that will extract new “scientometric” information and knowledge from the eChemistry Web.

Our work in the oreChem Project and, in particular, our design of the interoperability infrastructure, is being undertaken with the recognition that chemistry, like any scholarly discipline, is not an island, but has complex linkages to scholarship in other disciplines and into related activities such as education, and in fact to the general network‐based information environment. By basing our work on OAI‐ORE, we hope that the interoperability paradigm designed for oreChem will coexist with similar work in other disciplines and in fact with the general Web information space and its ubiquitous search tools, services, and applications.

1 Collaborators in the oreChem Project are University of Cambridge (Peter Murray Rust, Jim Downing), Cornell University (Carl Lagoze, Theresa Velden), University of Indiana (Geoffrey Fox, Marlon Pierce), Penn State University (C. Lee Giles, Prasenjit Mitra, Karl Mueller), PuBChem (Steve Bryant), and University of Southampton (Jeremy Frey, Simon Coles).

2 These repositories include CrystalEye, 100,000 molecules and 100,000 fragments from crystal structures with full crystallographic details and with 3D coordinates; SPECTRaT, open theses with molecules; Pub3D, MMFF94‐optimized 3D structures for PubChem compounds; ChemXSeer, an integrated digital library and database allowing for intelligent search of documents in the chemistry domain and data obtained from chemical kinetics; eCrystals, high level crystal structures and processed x‐ray diffraction data; and R4L, experimental spectroscopic and analytical chemical data.