OREChem

I will start to widen out from the library of the future  and bring in chemistry and eScience. Librarians should not switch off as the topics are very relevant. Several in our group are off to Redmond – to two official meetings and other informal meet-ups. I’ll blog (or twitter/FF about these as we go).

The first meeting is OREChem, sponsored by Lee Giles Dirks from Microsoft External Research and PI’ed by Carl Lagoze from Cornell. Lee is part of Tony Hey’s empire in MSR and has responsibility for Scholarly publication and education. There is a good coherence and overlap between the projects and we are  committed to these being Open.

OAI-ORE (Open Archives Initiative Protocol – Object Exchange and Reuse) is brought to you by the people that brought you OAI-PMH – Carl and Herbert. One of the tricky problems on the web is being able to access a bounded set of information on the web. For example if you go to this blog address and download it, what do you get. I actually don’t know and I expect it’s a mess. This isn’t a new problem, and the hypermedia gurus have been active for decades – when I started SGML I spent many hours trying to understand “Bounded Object Sets” and “architectural forms”.

ORE tackles this problem in the context of research and scholarship. It can be used for anything, but the thrust is on making web resources for digital libraries, research laboratories, etc. I have the honour of being on the ORE advisory board MSR and I’d urge you to get involved. MSR are backing ORE and as an exemplar have applied this to chemistry, in OREChem. Here we are showing how to create bounded web resources in a context of linked data. I’ll write more later, but to put a marker down we have transformed CrystalEye into RDF and will be working over the weekend to agree what the best approach to ORE-ifying it is. I’ll leave you with Carl’s recent paper (The oreChem Project) …

The oreChem Project:
Integrating Chemistry Scholarship with the Semantic Web

Carl Lagoze
Information Science, Cornell University
lagoze@cs.cornell.edu

The oreChem project, funded by Microsoft, is a collaboration1 between chemistry scholars
and information scientists to develop and deploy the infrastructure, services, and
applications to enable new models for research and dissemination of scholarly materials in
the chemistry community. Although the focus of the project is chemistry, the work is being
undertaken with an attention to general cyber infrastructure for eScience, thereby enabling
the linkages among disciplines that are required to solve today’s key scientific challenges
such as global warming. A key aspect of this work, and a core aim of this project, is the
design and implementation of an interoperability infrastructure that will allow chemistry
scholars to share, reuse, manipulate, and enhance data that are located in repositories,
databases, and Web services distributed across the network.

The foundations of this planned infrastructure are the specifications developed as part of
the Open Archives Initiative‐Object Reuse and Exchange (OAI‐ORE) [9] effort. These
specifications provide a data model [8] and set of serialization syntaxes [10‐12] for
describing and identifying aggregations of Web resources and describing the relationships
among the resources that are constituents of aggregations. The OAI‐ORE specifications are
firmly grounded in the Web architecture [6] and in the principles of the semantic web [4, 7]
and the Linked Data Effort [3]. The relevant connections of the OAI‐ORE specifications to
mainstream Web and Semantic Web architecture include:

  • All aspects of data model are expressed in terms of resources, representations, URIs,
    and triples.
  • The fundamental entity in the data model, the Aggregation, is a resource without a
    Representation (a “non‐document” resource). This paradigm is similar to the
    manner in which real‐world entities or concepts are included in the Web via the
    mechanisms proposed by the Linked Data Effort [3],
  • The description of an Aggregation, a Resource Map, is a separate Resource, which is
    accessible via the URI of the Aggregation using the mechanisms defined for Cool
    URIs [15].
  • The result of an HTTP access of a Resource Map URI is a serialization of the triples
    describing the Aggregation. This serialization may be in any of the OAI‐ORE
    serialization syntaxes: RDF/XML [2], RDFa [1], and Atom [14] (triples can be
    extracted from this via an OAI‐ORE defined GRDDL‐compliant XSLT script).

Our initial work in the oreChem Project is the design of a graph‐based object model that
specializes the core OAI‐ORE data model for the chemistry domain. This model builds on
the centrality of the molecule, or chemical compound, in the record of chemistry
scholarship. In the nature of a relational database key, a molecule or compound, identified
in a universal manner [13], forms the central hub for linkages to other entities such as
investigations, experiments, scholars, and processes related to that molecule. We are then
using this model to design interfaces and APIs to exchange molecular information and their
relationships among distributed repositories, services, and agents.

We are demonstrating this infrastructure by adapting a number of existing chemistry data
repositories2 to the APIs and models. We are also further populating these repositories by
developing and refining automated techniques for retrospectively extracting chemical
information and interlinking chemical data from existing chemistry research corpora.

Following this we will develop and deploy a number of tools, such as chemical structure
searching, over the repositories that have been adapted to the infrastructure. In the latter
stages of the project, we will extend the retrospective data extraction techniques with active
“in the lab” capture of chemistry data, and the addition of that “in‐process” data to the
knowledge network defined by the infrastructure data model.

Ultimately, we envision that this common data model, interchange protocols, and suite of
data extraction and data capture tools will enable an eChemistry Web – a semantic graph
with embedded subgraphs representing molecules which are then interrelated to
publications that refer to them, experiments that work with them, the context of these
experiments, the researchers working with these molecules, annotations about publications
and experiments, and the like. A particularly interesting aspect of this semantic graph is the
manner in which it mixes data, publication artifacts, and people – providing an informationrich
social network built around the notion of object‐centered sociality [5]. In the latter
phases of the project we hope to build innovative analysis tools that will extract new
“scientometric” information and knowledge from the eChemistry Web.

Our work in the oreChem Project and, in particular, our design of the interoperability
infrastructure, is being undertaken with the recognition that chemistry, like any scholarly
discipline, is not an island, but has complex linkages to scholarship in other disciplines and
into related activities such as education, and in fact to the general network‐based
information environment. By basing our work on OAI‐ORE, we hope that the
interoperability paradigm designed for oreChem will coexist with similar work in other
disciplines and in fact with the general Web information space and its ubiquitous search
tools, services, and applications.

1 Collaborators in the oreChem Project are University of Cambridge (Peter Murray Rust, Jim
Downing), Cornell University (Carl Lagoze, Theresa Velden), University of Indiana (Geoffrey
Fox, Marlon Pierce), Penn State University (C. Lee Giles, Prasenjit Mitra, Karl Mueller),
PuBChem (Steve Bryant), and University of Southampton (Jeremy Frey, Simon Coles).

2 These repositories include CrystalEye, 100,000 molecules and 100,000 fragments from
crystal structures with full crystallographic details and with 3D coordinates; SPECTRaT,
open theses with molecules; Pub3D, MMFF94‐optimized 3D structures for PubChem
compounds; ChemXSeer, an integrated digital library and database allowing for intelligent
search of documents in the chemistry domain and data obtained from chemical kinetics;
eCrystals, high level crystal structures and processed x‐ray diffraction data; and R4L,
experimental spectroscopic and analytical chemical data.

This entry was posted in "virtual communities", nmr, open notebook science, Uncategorized and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *