Open Chemistry Data at NIST

I had a wonderful mail this morning from Steve Heller …

Peter


I am helping the NIST folks get additional GC/MS EI (electron impact only) mass spectral for their WebBook and mass spec database.
http://webbook.nist.gov/chemistry/
and
http://www.nist.gov/srd/nist1a.htm

The question I have for you is would you be willing to post something on your blog suggesting it would be useful for people to donate their EI MS to the NIST folks. The WebBook is Open Data which is where the spectra would go first/initially. In addition, the spectra would also go into the NIST mass spec database to add to the existing database they provide.
NIST is in the process of setting up an arrangement with the Open Access Chemistry Central folks to do this and I wanted to see if you also would be willing to cooperate/collaborate as well.
Cheers

Steve

PMR: Many of us have known the NIST webbook for many years. It was the first, and for some time the only, openly accessible chemistry resource on the web (outside bio-stuff like PDB). NIST are a US government agency whose role is – in large part – to produce standards (data, specs) for resources in science and engineering. Part of this role is to support US commerce through these activities.

The webbook has many thousands of entries for compounds. Even if you aren’t a chemist, have a look as it’s an ideal exemplar of how data should be organised. The impressive thing is that it has complete references for all data and also concentrates on error estimation. In many ways it is the gold standard of chemical data. (I agree that things like Landolt-Bernstein are very important but in the modern web-world monographs costing thousands of dollars are increasingly dated). And it was Steve and colleagues (especially Steve Stein) who got the InChI process started – because they had so much experience in managing data publicly it made sense to promote the InChI identifier for compounds.

(In passing, NIST has also made an important contribution to our understanding of the universe by measuring the fundamental constants to incredible accuracy).

So is NIST in CKAN – the Open Knowledge Foundation’s growing list of packages of Open Data? YES (from http://www.ckan.net/package/read/nist)

Metadata:

Notes:

About

The NIST Data Gateway provides easy access to NIST scientific and technical data. These data cover a broad range of substances and properties from many different scientific disciplines.

Openness

Much of the material appears to be in the public domain as it is produced by the US Federal Government, but it varies from dataset to dataset.

Note that there is some fuzziness about what is meant by openness here – the NIST pages carry “all rights reserved” and “the right to charge in future”. But Steve’s motivation is clear here and it’s part of the role of OKFN/CKAN to help determine what the rights are.

I’m also interested in the reference to Open Access Chemistry Central. This raises the very important question of where Open Data should be located. The bioscience community has shown that a mixture of (inter)governmental organizations can work extremely well but this is less clear in chemistry at present. We are in exploration phase with a number of initiatives trying out models such as Pubchem (gov), Chemspider (independent/commercial), Crystaleye (academic), NIST (gov), Wikipedia Chemistry (independent), NMRShiftDB(academia), Chemistry Central (commercial/publisher) etc. I am sure there will be a need for multiple outlets – the variation in the sites above is too great for any single organization.

What is important is that this is Linked Open Data because then it does not matter who exposes it. LOD has a number of requirements including

  • Open Data (not just accessible)

  • Semantic infrastructure (e.g. XML/RDF)

  • Identifier systems

  • Appropriate metadata and/or Ontologies

I’ll be talking about this at BioIT next week in Boston (where I shall meet up with Steve). I’ll be bloggins more over the next two days.

In Cambridge we have just been funded by JISC to enhance our repository of chemistry data, which will include Mass Spec. I don’t know how much is EI, but our mission is to make the data Open and where this happens then we will certainly send it off to Steve. There’s a certain amount of technology needed but between us I think we could get an excellent public prototype.

More – much more – soon.

This blogpost was prepared with ICE+OpenOffice.

Posted in "virtual communities", Uncategorized | Tagged | 5 Comments

Blogging Encouragement to an eScientist – Toowoomba-style

I recently posted some ideas (Blogging encouragement to an eScientist) on how to get started with blogging. Here’s some more thoughts while I wait for the pre-final version of the Chem4Word demo…

Peter Sefton has shown me how to use Open Office (or Word) with the ICE toolbar/addin and I’m using this for this post. But the main thrust is a discussion we had in Magee’s bar.

What happens if no-one comments on your blog? This is a hard thing for any of us. You have no idea how many people are reading your blog and for what purpose. Yes, there are Feedburner and Technorati and they give a very rough picture but it’s crude (I suspect a lot of my previous Technorati blogs were pseudo-linkspam).

Some blogs generate lots of comment on every post. Some of the chemistry blogs are like this – Tenderbutton before he had to quit to write his thesis, and Totally Synthetic. There could easily be 50-100 comments, almost by return. This really anticipated Twitter and indeed Twitter and Friendfeed may take over some of this. It clearly suggested that a strong community grew up round these blogs – in this case practising laboratory scientists for the most part. It encouraged chemical gossip – what were things like in other laboratories, etc.

Sometimes, but not always, you can create posts which get reactive comments. These are normally when you air views that are likely to be controversial. (It’s not generally a good idea to invite – say – political discussion just for the sake of it, though obviously some science blogs interact well with current politics. The new enlightenment in the US is crucially important for Open practices.
But sometimes you get quite unusual comments, as Peter Sefton found. He’s a committed cyclist and has a trailer – I think for children. Anyway he mentioned this – or showed a picture – and now his biggest blog karma comes from cyclists (and not from ICE fanatics).

So I was listening to Radio Queensland and it was focusing on Toowoomba where one of the police officers is concerned about bad motoring habits engangering cyclists. So he has installed a camera on his bicycle and uses this to record bad driving, ultimately with a view to prosecution. He’s also said that if any cyclist wishes to do the same, then the police would take such videos seriously.

As a daily cyclist in Cambridge I think this is a great idea. I’ve had a few moderately near misses (I had actual contacts in London which damaged me). I can imagine the idea catching on very rapidly here – the home of web-based government such as FixMyStreet and WhatDoTheyKnow. And, of course, Cambridge is the most comprehensively OpenStreetMapped city anywhere (I think). So just as FixMyStreet encourages people to take photos of potholes, uncleared ice, bad signs, broken railings, etc. what about FixMyDriving?

[I thought I had remembered how to use OO+ICE to post this but it crashed and only posted half the document so I am back to the old methods until the ICE patch].

Oh, and one of the most depressing aspects is that when you do get a comment and think – great! – it’s often some mindlessly depressing linkspam. I have installed Akismet and it’s caught over 100, 000 spams, but some still get through. Make sure you get used to recognising them because otherwise you could ge deindexed by Google. That’s happened to us. Really, really depressing.

Just found The Open Laboratory 2009 blogging competition. An excellent place to see what top-class science blogs are like. Enter yours if you feel like it!

Posted in Uncategorized | 1 Comment

Crystaleyesing The Fascinator

We have covered a lot of ground here in Toowoomba USQ (and we haven’t finished as we have a pub visit). We didn’t know what we were going to do at 9 am yesterday but we took these strands:

  • Understanding Chem4Word in an authoring environment. We arrived at the (reasonable) conclusion that this is a one-way process documents coming from Chem4Word can be repurposed using any number of dowmnstream tools (JUMBO, ICE, etc.) but that we were not looking to reinject documents into Word2007 (at this stage) nor to carry behaviourable interoperability into other wordprocvessors. Of course we require semantic interoperability.

  • Review of ICE-TheOREm. This is a JISC project where we are working with USQ who provide the bulk of the development. It’s a proof of concept to show how a thesis can be assembled from components using ORE, sent to a Board of Graduate Studies and where components can be embargoed. All will be revealed at Open Repositories (OR09) by Jim Downing and Peter Sefton too bad I can’t be there but maybe we’ll show something at ETD2009…

  • Reusing UCC content in USQ tools. The most immediate attraction was to port Crystaleye into the Fascinator .

  • From the USQ description:

    The Fascinator is a software platform for eResearch. Development started in 2008 as an attempt to create a clean and usable Institutional Repository user interface. We succeeded in creating a faceted search interface for repositories such as ePrints and Fedora Commons.

  • I’m blogging this to make sure I have understood it… The Fascinator has a back-end repository (currently Fedora though it might be changed for other engines, or a file system). It currently is populated mainly with metadata from other resources which hold the original blobs, movies, etc. – one of these is 20 TB of Vietnam history. The content of the repository is then indexed by local scripts (Python/Jython) and passed to solr a web interface for Lucene. This provides an indexed search engine for the content.

  • There is no reasons why finely grained data should not be held and Oliver has ingested Crystaleye as XML. We have written filters for all the important content such as atom counts, CIF dictionary items, etc. A preview can be seen at http://rspilot.usq.edu.au:8080/the-fascinator/search/cml where 29 entries have been indexed. Try browsing the entries or searching with blue – which will find blue crystals. Obviously it is possible to customise the interface to provide specific search boxes and terms.

  • The main point is that Daniel hacked this in about an elapsed day. That’s a tribute to Daniel and also to the flexibility of the platform. Daniel’s planning to put 100,000 entries into the system and virtualise it.

  • There are overlaps and differences between the Fascinator and our (Jim’s) Lensfield. The Fascinator is a faceted indexer and has been designed for indexing large documents. Lensfield is a native RDF approach which is more fine grained and which can hold more complex structures. Jim and Peter will be comparing notes at OR09 and I am sure that we will be able to use the synergy of both. It’s clear that tools for the semantic web are starting to arrive.

Posted in Uncategorized | Leave a comment

Blogging using ICE

I am learning (thanks to Cynthia and Peter) how to use ICE as an authoring tool for blogging and hopefully the format, if not the content, of my blogs will improve.

Peter says enter a blockquote

This is a new paragraph. The main message is that I am authoring this offline. I first have to save this.

This was a quick post to show how the ICE system at USQ can be used to create blogs. The advantage is that it uses OpenOffice as an authoring tool so the worst excesses of WordPress are removed. The final document is then published into the drafts folder of the blog, where I must log in and republish it.

We are no looking at doing this automatically by using the server at USQ, so the next post may show this.


Currently I have settled on publising directly from OO, using the USQ plugin/toolbar. This has a few minor glitches (you need to retype URLs etc for each post) but they will be fixed. When that is done I shall move to it wholeheartedly. It still goes through wordpress so that the posts are there and can be manged and tweaked, but the bulk of the editing is done offline in a much more congenial environment.

Posted in Uncategorized | Leave a comment

ICE-cold in Toowoomba

I am here for all too short a time working with Peter Sefton and colleages on a number of collaborations on authoring and publishing tools. Peter runs the  Australian Digital Futures Institute at the University of Southern Queensland in Toowoomba – a lovely place in the mountains west of Brisbane.

We have a joint project funded by JISC – ICE-Theorem – and I’ll blog later when we have had the demo. This is a great arrangement because we have been able to contract much of the work to Peter’s group. Having now met the current group (and it’s grown since I was last here) I can say that it has a critical mass of committed developers which is very hard to put together in most academic institutions, especially those which depend on “research” output rather than technology. We’ve built up a strong mutual understanding over the last 3 or so years.

We have our differences of approach, but wherever popssible we are looking for these to compement each other. Good academic web tools will depend on a mixture of diversity and synergy. That means trying out new ideas but not getting locked into one’s own approach because you want glory or money (the chances are relatively small).

What often happens in the academic content/publishing world is that technology “empires” spring up – managing repositories, courseware, etc. They often mutate into political organisations with large consortia, where the pace is governed not by technology but by the need to satisfy everybody’s interests. At the other end of the spectrum are the geeks – in the best sense – who want to build systems in  days.

They often do. And Toowoomba is one of the places where it happens.

Peter has been showing me the Fascinator – it’s a lightweight desktop repository based on Fedora (but that’s excchangeable). We have an apparently similar approach in Jim Downing’s Lensfield. However we are looking to see how these two complement each other – Peter is document-centered, we are data-centered and there is enough difference that it make sense to go forwrad on both fronts.

But I have to rush …

Posted in "virtual communities", Uncategorized | Leave a comment

Semantic authoring

An interesting post from Duncan Hull  The Unreasonable Effectiveness of Google about the challenges of a semantic web of data. Since I am talking on the Chemical Semantic Web at Bio-IT World Conference & Expo 2009 it has a lot to influence me. There is the question of whether to annotate data:

“The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available.”

In general I agree, and believe that it’s possible to use heuristics such as text-mining and ontologies to clear up some of this later (it can never be 100 percent). Data often contain internal checks and consistency relations that show bad values. A typica example is temperature – when you find a small set of values 273 degrees larger than the main set you can be pretty sure someone has muddled celsius and kelvin. But there are many cases where you can’t know an isolated value is wrong.

And so our current approach is painless semantic authoring. For authors to create the same documents as they do now, but checked ontologically and semantically. That’s technically possible – why I am at Toowoomba today. But it needs the tools, and that’s why I am at Toowoomba.

Posted in Uncategorized | Leave a comment

Crystal26 – what I said – the Crystallographic Semantic Web

As usual I didn’t know in detail what I would say at Crystal26 – it depends on who is present, what has just been said, how grateful I am to the organizers (10/10). I have an overview page (in HTML) and a menu of a few hundred topics with 10-20 “slides” each. In particular I download chunks of HTML from the web rather than try to emasculate them with powerpoint. This makes it difficult to distribute a “talk” and so I try to blog the major points.

Overview of presentation:

  • The Semantic Web is here and ICT companies are investing heavily
  • Vision of universal access to knowledge for both humans and machines
  • Belief in emergent human/machine phenomena
  • SW already well developed in bioscience
  • crystallography very well placed in physical science

I do strongly believe in the nascent Sematic Web leading to a new phase of knowledge devlopment and sharing. Some obvious areas will be the continued development of natural language tools and – as others think – a new generation of knowledgebases – beyond Google. Current contenders include Wolfram Alpha and True_Knowledge. Little is known in practice of either – I would guess that WA would have major applicability to physical science, while TK seems to be closer to aSemantic Web approach, but with fuzzy algorithms rather than the formalism of RDF/OWL. We shall see. But at present I’m guessing it’s still worth trying to create semantic documents with controlled ontologies.

Semantic web:

  • linked Open Data
  • (global) reasoning engine
  • social networks

I showed my Tweetdeck as an example of social networks – I have become converted to this as a valuable way of throwing medium-valuable contributions or request to a like-minded community. For example I tweeted my request for support material and was picked up by various members of the direct and informal groups. However I concentrated on Linked Open Data (including the Open) but as always didn’t cover it all – I present material until the time cutoff and then stop. (Whereas a Powerpoint requires you to flip over lots of slides to get to the end).

Expectations:

  • global access to data in standardised or interconvertible form
  • “natural language” questions linking data
  • “giant global brain”

Requirements:

  • Open data
  • Agreed semantics
  • identifier system
  • agreed ontologies
  • ontology mapping
  • authoring tools
  • searchable repositories
  • validation systems

Crystallography does well implicitly in these areas but it needs for formalising. Data is often not aggressively Open, semantics are not in modern formats, ontologies are implied at best (through CIF dictinaries). Data creattion tools are patchy – the good news is that manufacturers are generally on board, the bad is that there is no semantic editing or authoring. Repositories do not exist except as highly managed, expensive data banks. This has to and will change

Challenges

  • getting semantic data
  • publishers’ attitudes (not IUCR/Acta)
  • creation of ontologies

Tractable approaches

  • Open semantic authoring tools
  • searchable Open Data (RDF) repositiories

So then some demos. I showed Andrew Walkinshaw’s geo-crystal mashup, CrystalEye, CIF-CML-RDF-OWL in Protege, semantic authoring in Chem4Word and finally a movie of Lensfield molecular repository (Nico, Joe, Jim have made great short movies of what the system looks like).

And, as always, we are keen to collaborate in these areas. My great thanks to Peter Turner for hisx invite and support.

Posted in Uncategorized | 1 Comment

Crystal26/SCANZ at Barossa

I’ve immersed myself for the last 2 days in the Australia/NZ crystallography meeting – about 100 scientists – some old acquaintances. It’s been wonderful. Some of the imaging and related techniques have been awesome – instrumentation has moved on so much. And, of course, this means that data is more and more critical.

It’s been great to see experiments where theory and experiment are compared (this is almost completely unknown in modern chemoinformatics where there is little theory and virtually no experiment).

I’ll pick just one – quantitative diffuse scattering from Harald Reichert (ESRF). Don’t switch off – as it relates to chemistry. Because as atom moves in a crystal it experiencese the chemical and lattice forces of its neighbours. Since diffuse scattering can now be measured with a dynamic range of (IIRC) 5 orders of magnitude we can see places ikn the crystal where atoms occur rarely and this maps out the anharmonic potential. By working in reciprocal space it’s possible to observe second and third-order interatomic forces.

So what does this have to do with chemistry? Well, for years, according to Harald, the DFT experts have been telling the experimentalists that theory (“first principles”) is so good that experiments are irrelevant.

Well the latest synchrotrons are showing the reverse. That theory is all over the shop compared with experiment. And of course it’s not “first principles” – it’s a folklore of basis sets, functionals and pseudopotentials. So maybe the “correct” DFT parameterisation could be better determined from experiment. [Note this is for alloys at present, but in principle it can be done for ionic solids.]

I miss not doing experiments.

Posted in Uncategorized | 1 Comment

Henry Rzepa's blog

Henry and I have worked together more many years – today he mailed me about the latest entry in his blog. He’s wondering whether blogs are a way of recording scientific ideas – which used to be published in letters. Anyway here’s a link

A molecule with an identity crisis: Aromatic or anti-aromatic?

In 1988, Wilke (DOI: 10.1002/anie.198801851) reported molecule 1

A [24] annulene. Click on image for model.

A 24-annulene. Click on image for model.

It was a highly unexpected outcome of a nickel-catalyzed reaction and was described as a 24-annulene with an unusual 3D shape. Little attention has been paid to this molecule since its original report, but the focus has now returned! The reason is that a 24- annulene belongs formally to a class of molecule with 4n (n=6) π-electrons, and which makes it antiaromatic according to the (extended) Hückel rule. This is a select class of molecule, of which the first two members are cyclobutadiene and cyclo-octatetraene. The first of these is exceptionally reactive and unstable and is the archetypal anti-aromatic molecule. The second is not actually unstable, but it is reactive and but conventional wisdom has it that it avoids the antiaromaticity by adopting a highly non-planar tub shape and hence adopts reactive non-aromaticity. Both these examples have localized double bonds, a great contrast with the molecule which sandwiches them, cyclo-hexatriene (i.e. benzene). The reason for the resurgent interest is that a number of crystalline, apparently stable, antiaromatic molecules have recently been discovered, and ostensibly, molecule 1 belongs to this select class! …..

Posted in Uncategorized | Leave a comment

Thoughts about DCC and USQ

Having moved computers one of the things that got lost was a list of feeds so I am catching up with some of the ones I used to read. I came across Chris Rusbridge’s Digital Curation Blog which is essential reading for anyone interested in the subject. Chris is a clear and thoughtful writer and often uses the blog to explore new ideas. Here is his comment on something I said at #LOTF09:

Libraries of the Future: SourceForge as Repository?

In his talk (which he pre-announced on his resumed blog), Peter Murray-Rust (PMR) suggested (as he has done previously) that we might like to think of SourceForge as an alternative model for a scholarly repository (“Sourceforge. A true repository where I store all my code, versioned, preserved, sharable”). I’ve previously put forward a related suggestion, stimulated by PMR’s earlier remarks, on the JISC-Repositories email list, where it got a bit of consideration. But this time the reaction, especially via the Twitter #lotf09 feed, was quite a bit stronger. Reactions really fell into two groups: firstly something like: SourceForge was setting our sights way too low; it’s ugly, cluttered by adverts, and SLOOOOOWWW (I can’t find the actual quote in my Twitter search; maybe it was on Second Life). Secondly, the use cases are different, eg @lescarr: “Repositories will look like sourceforge when researchers look like developers” (it might be worth noting that PMR is a developer as well as a researcher).

PMR: My demo of Sourceforge was primarily simply to show features of an Open, versioned system – I wanted to show that writers needed versions, needed to share, needed a submit-and-forget system. What I was implying (and it had to be implicit because of the short time) was that we needed similar systems for compound documents. Chris continues:

Journal articles, by contrast, are mostly written using proprietary (sometimes open source) office systems which create and edit highly complex files whose formats are closely linked to the application. Apart from some graphic and tabular elements, an article will usually be one file, and if not, there is no convention on how it might be composed. Workflow for multi-author articles is entirely dependent on the lead author, who will usually work with colleagues to assign sections for authoring, gather these contributions, assemble and edit them into a coherent story, circulating this repeatedly for comments and suggested improvements (sometimes using tracked change features).

PMR: completely so, and I am happy that Chris and others have picked up the baton. We need something of the power of Word or OpenOffice and in an ideal world the repository would allow either tool to access documents in the system. Indeed we should be careful about the word “repository” which implies “repose” and therefore peaceful but dead documents.

Next week I am off to USQ to work with Peter Sefton, guru of the ICE: The Integrated Content Environment. I need to know more detail before I claim it is the answer to Chris’ requirements. A system which manages collaboration, multiple authors, different content types, links, behaviour, styles, versions, platform independence, etc. is not trivial to create and I’ve probably made some facile assumptions about operations.

But if you are a “repository manager” responsible for curating an institution’s published output, you should be actively looking at communal authoring systems. That’s what the researchers really want, and getting the results into a repository should be a manageable task.

Posted in Uncategorized | 2 Comments