Many years ago Henry Rzepa and I discussed the idea of extending Dublin Core to chemistry and we called it Dublin-Chem. The “Dublin” is Dublin Ohio, home of OCLC. We discussed this with Stuart Weibel [OCLC] – the DC guru – and it seemed a reasonable approach. An early publication )ca. 1999) listed 11 primary tags (although I thought there were more):
Table 2. A Chemical Metadata Schema | ||
---|---|---|
Element Name | Description of the element | Deployment in HTML 4.0 |
HEAD | Specifies the location of a meta data profile. | |
DC.chem.coordinates | Molecular coordinates | |
DC.chem.substance.formula | Formula constitution | |
DC.chem.substance.smiles | Connection table for molecule | |
DC.chem.computation-simulation | Presence of computed or simulated property | |
DC.chem.biological-activity | Biological activity | |
DC.chem.safety | Type of chemical safety information | |
DC.chem.characterisation | Characterisation mode of molecule | |
DC.chem.instrumentation | Associated instrumentation | |
DC.chem.physicochemical-data | Molecular properties | |
DC.chem.reaction-data | Reaction classification | |
DC.chem.crystallography | Crystallographic information |
We’d like to put these into the chem:* microformat pool. It’s probably a good idea to remove the hierachary (e.g. chem:formula) and some of the verbosity (e.g. chem:reaction).
I have talked with a future Open collaborator who is keen to try these ideas out on the chemical blogosphere. We calculated that the current blogosphere might contain ca 1 million triples – this is not a serious problem at this stage – 3 orders of magnitude might require more engineering.
So how many tags have we got? and how many might we want? Maybe a good start is to think of hypothetical queries (aimed at present at the blogosphere, but potentially over a much wider set of documents). At present let’s assume that there are no synonyms and no numeric computation. Some suggestions:
- Find posts after [data] with mention of patents from GSK
- What posted syntheses mention DCM
- Find posted reviews of syntheses which involve author X.
Note that not everything has to be done in chem:* – we can probably rely on dates, bibliography etc. coming from elsewhere.
I read Egon’s suggestion a few months back, and have been waiting for some progress in this area. I think I am much like a blogger you made reference to earlier, in that I am very much confused as to what I should do regarding microformats in the blog I am writing. I have placed InChIs and DOIs in the text – just what should I do now with this microformatting business?
My hunch is that I should just wait for some of this to simply mature – but the longer I wait, the more work it will be to go back to earlier materials and add this stuff in.
Any advice?
Why will 3 orders of magnitude more triples need more engineering? Are you assuming that the only way of getting value is by aggregating all the triples into a single triple-store?
I suspect the real question is how we structure sources of chem:triples so that we don’t have to aggregate them all together to do useful things, and that this is worth thinking about up front.
(2) Jim – ignore what I wrote. The world is much bigger and I will post about this in an hour or so