Open Canada

(and maybe a reader will give me the French translation – I do not know the gender).
I am delighted to be speaking at WWW 2007 in Banff Canada as Canada has a very high profile in Open activities. This is is not a comprehensive survey and depends on people I have met or have encountered virtually – order is idiosyncratic.
Heather Morrison has been very active in promoting Openness – coincidentally she posted a a request for a summary of open data just today.
Here’s her home page, splendidly called the The Imaginary Journal of Poetic Economics

Imagine a world where anyone can instantly access all of the world’s scholarly knowledge – as profound a change as the invention of the printing press. Technically, this is within reach. All that is needed is a little imagination, to reconsider the economics of scholarly communications from a poetic viewpoint.

At the same time, she got a reply from Alison Ball:

Heather
There are descriptions of Canadian data sites at http://dac.cisti.nrc.ca/datact_e.cfm. Some examples of data sets that could be of interest to laypersons are:

  • Canadian Poisonous Plants Information System
  • Canadian Bird Trends
  • CADRMP Adverse Reaction Database
  • National Climate Data and Information Archive
  • Ontario Sport Fish Contaminant Monitoring Program

Wow! Just the sort of material I was looking for! These are exactly the sorts of data that should be Open. Alison is in the National Research Council of Canada which is also much more proactive than most at insisting that data become free.
And here are some more people and instances that I have been sent and Canada can be proud of:

Of the 16 people at the Budapest meeting that became BOAI – three are
Canadian: Jean-Claude Guedon, Leslie Chan, and
Stevan Harnad.

Francis Ouellette – see the Ouellette declaration. Andrew Waller, librarian from the University of Calgary and OA advocate. John Willinsky & the PKP Project / OJS. Canada has a recently, well-funded project called Synergies to support scholarly publishing – most of the nodes (SFU, University of Calgary) are strong supporters of open access.

Public Knowledge Project / Open Journal Systems:
First International PKP Scholarly Publishing Conference, July 11-13,
Vancouver, BC:
(highly recommended!!)
Canadian Association of Research Libraries (CARL) IR Project (all
Canadian university research libraries either have, or are developing
an IR – the CARL metadata harvester site is at:
(Currently 12 IRs; work is being done to roughly double this number
in the very near future.
Canadian Institutes of Health Research – draft Access to Research
Outputs policy:
If passed in a form similar to the draft, this will be a very strong
policy, which calls for open data as well as published results of
research.
Open Access Declaration for the Ouellette Laboratory
IJPE Canadian Leadership in the Open Access Movement series
And Heather is even now teaching a whole (1-credit) course on open access – is this the first?:
Posted in open issues, www2007 | 2 Comments

chem:microformats – what questions would YOU like?

Many years ago Henry Rzepa and I discussed the idea of extending Dublin Core to chemistry and we called it Dublin-Chem. The “Dublin” is Dublin Ohio, home of OCLC. We discussed this with Stuart Weibel [OCLC] – the DC guru – and it seemed a reasonable approach. An early publication )ca. 1999) listed 11 primary tags (although I thought there were more):

Table 2. A Chemical Metadata Schema
Element Name Description of the element Deployment in HTML 4.0
HEAD Specifies the location of a meta data profile.  
DC.chem.coordinates Molecular coordinates
DC.chem.substance.formula Formula constitution
DC.chem.substance.smiles Connection table for molecule
DC.chem.computation-simulation Presence of computed or simulated property
DC.chem.biological-activity Biological activity
DC.chem.safety Type of chemical safety information
DC.chem.characterisation Characterisation mode of molecule
DC.chem.instrumentation Associated instrumentation
DC.chem.physicochemical-data Molecular properties
DC.chem.reaction-data Reaction classification
DC.chem.crystallography Crystallographic information

We’d like to put these into the chem:* microformat pool. It’s probably a good idea to remove the hierachary (e.g. chem:formula) and some of the verbosity (e.g. chem:reaction).
I have talked with a future Open collaborator who is keen to try these ideas out on the chemical blogosphere. We calculated that the current blogosphere might contain ca 1 million triples – this is not a serious problem at this stage – 3 orders of magnitude might require more engineering.
So how many tags have we got? and how many might we want? Maybe a good start is to think of hypothetical queries (aimed at present at the blogosphere, but potentially over a much wider set of documents). At present let’s assume that there are no synonyms and no numeric computation. Some suggestions:

  • Find posts after [data] with mention of patents from GSK
  • What posted syntheses mention DCM
  • Find posted reviews of syntheses which involve author X.

Note that not everything has to be done in chem:* – we can probably rely on dates, bibliography etc. coming from elsewhere.

Posted in semanticWeb | 3 Comments

TBL+13: If everybody did it it would be awesome

13 years ago I sat entranced listening to Tim Berners-Lee giving the closing address at the first WWW conference in CERN, Geneva. I was particularly influenced by one diagram which changed the way I thought about the world. I don’t know whether the one below – which I pinched from Dan Connolly’s tutorial – is Tim’s original but it captures the idea:

The Semantic Web… is an open world and universal space for machine-readable data.

things in documents

To a computer, then, the web is a flat, boring world devoid of meaning…This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them…Adding semantics to the web involves two things: allowing documents which have information in machine-readable forms, and allowing links to be created with relationship values.

TimBL, WWW1994

This diagram showed that there was a “subterranean” semantic world of documents that moved in sync with the real world. Which drives which we are still discovering!

So what is today’s message? Since there have been 13 years of world effort it’s probably not as epoch-aking for me. But the “two magics” are creativity and collaboration. Both are critical. And the chemical blogosophere and the Blue Obelisk is where creativity and collaboration meet for chemistry. It has to be the start of the future.
And the single message to take away from Tim’s talk:
“If everybody did it it would be awesome”.

Posted in semanticWeb, www2007 | Leave a comment

Chemical Microformats have arrived some time ago!

Egon writes and puts me to shame…

  1. Name: Egon Willighagen | E-mail: egon.willighagen@gmail.com | URI: http://chem-bla-ics.blogspot.com/ | IP: 134.95.200.25The use of microformats in chemistry has already begun:http://chem-bla-ics.blogspot.com/2006/12/including-smiles-cml-and-inchi-in.html

He suggested this over FOUR MONTHS ago! I probably missed it as I teleported out of the blogosphere at that time for 3 months to do some hacking. So sorry!
Anyway this is great. He writes:

Including SMILES, CAS and InChI in blogs
Including SMILES is much easier as it is plain text, and has the advantage over InChI that it is much more readable. Chris wondered in th e KinasePro blog on how to tag SMILES, while Paul did the same on ChemBark about CAS numbers.
Now, users of PostGenomic.com know how to add markup to their blogs to get PostGenomic index discussed literature, website and conferences. Something similar is easily done for chemistry things too, as I showed in Hacking InChI support into postgenomic.com (which was put on lower priority because of finishing my PhD). PostGenomic.com basically uses microformats, which I blogged about just a few days ago in Chemo::Blogs #2, where I suggested the use of asperin.
And this is the way SMILES, CAS and InChI’s can be tagged on blogs. The element is HTML code to indicate a bit of similar content in HTML, and can, among many other things, be formatted differently than other text. However, this can also be used to add semantics in a relatively cheap, but accepted, way. Microformats are formalized just by use, so whatever we, as chemistry bloggers, use will become the de facto standard. Here are my suggestions:

  • for SMILES: CCO
  • for CAS registry numbers: 50-00-0
  • for InChI: InChI=1/CH4/h1H4

The RDFa alternative
The future, however, might use RDFa over microformats, so here are the RDFa equivalents:

  • for SMILES: CCO
  • for CAS registry numbers: 50-00-0
  • for InChI: InChI=1/CH4/h1H4

which requires you to register the namespace xmlns:chem=”http://www.blueobelisk.org/chemistryblogs/” somewhere though. Formally, the URN for this namespace needs to be formalized; Peter, would the Blue Obelisk be the platform to do this? BTW, this is more advanced, and currently does not have practical advantages over the use of microformats.

Talking with Dan Connolly it seems that for best use of Microformats we need to regularize the vocabulary – see the FOAF specification for example. So – unless it has already been done and I have been sleeping – we should get this going on the BO Wiki.

Posted in chemistry, semanticWeb | Leave a comment

Open Data WWW 2007

I am on a panel at WWW 2007 and have a 10-minute slot. Since our wireless in the hotel is pretty good I thought I would continue the experiment I made earlier by blogging parts of it. It also means that the blogosphere can read it and make comments (“peer-review”?) before I present it. Also, since I haven’t yet met up with my chair, I will list more links than I shall use. Then I shall select those most relevant in the 10-minutes allowed (in this way I do not overrun my time).

Building a Semantic Web in Which Our Data Can Participate

Time: Thursday, May 10, 2007 (10:30am-noon)Location: ColemanModerator: Paul Miller (Talis)
Panelists:

  • Steve Coast (OpenStreetMap)
  • Peter Murray-Rust (University of Cambridge)
  • Rob Styles (Talis)
  • Jamie Taylor (Metaweb)

Abstract:
This panel session will introduce participants to the increasingly important concept of open data, illustrating a broader set of concerns with real-world examples from broadcasting, scholarly publishing, map making and the library sector. Panelists will use these examples to highlight current restrictions on the effective use and reuse of rich data in powering the Semantic Web, and will offer suggestions as to ways in which attitudes and practice can and must change if we are to realise the potential of the data already held in databases around the world.
Much attention is currently being paid to Open Source software, and to the value it can bring to the development and dissemination of software within a mixed economy comprising traditionally commercial, open source, and hybrid solutions of various forms.
In the academic sector, too, existing models of publication are being challenged by the rise of the Open Access movement. Here, as in the software world, early polarisation is increasingly giving way to a more pragmatic world view in which various models of publication co-exist to meet a set of requirements.
Far less attention has been paid to the manner in which data can be used and reused, with only a few projects such as OpenStreetMap really challenging the traditional models of control over creating and accessing the underlying data upon which so many applications rely. In scholarly publishing, too, there has tended to be an expectation that rights in the data behind a published paper will be controlled, rather than making the data available — and data produced by the research — in order that readers might test the author’s conclusions for themselves. Now, some funders are beginning to require that both reports of research and data produced by the research be made easily available for re-use, and organisations such as Creative Commons are taking a serious interest in this area with their Science Commons activity.

PaulM doesn’t actually use the term “Open Data” but it’s implied. It’s still not very clearly defined and this should be a matter of urgency – or we don’t know what we are talking about. To help catalyse people’s thoughts and get feedback, I started a page on Wikipedia and also suggested to SPARC that they start an Open Data Mailing list for which many thanks.
In the next few posts I gather resources to which I may refer.

Posted in data, open issues, www2007 | 1 Comment

We must protect our digital history

I’m at a rather special session on the history of WWW and more generally the web. 13 years ago at WWW CERN 94 I got a T-shirt. I’ve never worn it till now – when we are celebrating the history of the WWW. A very telling quote:

“The good thing about digital media is that you can save everything. The bad thing about digital media is that you can lose everything.”
– Brewster Kahle, Web pioneer, founder of The Internet Archive

I’m conscious that much of what I have been involved in over 15 years has been lost. It’s not “mine” – it’s ours. Things like the first Internet Course – Introduction to Object-Oriented Programming Using C++ – run by Marcus Speh under the auspices of he Globewide Network Academy. It won a “best of the web” prize at CERN WWW 94 – but I suspect almost all of that is lost.
More history tomorrow – maybe I’ll donate the T-shirt

Posted in www2007 | Leave a comment

Microformats in the chemical blogosphere – the Chemical Semantic Web has arrived?

One of my readers writes privately…

Too many acronyms for my poor head in [blog] world. I am beginning to see this as a series of rocks in a swirling sea of T- and F-LAs. People on the solid rocks know where they are and what they’re trying to do; the rest are carried around on the acronymic currents, and continually changing the way they face. Confusing for the rest of us….

Yes.
This gives a true flavour of what I am being bombarded with – a screenful of acronyms, parsers, etc. for GRDDL. The main point of the blog is to leave pointers for the Blue Obelisk community.
The simple message is that this stuff is very powerful if the community wants to use it. I think the chemical blogosphere will. I talked with Dan Connolly and Harray Halpin at tea. Essentially we need a microformat vocabulary of no more than 20 terms (see hCard, etc.). Some time ago Henry and I proposed Dublin Chem which has concepts such as “does this document talk about substances”? “are there any calculations”? etc. So we could write:
<span class=”bo:calculation”>MOPAC</span>
which is a microformat saying that the document has something to do with the tag “calculation” as defined by the Blue Obelisk community. There is a lot of very clever magic (profiles) which can be added to the top of the HTML or XML document. There are also lots of very clever tools that already exist to process this. All the stuff is in the tutorial material
I think this stuff is actually easier than adding InChIs to chemical documents. If we use it then the chemical blogosphere becomes extremely powerful. We can then ask questions like:
“how do I make trityl azide”?
and the GRDDL/SPARQL tool will search the chemical blogosphere for
class=”bo:preparation”>trityl azide
(and of course the InChI could also be used)
How do we know what the “chemical blogosphere is”? We use FOAF (Friend of a Friend) tools. we can define ourselves as friends in the chemical blogosphere and search this bounded set.
So here are some of Dan’s slides that informa our direction:

  1. Everything should have a URI: All entities of interest should be identified by URIs.
  2. Follow Your Nose Principle: URIs should be dereference-able, meaning that an application can look up a URI over the HTTP protocol and retrieve RDF data.
  3. Use standard formats: Data should be provided using the RDF/XML or Turtle syntax. If data is embedded using a format like Microformats , then these documents should include links to automatically extract RDF data from them, ala GRDDL.
  4. Link Your Data: Resource descriptions should contain links to related information in the form of dereference-able URIs within RDF statements and rdfs:seeAlso links.

and the most exciting vision:

The Web as One Big Mashup

Follow your nose and query the whole Web
For each triple pattern, the library executes the following algorithm:

  1. look up URIs that appear in the triple pattern. Add retrieved graphs to the local graph set.
  2. look up any URI y where the graph set includes the triple { x rdfs:seeAlso y } and x is a URI from the triple pattern. Add retrieved graphs to the local graph set.
  3. match the triple pattern against all graphs in the local graph set.
  4. for each triple that matches the triple pattern
    1. look up all new URIs that appear in the triple. Add retrieved graphs to the local graph set.
    2. look up any new URI y where the graph set includes the triple { x rdfs:seeAlso y } and x is a URI from a matching triple. Add retrieved graphs to the local graph set.
  5. match the triple pattern against all newly retrieved graphs.
  6. repeat step 4 and 5 until the maximum number of retrieval steps or the timeout is reached.

So “The chemical web as one big mashup” I think we can.
(Check out the “Tabulator” which is a sort of RDF browser for the web.)

Posted in chemistry, semanticWeb | 1 Comment

GRDDL and RDFa

This is how we create our own microformats under Blue Obelisk… from the GRDDL tutorial

Is it too much work to ask people to add the transformation and profile to their individual instance data?
Creators or maintainers of vocabularies can also give users of their data the option of having their data transformed into RDF without having to even add any new markup to individual documents
Since once the tranformation has been linked to the profile or namespace document, all the users of the dialect get the added value of RDF for free
In either the namespace document or profile URI there has to be the following RDF property: http://www.w3.org/2003/g/data-view#profileTransformation whose subject is the namespace doc or profile document and whose object is the transformation itself.
While GRDDL has primarily in the wild been used to convert widely deployed microformats to RDF, it can actually be used with the W3C RDFa work item that allows one to “microformat-style” embed arbitary RDF statements in HTML
RDFa is useful because microformats exist as a number of centralized vocabularies, and what if you want to mark-up meta-data in a web-page about a subject there isn’t a microformat about?
Since RDFa is still a moving target, we personally recommend people use Embedded RDF for the time being unless they are willing to track the changes in RDFa, but RDFa is more expressive than Embedded RDF (allowing XML Schema datatypes, etc.

This document is licensed under a
xmlns:cc="http://cc.org/ns#" rel="cc:license">
CC License
  and was written by TimBL.
  • use existing HTML attributes whenever possible: rel.
  • “Bridging the Clickable and Semantic Webs”: there’s already a clickable link, now we type it.
  • self-contained:
    • copy-and-paste a chunk of HTML along with its RDFa.
    • build “RDFa wizards” to create copy-and-paste-able content-and-structure.
    • combine “widgets” on a page, each widget with its own HTML+RDFa (no interference).

The point is that we can embed our own namespace and profile. The main downside is we can’t do datatypes. But we could start immediately

Posted in semanticWeb, www2007 | Leave a comment

GRDDL and the scruffies –

I think this is very important and we should create a BO microformat for chemistry…
I’ve defected from the heavyweight stuff to the microformats tutorial – also goes along with GRDDL. Being given as a double act by Harry Halpin and Dan Connolly (see tutorial notes for all links). We are going to do a mashup of our calendars, Microformats include hCard, hCal, XFN, hReview, rel-tag. The lower case semantic web.
Microformats are exploding – but only apply to COMMON problems (e.g. date, people,) Microformats kind-of don’t validate – but do we care? There is apparently an emerging citation microformat
Example is of how someone can use her foaf to get hotel reviews from her friends. GRDDL and SPARQL can do this. Look at the examples in the tutorial.
A strong emerging theme is Open Data – the data that we put into a web site. “I want my data back” – Jon Bosak 1997. This is a slightly different dialect of “Open Data” – that which is used by Web 2.0
If we can hack this from the bottom-up we can add major value to chemistry.
“The semantic web is to spread sheets and databases what the web of hypertext documents is to wordprocessor files”
Can we generate viral growth – as we see in microformats.
so here is the key:
GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a way to boostrap RDF out of XML and in particular XHTML data by explicitly linking transformations from RDF to XML.
GRDDL terminology:

  1. Source Document: an XML document which references at least one GRDDL transformation and hence licenses a GRDDL-aware to extract RDF.
  2. GRDDL-aware agent: a software agent able to identify the GRDDL transformations and run them to extract RDF.
  3. GRDDL Transformation: an algorithm for getting RDF from a source document

… everything is in the tutorial …
we need profiles, e.g. in head element of HTML (perhaps a BO profile)
“SPARQL isn’t as bad as it looks”
declare namespaces with prefixes (so we need a BO one)
SELECT DISTINCT ?a ?b ?c
FROM (contents of URI)
WHERE {
sparql stuff
}
I am impressed. It will work for BO. Henry and I came up with something like “dublin-chem” many years ago. But microformats are much more powerful. We need to think of the key concepts in chemistry and set up BO microformats.
more later…

Posted in semanticWeb | Leave a comment

I'm a scruffy

I’m at a (very good) tutorial on semantic web at the WWW 2007 meeting in Banff. In the first session there was a key slide from Garole Goble on the different axes – semantics and web. Putting both of these together hasn’t yet happened. Full semantics – OWL, etc require a supported community – either a large corporation or an enlightened public discipline such as bioscience, healthcare, astronomy. They have invested in semantic technologies – ontologies, decision making, formal workflows, resource discovery, etc. (slide 21) The other axis is web – where we get a bottom-up approach of folksonomies, tagging, blogosphere, etc. (Formal chemistry is zero semantics and zero web, of course, since most formal organisation compete rather than collaborate). So the main development is along the web axis, which Carole labels the “scruffies”.
I’m proud to be a scruffy (I even practice it IRL).
So I asked Sean Bechhofer, – who gave the first presentation – what the scruffies could do in semantics. A fairly simple idea jumped out for creating a bottom-up semantics independent of publishers and other organisations, which I’ll probably take forward on the Blue Obelisk list. But here I’d like to know what other communities have done in creating bottom-up semanatics. Sean gave the example of conferences on Semantics which have created a simple schema for describing a conference – taking away all the pictorial garbage from the websites and simply giving the basic facts in semantic form – name, place, dates, etc. This is simple, with high appeal and easily implementable. Any other ideas?

Posted in semanticWeb, www2007 | 1 Comment