I’m at a session at WWW 2007 on Linking Data – which I think will be of enormously important for us. Something I had never heard of before (it’s new this year):
DBPedia
it scrapes 750, 000 infoboxes from WP and truns them into structured RDF. The message is simple – the more implicit structure there is in WP, the easier it is for DBP to extract it. If there is a template for a given category (e.g. chemical compounds) then we can easily create an interface to extract structured RDF. For example DBP now has:
1,600,000 concepts
58, 000 persons
75, 000 YAGO categories
207 , 000 WP categories
and I am sure it will be relatively easy to extract the chemistry (Martin, how many compounds are there with infoboxes?)
DBP has a SPARQL endopint, on an OpenLink Virtuoso server (I am sitting next to these guys) Typical Q:
“All German musicians born in Berlin in 19th Century”
Extensions include
- free text search
- COUNT()
Key components are:
- All concepts are identified by URIs
- All URIs dereferenceable over the web into a small RDF snippet.
The fantastic thing is that we now have a complete RDF resource FOR FREE. One example which was shown was “von Baeyer”, so whenever we refer to him we get his date of birth, history, probably even his FOAFs! DBP is becoming one of the central information hubs of the emerging web of data.
In that way DBP can become the “popular” chemical hub, while Pubchem-RDF will become the “specialist” chemical hub. Of course they will be linked and possibly even indistinguishable in some RDF snippets.
The queries are fantastic:
“A soccer player with #11 shirt in a club with a stadium of over 40,000 seats born in a country with over 10 M inhabitants”
Let’s think what the Blue Obelisk will be able to do for chemistry. TBL has said we can lash/mash things up “in an afternoon” I am going to find out today what we can do with the chemistry we have got.
The other RDF resources in the same web are books, US census, geonames, CIA factbook, DBLP, dbtune, FOAF, Revyu
600 RDF triples. This is staggering. 100Klinks out of DBPedia
And then in 2 months music, gutenbreg, SW-lifesci, flickr, eurostat, freebase, HTMLweb GRDDL , blogosphere (SIOC), music brainz…
So – let;s do dbchem…!!! There is still a lot for me to learn. There are starting to be several large hubs of links. Which is the hub for a community will depend on what they want and what they create.
As far as I can tell, there are around 3000 compounds with chemboxes, and over 2000 with drugboxes. I think we have many compounds on WP without chemboxes, but they are typically very brief articles (stubs) with little information. Of course linking into the mainstream of chemical information, as dbpedia seeks to do, may provide an incentive for more wikichemists to work on adding chemboxes. Sounds great!
Martin A. Walker (Walkerma on WP)