DBpedia recently released the new version of their dataset. The project aims to extract structured information from Wikipedia so that this can be queried like a database. On their blog they say:
The renewed DBpedia dataset describes 1,950,000 “things”, including at least 80,000 persons, 70,000 places, 35,000 music albums, 12,000 films. It contains 657,000 links to images, 1,600,000 links to relevant external web pages and 440,000 external links into other RDF datasets. Altogether, the DBpedia dataset now consists of around 103 million RDF triples.
As well as improving the quality of the data, the new release includes coordinates for geographical locations and a new classificatory schema based on Wordnet synonym sets. It is also extensively linked with many other open datasets, including: “Geonames, Musicbrainz, WordNet, World Factbook, EuroStat, Book Mashup, DBLP Bibliography and Project Gutenberg datasets”.
This is probably one of the largest open data projects currently out there – and it looks like they have done an excellent job at integrating structured data from Wikipedia with data from other sources. (For more on this see the W3C SWEO Linking Open Data project – which exists precisely in order to link more or less open datasets together.)
PMR: DBPedia1 was mindblowing, but – not surprisingly – suffered from inconsistency and incompleteness. For example there were several RDF predicates for deathDate “death_date”, “deathdate”, etc. This is entirely forgivable for a first try. As DBPedia awareness spreads through WPedians they will converge on how infoboxes are created to give maximum semantic value. It only needs one or two evangelists in a discipline – e.g. in chemistry – to work this out, show the value, and then popularise it. The main body of WPedians will then adopt these methods and rapidly create a coherent semantic hyper-object.
The exciting thing is that this is zero-cost.
This will revolutionise reference chemistry. We have recently shown – and will be demoing at AHM2007 – how we can extract semantic chemistry from eTheses. That means that any student writing a thesis can increasingly link – painlessly – to WPedia for their lighweight ontological resource. Authors will know they are using the terms correctly – readers will know what the terms mean – and much more.
So I predict that with a few years DBPedia will become the semantic resource for chemistry. Every entry in WPedia enhances it – you never go backwards. We’ll be able to combine fundamental information for compounds such as colour, melting point, density, etc. There will be enough semantic data that a machine could rediscover the periodic table.
And that’s just the start. So, I’ll be browsing DBPedia in the blank spaces at AHM2007.
“So I predict that with a few years DBPedia will become the semantic resource for chemistry.”
Very informative. Who are their angel investors?
I always want to be part of a web2.0 finance related projects!
(1) I meant to qualify this as “general reference chemistry”. It may not be the actual DBPedia group at present, but it will be a semantic transformation of social computing.
Pingback: Noel O'Blog
I am always getting excited when reading about ‘structured data’ 😉
And I need help …
http://miningdrugs.blogspot.com/2007/09/wikipedia-dbpedia-and-remaining.html
(5) We are now starting to extract semantic data from theses (we can do this as we have the rights). This is necessarily semi-structured and is a good example of the sort of thing that is possible. A typical query is “which theses have compounds common to both”. Relatively easy to express in SPARQL.