open data: centralised or decentralized?

Deepak Singh highlights one of the emerging approaches to global data, Freebase. Recall that at scifoo we also heard about Google’s offer to host scientific data:

Freebase at Scifoo

Published 15 hours, 44 minutes ago

One of the sessions at Scifoo that left me a little confused was the demo by Danny Hillis and colleages on Freebase, something that I have discussed previous at bbgm. I love the concept of Freebase, the ability to create structures on top of data in a collaborative, somewhat ad hoc way.
Something that I wasn’t aware of was that the folks at Metaweb are using Freebase (the website) as a test case, and expect that the primary use will be for developers to build applications using the Freebase API. The killer application that was mentioned was people search. I wonder how people search using Freebase would get significantly better traction that something like Spock, although it’s easy to see how a proper implementation could easily leap ahead of any people search engine (and someone should develop one right now).
The somewhat disappointing aspect, at least as I understand things today was that all data had to be local to Freebase. That would mean that if I wanted to use Freebase as an annotation engine for multiple distributed data sets (e.g. at NCBI or EBI), it would not be too practical. However, I wonder if there was a way of using Freebase as a store for annotations, etc, which link out to all these data sources, e.g. a store for protein interactions based on literature data stored elsewhere.
I believe that to be applicable in the biosciences, and perhaps elsewhere, Freebase needs to be untethered. While the website can remain a source of information, and people can use it as a backend data source, an open data model, query language and API which can be run anywhere and put on top of any data source would make things very possible. Does it make sense for the folks at Freebase to do that? I don’t know and haven’t had the opportunity to quite put my head around the problem, but if all data has to be local, it’s going to be hard to use the power in a practical way. The metaweb, as it were, should not be centralized. Perhaps Freebase is just one example, a test ground for what Metaweb Technologies will make available, and we just need to wait for that.
Can you make out that I am a little confused?

2 Responses to “Freebase at Scifoo”

  1. 1 Jim Hendler Aug 12th, 2007 at 4:56 pm
    A lot of the Semantic Web vision is based on exactly what you are asking for – something like MetaWeb, but open and distributed – like the difference between a great ebook and the Web – each has its place, but the place for an open distributed store as a way of linking things seems to be important — check out the W3C’s Semantic Web Activity (http://www.w3.org/2001/sw)
  2. 2 Deepak Singh Aug 12th, 2007 at 8:00 pm
    I am quite familiar with the W3C work (I’ve blogged about it before as well), and I completely agree that each has it’s place. What appeals to me about Freebase is the ability of people without expertise in Ontologies and XML to build structure on top of data.

I am attracted by Freebase/Metaweb and also DBPedia/openlink. These are technologies which build ontological-supported repositories where large amounts of metadata can be centrally stored. I talked with some of the people involved at the www2007 meeting and some of the have the vision of vast central stores of metadata – loosely tera-triplestores or larger. I think that technology now allows this.
However I also picked up this centralist approach. There was also a view that the whole of the world’s information could be given unique IDs. This won’t work generally as there are many concepts which are important but too fuzzy to label. Copies, containers, addresses, versions etc. all cause major problems.
And I think Deepak is right for bioscience – it can’t be centralised and the semantic web has to be distributed.
But chemistry is smaller. I have already suggested that a year’s core information on new published compounds could be squeezed into a few terabytes. Not everything, perhaps, but enough to make it worthwhile. And, in chemistry, most concepts can be given unique labels. So, as always, it’d discipline-dependent.
Did I mention that such a repository has to be completely Open Data?
This entry was posted in data, open issues. Bookmark the permalink.

One Response to open data: centralised or decentralized?

  1. The differences between fields was evident as SciFoo. It appears that some fields like cosmology require a great deal of time to analyze raw data, while in organic synthesis, once we are able to visualize the raw data in the form of spectra we can quickly come to a decision about the support for the claim that a certain compound was isolated. My guess is that what is expected of the peer review system will vay greatly between fields as well.

Leave a Reply

Your email address will not be published. Required fields are marked *