An interesting post from Duncan Hull The Unreasonable Effectiveness of Google about the challenges of a semantic web of data. Since I am talking on the Chemical Semantic Web at Bio-IT World Conference & Expo 2009 it has a lot to influence me. There is the question of whether to annotate data:
“The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available.”
In general I agree, and believe that it’s possible to use heuristics such as text-mining and ontologies to clear up some of this later (it can never be 100 percent). Data often contain internal checks and consistency relations that show bad values. A typica example is temperature – when you find a small set of values 273 degrees larger than the main set you can be pretty sure someone has muddled celsius and kelvin. But there are many cases where you can’t know an isolated value is wrong.
And so our current approach is painless semantic authoring. For authors to create the same documents as they do now, but checked ontologically and semantically. That’s technically possible – why I am at Toowoomba today. But it needs the tools, and that’s why I am at Toowoomba.