I (and colleagues) are getting ready for the December Hackathon (JISC, OKF, SWAT4LS) which includes Open Research Reports and the Semantic Web For Life Sciences. The Hackathon can include any activity but we are preparing material to bring along based on Open Research for diseases and which is or can be semantified. We hope this will be an important step forward for making disease information more widely available and useful.
So what’s Semantics? It’s not a disease, is it?
No. It’s a formal way of talking about things. Humans are (usually) very good at understanding each other even when they use fuzzy language. For example there is a sign next to our bicycle shed which says:
NO BICYCLES HERE
*We* all know this means:
“Do not put bicycles here”
and the study of this is called pragmatics http://en.wikipedia.org/wiki/Pragmatics .
Here are three sentences where (English speakers) easily distinguish the difference between the meaning of the symbol “cold”:
- She has a cold
- She has a cold sore
- She has a cold foot
We will return to these later.
Unfortunately pragmatics is beyond the range of most computer systems so we have to we have to create formal systems for them – these are based on syntax (a common symbolic representation, http://en.wikipedia.org/wiki/Syntax ) and semantics (agreement on meaning, http://en.wikipedia.org/wiki/Semantics ). (Be warned that the border between these is fuzzy).
Our syntax for the semantic web includes:
- URIs (http://en.wikipedia.org/wiki/Uniform_Resource_Identifier ). This is a universally agreed mechanism for giving things-on-the-web names. Thus the URI for the Wikipedia article on “syntax” is “http://en.wikipedia.org/wiki/Syntax”
B: HANG ON! That’s not a name, it’s an address. It’s a Uniform Resource Locator (URL, http://en.wikipedia.org/wiki/URL, )
A: Yes. It’s an address and also a name. The URI identifies the resources and also locates it.
B: But it might not be there – you might get a 404.
A: Wikipedia never 404s
B: Or someone could copy the page to another address. It’s still the same page, but a different URL.
A: but it’s not the definitive URI
B: Why not. And anyway The XML crew spent 10,000 mail messages debating that names and addresses were different.
A: well they are the same now. Tim says so.
B: That’s a distorted view of reality.
PMR: Hussssh! This has been a major debate for years and will continue to be so. Here’s Tim (http://en.wikipedia.org/wiki/Linked_Data ):
- Use URIs to identify things.
- Use HTTP URIs so that these things can be referred to and looked up (“dereferenced“) by people and user agents.
- Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
- Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
B: So these are conflated URIs (“HTTP URIs”). They only work if the thing is a web resource.
A: Here he is again:
- All kinds of conceptual things, they have names now that start with HTTP.
- I get important information back. I will get back some data in a standard format which is kind of useful data that somebody might like to know about that thing, about that event.
- I get back that information it’s not just got somebody’s height and weight and when they were born, it’s got relationships. And when it has relationships, whenever it expresses a relationship then the other thing that it’s related to is given one of those names that starts with HTTP.
Note that although the second rule mentions “standard formats”, it does not require any specific standard, such as RDF/XML.
B: so it’s only “conceptual things”. Like “syntax”. My cat cannot have an HTTP-URI.
A: not your cat. But TBL can be dereferenced: Look at http://en.wikipedia.org/wiki/Tim_Berners-Lee .
B: That’s not him – it’s a web page about him. You make it sound as if Wikipedia defines reality. If it’s not in Wikipedia it doesn’t exist. You are a Borgesian.
A: A what?
PMR: Shhhh! This is the sort of “robust discussion” we get into all the time. We are going to take a very simple approach to the semantic web. The advantage is it is easy to understand and will work. We are first of all going to give things precise labels.
B: Like a “cold”
PMR: exactly like that. We will call a cold “J00.0”
B: whatever for? I won’t remember that.
A: You don’t have to – the machines will. A cold will always be J00.0
B: Well why “J00.0”? Why not “common_cold”, like Wikipedia (http://en.wikipedia.org/wiki/Common_cold )?
A: Because that’s what the WHO call it. In their International Classification Of Disease Edition 10 (ICD-10) http://en.wikipedia.org/wiki/ICD-10 . PMR actually worked with the WHO (in Uppsala) to convert ICD-10 to XML. He knows it by heart.
PMR: well I did. I’ve forgotten most of it.
B: OK, well I suppose the WHO has a right to create names for diseases. But surely they aren’t the only ones?
A: No = there’s http://en.wikipedia.org/wiki/Medical_Subject_Headings (MeSH) – which calls it D003139 . And ICD–9 …
B: The ninth edition I suppose …
A: Yes. Calls it 460.
B: I bet they don’t all agree on what a cold is.
PMR: No. There’s lots of variation in medical terminology. There’s the http://en.wikipedia.org/wiki/Unified_Medical_Language_System (UMLS) It:
is a compendium of many controlled vocabularies in the biomedical sciences (created 1986). It provides a mapping structure among these vocabularies and thus allows one to translate among the various terminology systems; it may also be viewed as a comprehensive thesaurus and ontology of biomedical concepts.
B: and now this “ontology” word?
A: it’s a formal system (http://en.wikipedia.org/wiki/Ontology_%28computer_science%29 ):
an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.
B: It’s too complicated for me.
PMR: We are going to start simple. Ontologies tell computers how to distinguish between different meanings of the concept “cold”. We’ll just assume that we humans generally agree.
A: But doctors don’t agree on diagnoses – how can we?
PMR: This isn’t about whether you are actually infected by rhinovirus…
B: … ???
PMR: The virus that causes a cold. It looks like this:
A: Yes – it’s got icosahedral symmetry and was …
PMR: … back to the semantics. It’s about putting the concept of “cold” into computers. We need a unique identifier and we can use the WHO one.
B: but J00.0 isn’t unique. That’s the number of my neighbour’s car.
PMR: so we turn it into a URI. An HTTP-URI is unique because it’s based on domain names, and they are unique.
A: but what domain name? Since the WHO invented it, let’s use the HTTP-URL for the cold. That’s http://apps.who.int/classifications/icd10/browse/2010/en#/J00-J06
B: but that should be http://apps.who.int/classifications/icd10/browse/2010/en#/J00 – but that doesn’t resolve. And in any case I bet the “apps” bit changes. That’s why addresses are no use for URIs
PMR: It’s really up to authorities like WHO to give stable identifiers for this, that are persistent in name and address.
B: That’s a tough order. Do you think the WHO are up to it?
PMR: Probably not yet. We’ll probably need to invent a way round it. Perhaps with a PURL (http://en.wikipedia.org/wiki/Persistent_Uniform_Resource_Locator ).
B: and you said this was easy?
PMR: The Semantic Web community is working hard to make this easy for you, yes. Anyway, nearly there. Let’s just use http://purl.org/who/classifications/icd10/J00.0 as a shorthand for “common cold”
PMR: sorry, identifier. And address – which we can make resolvable by redirecting the PURL.
A: OK, we’ve now got an identifier system for all diseases. Will we always use ICD-10?
PMR: It’ll make it easier for our ORR project and we shan’t need mappings or ontologies.
A: So we can identify “cold sores” as http://purl.org/who/classifications/icd10/B00.1
B. You’ve convinced me that we can give each disease a unique identifier (whether we actually have the disease or not). But “cold sores” is not a disease – it’s a symptom. And the disease is “Herpesviral vesicular dermatitis” according to WHO. The virus isn’t a disease as such so does it have its own identifier?
PMR: Yes. The virus (http://en.wikipedia.org/wiki/Herpes_simplex_virus ) is actually a combination of protein and RNA. Its classification is:
B: But that’s not an identifier.
PMR: Agreed. So somewhere we need to find an identifier or work out a schem for creating one.
B: So the semantic web won’t work?
A: We are all at the stage of creating it. There’s been a huge increase in identifier systems. There are now thousands in the Linked Open Data cloud. And that’s the sort of thing we’ll tackle in the Hackathon.
B: I’m knackered. I’ve learnt that we need HTTP-URIs for everything. We’ve just done diseases. If we want to do ORRs we need people, places, parasites and so on. All in semantic form, right?
B: so there’s a HUGE amount of work to be done.
A: But lots of people are involved. And once we’ve done it, it will be persistent.
B: until we change our concepts…
PMR: But by then we shall already have shown how powerful it is.