petermr's blog

A Scientist and the Web


Archive for the ‘www2007’ Category

Paul Miller speaking at UCC

Tuesday, October 9th, 2007

I should have blogged this earlier but was too wrapped up in my talk for yesterday. Still if anyone in the Cambridge area is reading this, Paul Miller of Talis is visiting us today and giving a talk in the afternoon (1415, 2007-10-09, Unilever Centre, Chemistry).
I can’t remember when I ‘met’ Paul, but he invited me to a session on Open Data that he ran at WWW2007 and I was extremely impressed with the activities that he and his extended collaborators are working in. I also heard elsewhere that he had given a very good talk on the future of data on the web so I suggested he should be one of our speakers.


Paul Miller

Paul Miller

Senior Manager, Technology Evangelist

blog: Thinking about the Future

Paul joined Talis in September 2005 from the Common Information Environment (CIE), where as Director he was instrumental in scoping policy and attracting new members such as the BBC, National Library of Scotland and English Heritage to this group of UK public sector organisations. Previously, Paul was at UKOLN where he was active in a range of cross-domain standardisation and advocacy activities, and before that he was Collections Manager at the Archaeology Data Service. At Talis, Paul is exploring new models of collaboration and identifying further areas in which our technology or knowledge would be of value. Paul has a Doctorate in Archaeology from the University of York.

PMR: Note that you don’t have to have a degree in Computer Science to be a world expert on the semantic web. The future requires people with a vast range of experience and skills

Moving from a web of documents to the web of data
Dr. Paul Miller, Senior Manager & Technology Evangelist Talis (
The open web has been transformed by a tide of newly interactive applications, many of which approach the utility of software previously installed locally on your own computer. Flexible and interactive, they also leverage the power of the network to compare your behaviour with that of your peers (Amazon’s famous recommendation services are a prime example), and aggregate individually irrelevant actions together to deliver network-scale benefits.
Into this visibly evolving world of richly ‘Web 2.0′ web sites comes the long-held promise of Tim Berners-Lee’s ‘Semantic Web’. Escaping from the laboratory, previously esoteric technologies such as ‘RDF’, ‘OWL’ and ‘GRDDL’ are being put to work in building the next generation of the rich applications that have so transformed our use of the Web.
This presentation will illustrate and comment upon these trends, as well as introducing ways in which they might impinge upon all of you… and not just when you order a new book from Amazon in your lunch break.

PMR: Among many other talents Paul is able to blog talks as they are given. I can’t type fast enough to do this for Paul, but I’ll try to summarise my impressions.

And anyway, everything that Paul and Talis do is Open. They are leading the way in developing Open licences. I am sure it will be a good business investment.

Talis licence for Open Data

Monday, September 24th, 2007

I used to think Open Data was simple – “facts are not copyrightable” and everything follows. No I am wiser and realise that data are complex and need a lot of attention – fast. So it’s very valuable to see groups who are addressing the problem. Here is Paul Miller of Talis (who convened a WWW2007 session on Open Data)

18:11 24/09/2007, Paul Miller, Nodalities
In the world of creative works, notions espoused by Lawrence Lessig and others over a number of years are becoming increasingly well understood. A Creative Commons license, for example, is recognised as giving the holder of rights an ability to prospectively grant certain permissions rather than limit use of their work by expecting all comers to request these permissions, again and again. Those rights are not cast aside, removing all opportunities to protect your work, your name, or your potential revenue stream. Rather, you are provided with a means to explicitly declare that your work may be used and reused by others in certain ways without their needing to request permission. Any other use is not forbidden; those uses must simply be negotiated in the ‘normal’ way… a normal way that also applied to those uses covered by Creative Commons licenses before the advent of those licenses.

Creative Commons licenses are an extension of copyright law, as enshrined in the legal frameworks of various jurisdictions internationally. As such, it doesn’t really work terribly well for a lot of (scientific, business, whatever) data… but the absence of anything better has led people to try slapping Creative Commons licenses of various types on data that they wish to share. It will be interesting to see what happens, the first time one of those licenses needs to be upheld via a court!

At Talis, we have an interest in seeing large bodies of structured data available for use. Through the Talis Platform, we offer one means whereby such data may be stored, used, aggregated and mined, although we clearly recognise that similar data may very well also be required in similar contexts.

Recognising that contributors of such data need to be reassured as to the uses to which we – and others – may put their hard work, we spent some time a couple of years ago drafting something then called the Talis Community Licence. This draft licence is based upon protections enshrined in European Law, and has been used ‘in anger’ for a while to cover contributions of millions of records to one particular application on the Talis Platform.

There has been plenty of talk around ‘open data‘ here on Nodalities, and on our sister blog Panlibus. See, for example, this recent post from Rob Styles. There were also fascinating discussions at the WWW2007 conference earlier this year.

Despite interest in open (or ‘linked‘) data, licenses to provide protection (and, of course, to explicitly encourage reuse) are few and far between. Amongst zealous early adopters, there does seem to be a tendency to either (mis)use a Creative Commons license, to say nothing whatsoever, or to cast their data into the public domain. None of these strategies are fit for application to business-critical data.

Building upon our original work on the TCL, we recently provided funding to lawyers Jordan Hatcher and Charlotte Waelde. They were tasked with validating the principles behind the license, developing an effective expression of those principles that could be applied beyond the database-aware shores of Europe, and working with us to identify a suitable home in which this new licence could be hosted, nurtured, and carried forward for the benefit of stakeholders far outside Talis.

Today, Jordan posted the latest draft of this license (now going by the name ‘Open Data Commons‘), some rationale, and pointers to various ways in which he – and we – are seeking input and further validation.

As my colleague Rob (again!) has argued, curators of data need an option on the permissions continuum between free-for-all and locked down. The Open Data Commons, née Talis Community Licence, offers that option.

Take a look. Think about how you would use it. Consider what sort of administrative framework you would want behind such a license. Join the conversation.

PMR:  First of all many thanks to funding legal work on Open Data. Whatever else we have to remain within the legal framework or we court disaster at a later stage.

There will not be a single approach to this anymore than there is a single Open Source licence. Motivations vary and, even more importantly, data is more varied than software. I know of two other efforts, Science Commons  – (in Cambridge US) springing from CC, and the The Open Knowledge Foundation set up by the tireless Rufus Pollock (in Cambridge UK) who invited me to be on the board. We honour this by using the OKFN “Open Data” on
our own CrystalEye. I expect that people will choose different licences to emphasize different policies. (For example I currently use Artistic as my software licence as I don’t want the name JUMBO to be misused for derivative works which are not compliant. I might well use BSD elsewhere. and so on).

As Paul says, please converse.

Blogging in science and mathematics

Sunday, July 1st, 2007

In a splendid post – reproduced in full (indented) – Kyle Finchsigmate highlights the difference between chemistry and other sciences.

WTF is up with the Science blogosphere?

15:11 30/06/2007, Kyle Finchsigmate, sciency politics, The Chem Blog
vismap.jpgI [i.e. KF] was recently interviewed by Nature on the state of blogs and anonymity and whatnot and the interviewer had an interesting question: Why is the population of chemistry blogs so high relative to other disciplines?  There have been a number of posts on why chemistry is so absent in popular media – (IMHO, it lacks the “God element” present in biology/medicine and physics, especially astrophysics, which makes it less interesting to the masses.  The complexities and fuzzy logic employed by astrophysicists requires nothing less than a religiosity to believe some of the odd shit they sling out.)

Anyway, it’s disheartening that there are very few blogs about physics and biology compared to chemistry since the future is integrated approaches.  If I could, I would seek out a lively biologist to team up with on the Chem Blog, but I know of none.  It would be awesome to have a readable blog about either field, since I consider both of them too far off topic to be approachable.

Anyway, my response to the Nature interviewer (the interview should be available via pod cast next month) was that the chemistry blog-o-sphere had a number of very strong voices and drew a lot of inspiration to a lot of people.  Particularly catalytic in that was Dylan Stiles, Paul Bracher and Paul Docherty.  When asked if I was a strong voice, I arrogantly replied affirmatively, but that’s just my style and I was the subject of the interview in any case, so I had to have some degree of impact.  I know that I have made no secret that I started blogging because of Dylan’s post on Otera’s Catalyst, which he employed in his recent Org. Let. publication.  (I did not find it via Bengu Sezen [a researcher involved in ongoing controversy about the validity of published results - PMR], though I did exploit her to jump start my blog via trackbacks to Dylan’s, which is a wee shameful, but it worked.  Besides, if Blogging really had any superstar it was her.  It is, after all, the news that makes the reporter, not the other way around [though, with blogging, that argument can be contested].)

The walls are too high really to make a plea to people in other fields to start blogs, since I don’t think physicists or biologists frequent this blog, but if they WERE here, I would ask them to consider it.

[PMR: Question to Kyle - is the diagram the chemical blogosphere?]

PMR: I had exactly this experience when I was invited to talk to the Mathematical Knowledge Management group last week. I asked if anyone blogged – not really. There are maths blogs but I get the impression that they are mainly aimed at school, problems solving, etc. No-one was blogging the meeting (other than me! – look for the mkm2007 tag on That’s a pity as the talks were excellent – As a crystallographer I was fascinated to here Tom Hales’ work on proving Kepler’s conjecture (cubic/hexagonal close packing really are the densest ways of packing uniform spheres in 3D).

In chemistry the blogosphere gives up-to-the-minute  reports – in a large meeting it’s not impossible that people transfer sessions when they read the blogs (except that the ACS normally refuses to provide wireless even to speakers and you have to buy your own). I won’t speculate on why this is so, but I certainly felt more confident of starting a blog because Tenderbutton had shown the way. (It’s well know that his supervisor disapproved – “went ballistic” is what I heard from one senior chemist). Note, however, that in synthetic chemistry there are long periods of watching reactions “bubble away” and blogging can be a near-zero-cost multitasking activity.
There are many motivations for blogging – when I talk to young scientists I point out that several chemists are now on the first steps to science journalism having been scooped up by science publishers.  For myself the blog has several novel advantages.

  • I can post ideas in progress. That’s anathema in chemistry, though I see signs we are changing it.
  • It summarises my current position – especially where peer-reviewed publication may not be the best way. Difficult to publish technology in science journals.
  • It is a platform for advocacy
  • It reaches out to other disciplines (and I’ll say more about maths in later posts)
  • It acts as a record of my talks. In general I blog about what I am going to say at a meeting. This alerts people to the issues, and may also be a fallback if my machine breaks down. At WWW20007 I posted the summary of my ideas and many people in the audience had already read them or were following them as I went through the talk. This is especially useful where (as then) I only had 5-10 minutes – you can give details that you don’t have time to say physically. And, since my talks are stochastic, it reminds me if there is anything I have forgotten as I come to the end of the talk.

So I think my talk has catalysed at least a subset of the maths community to think about blogging. Michael Kohlhase‘s blog is an example and I’ll be talking more later about the collaboration we have set up between MathML/OpenMath and CML – this might be exciting news for science publishers and reporters. So perhaps one of the most important aspects of blogging – for me – is:

  • A way of reaching beyond the boundaries of my own domain. It’s obviously an effective approach in Open Data, etc. as I have had several people in the LIS (library-information sci) community tell me they were glad I had restarted my blog.  I think that Michael and I will make it work for chemistry and maths. He is intimately connected to the area of mathematical knowledge while I connect to the Blue Obelisk of chemical open source. Thus if we say “who would like to help with the management of geometrical algorithms in the BO repository it’s quite possible we’ll get someone from the maths community being interested. And when – as we hope – MathML and CML start to really interoperate we will have the basis of some of the formal knowledge architecture of the immediate future.

That couldn’t happen without the blogosphere. Blogging is an integral part of modern scientific knowledge. And the more enlightened scientific publishers know it. Unfortunately very few senior chemists do.

Audible Open Data at WWW2007

Saturday, May 19th, 2007

Danny Ayers who ran the developers track at WWW2007 recorded our Open Data session. Some presentations had slides and especially Steve Coast and I used animated/interactive material but I think the ideas come across. The Q&A had a lot of audience participation (which we all encouraged) but not all speakers were close to micophones – but hey, it’s the first time!

Danny Ayers: Open Data Podcasts

The Dev Track at WWW2007 began with a group of presentations on Open Data, chaired by Paul Miller of Talis. I did a rough & ready recording, which I’ve trimmed and chunked. Quality varies quite a bit (Q&A particularly), and some slides were shown, but I believe there’s plenty in the audio to make sense of the material.

Personally I was particularly pleased by this session because it revealed information that I for one would never have searched out without prompting. Turned out to be very interesting. “When you start using data, you need to start paying attention to these things…”

In retrospect the session was perhaps slightly mis-billed having “Semantic Web” in the title, when the material was about data in general. But the room was full and the audience engaged, so no harm done. In fact I think RDF was only mentioned once – see if you can spot it…  

Q: “Where do we go from here?”
A: “Evangelize!”

Many thanks to everyone involved.

I was particular pleased to see the wide engagement – this is dev-track which is not about politics – but it was clear that access to data really matters to a lot of people. There is obviously the need for licenses – Talis have been working on one, for example.

Thanks Danny

WWW2007 postscript

Sunday, May 13th, 2007

I am delighted that I had the chance to go to WWW2007 – at one stage I’d wondered whether there would be anything of interest other than the session I was in (Open Data). Or that I would know anyone… After all it was 13 years since the last/first WWW meeting I went to (although obviously there is a lot of overlap with XML). And would I have lost touch with all those W3C Recommendations (== standards). As it turned out I got so excited I found it difficult to sleep.

The features I take away are:

  • “Web 2.0″ is big with the industry people – the keynotes (I’ve already mentioned TimBL) concentrated on the new webSociety where the technical stuff should be part of the plumbing. Nothing really new but optimism about pixelsEverywhere (i.e. we shan’t need laptops – we read our email on the gaspumps) – trust and identity, revenue generation, etc.
  • “Semantic Web” – overlaps with, but is different from Web2.0. The immediate progress (for which I am glad) will be lowercasesw – just do it NOW! – for which the human nodes and arcs will be critical. The sw will be rather like a W. Heath Robinson machine – all string and sealing-wax – but every joint will be surrounded by humans pouring on oil, adding rubber bands, etc. We’ve now idea what it will evolved to, but we are optimistic.
  • “Linked Data” – a very strong and exciting theme. We are generating RDF triples in advance of knowling how we are going to connect them. It’s somewhat like a neural net. We think there will be an explosion of insight when this happens – beyond what we have done with Web2.0 mashups – useful though those are. I’m currently trying to load the basic tools so I can play with real stuff.
  • “Open Data”. Very positive and exciting. There is no doubt that the Web of the next few years will be data driven. Everyone was dismissive of walled gardens and sites without RDF-compatible APIs – incuding Creative and other Commons licenses. The semantic web can only function when data flows at the speed of the internet, not the speed of lawyers, editors and business managers. And I have no doubt that there will be businesses built on Open Data. Excitingly for me there seems to be no real difference between OpenData in  maps,  logfiles, and scholarly publications. (So I’m looking forward to XTech2007)
  • Sense of community and history. A strong desire to preserve our digital history. Google finds the following image from WWW94 and CERN

P. Murray-Rust

Yes – I was running a biology and the Web session, only to find that Amos Bairoch was in the audience! How much of this is still in the collective web semi-consciousness. Somehow I am assuning that everything I now do leaves preserved digital footprints – is that naive? And what, if anything, could I do?

What’s in a namespaceURI?

Sunday, May 13th, 2007

On more than one occasion we had heated debates about whether a namespaceURI may/must be resolvable. In the session on linked Data TimBL made it clear that he thought that all namespaceURIs must be resolvable. This conflicted with my memory of the namespaces in XML spec which I remembered as saying the the namespace was simply a name (indeed there can be problems when software attempts to resolve such URIs). So I turned to Namespaces in XML 1.0 (Second Edition) which is more recent (and which I hadn’t read) and I’m not sure I’m clearer. I can find:

“Because of the risk of confusion between URIs that would be equivalent if dereferenced, the use of %-escaped characters in namespace names is strongly discouraged. “


” It is not a goal that it be directly usable for retrieval of a schema (if any exists). Uniform Resource Names [RFC2141] is an example of a syntax that is designed with these goals in mind. However, it should be noted that ordinary URLs can be managed in such a way as to achieve these same goals.”

So this sounds like “may” rather than “must” be dereferenceable.

Now namespaceURIs also exist in RDF documents (whether or not in XML format), and Tim was very clear that all URIs must be dereferenceable. I don’t know how whether this is formalised.

Looking for RDF I find Resource Description Framework (RDF) / W3C Semantic Web Activity which contains:

“The RDF Specifications build on URI and XML technologies”

and the first links contains:

“Uniform Resource Identifiers (URIs, aka URLs) are short strings that identify resources in the web: documents, images, downloadable files, services, electronic mailboxes, and other resources. They make resources available under a variety of naming schemes and access methods such as HTTP, FTP, and Internet mail addressable in the same simple way. They reduce the tedium of “log in to this server, then issue this magic command …” down to a single click.

All documents date from 2006.

So I think there is “XML namespaceURI” and RDF namespaceURI” which if not identified separately are confusing. Or maybe the time has come to make all namespaceURI dereferenceable even if their owners assert they are only names. In which case what is the value of the resource? The simplest should be the “Hello World!” of the URI:

“Hello Tim!”

I shall try to make namespaceURIs resolvable although this is difficult when not connected to the Internet.

Web 2.0 and/or Semantic Web

Saturday, May 12th, 2007

Web 2.0 and Semantic Web are sometimes used synonymously, sometimes distinct. I’ve come in halfway through a presentation (missed speaker’s name) and taken away:

Web 2.0

  • blogging
  • AJAX
  • small-scale mashups
  • proprietary APIs
  • niche vocabularies
  • screenscraping

whereas Semantic Web is

  • large-scale data linking
  • comprehensive ontologies
  • standard APIs
  • well-defined data export
  • data reconciliation

and suggested that we put them together as:

  • blogging
  • AJAX
  • large-scale data linking
  • standard APIs
  • niche vocabularies
  • well-defined data export
  • data reconciliation

“There’s just one Web after all”

Parsing Microformats (revised post)

Saturday, May 12th, 2007

Useful presentation online (in S5) from Ryan King (of Technorati) on parsing microfomats. (I’ve been out of touch with HTML4 and I’m learning things.) We’ll need a day or two of virtual Blue Obelisk discussion to make sure we are adhering to the specs (yes, there are some). You don’t have to LIKE them – but they seem to be the way that it works.For example the value of a class may be a list of whitespace-separated tokens. Spans may be nested. All class names are lowercase

I tried to give the examples in an earlier version of this post but the raw XHTML breaks WordPress. You’ll have to read Ryan’s talk – it’s very clear there.
The main thing is that we have to know what we are doing, not make it up from HTML vocabulary as we go along. So it’s definitely important that the Blue Obelisk has a Wiki page on how we should be using microformats. If Ryan has material relevant to BO I’ll blog it later.

Yahoo! pipes – yet another workflow?

Friday, May 11th, 2007

Nice presentation about YP – looks like they are going to start allowing custom web services. Drastically reduces coding – often to zero:

  • 10% of the web 2.0 pyramid (coders, remixers, bloggers)
  • assume prior knowledge (loops, data types)

Heart of system is, of course, web data.

  • engine tuned for RSS but not necessarily.
  • editor – nearly everything can do it in browser. Instant “ON” – no downloads. Dataflow apps tuned to visual programming. learn and propagate by “view source” (this is valuable metaphor)
  • design – easy to use. highlights valid connections. l2r readability. (find pizza within 1 mile of foo), dragability. debuggable on refresh.

Certainly looks slickr than Taverna. Uses <canvas> tag in many browsers. Runs on any modern browser (IE6/7 via SVG). Performance degrades with transparent layers. worst problem for Canvas is that it occludes DOM events (only click). [Obviously fairly hairy programming was required - transparency, drag etc.]. API rate limits (i.e. if your pipe is popular you might use up API rate)
XML, JSON, KML GeoRSS. Disposable Applications? And perhaps XML-over-the-web has finally arrived?

the chemical semantic web has arrived! just do it NOW

Friday, May 11th, 2007

I have been overwhelmed with excitement about the new maturity of semantic technology and RDF data that is available for our construction of the chemical semantic web. Note that I used to write “Chemical Semantic Web” with the assumption that it had to use the whole paraphernalia of the Semantic Web. But – as a newly discovered “scruffy” – I now know that we only need lightweight – very lightweight – approaches and this is usually labelled as the “lowercase semantic web”. So, from now on, I’ll probably write “csw”.

The vision is simple – make everything a URI and use RDF to support searches using the modern generation of tools. We’ve had several sessions – some under the theme “Linking Data” The first contained tutorial material; the second – at which TBL and others demonstrated examples and vision blew away all my scepticism.
Tim’s message was simple – don’t hang around – just do it NOW! Unfortunately I don’t have any material to hand and will have to rely on memory. But I have no doubt that we have the chance to transform the world of chemical information within months. And remember that we can now start using machines to help.

There’s a huge amount of tools! and I am still struggling to know which to use. And some of the terminology (from ontoworld):

A SPARQL endpoint is a conformant SPARQL protocol service as defined in the SPROT specification. A SPARQL endpoint enables users (human or other) to query a knowledge base via the SPARQL language. Results are typically returned in one or more machine-processable formats. Therefore, a SPARQL endpoint is mostly conceived as a machine-friendly interface towards a knowledge base. Both the formulation of the queries and the human-readable presentation of the results should typically be implemented by the calling software, and not be done manually by human users.

So from what I can see

  • we find a URI (derefenceable) linked to a set of RDF that we are interested in (e.g. “chempedia”)
  • point it at an endpoint (e.g. tabulator)

and then issue a query.

I’m probably wrong, but I’ll know more tomorrow. At least I am doing it NOW!