Category Archives: XML

MathML and CML communities

I was delighted to meet old friends from the MathML/OpenMath community last week at Mathematical Knowledge Management 2007 – Patrick Ion, Robert Miner, James Davenport and Michael Kohlhase (apologies to any I have omitted). OpenMath (1993) was one of the first non-textual markup languages and was based on SGML, while MathML came along later (1999). The languages are distiinct but deliberately converging and (from WP):

OpenMath consists of the definition of “OpenMath Objects”, an abstract datatype for describing the logical structure of a mathematical formula, and the definition of “OpenMath Content Dictionaries”, or collections of names for mathematical concepts. The names available from the latter type of collections are specifically intended for use in extending MathML, and conversely, a basic set of such “Content Dictionaries” has been designed to be compatible with the small set of mathematical concepts defined in Content MathML, the non-presentational subset of MathML.

so I shall tend to use them interchangeably. Note, however, that MathML is an activity of the W3C, while (WP)

OpenMath has been developed in a long series of workshops and (mostly European) research projects that began in 1993 and continues through today.

MathML and CML have had a long history of association. We tend to present on the same platforms (e.g. NSF / NSDL Workshop on Scientific Markup Languages). Each has its particular growth points – they are accepted as formal means of scholarly publication by several major publishers and there are a variety of toolkits.

Here I want to emphasize that each is required not just in its own domain, but by neighbouring ones. Thus chemistry needs MathML, geology needs CML, etc. This requires a different mindset when developing tools – it isn’t necessary to address all the cutting edge research in the mother subject – but important to make sure that you can solve a useful number of problems in everyday science and engineering.

As an example I asked the maths community whether I could search for a given differential equation, e.g.:

dx/dt = -k*x

You can, of course, type this directly into Google and get results like this but that only works when the variables are x and t. Thus

da/dt = -ka  … or …

da/a +kdt = 0
or many other forms represent the same abstract equation.

So I was delighted to find that several people were actively working on this – it means we can serach the world’s literature for given functional forms indepedently of how they are represented. It’s hard – in some cases very hard – and varies between countries. It’s similar to the chemist’s use of InChI (see Unofficial InChI FAQ) to normalize and canonicalize chemical structure (it doesn’t matter whether you write HCl or ClH – the InChI is the same). And Google is quite good at finding these forms.

Even more fundamental is the use of dictionaries – OM has the content dictionaries and CML has CMLDict/Golem. They aren’t identical but close enough that it’s easy to convert between them.  The dictionary concept is very powerful and allows languages to be extended almost indefinitely. It also allows different groups to develop their own systems – which may even be incompatible – you load in the appropriate dictionary. And the software is effectively written.
So there is now a strong bond between the MathML and CML community. They are starting to adopt the idea of blogging and social computing (chemistry has led the way here), while we shall adopt some of the formalities of OM in our representations of physical science.

We’re going to pursue the following (at least) and keep in touch through the blogosphere:

  • mixed mathematics and chemistry (see next post)
  • social computing, which could involve student projects, etc.
  • combining forces in the advocacy of markup languages in scholarly scientific publications and the communal dissemination of data.

So – to show this isn’t just talk, MichaelK and I are starting to see how a “simple” formula in physical chemistry can be represented. We’ll show you shortly

Useful chemistry thesis in RDF

I shall be using Alicia’s Open Science Thesis in Useful Chemistry as a technical demonstrator at ETD2007. I really want to show how a born digital thesis is a qualitative step forward. Completely new techniques can be used to structure, navigate and mine the information. Here’s a taster:

A chemical reaction diagram (“scheme”) is a graphic object which looks like this:

udc_scheme2.JPG

As you can see this is semantically useless. A lot of work has gone into this, but none of it is useful to a machine (look closely and you’ll see it’s a JPEG). Even in the native software which was used to draw it it is unlikely that the semantics can be easily determined. However XML and RDF allow a complete representation. It took me about 1 hour to handcraft the topology – if we had decent tools it would be seconds. The complete set of reaction schemes (I counted 11 in the thesis can be easily converted to a single RDF file which looks something like this:

uc:scheme1_1 pmr:isA pmr:reactionScheme .
uc:scheme1_1 pmr:hasA uc:rxn1_1a .
uc:scheme1_1 pmr:hasA uc:rxn1_1b .

uc:rxn1_1a pmr:hasReactant uc:comp1 .
uc:rxn1_1a pmr:hasReactant uc:comp2 .
uc:rxn1_1a pmr:hasReactant uc:comp3 .
uc:rxn1_1a pmr:hasReactant uc:comp4 .
uc:rxn1_1a pmr:hasProduct uc:comp5 .
uc:rxn1_1b pmr:hasReactant uc:comp5 .
uc:rxn1_1b pmr:hasProduct uc:comp6 .

(uc: refers to the usefulChemistry namespace, pmr: to mine).

There are many Open Source tools for graphing this and here is part of the output of one from the W3C

scheme1.png

Here you can see that reaction1.1a has four reactants (compound 1,2,3,4) and 1 product (comp 5). Comp5 is the reactant for another reaction (clipped to save blog problems). The complete picture for the whole thesis looks like this:
reactions1.png

and (assuming you have a large screen) you can see immediately what reactions every compound is involved in.

That’s only the start as it is possible to ask sophisticated questions from a SPARQL endpoint – and that’s where we are going next…

… IFF you make the theses true Open Access

Bioclipse and the Information Revolution

I have been honoured to have been asked to talk at the 07.05.23 Embrace Workshop on Bioclipse 2007 (EWB 07), BMC, Uppsala meeting next week in Sweden. This post explains why Bioclipse is so important (it goes beyond bio/chem) and also provides a title and abstract of my talk. So first the facts – http://en.wikipedia.org/wiki/Bioclipse :

The Bioclipse project is a Java-based, open source, visual platform for chemo- and bioinformatics based on the Eclipse Rich Client Platform (RCP). Bioclipse uses, as any RCP application, a plugin architecture that inherits basic functionality and visual interfaces from Eclipse, such as help system, software updates, preferences, cross-platform deployment etc. Via these plugins, Bioclipse provides functionality for chemo- and bioinformatics, and extension points that easily can be extended by other, possibly proprietary, plugins to provide additional functionality.

The first stable release of Bioclipse includes a CDK plugin (bc_cdk) to provide a chemoinformatic backend, a Jmol plugin (bc_jmol) for 3D-visualization, and a BioJava plugin (bc_biojava) for sequence analysis. Bioclipse is develped as a collaboration between the Proteochemometric Group , Dept. of Pharmaceutical Biosciences, Uppsala University, Sweden, and the Research Group for Molecular Informatics at Cologne University Bioinformatics Center (CUBIC).

Bicoclipse is based on the enormously professional and influential Eclipse framework – developed by IBM and made Open Source. I use Eclipse every day for my software development. It contains a rich set of resources (editors, browsers, searchers) along with the management of key components (compilers, repositories (SVN/CVS)). But because the Eclipse framework is written so flexibly many of these can be “stripped out” and replaced with domain-specific components (for bio- and chem- applications). Not surprisingly many of the Blue Obelisk projects have produced components which are now part of, or pluggable-into, Bioclipse.

Over the last two weeks I have been heaviiy influenced by the vision of the “lowercase semantic web” and this will be an important aspect of my presentation:

“Bioclipse and the Information Revolution”

(Peter Murray-Rust,

Unilever Centre for Molecular Sciences Informatics, Deparment of Chemistry, Cambridge, CB2 1EW, UK)

Chemistry is a complex subject and its information management requires complex software. Traditionally this has been provided by groups (often commercial companies) which provide monolithic software systems, and by large information aggregators who compile, curate and redistribute products and services. In recent years innovation and value has slowed down, and much of the emphasis has been on integration within commercial customers (e.g. pharmaceutical) rather than the development of new functionality. In particular the academic community – on whose research the industry relies – is deprived of a software and information environment in which it can freely innovate.

By contrast the web has seen a recent explosion of innovation and wealth creation – often categorised by “Web 2.0″ or “semantic web”. This is exemplified by the rise of the blogosphere (see Chemical blogspace) where many (young) scientists are trying new ways of communication and information re-use.

But the current web is based very heavily on text and graphics and has very little support for formalized disciplines such as chemistry. The browsers have little native support for XML (and what little there is can be found in vertical plugins, e.g. for mobile telephony). Much of the technology is based on centralised APIs such as Google maps, which has a centralist model and thin client model which does not translate to chemistry. And, if it did, it could consolidate the central control of information which many of us feel to be restrictive.

The current set of tools (Wikis, Blogs, etc.) are syntactically weak and (excluding a few experiments) have no semantic support. Current authors require semantic chemical tools, but are frustrated. Most rich chemical information rests on the laboratory bench – molecules, reactions, spectra, crystal structures, reports, recipes, etc. If this were made publicly available in semantic form chemistry could move towards a peer-to-peer network  that accurately represented the “ownership” or information.

The chemistry Open Source and Open Data community has now produced a critical mass of tools, many in wide use (post-beta) with more at alpha and beta. They have been brave in that they create components, often unglamorous but increasingly robust, which are interoperable and reconfigurable. They are increasingly being taken seriously, for example in pharma.

Until now the bench chemist – often trained on “clicks” within a Microsoft environment and ignorant of commandlines or scripting – would find there is too much integration required. But Bioclipse can and will change that. Any tech support in any institution will be familiar with “Eclipse” and can help with installation and integration and maybe even wrap some plugins.

The challenge for Bioclipse is to generate “viral” penetration within the chemical community. To do this it must:

  • be trivially installable. (I am currently installing V1.1.1).
  • be navigatable. Is the user interface – of perspectives – one that chemists can learn?
  • provide enough functionality to be useful.
  • require little or no maintenance.
  • ideally have a unique selling point (do something useful that other systems don’t)
  • interoperate with other systems (Bioclipse won’t be able to do everything)
  • create a semantically rich editor-browser platform (perhaps based on RDF)

This is a big challenge, but most of the Blue Obelisk and other Open Source community will be helping. (Bioclipse is Java, so non-Java applications such as OpenBabel and InChI require additional engineering). The areas where Bioclipse can take a lead include:

  • management of chemical documents (papers, theses, lab reports), using chemical linguistics such as OSCAR3
  • integration of structured ontologies such as GoldBook, ChEBI, CML dictionaries
  • validation of chemical information (using CML and other XML technologies documents and data sets can be formally validated)
  • integration of robots (e.g. harvesting of public chemical information)
  • integration into the chemical blogosphere (e.g. support for microformats and RDF).
  • linking of information within chemistry (e.g. analytical data to spectra)
  • linking between disciplines (e.g. small molecules to bioscience applications)

Given these, and given support for “most” of what chemists already require, Bioclipse should have immediate appeal. This will be strengthened by the needs and support of other communities such as

  • publishers (who need structured information that can be repurposed)
  • librarians (who need future-proof semantics for archival and preservation in institutional repositories)
  • regulators (who need searchable semantic information)

If it can spread virally, Bioclipse will be part of a Disruptive technology which will change the face of chemical information and effectively start the creation of the chemical semantic web.

XMLTech -XMLRDF

Alf Eaton and Gavin Bell (Nature) out together a lively BOF this evening on scientific publishing. They presented many of the key components – XML, persistent identifiers, ontologies, etc. Nice to see credit being given to PLoS for its pioneering use of these things (e.g. IDs for supplemental data).

A strong feeling from all that PDF must be supplemented of replaced by greater structure. “XML” is a useful mantra – although XML by itself is sometimes too constraining – and we need RDF. Maybe XMLRDF is a better mantra – it needs the XML to emphasis the difference from PDF and the RDF to point towards the future.

An anecdote of how the bite gets bitten – a publisher had acquired a chunk of content from another source (? merger/acquisition) and found that the PDFs were read-only – the hamburgers had been encrypted and the password lost. So they could be viewed but not re-used . Time for a change!

[ADDED IN PROOF] A much fuller post from Paul Miller

XTech2007 – XForms – do I need them?

Now in XTech2007 – arrived in time for the afternoon session of XForms by Steve Pemberton. XForms allow you to pass XML into/out of forms rather than relying on HTML. In includes things like validation – if you tell it something is a date, then you can check in makes sense as a date. And there’s stuff about credit cards, etc. So it makes sense to adapt them for – say – chemistry so that we can check data and molecules on submission.
I hadn’t looked at them for ca 3-4 years as I hadn’t seen any implementations. In fact, according to Steve, XForms has been the MOST implemented W3C spec ever. The reason I have missed them is that they tend to be used in mobiles as well as browsers and there is also a lot of star-centered business – a company whose customers all use XForms and there is central control. Nothing wrong with that, but it won’t be obvious to non-customers. Also the insurance industry has gone for them in a big way.

But most of the implementations come from the actual communities rather than being based on libraries (which is what we need). There is XSmiles which might help us – I think it’s now mature. But the scale seems a bit daunting “we used to have 30 programmers working on UIs for 5 years, now we solved the problem in 1 year with only 10 programmers”. Sic.

But there do seem to be plugins for Firefox (or they are in the pipeline). Using, I think, XBL and some with SVG. So maybe there is still hope for the browser in this area.

But whether we can move quickly towards a validating chemical data entry tool … I will continue to hack with broken tools for a little while

(In the original version of this post I used the erroneous “XMLForms”)

WWW2007 postscript

I am delighted that I had the chance to go to WWW2007 – at one stage I’d wondered whether there would be anything of interest other than the session I was in (Open Data). Or that I would know anyone… After all it was 13 years since the last/first WWW meeting I went to (although obviously there is a lot of overlap with XML). And would I have lost touch with all those W3C Recommendations (== standards). As it turned out I got so excited I found it difficult to sleep.

The features I take away are:

  • “Web 2.0″ is big with the industry people – the keynotes (I’ve already mentioned TimBL) concentrated on the new webSociety where the technical stuff should be part of the plumbing. Nothing really new but optimism about pixelsEverywhere (i.e. we shan’t need laptops – we read our email on the gaspumps) – trust and identity, revenue generation, etc.
  • “Semantic Web” – overlaps with, but is different from Web2.0. The immediate progress (for which I am glad) will be lowercasesw – just do it NOW! – for which the human nodes and arcs will be critical. The sw will be rather like a W. Heath Robinson machine – all string and sealing-wax – but every joint will be surrounded by humans pouring on oil, adding rubber bands, etc. We’ve now idea what it will evolved to, but we are optimistic.
  • “Linked Data” – a very strong and exciting theme. We are generating RDF triples in advance of knowling how we are going to connect them. It’s somewhat like a neural net. We think there will be an explosion of insight when this happens – beyond what we have done with Web2.0 mashups – useful though those are. I’m currently trying to load the basic tools so I can play with real stuff.
  • “Open Data”. Very positive and exciting. There is no doubt that the Web of the next few years will be data driven. Everyone was dismissive of walled gardens and sites without RDF-compatible APIs – incuding Creative and other Commons licenses. The semantic web can only function when data flows at the speed of the internet, not the speed of lawyers, editors and business managers. And I have no doubt that there will be businesses built on Open Data. Excitingly for me there seems to be no real difference between OpenData in  maps,  logfiles, and scholarly publications. (So I’m looking forward to XTech2007)
  • Sense of community and history. A strong desire to preserve our digital history. Google finds the following image from WWW94 and CERN

P. Murray-Rust

Yes – I was running a biology and the Web session, only to find that Amos Bairoch was in the audience! How much of this is still in the collective web semi-consciousness. Somehow I am assuning that everything I now do leaves preserved digital footprints – is that naive? And what, if anything, could I do?

What’s in a namespaceURI?

On more than one occasion we had heated debates about whether a namespaceURI may/must be resolvable. In the session on linked Data TimBL made it clear that he thought that all namespaceURIs must be resolvable. This conflicted with my memory of the namespaces in XML spec which I remembered as saying the the namespace was simply a name (indeed there can be problems when software attempts to resolve such URIs). So I turned to Namespaces in XML 1.0 (Second Edition) which is more recent (and which I hadn’t read) and I’m not sure I’m clearer. I can find:

“Because of the risk of confusion between URIs that would be equivalent if dereferenced, the use of %-escaped characters in namespace names is strongly discouraged. “

and

” It is not a goal that it be directly usable for retrieval of a schema (if any exists). Uniform Resource Names [RFC2141] is an example of a syntax that is designed with these goals in mind. However, it should be noted that ordinary URLs can be managed in such a way as to achieve these same goals.”

So this sounds like “may” rather than “must” be dereferenceable.

Now namespaceURIs also exist in RDF documents (whether or not in XML format), and Tim was very clear that all URIs must be dereferenceable. I don’t know how whether this is formalised.

Looking for RDF I find Resource Description Framework (RDF) / W3C Semantic Web Activity which contains:

“The RDF Specifications build on URI and XML technologies”

and the first links contains:

“Uniform Resource Identifiers (URIs, aka URLs) are short strings that identify resources in the web: documents, images, downloadable files, services, electronic mailboxes, and other resources. They make resources available under a variety of naming schemes and access methods such as HTTP, FTP, and Internet mail addressable in the same simple way. They reduce the tedium of “log in to this server, then issue this magic command …” down to a single click.

All documents date from 2006.

So I think there is “XML namespaceURI” and RDF namespaceURI” which if not identified separately are confusing. Or maybe the time has come to make all namespaceURI dereferenceable even if their owners assert they are only names. In which case what is the value of the resource? The simplest should be the “Hello World!” of the URI:

“Hello Tim!”

I shall try to make namespaceURIs resolvable although this is difficult when not connected to the Internet.

Web 2.0 and/or Semantic Web

Web 2.0 and Semantic Web are sometimes used synonymously, sometimes distinct. I’ve come in halfway through a presentation (missed speaker’s name) and taken away:

Web 2.0

  • blogging
  • AJAX
  • small-scale mashups
  • proprietary APIs
  • niche vocabularies
  • screenscraping

whereas Semantic Web is

  • large-scale data linking
  • comprehensive ontologies
  • standard APIs
  • well-defined data export
  • data reconciliation

and suggested that we put them together as:

  • blogging
  • AJAX
  • large-scale data linking
  • standard APIs
  • niche vocabularies
  • well-defined data export
  • data reconciliation

“There’s just one Web after all”

Parsing Microformats (revised post)

Useful presentation online (in S5) from Ryan King (of Technorati) on parsing microfomats. (I’ve been out of touch with HTML4 and I’m learning things.) We’ll need a day or two of virtual Blue Obelisk discussion to make sure we are adhering to the specs (yes, there are some). You don’t have to LIKE them – but they seem to be the way that it works.For example the value of a class may be a list of whitespace-separated tokens. Spans may be nested. All class names are lowercase

I tried to give the examples in an earlier version of this post but the raw XHTML breaks WordPress. You’ll have to read Ryan’s talk – it’s very clear there.
The main thing is that we have to know what we are doing, not make it up from HTML vocabulary as we go along. So it’s definitely important that the Blue Obelisk has a Wiki page on how we should be using microformats. If Ryan has material relevant to BO I’ll blog it later.

ICD-10 – past and present

I am really excited and pleased by Peter Suber’s latest post. WHO converts a disease database to a wiki.

WHO adopts Wikipedia approach for key update, CBC News, May 2, 2007. Excerpt:

If the collaborative wiki process works for compiling an encyclopedia, couldn’t the same approach work for classifying all the diseases and injuries that afflict humankind? The World Health Organization thinks it can.

It is embarking on one of its periodic updates of a system of medical coding called the International Classification of Diseases, or ICD, and it wants the world’s help doing it.

While work on previous versions has been the domain of hand-picked experts, this time the Geneva-based global health agency is throwing open its portal to anyone who wants to weigh in on the revision….

The new, more open approach to updating the disease classifications won’t be entirely wiki-esque. That process, with its anyone-can-edit approach, builds a degree of vulnerability into the end product, with some contributors deliberately planting false information for the fun of it.

With the ICD, people can propose changes and argue for them on a WHO-sponsored blog. But groups of subject matter experts will weigh and synthesize the suggestions, said [Robert Jakob, the WHO medical officer responsible for the ICD]….

OK, the WHO is a good thing and I’m very keen on Open collaborative science and medicine but there is a special aspect. For several years I worked on ICD-10 to convert it to XML – before most people had ever heard of it. ICD-10 has thousands of diseases in a hierarchical classification and a code for each. Nowadays it would pprobably be called an ontology.

Lesley West and I created a DTD to hold the ICD-10 and other dictionaries – we called our approach the “Virtual Hyperglossary”. We created a company to produce information products for the pharmaceutical industry and the regulatory process in which ICD-10 and other terminologies were important.

How naive our efforts look now! We used a DTD rather than a schema. We didn’t have language processing tools to make the translation of the scanned material better structured.We used ISO-12620 as the basis – and I spent time on the ISO meetings. Pace was often slow. But at the time it was ahead of its time.

Most things in the brave new world don’t make it – the VHG was one such. There are lots of others. I’ve moved a long way since then. But some of what we are doing now will make it and change the face of scholarship. We have only just started.