This is an attempt to explain why XML is important in a scientific context. I shall try to assemble as many reasons as possible, but there are also many other tutorials and overviews.
I believe that XML is a fundamental advance in our representation of knowledge. It’s not the first time this has been attempted – for example you can do anything in LISP that you can do in XML and a good deal more. But XML has caught on and is now found in every modern machine on the planet.
Let’s start with a typical piece of information:
Pavel Fiedler, Stanislav Böhm, Jií Kulhánek and Otto Exner, Org. Biomol. Chem., 2006, 4, 2003
How do we interpret what this means? We guess that there are 4 authors (although it is not unknown for people to have “and” in their names), that the italic string is the abbreviation of a journal, that 4 is a journal number. But what are “2006” and “2003”? Unless you know that the first number is the year and the third the starting page (see RSC site) you have to guess. And many of you would guess wrong.
If, however, this is created as:
<author>Pavel Fiedler</author>
<author>Stanislav Böhm</author>
<author;>Jií Kulhánek</author>
<author;Otto Exner</author>
<journal>Org. Biomol. Chem.</journal>
<year>2006</year>
<journal>4</journal>
<page>2003</page>
you can see that each piece of information is clearly defined. There is no reliance on position, formatting or other elements of style to denote what something means.
But isn’t this harder to create and read? If everything is done by a human, perhaps. But almost all XML documents are authored by machines, either from editors or the result of a program. And the good news is that the style – the italics, etc. – can be automatically added. XSLT allows very flexible and precise addition of syyle information through stylesheets.
So it won’t surprsie yout that publishers actually create their content in XML. When you submit a Word or LaTeX document it gets converted into XML – either by software (which isn’t always perfect) or by retyping :-(. The final formatting – either as PDF or HTML can be done automatically by applying different stylesheets. So the document process is:
XML + PDFstylesheet -> PDF
XML + HTMLStylesheet -> HTML
The stylesheets don’t depend on the actual document being processed and work for any instance. Of course it takes some work and care to create them, but most of you don’t need to worry.
So for anyone working with documents, XML allows the content to be stored independently of the style. That’s a great advantage also when it comes to preservation and archival. Because XML is standard, Open, ASCII, etc. it doesn’t suffer from information loss when it is moved from one machine to another (how many of you have lost newline characters when going from Windows to Mac to Zip to Word, etc.?) It’s possible to calculate a digital checksum for a canonical XML document so any corruption can be immediately spotted.
There are a number of other aspects. Notice that the second and third authors have diacritic marks in their names. XML supports a very wide range of encodings and character sets so is an international specification.
In later posts I’ll show the power of XML for validation, how software can be applied and how data can be structured. Please feel free to add comments or questions.