As I have blogged before we are looking at ways of improving the information infrastructure in our Centre. We’re all very consicous of how little we know – I know I know very little and I’m quite prepared to admit it in public. Ignorance per se is not a crime – only wilful ignorance. As part of the process we created some self-help groups and the first feedback is that they would like a set of FAQs for a wide variety of questions. Remembering that this is a group of 40+ molecular informatics scientists I’ll post some of the questions on an occasional basis. Because others can contribute to this blog maybe we’ll build some communal FAQs…
So I cannot resist “What are the advantages of XML and why should I care?”. I’ve invested several years of my life in developing XML, and layering Chemical Markup Language (CML) on to of it. So it’s very dear to my heart. This post won’t answer the general question directly so there will be more.
I got introduced to Markup Languages after WWW2. At WWW1 (1994) it was clear that HTML had succeeded very well with text and graphics but that more formality was required for other science disciplines. Recall that the early web was about science, not commerce and although TimBL saw the commercial potential, it was low key at that time. So Henry Rzepa went off to WWW2 and came back saying that people were talking about “something called SGML“. It was also clear that CERN (where TBL developed HTML) was strong on SGML and that it could support complex documents. I had been struggling for several years with the need formalize chemistry into a component-based system and with SGML this seemed possible. You could create your own tags for whatever you liked as long as you defined it formally (with a DTD).
So I created some sample CML documents with my own tags. That was the easy bit. The DTD (which defined the language) was harder but possible. The real difficulty was actually doing anything useful with SGML. You could read it and… agree it was correct… and send it to other people… but it didn’t do anything. Why would chemists use it? At that stage they wouldn’t….
The users of SGML were somewhat esoteric groups. Typical examples were the Text Encoding Initiative (a project to encode the world’s literature in SGML). At the other end were creators of aircraft maintenance manuals. (Although there were hints that SGML could be used for anything it was primarily used for text). The good news was that almost all major publishers of scientific articles used SGML in the production process.
I soon realised that to do anything useful – especially for chemistry – required procedural code. And there was very little. Some of it was extremely expensive – one company wanted $250K (sic) for a site license. The main clients were technical publishers – e.g. in aerospace. So I started to write my own system without any idea what I had got into. I found myself having to refer to “parents and children” of parts of documents – this seemed very strange to me at the time. I was extremely grateful to Joe English who developed a system called CoST and gave me huge amounts of virtual help. Joe, you were very patient – hope all is well! However there were a few pioneers of Open Source like Joe and IMO they saved the day for SGML and paved the way for XML. Top of the list is James Clark – whom I’ve never physically met – but has underpinned much of XML with his code and ideas. His nsgmls system was the only code that had the power I required and which could transform the (potentially incredibly complex) SGML documents into something tractable.
So by 1995 I had a system which could represent chemistry in SGML and process it with a mixture of tcl/CoST and nsgmls. It had fairly advanced graphics (in tk) and could even do document analysis of sorts. At that stage (another story) I was converted to Java and effectively wrote a complete system for CML/SGML in Java. This had a simple DOM, a menuing system and a tree widget (in AWT!) and could hold a complete chemical document.
Then, in 1996, Henry pointed me at a small activity on the W3C pages called XML. (Actually Henry and I had already used “XML” as part of CML, but we surrended the term). I got myself onto the working group and was therefore one of abou 100 people who contributed to the development of XML.
When XML was first started it was “SGML on the Web”. It wasn’t expected to be important and it wasn’t even on the front page of the W3C. As SGML was seen as complex and limited, XML wasn’t really expected to flourish.
XML’s success was due to the foresight and energy of a number of people, but especially Jon Bosak – the “father of XML”. Jon worked on technical document management in Sun (I hope that’s right) and saw very clearly that XML was part of the future of the Web. He coordinated the effort, got funding and political support, and I remember his pride in showing the back cover of the first draft of XML – ” sponsored by Sun and Microsoft”. This was a great technical and political achievement.
Tim Bray – another champion and parent of XML – writes:
“It is to Jon Bosak’s immense credit that he (like many of us) not only saw the need for simplification [of SGML] but (unlike anyone else) went and hounded the W3C until it became less trouble for them to give him his committee than to keep on saying SGML was irrelevant.”
It was supported by one of the largest and most active virtual communities. Henry and I offered to run the mailing list, XML-DEV, on which much of the planning and software development took place. By insisting on Open software as a primary means of verification the spec was kept to a manageable, implementable size. This meant that, unlike SGML, XML could be implemented by the DPH (“Desperate Perl Hacker”). And it was….
… the rest is history. XML has become universal. Jon (I think) described it as “the digital dial tone” – i.e. wherever information is being passed on the web it will increasingly be in XML.
So that explains why I care :-). Next post I’ll explain why you should also care.