We've had a great hackfest – extended over 10 days – working towards scholarly HTML. The idea is simple – we should be using HTML as the main substrate for exchanging information in areas of scholarship, research, education and learning. Most "formats" have shortcomings but only HTML preserves the democratic symmetry of ease of authoring and ease of reading. XML is necessary for complex objects (such as CML, SVG, MathML, etc.) but HTML has everything that is required for most other communication – documents, webpages, publications. PDF, Word, LaTeX may be useful for authoring but have major shortcomings for re-use.
So why don't we all use HTML? After all in 1994 we all did. Everyone on the web know how to author HTML and how to render it. Since that time we've seen the growth of the graphical over the semantic. Good graphics requires tools and these tools have been developed in an asymmetric manner – the reader cannot interact with their outputs other than to view them with the human eye. We've lost the simple skill of reading someone else's content, editing it and republishing it.
But academia is all about re-using content and it's bizarre that we produce it in ways that prevent this. Scholarly HTML – which is nothing more than using HTML sensibly – will change this to return power to the reader.
Peter Sefton and I have spent much of the last week discussing how to do this. Peter is going to continue to develop demos using his suite of tools. We are probably going to concentrate first on repurposing existing scholarly content from Open Access publishers. (We can't apply this to closed access material because it breaks copyright). We'll be looking at how existing published material can interoperate simply by converting it to scholarlyHTML. So the front runners are those who are partners on our JISC grants – and who publish Open Access HTML.
Why does this matter? One Friday the EBI held a session on text-mining and the continuing undercurrent was that most of the material provided by publishers made text-mining almost impossible – it couldn't be easily read (PDF) and the results couldn't be distributed.
Whereas with ScholarlyHTML textmining is almost trivial.