The ICE-man: Scholary HTML not PDF

I’m picking up on Peter Sefton’s monster post and one of his phrases suddenly hit me:

academia is one of the few places where PDF is considered acceptable as a means of communication

I thought about it and I realised it’s true. This awful mess we are in is of our own making. Or rather our own supine acceptance of the PDF served up by scholarly publishers. So why does academia use PDF?

  • Because the publishers like it

  • Because it looks like a good way to preserve things

And I can’t think of any other reason. It’s awful to index, to add behaviour to, as a means of developing interoperability (which is trumpeted for repositories but hasn’t happened). It is directly against the spirit of the web. HTML has been one of the great successes of the webb (HTTP was another as were URIs).

PDF announces: I don’t care about modern information. I can’t think for myself. The medium is the message.

Apart from advertising brochures another area where PDF flourishes is regulatory systems. Pharma companies like PDF because it’s much more difficult to search than XML (or HTML) and so harder to find those bits which they don’t want found. And regulatory likes it because the pages allow for easy certification .

Is that all academia is about? Helping the publishers certify their page count? And making it difficult to search their pages?

So here’s a fuller version of Peter’s section:

Scholarly HTML

Against this background I will confine myself to the dimensions I really care about, which is how to make word processors produce good quality HMTL, and document interoperability. Ive been over and over why this is important here, but heres a summary.

On the authoring side, offline word processors like Microsoft Word and OpenOffice.org Writer are probably still the best all round compromised for academic authoring in those disciplines which dont use some other format like LaTeX. For now. I expect this to change soon, we are starting to see document drafting in Google Docs (which lacks citation services and styles and easy embedding of diagrams so far) , and if Google Wave realises its promise then I think it could be an end-to-end scholarly communications platform.

PMR: Fully agreed. Word processors are complicated because documents are complicated (unless you default to bitmaps such as PDF. I have looked under the cover and PDF is truly awful)

On the delivery side, academia is one of the few places where PDF is considered acceptable as a means of communication whereas on a normal website it is regarded as an impediment to usability. We need to be getting scholarly works into HTML so we can do more with them; meshing them with data and visualisations and delivering them to mobile devises.

While we wait for Google Wave to take over the world, what Id like to see is a Word toolbar much like the ICE toolbar to support scholarly authoring but with better integration into Word than we have had the resources to make so far here in Toowoomba. It should let people create well structured documents which can be pushed to academic systems; journals, repositories and learning systems and not just in PDF, or Word format, in some kind of formally specified Scholarly HTML. I think that idea had some support at our meeting, but Lee Dirks in particular pointed out that it would need to be done with reference to a stakeholder group who can help define and own this Scholarly HTML thing. Id be interested in ideas on who these stakeholders might be;

Publishers obviously, where MS Research have great contacts.

Repository owners particularly the discipline repositories like arXive and Pubmed Central.

The eResearch community; I hope that I can get the Australian National Data Service (ANDS) interested in this stuff.

The Electronic Thesis and Dissertation (ETD) movement. (My group is involved in this via our CAIRSS repository support service, the Australasian Digital Thesis program in Australia will come to CAIRSS at some point.)

The eLearning community, maybe.

But actually, where this matters most is on the long tail:

Thousands of small repositories and journals are stuck with paper-on-screen because thats all their tools support.

The small but growing group of users who want to do more with the versions of their documents they deposit in repositories.

Id appreciate any thoughts about who might be interested in defining a scholarly profile of HTML a few people told me theyre following these posts so please speak up in the comments.

I’m interested, obviously. My requirements which Peter knows of course are that we can embed CML (Chemical Markup Language) and other Markup languages. And that we can start to use RDF (RDFA?).

Please, academia, wake up and embrace the digitalSemantic, not ePaper future.

This entry was posted in Uncategorized. Bookmark the permalink.

14 Responses to The ICE-man: Scholary HTML not PDF

  1. David Gerard says:

    Um … left to themselves, academics tend towards issuing papers in PostScript or the original LaTeX. They’re fond of issuing stuff in a hard-to-edit medium …
    Academics are a subset of humans, and humans should issue semantic content, but in practice consistently issue presentation content.
    So the problem is not PDF per se. I don’t think it’s ill will or even negligence on the part of publishers either.
    To get them working more semantically, I would suggest working out how to spread the semantic content meme and make it useful to gain critical mass for it.

    • pm286 says:

      @David Thanks. You are right and we are trying hard to educate. But there is a vicious circle with publishers in more ways than one

  2. warrener says:

    Quote from Peter Sefton:
    “Against this background I will confine myself to the dimensions I really care about, which is how to make word processors produce good quality HMTL …
    On the authoring side, offline word processors like Microsoft Word and OpenOffice.org Writer are probably still the best all round compromised for academic authoring.”
    I have not seen the HTML output from OOo but, in my experience, the HTML produced by MS Word is absolutely horrendous! It is full of unnecessary garbage so it takes an age to render in the browser and may not be correct when it has finished. I think that it is very unlikely that MS Word will produce “good quality HTML” any time soon (if ever) so it seems to me that a decent WYSIWYG HTML editor would be much better than a word processor for HTML. I agree about PDF, though!

  3. Pingback: Unilever Centre for Molecular Informatics, Cambridge - PDFs « petermr’s blog

  4. David Groenewegen says:

    One of the often ignored reasons for the prevalence of PDF in academia is the need for clear citation information. If you going to cite a specific piece of information within an article (especially long ones) or book you need to have a page number. PDF allows for a consistent page number, which HTML does not. This does not mean that HTML should not also be generated, just that the fixed formatting that PDF is best at has a real value for academics.
    The other reason is that lots of people print stuff to read (I know I do), and PDFs make for nicer print copies. Until the perfect electronic reading app comes along PDF will persist.

    • pm286 says:

      @David. (a) citations. I don’t buy this. The BMC article has: BMC Bioinformatics 2008, 9:545 doi:10.1186/1471-2105-9-545. It’s just as easy to use the DOI as the page number. When journals are ePaper only then there will be no serial pages and the BMC article numbers the pages from 1 to 9 (not 545-554).
      (b) So e-journals are purely a way of saving the publisher’s printing costs and transferring them to the reader. It will pass.

  5. I think what is needed here is for the academic publishing community to agree on a file format standard. This standard needs to accomodate the metadata that publishers are interested in (with dublin core as an absolute minimum) but it also needs to provide for a mechanism of storing the body of the document with all the data that relates to the kind of content it is, e.g paragraph, graphic, table, references, chapter heading, etc.
    Agreeing on a file format is, IMO, better than agreeing an which app should be used. The format will probably have to use XML.
    When a suitable XML format is agreed on it will be possible to write XSLT that can transform the data into a number of output forms. One of these will be HTML. This should be able to cope with most things.
    I admit there are problems when it comes to displaying chemical and mathamtical formulae. The way I have solved this for a digital library prototype I have worked on is to invoke TeX via a little CGI-like program. The formulae are tagged to invoke the CGI automatically as the page renders. There is a javascript package called jsMath that some would use instead (IMO it is not as good as requires a client-side installation).
    Scholars often use TeX or LaTeX to produce their document. They then have to jump through various hoops to get the publisher to publish the article. This extra work includes providing GIFs for the tables, figures and formulae, as well as providing the PDF. If only there was a way to process the TeX to produce this elusive standard XML file format. If only…..
    A few years ago I wrote a script that converts LaTeX to DocBook format. So it can be done. But agreeing on the XML file format will be tricky. And there is the non-content related metadata to consider. Not all of this can be done by the document author. It sometimes requires manual work from content editors working for the publisher.

    • pm286 says:

      @Andrew many thanks. The problems are still mainly people problems. If we all want this to happen, it will. In 1995 we had great LaTeX2HTML – including conversion to gifs. But the momentum slowed. Similarly the momentum on vector graphics slowed. But it is picking up

  6. David Groenewegen says:

    Peter,
    (a) In a previous life I was an historian, and in that field citing of the page was a crucial part of a citation – merely citing the paper was not enough. It can be overcome of course, but citation is a very old standard, and a better one would need to be widely accepted for people to abandon it.
    (b) Yes, that is partially true – in the past of course a lot of people would have photocopied the print copy in the library and taken that home (remember the huge queues around photocopiers at exam time), so the new system is arguably more efficient as there are less print copies in the libraries and some people who might have photocopied will now read on screen. In any case, this doesn’t negate the point that many people still prefer to read a piece of paper than a screen. This will change, as you say, but it is a people problem that will take time.

    • pm286 says:

      @David. True, and in a previous life I also had to cite pages numbers. I have no idea why we had to cite the end page – I think it was either to show we had gone to the library to find the paper (even if we didn’t read it) or to show the readers how huge the document was (e.g. if they wanted an Interlibrary loan and had to pay per page). They could then decide whether it was worth ordering.
      I think there is a better standard. It’s a DOI.

  7. Regarding latex2html, IMO there is a much better app now called hevea. See http://hevea.inria.fr/index.html. I colleague of mine, Ciaran McHale has produced some documentation using hevea where you can see the HTML and PDF. I reckon its pretty good. Go to http://www.ciaranmchale.com/training-courses.html.

  8. Pingback: Open Knowledge Foundation Blog » Blog Archive » Open Data in Archaeology

Leave a Reply

Your email address will not be published. Required fields are marked *