PDFs

Egon Willighagen is one of the of people whose support has kept me going through tough times. Here he supports my criticism of PDF (No, PDFs really do suck!). For background: I had posted a criticism of PDF and a lively discussion took place on FriendFeed. I understand it is questionable as to whether FF discussions should be quoted they are public but can be taken out of context so I didn’t. I had a few supporters, but a surprising number of naysayers, the general tenor of which was that PDF is easy to read and HTML is difficult to read. So I’ve done an experiment, but first Egon’s post:

No, PDFs really do suck!

EW: A typical blog by Peter MR made (again), The ICE-man: Scholary HTML not PDF, the point of why PDF is to data what a hamburger is to a cow, in reply to a blog by Peter SF, Scholarly HTML.

This lead to a discussion on FriendFeed. A couple of misconceptions:

FF: “But how are we going to cite without paaaaaaaaaaaage nuuuuuuuuuuumbers?”

EW: We don’t. Many online-only journals can do without; there is DOI. And if that is not enough, the legal business has means of identifying paragraphs, etc, which should provide us with all the methods we could possibly need in science.

FF: Typesetting of PDFs, in most journals, is superior than HTML, which is why I prefer to read a PDF version if it is available. It is nicer to the eyes.

EW:Ummm… this is supposed to be Science, not a California Glossy. It seems that pretty looks is causing major body count in the States. Otherwise, HTML+CSS can likely beat any pretty looks of PDF, or at least match it.

FF:As I seem to be the only physicist/mathematician who comments on these sort of things, I feel like a broken record, but math support in browsers currently sucks extremely badly and this is a primary reason why we will continue to use PDF for quite some time.

EW: HTML+MathML is well established, and default FireFox browsers have no problem showing mathematical equations. For years, the Blue Obelisk QSAR descriptor ontology has been using such a set up for years. If you use TeX to author your equations, you can convert it to HTML too.

FF:We can mine the data from the PDF text.

EW:Theoretically, yes. Practically, it is money down the drain. PDF is particularly nasty here, as it breaks words at the end of a line, and even can make words consist of unlinked series of characters positioned at (x,y). PDF, however, can contains a lot of metadata, but that is merely a hack, and unneeded workaround. Worse, hardly used regarding chemistry. PDF can contain PNG images which can contain CML; the tools are there, but not used, and there are more efficient technologies anyway.

EW: I, for one, agree with Peter on PDF: it really suck as scientific communication medium.

So here’s an experiment with a sample size of one. I went to BiomedCentral, took the first journal I came across which had a ta-ble. (A ta-ble is a coll-ect-ion of num-bers in rows and col-umns). Sometimes I read tables, but sometimes I put them into a spread-sheet. (A spread-sheet is soft-ware that lets you cal-cul-ate things). The article is http://www.biomedcentral.com/1471-2105/9/545 which is called the full-text and is in HTML. I went to Table 1 and found:

graphics1

I then went to the PDF (which is not seen by BMC as the full-text) http://www.biomedcentral.com/content/pdf/1471-2105-9-545.pdf and found the same table:

graphics2

[I have reproduced them at the same size as they cam up in my open-source browser. The HTML was rendered naturally by the browser with no help from me. The PDF required me to download a closed-source proprietary plugin from Adobe.

I am not an expert on readability but I would like to see the researched arguments that says the HTML is worse for humans than the PDF (actually I think it’s better).

But here is the clincher. As a scientist I don’t just want to read the paper with my eyes. I want to use the numbers. Maybe I want to see how column 1 varies with column 2. The natural way to do this is to cut-and-paste the table. (This is done in each case by sweeping out the table with the cur-sor and pressing the Ctrl-key and the C key on the key-board and the same time. The data is now on the clip-board). I then open up Ex-cel (because I am in the pay of Mi-cro-soft) and paste the clipboard into the spread-sheet. This is what I get from the PDF version.

graphics3

All the data has gone into one column. The tabular nature has been completely destroyed. And the cut-and-paste was done with Adobe’s own tool, so even Adobe doesn’t know what a table is in the PDF. (I have been taken to task for criticizing PDF because some people don’t use Adobe tools).

Here’s the HTML version. I have highlighted a cell to show that all cells are in correct columns:

graphics4

I was called a bit dogmatic. Yes I am. This seems to me so self-evidently a case for using HTML over PDF that I can’t think of any reason why PDF should be used.

And kudos to BMC. They have realised that HTML is a better digital medium than PDF. Are their readers cancelling subscriptions? No…

Oops… BMC is an Open Access publisher. It is forcing its authors to pay for their manuscripts to be converted into horrid HTML. I expect they’ll start sending their papers elsewhere…

Xxx

This entry was posted in Uncategorized. Bookmark the permalink.

14 Responses to PDFs

  1. David Hall says:

    I think the really interesting question is whether the HTML+CSS adds value over the original source documents. For dissertations, etc., you have been arguing for access to the word and/or LaTex. The BioMed Central examples shows that often the journals add value through their HTML+CSS representations for data mining. When there is a figure mentioned in the article, they link to a popup of the figure. When articles are cited, they link to the anchor for the article from which the other article is usually linked via pubmed or doi. All in all, very useful for someone looking to mine data.
    I’m still going to read all my scientific articles in PDF and insist on that though. I hate it when an article comes out early access and I can’t have my pdf. I agree with everything you’ve said, my pdf doesn’t have to have page numbers.
    Currently you have your mineable html representative and everyone who actually cares about the aesthetics of an article (or who get sick of staring at a computer screen all day and find it easier to highlight stuff on a nicely formatted dual column print out) has their pdf. I agree that HTML and CSS could match the prettiness of PDF, but journals don’t do it. And they certainly don’t make use of the CSS properties that would make a paper nice when printed out. BMC doesn’t have a print stylesheet. Does any journal have a print stylesheet? Also, if MathML was really that well supported by browsers, why don’t places like Wikipedia use it? It’s clear the consensus among many in the community is that MathML browser support is horrid. Even the person you link to for converting Tex to MathML agrees the problem is in rendering, not authoring, and has long been so.

    • pm286 says:

      @David. All true. If there is both HTML and PDF then there is no problem. We can mine the HTML (unless the Nature protocol restricts us to trivial amounts).

  2. Klaus Graf says:

    Peter, I appreciate your fight against PDF.
    It wouldn’t a bad idea if all OA repositories would offer both versions.
    There are other reasons not to love PDF beside of data mining issues.
    I hate it to click links in PDFs. Mostly I cannot use the right mouse-click to open the link in a new window. If articles are on web issues it is central to click on links in the easy way like HTML.

    • pm286 says:

      @Klaus. Absolutely. I have argued that where the original is not PDF thenm it should be preserved and exposed as well.

  3. Peter, I’m actually on your side — my comments in the FF thread were meant sarcastically. I have fairly good PDF-hater cred myself (note the date on that one!), and I got my start as a wee conversion peasant doing markup work for scholarly books and journals.
    Klaus, who bells cat? Getting PDF and (decent) HTML out of the same workflow — or even out of different ones — is serious work. Typesetting is not conversion.
    My favored solution is to dump PDF. It may even happen in my lifetime, if the math issue can finally get sorted out.
    David, wholly agree with you about the hideousness of journal HTML. I have fallen utterly in love with the Readability bookmark in this context, and recommend it to you highly.

  4. Peter…I read a lot of papers in PDF format and as a reading format I like it. But I do wish that they were also made available in decent HTML, especially by the Open Access publishers.
    In the early days of MOlbank they produced rich HTML. So, an article like this http://merian.pch.univie.ac.at/molbank/molbank2007/m556.htm could be reused to do this on our ChemSpider system : http://www.chemspider.com/21106869#description
    Now though Molbank is issued only as PDF. It greatly disables what we’d like to do. Molbank is still great though…

  5. Dorethea, I cought your quotes around your comment, and the sarcasm. Sad thing is, that it was indeed a quite illustrative quote.
    Funny thing actually happened. HTML+CSS is not the perfect replacement to me (but I think it is sufficient for what we want, and I refered to journal publishing quality of some 30 years ago, HTML+CSS would certainly beat looks on those papers, and I see no reason why HTML+CSS could not improve looks and/or cross-platform usefulness in that same period as PDF did to paper), but everyone jumped on me for mentioning it, where I explicitly say, “it’s not about the looks, but about the semantics”.
    Peter, I like your copy/paste HTML tables example very much. It would be even better if we good demo the equivalent with figures too, though I am currently unaware of online journals using something like vector graphics for scatterplots, etc…

    • pm286 says:

      @Egon. Thanks. You are one of the people I can always count on for support. I doubt there will be online journals with proper support for plots until the browsers support vertor graphics properly.

  6. BTW, another aspect that should have come up, is that HTML does semantic things. I have been using RDFa on my new homepage (under construction, preview: http://egonw.github.com/) which works perfectly well, and integrates pretty looks (I’m not an artist, and this is about the prettiest I can make it) which machine readable Resource Descriptiong Framework-marked up the facts. I would really love to see the discussion going in that direction, instead of looks…
    But hey, we’re just scientists; we care about looks. (No quotes, but I hope the sarcasm is clear 🙂

  7. Pingback: ArchivePress » Blog Archive » Which blogs should be preserved?

  8. Pingback: Unilever Centre for Molecular Informatics, Cambridge - Effective digital preservation is (almost) impossible; so Disseminate instead « petermr’s blog

  9. Pingback: Science in the open » Talking to the next generation - NESTA Crucible Workshop

  10. Pingback: Open Knowledge Foundation Blog » Blog Archive » Open Data in Archaeology

Leave a Reply

Your email address will not be published. Required fields are marked *