Publishers' typesetting destroys science: They are all as bad as each other. Can you spot the error?

I’ve just been trying to mine publicly visible scientific publications from scholarly publishers. (That’s right – “publicly visible” – Hargreaves comes later).
They destroy the text. They destroy the images and diagrams. And we pay them money – usually more than a thousand dollars for this. Sometimes many thousands. And when I talk to them – which is regular – they all say something like:
“Oh, we can’t change our workflow – it would take years” (or something similar). As if this was a law of the universe.
Unfortunately it’s a law of publishing arrogance. They don’t give a stuff about the reader. There’s no market forces – the only thing that the PublisherAcademic complex worries about is the shh-don’t-mention-the-Impact-Factor.
And it’s not just the TollAccess ones but also the OpenAccess ones. So today’s destruction of quality comes from BMC. (I shall be even handed in my criticism).
I’m trying to get my machines to read HTML from BMC’s site. Why HTML? Well publisher’s PDF is awful – I’ll come to that tomorrow or sometime). Whereas HTML is a standard of many years and so it’s straightforward to parse. Yes,
unless it comes from a Scholarly publisher…
PUZZLE TODAY. What’s (seriously) wrong with the following. [Kaveh, you will spot it, but give the others a chance to puzzle!]. It’s verbatim from (I have added some CR’s to make it readable

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
<html id="nojs" xmlns=""
    xmlns:og="" xml:lang="en-GB"
    lang="en-GB" xmlns:wb=“”>
<head> ... [rest of document snipped]

When you see it you’ll be as horrified as I was. There is no excuse for this rubbish. Why do we put up with this?

This entry was posted in Uncategorized. Bookmark the permalink.

10 Responses to Publishers' typesetting destroys science: They are all as bad as each other. Can you spot the error?

  1. Daniel says:

    Yes, it’s awful!
    I must admit I cheated and used w3c’s validator.

  2. Andrew Walker says:

    “That must just be PMR’s text editor ‘fixing’ things for him” I thought. Nobody would do that! But yes. There is a bozo creating XHTML at BMC.

  3. Henry Rzepa says:

    There may be those unfamiliar with XML who are pondering the above, and the syntactic difference between " and “ (I find very useful in this regard.
    Peter, there is no preview on this blog, so I will not know if my escaped entities come our correctly or not.
    As someone who routinely writes everything in HTML, and then validates the result using Tidy, it is incomprehensible to me how a professional publisher would not actually validate their XML before publishing it.
    Still, like postscript before it, those who write native XML (HTML5) are I suspect a vanishing breed. Thus my favourite text editor on OS X (BBedit) has had a XML validator built in for years. That is until the latest release (V11) when they dropped it!

    • pm286 says:

      You’ve got it Henry. (and so have the others). BiomedCentral have used “smart quotes” (“ and ”). These are so pretty that we have to use them, don’t we? Except that they completely bugger up HTML. In a NAMESPACE declaration, for goodness sake
      What happened to me was that Tidy did its best to parse it. Tidy thought there were no quotes at all (reasonable since there aren’t any). It wrapped the smart quotes in proper quotes. This parses syntactically. But the result was something like:
      This is syntactically OK, but XOM recognised it as violating the namespace criterion (no leading http).
      Also Biomedcentral have an XML prefix without a declaration. That took me have a day to kludge round.
      The simple facts:
      I started content-mining to find new science. So far it’s just cleaning up PublisherCrap.
      I think I can find technical illiteracies in every major publisher.
      It’s not right, it’s not just.

  4. Henry Rzepa says:

    The problem may be in part the disconnect between the publisher and their “customers”. You might think that the customers are the people who read the journal. Wrong. The customers are still the institutional libraries, the same ones who pay the subscription to the publisher. Why should an administrator in a library care whether an XML or PDF document is “usable”? After all, its usability is defined by the ability of a human to read it, is it not?
    So perhaps we should let our institutional librarians know that we are now “using” articles in many new ways, and that before paying the annual subscriptions, the journal should be investigated to see if it is fit for all purposes, not just reading.
    I might add that possibly, the publishers realise that using an article in new ways means the opportunity to charge more for this. Hence DRM, restricting what use an article can be put to unless the customer has paid more for the rights to do so.

    • pm286 says:

      Thanks Henry,
      You are absolutely correct for TollAccessPublishers (taPublishers). Since librarians only RENT the content (sorry, Librarians only transfer the RENT from the taxpayer to the publisher) they have no direct interest in what they are RENTing. The only thiing they care about is placating their senior academics (I keep getting told “we’d like to do X but our academics wouldn’t stand for it”). So libraries are simply channels for the passage of money for goods they never touch, never use.
      Open Access publishers are also not concerned with their readers. They want a good article metric count but they generally aren’t interested in what is in the article or how it’s packaged. After all if they only reach sighted humans that’s probably a large enough market. No, what they compete for is authors.
      Although that doesn’t explain why the authoring process is among the worst experiences in the world…

  5. Kaveh says:

    I have to admit it too me a long time to see that. We are so used to the computer “smartening” our quotations that they don’t stand out any more!
    I have to agree with you, Peter, that most XML that I see on publishers’ sites are useless for anything, except for ticking a box for managers who can say “we have all content in JATS” to their directors.
    The more depressing thing is that there is only a tiny proportion of XML that is visible to us, and I think entirely from OA publishers, and we should applaud the OA publishers at the same time as criticizing them. So even if you pay your subscription to a journal, the XML is hidden in a vault. I think perhaps the big boys are just too embarrassed to show us their XML!
    So here is a question for all publishers: why are you hiding the XML even when I have paid my money to get the article?

    • pm286 says:

      AFAIK many TollAccess publishers don’t expose XML on their “hybrid Open Access” publications.
      We seem to agree that the XML is so awful they would be ashamed. Except that they don’t care.
      That’s why I have spent 2-3 years of my life developing systems to read PDF, and now pixels.

  6. Pingback: Publishers’ typesetting destroys science: They are all as bad as each other. Can you spot the error? – ContentMine

Leave a Reply

Your email address will not be published. Required fields are marked *