How publishers destroy science: Elsevier's XML, API and the disappearing chemical bond. DO NOT BUY XML

TL;DR Elsevier typsetting turns double bonds into garbage.
Those of you who follow this blog will know that I contend that publishers corrupt manuscripts and thereby destroy science.
Those of you who follow this blog will know that Elsevier publicly stated that I could not use the new “Hargreaves” law to mine articles on their web page and I must do this through their API. Originally there were zillions of conditions, which – under our constant criticism – have gradually (but nowhere completely) disappeared. They now allow me to mine from the web page, but insist that their XML-API gives better content.
I have consistently refused to use Elsevier’s API for legal, political and social reasons (I don’t want to sign my rights away, be monitored, have to ask permission, etc.). But I also know from at least 5 years of trying to interpret publishers’ PDFs and HTML that information is corrupted. By this I mean that what the author submits is turned into something different lexically, typographically and often semantics. (Yes, that means that by changing the way something looks , you can change its meaning).
Anyway yesterday Chris Shillum, who was part of the team I challenged, tweeted that he would let me have a paper – in XML format – from the Elsevier API. For those who don’t know, XML is designed to hold information in a style-free form. It can be rendered by a stylesheet or program (e.g. FOP) into whatever font you like. I’m very familiar with XML having run the developers’ list with Henry Rzepa in 1997 and been co-author of the universal SAX protocol. Henry and I have developed Chemical Markup Language (CML) precisely for the purpose of chemical publishing (among many other things).
 
But Elsevier don’t use CML, they use typographers who know nothing about chemistry. At school you may have heard of a “double bond” (http://en.wikipedia.org/wiki/Double_bond). It’s normally represented by two lines between the atoms. We used to draw these with rapidographs, but now we type them. So every chemist in the world will type Carbon Dioxide as
O=C=O
capital-O equals capital-C equals capital-O
You can do it – nothing terrible happens. You can even search chemical databases using this. They all understand “equals”.
But that’s not good enough for Elsevier (and most of the others). It has to look “pretty”. It’s more important that a publication looks pretty than that it’s correct. And that’s one of the major ways they corrupt information. So here’s the paper that Chris Shillum sent me.
First as a PDF.
elsevierchem4
Can you see the C=O double bond in the middle? “(C=O stretching)”. It’s no longer an equals, but a special publisher-only symbol they think looks prettier. Among other things if I search for “C=O” I won’t find the double bond in the text. That’s bad enough. But what’s far worse is that this symbol has been included in their XML. And this gets transmitted to the HTML – which looks like (you can try this yourself http://www.sciencedirect.com/science/article/pii/S0014579301033130 ).
elsevierchem
???
What’s happened??? Do you also see a square? The double bond has disappeared.
The square is Firefox saying “I have been given a character I don’t understand and the best I can do is draw a square” – sorry. Safari does the same. Do ANY of you get anything useful? I doubt it.
Because Elsevier has created a special Elsevier-only method of displaying chemistry. It probably only works inside Elsevier back-room – it won’t work in any normal browser. Here’s what has happened.
Elsevier wanted a symbol to display a double bond. “Equals” which all the rest of the world uses – isn’t good enough. So they created their own special Elsevier-double-bond. It’s not a standard Unicode codepoint – it’s in a Private Use Area: (http://en.wikipedia.org/wiki/Private_Use_Areas). This is reserved for a single organisation to use. It is not intended for unrestricted public use. In certain cases groups, with mutual agreement, have developed communities of practice. But I know of no community outside Elsevier that uses this. (BTW the XML uses 6 Elsevier-only DTDs and can only be understood by reading a 500-page manual – the chemistry is hidden somewhere at the end. This is the monstrosity that Elsevier wishes to force us to use.
It’s highly dangerous. If you change a double bond to a triple bond (ethylene => acetylene) it can explode and blow you up. But double and triple bonds are both represented by a hollow square if you try to view Elsevier-HTML. And goodness knows what else:
So Elsevier destroys information.
Chris Shillum tells me on Twitter that it’s not a problem. But it is. Using the Private Use Area without the agreement of the community is utterly irresponsible. No one even knew that Elsevier was doing it.
Why’s it irresponsible? Because many languages use it for other purposes. See Wikipedia above. Estonian, Tibetan, Chinese … If an Elsevier-double-bond is used in these documents (e.g. an Estonian chemistry department) there will be certain corruption of both the chemistry and the Estonian. There are probably 10 million chemical compounds with double bonds and all will be corrupted.
But it’s also arrogant. “We’re Elsevier. We’re not going to work with existing DTDs (XML specifications) – we’re going to invent our own.” Who uses it outside Elsevier? “And we are going to force text-miners to use this monstrosity.”
And it’s the combined arrogance and incompetence of publishers that destroys science during the manuscript processing. I’ve been through it. I know.
 
 

This entry was posted in Uncategorized. Bookmark the permalink.

6 Responses to How publishers destroy science: Elsevier's XML, API and the disappearing chemical bond. DO NOT BUY XML

  1. Daniel says:

    An equals sign may look like a double bond but semantically it isn’t one. Considering what else is present in Unicode I’m surprised characters for double/triple/quadruple bond are not yet in Unicode. Similarly the closest approximation to a triple bond is probably the Unicode”identical to” symbol (≡) and as far as I know there is no Unicode character appropriate for representing a quadruple bond. Elsevier’s documentation of their DTD appears to enumerate these characters:
    http://www.elsevier.com/__data/assets/pdf_file/0008/109844/ja50_tagbytag5.pdf (p25)
    I think these characters are part of the STIX project (http://www.ams.org/STIX/private/stixprv-E6.html) i.e. they were not decided by Elsevier in isolation but by a consortium of the following publishers:
    http://www.stixfonts.org/stipubs.html
    Anyway the point is if the character actually provides more nuance than an equals sign at what point should the normalization be performed? Obviously placing a private area Unicode character in the HTML is unacceptable but replacing the character with its nearest Unicode surrogate isn’t an entirely satisfactory solution either…although an equals sign is indeed a nearly ubiquitously used surrogate. I assume you’d hate it even more if they’d used an image instead of a character 😉

    • pm286 says:

      All your points are well taken Daniel, and I’d agree that STIX seems to be the core.
      >>An equals sign may look like a double bond but semantically it isn’t one.
      This is determined by usage. Many symbols (e.g. “+”+) are hugely overloaded. The prevalence of homoglyphs (two characters that have different semantics but the same meaning) is just as much of a problem (e.g. Aring and angstr are visually identical and are both used for Angstrom unit).
      >>Considering what else is present in Unicode I’m surprised characters for double/triple/quadruple bond are not yet in Unicode. Similarly the closest approximation to a triple bond is probably the Unicode”identical to” symbol (≡) and as far as I know there is no Unicode character appropriate for representing a quadruple bond. Elsevier’s documentation of their DTD appears to enumerate these characters:
      http://www.elsevier.com/__data/assets/pdf_file/0008/109844/ja50_tagbytag5.pdf (p25)
      I think these characters are part of the STIX project (http://www.ams.org/STIX/private/stixprv-E6.html) i.e. they were not decided by Elsevier in isolation but by a consortium of the following publishers:
      http://www.stixfonts.org/stipubs.html
      I will accept this.
      >> Anyway the point is if the character actually provides more nuance than an equals sign at what point should the normalization be performed? Obviously placing a private area Unicode character in the HTML is unacceptable but replacing the character with its nearest Unicode surrogate isn’t an entirely satisfactory solution either…although an equals sign is indeed a nearly ubiquitously used surrogate.
      I think all practising chemists (as opposed to publishers of chemistry) will use “equals”. And everyone – including I hope publishers – uses a “+” (plus) to represent both a positive charge and addition and chemical reactions and many other overloaded operations.
      >>I assume you’d hate it even more if they’d used an image instead of a character 😉
      If in the XML, absolutely yes. In the PDF – probably yes.

  2. Henry Rzepa says:

    Can I give another example of what Peter describes: http://www.ch.imperial.ac.uk/rzepa/blog/?p=12728 I read an article about B40, and wished to acquire the coordinates. They were apparently available in a PDF file as supporting information.
    I found the columns of numbers, and did a simple copy/paste (fortunately, this PDF file was not DRM-protected to prevent me from doing so). After the paste, I had gibberish.
    I suspect what had happened is that the original authors were from China, and they used fonts to describe the numbers that Western PDF readers do not recognise.
    I did eventually extract numbers, but with the line throws missing and I had to edit (40) of these in by hand. In this instance, I was able to recognise (as a human) where they belonged, but I do wonder whether a machine could have managed?

  3. Pingback: Text technical quality : a new frontier ? | Casus Bibli

  4. Pingback: La qualité technique du texte : une nouvelle frontière ? | Casus Bibli

  5. Pingback: How publishers destroy science: Elsevier’s XML, API and the disappearing chemical bond. DO NOT BUY XML – ContentMine

Leave a Reply

Your email address will not be published. Required fields are marked *