TL;DR Elsevier typsetting turns double bonds into garbage.
Those of you who follow this blog will know that I contend that publishers corrupt manuscripts and thereby destroy science.
Those of you who follow this blog will know that Elsevier publicly stated that I could not use the new “Hargreaves” law to mine articles on their web page and I must do this through their API. Originally there were zillions of conditions, which – under our constant criticism – have gradually (but nowhere completely) disappeared. They now allow me to mine from the web page, but insist that their XML-API gives better content.
I have consistently refused to use Elsevier’s API for legal, political and social reasons (I don’t want to sign my rights away, be monitored, have to ask permission, etc.). But I also know from at least 5 years of trying to interpret publishers’ PDFs and HTML that information is corrupted. By this I mean that what the author submits is turned into something different lexically, typographically and often semantics. (Yes, that means that by changing the way something looks , you can change its meaning).
Anyway yesterday Chris Shillum, who was part of the team I challenged, tweeted that he would let me have a paper – in XML format – from the Elsevier API. For those who don’t know, XML is designed to hold information in a style-free form. It can be rendered by a stylesheet or program (e.g. FOP) into whatever font you like. I’m very familiar with XML having run the developers’ list with Henry Rzepa in 1997 and been co-author of the universal SAX protocol. Henry and I have developed Chemical Markup Language (CML) precisely for the purpose of chemical publishing (among many other things).
But Elsevier don’t use CML, they use typographers who know nothing about chemistry. At school you may have heard of a “double bond” (http://en.wikipedia.org/wiki/Double_bond). It’s normally represented by two lines between the atoms. We used to draw these with rapidographs, but now we type them. So every chemist in the world will type Carbon Dioxide as
O=C=O
capital-O equals capital-C equals capital-O
You can do it – nothing terrible happens. You can even search chemical databases using this. They all understand “equals”.
But that’s not good enough for Elsevier (and most of the others). It has to look “pretty”. It’s more important that a publication looks pretty than that it’s correct. And that’s one of the major ways they corrupt information. So here’s the paper that Chris Shillum sent me.
First as a PDF.
Can you see the C=O double bond in the middle? “(C=O stretching)”. It’s no longer an equals, but a special publisher-only symbol they think looks prettier. Among other things if I search for “C=O” I won’t find the double bond in the text. That’s bad enough. But what’s far worse is that this symbol has been included in their XML. And this gets transmitted to the HTML – which looks like (you can try this yourself http://www.sciencedirect.com/science/article/pii/S0014579301033130 ).
???
What’s happened??? Do you also see a square? The double bond has disappeared.
The square is Firefox saying “I have been given a character I don’t understand and the best I can do is draw a square” – sorry. Safari does the same. Do ANY of you get anything useful? I doubt it.
Because Elsevier has created a special Elsevier-only method of displaying chemistry. It probably only works inside Elsevier back-room – it won’t work in any normal browser. Here’s what has happened.
Elsevier wanted a symbol to display a double bond. “Equals” which all the rest of the world uses – isn’t good enough. So they created their own special Elsevier-double-bond. It’s not a standard Unicode codepoint – it’s in a Private Use Area: (http://en.wikipedia.org/wiki/Private_Use_Areas). This is reserved for a single organisation to use. It is not intended for unrestricted public use. In certain cases groups, with mutual agreement, have developed communities of practice. But I know of no community outside Elsevier that uses this. (BTW the XML uses 6 Elsevier-only DTDs and can only be understood by reading a 500-page manual – the chemistry is hidden somewhere at the end. This is the monstrosity that Elsevier wishes to force us to use.
It’s highly dangerous. If you change a double bond to a triple bond (ethylene => acetylene) it can explode and blow you up. But double and triple bonds are both represented by a hollow square if you try to view Elsevier-HTML. And goodness knows what else:
So Elsevier destroys information.
Chris Shillum tells me on Twitter that it’s not a problem. But it is. Using the Private Use Area without the agreement of the community is utterly irresponsible. No one even knew that Elsevier was doing it.
Why’s it irresponsible? Because many languages use it for other purposes. See Wikipedia above. Estonian, Tibetan, Chinese … If an Elsevier-double-bond is used in these documents (e.g. an Estonian chemistry department) there will be certain corruption of both the chemistry and the Estonian. There are probably 10 million chemical compounds with double bonds and all will be corrupted.
But it’s also arrogant. “We’re Elsevier. We’re not going to work with existing DTDs (XML specifications) – we’re going to invent our own.” Who uses it outside Elsevier? “And we are going to force text-miners to use this monstrosity.”
And it’s the combined arrogance and incompetence of publishers that destroys science during the manuscript processing. I’ve been through it. I know.
-
Recent Posts
-
Recent Comments
- pm286 on ContentMine at IFLA2017: The future of Libraries and Scholarly Communications
- Hiperterminal on ContentMine at IFLA2017: The future of Libraries and Scholarly Communications
- Next steps for Text & Data Mining | Unlocking Research on Text and Data Mining: Overview
- Publishers prioritize “self-plagiarism” detection over allowing new discoveries | Alex Holcombe's blog on Text and Data Mining: Overview
- Kytriya on Let’s get rid of CC-NC and CC-ND NOW! It really matters
-
Archives
- June 2018
- April 2018
- September 2017
- August 2017
- July 2017
- November 2016
- July 2016
- May 2016
- April 2016
- December 2015
- November 2015
- September 2015
- May 2015
- April 2015
- January 2015
- December 2014
- November 2014
- September 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- December 2006
- November 2006
- October 2006
- September 2006
-
Categories
- "virtual communities"
- ahm2007
- berlin5
- blueobelisk
- chemistry
- crystaleye
- cyberscience
- data
- etd2007
- fun
- general
- idcc3
- jisc-theorem
- mkm2007
- nmr
- open issues
- open notebook science
- oscar
- programming for scientists
- publishing
- puzzles
- repositories
- scifoo
- semanticWeb
- theses
- Uncategorized
- www2007
- XML
- xtech2007
-
Meta
An equals sign may look like a double bond but semantically it isn’t one. Considering what else is present in Unicode I’m surprised characters for double/triple/quadruple bond are not yet in Unicode. Similarly the closest approximation to a triple bond is probably the Unicode”identical to” symbol (≡) and as far as I know there is no Unicode character appropriate for representing a quadruple bond. Elsevier’s documentation of their DTD appears to enumerate these characters:
http://www.elsevier.com/__data/assets/pdf_file/0008/109844/ja50_tagbytag5.pdf (p25)
I think these characters are part of the STIX project (http://www.ams.org/STIX/private/stixprv-E6.html) i.e. they were not decided by Elsevier in isolation but by a consortium of the following publishers:
http://www.stixfonts.org/stipubs.html
Anyway the point is if the character actually provides more nuance than an equals sign at what point should the normalization be performed? Obviously placing a private area Unicode character in the HTML is unacceptable but replacing the character with its nearest Unicode surrogate isn’t an entirely satisfactory solution either…although an equals sign is indeed a nearly ubiquitously used surrogate. I assume you’d hate it even more if they’d used an image instead of a character 😉
All your points are well taken Daniel, and I’d agree that STIX seems to be the core.
>>An equals sign may look like a double bond but semantically it isn’t one.
This is determined by usage. Many symbols (e.g. “+”+) are hugely overloaded. The prevalence of homoglyphs (two characters that have different semantics but the same meaning) is just as much of a problem (e.g. Aring and angstr are visually identical and are both used for Angstrom unit).
>>Considering what else is present in Unicode I’m surprised characters for double/triple/quadruple bond are not yet in Unicode. Similarly the closest approximation to a triple bond is probably the Unicode”identical to” symbol (≡) and as far as I know there is no Unicode character appropriate for representing a quadruple bond. Elsevier’s documentation of their DTD appears to enumerate these characters:
http://www.elsevier.com/__data/assets/pdf_file/0008/109844/ja50_tagbytag5.pdf (p25)
I think these characters are part of the STIX project (http://www.ams.org/STIX/private/stixprv-E6.html) i.e. they were not decided by Elsevier in isolation but by a consortium of the following publishers:
http://www.stixfonts.org/stipubs.html
I will accept this.
>> Anyway the point is if the character actually provides more nuance than an equals sign at what point should the normalization be performed? Obviously placing a private area Unicode character in the HTML is unacceptable but replacing the character with its nearest Unicode surrogate isn’t an entirely satisfactory solution either…although an equals sign is indeed a nearly ubiquitously used surrogate.
I think all practising chemists (as opposed to publishers of chemistry) will use “equals”. And everyone – including I hope publishers – uses a “+” (plus) to represent both a positive charge and addition and chemical reactions and many other overloaded operations.
>>I assume you’d hate it even more if they’d used an image instead of a character 😉
If in the XML, absolutely yes. In the PDF – probably yes.
Can I give another example of what Peter describes: http://www.ch.imperial.ac.uk/rzepa/blog/?p=12728 I read an article about B40, and wished to acquire the coordinates. They were apparently available in a PDF file as supporting information.
I found the columns of numbers, and did a simple copy/paste (fortunately, this PDF file was not DRM-protected to prevent me from doing so). After the paste, I had gibberish.
I suspect what had happened is that the original authors were from China, and they used fonts to describe the numbers that Western PDF readers do not recognise.
I did eventually extract numbers, but with the line throws missing and I had to edit (40) of these in by hand. In this instance, I was able to recognise (as a human) where they belonged, but I do wonder whether a machine could have managed?
Pingback: Text technical quality : a new frontier ? | Casus Bibli
Pingback: La qualité technique du texte : une nouvelle frontière ? | Casus Bibli
Pingback: How publishers destroy science: Elsevier’s XML, API and the disappearing chemical bond. DO NOT BUY XML – ContentMine