The average large STM publisher receives several thousand USD (either in subscriptions or in author processing charges (APC)) to process an article. This huge and unjustified sum contains not only obscene profits (30-40 %) but also gross inefficiencies and often destruction of scientific content. The point is that publishers decide what THEY want to do irrespective of what authors or sighted human readers want (let alone unsighted readers or machines).
I am highly critical of publishers' typesetting. Almost everything I say is conjecture as typesetting occurs in the publisher's backroom process, possibly including outsourcing. I can only guess what happens. I've tried to estimate the per-page cost and it's about 8-15 USD. So the cost per paper in typesetting alone is well over 100 USD. And the result is often awful.
The purpose of typesetting is to further the publisher's business. I spoke to one publisher and asked them "Why don't you use standard fonts such as Helvetica (Arial) and Unicode)" Their answer (unattributed) was:
"Helvetica is boring".
I think this sums up the whole problem. A major purpose of STM typesetting is for one publisher to compete against the others. For me, scientific typography was solved (brilliantly) by Don Knuth with TeX and Metafont about 40 years ago. Huge numbers of scientists (especially physicists) and mathematicians communicate in (La)TeX without going near a typesetter. When graduate students write their theses they use LaTeX.
Or Word. I've read many student theses in Word and it's perfectly satisfactory. Why shouldn't we use LaTeX and Word for scientific communication?
Well, of course we do. That's what ArXiV takes. If scientists ran publishing it would cost a fraction of the cost and increase scientific communication. That's what Tim Gowers and colleagues plan to do – dump the publishers completely. I completely support this of course.
The publishers have grown an artificial market in creating "final PDFs". They are often double-column. Why!? Because they can be printed. We are in the C21 and publishers are creating print. D-C PDF is awful on my laptop. Wrong aspect ratio and I have to scroll back for every page. It's destroying innovation and we are paying hundreds of millions.
And of course every publisher has to be different. It would make sense for authors to use a standard way of submitting bibliography. There is, of course. It's http://en.wikipedia.org/wiki/BibTeX. Free as in beer and free as in speech. Been going for nearly 40 years .
But its main problem is that it makes publishing too democratic and easy. Publishers would have nothing create gatekeeper rituals with. So instead they ask authors to destroy their bibliographic information by perpetuating arcane ways of "formatting references". Here's an example from Dino Mike's paper (https://peerj.com/articles/36.pdf ) in PeerJ - and remember PeerJ is only a month or two old. But they still use the same 100-year approach to typesetting. Here's a "citation" from Mike's bibliography to a PLoS paper:
This is an example of the publisher-imposed wastage of human effort. What is the 5(10)? I actually have no idea (I can guess). It's not needed – the DOI is a complete reference. It's a redundant and error-prone librarian approach. But Mike had to type these meaningless numbers into the paper. And, not surprisingly, he got it wrong (or at least something is wrong). Because the target paper says:
"5(11)". It's a waste of Mike's time to ask him to do this. And the reason is simple. Publishers and metapublishers make huge amounts of money by managing bibliography, and they do it in company-specific ways. So the whole field is far more complex and error-prone than it should be. Muddle the field and create a market. (Or historically, perpetuate a muddled market). It's criminal that bibliography is still a problem.
Back to the typography:
- Mike wrote his paper in LaTeX. Word (see comments) He used Ticks (U+2713) [See comments after my blog post for what follows]. His Unicode ticks were converted by the publisher (Kaveh) into "Dingbats". (Dingbats is not a standard PDF font – ZapfDingbats is and I have to guess they are the same (there are mutants of Dingbats).
I haven't met Kaveh, but I have considerable respect for his company. But I do not accept his argument: he justifies typesetting by: "But the primary reason for "typesetting" is to produce an attractive "paginated" form of the information, according to traditions of hundreds of years
It is precisely "hundreds of years" that is the problem. We should be looking to the present and future. And I will show you a typical example of the irresponsible destruction of scientific information. But first I will comment that IMO publishers treat unsighted humans with apparent disregard. In analysing several thousands of papers I have come across huge numbers of characters which I am absolutely sure could not be interpreted by current tools. Here's an example that I would be amazed if any machine-reader in the world could interpret correctly. It ought to speak "left-paren epsilon subscript(400) equals 816 plus-or-minus 56 cm superscript -1. "
But it doesn't get the "epsilon". The true interpretation is so horrible that I hesitate to post it – and will leave it as an exercise for you to guess. I am almost sure that the horror has been introduced by the publisher as they use a special font (AdvOTb92eb7df.I) which I and no-one else has ever heard of.
Who's the publisher? Well it's one with a Directorate of Universal Access.
But that doesn't seem to provide accessibility for unsighted humans or machines. And it's even beaten our current pdf2svg software – it needs matrix algebra to solve.
Villu has got it right. "So this "epsilon" is actually "three (italic)", which is first flipped horizontally and then vertically? The italic-ness can be determined after the name of the font (ends with ".I"), and the number of flips by observing the fat tail of the stroke (should be at the lower left corner, but is at the upper right corner)?" PMR: It's only flipped once – see chunk below where you can see that FakeEpsilon and 3 are the same way up.
PMR: The fontMatrix is actually FLIP * SKEW so comes out as
(d is the skew angle) Which gave me a negative font size (so failed to show in the output. Could a speech reader detect that and convert the "3" to an italic epsilon? Not a chance.
For the record Unicode has every character you could conceivably want in normal scientific use. Here are the epsilons from http://www.fileformat.info/info/unicode/ :
I'm getting bored. There are several more pages of UNICODE epsilons. It's inconceivable that one of those wouldn't be suitable. But no, the publisher takes our money to produce an abomination.