I’ve been to central Melbourne (Central Business District, CBD) for the last two days. To hack. But the first visit was to http://en.wikipedia.org/wiki/Queen_Victoria_Market to find AMI. (AMI2 is the scientifically intelligent program amanuensis we are building and AMI will give us inspiration and symbolism – these things matter.) Queen Vic has rows of tourist traps including soft Australian toys.
So I had no doubt that AMI would find me. I looked long and hard, thinking platypus, opossum, echidna, koala, but I had no choice and here is AMI:
She – and AMI has always been a she – has not named her joey yet. In case you need to know more about http://en.wikipedia.org/wiki/Kangaroo Wikipedia says they are “shy and retiring by nature” so an excellent companion, who also “release virtually [no methane]. The hydrogen byproduct of fermentation is instead converted into acetate”.
So yesterday and today I sat and hacked in the superb La Trobe reading room in the http://en.wikipedia.org/wiki/State_Library_of_Victoria :
Free wifi, free power and that’s the view where I was sitting. Perfect silence. The occasional visitor coming to see Ned Kelly’s home-made suit of armour.
So I have spent two days teaching AMI about publishers’ PDFs. Remember AMI has no emotions, doesn’t get angry, doesn’t rant. So her main impressions are:
- Highly variable
- Quite a lot of work
- A challenge but manageable
She doesn’t use words like bad/good awful/beautiful good/bad)valueForMoney, but “tractable/intractable” “standard/nonstandard” “deterministic/guesswork”.
She’s been following the discussion on the last post (http://blogs.ch.cam.ac.uk/pmr/2012/11/04/ami2-opencontentmining-pdf2svg-ami-comments-on-her-experience-of-the-digital-printing-industry-and-stm-publishers/#comment-117090 ), where there are some very useful comments.
Having implemented some font support for PDFBox, it is my understanding that fontNames shouldn’t be used to judge about what’s “inside” them.
The fontName=MGCOJK+AdvTT7c108665 probably corresponds to some synthetic font object. The PDF specification makes it clear that when PDF documents are exported with “the smallest PDF file size possible” objective, then it is OK to perform the compacting of embedded font objects by leaving out unnecessary glyphs, by remapping character codes etc. For example, it is not too uncommon to encounter embedded font objects that contain only one or too glyphs.
The extent of the compacting of font objects depends on the scientific publisher. Some do it, some don’t.
OK. So AMI2 will have to deal with synthetic font objects. If they are Unicode it’s probably OK. If they aren’t we’ll need some per-publisher hacks, we suspect.
Steve Pettifer says:
Villu is correct — you cannot ‘trust’ the names of fonts to mean anything. If you’re lucky they will have some words like ‘bold’ or ‘italic’ or end with ‘-b’ or ‘-i’. But they are just opaque identifiers, planted in there by whatever software created the document (typically Unix like stuff plants human-readable names, Microsoft stuff plants duff names, but you shouldn’t rely on them even if they look as though they mean something).
So AMI2 knows not to trust fontNames. She will try to trust the content. But she’s used to heuristics and hacks.
Steve Pettifer says:
The point I’m trying to make is that the mess we see in PDFs (and HTML) representation is caused by a combination of a lack of sensible authoring tools, and the process of ‘publishing’; sometimes authors do things right, and publishers mangle them; sometimes the other way round, and the mistakes appear in all representations (even in the XML versions of things).
There is one place where PDF is considerably worse than HTML, and that’s in the very naughty use publishers sometimes make of combining glyphs to make characters ‘look right’. I’ve found instances where the Å (Angstrom) symbol is created in the PDF by drawing a capital A, and then a lower-case ‘o’, with instructions to shrink the o, move it back one character, and place it above the A. In the PDF this looks OK to a human, but comes out as guff to a machine (again, a heuristic needed to spot it). In this particular instance the HTML representation was also broken, coming out as two sequential characters ‘Ao’. And all this in spite of the fact that there’s a perfectly good unicode Å character.
The ß versus β problem is surprisingly common (though as you say, its relatively rare that ß would appear in STM articles to mean the German character) — we’ve found several instances of it in modern articles — again unless you’re analysing these things with a machine, they look plausible in both PDF and HTML, and its only an eagle-eyed reader that would spot them.
Perhaps it would be useful for us to jointly create a list of dodgy characters / naughty encodings and heuristics for spotting them?
AMI doesn’t understand words like “mess” and “mangle” and “naughty”. She does understand “right”. She now knows that she can expect a little “o” shifted back and up above an “A” for character “Aring” . That’s no problem. She also has to learn that “H” and “e” in that order without a space spells “He”. And she will soon be be taught that “He” can mean a male human or Helium (and probably lots of other things). By separating the problem into bits (first get the characters right, then see how they join, then interpret them as science) AMI2 is quite confident we shall get there. RSN.
And when we do she will be able to act as amanuensis to an awful lot of humans.
Meanwhile she has to meet Chuff…
And tomorrow she is off to the seaside with PMR to do some more hacking. Will we find wifi? Who knows – but it’s not needed for the final hack on PDF2SVG, which we might even post tomorrow. Who knows?