Content-mining: #ami2 and #animalgarden continue to parse scientific PDFs into semantic form

PMR has been hacking bugs with MJ and AMI2…

PMR: PDF2SVG should now manage Type0 fonts and we’ve fixed bugs on some character processing. Now let’s look at the Dingbats… AMI, do you have a Dingbat lookup-table?

A: No

P: OK, we’ll have to create one. I’ve tried to find a conversion table to Unicode on the web… but failed. However we’ve found:

(that’s only part because the whole picture might be copyright).

A: so what’s the Unicode for (char)51?

P: It’s a “tick”. I’ll have to look it up in the Unicode

A: I can’t understand glyphs yet. You’ll have to do it for me. There’s about 170. Is that “boring”?

P: VERY. But I will only do the ones I need and hope others will help out. It only has to be done once. I’ve found:

They don’t match up so I’ll have to do them one-by-one… AMI, please would you create a new table “dingbats.xml” in

pdf2svg1/src/main/resources/org/xmlcml/pdf2svg/codepoints/misc/dingbats.xml

A: done.

P: we’ll map char-51 to U+2713 because they seem to be the same (and char-52 to U+2714, etc.. And while I’m watching the cricket I can do some more. We have now g confidence in converting 99.99+% of the characters that are likely to occur in bioscience. There’s more work to do for maths.

A: Yes. I have a document which fails with CambriaMath. We default to Unicode so CambriaMath-4666 defaults to ETHIOPIC SYLLABLE SHI (U+123A) which looks like:

Is that what you want?

P: No I have to translate each character into the table

A: There are only 10000 so that should take a few milliseconds…

P: SHI

 

 

 

This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Content-mining: #ami2 and #animalgarden continue to parse scientific PDFs into semantic form

Leave a Reply

Your email address will not be published. Required fields are marked *