In my last technical post I mentioned that we were trying to recognize the character “A”. Not too difficult for a sighted European human. Hard for Eeyore. Hard for AMI our document reading program.
It’s taken a bit longer than I thought. Here’s our “A”
Simple, isn’t it? It’s one colour – black. What could be simpler?
Well actually it isn’t one colour. it’s 255 shades of gray (I have given up and use US spelling for compsci things). or more correctly it’s a gradation from 0 (black, no light at all) to 255 (as white as you can get). Look closely and you will see “jaggies” which aren’t completely blank. They are there as “antialiasing” – a method to make it look nice for humans (and it works). Remember we aren’t allowed to draw straight lines – we have to use pixels. Many years ago – 1970 – we used pen plotters (Calcomp) or moving spots of light (Tektronix) to draw straight lines. The great Evans and Sutherland computers – for which modern structural biologists should be grateful – produced many of the classic protein structures with straight lines (vectors) during the 1980’s.
But then pixels came back – Silicon Graphics and desktops – and are almost universal. So we have to draw lines and circles with pixels. There are clever algorithms (I still marvel at Bresenham’s circle) but the output is for humans, not machines. A machine simply sees an array (about 30*36 = 1080) of white, grey and black pixels.
No lines.
No A.
We’ve got to reconstruct the A. (AMI asks “Couldn’t the publishers publish proper A’s?”. “No AMI, they can’t, because they want to do the same thing they were doing 20 years ago – simulate paper”). Here’s a bit of it:
Remember “0” is black and “255” is white. You can see the SW corner of the A. I have to teach AMI how to recognise it.
AMI: Are all A’s the same size?
P. No. They can be as small as 7 pixels high (anything smaller is unreadable for humans).
AMI. Are they all the same aspect ratio?
P. No. Some are thin and some are fat.
AMI. Are all the lines the same width?
P. No. There’s Helvetica light (thin), Helvetica (medium) and HelveticaBold (thick)
AMI: Is the A always symmetrical?
P: No, it can be slanted (oblique, italic)
AMI: Are there other fonts besides Helvetica?
P. Zillions. At least ten thousand.
AMI: so you will have to teach me a hundred thousand fonts.
P. I can’t. Some of them are Copyright
AMI: That means you go to jail if you use them and redistribute them?
P. More or less.
AMI. I will not be able to recognise all As.
P. May of them are VERY similar. I will teach you how to recognise similar characters. First we have to convert them to gray.
AMI. I have some convertToGray() modules.
P. Good. Then we have to clip them.
AMI. That’s what you are writing now? Trimming off the white pixel edges.
P. That’s right.
AMI: But some are “nearly white” (240) is that white?
P: we shall set a heuristic cutoff – maybe 240, maybe higher.
AMI: How do you tell?
P. Trial and error. It’s very boring. What is worse is that I am doing some of the things for the first time.
AMI: But you should’nt make mistakes. That’s why we use JUnit and Maven and Jenkins.
P. But I don’t know what methods work best. Maybe Otsu? Maybe Hough? There is no single solution.
AMI. Perhaps you can get some hackers to help you. They might know more about Java Images.
P. Good idea. Java’s ImageIO is not very cuddly – for example a missing file does not throw FileNotFound, but NullPointer.
AMI. ARE THERE ANY JAVA IMAGE HACKERS WHO CAN HELP PM-R? You don’t need to know any chemistry!
P. Thanks AMI. Hackers, please leave a comment on the blog. And there’s a very exciting way of meeting them that I am not yet allowed to blog – a few days.
AMI. Communal knowledge makes projects go faster. I like having several people writing my code. We have good tools for keeping in synch.
P. I took days – an experienced image guru would have done this in a morning.
AMI. So we have clipped the images. Now we have to make them the same size?
P. Yes. And I found a very useful library Imgscalr.
AMI. Yes, you installed and tested it on a few characters . Did it work?
P. Seems to. Now we have to compute correlation… and then see how unique the results are.
-
Recent Posts
-
Recent Comments
- pm286 on ContentMine at IFLA2017: The future of Libraries and Scholarly Communications
- Hiperterminal on ContentMine at IFLA2017: The future of Libraries and Scholarly Communications
- Next steps for Text & Data Mining | Unlocking Research on Text and Data Mining: Overview
- Publishers prioritize “self-plagiarism” detection over allowing new discoveries | Alex Holcombe's blog on Text and Data Mining: Overview
- Kytriya on Let’s get rid of CC-NC and CC-ND NOW! It really matters
-
Archives
- June 2018
- April 2018
- September 2017
- August 2017
- July 2017
- November 2016
- July 2016
- May 2016
- April 2016
- December 2015
- November 2015
- September 2015
- May 2015
- April 2015
- January 2015
- December 2014
- November 2014
- September 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- December 2006
- November 2006
- October 2006
- September 2006
-
Categories
- "virtual communities"
- ahm2007
- berlin5
- blueobelisk
- chemistry
- crystaleye
- cyberscience
- data
- etd2007
- fun
- general
- idcc3
- jisc-theorem
- mkm2007
- nmr
- open issues
- open notebook science
- oscar
- programming for scientists
- publishing
- puzzles
- repositories
- scifoo
- semanticWeb
- theses
- Uncategorized
- www2007
- XML
- xtech2007
-
Meta
This is a really nice introduction to the basic problems of image enhancement and OCR. From my own experience, I might have a few suggestions:
Regarding the heuristic brightness cutoff (where you use 240 as an example), I’ve found it quite handy to simply use the _average_ brightness of the image. This always works if the image is evenly bright, i.e., the values of “black” and “white” are approximately the same throughout the image.
The NullPointerException you mention regarding Java ImageIO does not occur if the image file to load does not exist, but rather if the file does exist, but ImageIO is unable to decode the image, as documented here:
http://docs.oracle.com/javase/6/docs/api/javax/imageio/ImageIO.html#read(java.io.File)
For image scaling, Java has quite a few onboard facilities, which can also be used for white edge truncation, as well as gray scale conversion. For starters, it’s worthwhile to look at these two classes, especially the details listed below each:
http://docs.oracle.com/javase/6/docs/api/java/awt/image/BufferedImage.html
– TYPE_BYTE_GRAY
– createGraphics()
http://docs.oracle.com/javase/6/docs/api/java/awt/Graphics2D.html
– drawImage()
– scale()
– drawString() (for creating reference images for comparison)
Really useful! Thanks
Yes, I have used createGraphics() before (e.g. to create thumbnails). As always it’s finding the right methods and order to call them in
Despite the grey borders it should be fairly easy to threshold the image efficiently. I would recommend you try FIJI (it’s part public domain, part FreeBSD) which is Just ImageJ and already comes with several threshold algorithms. The interface allows for easy experimentation.
After opening the image, the first thing you will need is to drop the colour information (the png image you have in this blog post is an RGB image). Do “Edit > Type > 8-bit”.
For the automatic threshold, go to “Image > Adjust > Threshold …”. You will get a long list of available algorithms (if you need them under another license, I have some of them implemented in GPL as well). The Otsu method worked fine.
If you don’t have characters details of 1 pixel width (you probably don’t), then I’d recommend to do a greyscale dilation with the smaller diamond structuring element before the threshold. To do this, before the automatic threshold, do “Process > Morphology > Gray morphology” and select a radius of 1.0 pixels, type of structure element diamond, and dilate operator.
Because seeing is believing, here’s a screenshoot of the results I got with this very simple approach http://picpaste.com/Screenshot_from_2014-03-04_12_36_19-u7R6utd5.png The red part is the final object.
About getting AMI to recognize each letters, you will have to train her. I once saw a public set of handwritten characters for the purpose of training systems. Despite the many different fonts, there should be no novelty in training such a system. Your only problem may be in distinguish between some Latin and Greek characters. I think machine learning will be the way to go to do this. If post office have systems recognizing handwritten postal codes, I’m sure you can have one recognizing machine printed characters.
I’ve found that Adobe Acrobat’s OCR is very slow and inaccurate on grayscale text. I run them through Scan Tailor ( scantailor.sourceforge.net/ ) to get crisp monochrome TIFs, which work much better. Perhaps there’s a java port of the relevant component?
Thanks. I try to avoid Adobe products but it may be useful as a reference