In my last technical post I mentioned that we were trying to recognize the character "A". Not too difficult for a sighted European human. Hard for Eeyore. Hard for AMI our document reading program.
It's taken a bit longer than I thought. Here's our "A"
Simple, isn't it? It's one colour - black. What could be simpler?
Well actually it isn't one colour. it's 255 shades of gray (I have given up and use US spelling for compsci things). or more correctly it's a gradation from 0 (black, no light at all) to 255 (as white as you can get). Look closely and you will see "jaggies" which aren't completely blank. They are there as "antialiasing" - a method to make it look nice for humans (and it works). Remember we aren't allowed to draw straight lines - we have to use pixels. Many years ago - 1970 - we used pen plotters (Calcomp) or moving spots of light (Tektronix) to draw straight lines. The great Evans and Sutherland computers - for which modern structural biologists should be grateful - produced many of the classic protein structures with straight lines (vectors) during the 1980's.
But then pixels came back - Silicon Graphics and desktops - and are almost universal. So we have to draw lines and circles with pixels. There are clever algorithms (I still marvel at Bresenham's circle) but the output is for humans, not machines. A machine simply sees an array (about 30*36 = 1080) of white, grey and black pixels.
We've got to reconstruct the A. (AMI asks "Couldn't the publishers publish proper A's?". "No AMI, they can't, because they want to do the same thing they were doing 20 years ago - simulate paper"). Here's a bit of it:
Remember "0" is black and "255" is white. You can see the SW corner of the A. I have to teach AMI how to recognise it.
AMI: Are all A's the same size?
P. No. They can be as small as 7 pixels high (anything smaller is unreadable for humans).
AMI. Are they all the same aspect ratio?
P. No. Some are thin and some are fat.
AMI. Are all the lines the same width?
P. No. There's Helvetica light (thin), Helvetica (medium) and HelveticaBold (thick)
AMI: Is the A always symmetrical?
P: No, it can be slanted (oblique, italic)
AMI: Are there other fonts besides Helvetica?
P. Zillions. At least ten thousand.
AMI: so you will have to teach me a hundred thousand fonts.
P. I can't. Some of them are Copyright
AMI: That means you go to jail if you use them and redistribute them?
P. More or less.
AMI. I will not be able to recognise all As.
P. May of them are VERY similar. I will teach you how to recognise similar characters. First we have to convert them to gray.
AMI. I have some convertToGray() modules.
P. Good. Then we have to clip them.
AMI. That's what you are writing now? Trimming off the white pixel edges.
P. That's right.
AMI: But some are "nearly white" (240) is that white?
P: we shall set a heuristic cutoff - maybe 240, maybe higher.
AMI: How do you tell?
P. Trial and error. It's very boring. What is worse is that I am doing some of the things for the first time.
AMI: But you should'nt make mistakes. That's why we use JUnit and Maven and Jenkins.
P. But I don't know what methods work best. Maybe Otsu? Maybe Hough? There is no single solution.
AMI. Perhaps you can get some hackers to help you. They might know more about Java Images.
P. Good idea. Java's ImageIO is not very cuddly - for example a missing file does not throw FileNotFound, but NullPointer.
AMI. ARE THERE ANY JAVA IMAGE HACKERS WHO CAN HELP PM-R? You don't need to know any chemistry!
P. Thanks AMI. Hackers, please leave a comment on the blog. And there's a very exciting way of meeting them that I am not yet allowed to blog - a few days.
AMI. Communal knowledge makes projects go faster. I like having several people writing my code. We have good tools for keeping in synch.
P. I took days - an experienced image guru would have done this in a morning.
AMI. So we have clipped the images. Now we have to make them the same size?
P. Yes. And I found a very useful library Imgscalr.
AMI. Yes, you installed and tested it on a few characters . Did it work?
P. Seems to. Now we have to compute correlation... and then see how unique the results are.