We are developing software to read scientific documents automatically. One problem is that many of the characters are in pixel form and we have recognise them. This is called Optical Character Recognition or OCR.
As with all problems I look to see whether there is a solution already. I need a FLOSS (Open) pure Java solution. I have found three possibilities:
- Tesseract. Not pure Java
- Java OCR (was simple but now so complicated we don’t even know how to run it! Help would be useful!)
- Lookup (a method for correlating character images – but where to start?).
So we have decided (regretfully) we have to write our own. If anyone has a pure Java OCR solution that doesn’t require training please put us out of our agony. Ultimately it should do a subset of Unicode including Greek, Maths, symbols, etc. I’ve hacked several ideas including thinning and topology (which is still on the boards). But recently I came to the conclusion that to make a start we should correlate known characters with unknown ones. Since most scientific images use Helvetica (or similar sans-serif) or Times-New-Roman (or similar serif) or Courier (or similar monospace) we can start with those to compare against.
From the “House at Pooh Corner” quoted without permission but with Love (A A Milne was my aunts’ uncle).
Eeyore had three sticks on the ground, and was looking at them. Two of the sticks were touching at one end, but not at the other, and the third stick was laid across them. Piglet thought that perhaps it was a Trap of some kind.
“Oh, Eeyore,” he began again, “I just–”
“Is that little Piglet?” said Eeyore, still looking hard at his sticks.
“Yes, Eeyore, and I–”
“Do you know what this is?”
“No,” said Piglet.
“It’s an A.”
How do we know it’s an “A”? (Douglas Hofstadter wrote deeply on this). My simple solution is “does it have a good pixel wise correlation coefficient with any of the As in three font families after scaling and translation?”.
I’ve spent a week writing code (very badly) to do the scaling and translation. Part of the problem is that I was in the Australian desert with very limited wifi and no geek companions and broken chunks of time. I’d write a bit each day, spend my time remembering what I had done, and then hack away at tests and then start fro scratch again.
I’ve now come back and found a simple online solution:
BufferedImage bimage = Scalr.resize(bimage, Method.QUALITY, Mode.FIT_EXACT, width,
from the imgscalr library (see this post http://e-blog-java.blogspot.com.au/2012/02/cropping-rotate-and-resizing-images.html )
Doh! I could have got all this working a week ago.
Yes, if I had known there *was* a solution and where to look. Sometimes it takes a week to find where to start. You look on Stackoverflow, Google a bit, tweet a bit and sometimes it takes minutes, but sometimes takes a week. It really helps to have a geek cloud round you. I do reasonably well on these things but when you convolute the different languages (C, Java, Python, Perl, R, JS, etc.) with Open and Easy then the message becomes more diffuse. (There are some packages – e.g. OpenCV – that can take a week to know where to start!).
Rabbit came up importantly, nodded to Piglet, and said, “Ah, Eeyore,” in the voice of one who would be saying “Good-bye ” in about two more minutes.
“There’s just one thing I wanted to ask you, Eeyore. What happens to Christopher Robin in the mornings nowadays?”
“What’s this that I’m looking at?” said Eeyore, still looking at it.
“Three sticks,” said Rabbit promptly.
“You see?” said Eeyore to Piglet. He turned to Rabbit. “I will now answer your question,” he said solemnly.
“Thank you,” said Rabbit.
“What does Christopher Robin do in the mornings? He learns. He becomes Educated. He instigorates–I think that is the word he mentioned, but I may be referring to something else–he instigorates Knowledge. In my small way I also, if I have the word right, am–am doing what he does. That, for instance, is?”
“An A,” said Rabbit, “but not a very good one. Well, I must get back and tell the others.”
So Eeyore has to refactor his “A”:
“What did Rabbit say it was?” he asked.
“An A,” said Piglet.
“Did you tell him?”
“No, Eeyore, I didn’t. I expect he just knew.”
“He knew? You mean this A thing is a thing Rabbit knew?”
“Yes, Eeyore. He’s clever, Rabbit is.”
“Clever!” said Eeyore scornfully, putting a foot heavily on his three sticks. “Education!” said Eeyore bitterly, jumping on his six sticks. “What is Learning?” asked Eeyore as he kicked his twelve sticks into the air. “A thing Rabbit knows! Ha!”
So I have refactored my code by jumping on it and kicking it into the air
But I am not bitter – I am happy!