OCR in Java (2); Zarkonnen Longan is the best yet

The web is wonderful! The best way to write code is not to. I posted this morning about the problems I had in using Java for Optical Character Recognition. And within an hour I had this great response from David Stark (@Zarkonnen_com)

For starters, here’s what I tried to post on your blog about Longan earlier:

So a few years back I found myself in the same situation as you – wanting to do Java OCR, and the only real solution on the block is tesseract/ocropus, which is a nightmare to install/distribute. I eventually started work on a pure-Java OCR system called “Longan”. The project is currently on an extended hiatus, but if there is interest / potential collaborators / potential users, I’d be very interested in reviving it.

PMR: that’s wonderful. Not only for this project, but the idea that people can carry a project throughout a number of years. You don’t have to be in a University. There’s a huge need…

GitHub link: https://github.com/Zarkonnen/Longan

Longan’s features:

* Pure Java with zero dependencies. It doesn’t even reference any external jars.

* Based on convolutional neural networks, which are a pretty modern and robust approach to OCR.

* Usable as a library or command-line program.

* Reasonably modular system composed of stages.

* Takes care of eliminating images and speckles from the input, adjusting input rotation, and detecting multi-column layouts.

* Recognition system is pretty much all data-driven. You can plug in a different set of neural network weights to get a different alphabet or specialise the system for a particular font or group of fonts.

* Free and libre, licensed under Apache 2.0.

PMR: very much the features I try to use myself. (I do, however use communal libraries such as Apache and with Maven that’s easy once you have learnt how. So David doesn’t need to include the code of apache.commons.cli anymore – we simply have a maven dependency in the pom.xml). Fully agreed about library and CLI and the modularity.
There’s a workflow aspect to image processing. Typically we may have to crop, denoise, deskew, equalise, binaries, thin, recognise , etc. I’m normally starting with born-digital so I only need the last one or two. It’s therefore important to modularise and pipeline. And it’s very important to be able to experiment – hence data-driven and parameterisation.

—

DS-Z: Anyway, I hiatus-ed the project a few years ago because I hit a point where I realised I had to start going about fine-tuning the neural networks in a more methodical way, and I didn’t have the time to do so. I had basically reached the point where I’d try it out on some example input, fix things up to make it work better, only to realise I’d made it perform much worse on other input!

PMR: Very common. When it’s a difficult problem (like OCR) the parameters are often finely balanced.

In terms of trying it out, try running it with com.zarkonnen.longan.Main as the entry point and the path to the attached file as the single command-line argument. Note that the file is just a random fragment I grabbed from a scan, so it’s not particularly optimised for Longan, or Longan optimised for it. Then try commit e2f819f5f865ae6e9211a435f098883979fdb1ed which is actually much better, as it’s the version before some major re-engineering efforts which had the aforementioned effect of making it work better for some inputs and a lot worse for others.

I can also have a spelunk around and check for any secondary projects/data that I used during the project.

– David

Marvellous! It actually took me five minutes to work out how to run Longan, whereas I have struggled for days with javaocr2012.
But I shan’t throw that away. I want to see how Neural nets compare with Moments+Mahalanobis. With NN you have no insight into the model and that’s a problem when you need to refine things. I shall use both – neither can be 100% perfect. And in any case how do you tell a zero (0) from an oh (o)? I’m also going to include a third way – topology of skeletons – not sure whether it;s been used before. And we’ve also got information from the environment – io is more likely to be one-zero (== 10) than one-oh although if you’re a planetologist it might be Io – the Jovian moon. In chemistry IO could be hypoiodite. and so on.

So I’d love David to find spare time to hack on this…

OCR in Java (2); Zarkonnen Longan is the best yet

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta