Monday, February 3, 2014

Optical Character Recognition (OCR)

Due to a reader's comment on my German blog I experimented a bit with the very interesting topic of automatic text recognition, usually called "Optical character Recognition (OCR)", and found out that relatively easy relatively good results can be achieved.
OCR is the automatic recognition of texts and characters out of images - of course here I post about OCR in C#.
To write an own OCR class would not be easy - but there are a lot of complete classes, also for C#, which can be used. I used Puma.NET.
First of all this has to be downloaded from the homepage linked above. If you then want to use Puma in a program, first a reference to the library "Puma.Net.dll" has to be added, which is located on my computer under "C:\Program Files (x86)\Puma.NET\Assemblies".
If you then compile the project, the 3 files "puma.interop.dll", "Puma.Net.dll" and "dibapi.dll" have to appear in the Build folder. The last one probably does not exist and therefore has to be copied manually to that folder from the Assembly folder of Puma.
One last thing has to be done before you can work with OCR: Full access has to be given to the current user for the folder "COM Server" of the Puma directory.
Now to the code, with already 4 lines of code a complete text recognition can be realized. For simplicity we include Puma.Net via using.
First we have to create an object of the type PumaPage and hand the path to the image we want to recognize over.
Next we set the format of the characters, here we simply chose  TxtAscii.
Via the property Language we can even set the language of the text we want to sccan.
With RecognizeToString() the program then tries to recognize text from the image.
Here the complete code:

PumaPage P1 = new PumaPage(@"C:\Users\Oliver\Documents\Visual Studio 2013\Projects\COCR\COCR\bin\Debug\Test.bmp");
P1.FileFormat = PumaFileFormat.TxtAscii;
P1.Language = PumaLanguage.German;
string Result = P1.RecognizeToString();

For computer texts the program works pretty well, but as soon as the font gets a little bit more "squiggly", the quality of the result decreases rapidly.


  1. I've used Tesseract OCR engine, but it has a number of formatting and accuracy issues. So far, the best solution I have encountered is RE OCR. 20 free pages a month and very good accuracy.

    .net ocr api, c# ocr library

  2. I find another free online ocr, it's using tesseract ocr 3.02.