Tesseract Detailed Configuration

When it comes to OCR, having options and flexibility on how to approach and extract text from documents is essential. As performing OCR is costly, being able to control the performance and methods to use on specific documents is needed to ensure the application utilizing OCR is scalable and efficient.

IronTesseract offers developers different properties and options with which to tinker. For example, if you wanted to blacklist certain characters or also wanted to read the barcodes within the documents as well or even dictate how the OCR engine reads the page to scan for potential blocks of text, all that and more with the IronTesseract class.

5-Step Guide to Using IronOCR with Tesseract 5

  1. var ocrTesseract = new IronTesseract();
  2. ocrTesseract.Language = OcrLanguage.EnglishBest;
  3. ocrTesseract.Configuration.ReadBarCodes = false;
  4. ocrTesseract.Configuration.BlackListCharacters = "`ë|^";
  5. ocrTesseract.Configuration.TesseractVariables["tessedit_parallelize"] = false;

After initiating the IronTesseract class, there are a few important options immediately available that we would want to modify. The first property to configure is the Language. By default, the language is English; however, IronTesseract supports up to 125 languages and even allows multiple languages with the UseMultipleLanguages method. For more details, refer to here.

The second property we want to configure is the TesseractConfiguration class. With this class, we can modify how the Tesseract engine scans the document for potential blocks of text. Firstly, we modify the language of the Tesseract Engine by assigning the language to OcrLanguage.EnglishBest. This variation combines an LTSM and an OEM, which are shape recognition strategies using OCR; combining these two strategies allows the OCR to produce more accurate results. Afterwards, we set the ReadBarCodes to false to avoid reading barcodes during the OCR process.

We also further customize and specify the characters that we want to extract by blacklisting certain characters on the document; in this example, we blacklist characters to avoid extracting text with backticks, accents, or the caret symbol in them. Finally, we set the TessreactVariables["tessedit_parallelize"] to false to disable parallel processing for the time being. This last one is a really powerful feature since it speaks directly to the Tesseract Engine. Here is a complete list of TesseractVariables that allows the developers to customize the behavior of the Tesseract Engine further when performing OCR.

Click here to view the How-to-Guide, including examples, sample code and files >