OCR Configuration for Advanced Reading

IronOCR provides advanced scan reading methods such as ReadPassport, ReadLicensePlate, and ReadPhoto that go beyond standard OCR. These methods are powered by the IronOcr.Extensions.AdvancedScan package. To fine-tune how these methods process text, IronOCR exposes the TesseractConfiguration class, giving developers full control over character whitelisting, blacklisting, barcode detection, data table reading, and more.

This article covers the TesseractConfiguration properties available for advanced reading and practical examples for configuring OCR in real-world scenarios.

Get started with IronOCR

Start using IronOCR in your project today with a free trial.

First Step:
green arrow pointer


TesseractConfiguration Properties

The TesseractConfiguration class provides the following properties for customizing OCR behavior. These are set through IronTesseract.Configuration.

Property Type Description
WhiteListCharacters string Only characters present in this string will be recognized in the OCR output. All other characters are excluded.
BlackListCharacters string Characters in this string are actively ignored and removed from the OCR output.
ReadBarCodes bool Enables or disables barcode detection within the document during OCR processing.
ReadDataTables bool Enables or disables table structure detection within the document using Tesseract.
PageSegmentationMode TesseractPageSegmentationMode Determines how Tesseract segments the input image. Options include AutoOsd, Auto, SingleBlock, SingleLine, SingleWord, and more.
RenderSearchablePdf bool When enabled, OCR output can be saved as a searchable PDF with an invisible text layer.
RenderHocr bool When enabled, OCR output includes hOCR data for further processing or export.
TesseractVariables Dictionary<string, object> Provides direct access to low-level Tesseract configuration variables for fine-grained control. See the full list of Tesseract variables.

Even with these high-level properties, IronOCR offers detailed customization through the TesseractVariables dictionary, which exposes hundreds of underlying Tesseract engine parameters for specialized use cases.

Setting Up a Character Whitelist for License Plates

A common use case for WhiteListCharacters is restricting OCR output to only the characters that can appear on a license plate: uppercase letters, digits, hyphens, and spaces. This eliminates noise and improves accuracy by telling the engine to ignore anything outside the expected character set.

Input

The following vehicle registration record contains a mix of uppercase text, lowercase text, special symbols (@, $, #, |, ~, ^, *), and punctuation. Only uppercase letters, digits, hyphens, and spaces should survive the whitelist.

Vehicle registration record with mixed characters for OCR whitelist demonstration

Code

In this example, WhiteListCharacters is set to A-Z, 0-9, hyphens, and spaces. The BlackListCharacters property filters out known noise symbols like `, ~, @, #, $, %, &, and *.

:path=/static-assets/ocr/content-code-examples/how-to/ocr-configurations-for-advanced-reading.cs
using IronOcr;

// Initialize the Tesseract OCR engine
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Whitelist only characters that appear on license plates
    WhiteListCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789- ",

    // Blacklist common noise characters
    BlackListCharacters = "`~@#$%&*",
};

var ocrInput = new OcrInput();
// Load the input image
ocrInput.LoadImage("advanced-input.png");
// Perform OCR on the input image with ReadPhoto method
var results = ocr.ReadPhoto(ocrInput);

// Print the filtered text result to the console
Console.WriteLine(results.Text);
$vbLabelText   $csharpLabel

Output

OCR output showing only whitelisted license plate characters

The whitelist filtering is clearly visible in the results:

  • "Plate: ABC-1234" becomes "P ABC-1234". The lowercase word "late:" is dropped, while the plate number is preserved exactly.
  • "VIN: 1HGBH41JXMN109186" becomes "VIN 1HGBH41JXMN109186". The colon is dropped, but the uppercase VIN and full number are kept.
  • "Owner: john.doe@email.com" becomes "O". The entire lowercase email and punctuation are removed.
  • "Region: CA-90210 | Zone #5" becomes "R CA-90210 Z 5". The pipe (|) and hash (#) are removed, while the uppercase letters and numbers survive.
  • "Fee: $125.00 + tax*" becomes "F 12500". The dollar sign, decimal point, plus sign, and lowercase "tax" are all removed.
  • "Ref: ~record_v2^final" becomes "R 2". The tilde (~), underscore, caret (^), and all lowercase characters are stripped.

Configuring Barcode and Data Table Reading

IronOCR can detect barcodes and structured tables within documents alongside text. These features are controlled through TesseractConfiguration:

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Enable barcode detection within documents
    ReadBarCodes = true,

    // Enable table structure detection
    ReadDataTables = true,
};
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Enable barcode detection within documents
    ReadBarCodes = true,

    // Enable table structure detection
    ReadDataTables = true,
};
$vbLabelText   $csharpLabel
  • ReadBarCodes: When set to true, IronOCR scans the document for barcodes in addition to text. Set to false to skip barcode detection and speed up processing when barcodes are not expected.
  • ReadDataTables: When set to true, Tesseract attempts to detect and preserve table structures in the document. This is useful for invoices, reports, and other tabular documents.

These options can be combined with WhiteListCharacters and BlackListCharacters for precise control over what is extracted from complex documents.

Controlling Page Segmentation Mode

PageSegmentationMode tells Tesseract how to interpret the layout of the input image before recognizing text. Choosing the right mode for your document type has a direct impact on accuracy.

Mode Use Case
AutoOsd Automatic layout analysis with orientation and script detection
Auto Automatic layout analysis without OSD (default)
SingleColumn Assumes the image is a single column of text
SingleBlock Assumes the image is a single uniform block of text
SingleLine Assumes the image is a single line of text
SparseText Finds as much text as possible in any order

For a label or banner that contains a single line, SingleLine eliminates multi-block analysis and improves both speed and precision:

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SingleLine,
};

using OcrInput input = new OcrInput();
input.LoadImage("single-line-label.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SingleLine,
};

using OcrInput input = new OcrInput();
input.LoadImage("single-line-label.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
$vbLabelText   $csharpLabel

For a scanned page with irregular text placement (such as a receipt with scattered prices), SparseText recovers more content than Auto:

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SparseText,
};

using OcrInput input = new OcrInput();
input.LoadImage("receipt-scan.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SparseText,
};

using OcrInput input = new OcrInput();
input.LoadImage("receipt-scan.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
$vbLabelText   $csharpLabel

Generating Searchable PDFs and hOCR Output

RenderSearchablePdf and RenderHocr control the output formats that IronOCR produces alongside the plain text result.

RenderSearchablePdf embeds an invisible text layer over the original image, producing a PDF where users can search and copy text while the scanned image remains visible. This is the standard output format for document archival workflows.

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderSearchablePdf = true,
};

using OcrInput input = new OcrInput();
input.LoadPdf("scanned-document.pdf");

OcrResult result = ocr.Read(input);
result.SaveAsSearchablePdf("searchable-output.pdf");
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderSearchablePdf = true,
};

using OcrInput input = new OcrInput();
input.LoadPdf("scanned-document.pdf");

OcrResult result = ocr.Read(input);
result.SaveAsSearchablePdf("searchable-output.pdf");
$vbLabelText   $csharpLabel

RenderHocr produces an hOCR document, an HTML file that encodes the text content together with bounding box coordinates for every word. This is useful when downstream tools need precise word positioning (for example, redaction engines or document layout analysis).

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderHocr = true,
};

using OcrInput input = new OcrInput();
input.LoadImage("document-page.png");

OcrResult result = ocr.Read(input);
result.SaveAsHocrFile("output.html");
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderHocr = true,
};

using OcrInput input = new OcrInput();
input.LoadImage("document-page.png");

OcrResult result = ocr.Read(input);
result.SaveAsHocrFile("output.html");
$vbLabelText   $csharpLabel

Both flags can be enabled at the same time if you need all three output formats (plain text, searchable PDF, and hOCR) from a single read call.

Unicode Character Filtering for International Documents

For international documents in Chinese, Japanese, or Korean, the WhiteListCharacters and BlackListCharacters properties work with Unicode characters. This allows you to restrict output to specific scripts, such as only Hiragana and Katakana for Japanese.

Please note Ensure that the corresponding language pack has been installed (e.g., IronOcr.Languages.Japanese) before proceeding.

Input

OCR advanced configuration Japanese input

Code

In this example, the whitelist includes Hiragana, Katakana, digits, and common Japanese punctuation. Noise symbols like , , and § are blacklisted.

Warning The console may not support displaying Unicode characters. Redirecting the output to a .txt file is a reliable way to verify results when dealing with such characters.

:path=/static-assets/ocr/content-code-examples/how-to/ocr-configurations-for-advanced-reading-jp.cs
using IronOcr;
using System.IO;

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Whitelist only Hiragana, Katakana, numbers, and common Japanese punctuation
    WhiteListCharacters = "あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらりるれろわをん" +
                            "アイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワヲン" +
                            "0123456789、。?!()¥ー",

    // Blacklist common noise/symbols you want to ignore
    BlackListCharacters = "★■§",
};

var ocrInput = new OcrInput();

// Load Japanese input image
ocrInput.LoadImage("jp.png");

// Perform OCR on the input image with ReadPhoto method
var results = ocr.ReadPhoto(ocrInput);

// Write the text result directly to a file named "output.txt"
File.WriteAllText("output.txt", results.Text);

// You can add this line to confirm the file was saved:
Console.WriteLine("OCR results saved to output.txt");
$vbLabelText   $csharpLabel

Output

OCR advanced configuration Japanese output

Here's the output for the input above.

Because the whitelist includes only base Hiragana and Katakana characters, derived characters such as プ (pu) and デ (de) are dropped. Kanji, the Yen symbol, and full-width parentheses are also excluded since they are not in the whitelist. Blacklisted symbols like and are actively removed.

Conclusion

The TesseractConfiguration class gives developers fine-grained control over how IronOCR processes documents in advanced reading scenarios. By combining character whitelists with barcode detection, table reading, and international language support, you can build OCR pipelines tailored to specific document types such as license plates, passports, invoices, and multilingual content.

For more on the advanced reading methods themselves, see the ReadPassport guide, the ReadLicensePlate guide, and the full list of Tesseract configuration variables.

Curtis Chau
Technical Writer

Curtis Chau holds a Bachelor’s degree in Computer Science (Carleton University) and specializes in front-end development with expertise in Node.js, TypeScript, JavaScript, and React. Passionate about crafting intuitive and aesthetically pleasing user interfaces, Curtis enjoys working with modern frameworks and creating well-structured, visually appealing manuals.

...

Read More
Ready to Get Started?
Nuget Downloads 5,462,358 | Version: 2026.3 just released