OCR Configuration for Advanced Reading

IronOCR provides advanced scan reading methods such as ReadPassport, ReadLicensePlate, and ReadPhoto that go beyond standard OCR. These methods are powered by the IronOcr.Extensions.AdvancedScan package. To fine-tune how these methods process text, IronOCR exposes the TesseractConfiguration class, giving developers full control over character whitelisting, blacklisting, barcode detection, data table reading, and more.

This article covers the TesseractConfiguration properties available for advanced reading and practical examples for configuring OCR in real-world scenarios.

Quickstart: Restrict OCR Output to a Character Whitelist

Set WhiteListCharacters on TesseractConfiguration before calling Read. Any character not in the whitelist is silently dropped from the result, eliminating noise without any post-processing.

  1. Install IronOCR with NuGet Package Manager

    PM > Install-Package IronOcr
  2. Copy and run this code snippet.

    var result = new IronTesseract() { Configuration = new TesseractConfiguration { WhiteListCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789- " } }.Read(new OcrInput("image.png")); Console.WriteLine(result.Text);
  3. Deploy to test on your live environment

    Start using IronOCR in your project today with a free trial

    arrow pointer


TesseractConfiguration Properties

The TesseractConfiguration class provides the following properties for customizing OCR behavior. These are set through IronTesseract.Configuration.

Property Type Description
WhiteListCharacters string Only characters present in this string will be recognized in the OCR output. All other characters are excluded.
BlackListCharacters string Characters in this string are actively ignored and removed from the OCR output.
ReadBarCodes bool Enables or disables barcode detection within the document during OCR processing.
ReadDataTables bool Enables or disables table structure detection within the document using Tesseract.
PageSegmentationMode TesseractPageSegmentationMode Determines how Tesseract segments the input image. Options include AutoOsd, Auto, SingleBlock, SingleLine, SingleWord, and more.
RenderSearchablePdf bool When enabled, OCR output can be saved as a searchable PDF with an invisible text layer.
RenderHocr bool When enabled, OCR output includes hOCR data for further processing or export.
TesseractVariables Dictionary<string, object> Provides direct access to low-level Tesseract configuration variables for fine-grained control. See the full list of Tesseract variables.

The TesseractVariables dictionary goes further still, exposing hundreds of underlying Tesseract engine parameters for cases where the high-level properties are not sufficient.

The examples below demonstrate each property group, starting with character whitelisting.

Setting Up a Character Whitelist for License Plates

A common use case for WhiteListCharacters is restricting OCR output to only the characters that can appear on a license plate: uppercase letters, digits, hyphens, and spaces. This eliminates noise and improves accuracy by telling the engine to ignore anything outside the expected character set.

Input

The following vehicle registration record contains a mix of uppercase text, lowercase text, special symbols (@, $, #, |, ~, ^, *), and punctuation.

Vehicle registration record with mixed characters for OCR whitelist demonstration

BlackListCharacters supplements the whitelist by actively excluding known noise symbols like `, ~, @, #, $, %, &, and *.

:path=/static-assets/ocr/content-code-examples/how-to/ocr-configurations-for-advanced-reading.cs
using IronOcr;

// Initialize the Tesseract OCR engine
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Whitelist only characters that appear on license plates
    WhiteListCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789- ",

    // Blacklist common noise characters
    BlackListCharacters = "`~@#$%&*",
};

var ocrInput = new OcrInput();
// Load the input image
ocrInput.LoadImage("advanced-input.png");
// Perform OCR on the input image with ReadPhoto method
var results = ocr.ReadPhoto(ocrInput);

// Print the filtered text result to the console
Console.WriteLine(results.Text);
$vbLabelText   $csharpLabel

Output

OCR output showing only whitelisted license plate characters

The whitelist filtering is clearly visible in the results:

  • "Plate: ABC-1234" becomes "P ABC-1234". The lowercase word "late:" is dropped, while the plate number is preserved exactly.
  • "VIN: 1HGBH41JXMN109186" becomes "VIN 1HGBH41JXMN109186". The colon is dropped, but the uppercase VIN and full number are kept.
  • "Owner: john.doe@email.com" becomes "O". The entire lowercase email and punctuation are removed.
  • "Region: CA-90210 | Zone #5" becomes "R CA-90210 Z 5". The pipe (|) and hash (#) are removed, while the uppercase letters and numbers survive.
  • "Fee: $125.00 + tax*" becomes "F 12500". The dollar sign, decimal point, plus sign, and lowercase "tax" are all removed.
  • "Ref: ~record_v2^final" becomes "R 2". The tilde (~), underscore, caret (^), and all lowercase characters are stripped.

The same WhiteListCharacters and BlackListCharacters approach works for any document type, not just license plates. The next section shows how to extend a read to detect barcodes and table structures in the same pass.

Configuring Barcode and Data Table Reading

IronOCR can detect barcodes and structured tables within documents alongside text. These features are controlled through TesseractConfiguration:

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Enable barcode detection within documents
    ReadBarCodes = true,

    // Enable table structure detection
    ReadDataTables = true,
};
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Enable barcode detection within documents
    ReadBarCodes = true,

    // Enable table structure detection
    ReadDataTables = true,
};
$vbLabelText   $csharpLabel
  • ReadBarCodes: When set to true, IronOCR scans the document for barcodes in addition to text. Set to false to skip barcode detection and speed up processing when barcodes are not expected.
  • ReadDataTables: When set to true, Tesseract attempts to detect and preserve table structures in the document. This is useful for invoices, reports, and other tabular documents.

These options can be combined with WhiteListCharacters and BlackListCharacters for precise control over what is extracted from complex documents.

While filtering and detection control what gets extracted, layout interpretation is a separate concern. The next section covers how to select the right PageSegmentationMode for the document type.

Controlling Page Segmentation Mode

PageSegmentationMode tells Tesseract how to segment the input image before recognition. Choosing the wrong mode for a given layout causes the engine to misread or skip text entirely.

Mode Use Case
AutoOsd Automatic layout analysis with orientation and script detection
Auto Automatic layout analysis without OSD (default)
SingleColumn Assumes the image is a single column of text
SingleBlock Assumes the image is a single uniform block of text
SingleLine Assumes the image is a single line of text
SparseText Finds as much text as possible in any order

For a label or banner that contains a single line, SingleLine eliminates multi-block analysis and improves both speed and accuracy.

Input

single-line-label.png is a narrow shipping label with exactly one line of bold Courier text: SHIPPING LABEL: TRK-2024-XR9-001.

Single-line shipping label for OCR SingleLine segmentation mode
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SingleLine,
};

using OcrInput input = new OcrInput();
input.LoadImage("single-line-label.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SingleLine,
};

using OcrInput input = new OcrInput();
input.LoadImage("single-line-label.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
$vbLabelText   $csharpLabel

For a scanned page with irregular text placement, SparseText recovers more content than Auto.

Input

receipt-scan.png is a Corner Market thermal receipt with four line items (coffee, muffin, juice, granola bar), a dashed separator, subtotal, tax, and total. This is the kind of layout where fixed-block segmentation misses entries at different horizontal positions.

Thermal receipt for OCR SparseText segmentation mode
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SparseText,
};

using OcrInput input = new OcrInput();
input.LoadImage("receipt-scan.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SparseText,
};

using OcrInput input = new OcrInput();
input.LoadImage("receipt-scan.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
$vbLabelText   $csharpLabel

With layout segmentation tuned to the document type, the next step is controlling the output format for downstream processing.

Generating Searchable PDFs and hOCR Output

RenderSearchablePdf and RenderHocr control the output formats that IronOCR produces alongside the plain text result.

RenderSearchablePdf embeds an invisible text layer over the original image, producing a PDF where users can search and copy text while the scanned image remains visible. This is the standard output format for document archival workflows.

Input

scanned-document.pdf is a single-page business letter from IronOCR Solutions Ltd. (dated 15 March 2024, reference DOC-2024-OCR-0315). The result is saved as searchable-output.pdf.

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderSearchablePdf = true,
};

using OcrInput input = new OcrInput();
input.LoadPdf("scanned-document.pdf");

OcrResult result = ocr.Read(input);
result.SaveAsSearchablePdf("searchable-output.pdf");
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderSearchablePdf = true,
};

using OcrInput input = new OcrInput();
input.LoadPdf("scanned-document.pdf");

OcrResult result = ocr.Read(input);
result.SaveAsSearchablePdf("searchable-output.pdf");
$vbLabelText   $csharpLabel

Output

The output is a PDF that looks identical to the input but contains a hidden text layer. Open searchable-output.pdf and use Ctrl+F to verify that the embedded text is searchable and copyable.

RenderHocr produces an hOCR document, an HTML file that encodes the text content together with bounding box coordinates for every word. This is useful when downstream tools need precise word positioning, for example, redaction engines or document layout analysis.

Input

document-page.png is a document page with the heading "Quarterly Summary Q1 2024" and two paragraphs of financial data covering revenue, operating costs, and growth drivers. The result is saved as output.html.

Document page input for hOCR bounding box output
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderHocr = true,
};

using OcrInput input = new OcrInput();
input.LoadImage("document-page.png");

OcrResult result = ocr.Read(input);
result.SaveAsHocrFile("output.html");
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderHocr = true,
};

using OcrInput input = new OcrInput();
input.LoadImage("document-page.png");

OcrResult result = ocr.Read(input);
result.SaveAsHocrFile("output.html");
$vbLabelText   $csharpLabel

Output

output.html encodes each recognized word with its bounding box coordinates. Open the file in a browser to inspect the hOCR structure, or pass it to a downstream tool for layout analysis or redaction.

Both flags can be enabled at the same time if you need all three output formats (plain text, searchable PDF, and hOCR) from a single read call.

These output flags work independently of the language being read, including non-Latin scripts. The next section shows how to apply character filtering to Japanese text.

Unicode Character Filtering for International Documents

For international documents in Chinese, Japanese, or Korean, the WhiteListCharacters and BlackListCharacters properties work with Unicode characters. This allows you to restrict output to specific scripts, such as only Hiragana and Katakana for Japanese.

Please note Ensure that the corresponding language pack has been installed (e.g., IronOcr.Languages.Japanese) before proceeding.

Input

The document contains a title (テスト), a Japanese sentence mixing Hiragana and Katakana with voiced-mark variants (プ, で), a price line with blacklisted noise symbols (★, ■) and Kanji (価格), and a memo line with another blacklisted symbol (§), more Kanji (購入), additional voiced-mark variants (プ, デ), and base Katakana (メモ, ール). The whitelist passes only base Hiragana, base Katakana, digits, and common Japanese punctuation; the three noise symbols are explicitly blacklisted.

OCR advanced configuration Japanese input

The Unicode character ranges for Hiragana and Katakana are passed as string literals in WhiteListCharacters, with the noise symbols listed in BlackListCharacters.

Warning The console may not support displaying Unicode characters. Redirecting the output to a .txt file is a reliable way to verify results when dealing with such characters.

:path=/static-assets/ocr/content-code-examples/how-to/ocr-configurations-for-advanced-reading-jp.cs
using IronOcr;
using System.IO;

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Whitelist only Hiragana, Katakana, numbers, and common Japanese punctuation
    WhiteListCharacters = "あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらりるれろわをん" +
                            "アイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワヲン" +
                            "0123456789、。?!()¥ー",

    // Blacklist common noise/symbols you want to ignore
    BlackListCharacters = "★■§",
};

var ocrInput = new OcrInput();

// Load Japanese input image
ocrInput.LoadImage("jp.png");

// Perform OCR on the input image with ReadPhoto method
var results = ocr.ReadPhoto(ocrInput);

// Write the text result directly to a file named "output.txt"
File.WriteAllText("output.txt", results.Text);

// You can add this line to confirm the file was saved:
Console.WriteLine("OCR results saved to output.txt");
$vbLabelText   $csharpLabel

Output

OCR advanced configuration Japanese output

The full filtered output is available as a text file: jp-output.txt.

Because the whitelist includes only base Hiragana and Katakana characters, derived voiced-mark variants such as プ (pu) and デ (de) are dropped. Kanji characters like 価格 (price) and 購入 (purchase) are also excluded since they fall outside the whitelisted character set. Blacklisted symbols like , , and § are actively removed regardless of the whitelist.

Where Should I Go Next?

Now that you understand how to configure IronOCR for advanced reading scenarios, explore:

For production use, remember to obtain a license to remove watermarks and access full functionality.

Curtis Chau
Technical Writer

Curtis Chau holds a Bachelor’s degree in Computer Science (Carleton University) and specializes in front-end development with expertise in Node.js, TypeScript, JavaScript, and React. Passionate about crafting intuitive and aesthetically pleasing user interfaces, Curtis enjoys working with modern frameworks and creating well-structured, visually appealing manuals.

...

Read More
Ready to Get Started?
Nuget Downloads 5,558,417 | Version: 2026.3 just released
Still Scrolling Icon

Still Scrolling?

Want proof fast? PM > Install-Package IronOcr
run a sample watch your image become searchable text.