OCR Configuration for Advanced Reading

Updated:February 26, 2026

IronOCR provides advanced scan reading methods such as ReadPassport, ReadLicensePlate, and ReadPhoto that go beyond standard OCR. These methods are powered by the IronOcr.Extensions.AdvancedScan package. To fine-tune how these methods process text, IronOCR exposes the TesseractConfiguration class, giving developers full control over character whitelisting, blacklisting, barcode detection, data table reading, and more.

This article covers the TesseractConfiguration properties available for advanced reading and practical examples for configuring OCR in real-world scenarios.

Quickstart: Restrict OCR Output to a Character Whitelist

Set WhiteListCharacters on TesseractConfiguration before calling Read. Any character not in the whitelist is silently dropped from the result, eliminating noise without any post-processing.

Install IronOCR with NuGet Package Manager
PM > Install-Package IronOcr

Copy and run this code snippet.

var result = new IronTesseract() { Configuration = new TesseractConfiguration { WhiteListCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789- " } }.Read(new OcrInput("image.png")); Console.WriteLine(result.Text);

Deploy to test on your live environment

Start using IronOCR in your project today with a free trial

How to Configure OCR for Advanced Reading

Install IronOCR from NuGet
Install the IronOcr.Extensions.AdvancedScan package
Configure TesseractConfiguration properties such as WhiteListCharacters and ReadBarCodes
Load the input image with OcrInput
Read the image using an advanced method like ReadPhoto, ReadLicensePlate, or ReadPassport

TesseractConfiguration Properties

The TesseractConfiguration class provides the following properties for customizing OCR behavior. These are set through IronTesseract.Configuration.

Property	Type	Description
`WhiteListCharacters`	string	Only characters present in this string will be recognized in the OCR output. All other characters are excluded.
`BlackListCharacters`	string	Characters in this string are actively ignored and removed from the OCR output.
`ReadBarCodes`	bool	Enables or disables barcode detection within the document during OCR processing.
`ReadDataTables`	bool	Enables or disables table structure detection within the document using Tesseract.
`PageSegmentationMode`	TesseractPageSegmentationMode	Determines how Tesseract segments the input image. Options include `AutoOsd`, `Auto`, `SingleBlock`, `SingleLine`, `SingleWord`, and more.
`RenderSearchablePdf`	bool	When enabled, OCR output can be saved as a searchable PDF with an invisible text layer.
`RenderHocr`	bool	When enabled, OCR output includes hOCR data for further processing or export.
`TesseractVariables`	Dictionary<string, object>	Provides direct access to low-level Tesseract configuration variables for fine-grained control. See the full list of Tesseract variables.

The TesseractVariables dictionary goes further still, exposing hundreds of underlying Tesseract engine parameters for cases where the high-level properties are not sufficient.

The examples below demonstrate each property group, starting with character whitelisting.

Setting Up a Character Whitelist for License Plates

A common use case for WhiteListCharacters is restricting OCR output to only the characters that can appear on a license plate: uppercase letters, digits, hyphens, and spaces. This eliminates noise and improves accuracy by telling the engine to ignore anything outside the expected character set.

Input

The following vehicle registration record contains a mix of uppercase text, lowercase text, special symbols (@, $, #, |, ~, ^, *), and punctuation.

BlackListCharacters supplements the whitelist by actively excluding known noise symbols like `, ~, @, #, $, %, &, and *.

:path=/static-assets/ocr/content-code-examples/how-to/ocr-configurations-for-advanced-reading.cs

using IronOcr;

// Initialize the Tesseract OCR engine
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Whitelist only characters that appear on license plates
    WhiteListCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789- ",

    // Blacklist common noise characters
    BlackListCharacters = "`~@#$%&*",
};

var ocrInput = new OcrInput();
// Load the input image
ocrInput.LoadImage("advanced-input.png");
// Perform OCR on the input image with ReadPhoto method
var results = ocr.ReadPhoto(ocrInput);

// Print the filtered text result to the console
Console.WriteLine(results.Text);

$vbLabelText $csharpLabel

Output

The whitelist filtering is clearly visible in the results:

"Plate: ABC-1234" becomes "P ABC-1234". The lowercase word "late:" is dropped, while the plate number is preserved exactly.
"VIN: 1HGBH41JXMN109186" becomes "VIN 1HGBH41JXMN109186". The colon is dropped, but the uppercase VIN and full number are kept.
"Owner: john.doe@email.com" becomes "O". The entire lowercase email and punctuation are removed.
"Region: CA-90210 | Zone #5" becomes "R CA-90210 Z 5". The pipe (|) and hash (#) are removed, while the uppercase letters and numbers survive.
"Fee: $125.00 + tax*" becomes "F 12500". The dollar sign, decimal point, plus sign, and lowercase "tax" are all removed.
"Ref: ~record_v2^final" becomes "R 2". The tilde (~), underscore, caret (^), and all lowercase characters are stripped.

The same WhiteListCharacters and BlackListCharacters approach works for any document type, not just license plates. The next section shows how to extend a read to detect barcodes and table structures in the same pass.

Configuring Barcode and Data Table Reading

IronOCR can detect barcodes and structured tables within documents alongside text. These features are controlled through TesseractConfiguration:

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Enable barcode detection within documents
    ReadBarCodes = true,

    // Enable table structure detection
    ReadDataTables = true,
};

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Enable barcode detection within documents
    ReadBarCodes = true,

    // Enable table structure detection
    ReadDataTables = true,
};

$vbLabelText $csharpLabel

ReadBarCodes: When set to true, IronOCR scans the document for barcodes in addition to text. Set to false to skip barcode detection and speed up processing when barcodes are not expected.
ReadDataTables: When set to true, Tesseract attempts to detect and preserve table structures in the document. This is useful for invoices, reports, and other tabular documents.

These options can be combined with WhiteListCharacters and BlackListCharacters for precise control over what is extracted from complex documents.

While filtering and detection control what gets extracted, layout interpretation is a separate concern. The next section covers how to select the right PageSegmentationMode for the document type.

Controlling Page Segmentation Mode

PageSegmentationMode tells Tesseract how to segment the input image before recognition. Choosing the wrong mode for a given layout causes the engine to misread or skip text entirely.

Mode	Use Case
`AutoOsd`	Automatic layout analysis with orientation and script detection
`Auto`	Automatic layout analysis without OSD (default)
`SingleColumn`	Assumes the image is a single column of text
`SingleBlock`	Assumes the image is a single uniform block of text
`SingleLine`	Assumes the image is a single line of text
`SparseText`	Finds as much text as possible in any order

For a label or banner that contains a single line, SingleLine eliminates multi-block analysis and improves both speed and accuracy.

Input

single-line-label.png is a narrow shipping label with exactly one line of bold Courier text: SHIPPING LABEL: TRK-2024-XR9-001.

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SingleLine,
};

using OcrInput input = new OcrInput();
input.LoadImage("single-line-label.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SingleLine,
};

using OcrInput input = new OcrInput();
input.LoadImage("single-line-label.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);

$vbLabelText $csharpLabel

For a scanned page with irregular text placement, SparseText recovers more content than Auto.

Input

receipt-scan.png is a Corner Market thermal receipt with four line items (coffee, muffin, juice, granola bar), a dashed separator, subtotal, tax, and total. This is the kind of layout where fixed-block segmentation misses entries at different horizontal positions.

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SparseText,
};

using OcrInput input = new OcrInput();
input.LoadImage("receipt-scan.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SparseText,
};

using OcrInput input = new OcrInput();
input.LoadImage("receipt-scan.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);

$vbLabelText $csharpLabel

With layout segmentation tuned to the document type, the next step is controlling the output format for downstream processing.

Generating Searchable PDFs and hOCR Output

RenderSearchablePdf and RenderHocr control the output formats that IronOCR produces alongside the plain text result.

RenderSearchablePdf embeds an invisible text layer over the original image, producing a PDF where users can search and copy text while the scanned image remains visible. This is the standard output format for document archival workflows.

Input

scanned-document.pdf is a single-page business letter from IronOCR Solutions Ltd. (dated 15 March 2024, reference DOC-2024-OCR-0315). The result is saved as searchable-output.pdf.

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderSearchablePdf = true,
};

using OcrInput input = new OcrInput();
input.LoadPdf("scanned-document.pdf");

OcrResult result = ocr.Read(input);
result.SaveAsSearchablePdf("searchable-output.pdf");

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderSearchablePdf = true,
};

using OcrInput input = new OcrInput();
input.LoadPdf("scanned-document.pdf");

OcrResult result = ocr.Read(input);
result.SaveAsSearchablePdf("searchable-output.pdf");

$vbLabelText $csharpLabel

Output

The output is a PDF that looks identical to the input but contains a hidden text layer. Open searchable-output.pdf and use Ctrl+F to verify that the embedded text is searchable and copyable.

RenderHocr produces an hOCR document, an HTML file that encodes the text content together with bounding box coordinates for every word. This is useful when downstream tools need precise word positioning, for example, redaction engines or document layout analysis.

Input

document-page.png is a document page with the heading "Quarterly Summary Q1 2024" and two paragraphs of financial data covering revenue, operating costs, and growth drivers. The result is saved as output.html.

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderHocr = true,
};

using OcrInput input = new OcrInput();
input.LoadImage("document-page.png");

OcrResult result = ocr.Read(input);
result.SaveAsHocrFile("output.html");

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderHocr = true,
};

using OcrInput input = new OcrInput();
input.LoadImage("document-page.png");

OcrResult result = ocr.Read(input);
result.SaveAsHocrFile("output.html");

$vbLabelText $csharpLabel

Output

output.html encodes each recognized word with its bounding box coordinates. Open the file in a browser to inspect the hOCR structure, or pass it to a downstream tool for layout analysis or redaction.

Both flags can be enabled at the same time if you need all three output formats (plain text, searchable PDF, and hOCR) from a single read call.

These output flags work independently of the language being read, including non-Latin scripts. The next section shows how to apply character filtering to Japanese text.

Unicode Character Filtering for International Documents

For international documents in Chinese, Japanese, or Korean, the WhiteListCharacters and BlackListCharacters properties work with Unicode characters. This allows you to restrict output to specific scripts, such as only Hiragana and Katakana for Japanese.

Please note Ensure that the corresponding language pack has been installed (e.g., IronOcr.Languages.Japanese) before proceeding.

Input

The document contains a title (テスト), a Japanese sentence mixing Hiragana and Katakana with voiced-mark variants (プ, で), a price line with blacklisted noise symbols (★, ■) and Kanji (価格), and a memo line with another blacklisted symbol (§), more Kanji (購入), additional voiced-mark variants (プ, デ), and base Katakana (メモ, ール). The whitelist passes only base Hiragana, base Katakana, digits, and common Japanese punctuation; the three noise symbols are explicitly blacklisted.

The Unicode character ranges for Hiragana and Katakana are passed as string literals in WhiteListCharacters, with the noise symbols listed in BlackListCharacters.

Warning The console may not support displaying Unicode characters. Redirecting the output to a .txt file is a reliable way to verify results when dealing with such characters.

:path=/static-assets/ocr/content-code-examples/how-to/ocr-configurations-for-advanced-reading-jp.cs

using IronOcr;
using System.IO;

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Whitelist only Hiragana, Katakana, numbers, and common Japanese punctuation
    WhiteListCharacters = "あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらりるれろわをん" +
                            "アイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワヲン" +
                            "0123456789、。？！（）¥ー",

    // Blacklist common noise/symbols you want to ignore
    BlackListCharacters = "★■§",
};

var ocrInput = new OcrInput();

// Load Japanese input image
ocrInput.LoadImage("jp.png");

// Perform OCR on the input image with ReadPhoto method
var results = ocr.ReadPhoto(ocrInput);

// Write the text result directly to a file named "output.txt"
File.WriteAllText("output.txt", results.Text);

// You can add this line to confirm the file was saved:
Console.WriteLine("OCR results saved to output.txt");

$vbLabelText $csharpLabel

Output

The full filtered output is available as a text file: jp-output.txt.

Because the whitelist includes only base Hiragana and Katakana characters, derived voiced-mark variants such as プ (pu) and デ (de) are dropped. Kanji characters like 価格 (price) and 購入 (purchase) are also excluded since they fall outside the whitelisted character set. Blacklisted symbols like ★, ■, and § are actively removed regardless of the whitelist.

Where Should I Go Next?

Now that you understand how to configure IronOCR for advanced reading scenarios, explore:

Reading specific document types such as passports and license plates
The full list of Tesseract configuration variables for fine-grained engine tuning
Barcode and QR code reading as a standalone OCR use case
Exporting hOCR and searchable PDFs from processed results

For production use, remember to obtain a license to remove watermarks and access full functionality.

Curtis Chau

Chat with engineering team now

Technical Writer

Curtis Chau holds a Bachelor’s degree in Computer Science (Carleton University) and specializes in front-end development with expertise in Node.js, TypeScript, JavaScript, and React. Passionate about crafting intuitive and aesthetically pleasing user interfaces, Curtis enjoys working with modern frameworks and creating well-structured, visually appealing manuals.

...

Ready to Get Started?

Nuget Downloads 5,558,417 | Version: 2026.3 just released

View Licenses

Still Scrolling?

Want proof fast? PM > Install-Package IronOcr
run a sample watch your image become searchable text.

View Licenses

Customer Highlight:

Developer Spotlight:

Webinars:

Start Free 30 Day Trial

On This Page

OCR Configuration for Advanced Reading

Install IronOCR with NuGet Package Manager

Copy and run this code snippet.

Deploy to test on your live environment

How to Configure OCR for Advanced Reading

TesseractConfiguration Properties

Setting Up a Character Whitelist for License Plates

Input

Output

Configuring Barcode and Data Table Reading

Controlling Page Segmentation Mode

Input

Input

Generating Searchable PDFs and hOCR Output

Input

Output

Input

Output

Unicode Character Filtering for International Documents

Input

Output

Where Should I Go Next?

Still Scrolling?

Iron Support Team

Start Free 30 Day Trial

On This Page

OCR Configuration for Advanced Reading

Install IronOCR with NuGet Package Manager

Copy and run this code snippet.

Deploy to test on your live environment

How to Configure OCR for Advanced Reading

TesseractConfiguration Properties

Setting Up a Character Whitelist for License Plates

Input

Output

Configuring Barcode and Data Table Reading

Controlling Page Segmentation Mode

Input

Input

Generating Searchable PDFs and hOCR Output

Input

Output

Input

Output

Unicode Character Filtering for International Documents

Input

Output

Where Should I Go Next?

Still Scrolling?

Next step: Start free 30-day Trial

Next step: Start free 30-day Trial

Trusted by Millions of Engineers Worldwide

Iron Support Team