先進閱讀的OCR配置

已更新:2026年6月3日

Translated

View the article in English

IronOCR提供了先進的掃描閱讀方法，如ReadPhoto，超越了標準的OCR。這些方法由IronOcr.Extensions.AdvancedScan套餐提供支持。為了微調這些方法如何處理文字，IronOCR提供了TesseractConfiguration類，讓開發人員可以完全控制字元白名單、黑名單、條碼檢測、資料表閱讀等。

本文章涵蓋了可用於先進閱讀的TesseractConfiguration屬性及配置OCR的實際範例，適用於現實場景中。

快速入門：將OCR輸出限制在字元白名單

在調用WhiteListCharacters。任何不在白名單中的字元將被靜默去除，消除噪音而不需要後處理。

使用NuGet套件管理器安裝https://www.nuget.org/packages/IronOcr
PM > Install-Package IronOcr

複製並運行這段程式碼片段。

var result = new IronTesseract() { Configuration = new TesseractConfiguration { WhiteListCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789- " } }.Read(new OcrInput("image.png")); Console.WriteLine(result.Text);

部署以在您的實時環境中測試

今天就開始在您的專案中使用IronOCR，透過免費試用

如何配置OCR以進行先進閱讀

Install IronOCR from NuGet
Install the IronOcr.Extensions.AdvancedScan package
配置TesseractConfiguration屬性，如WhiteListCharacters和ReadBarCodes
使用OcrInput載入輸入圖像
使用先進的方法如ReadPhoto，ReadLicensePlate，或ReadPassport來閱讀圖像

TesseractConfiguration屬性

TesseractConfiguration類提供了以下屬性以自訂OCR行為。這些通過IronTesseract.Configuration設置。

屬性	型別	描述
`WhiteListCharacters`	string	只有在此字串中存在的字元才會在OCR輸出中被識別。所有其他字元被排除。
`BlackListCharacters`	string	此字串中的字元會被積極忽略並從OCR輸出中移除。
`ReadBarCodes`	bool	在進行OCR處理時啟用或禁用文件中的條碼檢測。
`ReadDataTables`	bool	使用Tesseract來啟用或禁用文件中表格結構的偵測。
`PageSegmentationMode`	TesseractPageSegmentationMode	決定Tesseract如何分割輸入圖像。選項包括`AutoOsd`，`Auto`，`SingleBlock`，`SingleLine`，`SingleWord`等等。
`RenderSearchablePdf`	bool	啟用時，OCR輸出可以保存為帶有隱藏文字層的可搜尋PDF。
`RenderHocr`	bool	啟用時，OCR輸出包含hOCR資料以供進一步處理或匯出。
`TesseractVariables`	Dictionary<string, object>	提供直接存取低層次的Tesseract配置變數，以便進行精細控制。

TesseractVariables字典則進一步，揭示了數百個基本的Tesseract引擎參數，適用於高級屬性不足的情況。

以下範例展示了每個屬性組，首先從字元白名單設置開始。

為車牌設置字元白名單

WhiteListCharacters的一個常見用途是將OCR輸出限制為車牌上可能出現的字元：大寫字母、數字、連字號和空格。這通過告訴引擎忽略預期字元集之外的任何東西，來消除噪音並提高準確性。

輸入

以下車輛登記記錄包含大寫文字、小寫文字、特殊符號(#，|, *)和標點。

, and*`。

:path=/static-assets/ocr/content-code-examples/how-to/ocr-configurations-for-advanced-reading.cs

using IronOcr;

// Initialize the Tesseract OCR engine
IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Whitelist only characters that appear on license plates
    WhiteListCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789- ",

    // Blacklist common noise characters
    BlackListCharacters = "`~@#$%&*",
};

var ocrInput = new OcrInput();
// Load the input image
ocrInput.LoadImage("advanced-input.png");
// Perform OCR on the input image with ReadPhoto method
var results = ocr.ReadPhoto(ocrInput);

// Print the filtered text result to the console
Console.WriteLine(results.Text);

Imports IronOcr

' Initialize the Tesseract OCR engine
Dim ocr As New IronTesseract()

ocr.Configuration = New TesseractConfiguration With {
    ' Whitelist only characters that appear on license plates
    .WhiteListCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789- ",
    
    ' Blacklist common noise characters
    .BlackListCharacters = "`~@#$%&*"
}

Dim ocrInput As New OcrInput()
' Load the input image
ocrInput.LoadImage("advanced-input.png")
' Perform OCR on the input image with ReadPhoto method
Dim results = ocr.ReadPhoto(ocrInput)

' Print the filtered text result to the console
Console.WriteLine(results.Text)

$vbLabelText $csharpLabel

輸出

結果中清楚地看到白名單過濾效果：

"Plate: ABC-1234" 變為 "P ABC-1234"。小寫單詞"late:"被刪除，車牌號碼保持不變。
"VIN: 1HGBH41JXMN109186" 變為 "VIN 1HGBH41JXMN109186"。冒號被刪除，但大寫的VIN和全部數字保持。
"Owner: john.doe@email.com" 變為 "O"。整個小寫電子郵件和標點符號被移除。
"Region: CA-90210|Zone #5" 變為 "R CA-90210 Z 5"。管道(|) and hash (#)被刪除，而大寫字母和數字保存。
*"Fee: $125.00 + tax" 變為 "F 12500"**。美元標誌、小數點、加號和小寫"tax"全部被移除。
"Ref: ~record_v2^final" 變為 "R 2"。波浪號(^)以及所有小寫字元都被剝離。

相同的BlackListCharacters方法適用於任何文件型別，不僅限於車牌。接下來的部分展示如何在同一次閱讀中擴展以檢測條碼和表格結構。

配置條碼和資料表讀取

IronOCR可以在文字之外檢測文件中的條碼和結構化表格。這些功能是通過TesseractConfiguration控制的：

:path=/static-assets/ocr/content-code-examples/how-to/ocr-configurations-for-advanced-reading-3.cs

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Enable barcode detection within documents
    ReadBarCodes = true,

    // Enable table structure detection
    ReadDataTables = true,
};

Dim ocr As New IronTesseract()

ocr.Configuration = New TesseractConfiguration With {
    .ReadBarCodes = True, ' Enable barcode detection within documents
    .ReadDataTables = True ' Enable table structure detection
}

$vbLabelText $csharpLabel

ReadBarCodes：設置為true時，IronOCR會在文字之外掃描文件中的條碼。設置為false以跳過條碼檢測，當不期望條碼時加速處理速度。
ReadDataTables：設置為true時，Tesseract試圖檢測並保留文件中的表格結構。這對於發票、報告及其他表格式文件非常有用。

這些選項可以與BlackListCharacters結合使用，以精確控制從複雜文件中提取的內容。

而過濾和檢測控制提取的內容，布局解讀則是另一個關注點。接下來的部分涵蓋如何為文件型別選擇正確的PageSegmentationMode。

控制頁面分割模式

PageSegmentationMode告訴Tesseract在識別之前如何分割輸入圖像。為特定布局選擇錯誤的模式會導致引擎誤讀或完全跳過文字。

模式	用例
`AutoOsd`	帶有方向和腳本檢測的自動布局分析
`Auto`	不帶OSD的自動布局分析（預設）
`SingleColumn`	假定圖像為單欄文字
`SingleBlock`	假定圖像為單一均勻塊文字
`SingleLine`	假定圖像為單行文字
`SparseText`	以任意順序找到盡可能多的文字

對於包含單行的標籤或橫幅，SingleLine消除了多塊分析並提高了速度和準確性。

輸入

SHIPPING LABEL: TRK-2024-XR9-001。

:path=/static-assets/ocr/content-code-examples/how-to/ocr-configurations-for-advanced-reading-4.cs

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SingleLine,
};

using OcrInput input = new OcrInput();
input.LoadImage("single-line-label.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);

Imports IronOcr

Dim ocr As New IronTesseract()

ocr.Configuration = New TesseractConfiguration With {
    .PageSegmentationMode = TesseractPageSegmentationMode.SingleLine
}

Using input As New OcrInput()
    input.LoadImage("single-line-label.png")

    Dim result As OcrResult = ocr.Read(input)
    Console.WriteLine(result.Text)
End Using

$vbLabelText $csharpLabel

對於有不規則文字佈局的掃描頁面，Auto恢復更多內容。

輸入

receipt-scan.png是一張Corner Market熱敏收據，包含四個版面項目（咖啡、鬆糕、果汁、格蘭諾拉棒）、一個虛線分隔符、小計、稅款和總計。這種布局中，固定塊分割錯失了不同水平位置的條目。

:path=/static-assets/ocr/content-code-examples/how-to/ocr-configurations-for-advanced-reading-5.cs

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    PageSegmentationMode = TesseractPageSegmentationMode.SparseText,
};

using OcrInput input = new OcrInput();
input.LoadImage("receipt-scan.png");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);

Imports IronTesseract

Dim ocr As New IronTesseract()

ocr.Configuration = New TesseractConfiguration With {
    .PageSegmentationMode = TesseractPageSegmentationMode.SparseText
}

Using input As New OcrInput()
    input.LoadImage("receipt-scan.png")

    Dim result As OcrResult = ocr.Read(input)
    Console.WriteLine(result.Text)
End Using

$vbLabelText $csharpLabel

在將布局分割調整到文件型別之後，下一步是控制下游處理的輸出格式。

生成可搜尋PDF和hOCR輸出

RenderHocr控制IronOCR產生的輸出格式，伴隨著純文字結果。

RenderSearchablePdf在原始圖像上嵌入一個隱藏文字層，產生一個使用者可以搜尋和複製文字的PDF，同時掃描圖像保持可見。這是文件存檔工作流的標準輸出格式。

輸入

scanned-document.pdf是一個來自IronOCR Solutions Ltd.（日期為2024年3月15日，參考DOC-2024-OCR-0315）的單頁商業信函。結果保存為searchable-output.pdf。

:path=/static-assets/ocr/content-code-examples/how-to/ocr-configurations-for-advanced-reading-6.cs

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderSearchablePdf = true,
};

using OcrInput input = new OcrInput();
input.LoadPdf("scanned-document.pdf");

OcrResult result = ocr.Read(input);
result.SaveAsSearchablePdf("searchable-output.pdf");

Imports IronTesseract

Dim ocr As New IronTesseract()

ocr.Configuration = New TesseractConfiguration With {
    .RenderSearchablePdf = True
}

Using input As New OcrInput()
    input.LoadPdf("scanned-document.pdf")

    Dim result As OcrResult = ocr.Read(input)
    result.SaveAsSearchablePdf("searchable-output.pdf")
End Using

$vbLabelText $csharpLabel

輸出

輸出是一個看起來與輸入完全相同的PDF，但包含隱藏文字層。打開searchable-output.pdf並使用Ctrl+F檢驗嵌入式文字是否可搜尋和可複製。

RenderHocr產生一個hOCR文件，即HTML文件，將文字內容與每個字的邊界框坐標一起編碼。這很有用，當下游工具需要精確的文字定位，例如，遮罩引擎或文件佈局分析。

輸入

document-page.png是一個文件頁，標題為"季度概要 Q1 2024"，並有兩段涵蓋收入、運營成本和增長驅動因素的財務資料。結果保存為output.html。

:path=/static-assets/ocr/content-code-examples/how-to/ocr-configurations-for-advanced-reading-7.cs

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    RenderHocr = true,
};

using OcrInput input = new OcrInput();
input.LoadImage("document-page.png");

OcrResult result = ocr.Read(input);
result.SaveAsHocrFile("output.html");

Imports IronTesseract

Dim ocr As New IronTesseract()

ocr.Configuration = New TesseractConfiguration With {
    .RenderHocr = True
}

Using input As New OcrInput()
    input.LoadImage("document-page.png")

    Dim result As OcrResult = ocr.Read(input)
    result.SaveAsHocrFile("output.html")
End Using

$vbLabelText $csharpLabel

輸出

output.html編碼每個識別字的邊界框坐標。在瀏覽器中打開該文件以檢查hOCR結構，或將其傳送到下游工具以進行佈局分析或遮罩。

如果需要所有三種輸出格式（純文字、可搜尋PDF和hOCR）從單次讀取調用中，這兩個標誌可以同時啟用。

這些輸出標誌獨立於讀取的語言工作，包括非拉丁字母的腳本。下一節展示如何應用字元過濾到日文文字。

國際文件的Unicode字元過濾

對於中文、日文或韓文等國際文件，BlackListCharacters屬性與Unicode字元一起工作。這使您可以將輸出限制為特定的腳本，例如僅日文的平假名和片假名。

請注意確保在繼續之前已安裝對應的語言包（例如，IronOcr.Languages.Japanese）。

輸入

文件包含一個標題（テスト），一個混合平假名和片假名，以及有音標變體的日文句子（プ，で），一條帶有黑名單噪音符號（★，■）和漢字（價格）的價格線，以及帶有另一黑名單符號（§）、更多漢字（購入）、附加音標變體（プ，デ）和基礎片假名（メモ，ール）的備忘錄線。白名單僅通過基礎平假名、片假名、數字和常見日文標點符號；明確列出的三個噪音符號在黑名單中。

平假名和片假名的Unicode字元範圍以字串文字形式在BlackListCharacters中列出。

警告控制台可能不支持顯示Unicode字元。將輸出重定向到.txt文件是一種處理這類字元時檢驗結果的可靠方法。

:path=/static-assets/ocr/content-code-examples/how-to/ocr-configurations-for-advanced-reading-jp.cs

using IronOcr;
using System.IO;

IronTesseract ocr = new IronTesseract();

ocr.Configuration = new TesseractConfiguration
{
    // Whitelist only Hiragana, Katakana, numbers, and common Japanese punctuation
    WhiteListCharacters = "あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらりるれろわをん" +
                            "アイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワヲン" +
                            "0123456789、。？！（）¥ー",

    // Blacklist common noise/symbols you want to ignore
    BlackListCharacters = "★■§",
};

var ocrInput = new OcrInput();

// Load Japanese input image
ocrInput.LoadImage("jp.png");

// Perform OCR on the input image with ReadPhoto method
var results = ocr.ReadPhoto(ocrInput);

// Write the text result directly to a file named "output.txt"
File.WriteAllText("output.txt", results.Text);

// You can add this line to confirm the file was saved:
Console.WriteLine("OCR results saved to output.txt");

Imports IronOcr
Imports System.IO

Dim ocr As New IronTesseract()

ocr.Configuration = New TesseractConfiguration With {
    .WhiteListCharacters = "あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらりるれろわをん" &
                           "アイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワヲン" &
                           "0123456789、。？！（）¥ー",
    .BlackListCharacters = "★■§"
}

Dim ocrInput As New OcrInput()

' Load Japanese input image
ocrInput.LoadImage("jp.png")

' Perform OCR on the input image with ReadPhoto method
Dim results = ocr.ReadPhoto(ocrInput)

' Write the text result directly to a file named "output.txt"
File.WriteAllText("output.txt", results.Text)

' You can add this line to confirm the file was saved:
Console.WriteLine("OCR results saved to output.txt")

$vbLabelText $csharpLabel

輸出

完整的過濾輸出作為文字文件提供：jp-output.txt。

因為白名單僅包括基礎平假名和片假名，所以派生的音標變體如プ（pu）和デ（de）被刪除。像價格（価格）和購買（購入）這樣的漢字也被排除，因為它們超出了白名單字元集。像§這樣的黑名單符號被積極移除，無論白名單如何。

我接下來應該去哪裡？

現在您已了解如何配置IronOCR以應對先進的閱讀方案，請探索：

閱讀特定文件型別例如護照和車牌
條碼和QR碼讀取作為獨立的OCR使用案例
從處理結果中導出hOCR和可搜尋的PDF

對於生產使用，請記得獲取授權以移除浮水印並存取全部功能。

常見問題

IronOCR中的TesseractConfiguration是什麼？

IronOCR中的TesseractConfiguration允許使用者自定義OCR設置，提供進階閱讀能力，例如字元白名單、條碼讀取和多語言支援。

我如何在IronOCR中設置字元白名單？

在IronOCR中，您可以使用TesseractConfiguration設置字元白名單，讓OCR引擎識別特定字元，可用於像讀取車牌這樣的任務。

IronOCR可以讀取條碼和資料表嗎？

是的，IronOCR可以通過在TesseractConfiguration屬性中調整特定設置來讀取條碼和資料表，以實現精確的OCR資料提取。

IronOCR是否支持國際語言，如中文、日文、韓文？

IronOCR通過其多語言TesseractConfiguration選項支持國際語言，包括中文、日文、韓文。

使用IronOCR進階OCR配置的好處是什麼？

利用IronOCR中的進階OCR配置可以更準確和高效的文字識別，支持例如語言特定的文字識別和結構化資料提取等專門任務。

能否針對特定OCR任務優化IronOCR？

是的，可以通過配置如字元白名單設置和啟用條碼或表格識別等來優化IronOCR，以改進針對特定應用的性能。

我如何在IronOCR中啟用多語言支援？

要在IronOCR中啟用多語言支援，您可以調整TesseractConfiguration中的語言設置，讓OCR引擎能夠識別多種語言的文字。

什麼是字元白名單，它們在IronOCR中的用途是什麼？

IronOCR中的字元白名單是OCR引擎配置為識別的特定字元列表，適合於像讀取數字或特定文字模式這樣的集中任務。

IronOCR能否用於讀取結構化資料格式？

是的，IronOCR可以配置為讀取和處理結構化資料格式，如條碼和表格，提供多功能OCR能力以滿足各種資料提取需求。

IronOCR有哪些可用的配置可用於進階文字識別？

IronOCR提供了如字元白名單、多語言支援和條碼識別等配置，以增強針對特定需求進階文字識別能力。

Curtis Chau

立即與工程團隊聊天

技術作家

Curtis Chau擁有Carleton大學的電腦科學學士學位，專精於前端開發，擁有Node.js、TypeScript、JavaScript和React的專業知識。Curtis熱衷於建立直觀且美觀的使用者介面，喜愛使用現代框架並建立結構良好、視覺吸引力的手冊。

除了開發，Curtis對物聯網（IoT）有濃厚的興趣，探索創新的方法來整合硬體和軟體。在空閒時間，他喜歡玩遊戲和建立Discord機器人，結合他對技術的熱愛與創造力。

準備開始了嗎？

Nuget 下載 6,136,090 | 版本： 2026.7 剛剛發布

查看授權

還在滾動？

想要快速證明？ PM > Install-Package IronOcr
執行範例觀看您的圖像轉變為可搜尋文字。

查看授權

開始免費30天試用

此頁面上的內容

先進閱讀的OCR配置

使用NuGet套件管理器安裝https://www.nuget.org/packages/IronOcr

複製並運行這段程式碼片段。

部署以在您的實時環境中測試

如何配置OCR以進行先進閱讀

TesseractConfiguration屬性

為車牌設置字元白名單

輸入

輸出

配置條碼和資料表讀取

控制頁面分割模式

輸入

輸入

生成可搜尋PDF和hOCR輸出

輸入

輸出

輸入

輸出

國際文件的Unicode字元過濾

輸入

輸出

我接下來應該去哪裡？

常見問題

IronOCR中的TesseractConfiguration是什麼？

我如何在IronOCR中設置字元白名單？

IronOCR可以讀取條碼和資料表嗎？

IronOCR是否支持國際語言，如中文、日文、韓文？

使用IronOCR進階OCR配置的好處是什麼？

能否針對特定OCR任務優化IronOCR？

我如何在IronOCR中啟用多語言支援？

什麼是字元白名單，它們在IronOCR中的用途是什麼？

IronOCR能否用於讀取結構化資料格式？

IronOCR有哪些可用的配置可用於進階文字識別？

還在滾動？

下一步：開始免費30天試用

Thank You

下一步：開始免費30天試用

Want to deploy IronSuite to a live project for FREE?

What’s included?

您的授權金鑰已發送到您的收件箱

您的演示請求已進入。

受到全球數百萬工程師的信任

Iron 支援團隊