OCR 工具

如何使用 Tesseract 從影像中提取文字

Curtis Chau

更新:2025年7月2日

利用 IronOCR 和 Tesseract 等庫，開發人員可以使用高級演算法和機器學習技術從圖像和掃描文件中提取文字資訊。本教學將向讀者展示如何使用 Tesseract 庫從圖像中提取文字，最後介紹 IronOCR 的獨特方法。

1. 使用 Tesseract 進行 OCR

1.1 安裝 Tesseract

使用 NuGet 套件管理器控制台，輸入以下命令：

Install-Package Tesseract

或透過 NuGet 套件管理員下載該套件。

如何實現 OCR 文字識別，圖 1：在 NuGet 套件管理器中安裝 Tesseract 套件 在 NuGet 套件管理器中安裝 Tesseract 套件

安裝 NuGet 套件後，必須手動將語言檔案安裝並儲存到專案資料夾中。這可以被視為該特定庫的一個缺陷。

請造訪以下網站下載語言檔案。下載完成後，解壓縮文件，並將"tessdata"資料夾新增至專案的偵錯資料夾。

1.2. 使用 Tesseract（快速入門）

可以使用以下原始程式碼對給定影像進行OCR識別：

using Tesseract;

class Program
{
    static void Main()
    {
        // Initialize Tesseract engine with English language data
        using var ocrEngine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default);

        // Load the image to be processed
        using var img = Pix.LoadFromFile("Demo.png");

        // Process the image to extract text
        using var res = ocrEngine.Process(img);

        // Output the recognized text
        Console.WriteLine(res.GetText());
        Console.ReadKey();
    }
}

using Tesseract;

class Program
{
    static void Main()
    {
        // Initialize Tesseract engine with English language data
        using var ocrEngine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default);

        // Load the image to be processed
        using var img = Pix.LoadFromFile("Demo.png");

        // Process the image to extract text
        using var res = ocrEngine.Process(img);

        // Output the recognized text
        Console.WriteLine(res.GetText());
        Console.ReadKey();
    }
}

Imports Tesseract

Friend Class Program
	Shared Sub Main()
		' Initialize Tesseract engine with English language data
		Dim ocrEngine = New TesseractEngine("tessdata", "eng", EngineMode.Default)

		' Load the image to be processed
		Dim img = Pix.LoadFromFile("Demo.png")

		' Process the image to extract text
		Dim res = ocrEngine.Process(img)

		' Output the recognized text
		Console.WriteLine(res.GetText())
		Console.ReadKey()
	End Sub
End Class

$vbLabelText $csharpLabel

首先，必須建立一個 TesseractEngine 對象，將語言資料載入引擎。然後藉助 Pix.LoadFromFile 載入所需的圖片檔案。
將圖像傳遞給 TesseractEngine 以使用 Process 方法提取文字。
使用 GetText 方法取得辨識出的文本，並將其列印到控制台。

如何進行OCR文字識別，圖2：從圖像中提取的文字 從圖像中提取的文字

1.3 超立方體的考慮因素

Tesseract 從 3.00 版本開始支援輸出文字格式、OCR 位置資料和頁面佈局分析。
Tesseract 可在 Windows、Linux 和 MacOS 上運行，但由於開發支援有限，目前已證實其主要在 Windows 和 Ubuntu 上按預期運行。
Tesseract 可以區分等寬字體和比例字體。
利用 OCRopus 等前端，Tesseract 非常適合用作後端，並可用於更具挑戰性的 OCR 作業，例如佈局分析。
Tesseract 的一些不足之處：
- 最新版本並未設計為可在 Windows 系統上編譯。 Tesseract 的 C# API 封裝器維護頻率很低，而且比 Tesseract 的新版本落後數年。

要了解有關 C# 中 Tesseract 的更多信息，請訪問Tesseract 教程。

2. 使用 IronOCR 進行 OCR 識別

2.1. 安裝 IronOCR

在 NuGet 套件管理器控制台中輸入以下命令：

Install-Package IronOcr

或者，您也可以透過 NuGet 套件管理器安裝 IronOCR 庫，以及其他語言的附加套件，這些套件使用起來既簡單又方便。

如何取得 OCR 文字辨識功能，圖 3：透過 NuGet 套件管理器安裝 IronOcr 和語言套件 透過 NuGet 套件管理器安裝 IronOcr 和語言套件

2.2. 使用 IronOCR

以下是識別給定圖像中文字的範例程式碼：

using IronOcr;

class Program
{
    static void Main()
    {
        // Create an IronTesseract instance with predefined settings
        var ocr = new IronTesseract()
        {
            Language = OcrLanguage.EnglishBest,
            Configuration = { TesseractVersion = TesseractVersion.Tesseract5 }
        };

        // Create an OcrInput instance for image processing
        using var input = new OcrInput();

        // Load the image to be processed
        input.AddImage("Demo.png");

        // Process the image and extract text
        var result = ocr.Read(input);

        // Output the recognized text
        Console.WriteLine(result.Text);
        Console.ReadKey();
    }
}

using IronOcr;

class Program
{
    static void Main()
    {
        // Create an IronTesseract instance with predefined settings
        var ocr = new IronTesseract()
        {
            Language = OcrLanguage.EnglishBest,
            Configuration = { TesseractVersion = TesseractVersion.Tesseract5 }
        };

        // Create an OcrInput instance for image processing
        using var input = new OcrInput();

        // Load the image to be processed
        input.AddImage("Demo.png");

        // Process the image and extract text
        var result = ocr.Read(input);

        // Output the recognized text
        Console.WriteLine(result.Text);
        Console.ReadKey();
    }
}

Imports IronOcr

Friend Class Program
	Shared Sub Main()
		' Create an IronTesseract instance with predefined settings
		Dim ocr = New IronTesseract() With {
			.Language = OcrLanguage.EnglishBest,
			.Configuration = { TesseractVersion = TesseractVersion.Tesseract5 }
		}

		' Create an OcrInput instance for image processing
		Dim input = New OcrInput()

		' Load the image to be processed
		input.AddImage("Demo.png")

		' Process the image and extract text
		Dim result = ocr.Read(input)

		' Output the recognized text
		Console.WriteLine(result.Text)
		Console.ReadKey()
	End Sub
End Class

$vbLabelText $csharpLabel

此程式碼初始化一個 IronTesseract 對象，設定語言和 Tesseract 版本。然後建立一個 OcrInput 對象，使用 AddImage 方法載入圖片檔。
Read 方法處理圖像並提取文本，然後將文本列印到控制台。

如何進行 OCR 文字識別，圖 4：使用 IronOCR 庫提取的文字輸出 使用 IronOCR 庫提取文字輸出

2.3 鐵氧濃度比 (IronOCR) 考量

IronOCR 是 Tesseract 函式庫的擴展，引入了更高的穩定性和更高的準確性。
IronOCR 可以讀取PDF和照片中的文字內容。它還可以讀取 20 多種不同類型的條碼和二維碼。
輸出可以呈現為純文字、結構化資料、條碼或二維碼。
該圖書館認可全球 125 種語言。
IronOCR 可靈活地在所有 .NET 環境（控制台、Web、桌面等）中運行，並且還支援最新的行動框架，如 Mono、Xamarin、 Azure和MAUI 。
IronOCR 提供免費試用版，且開發版的價格更低。了解更多許可資訊。