如何使用IronOCR构建Azure OCR服务

查克尼特·宾

2021年九月23日

更新 2024年十二月22日

Translated

View the article in English

Iron Software 创建了一个 OCR（光学字符识别）库，可以解决 Azure OCR 集成中的互操作性问题。在 Azure 上使用 OCR 库对开发者来说一直有点困难。解决这个以及其他许多OCR难题的方案是IronOCR。

IronOCR的Microsoft Azure功能

IronOCR包括以下特性，用于在Microsoft Azure上构建OCR服务：

将 PDF 转换为可搜索的文档，以便于提取文本。
将图像转换为可搜索的文档，通过从图像中提取文本。
读取条形码以及二维码
卓越的准确性
本地运行，不需要SaaS（软件即服务），这是一种软件分发模型，其中云提供商（例如Microsoft Azure）托管各种应用程序，并将这些应用程序提供给最终用户。
闪电般的速度
让我们来看看最佳的OCR引擎，Iron Software的IronOCR，如何让开发者更容易地从任何输入文档中提取文本。

让我们开始使用我们的Azure OCR服务

要开始使用示例，我们首先需要安装IronOCR。

使用C#创建一个新的控制台应用程序。
通过 NuGet 安装 IronOCR，可以输入：Install-Package IronOcr 或选择管理 NuGet 包并搜索 IronOCR。这如下所示
将您的 Program.cs 文件编辑为以下内容：
- 我们导入 IronOcr 命名空间，以利用其 OCR 功能来读取和提取 PDF 文件的内容。
- 我们创建了一个新的 IronTesseract 对象，以便我们可以从图像中提取文本。

using IronOcr;
using System;

namespace IronOCR_Ex
{
    class Program
    {
        static void Main(string [] args)
        {
            var ocr = new IronTesseract();
            using (var Input = new OcrInput("..\\Images\\Purgatory.PNG"))
            {
                var result = ocr.Read(Input); //Read PNG image File
                Console.WriteLine(result.Text); //Write Output to PDF document
                Console.ReadLine();
            }
        }
    }
}

using IronOcr;
using System;

namespace IronOCR_Ex
{
    class Program
    {
        static void Main(string [] args)
        {
            var ocr = new IronTesseract();
            using (var Input = new OcrInput("..\\Images\\Purgatory.PNG"))
            {
                var result = ocr.Read(Input); //Read PNG image File
                Console.WriteLine(result.Text); //Write Output to PDF document
                Console.ReadLine();
            }
        }
    }
}

Imports IronOcr
Imports System

Namespace IronOCR_Ex
	Friend Class Program
		Shared Sub Main(ByVal args() As String)
			Dim ocr = New IronTesseract()
			Using Input = New OcrInput("..\Images\Purgatory.PNG")
				Dim result = ocr.Read(Input) 'Read PNG image File
				Console.WriteLine(result.Text) 'Write Output to PDF document
				Console.ReadLine()
			End Using
		End Sub
	End Class
End Namespace

$vbLabelText $csharpLabel

接下来，我们打开一个名为Purgatory.PNG的图片。这幅图像是但丁的《神曲》的一部分——我的最爱书籍之一。这张图片看起来像下一张图片。
要使用IronOCR的光学字符读取功能提取的文本
图 2 - 使用 IronOCR 的光学字符识别功能提取的文本
提取上述输入图像文本后的输出。
图 3 - 提取的文本
让我们对一个PDF文档做同样的处理。 PDF文档包含与图形相同的要提取的文本。
我们将成为一个PDF文档，而不是图像。输入以下代码：

 var Ocr = new IronTesseract();
            using (var input = new OcrInput())
            {
                input.Title = "Divine Comedy - Purgatory"; //Give title to input document 
                //Supply optional password and name of document
                input.AddPdf("..\\Documents\\Purgatorio.pdf", "dante");
                var Result = Ocr.Read(input); //Read the input file

                Result.SaveAsSearchablePdf("SearchablePDFDocument.pdf"); 
            }

 var Ocr = new IronTesseract();
            using (var input = new OcrInput())
            {
                input.Title = "Divine Comedy - Purgatory"; //Give title to input document 
                //Supply optional password and name of document
                input.AddPdf("..\\Documents\\Purgatorio.pdf", "dante");
                var Result = Ocr.Read(input); //Read the input file

                Result.SaveAsSearchablePdf("SearchablePDFDocument.pdf"); 
            }

Dim Ocr = New IronTesseract()
			Using input = New OcrInput()
				input.Title = "Divine Comedy - Purgatory" 'Give title to input document
				'Supply optional password and name of document
				input.AddPdf("..\Documents\Purgatorio.pdf", "dante")
				Dim Result = Ocr.Read(input) 'Read the input file

				Result.SaveAsSearchablePdf("SearchablePDFDocument.pdf")
			End Using

$vbLabelText $csharpLabel

几乎与之前提取图片中文本的代码相同。

在这里，我们使用 OcrInput 方法来读取当前的 PDF 文档，在这种情况下是：Purgatorio.pdf。如果PDF文件中有元数据，如标题或密码，我们也可以输入这些信息。

结果被保存为一个PDF文档，我们可以在其中搜索文本。

注意，如果PDF文件太大，可能会抛出异常。

谈够了Windows应用程序；让我们看看如何在Microsoft Azure上使用OCR。
IronOCR的优点在于它能够在微服务架构中作为Azure功能与Microsoft Azure非常好地配合使用。这是一个使用IronOCR的Microsoft Azure函数的非常快速的示例。这个Microsoft Azure函数可以从图像中提取文本。

public static class OCRFunction
{
    public static HttpClient hcClient = new HttpClient();

    [FunctionName("IronOCRFunction_EX")]
    public static async Task<IActionResult> Run([HttpTrigger] HttpRequest hrRequest, ExecutionContext ecContext)
    {
        var URI = hrRequest.Query ["image"];
        var saStream = await hcClient.GetStreamAsync(URI);

        var ocr = new IronTesseract();
        using (var inputOCR = new OcrInput(saStream))
        {
            var outputOCR = ocr.Read(inputOCR);
            return new OkObjectResult(outputOCR.Text);
        }
    }
}

public static class OCRFunction
{
    public static HttpClient hcClient = new HttpClient();

    [FunctionName("IronOCRFunction_EX")]
    public static async Task<IActionResult> Run([HttpTrigger] HttpRequest hrRequest, ExecutionContext ecContext)
    {
        var URI = hrRequest.Query ["image"];
        var saStream = await hcClient.GetStreamAsync(URI);

        var ocr = new IronTesseract();
        using (var inputOCR = new OcrInput(saStream))
        {
            var outputOCR = ocr.Read(inputOCR);
            return new OkObjectResult(outputOCR.Text);
        }
    }
}

Public Module OCRFunction
	Public hcClient As New HttpClient()

	<FunctionName("IronOCRFunction_EX")>
	Public Async Function Run(<HttpTrigger> ByVal hrRequest As HttpRequest, ByVal ecContext As ExecutionContext) As Task(Of IActionResult)
		Dim URI = hrRequest.Query ("image")
		Dim saStream = Await hcClient.GetStreamAsync(URI)

		Dim ocr = New IronTesseract()
		Using inputOCR = New OcrInput(saStream)
			Dim outputOCR = ocr.Read(inputOCR)
			Return New OkObjectResult(outputOCR.Text)
		End Using
	End Function
End Module

$vbLabelText $csharpLabel

这将函数接收到的图像直接输入到ocr引擎中，以输出为提取的文本。

关于Microsoft Azure的快速回顾。

根据微软的说法：Microsoft Azure 微服务是一种构建应用程序的架构方法，其中每个核心功能或服务都是独立构建和部署的。微服务架构是分布式的和松散耦合的，因此一个组件的故障不会导致整个应用程序崩溃。独立组件共同工作，并通过明确定义的API合同进行通信。构建微服务应用程序以满足迅速变化的业务需求，并更快地将新功能推向市场。

IronOCR与.NET或Microsoft Azure的一些其他功能包括以下内容：

能够对几乎任何文件、图像或PDF执行OCR。

处理 OCR 输入的闪电般速度
卓越的准确性
读取条形码和二维码
本地运行，无需SaaS
可以将 PDF 和图像转换为可搜索的文档。
优秀的微软认知服务Azure OCR替代品

改善OCR性能的图像过滤器

OcrInput.Rotate - 将图像顺时针旋转多个度数。逆时针旋转请使用负数。
OcrInput.Binarize() - 这个图像滤镜将每个像素转换为黑色或白色，没有中间色。这提高了OCR性能。
OcrInput.ToGrayScale() - 此图像滤镜将每个像素转换为灰度的色调。这提高了OCR速度。
OcrInput.Contrast() - 自动增加对比度。此过滤器提高了低对比度扫描中的OCR速度和准确性。
OcrInput.DeNoise() - 去除数字噪音。此过滤器仅应在输入文档预期存在噪音时使用。
OcrInput.Invert() - 反转每种颜色。
OcrInput.Dilate() - 膨胀向图像中任何对象的边界添加像素。
OcrInput.Erode() - 侵蚀移除对象边界上的像素。
OcrInput.Deskew() - 旋转图像，使其方向正确并且垂直。这对OCR非常有用，因为Tesseract对倾斜扫描的容忍度可以低至5度。
OcrInput.DeepCleanBackgroundNoise() - 强力去除背景噪音。
OcrInput.EnhanceResolution - 增强低质量图像的分辨率。

速度性能

以下是一个例子：

    var Ocr = new IronTesseract();
    Ocr.Configuration.BlackListCharacters = "~`$#^*_}{][
\\";
    Ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.Auto;
    Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
    Ocr.Configuration.EngineMode = TesseractEngineMode.LstmOnly;
    Ocr.Language = OcrLanguage.EnglishFast;
    using (var Input = new OcrInput("..\\Images\\Purgatory.PNG"))
    {
        var Result = Ocr.Read(Input);
        Console.WriteLine(Result.Text);
    }

    var Ocr = new IronTesseract();
    Ocr.Configuration.BlackListCharacters = "~`$#^*_}{][
\\";
    Ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.Auto;
    Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
    Ocr.Configuration.EngineMode = TesseractEngineMode.LstmOnly;
    Ocr.Language = OcrLanguage.EnglishFast;
    using (var Input = new OcrInput("..\\Images\\Purgatory.PNG"))
    {
        var Result = Ocr.Read(Input);
        Console.WriteLine(Result.Text);
    }

Dim Ocr = New IronTesseract()
	Ocr.Configuration.BlackListCharacters = "~`$#^*_}{][ \\"
	Ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.Auto
	Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5
	Ocr.Configuration.EngineMode = TesseractEngineMode.LstmOnly
	Ocr.Language = OcrLanguage.EnglishFast
	Using Input = New OcrInput("..\Images\Purgatory.PNG")
		Dim Result = Ocr.Read(Input)
		Console.WriteLine(Result.Text)
	End Using

$vbLabelText $csharpLabel

价格和许可选项

基本上有三个付费许可级别，所有这些都基于一次性购买和终身许可的原则。

是的，这些都是免费用于开发目的的。

IronOCR 为 .NET 应用程序提供的功能，在 Azure 和其他系统上运行 OCR

IronOCR支持127种国际语言。每种语言都有快速、标准和最佳质量可供选择。一些可用的语言包包括：
- 保加利亚语
- 亚美尼亚语
- 克罗地亚语
- 南非荷兰语
- 丹麦语
- 捷克
- 菲律宾人
- 芬兰语
- 法语
- 德语
有更多语言包可供查看，请点击以下链接。 IronOCR 语言包
它在 .NET 中开箱即用。
- 支持 Xamarin
- 支持 Mono
- 支持Microsoft Azure
- 支持在 Microsoft Azure 上的 Docker
- 支持 PDF 文档
- 支持多帧Tiff文件
支持所有主要的图像格式
支持以下 .NET 框架：
- .NET Framework 4.5 及更高版本
- .NET Standard 2
- .NET Core 2
- .NET Core 3
.NET Core 5
你不需要安装 Tesseract（一个支持 Unicode 并支持超过 100 种语言的开源 OCR 引擎）来使 IronOCR 工作。
- 比Tesseract的精度更高
相比 Tesseract 有速度提升
修正文档或文件扫描质量低的问题
纠正低质量歪斜扫描的文档或文件