如何在 AWS Lambda 上 OCR 文档

查克尼特·宾

2023年十一月21日

更新 2024年十二月17日

Translated

View the article in English

这篇说明文章提供了使用 IronOCR 设置 AWS Lambda 函数的分步指南。通过本指南，您将学会如何配置 IronOCR 并高效读取存储在 S3 存储桶中的文档。

如何在 AWS Lambda 上 OCR 文档

下载一个 C# 库来对文档进行 OCR
创建并选择项目模板
修改 FunctionHandler 代码
配置和部署项目
调用该函数并检查 S3 中的结果

安装

本文将使用一个S3桶，因此需要AWSSDK.S3包。

如果使用 IronOCR ZIP，则必须设置临时文件夹。

var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;

var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;

Dim awsTmpPath = "/tmp/"
IronOcr.Installation.InstallationPath = awsTmpPath
IronOcr.Installation.LogFilePath = awsTmpPath

$vbLabelText $csharpLabel

立即在您的项目中开始使用IronOCR，并享受免费试用。

第一步：

创建 AWS Lambda 项目

使用 Visual Studio，创建容器化 AWS Lambda 是一个简单的过程：

安装AWS Toolkit for Visual Studio
选择“AWS Lambda 项目（.NET Core - C#）”
选择一个“.NET 8（容器镜像）”蓝图，然后选择“完成”。

添加软件包依赖关系

在 .NET 8 中使用 IronOCR 库在 AWS Lambda 上使用不需要安装额外的依赖项。用以下内容修改项目的 Dockerfile：


FROM public.ecr.aws/lambda/dotnet:8

# 安装必要的软件包

运行 dnf update -y

WORKDIR /var/task

# 该 COPY 命令将 .NET Lambda 项目的构建工件从主机复制到映像中。

# COPY 的来源应与 .NET Lambda 项目发布其构建工件的位置相匹配。

如果正在构建 Lambda 函数

# 使用 AWS .NET Lambda 工具时，`--docker-host-build-output-dir` 开关控制 .NET Lambda 项目的位置

# 将建立。

.NET Lambda 项目模板默认设置为 `--docker-host-build-output-dir`

# 在 aws-lambda-tools-defaults.json 文件中设置为 "bin/Release/lambda-publish"。

#

# 也可以使用 Docker 多阶段构建来在映像中构建 .NET Lambda 项目。

# 有关此方法的更多信息，请查看项目的 README.md 文件。

复制 "bin/Release/lambda-publish" 。

修改 FunctionHandler 代码

本示例从 S3 存储桶中检索图片，对其进行处理，然后将可搜索的 PDF 保存回同一存储桶。使用 IronOCR ZIP 时，设置临时文件夹至关重要，因为该库需要写入权限才能从 DLL 复制运行时文件夹。

using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda;

public class Function
{
    private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

    /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
    /// <returns></returns>
    public async Task FunctionHandler(ILambdaContext context)
    {
        // Set temp file
        var awsTmpPath = @"/tmp/";
        IronOcr.Installation.InstallationPath = awsTmpPath;
        IronOcr.Installation.LogFilePath = awsTmpPath;

        IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

        string bucketName = "deploymenttestbucket"; // Your bucket name
        string pdfName = "sample";
        string objectKey = $"IronPdfZip/{pdfName}.pdf";
        string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

        try
        {
            // Retrieve the PDF file from S3
            var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

            IronTesseract ironTesseract = new IronTesseract();
            OcrInput ocrInput = new OcrInput();
            ocrInput.LoadPdf(pdfData);
            OcrResult result = ironTesseract.Read(ocrInput);

            // Use pdfData (byte array) as needed
            context.Logger.LogLine($"OCR result: {result.Text}");

            // Upload the PDF to S3
            await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());

            context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
        }
        catch (Exception e)
        {
            context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
        }
    }
    private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
    {
        var request = new GetObjectRequest
        {
            BucketName = bucketName,
            Key = objectKey
        };

        using (var response = await _s3Client.GetObjectAsync(request))
        using (var memoryStream = new MemoryStream())
        {
            await response.ResponseStream.CopyToAsync(memoryStream);
            return memoryStream.ToArray();
        }
    }

    // Function to upload the PDF file to S3
    private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
    {
        using (var memoryStream = new MemoryStream(pdfBytes))
        {
            var request = new PutObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey,
                InputStream = memoryStream,
                ContentType = "application/pdf",
            };

            await _s3Client.PutObjectAsync(request);
        }
    }
}

using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda;

public class Function
{
    private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

    /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
    /// <returns></returns>
    public async Task FunctionHandler(ILambdaContext context)
    {
        // Set temp file
        var awsTmpPath = @"/tmp/";
        IronOcr.Installation.InstallationPath = awsTmpPath;
        IronOcr.Installation.LogFilePath = awsTmpPath;

        IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

        string bucketName = "deploymenttestbucket"; // Your bucket name
        string pdfName = "sample";
        string objectKey = $"IronPdfZip/{pdfName}.pdf";
        string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

        try
        {
            // Retrieve the PDF file from S3
            var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

            IronTesseract ironTesseract = new IronTesseract();
            OcrInput ocrInput = new OcrInput();
            ocrInput.LoadPdf(pdfData);
            OcrResult result = ironTesseract.Read(ocrInput);

            // Use pdfData (byte array) as needed
            context.Logger.LogLine($"OCR result: {result.Text}");

            // Upload the PDF to S3
            await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());

            context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
        }
        catch (Exception e)
        {
            context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
        }
    }
    private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
    {
        var request = new GetObjectRequest
        {
            BucketName = bucketName,
            Key = objectKey
        };

        using (var response = await _s3Client.GetObjectAsync(request))
        using (var memoryStream = new MemoryStream())
        {
            await response.ResponseStream.CopyToAsync(memoryStream);
            return memoryStream.ToArray();
        }
    }

    // Function to upload the PDF file to S3
    private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
    {
        using (var memoryStream = new MemoryStream(pdfBytes))
        {
            var request = new PutObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey,
                InputStream = memoryStream,
                ContentType = "application/pdf",
            };

            await _s3Client.PutObjectAsync(request);
        }
    }
}

Imports Amazon.Lambda.Core
Imports Amazon.S3
Imports Amazon.S3.Model
Imports IronOcr

' Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
<Assembly: LambdaSerializer(GetType(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))>

Namespace IronOcrZipAwsLambda

	Public Class [Function]
		Private Shared ReadOnly _s3Client As IAmazonS3 = New AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1)

		''' <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
		''' <returns></returns>
		Public Async Function FunctionHandler(ByVal context As ILambdaContext) As Task
			' Set temp file
			Dim awsTmpPath = "/tmp/"
			IronOcr.Installation.InstallationPath = awsTmpPath
			IronOcr.Installation.LogFilePath = awsTmpPath

			IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01"

			Dim bucketName As String = "deploymenttestbucket" ' Your bucket name
			Dim pdfName As String = "sample"
			Dim objectKey As String = $"IronPdfZip/{pdfName}.pdf"
			Dim objectKeyForSearchablePdf As String = $"IronPdfZip/{pdfName}-SearchablePdf.pdf"

			Try
				' Retrieve the PDF file from S3
				Dim pdfData = Await GetPdfFromS3Async(bucketName, objectKey)

				Dim ironTesseract As New IronTesseract()
				Dim ocrInput As New OcrInput()
				ocrInput.LoadPdf(pdfData)
				Dim result As OcrResult = ironTesseract.Read(ocrInput)

				' Use pdfData (byte array) as needed
				context.Logger.LogLine($"OCR result: {result.Text}")

				' Upload the PDF to S3
				Await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes())

				context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}")
			Catch e As Exception
				context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}")
			End Try
		End Function
		Private Async Function GetPdfFromS3Async(ByVal bucketName As String, ByVal objectKey As String) As Task(Of Byte())
			Dim request = New GetObjectRequest With {
				.BucketName = bucketName,
				.Key = objectKey
			}

			Using response = Await _s3Client.GetObjectAsync(request)
			Using memoryStream As New MemoryStream()
				Await response.ResponseStream.CopyToAsync(memoryStream)
				Return memoryStream.ToArray()
			End Using
			End Using
		End Function

		' Function to upload the PDF file to S3
		Private Async Function UploadPdfToS3Async(ByVal bucketName As String, ByVal objectKey As String, ByVal pdfBytes() As Byte) As Task
			Using memoryStream As New MemoryStream(pdfBytes)
				Dim request = New PutObjectRequest With {
					.BucketName = bucketName,
					.Key = objectKey,
					.InputStream = memoryStream,
					.ContentType = "application/pdf"
				}

				Await _s3Client.PutObjectAsync(request)
			End Using
		End Function
	End Class
End Namespace

$vbLabelText $csharpLabel

在 try 块之前，指定从 IronPdfZip 目录中读取文件 "sample.pdf"。 GetPdfFromS3Async 方法随后用于检索 PDF 字节，并将其传递给 LoadPdf 方法。

增加内存和超时

Lambda 函数分配的内存量将根据正在处理的文档大小和同时处理的文档数量而变化。作为基准，将内存设置为512 MB，将超时设置为300秒，在aws-lambda-tools-defaults.json中。


"function-memory-size" : 512,

"function-timeout" : 300

当内存不足时，程序会出现错误："Runtime exited with error: signal: killed"。增加内存大小可以解决这个问题。欲了解更多详情，请参考故障排除文章：AWS Lambda - 运行时退出信号：已终止。

出版

要在 Visual Studio 中发布，请右键单击项目并选择 "发布到 AWS Lambda..."，然后配置必要的设置。您可以在AWS 网站上阅读更多关于发布 Lambda 的信息。

试用！

您可以通过Lambda 控制台或通过 Visual Studio 激活 Lambda 函数。

查克尼特·宾

立即与工程团队聊天

软件工程师

Chaknith 负责 IronXL 和 IronBarcode 的工作。他在 C# 和 .NET 方面拥有深厚的专业知识，帮助改进软件并支持客户。他从用户互动中获得的洞察力，有助于提升产品、文档和整体体验。

在此页面上

如何在 AWS Lambda 上 OCR 文档

如何在 AWS Lambda 上 OCR 文档

安装

创建 AWS Lambda 项目

添加软件包依赖关系

修改 FunctionHandler 代码

增加内存和超时

出版

试用！

准备开始了吗？版本： 2025.4 刚刚发布

IronOCR 是 IRON 的一部分。套装

在此页面上

如何在 AWS Lambda 上 OCR 文档

如何在 AWS Lambda 上 OCR 文档

安装

创建 AWS Lambda 项目

添加软件包依赖关系

修改 FunctionHandler 代码

增加内存和超时

出版

试用！

准备开始了吗？ 版本： 2025.4 刚刚发布

下一步：开始免费30天试用

下一步：开始免费30天试用

被全球超过200万名工程师信赖

IronOCR 是 IRON 的一部分。套装

准备开始了吗？版本： 2025.4 刚刚发布