如何在 AWS Lambda 上识别文档 OCR

This article was translated from English: Does it need improvement?
Translated
View the article in English
Amazon Lambda Architecture Logo related to 如何在 AWS Lambda 上识别文档 OCR

本文提供了使用 IronOCR 设置 AWS Lambda 函数的逐步指南。 通过本指南,您将学习如何配置 IronOCR 并高效地读取存储在 S3 存储桶中的文档。

安装

本文将使用 S3 存储桶,因此需要 AWSSDK.S3 包。

如果您使用的是 IronOCR ZIP,则必须设置临时文件夹。

// Set temporary folder path and log file path for IronOCR.
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
// Set temporary folder path and log file path for IronOCR.
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
' Set temporary folder path and log file path for IronOCR.
Dim awsTmpPath = "/tmp/"
IronOcr.Installation.InstallationPath = awsTmpPath
IronOcr.Installation.LogFilePath = awsTmpPath
$vbLabelText   $csharpLabel

今天在您的项目中使用 IronOCR,免费试用。

第一步:
green arrow pointer

创建 AWS Lambda 项目

使用 Visual Studio 创建容器化的 AWS Lambda 是一个简单的过程:

选择容器图像

添加包依赖项

在 .NET 8 中使用 IronOCR 库不需要安装额外的依赖项即可在 AWS Lambda 上使用。 修改项目的 Dockerfile 文件,添加以下内容:

FROM public.ecr.aws/lambda/dotnet:8

# Update all installed packages
RUN dnf update -y

WORKDIR /var/task

# Copy build artifacts from the host machine into the Docker image
COPY "bin/Release/lambda-publish" .

修改 FunctionHandler 代码

此示例从 S3 存储桶中检索图像,对其进行处理,并将可搜索的 PDF 保存回同一个存储桶。 使用 IronOCR ZIP 时,设置临时文件夹至关重要,因为该库需要写入权限才能从 DLL 复制运行时文件夹。

using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;
using System;
using System.IO;
using System.Threading.Tasks;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda
{
    public class Function
    {
        // Initialize the S3 client with a specific region endpoint
        private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

        /// <summary>
        /// Function handler to process OCR on the PDF stored in S3.
        /// </summary>
        /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
        public async Task FunctionHandler(ILambdaContext context)
        {
            // Set up necessary paths for IronOCR
            var awsTmpPath = @"/tmp/";
            IronOcr.Installation.InstallationPath = awsTmpPath;
            IronOcr.Installation.LogFilePath = awsTmpPath;

            // Set license key for IronOCR
            IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

            string bucketName = "deploymenttestbucket"; // Your bucket name
            string pdfName = "sample";
            string objectKey = $"IronPdfZip/{pdfName}.pdf";
            string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

            try
            {
                // Retrieve the PDF file from S3
                var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

                // Initialize IronTesseract for OCR processing
                IronTesseract ironTesseract = new IronTesseract();
                OcrInput ocrInput = new OcrInput();
                ocrInput.LoadPdf(pdfData);
                OcrResult result = ironTesseract.Read(ocrInput);

                // Log the OCR result
                context.Logger.LogLine($"OCR result: {result.Text}");

                // Upload the searchable PDF to S3
                await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());
                context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
            }
            catch (Exception e)
            {
                context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
            }
        }

        /// <summary>
        /// Retrieves a PDF from S3 and returns it as a byte array.
        /// </summary>
        private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
        {
            var request = new GetObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey
            };

            using (var response = await _s3Client.GetObjectAsync(request))
            using (var memoryStream = new MemoryStream())
            {
                await response.ResponseStream.CopyToAsync(memoryStream);
                return memoryStream.ToArray();
            }
        }

        /// <summary>
        /// Uploads the generated searchable PDF back to S3.
        /// </summary>
        private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
        {
            using (var memoryStream = new MemoryStream(pdfBytes))
            {
                var request = new PutObjectRequest
                {
                    BucketName = bucketName,
                    Key = objectKey,
                    InputStream = memoryStream,
                    ContentType = "application/pdf"
                };

                await _s3Client.PutObjectAsync(request);
            }
        }
    }
}
using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;
using System;
using System.IO;
using System.Threading.Tasks;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda
{
    public class Function
    {
        // Initialize the S3 client with a specific region endpoint
        private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

        /// <summary>
        /// Function handler to process OCR on the PDF stored in S3.
        /// </summary>
        /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
        public async Task FunctionHandler(ILambdaContext context)
        {
            // Set up necessary paths for IronOCR
            var awsTmpPath = @"/tmp/";
            IronOcr.Installation.InstallationPath = awsTmpPath;
            IronOcr.Installation.LogFilePath = awsTmpPath;

            // Set license key for IronOCR
            IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

            string bucketName = "deploymenttestbucket"; // Your bucket name
            string pdfName = "sample";
            string objectKey = $"IronPdfZip/{pdfName}.pdf";
            string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

            try
            {
                // Retrieve the PDF file from S3
                var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

                // Initialize IronTesseract for OCR processing
                IronTesseract ironTesseract = new IronTesseract();
                OcrInput ocrInput = new OcrInput();
                ocrInput.LoadPdf(pdfData);
                OcrResult result = ironTesseract.Read(ocrInput);

                // Log the OCR result
                context.Logger.LogLine($"OCR result: {result.Text}");

                // Upload the searchable PDF to S3
                await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());
                context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
            }
            catch (Exception e)
            {
                context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
            }
        }

        /// <summary>
        /// Retrieves a PDF from S3 and returns it as a byte array.
        /// </summary>
        private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
        {
            var request = new GetObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey
            };

            using (var response = await _s3Client.GetObjectAsync(request))
            using (var memoryStream = new MemoryStream())
            {
                await response.ResponseStream.CopyToAsync(memoryStream);
                return memoryStream.ToArray();
            }
        }

        /// <summary>
        /// Uploads the generated searchable PDF back to S3.
        /// </summary>
        private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
        {
            using (var memoryStream = new MemoryStream(pdfBytes))
            {
                var request = new PutObjectRequest
                {
                    BucketName = bucketName,
                    Key = objectKey,
                    InputStream = memoryStream,
                    ContentType = "application/pdf"
                };

                await _s3Client.PutObjectAsync(request);
            }
        }
    }
}
Imports Amazon.Lambda.Core
Imports Amazon.S3
Imports Amazon.S3.Model
Imports IronOcr
Imports System
Imports System.IO
Imports System.Threading.Tasks

' Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
<Assembly: LambdaSerializer(GetType(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))>

Namespace IronOcrZipAwsLambda
	Public Class [Function]
		' Initialize the S3 client with a specific region endpoint
		Private Shared ReadOnly _s3Client As IAmazonS3 = New AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1)

		''' <summary>
		''' Function handler to process OCR on the PDF stored in S3.
		''' </summary>
		''' <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
		Public Async Function FunctionHandler(ByVal context As ILambdaContext) As Task
			' Set up necessary paths for IronOCR
			Dim awsTmpPath = "/tmp/"
			IronOcr.Installation.InstallationPath = awsTmpPath
			IronOcr.Installation.LogFilePath = awsTmpPath

			' Set license key for IronOCR
			IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01"

			Dim bucketName As String = "deploymenttestbucket" ' Your bucket name
			Dim pdfName As String = "sample"
			Dim objectKey As String = $"IronPdfZip/{pdfName}.pdf"
			Dim objectKeyForSearchablePdf As String = $"IronPdfZip/{pdfName}-SearchablePdf.pdf"

			Try
				' Retrieve the PDF file from S3
				Dim pdfData = Await GetPdfFromS3Async(bucketName, objectKey)

				' Initialize IronTesseract for OCR processing
				Dim ironTesseract As New IronTesseract()
				Dim ocrInput As New OcrInput()
				ocrInput.LoadPdf(pdfData)
				Dim result As OcrResult = ironTesseract.Read(ocrInput)

				' Log the OCR result
				context.Logger.LogLine($"OCR result: {result.Text}")

				' Upload the searchable PDF to S3
				Await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes())
				context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}")
			Catch e As Exception
				context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}")
			End Try
		End Function

		''' <summary>
		''' Retrieves a PDF from S3 and returns it as a byte array.
		''' </summary>
		Private Async Function GetPdfFromS3Async(ByVal bucketName As String, ByVal objectKey As String) As Task(Of Byte())
			Dim request = New GetObjectRequest With {
				.BucketName = bucketName,
				.Key = objectKey
			}

			Using response = Await _s3Client.GetObjectAsync(request)
			Using memoryStream As New MemoryStream()
				Await response.ResponseStream.CopyToAsync(memoryStream)
				Return memoryStream.ToArray()
			End Using
			End Using
		End Function

		''' <summary>
		''' Uploads the generated searchable PDF back to S3.
		''' </summary>
		Private Async Function UploadPdfToS3Async(ByVal bucketName As String, ByVal objectKey As String, ByVal pdfBytes() As Byte) As Task
			Using memoryStream As New MemoryStream(pdfBytes)
				Dim request = New PutObjectRequest With {
					.BucketName = bucketName,
					.Key = objectKey,
					.InputStream = memoryStream,
					.ContentType = "application/pdf"
				}

				Await _s3Client.PutObjectAsync(request)
			End Using
		End Function
	End Class
End Namespace
$vbLabelText   $csharpLabel

在 try 代码块之前,指定要从 IronPdfZip 目录读取的文件为"sample.pdf"。 然后使用GetPdfFromS3Async方法检索 PDF 字节,并将其传递给LoadPdf方法。

增加内存和超时

根据正在处理的文档大小和同时处理的文档数量,Lambda 函数中分配的内存量会有所不同。 作为基础,将内存设置为 512 MB,超时设置为 300 秒在 aws-lambda-tools-defaults.json 中。

{
    "function-memory-size": 512,
    "function-timeout": 300
}

当内存不足时,程序会引发错误:'Runtime exited with error: signal: killed.' 增加内存大小可以解决此问题。 有关更多详细信息,请参阅故障排除文章: AWS Lambda - 运行时退出信号:已终止

发布

要在 Visual Studio 中发布,右键单击项目并选择 'Publish to AWS Lambda...',然后配置必要的设置。 您可以在 AWS 网站 上阅读有关发布 Lambda 的更多信息。

试一试!

您可以通过 Lambda 控制台 或通过 Visual Studio 激活 Lambda 函数。

常见问题解答

如何在 AWS 上使用 C# 对文档执行 OCR?

您可以使用 IronOCR 将 OCR 应用于存储在 Amazon S3 存储桶中的文档,将其与 AWS Lambda 集成。这涉及创建一个 C# 的 Lambda 函数,从 S3 检索文档,使用 IronOCR 处理这些文档,然后将结果上传回 S3。

在 AWS Lambda 中使用 C# 设置 OCR 需要哪些步骤?

要使用 C# 在 AWS Lambda 上设置 OCR,您需要下载 IronOCR 库,在 Visual Studio 中创建 AWS Lambda 项目,配置您的函数处理程序以使用 IronOCR 进行处理,然后部署您的函数。此设置允许您将图像转换为可搜索的 PDF。

在 AWS Lambda 上运行 OCR 的推荐配置是什么?

为了在 AWS Lambda 上使用 IronOCR 运行 OCR 时获得最佳性能,建议至少设置 512 MB 的内存分配和 300 秒的超时。这些设置有助于管理大文件或多个文档的处理。

如何处理 AWS Lambda 中的“运行时退出错误:信号:已终止”?

此错误通常表示您的 Lambda 函数已耗尽分配的内存。增加 Lambda 函数配置中的内存分配可以解决此问题,特别是在处理大型文档时使用 IronOCR。

我可以在部署前本地测试我的 AWS Lambda OCR 函数吗?

是的,您可以使用 Visual Studio 的 AWS 工具包在本地测试您的 AWS Lambda OCR 函数。此工具包提供了一个本地环境以模拟 Lambda 的执行,使您能够在部署前调试和完善函数。

AWS Lambda 项目中的 Dockerfile 有什么用途?

AWS Lambda 项目中的 Dockerfile 用于创建一个容器镜像,该镜像定义了您的 Lambda 函数的执行环境和依赖项。这确保了您的函数在 AWS 中正常运行所需的所有组件。

在 AWS Lambda 上使用 .NET 8 的 IronOCR 需要其他依赖项吗?

在 AWS Lambda 上使用 .NET 8 时,除了 IronOCR 库和必要的 AWS SDK 包之外不需要其他依赖项。这简化了执行 OCR 任务的集成过程。

将 C# OCR 与 AWS Lambda 集成的前提条件是什么?

在将 C# OCR 与 AWS Lambda 集成之前,您需要安装 S3 的 AWS SDK、IronOCR 库和 Visual Studio 的 AWS 工具包。您还需要配置一个用于存储和检索文档的 S3 存储桶。

Curtis Chau
技术作家

Curtis Chau 拥有卡尔顿大学的计算机科学学士学位,专注于前端开发,精通 Node.js、TypeScript、JavaScript 和 React。他热衷于打造直观且美观的用户界面,喜欢使用现代框架并创建结构良好、视觉吸引力强的手册。

除了开发之外,Curtis 对物联网 (IoT) 有浓厚的兴趣,探索将硬件和软件集成的新方法。在空闲时间,他喜欢玩游戏和构建 Discord 机器人,将他对技术的热爱与创造力相结合。

准备开始了吗?
Nuget 下载 5,167,857 | Version: 2025.11 刚刚发布