如何在 AWS Lambda 上 OCR 文档

This article was translated from English: Does it need improvement?
Translated
View the article in English
Amazon Lambda Architecture Logo related to 如何在 AWS Lambda 上 OCR 文档

这篇说明文章提供了使用 IronOCR 设置 AWS Lambda 函数的分步指南。 通过本指南,您将学会如何配置 IronOCR 并高效读取存储在 S3 存储桶中的文档。

安装

本文将使用 S3 文件桶,因此 *AWSSDK.S3需要 * 软件包。

如果使用 IronOCR ZIP,则必须设置临时文件夹。

var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
Dim awsTmpPath = "/tmp/"
IronOcr.Installation.InstallationPath = awsTmpPath
IronOcr.Installation.LogFilePath = awsTmpPath
VB   C#

立即在您的项目中开始使用IronOCR,并享受免费试用。

第一步:
green arrow pointer

创建 AWS Lambda 项目

使用 Visual Studio,创建容器化 AWS Lambda 是一个简单的过程:

添加软件包依赖关系

在 .NET 8 中使用 IronOCR 库在 AWS Lambda 上使用不需要安装额外的依赖项。 用以下内容修改项目的 Dockerfile:


FROM public.ecr.aws/lambda/dotnet:8

# 安装必要的软件包

运行 dnf update -y

WORKDIR /var/task

# 该 COPY 命令将 .NET Lambda 项目的构建工件从主机复制到映像中。

# COPY 的来源应与 .NET Lambda 项目发布其构建工件的位置相匹配。

如果正在构建 Lambda 函数

# 使用 AWS .NET Lambda 工具时,"--docker-host-build-output-dir "开关控制 .NET Lambda 项目的位置。

# 将建立。

.NET Lambda 项目模板默认有`--docker-host-build-output-dir`。

# 在 aws-lambda-tools-defaults.json 文件中设置为 "bin/Release/lambda-publish"。

#

# 也可以使用 Docker 多阶段构建来在映像中构建 .NET Lambda 项目。

# 有关此方法的更多信息,请查看项目的 README.md 文件。

复制 "bin/Release/lambda-publish" 。

修改 FunctionHandler 代码

本示例从 S3 存储桶中检索图片,对其进行处理,然后将可搜索的 PDF 保存回同一存储桶。 使用 IronOCR ZIP 时,设置临时文件夹至关重要,因为该库需要写入权限才能从 DLL 复制运行时文件夹。

using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda;

public class Function
{
    private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

    /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
    /// <returns></returns>
    public async Task FunctionHandler(ILambdaContext context)
    {
        // Set temp file
        var awsTmpPath = @"/tmp/";
        IronOcr.Installation.InstallationPath = awsTmpPath;
        IronOcr.Installation.LogFilePath = awsTmpPath;

        IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

        string bucketName = "deploymenttestbucket"; // Your bucket name
        string pdfName = "sample";
        string objectKey = $"IronPdfZip/{pdfName}.pdf";
        string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

        try
        {
            // Retrieve the PDF file from S3
            var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

            IronTesseract ironTesseract = new IronTesseract();
            OcrInput ocrInput = new OcrInput();
            ocrInput.LoadPdf(pdfData);
            OcrResult result = ironTesseract.Read(ocrInput);

            // Use pdfData (byte array) as needed
            context.Logger.LogLine($"OCR result: {result.Text}");

            // Upload the PDF to S3
            await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());

            context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
        }
        catch (Exception e)
        {
            context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
        }
    }
    private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
    {
        var request = new GetObjectRequest
        {
            BucketName = bucketName,
            Key = objectKey
        };

        using (var response = await _s3Client.GetObjectAsync(request))
        using (var memoryStream = new MemoryStream())
        {
            await response.ResponseStream.CopyToAsync(memoryStream);
            return memoryStream.ToArray();
        }
    }

    // Function to upload the PDF file to S3
    private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
    {
        using (var memoryStream = new MemoryStream(pdfBytes))
        {
            var request = new PutObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey,
                InputStream = memoryStream,
                ContentType = "application/pdf",
            };

            await _s3Client.PutObjectAsync(request);
        }
    }
}
using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda;

public class Function
{
    private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

    /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
    /// <returns></returns>
    public async Task FunctionHandler(ILambdaContext context)
    {
        // Set temp file
        var awsTmpPath = @"/tmp/";
        IronOcr.Installation.InstallationPath = awsTmpPath;
        IronOcr.Installation.LogFilePath = awsTmpPath;

        IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

        string bucketName = "deploymenttestbucket"; // Your bucket name
        string pdfName = "sample";
        string objectKey = $"IronPdfZip/{pdfName}.pdf";
        string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

        try
        {
            // Retrieve the PDF file from S3
            var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

            IronTesseract ironTesseract = new IronTesseract();
            OcrInput ocrInput = new OcrInput();
            ocrInput.LoadPdf(pdfData);
            OcrResult result = ironTesseract.Read(ocrInput);

            // Use pdfData (byte array) as needed
            context.Logger.LogLine($"OCR result: {result.Text}");

            // Upload the PDF to S3
            await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());

            context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
        }
        catch (Exception e)
        {
            context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
        }
    }
    private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
    {
        var request = new GetObjectRequest
        {
            BucketName = bucketName,
            Key = objectKey
        };

        using (var response = await _s3Client.GetObjectAsync(request))
        using (var memoryStream = new MemoryStream())
        {
            await response.ResponseStream.CopyToAsync(memoryStream);
            return memoryStream.ToArray();
        }
    }

    // Function to upload the PDF file to S3
    private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
    {
        using (var memoryStream = new MemoryStream(pdfBytes))
        {
            var request = new PutObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey,
                InputStream = memoryStream,
                ContentType = "application/pdf",
            };

            await _s3Client.PutObjectAsync(request);
        }
    }
}
Imports Amazon.Lambda.Core
Imports Amazon.S3
Imports Amazon.S3.Model
Imports IronOcr

' Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
<Assembly: LambdaSerializer(GetType(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))>

Namespace IronOcrZipAwsLambda

	Public Class [Function]
		Private Shared ReadOnly _s3Client As IAmazonS3 = New AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1)

		''' <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
		''' <returns></returns>
		Public Async Function FunctionHandler(ByVal context As ILambdaContext) As Task
			' Set temp file
			Dim awsTmpPath = "/tmp/"
			IronOcr.Installation.InstallationPath = awsTmpPath
			IronOcr.Installation.LogFilePath = awsTmpPath

			IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01"

			Dim bucketName As String = "deploymenttestbucket" ' Your bucket name
			Dim pdfName As String = "sample"
			Dim objectKey As String = $"IronPdfZip/{pdfName}.pdf"
			Dim objectKeyForSearchablePdf As String = $"IronPdfZip/{pdfName}-SearchablePdf.pdf"

			Try
				' Retrieve the PDF file from S3
				Dim pdfData = Await GetPdfFromS3Async(bucketName, objectKey)

				Dim ironTesseract As New IronTesseract()
				Dim ocrInput As New OcrInput()
				ocrInput.LoadPdf(pdfData)
				Dim result As OcrResult = ironTesseract.Read(ocrInput)

				' Use pdfData (byte array) as needed
				context.Logger.LogLine($"OCR result: {result.Text}")

				' Upload the PDF to S3
				Await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes())

				context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}")
			Catch e As Exception
				context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}")
			End Try
		End Function
		Private Async Function GetPdfFromS3Async(ByVal bucketName As String, ByVal objectKey As String) As Task(Of Byte())
			Dim request = New GetObjectRequest With {
				.BucketName = bucketName,
				.Key = objectKey
			}

			Using response = Await _s3Client.GetObjectAsync(request)
			Using memoryStream As New MemoryStream()
				Await response.ResponseStream.CopyToAsync(memoryStream)
				Return memoryStream.ToArray()
			End Using
			End Using
		End Function

		' Function to upload the PDF file to S3
		Private Async Function UploadPdfToS3Async(ByVal bucketName As String, ByVal objectKey As String, ByVal pdfBytes() As Byte) As Task
			Using memoryStream As New MemoryStream(pdfBytes)
				Dim request = New PutObjectRequest With {
					.BucketName = bucketName,
					.Key = objectKey,
					.InputStream = memoryStream,
					.ContentType = "application/pdf"
				}

				Await _s3Client.PutObjectAsync(request)
			End Using
		End Function
	End Class
End Namespace
VB   C#

在 try 块之前,指定从 IronPdfZip 目录中读取文件 "sample.pdf"。 然后使用 GetPdfFromS3Async 方法检索 PDF 字节,并将其传递给 LoadPdf 方法。

增加内存和超时

Lambda 函数分配的内存量将根据正在处理的文档大小和同时处理的文档数量而变化。 作为基准,在 aws-lambda-tools-defaults.json 中将内存设置为 512 MB,超时时间设置为 300 秒。


"function-memory-size" : 512,

"function-timeout" : 300

当内存不足时,程序会出现错误:"Runtime exited with error: signal: killed"。增加内存大小可以解决这个问题。 有关详细信息,请参阅故障排除文章:AWS Lambda - 运行时退出信号:已杀死.

出版

要在 Visual Studio 中发布,请右键单击项目并选择 "发布到 AWS Lambda...",然后配置必要的设置。 您可以了解更多有关发布 Lambda 的信息。AWS 网站.

试用!

您可以通过以下方式激活 Lambda 函数:Lambda 控制台或通过 Visual Studio。