如何在 AWS Lambda 上進行 OCR 文件識別

查克尼思·賓
查克尼思·賓
2023年11月21日
已更新 2024年12月17日
分享:
This article was translated from English: Does it need improvement?
Translated
View the article in English
Amazon Lambda Architecture Logo related to 如何在 AWS Lambda 上進行 OCR 文件識別

這篇操作指南提供了使用 IronOCR 設置 AWS Lambda 函數的逐步指南。 按照本指南,您將學習如何配置IronOCR並有效地讀取存儲在S3桶中的文件。

安裝

本文將使用 S3 存儲桶,因此需要 AWSSDK.S3 套件。

如果您正在使用 IronOCR ZIP,請務必設置臨時文件夾。

var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
Dim awsTmpPath = "/tmp/"
IronOcr.Installation.InstallationPath = awsTmpPath
IronOcr.Installation.LogFilePath = awsTmpPath
$vbLabelText   $csharpLabel

立即在您的專案中使用IronOCR,並享受免費試用。

第一步:
green arrow pointer

建立 AWS Lambda 專案

使用 Visual Studio,建立容器化的 AWS Lambda 是一個簡單的過程:

  • 安裝 AWS Toolkit for Visual Studio
  • 選擇「AWS Lambda 專案 (.NET Core - C#)」
  • 選擇“.NET 8(容器映像)”藍圖,然後選擇“完成”。

    選擇容器映像

新增套件依賴項

在 .NET 8 中使用 IronOCR 庫不需要額外安裝依賴就能在 AWS Lambda 上使用。 將專案的 Dockerfile 修改為以下內容:


FROM public.ecr.aws/lambda/dotnet:8

# 安裝必要的套件

執行 dnf 更新 -y

WORKDIR /var/task

# 此 COPY 指令將 .NET Lambda 專案的建置成果從主機複製到映像檔中。

# COPY 的來源應與 .NET Lambda 專案發佈其建置工件的地點相匹配。

如果 Lambda 函數正在構建

# 使用 AWS .NET Lambda 工具,`--docker-host-build-output-dir` 開關控制 .NET Lambda 項目的生成輸出目錄位置

# 將被建造。

.NET Lambda 專案範本預設具有 `--docker-host-build-output-dir`

# 在 aws-lambda-tools-defaults.json 文件中設置為 "bin/Release/lambda-publish"。

#

# 或者可以使用 Docker 多階段構建在映像中構建 .NET Lambda 專案。

# 如需有關此方法的更多資訊,請查看專案的 README.md 文件。

將 "bin/Release/lambda-publish" 複製至 。

修改 FunctionHandler 程式碼

此範例從 S3 儲存桶中檢索圖像,處理該圖像,然後將可搜尋的 PDF 保存回同一個儲存桶。 設定暫存資料夾在使用IronOCR ZIP時是必要的,因為該庫需要寫入權限以從DLLs複製運行時資料夾。

using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda;

public class Function
{
    private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

    /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
    /// <returns></returns>
    public async Task FunctionHandler(ILambdaContext context)
    {
        // Set temp file
        var awsTmpPath = @"/tmp/";
        IronOcr.Installation.InstallationPath = awsTmpPath;
        IronOcr.Installation.LogFilePath = awsTmpPath;

        IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

        string bucketName = "deploymenttestbucket"; // Your bucket name
        string pdfName = "sample";
        string objectKey = $"IronPdfZip/{pdfName}.pdf";
        string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

        try
        {
            // Retrieve the PDF file from S3
            var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

            IronTesseract ironTesseract = new IronTesseract();
            OcrInput ocrInput = new OcrInput();
            ocrInput.LoadPdf(pdfData);
            OcrResult result = ironTesseract.Read(ocrInput);

            // Use pdfData (byte array) as needed
            context.Logger.LogLine($"OCR result: {result.Text}");

            // Upload the PDF to S3
            await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());

            context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
        }
        catch (Exception e)
        {
            context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
        }
    }
    private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
    {
        var request = new GetObjectRequest
        {
            BucketName = bucketName,
            Key = objectKey
        };

        using (var response = await _s3Client.GetObjectAsync(request))
        using (var memoryStream = new MemoryStream())
        {
            await response.ResponseStream.CopyToAsync(memoryStream);
            return memoryStream.ToArray();
        }
    }

    // Function to upload the PDF file to S3
    private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
    {
        using (var memoryStream = new MemoryStream(pdfBytes))
        {
            var request = new PutObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey,
                InputStream = memoryStream,
                ContentType = "application/pdf",
            };

            await _s3Client.PutObjectAsync(request);
        }
    }
}
using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda;

public class Function
{
    private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

    /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
    /// <returns></returns>
    public async Task FunctionHandler(ILambdaContext context)
    {
        // Set temp file
        var awsTmpPath = @"/tmp/";
        IronOcr.Installation.InstallationPath = awsTmpPath;
        IronOcr.Installation.LogFilePath = awsTmpPath;

        IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

        string bucketName = "deploymenttestbucket"; // Your bucket name
        string pdfName = "sample";
        string objectKey = $"IronPdfZip/{pdfName}.pdf";
        string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

        try
        {
            // Retrieve the PDF file from S3
            var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

            IronTesseract ironTesseract = new IronTesseract();
            OcrInput ocrInput = new OcrInput();
            ocrInput.LoadPdf(pdfData);
            OcrResult result = ironTesseract.Read(ocrInput);

            // Use pdfData (byte array) as needed
            context.Logger.LogLine($"OCR result: {result.Text}");

            // Upload the PDF to S3
            await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());

            context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
        }
        catch (Exception e)
        {
            context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
        }
    }
    private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
    {
        var request = new GetObjectRequest
        {
            BucketName = bucketName,
            Key = objectKey
        };

        using (var response = await _s3Client.GetObjectAsync(request))
        using (var memoryStream = new MemoryStream())
        {
            await response.ResponseStream.CopyToAsync(memoryStream);
            return memoryStream.ToArray();
        }
    }

    // Function to upload the PDF file to S3
    private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
    {
        using (var memoryStream = new MemoryStream(pdfBytes))
        {
            var request = new PutObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey,
                InputStream = memoryStream,
                ContentType = "application/pdf",
            };

            await _s3Client.PutObjectAsync(request);
        }
    }
}
Imports Amazon.Lambda.Core
Imports Amazon.S3
Imports Amazon.S3.Model
Imports IronOcr

' Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
<Assembly: LambdaSerializer(GetType(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))>

Namespace IronOcrZipAwsLambda

	Public Class [Function]
		Private Shared ReadOnly _s3Client As IAmazonS3 = New AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1)

		''' <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
		''' <returns></returns>
		Public Async Function FunctionHandler(ByVal context As ILambdaContext) As Task
			' Set temp file
			Dim awsTmpPath = "/tmp/"
			IronOcr.Installation.InstallationPath = awsTmpPath
			IronOcr.Installation.LogFilePath = awsTmpPath

			IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01"

			Dim bucketName As String = "deploymenttestbucket" ' Your bucket name
			Dim pdfName As String = "sample"
			Dim objectKey As String = $"IronPdfZip/{pdfName}.pdf"
			Dim objectKeyForSearchablePdf As String = $"IronPdfZip/{pdfName}-SearchablePdf.pdf"

			Try
				' Retrieve the PDF file from S3
				Dim pdfData = Await GetPdfFromS3Async(bucketName, objectKey)

				Dim ironTesseract As New IronTesseract()
				Dim ocrInput As New OcrInput()
				ocrInput.LoadPdf(pdfData)
				Dim result As OcrResult = ironTesseract.Read(ocrInput)

				' Use pdfData (byte array) as needed
				context.Logger.LogLine($"OCR result: {result.Text}")

				' Upload the PDF to S3
				Await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes())

				context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}")
			Catch e As Exception
				context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}")
			End Try
		End Function
		Private Async Function GetPdfFromS3Async(ByVal bucketName As String, ByVal objectKey As String) As Task(Of Byte())
			Dim request = New GetObjectRequest With {
				.BucketName = bucketName,
				.Key = objectKey
			}

			Using response = Await _s3Client.GetObjectAsync(request)
			Using memoryStream As New MemoryStream()
				Await response.ResponseStream.CopyToAsync(memoryStream)
				Return memoryStream.ToArray()
			End Using
			End Using
		End Function

		' Function to upload the PDF file to S3
		Private Async Function UploadPdfToS3Async(ByVal bucketName As String, ByVal objectKey As String, ByVal pdfBytes() As Byte) As Task
			Using memoryStream As New MemoryStream(pdfBytes)
				Dim request = New PutObjectRequest With {
					.BucketName = bucketName,
					.Key = objectKey,
					.InputStream = memoryStream,
					.ContentType = "application/pdf"
				}

				Await _s3Client.PutObjectAsync(request)
			End Using
		End Function
	End Class
End Namespace
$vbLabelText   $csharpLabel

在 try 區塊之前,指定從 IronPdfZip 目錄中讀取 'sample.pdf' 文件。 接下來使用GetPdfFromS3Async方法來檢索PDF位元組,該位元組被傳遞給LoadPdf方法。

增加記憶體和超時時間

Lambda 函數中分配的記憶體量將根據處理的文件大小和同時處理的文件數量而有所不同。 作為基準,請在aws-lambda-tools-defaults.json中將記憶體設置為512 MB,超時時間設置為300秒。


“function-memory-size”:512,

"function-timeout" : 300

當記憶體不足時,程式將拋出錯誤:「Runtime exited with error: signal: killed。」增加記憶體大小可以解決此問題。 如需更多詳情,請參閱故障排除文章:AWS Lambda - 運行時已退出信號:已終止

發布

要在 Visual Studio 中發佈,右鍵點擊專案並選擇「發佈到 AWS Lambda...」,然後配置必要的設定。 您可以在AWS 網站上閱讀更多關於發布 Lambda 的資訊。

試試看!

您可以透過Lambda 主控台或 Visual Studio 啟動 Lambda 函數。

查克尼思·賓
軟體工程師
Chaknith 致力於 IronXL 和 IronBarcode。他在 C# 和 .NET 方面擁有豐富的專業知識,協助改進軟體並支持客戶。他從用戶互動中獲得的洞察力有助於提高產品、文檔和整體體驗。