如何在 AWS Lambda 上識別文件 OCR

This article was translated from English: Does it need improvement?
Translated
View the article in English
Amazon Lambda Architecture Logo related to 如何在 AWS Lambda 上識別文件 OCR

本文提供了使用 IronOCR 設定 AWS Lambda 函數的逐步指南。 透過本指南,您將學習如何配置 IronOCR 並有效率地讀取儲存在 S3 儲存桶中的文件。

安裝

本文將使用 S3 儲存桶,因此需要AWSSDK.S3包。

如果您使用的是 IronOCR ZIP,則必須設定臨時資料夾。

// Set temporary folder path and log file path for IronOCR.
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
// Set temporary folder path and log file path for IronOCR.
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
' Set temporary folder path and log file path for IronOCR.
Dim awsTmpPath = "/tmp/"
IronOcr.Installation.InstallationPath = awsTmpPath
IronOcr.Installation.LogFilePath = awsTmpPath
$vbLabelText   $csharpLabel

!{--01001100010010010100001001010010010000010101001001011001010111110101001101010100010001010101010 10100010111110101010001010010010010010100000101001100010111110100001001001100010011111010000100100110001001111010101

建立 AWS Lambda 項目

使用 Visual Studio,建立容器化的 AWS Lambda 函數非常簡單:

選擇容器影像

新增包依賴項

在 .NET 8 中使用 IronOCR 程式庫不需要安裝額外的相依性即可在 AWS Lambda 上使用。 修改專案的 Dockerfile 文件,新增以下內容:

FROM public.ecr.aws/lambda/dotnet:8

# Update all installed packages
RUN dnf update -y

WORKDIR /var/task

# Copy build artifacts from the host machine into the Docker image
COPY "bin/Release/lambda-publish" .

修改函數處理程序程式碼

此範例從 S3 儲存桶中檢索影像,對其進行處理,並將可搜尋的 PDF 保存回同一個儲存桶。 使用 IronOCR ZIP 時,設定臨時資料夾至關重要,因為該程式庫需要寫入權限才能從 DLL 複製執行時間資料夾。

using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;
using System;
using System.IO;
using System.Threading.Tasks;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda
{
    public class Function
    {
        // Initialize the S3 client with a specific region endpoint
        private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

        /// <summary>
        /// Function handler to process OCR on the PDF stored in S3.
        /// </summary>
        /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
        public async Task FunctionHandler(ILambdaContext context)
        {
            // Set up necessary paths for IronOCR
            var awsTmpPath = @"/tmp/";
            IronOcr.Installation.InstallationPath = awsTmpPath;
            IronOcr.Installation.LogFilePath = awsTmpPath;

            // Set license key for IronOCR
            IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

            string bucketName = "deploymenttestbucket"; // Your bucket name
            string pdfName = "sample";
            string objectKey = $"IronPdfZip/{pdfName}.pdf";
            string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

            try
            {
                // Retrieve the PDF file from S3
                var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

                // Initialize IronTesseract for OCR processing
                IronTesseract ironTesseract = new IronTesseract();
                OcrInput ocrInput = new OcrInput();
                ocrInput.LoadPdf(pdfData);
                OcrResult result = ironTesseract.Read(ocrInput);

                // Log the OCR result
                context.Logger.LogLine($"OCR result: {result.Text}");

                // Upload the searchable PDF to S3
                await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());
                context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
            }
            catch (Exception e)
            {
                context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
            }
        }

        /// <summary>
        /// Retrieves a PDF from S3 and returns it as a byte array.
        /// </summary>
        private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
        {
            var request = new GetObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey
            };

            using (var response = await _s3Client.GetObjectAsync(request))
            using (var memoryStream = new MemoryStream())
            {
                await response.ResponseStream.CopyToAsync(memoryStream);
                return memoryStream.ToArray();
            }
        }

        /// <summary>
        /// Uploads the generated searchable PDF back to S3.
        /// </summary>
        private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
        {
            using (var memoryStream = new MemoryStream(pdfBytes))
            {
                var request = new PutObjectRequest
                {
                    BucketName = bucketName,
                    Key = objectKey,
                    InputStream = memoryStream,
                    ContentType = "application/pdf"
                };

                await _s3Client.PutObjectAsync(request);
            }
        }
    }
}
using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;
using System;
using System.IO;
using System.Threading.Tasks;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda
{
    public class Function
    {
        // Initialize the S3 client with a specific region endpoint
        private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

        /// <summary>
        /// Function handler to process OCR on the PDF stored in S3.
        /// </summary>
        /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
        public async Task FunctionHandler(ILambdaContext context)
        {
            // Set up necessary paths for IronOCR
            var awsTmpPath = @"/tmp/";
            IronOcr.Installation.InstallationPath = awsTmpPath;
            IronOcr.Installation.LogFilePath = awsTmpPath;

            // Set license key for IronOCR
            IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

            string bucketName = "deploymenttestbucket"; // Your bucket name
            string pdfName = "sample";
            string objectKey = $"IronPdfZip/{pdfName}.pdf";
            string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

            try
            {
                // Retrieve the PDF file from S3
                var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

                // Initialize IronTesseract for OCR processing
                IronTesseract ironTesseract = new IronTesseract();
                OcrInput ocrInput = new OcrInput();
                ocrInput.LoadPdf(pdfData);
                OcrResult result = ironTesseract.Read(ocrInput);

                // Log the OCR result
                context.Logger.LogLine($"OCR result: {result.Text}");

                // Upload the searchable PDF to S3
                await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());
                context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
            }
            catch (Exception e)
            {
                context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
            }
        }

        /// <summary>
        /// Retrieves a PDF from S3 and returns it as a byte array.
        /// </summary>
        private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
        {
            var request = new GetObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey
            };

            using (var response = await _s3Client.GetObjectAsync(request))
            using (var memoryStream = new MemoryStream())
            {
                await response.ResponseStream.CopyToAsync(memoryStream);
                return memoryStream.ToArray();
            }
        }

        /// <summary>
        /// Uploads the generated searchable PDF back to S3.
        /// </summary>
        private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
        {
            using (var memoryStream = new MemoryStream(pdfBytes))
            {
                var request = new PutObjectRequest
                {
                    BucketName = bucketName,
                    Key = objectKey,
                    InputStream = memoryStream,
                    ContentType = "application/pdf"
                };

                await _s3Client.PutObjectAsync(request);
            }
        }
    }
}
Imports Amazon.Lambda.Core
Imports Amazon.S3
Imports Amazon.S3.Model
Imports IronOcr
Imports System
Imports System.IO
Imports System.Threading.Tasks

' Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
<Assembly: LambdaSerializer(GetType(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))>

Namespace IronOcrZipAwsLambda
	Public Class [Function]
		' Initialize the S3 client with a specific region endpoint
		Private Shared ReadOnly _s3Client As IAmazonS3 = New AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1)

		''' <summary>
		''' Function handler to process OCR on the PDF stored in S3.
		''' </summary>
		''' <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
		Public Async Function FunctionHandler(ByVal context As ILambdaContext) As Task
			' Set up necessary paths for IronOCR
			Dim awsTmpPath = "/tmp/"
			IronOcr.Installation.InstallationPath = awsTmpPath
			IronOcr.Installation.LogFilePath = awsTmpPath

			' Set license key for IronOCR
			IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01"

			Dim bucketName As String = "deploymenttestbucket" ' Your bucket name
			Dim pdfName As String = "sample"
			Dim objectKey As String = $"IronPdfZip/{pdfName}.pdf"
			Dim objectKeyForSearchablePdf As String = $"IronPdfZip/{pdfName}-SearchablePdf.pdf"

			Try
				' Retrieve the PDF file from S3
				Dim pdfData = Await GetPdfFromS3Async(bucketName, objectKey)

				' Initialize IronTesseract for OCR processing
				Dim ironTesseract As New IronTesseract()
				Dim ocrInput As New OcrInput()
				ocrInput.LoadPdf(pdfData)
				Dim result As OcrResult = ironTesseract.Read(ocrInput)

				' Log the OCR result
				context.Logger.LogLine($"OCR result: {result.Text}")

				' Upload the searchable PDF to S3
				Await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes())
				context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}")
			Catch e As Exception
				context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}")
			End Try
		End Function

		''' <summary>
		''' Retrieves a PDF from S3 and returns it as a byte array.
		''' </summary>
		Private Async Function GetPdfFromS3Async(ByVal bucketName As String, ByVal objectKey As String) As Task(Of Byte())
			Dim request = New GetObjectRequest With {
				.BucketName = bucketName,
				.Key = objectKey
			}

			Using response = Await _s3Client.GetObjectAsync(request)
			Using memoryStream As New MemoryStream()
				Await response.ResponseStream.CopyToAsync(memoryStream)
				Return memoryStream.ToArray()
			End Using
			End Using
		End Function

		''' <summary>
		''' Uploads the generated searchable PDF back to S3.
		''' </summary>
		Private Async Function UploadPdfToS3Async(ByVal bucketName As String, ByVal objectKey As String, ByVal pdfBytes() As Byte) As Task
			Using memoryStream As New MemoryStream(pdfBytes)
				Dim request = New PutObjectRequest With {
					.BucketName = bucketName,
					.Key = objectKey,
					.InputStream = memoryStream,
					.ContentType = "application/pdf"
				}

				Await _s3Client.PutObjectAsync(request)
			End Using
		End Function
	End Class
End Namespace
$vbLabelText   $csharpLabel

在 try 程式碼區塊之前,指定要從 IronPdfZip 目錄讀取的檔案為"sample.pdf"。 然後使用GetPdfFromS3Async方法檢索 PDF 位元組,並將其傳遞給LoadPdf方法。

增加記憶體和超時

Lambda 函數中分配的記憶體量將根據正在處理的文件的大小和同時處理的文件數量而變化。 作為基準,在aws-lambda-tools-defaults.json中將記憶體設為 512 MB,逾時設為 300 秒。

{
    "function-memory-size": 512,
    "function-timeout": 300
}

當記憶體不足時,程式會拋出錯誤:"運行時錯誤退出:訊號:終止"。增加記憶體大小可以解決此問題。 如需更多詳細信息,請參閱故障排除文章: AWS Lambda - 運行時退出訊號:已終止

發布

若要在 Visual Studio 中發布,請右鍵按一下專案並選擇"發佈到 AWS Lambda...",然後配置必要的設定。 您可以在AWS 網站上閱讀更多關於發布 Lambda 函數的資訊。

試試看!

您可以透過Lambda 控制台或 Visual Studio 啟動 Lambda 函數。

常見問題解答

如何在 AWS 上使用 C# 對文件進行 OCR?

您可以通過將其與 AWS Lambda 結合使用 IronOCR 對存儲在 Amazon S3 存儲桶中的文件進行 OCR。這涉及在 C# 中創建一個 Lambda 函數,從 S3 檢索文件,使用 IronOCR 處理它們,然後將結果上傳回 S3。

在AWS Lambda上使用C#設置OCR的步驟有哪些?

要在 AWS Lambda 上使用 C# 設置 OCR,您需要下載 IronOCR 庫,在 Visual Studio 中創建一個 AWS Lambda 項目,配置您的函數處理程序以使用 IronOCR 進行處理,並部署您的函數。此設置使您能夠將圖像轉換為可搜索的 PDF。

運行 OCR 在 AWS Lambda 的推薦配置是什麼?

為了在 AWS Lambda 中運行 IronOCR 時獲得最佳性能,建議分配至少 512 MB 的內存以及 300 秒的超時期。這些設置有助於管理大型或多個文件的處理。

如何處理 AWS Lambda 中的 'Runtime exited with error: signal: killed'?

此錯誤通常表示您的 Lambda 函數已經用盡了分配的內存。增大 Lambda 函數的內存配置可以解決這個問題,特別是在使用 IronOCR 處理大文件時。

我可以在部署前本地測試我的 AWS Lambda OCR 函數嗎?

可以,您可以使用 Visual Studio 的 AWS 工具包本地測試您的 AWS Lambda OCR 函數。此工具包提供了一個本地環境來模擬 Lambda 執行,使您在部署前能夠調試和改進函數。

Dockerfile 在 AWS Lambda 項目中的目的是什麼?

AWS Lambda 項目中的 Dockerfile 用於創建一個容器映像,該映像定義了 Lambda 函數的執行環境和依賴項。這可確保您的函數在 AWS 中正常運行所需的所有組件。

在 AWS Lambda 上使用 .NET 8 的 IronOCR 是否需要任何其他依賴項?

除了 IronOCR 庫和必要的 AWS SDK 包之外,AWS Lambda 上的 .NET 8不需要額外的依賴項。這簡化了執行 OCR 任務的集成過程。

將 C# OCR 與 AWS Lambda 集成的前提條件是什麼?

在將 C# OCR 集成到 AWS Lambda 之前,您需要安裝 AWS SDK for S3、IronOCR 庫和 Visual Studio 的 AWS 工具包。您還需要一個配置好的 S3 存儲桶來存儲和檢索文件。

Curtis Chau
技術撰稿人

Curtis Chau 擁有電腦科學學士學位(卡爾頓大學),專長於前端開發,精通 Node.js、TypeScript、JavaScript 和 React。Curtis 對製作直覺且美觀的使用者介面充滿熱情,他喜歡使用現代化的架構,並製作結構良好且視覺上吸引人的手冊。

除了開發之外,Curtis 對物聯網 (IoT) 也有濃厚的興趣,他喜歡探索整合硬體與軟體的創新方式。在空閒時間,他喜歡玩遊戲和建立 Discord bots,將他對技術的熱愛與創意結合。

準備好開始了嗎?
Nuget 下載 5,384,824 | 版本: 2026.2 剛剛發布