AWS LambdaでドキュメントをOCRする方法

チャクニット・ビン

2023年11月21日

更新済み 2024年12月17日

共有:

Translated

View the article in English

この記事は、IronOCRを使用してAWS Lambda関数を設定するためのステップバイステップガイドを提供します。このガイドに従うことで、IronOCRの設定方法とS3バケットに保存されているドキュメントを効率的に読み取る方法を学ぶことができます。

AWS LambdaでドキュメントをOCRする方法

ドキュメントでOCRを実行するためのC#ライブラリをダウンロード
プロジェクトテンプレートを作成して選択する
FunctionHandlerコードを修正する
プロジェクトの構成とデプロイ
関数を呼び出して、S3で結果を確認します

インストール

この記事では S3 バケットを使用するため、AWSSDK.S3 パッケージが必要です。

IronOCR ZIPを使用する場合、一時フォルダを設定することが重要です。

var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;

var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;

Dim awsTmpPath = "/tmp/"
IronOcr.Installation.InstallationPath = awsTmpPath
IronOcr.Installation.LogFilePath = awsTmpPath

$vbLabelText $csharpLabel

今日から無料トライアルでIronOCRをあなたのプロジェクトで使い始めましょう。

最初のステップ:

AWS Lambdaプロジェクトを作成する

Visual Studioを使用すれば、コンテナ化されたAWS Lambdaの作成は簡単なプロセスです。

AWS Toolkit for Visual Studioをインストールしてください
'AWS Lambda Project (.NET Core - C#)'を選択します
「.NET 8 (コンテナイメージ)」のブループリントを選択し、「終了」を選択します。

パッケージ依存関係を追加する

.NET 8でIronOCRライブラリを使用する際、AWS Lambdaでの使用には追加の依存関係をインストールする必要はありません。プロジェクトのDockerfileを次のように変更します。


FROM public.ecr.aws/lambda/dotnet:8

# 必要なパッケージをインストールする

RUN dnf update -y

作業ディレクトリ /var/task

# このCOPYコマンドは、ホストマシンからイメージ内に.NET Lambdaプロジェクトのビルドアーティファクトをコピーします。

# COPYのソースは、.NET Lambdaプロジェクトがビルドアーティファクトを公開する場所と一致する必要があります。

Lambda関数が構築されている場合

# AWS .NET Lambda ツーリングでは、`--docker-host-build-output-dir` スイッチで .NET Lambda プロジェクトの出力先を制御します。

# 構築されます。

.NET Lambda プロジェクト テンプレートはデフォルトで `--docker-host-build-output-dir` を持っています。

# aws-lambda-tools-defaults.json ファイルで "bin/Release/lambda-publish" に設定します。

もちろんです。テキストを提供してください。

# また、Dockerのマルチステージビルドを使用して、イメージ内で.NET Lambdaプロジェクトをビルドすることもできます。

# この手法についての詳細は、プロジェクトのREADME.mdファイルを確認してください。

COPY "bin/Release/lambda-publish"  .

関数ハンドラーコードの修正

この例では、S3バケットから画像を取得し、それを処理して、検索可能なPDFとして同じバケットに保存します。 IronOCR ZIPを使用する際には、DLLからランタイムフォルダをコピーするためにライブラリが書き込み権限を必要とするため、一時フォルダを設定することが不可欠です。

using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda;

public class Function
{
    private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

    /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
    /// <returns></returns>
    public async Task FunctionHandler(ILambdaContext context)
    {
        // Set temp file
        var awsTmpPath = @"/tmp/";
        IronOcr.Installation.InstallationPath = awsTmpPath;
        IronOcr.Installation.LogFilePath = awsTmpPath;

        IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

        string bucketName = "deploymenttestbucket"; // Your bucket name
        string pdfName = "sample";
        string objectKey = $"IronPdfZip/{pdfName}.pdf";
        string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

        try
        {
            // Retrieve the PDF file from S3
            var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

            IronTesseract ironTesseract = new IronTesseract();
            OcrInput ocrInput = new OcrInput();
            ocrInput.LoadPdf(pdfData);
            OcrResult result = ironTesseract.Read(ocrInput);

            // Use pdfData (byte array) as needed
            context.Logger.LogLine($"OCR result: {result.Text}");

            // Upload the PDF to S3
            await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());

            context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
        }
        catch (Exception e)
        {
            context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
        }
    }
    private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
    {
        var request = new GetObjectRequest
        {
            BucketName = bucketName,
            Key = objectKey
        };

        using (var response = await _s3Client.GetObjectAsync(request))
        using (var memoryStream = new MemoryStream())
        {
            await response.ResponseStream.CopyToAsync(memoryStream);
            return memoryStream.ToArray();
        }
    }

    // Function to upload the PDF file to S3
    private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
    {
        using (var memoryStream = new MemoryStream(pdfBytes))
        {
            var request = new PutObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey,
                InputStream = memoryStream,
                ContentType = "application/pdf",
            };

            await _s3Client.PutObjectAsync(request);
        }
    }
}

using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda;

public class Function
{
    private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

    /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
    /// <returns></returns>
    public async Task FunctionHandler(ILambdaContext context)
    {
        // Set temp file
        var awsTmpPath = @"/tmp/";
        IronOcr.Installation.InstallationPath = awsTmpPath;
        IronOcr.Installation.LogFilePath = awsTmpPath;

        IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

        string bucketName = "deploymenttestbucket"; // Your bucket name
        string pdfName = "sample";
        string objectKey = $"IronPdfZip/{pdfName}.pdf";
        string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

        try
        {
            // Retrieve the PDF file from S3
            var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

            IronTesseract ironTesseract = new IronTesseract();
            OcrInput ocrInput = new OcrInput();
            ocrInput.LoadPdf(pdfData);
            OcrResult result = ironTesseract.Read(ocrInput);

            // Use pdfData (byte array) as needed
            context.Logger.LogLine($"OCR result: {result.Text}");

            // Upload the PDF to S3
            await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());

            context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
        }
        catch (Exception e)
        {
            context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
        }
    }
    private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
    {
        var request = new GetObjectRequest
        {
            BucketName = bucketName,
            Key = objectKey
        };

        using (var response = await _s3Client.GetObjectAsync(request))
        using (var memoryStream = new MemoryStream())
        {
            await response.ResponseStream.CopyToAsync(memoryStream);
            return memoryStream.ToArray();
        }
    }

    // Function to upload the PDF file to S3
    private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
    {
        using (var memoryStream = new MemoryStream(pdfBytes))
        {
            var request = new PutObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey,
                InputStream = memoryStream,
                ContentType = "application/pdf",
            };

            await _s3Client.PutObjectAsync(request);
        }
    }
}

Imports Amazon.Lambda.Core
Imports Amazon.S3
Imports Amazon.S3.Model
Imports IronOcr

' Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
<Assembly: LambdaSerializer(GetType(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))>

Namespace IronOcrZipAwsLambda

	Public Class [Function]
		Private Shared ReadOnly _s3Client As IAmazonS3 = New AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1)

		''' <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
		''' <returns></returns>
		Public Async Function FunctionHandler(ByVal context As ILambdaContext) As Task
			' Set temp file
			Dim awsTmpPath = "/tmp/"
			IronOcr.Installation.InstallationPath = awsTmpPath
			IronOcr.Installation.LogFilePath = awsTmpPath

			IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01"

			Dim bucketName As String = "deploymenttestbucket" ' Your bucket name
			Dim pdfName As String = "sample"
			Dim objectKey As String = $"IronPdfZip/{pdfName}.pdf"
			Dim objectKeyForSearchablePdf As String = $"IronPdfZip/{pdfName}-SearchablePdf.pdf"

			Try
				' Retrieve the PDF file from S3
				Dim pdfData = Await GetPdfFromS3Async(bucketName, objectKey)

				Dim ironTesseract As New IronTesseract()
				Dim ocrInput As New OcrInput()
				ocrInput.LoadPdf(pdfData)
				Dim result As OcrResult = ironTesseract.Read(ocrInput)

				' Use pdfData (byte array) as needed
				context.Logger.LogLine($"OCR result: {result.Text}")

				' Upload the PDF to S3
				Await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes())

				context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}")
			Catch e As Exception
				context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}")
			End Try
		End Function
		Private Async Function GetPdfFromS3Async(ByVal bucketName As String, ByVal objectKey As String) As Task(Of Byte())
			Dim request = New GetObjectRequest With {
				.BucketName = bucketName,
				.Key = objectKey
			}

			Using response = Await _s3Client.GetObjectAsync(request)
			Using memoryStream As New MemoryStream()
				Await response.ResponseStream.CopyToAsync(memoryStream)
				Return memoryStream.ToArray()
			End Using
			End Using
		End Function

		' Function to upload the PDF file to S3
		Private Async Function UploadPdfToS3Async(ByVal bucketName As String, ByVal objectKey As String, ByVal pdfBytes() As Byte) As Task
			Using memoryStream As New MemoryStream(pdfBytes)
				Dim request = New PutObjectRequest With {
					.BucketName = bucketName,
					.Key = objectKey,
					.InputStream = memoryStream,
					.ContentType = "application/pdf"
				}

				Await _s3Client.PutObjectAsync(request)
			End Using
		End Function
	End Class
End Namespace

$vbLabelText $csharpLabel

try ブロックの前に、ファイル 'sample.pdf' が IronPdfZip ディレクトリから読み取るために指定されています。 GetPdfFromS3Async メソッドは、その後にPDFバイトを取得するために使用され、そのバイトは LoadPdf メソッドに渡されます。

メモリとタイムアウトの増加

Lambda関数で割り当てられるメモリの量は、処理されるドキュメントのサイズと同時に処理されるドキュメントの数によって異なります。基準として、メモリを512 MBに設定し、タイムアウトを300秒に設定します。aws-lambda-tools-defaults.jsonに記載します。


「function-memory-size」：512

「function-timeout」: 300

メモリが不足している場合、プログラムはエラーをスローします: 'Runtime exited with error: signal: killed.' メモリサイズを増やすことでこの問題は解決できます。詳細については、トラブルシューティング記事をご参照ください：AWS Lambda - ランタイム終了シグナル: Killed。