So führen Sie OCR von Dokumenten auf AWS Lambda durch

Chaknith Bin

21. November 2023

Aktualisiert 17. Dezember 2024

Teilen Sie:

Translated

View the article in English

Dieser How-to-Artikel bietet eine Schritt-für-Schritt-Anleitung zum Einrichten einer AWS Lambda-Funktion mit IronOCR. Wenn Sie dieser Anleitung folgen, erfahren Sie, wie Sie IronOCR konfigurieren und Dokumente, die in einem S3-Bucket gespeichert sind, effizient lesen können.

So führen Sie OCR von Dokumenten auf AWS Lambda durch

Laden Sie eine C#-Bibliothek herunter, um OCR auf Dokumenten durchzuführen
Erstellen und wählen Sie die Projektvorlage
Ändern Sie den FunctionHandler-Code
Projekt konfigurieren und bereitstellen
die Funktion aufrufen und die Ergebnisse in S3 überprüfen

Einrichtung

Dieser Artikel wird einen S3-Bucket verwenden, daher ist das AWSSDK.S3-Paket erforderlich.

Wenn Sie IronOCR ZIP verwenden, ist es wichtig, den temporären Ordner festzulegen.

var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;

var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;

Dim awsTmpPath = "/tmp/"
IronOcr.Installation.InstallationPath = awsTmpPath
IronOcr.Installation.LogFilePath = awsTmpPath

$vbLabelText $csharpLabel

Beginnen Sie noch heute mit der Verwendung von IronOCR in Ihrem Projekt mit einer kostenlosen Testversion.

Erster Schritt:

Erstellen Sie ein AWS-Lambda-Projekt

Mit Visual Studio ist das Erstellen einer containerisierten AWS Lambda ein einfacher Prozess:

Installieren Sie das AWS Toolkit für Visual Studio
Wählen Sie ein 'AWS Lambda-Projekt (.NET Core - C#)'
Wählen Sie eine '.NET 8 (Container Image)'-Vorlage aus und klicken Sie dann auf 'Fertig'.

Paketabhängigkeiten hinzufügen

Die Verwendung der IronOCR-Bibliothek in .NET 8 erfordert keine zusätzlichen Abhängigkeiten für den Einsatz auf AWS Lambda. Ändern Sie die Dockerfile des Projekts wie folgt:


FROM public.ecr.aws/lambda/dotnet:8

# notwendige Pakete installieren

RUN dnf update -y

WORKDIR /var/task

# Dieser COPY-Befehl kopiert die Build-Artefakte des .NET Lambda-Projekts von der Hostmaschine in das Image.

# Die Quelle der COPY-Anweisung sollte mit dem Ort übereinstimmen, an dem das .NET Lambda-Projekt seine Build-Artefakte veröffentlicht.

Wenn die Lambda-Funktion erstellt wird

# mit den AWS .NET Lambda-Tools steuert der Schalter `--docker-host-build-output-dir`, wo das .NET Lambda-Projekt

# wird gebaut.

Die .NET Lambda-Projektvorlagen haben standardmäßig `--docker-host-build-output-dir`

# im aws-lambda-tools-defaults.json Datei auf "bin/Release/lambda-publish" setzen.

#

# Alternativ könnte ein Docker-Multi-Stage-Build verwendet werden, um das .NET Lambda-Projekt innerhalb des Images zu erstellen.

# Weitere Informationen zu diesem Ansatz finden Sie in der README.md-Datei des Projekts.

COPY "bin/Release/lambda-publish" .

Ändern Sie den FunctionHandler-Code

Dieses Beispiel ruft ein Bild aus einem S3-Bucket ab, verarbeitet es und speichert ein durchsuchbares PDF im selben Bucket. Das Festlegen des temporären Ordners ist entscheidend bei der Verwendung von IronOCR ZIP, da die Bibliothek Schreibberechtigungen benötigt, um den Laufzeitordner aus den DLLs zu kopieren.

using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda;

public class Function
{
    private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

    /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
    /// <returns></returns>
    public async Task FunctionHandler(ILambdaContext context)
    {
        // Set temp file
        var awsTmpPath = @"/tmp/";
        IronOcr.Installation.InstallationPath = awsTmpPath;
        IronOcr.Installation.LogFilePath = awsTmpPath;

        IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

        string bucketName = "deploymenttestbucket"; // Your bucket name
        string pdfName = "sample";
        string objectKey = $"IronPdfZip/{pdfName}.pdf";
        string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

        try
        {
            // Retrieve the PDF file from S3
            var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

            IronTesseract ironTesseract = new IronTesseract();
            OcrInput ocrInput = new OcrInput();
            ocrInput.LoadPdf(pdfData);
            OcrResult result = ironTesseract.Read(ocrInput);

            // Use pdfData (byte array) as needed
            context.Logger.LogLine($"OCR result: {result.Text}");

            // Upload the PDF to S3
            await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());

            context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
        }
        catch (Exception e)
        {
            context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
        }
    }
    private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
    {
        var request = new GetObjectRequest
        {
            BucketName = bucketName,
            Key = objectKey
        };

        using (var response = await _s3Client.GetObjectAsync(request))
        using (var memoryStream = new MemoryStream())
        {
            await response.ResponseStream.CopyToAsync(memoryStream);
            return memoryStream.ToArray();
        }
    }

    // Function to upload the PDF file to S3
    private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
    {
        using (var memoryStream = new MemoryStream(pdfBytes))
        {
            var request = new PutObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey,
                InputStream = memoryStream,
                ContentType = "application/pdf",
            };

            await _s3Client.PutObjectAsync(request);
        }
    }
}

using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda;

public class Function
{
    private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

    /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
    /// <returns></returns>
    public async Task FunctionHandler(ILambdaContext context)
    {
        // Set temp file
        var awsTmpPath = @"/tmp/";
        IronOcr.Installation.InstallationPath = awsTmpPath;
        IronOcr.Installation.LogFilePath = awsTmpPath;

        IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

        string bucketName = "deploymenttestbucket"; // Your bucket name
        string pdfName = "sample";
        string objectKey = $"IronPdfZip/{pdfName}.pdf";
        string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

        try
        {
            // Retrieve the PDF file from S3
            var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

            IronTesseract ironTesseract = new IronTesseract();
            OcrInput ocrInput = new OcrInput();
            ocrInput.LoadPdf(pdfData);
            OcrResult result = ironTesseract.Read(ocrInput);

            // Use pdfData (byte array) as needed
            context.Logger.LogLine($"OCR result: {result.Text}");

            // Upload the PDF to S3
            await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());

            context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
        }
        catch (Exception e)
        {
            context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
        }
    }
    private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
    {
        var request = new GetObjectRequest
        {
            BucketName = bucketName,
            Key = objectKey
        };

        using (var response = await _s3Client.GetObjectAsync(request))
        using (var memoryStream = new MemoryStream())
        {
            await response.ResponseStream.CopyToAsync(memoryStream);
            return memoryStream.ToArray();
        }
    }

    // Function to upload the PDF file to S3
    private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
    {
        using (var memoryStream = new MemoryStream(pdfBytes))
        {
            var request = new PutObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey,
                InputStream = memoryStream,
                ContentType = "application/pdf",
            };

            await _s3Client.PutObjectAsync(request);
        }
    }
}

Imports Amazon.Lambda.Core
Imports Amazon.S3
Imports Amazon.S3.Model
Imports IronOcr

' Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
<Assembly: LambdaSerializer(GetType(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))>

Namespace IronOcrZipAwsLambda

	Public Class [Function]
		Private Shared ReadOnly _s3Client As IAmazonS3 = New AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1)

		''' <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
		''' <returns></returns>
		Public Async Function FunctionHandler(ByVal context As ILambdaContext) As Task
			' Set temp file
			Dim awsTmpPath = "/tmp/"
			IronOcr.Installation.InstallationPath = awsTmpPath
			IronOcr.Installation.LogFilePath = awsTmpPath

			IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01"

			Dim bucketName As String = "deploymenttestbucket" ' Your bucket name
			Dim pdfName As String = "sample"
			Dim objectKey As String = $"IronPdfZip/{pdfName}.pdf"
			Dim objectKeyForSearchablePdf As String = $"IronPdfZip/{pdfName}-SearchablePdf.pdf"

			Try
				' Retrieve the PDF file from S3
				Dim pdfData = Await GetPdfFromS3Async(bucketName, objectKey)

				Dim ironTesseract As New IronTesseract()
				Dim ocrInput As New OcrInput()
				ocrInput.LoadPdf(pdfData)
				Dim result As OcrResult = ironTesseract.Read(ocrInput)

				' Use pdfData (byte array) as needed
				context.Logger.LogLine($"OCR result: {result.Text}")

				' Upload the PDF to S3
				Await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes())

				context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}")
			Catch e As Exception
				context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}")
			End Try
		End Function
		Private Async Function GetPdfFromS3Async(ByVal bucketName As String, ByVal objectKey As String) As Task(Of Byte())
			Dim request = New GetObjectRequest With {
				.BucketName = bucketName,
				.Key = objectKey
			}

			Using response = Await _s3Client.GetObjectAsync(request)
			Using memoryStream As New MemoryStream()
				Await response.ResponseStream.CopyToAsync(memoryStream)
				Return memoryStream.ToArray()
			End Using
			End Using
		End Function

		' Function to upload the PDF file to S3
		Private Async Function UploadPdfToS3Async(ByVal bucketName As String, ByVal objectKey As String, ByVal pdfBytes() As Byte) As Task
			Using memoryStream As New MemoryStream(pdfBytes)
				Dim request = New PutObjectRequest With {
					.BucketName = bucketName,
					.Key = objectKey,
					.InputStream = memoryStream,
					.ContentType = "application/pdf"
				}

				Await _s3Client.PutObjectAsync(request)
			End Using
		End Function
	End Class
End Namespace

$vbLabelText $csharpLabel

Vor dem Try-Block wird die Datei 'sample.pdf' zum Lesen aus dem IronPdfZip-Verzeichnis festgelegt. Die Methode GetPdfFromS3Async wird dann verwendet, um das PDF-Byte abzurufen, das an die Methode LoadPdf übergeben wird.

Speicher und Timeout erhöhen

Der im Lambda-Funktion zugewiesene Speicher variiert je nach Größe der verarbeiteten Dokumente und der Anzahl der gleichzeitig verarbeiteten Dokumente. Als Basislinie setzen Sie den Speicher auf 512 MB und den Timeout auf 300 Sekunden in aws-lambda-tools-defaults.json.


"function-memory-size" : 512,

"Funktions-Timeout" : 300

Wenn der Speicher unzureichend ist, wird das Programm den Fehler 'Runtime exited with error: signal: killed.' auslösen. Eine Erhöhung der Speichergröße kann dieses Problem beheben. Für weitere Informationen lesen Sie den Artikel zur Fehlerbehebung: AWS Lambda - Runtime Exited Signal: Killed.

Veröffentlichen

Um in Visual Studio zu veröffentlichen, klicken Sie mit der rechten Maustaste auf das Projekt und wählen Sie 'In AWS Lambda veröffentlichen...', dann konfigurieren Sie die notwendigen Einstellungen. Weitere Informationen zum Veröffentlichen eines Lambdas finden Sie auf der AWS-Website.

Probieren Sie es aus!

Sie können die Lambda-Funktion entweder über die Lambda-Konsole oder über Visual Studio aktivieren.

Chaknith Bin

Jetzt mit dem Ingenieurteam chatten

Software-Ingenieur

Chaknith arbeitet an IronXL und IronBarcode. Er hat tiefgehende Expertise in C# und .NET und hilft, die Software zu verbessern und Kunden zu unterstützen. Seine Erkenntnisse aus Benutzerinteraktionen tragen zu besseren Produkten, Dokumentation und einem insgesamt besseren Erlebnis bei.