Cómo realizar OCR de documentos en AWS Lambda

Chaknith Bin

21 de noviembre, 2023

Actualizado 17 de diciembre, 2024

Translated

View the article in English

Este artículo de instrucciones proporciona una guía paso a paso para configurar una función de AWS Lambda utilizando IronOCR. Al seguir esta guía, aprenderás a configurar IronOCR y a leer eficientemente documentos almacenados en un bucket S3.

Cómo realizar OCR de documentos en AWS Lambda

Descargar una biblioteca C# para realizar OCR en documentos
Crear y elegir la plantilla de proyecto
Modificar el código de FunctionHandler
Configurar y desplegar el proyecto
invoca la función y verifica los resultados en S3

Instalación

Este artículo utilizará un bucket S3, por lo que se requiere el paquete AWSSDK.S3.

Si está utilizando IronOCR ZIP, es esencial configurar la carpeta temporal.

var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;

var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;

Dim awsTmpPath = "/tmp/"
IronOcr.Installation.InstallationPath = awsTmpPath
IronOcr.Installation.LogFilePath = awsTmpPath

$vbLabelText $csharpLabel

Comience a usar IronOCR en su proyecto hoy con una prueba gratuita.

Primer Paso:

Crear un proyecto AWS Lambda

Con Visual Studio, crear un AWS Lambda con contenedor es un proceso sencillo:

Instala el AWS Toolkit for Visual Studio
Seleccione un 'Proyecto AWS Lambda (.NET Core - C#)'
Seleccione un plano de " .NET 8 (Container Image)", luego seleccione "Finalizar".

Agregar dependencias de paquetes

El uso de la biblioteca IronOCR en .NET 8 no requiere la instalación de dependencias adicionales para su uso en AWS Lambda. Modifica el Dockerfile del proyecto con lo siguiente:


FROM public.ecr.aws/lambda/dotnet:8

# instalar los paquetes necesarios

RUN dnf update -y

WORKDIR /var/task

# Este comando COPY copia los artefactos de construcción del proyecto Lambda de .NET desde la máquina anfitriona hasta la imagen.

# El origen de la COPIA debe coincidir con el lugar donde el proyecto .NET Lambda publica sus artefactos de construcción.

Si se está construyendo la función Lambda

# con el AWS .NET Lambda Tooling, el interruptor `--docker-host-build-output-dir` controla dónde el proyecto .NET Lambda

# se construirá.

Los plantillas de proyecto Lambda de .NET tienen por defecto `--docker-host-build-output-dir`

# establecer en el archivo aws-lambda-tools-defaults.json a "bin/Release/lambda-publish".

#

# Alternativamente, se podría utilizar una compilación multietapa de Docker para construir el proyecto .NET Lambda dentro de la imagen.

# Para obtener más información sobre este enfoque, consulta el archivo README.md del proyecto.

COPIAR "bin/Release/lambda-publish" .

Modificar el código de FunctionHandler

Este ejemplo recupera una imagen de un bucket S3, la procesa y guarda un PDF con capacidad de búsqueda de vuelta en el mismo bucket. Configurar la carpeta temporal es esencial al usar IronOCR ZIP, ya que la biblioteca requiere permisos de escritura para copiar la carpeta de tiempo de ejecución desde los DLL.

using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda;

public class Function
{
    private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

    /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
    /// <returns></returns>
    public async Task FunctionHandler(ILambdaContext context)
    {
        // Set temp file
        var awsTmpPath = @"/tmp/";
        IronOcr.Installation.InstallationPath = awsTmpPath;
        IronOcr.Installation.LogFilePath = awsTmpPath;

        IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

        string bucketName = "deploymenttestbucket"; // Your bucket name
        string pdfName = "sample";
        string objectKey = $"IronPdfZip/{pdfName}.pdf";
        string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

        try
        {
            // Retrieve the PDF file from S3
            var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

            IronTesseract ironTesseract = new IronTesseract();
            OcrInput ocrInput = new OcrInput();
            ocrInput.LoadPdf(pdfData);
            OcrResult result = ironTesseract.Read(ocrInput);

            // Use pdfData (byte array) as needed
            context.Logger.LogLine($"OCR result: {result.Text}");

            // Upload the PDF to S3
            await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());

            context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
        }
        catch (Exception e)
        {
            context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
        }
    }
    private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
    {
        var request = new GetObjectRequest
        {
            BucketName = bucketName,
            Key = objectKey
        };

        using (var response = await _s3Client.GetObjectAsync(request))
        using (var memoryStream = new MemoryStream())
        {
            await response.ResponseStream.CopyToAsync(memoryStream);
            return memoryStream.ToArray();
        }
    }

    // Function to upload the PDF file to S3
    private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
    {
        using (var memoryStream = new MemoryStream(pdfBytes))
        {
            var request = new PutObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey,
                InputStream = memoryStream,
                ContentType = "application/pdf",
            };

            await _s3Client.PutObjectAsync(request);
        }
    }
}

using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace IronOcrZipAwsLambda;

public class Function
{
    private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);

    /// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
    /// <returns></returns>
    public async Task FunctionHandler(ILambdaContext context)
    {
        // Set temp file
        var awsTmpPath = @"/tmp/";
        IronOcr.Installation.InstallationPath = awsTmpPath;
        IronOcr.Installation.LogFilePath = awsTmpPath;

        IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";

        string bucketName = "deploymenttestbucket"; // Your bucket name
        string pdfName = "sample";
        string objectKey = $"IronPdfZip/{pdfName}.pdf";
        string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";

        try
        {
            // Retrieve the PDF file from S3
            var pdfData = await GetPdfFromS3Async(bucketName, objectKey);

            IronTesseract ironTesseract = new IronTesseract();
            OcrInput ocrInput = new OcrInput();
            ocrInput.LoadPdf(pdfData);
            OcrResult result = ironTesseract.Read(ocrInput);

            // Use pdfData (byte array) as needed
            context.Logger.LogLine($"OCR result: {result.Text}");

            // Upload the PDF to S3
            await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());

            context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
        }
        catch (Exception e)
        {
            context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
        }
    }
    private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
    {
        var request = new GetObjectRequest
        {
            BucketName = bucketName,
            Key = objectKey
        };

        using (var response = await _s3Client.GetObjectAsync(request))
        using (var memoryStream = new MemoryStream())
        {
            await response.ResponseStream.CopyToAsync(memoryStream);
            return memoryStream.ToArray();
        }
    }

    // Function to upload the PDF file to S3
    private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
    {
        using (var memoryStream = new MemoryStream(pdfBytes))
        {
            var request = new PutObjectRequest
            {
                BucketName = bucketName,
                Key = objectKey,
                InputStream = memoryStream,
                ContentType = "application/pdf",
            };

            await _s3Client.PutObjectAsync(request);
        }
    }
}

Imports Amazon.Lambda.Core
Imports Amazon.S3
Imports Amazon.S3.Model
Imports IronOcr

' Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
<Assembly: LambdaSerializer(GetType(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))>

Namespace IronOcrZipAwsLambda

	Public Class [Function]
		Private Shared ReadOnly _s3Client As IAmazonS3 = New AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1)

		''' <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
		''' <returns></returns>
		Public Async Function FunctionHandler(ByVal context As ILambdaContext) As Task
			' Set temp file
			Dim awsTmpPath = "/tmp/"
			IronOcr.Installation.InstallationPath = awsTmpPath
			IronOcr.Installation.LogFilePath = awsTmpPath

			IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01"

			Dim bucketName As String = "deploymenttestbucket" ' Your bucket name
			Dim pdfName As String = "sample"
			Dim objectKey As String = $"IronPdfZip/{pdfName}.pdf"
			Dim objectKeyForSearchablePdf As String = $"IronPdfZip/{pdfName}-SearchablePdf.pdf"

			Try
				' Retrieve the PDF file from S3
				Dim pdfData = Await GetPdfFromS3Async(bucketName, objectKey)

				Dim ironTesseract As New IronTesseract()
				Dim ocrInput As New OcrInput()
				ocrInput.LoadPdf(pdfData)
				Dim result As OcrResult = ironTesseract.Read(ocrInput)

				' Use pdfData (byte array) as needed
				context.Logger.LogLine($"OCR result: {result.Text}")

				' Upload the PDF to S3
				Await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes())

				context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}")
			Catch e As Exception
				context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}")
			End Try
		End Function
		Private Async Function GetPdfFromS3Async(ByVal bucketName As String, ByVal objectKey As String) As Task(Of Byte())
			Dim request = New GetObjectRequest With {
				.BucketName = bucketName,
				.Key = objectKey
			}

			Using response = Await _s3Client.GetObjectAsync(request)
			Using memoryStream As New MemoryStream()
				Await response.ResponseStream.CopyToAsync(memoryStream)
				Return memoryStream.ToArray()
			End Using
			End Using
		End Function

		' Function to upload the PDF file to S3
		Private Async Function UploadPdfToS3Async(ByVal bucketName As String, ByVal objectKey As String, ByVal pdfBytes() As Byte) As Task
			Using memoryStream As New MemoryStream(pdfBytes)
				Dim request = New PutObjectRequest With {
					.BucketName = bucketName,
					.Key = objectKey,
					.InputStream = memoryStream,
					.ContentType = "application/pdf"
				}

				Await _s3Client.PutObjectAsync(request)
			End Using
		End Function
	End Class
End Namespace

$vbLabelText $csharpLabel

Antes del bloque try, se especifica el archivo 'sample.pdf' para leer desde el directorio IronPdfZip. El método GetPdfFromS3Async se utiliza luego para recuperar el byte PDF, que se pasa al método LoadPdf.

Aumentar la memoria y el tiempo de espera

La cantidad de memoria asignada en la función Lambda variará según el tamaño de los documentos que se están procesando y la cantidad de documentos procesados simultáneamente. Como línea base, configure la memoria en 512 MB y el tiempo de espera en 300 segundos en aws-lambda-tools-defaults.json.


"function-memory-size" : 512,

"función-tiempo de espera" : 300

Cuando la memoria es insuficiente, el programa lanzará el error: 'Runtime exited with error: signal: killed.' Aumentar el tamaño de la memoria puede resolver este problema. Para más detalles, consulta el artículo de solución de problemas: AWS Lambda - Runtime Exited Signal: Killed.

Publicar

Para publicar en Visual Studio, haga clic derecho en el proyecto y seleccione 'Publicar en AWS Lambda...', luego configure los ajustes necesarios. Puedes leer más sobre cómo publicar una Lambda en el sitio web de AWS.

¡Pruébalo!

Puedes activar la función Lambda ya sea a través de la consola Lambda o a través de Visual Studio.

Chaknith Bin

Chatea con el equipo de ingeniería ahora

Ingeniero de software

Chaknith trabaja en IronXL e IronBarcode. Tiene una gran experiencia en C# y .NET, ayudando a mejorar el software y a apoyar a los clientes. Sus conocimientos de las interacciones con los usuarios contribuyen a mejorar los productos, la documentación y la experiencia general.