How to OCR documents on AWS Lambda
This how-to article provides a step-by-step guide for setting up an AWS Lambda function using IronOCR. By following this guide, you will learn how to configure IronOCR and efficiently read documents stored in an S3 bucket.
How to OCR documents on AWS Lambda
Installation
This article will use an S3 bucket, so the AWSSDK.S3 package is required.
If you are using IronOCR ZIP, it is essential to set the temporary folder.
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
Dim awsTmpPath = "/tmp/"
IronOcr.Installation.InstallationPath = awsTmpPath
IronOcr.Installation.LogFilePath = awsTmpPath
Start using IronOCR in your project today with a free trial.
Create an AWS Lambda Project
With Visual Studio, creating a containerized AWS Lambda is an easy process:
- Install the AWS Tookit for Visual Studio
- Select an 'AWS Lambda Project (.NET Core - C#)'
- Select a '.NET 8 (Container Image)' blueprint, then select 'Finish'.
Add Package Dependencies
Using the IronOCR library in .NET 8 does not require additional dependencies to be installed for use on AWS Lambda. Modify the project's Dockerfile with the following:
FROM public.ecr.aws/lambda/dotnet:8
# install necessary packages
RUN dnf update -y
WORKDIR /var/task
# This COPY command copies the .NET Lambda project's build artifacts from the host machine into the image.
# The source of the COPY should match where the .NET Lambda project publishes its build artifacts. If the Lambda function is being built
# with the AWS .NET Lambda Tooling, the `--docker-host-build-output-dir` switch controls where the .NET Lambda project
# will be built. The .NET Lambda project templates default to having `--docker-host-build-output-dir`
# set in the aws-lambda-tools-defaults.json file to "bin/Release/lambda-publish".
#
# Alternatively Docker multi-stage build could be used to build the .NET Lambda project inside the image.
# For more information on this approach checkout the project's README.md file.
COPY "bin/Release/lambda-publish" .
Modify the FunctionHandler Code
This example retrieves an image from an S3 bucket, processes it, and saves a searchable PDF back to the same bucket. Setting the temp folder is essential when using IronOCR ZIP, as the library requires write permissions to copy the runtime folder from the DLLs.
using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;
// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]
namespace IronOcrZipAwsLambda;
public class Function
{
private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);
/// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
/// <returns></returns>
public async Task FunctionHandler(ILambdaContext context)
{
// Set temp file
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";
string bucketName = "deploymenttestbucket"; // Your bucket name
string pdfName = "sample";
string objectKey = $"IronPdfZip/{pdfName}.pdf";
string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";
try
{
// Retrieve the PDF file from S3
var pdfData = await GetPdfFromS3Async(bucketName, objectKey);
IronTesseract ironTesseract = new IronTesseract();
OcrInput ocrInput = new OcrInput();
ocrInput.LoadPdf(pdfData);
OcrResult result = ironTesseract.Read(ocrInput);
// Use pdfData (byte array) as needed
context.Logger.LogLine($"OCR result: {result.Text}");
// Upload the PDF to S3
await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());
context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
}
catch (Exception e)
{
context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
}
}
private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
{
var request = new GetObjectRequest
{
BucketName = bucketName,
Key = objectKey
};
using (var response = await _s3Client.GetObjectAsync(request))
using (var memoryStream = new MemoryStream())
{
await response.ResponseStream.CopyToAsync(memoryStream);
return memoryStream.ToArray();
}
}
// Function to upload the PDF file to S3
private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
{
using (var memoryStream = new MemoryStream(pdfBytes))
{
var request = new PutObjectRequest
{
BucketName = bucketName,
Key = objectKey,
InputStream = memoryStream,
ContentType = "application/pdf",
};
await _s3Client.PutObjectAsync(request);
}
}
}
using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;
// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]
namespace IronOcrZipAwsLambda;
public class Function
{
private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);
/// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
/// <returns></returns>
public async Task FunctionHandler(ILambdaContext context)
{
// Set temp file
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";
string bucketName = "deploymenttestbucket"; // Your bucket name
string pdfName = "sample";
string objectKey = $"IronPdfZip/{pdfName}.pdf";
string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";
try
{
// Retrieve the PDF file from S3
var pdfData = await GetPdfFromS3Async(bucketName, objectKey);
IronTesseract ironTesseract = new IronTesseract();
OcrInput ocrInput = new OcrInput();
ocrInput.LoadPdf(pdfData);
OcrResult result = ironTesseract.Read(ocrInput);
// Use pdfData (byte array) as needed
context.Logger.LogLine($"OCR result: {result.Text}");
// Upload the PDF to S3
await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());
context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
}
catch (Exception e)
{
context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
}
}
private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
{
var request = new GetObjectRequest
{
BucketName = bucketName,
Key = objectKey
};
using (var response = await _s3Client.GetObjectAsync(request))
using (var memoryStream = new MemoryStream())
{
await response.ResponseStream.CopyToAsync(memoryStream);
return memoryStream.ToArray();
}
}
// Function to upload the PDF file to S3
private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
{
using (var memoryStream = new MemoryStream(pdfBytes))
{
var request = new PutObjectRequest
{
BucketName = bucketName,
Key = objectKey,
InputStream = memoryStream,
ContentType = "application/pdf",
};
await _s3Client.PutObjectAsync(request);
}
}
}
Imports Amazon.Lambda.Core
Imports Amazon.S3
Imports Amazon.S3.Model
Imports IronOcr
' Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
<Assembly: LambdaSerializer(GetType(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))>
Namespace IronOcrZipAwsLambda
Public Class [Function]
Private Shared ReadOnly _s3Client As IAmazonS3 = New AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1)
''' <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
''' <returns></returns>
Public Async Function FunctionHandler(ByVal context As ILambdaContext) As Task
' Set temp file
Dim awsTmpPath = "/tmp/"
IronOcr.Installation.InstallationPath = awsTmpPath
IronOcr.Installation.LogFilePath = awsTmpPath
IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01"
Dim bucketName As String = "deploymenttestbucket" ' Your bucket name
Dim pdfName As String = "sample"
Dim objectKey As String = $"IronPdfZip/{pdfName}.pdf"
Dim objectKeyForSearchablePdf As String = $"IronPdfZip/{pdfName}-SearchablePdf.pdf"
Try
' Retrieve the PDF file from S3
Dim pdfData = Await GetPdfFromS3Async(bucketName, objectKey)
Dim ironTesseract As New IronTesseract()
Dim ocrInput As New OcrInput()
ocrInput.LoadPdf(pdfData)
Dim result As OcrResult = ironTesseract.Read(ocrInput)
' Use pdfData (byte array) as needed
context.Logger.LogLine($"OCR result: {result.Text}")
' Upload the PDF to S3
Await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes())
context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}")
Catch e As Exception
context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}")
End Try
End Function
Private Async Function GetPdfFromS3Async(ByVal bucketName As String, ByVal objectKey As String) As Task(Of Byte())
Dim request = New GetObjectRequest With {
.BucketName = bucketName,
.Key = objectKey
}
Using response = Await _s3Client.GetObjectAsync(request)
Using memoryStream As New MemoryStream()
Await response.ResponseStream.CopyToAsync(memoryStream)
Return memoryStream.ToArray()
End Using
End Using
End Function
' Function to upload the PDF file to S3
Private Async Function UploadPdfToS3Async(ByVal bucketName As String, ByVal objectKey As String, ByVal pdfBytes() As Byte) As Task
Using memoryStream As New MemoryStream(pdfBytes)
Dim request = New PutObjectRequest With {
.BucketName = bucketName,
.Key = objectKey,
.InputStream = memoryStream,
.ContentType = "application/pdf"
}
Await _s3Client.PutObjectAsync(request)
End Using
End Function
End Class
End Namespace
Before the try block, the file 'sample.pdf' is specified for reading from the IronPdfZip directory. The GetPdfFromS3Async
method is then used to retrieve the PDF byte, which is passed to the LoadPdf
method.
Increase Memory and Timeout
The amount of memory allocated in the Lambda function will vary based on the size of the documents being processed and the number of documents processed simultaneously. As a baseline, set the memory to 512 MB and the timeout to 300 seconds in aws-lambda-tools-defaults.json
.
"function-memory-size" : 512,
"function-timeout" : 300
When the memory is insufficient, the program will throw the error: 'Runtime exited with error: signal: killed.' Increasing the memory size can resolve this issue. For more details, refer to the troubleshooting article: AWS Lambda - Runtime Exited Signal: Killed.
Publish
To publish in Visual Studio, right-click on the project and select 'Publish to AWS Lambda...', then configure the necessary settings. You can read more about publishing a Lambda on the AWS website.
Try It Out!
You can activate the Lambda function either through the Lambda console or through Visual Studio.