如何在 AWS Lambda 上進行 OCR 文件識別
這篇操作指南提供了使用 IronOCR 設置 AWS Lambda 函數的逐步指南。 按照本指南,您將學習如何配置IronOCR並有效地讀取存儲在S3桶中的文件。
如何在 AWS Lambda 上進行 OCR 文件識別
安裝
這篇文章將使用 S3 存儲桶,所以AWSSDK.S3 套件是必需的。
如果您正在使用 IronOCR ZIP,請務必設置臨時文件夾。
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
Dim awsTmpPath = "/tmp/"
IronOcr.Installation.InstallationPath = awsTmpPath
IronOcr.Installation.LogFilePath = awsTmpPath
立即在您的專案中使用IronOCR,並享受免費試用。
建立 AWS Lambda 專案
使用 Visual Studio,建立容器化的 AWS Lambda 是一個簡單的過程:
- 安裝這個AWS 工具包 for Visual Studio
- 選擇「AWS Lambda 專案」(.NET Core - C#)「
選擇 “.NET 8”(容器映像)「藍圖」,然後選擇「完成」。
新增套件依賴項
在 .NET 8 中使用 IronOCR 庫不需要額外安裝依賴就能在 AWS Lambda 上使用。 將專案的 Dockerfile 修改為以下內容: 請提供內容以進行翻譯。
FROM public.ecr.aws/lambda/dotnet:8
安裝必要的套件
執行 dnf 更新 -y
WORKDIR /var/task
此 COPY 指令將 .NET Lambda 專案的建置成果從主機複製到映像檔中。
COPY 的來源應與 .NET Lambda 專案發佈其建置工件的地點相匹配。
如果 Lambda 函數正在構建
使用 AWS .NET Lambda 工具,--docker-host-build-output-dir
開關控制 .NET Lambda 專案的位置。
將被建造。
.NET Lambda 專案範本預設具有 --docker-host-build-output-dir
在 aws-lambda-tools-defaults.json 文件中設置為 "bin/Release/lambda-publish"。
#
或者可以使用 Docker 多階段構建在映像中構建 .NET Lambda 專案。
如需有關此方法的更多資訊,請查看專案的 README.md 文件。
將 "bin/Release/lambda-publish" 複製至 。 請提供內容以進行翻譯。
修改 FunctionHandler 程式碼
此範例從 S3 儲存桶中檢索圖像,處理該圖像,然後將可搜尋的 PDF 保存回同一個儲存桶。 設定暫存資料夾在使用IronOCR ZIP時是必要的,因為該庫需要寫入權限以從DLLs複製運行時資料夾。
using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;
// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]
namespace IronOcrZipAwsLambda;
public class Function
{
private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);
/// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
/// <returns></returns>
public async Task FunctionHandler(ILambdaContext context)
{
// Set temp file
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";
string bucketName = "deploymenttestbucket"; // Your bucket name
string pdfName = "sample";
string objectKey = $"IronPdfZip/{pdfName}.pdf";
string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";
try
{
// Retrieve the PDF file from S3
var pdfData = await GetPdfFromS3Async(bucketName, objectKey);
IronTesseract ironTesseract = new IronTesseract();
OcrInput ocrInput = new OcrInput();
ocrInput.LoadPdf(pdfData);
OcrResult result = ironTesseract.Read(ocrInput);
// Use pdfData (byte array) as needed
context.Logger.LogLine($"OCR result: {result.Text}");
// Upload the PDF to S3
await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());
context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
}
catch (Exception e)
{
context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
}
}
private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
{
var request = new GetObjectRequest
{
BucketName = bucketName,
Key = objectKey
};
using (var response = await _s3Client.GetObjectAsync(request))
using (var memoryStream = new MemoryStream())
{
await response.ResponseStream.CopyToAsync(memoryStream);
return memoryStream.ToArray();
}
}
// Function to upload the PDF file to S3
private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
{
using (var memoryStream = new MemoryStream(pdfBytes))
{
var request = new PutObjectRequest
{
BucketName = bucketName,
Key = objectKey,
InputStream = memoryStream,
ContentType = "application/pdf",
};
await _s3Client.PutObjectAsync(request);
}
}
}
using Amazon.Lambda.Core;
using Amazon.S3;
using Amazon.S3.Model;
using IronOcr;
// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]
namespace IronOcrZipAwsLambda;
public class Function
{
private static readonly IAmazonS3 _s3Client = new AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1);
/// <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
/// <returns></returns>
public async Task FunctionHandler(ILambdaContext context)
{
// Set temp file
var awsTmpPath = @"/tmp/";
IronOcr.Installation.InstallationPath = awsTmpPath;
IronOcr.Installation.LogFilePath = awsTmpPath;
IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";
string bucketName = "deploymenttestbucket"; // Your bucket name
string pdfName = "sample";
string objectKey = $"IronPdfZip/{pdfName}.pdf";
string objectKeyForSearchablePdf = $"IronPdfZip/{pdfName}-SearchablePdf.pdf";
try
{
// Retrieve the PDF file from S3
var pdfData = await GetPdfFromS3Async(bucketName, objectKey);
IronTesseract ironTesseract = new IronTesseract();
OcrInput ocrInput = new OcrInput();
ocrInput.LoadPdf(pdfData);
OcrResult result = ironTesseract.Read(ocrInput);
// Use pdfData (byte array) as needed
context.Logger.LogLine($"OCR result: {result.Text}");
// Upload the PDF to S3
await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes());
context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}");
}
catch (Exception e)
{
context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}");
}
}
private async Task<byte[]> GetPdfFromS3Async(string bucketName, string objectKey)
{
var request = new GetObjectRequest
{
BucketName = bucketName,
Key = objectKey
};
using (var response = await _s3Client.GetObjectAsync(request))
using (var memoryStream = new MemoryStream())
{
await response.ResponseStream.CopyToAsync(memoryStream);
return memoryStream.ToArray();
}
}
// Function to upload the PDF file to S3
private async Task UploadPdfToS3Async(string bucketName, string objectKey, byte[] pdfBytes)
{
using (var memoryStream = new MemoryStream(pdfBytes))
{
var request = new PutObjectRequest
{
BucketName = bucketName,
Key = objectKey,
InputStream = memoryStream,
ContentType = "application/pdf",
};
await _s3Client.PutObjectAsync(request);
}
}
}
Imports Amazon.Lambda.Core
Imports Amazon.S3
Imports Amazon.S3.Model
Imports IronOcr
' Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
<Assembly: LambdaSerializer(GetType(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))>
Namespace IronOcrZipAwsLambda
Public Class [Function]
Private Shared ReadOnly _s3Client As IAmazonS3 = New AmazonS3Client(Amazon.RegionEndpoint.APSoutheast1)
''' <param name="context">The ILambdaContext that provides methods for logging and describing the Lambda environment.</param>
''' <returns></returns>
Public Async Function FunctionHandler(ByVal context As ILambdaContext) As Task
' Set temp file
Dim awsTmpPath = "/tmp/"
IronOcr.Installation.InstallationPath = awsTmpPath
IronOcr.Installation.LogFilePath = awsTmpPath
IronOcr.License.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01"
Dim bucketName As String = "deploymenttestbucket" ' Your bucket name
Dim pdfName As String = "sample"
Dim objectKey As String = $"IronPdfZip/{pdfName}.pdf"
Dim objectKeyForSearchablePdf As String = $"IronPdfZip/{pdfName}-SearchablePdf.pdf"
Try
' Retrieve the PDF file from S3
Dim pdfData = Await GetPdfFromS3Async(bucketName, objectKey)
Dim ironTesseract As New IronTesseract()
Dim ocrInput As New OcrInput()
ocrInput.LoadPdf(pdfData)
Dim result As OcrResult = ironTesseract.Read(ocrInput)
' Use pdfData (byte array) as needed
context.Logger.LogLine($"OCR result: {result.Text}")
' Upload the PDF to S3
Await UploadPdfToS3Async(bucketName, objectKeyForSearchablePdf, result.SaveAsSearchablePdfBytes())
context.Logger.LogLine($"PDF uploaded successfully to {bucketName}/{objectKeyForSearchablePdf}")
Catch e As Exception
context.Logger.LogLine($"[ERROR] FunctionHandler: {e.Message}")
End Try
End Function
Private Async Function GetPdfFromS3Async(ByVal bucketName As String, ByVal objectKey As String) As Task(Of Byte())
Dim request = New GetObjectRequest With {
.BucketName = bucketName,
.Key = objectKey
}
Using response = Await _s3Client.GetObjectAsync(request)
Using memoryStream As New MemoryStream()
Await response.ResponseStream.CopyToAsync(memoryStream)
Return memoryStream.ToArray()
End Using
End Using
End Function
' Function to upload the PDF file to S3
Private Async Function UploadPdfToS3Async(ByVal bucketName As String, ByVal objectKey As String, ByVal pdfBytes() As Byte) As Task
Using memoryStream As New MemoryStream(pdfBytes)
Dim request = New PutObjectRequest With {
.BucketName = bucketName,
.Key = objectKey,
.InputStream = memoryStream,
.ContentType = "application/pdf"
}
Await _s3Client.PutObjectAsync(request)
End Using
End Function
End Class
End Namespace
在 try 區塊之前,指定從 IronPdfZip 目錄中讀取 'sample.pdf' 文件。 GetPdfFromS3Async
方法接著用於檢索 PDF 字節,然後傳遞給 LoadPdf
方法。
增加記憶體和超時時間
Lambda 函數中分配的記憶體量將根據處理的文件大小和同時處理的文件數量而有所不同。 作為基準,請在 aws-lambda-tools-defaults.json
中將記憶體設置為 512 MB,超時時間設置為 300 秒。 請提供內容以進行翻譯。
“function-memory-size”:512,
"function-timeout" : 300 請提供內容以進行翻譯。
當記憶體不足時,程式將拋出錯誤:「Runtime exited with error: signal: killed。」增加記憶體大小可以解決此問題。 欲了解更多詳情,請參閱疑難排解文章:AWS Lambda - 執行環境退出信號:已終止.
發布
要在 Visual Studio 中發佈,右鍵點擊專案並選擇「發佈到 AWS Lambda...」,然後配置必要的設定。 您可以閱讀更多關於在Lambda上發布的内容AWS 網站.
試試看!
您可以透過以下方式激活 Lambda 函數:Lambda 控制台或透過 Visual Studio。