Amazon Textract vs IronOCR: .NET OCR Library
AWS Textract's per-page pricing model can look inexpensive at low volume, but costs compound indefinitely at scale. Every document your application processes leaves your network, travels to an Amazon data center, gets processed by Amazon infrastructure, and the bill compounds indefinitely. For teams evaluating OCR options in .NET, the question is not just whether Textract produces accurate results — it does — but whether the per-page cost model, mandatory cloud transmission, and async polling architecture for multi-page documents match what your application actually needs.
Understanding AWS Textract
AWS Textract is Amazon's managed document analysis service, accessible via the AWS SDK for .NET through the AWSSDK.Textract NuGet package. It operates as a cloud API: your application sends document data to Amazon's infrastructure and receives structured results. The service requires an AWS account, IAM credentials with Textract permissions, and an internet connection for every single OCR operation.
Textract exposes several distinct analysis modes, each priced separately:
- DetectDocumentText: Basic text extraction (see AWS Textract pricing for current per-page rates)
- AnalyzeDocument (Tables): Structured table extraction at a higher per-page rate than basic text
- AnalyzeDocument (Forms): Key-value form extraction at a higher per-page rate than table extraction
- AnalyzeExpense: Invoice and receipt parsing at $0.01 per page
- AnalyzeID: Identity document extraction at $0.025 per page
- StartDocumentTextDetection / StartDocumentAnalysis: Asynchronous API required for any multi-page PDF, mandating an S3 staging bucket, job polling, and result pagination
The result model uses a flat list of Block objects with relationship IDs that must be traversed to reconstruct tables, forms, or any structured output. A simple table extraction requires iterating BlockType.TABLE blocks, finding child BlockType.CELL blocks via RelationshipType.CHILD relationship IDs, then fetching BlockType.WORD blocks for each cell's text. This relationship graph model handles complex document structures, but it is not lightweight.
The S3-Async Pipeline
Single-image OCR via DetectDocumentTextAsync can pass document bytes directly in the request. Multi-page PDFs cannot. Any PDF requires the full asynchronous pipeline:
// AWS Textract: Multi-page PDF requires S3 + async job
public async Task<string> ProcessPdfAsync(string pdfPath)
{
// Step 1: Upload to S3 — credentials for two services required
var key = $"uploads/{Guid.NewGuid()}.pdf";
using (var fileStream = File.OpenRead(pdfPath))
{
await _s3Client.PutObjectAsync(new PutObjectRequest
{
BucketName = _bucketName,
Key = key,
InputStream = fileStream
});
}
try
{
// Step 2: Start async Textract job
var startResponse = await _textractClient.StartDocumentTextDetectionAsync(
new StartDocumentTextDetectionRequest
{
DocumentLocation = new DocumentLocation
{
S3Object = new S3Object { Bucket = _bucketName, Name = key }
}
});
var jobId = startResponse.JobId;
// Step 3: Poll every 5 seconds until complete
GetDocumentTextDetectionResponse getResponse;
do
{
await Task.Delay(5000);
getResponse = await _textractClient.GetDocumentTextDetectionAsync(
new GetDocumentTextDetectionRequest { JobId = jobId });
} while (getResponse.JobStatus == JobStatus.IN_PROGRESS);
if (getResponse.JobStatus != JobStatus.SUCCEEDED)
throw new Exception($"Textract job failed: {getResponse.StatusMessage}");
// Step 4: Paginate through result blocks
var allText = new StringBuilder();
string nextToken = null;
do
{
var pageResponse = await _textractClient.GetDocumentTextDetectionAsync(
new GetDocumentTextDetectionRequest
{
JobId = jobId,
NextToken = nextToken
});
foreach (var block in pageResponse.Blocks.Where(b => b.BlockType == BlockType.LINE))
allText.AppendLine(block.Text);
nextToken = pageResponse.NextToken;
} while (nextToken != null);
return allText.ToString();
}
finally
{
// Step 5: Always clean up S3
await _s3Client.DeleteObjectAsync(_bucketName, key);
}
}
// AWS Textract: Multi-page PDF requires S3 + async job
public async Task<string> ProcessPdfAsync(string pdfPath)
{
// Step 1: Upload to S3 — credentials for two services required
var key = $"uploads/{Guid.NewGuid()}.pdf";
using (var fileStream = File.OpenRead(pdfPath))
{
await _s3Client.PutObjectAsync(new PutObjectRequest
{
BucketName = _bucketName,
Key = key,
InputStream = fileStream
});
}
try
{
// Step 2: Start async Textract job
var startResponse = await _textractClient.StartDocumentTextDetectionAsync(
new StartDocumentTextDetectionRequest
{
DocumentLocation = new DocumentLocation
{
S3Object = new S3Object { Bucket = _bucketName, Name = key }
}
});
var jobId = startResponse.JobId;
// Step 3: Poll every 5 seconds until complete
GetDocumentTextDetectionResponse getResponse;
do
{
await Task.Delay(5000);
getResponse = await _textractClient.GetDocumentTextDetectionAsync(
new GetDocumentTextDetectionRequest { JobId = jobId });
} while (getResponse.JobStatus == JobStatus.IN_PROGRESS);
if (getResponse.JobStatus != JobStatus.SUCCEEDED)
throw new Exception($"Textract job failed: {getResponse.StatusMessage}");
// Step 4: Paginate through result blocks
var allText = new StringBuilder();
string nextToken = null;
do
{
var pageResponse = await _textractClient.GetDocumentTextDetectionAsync(
new GetDocumentTextDetectionRequest
{
JobId = jobId,
NextToken = nextToken
});
foreach (var block in pageResponse.Blocks.Where(b => b.BlockType == BlockType.LINE))
allText.AppendLine(block.Text);
nextToken = pageResponse.NextToken;
} while (nextToken != null);
return allText.ToString();
}
finally
{
// Step 5: Always clean up S3
await _s3Client.DeleteObjectAsync(_bucketName, key);
}
}
Imports System
Imports System.IO
Imports System.Text
Imports System.Threading.Tasks
Imports Amazon.S3
Imports Amazon.Textract
Imports Amazon.Textract.Model
Public Class PdfProcessor
Private _s3Client As IAmazonS3
Private _textractClient As IAmazonTextract
Private _bucketName As String
Public Async Function ProcessPdfAsync(pdfPath As String) As Task(Of String)
' Step 1: Upload to S3 — credentials for two services required
Dim key = $"uploads/{Guid.NewGuid()}.pdf"
Using fileStream = File.OpenRead(pdfPath)
Await _s3Client.PutObjectAsync(New PutObjectRequest With {
.BucketName = _bucketName,
.Key = key,
.InputStream = fileStream
})
End Using
Try
' Step 2: Start async Textract job
Dim startResponse = Await _textractClient.StartDocumentTextDetectionAsync(
New StartDocumentTextDetectionRequest With {
.DocumentLocation = New DocumentLocation With {
.S3Object = New S3Object With {.Bucket = _bucketName, .Name = key}
}
})
Dim jobId = startResponse.JobId
' Step 3: Poll every 5 seconds until complete
Dim getResponse As GetDocumentTextDetectionResponse
Do
Await Task.Delay(5000)
getResponse = Await _textractClient.GetDocumentTextDetectionAsync(
New GetDocumentTextDetectionRequest With {.JobId = jobId})
Loop While getResponse.JobStatus = JobStatus.IN_PROGRESS
If getResponse.JobStatus <> JobStatus.SUCCEEDED Then
Throw New Exception($"Textract job failed: {getResponse.StatusMessage}")
End If
' Step 4: Paginate through result blocks
Dim allText = New StringBuilder()
Dim nextToken As String = Nothing
Do
Dim pageResponse = Await _textractClient.GetDocumentTextDetectionAsync(
New GetDocumentTextDetectionRequest With {
.JobId = jobId,
.NextToken = nextToken
})
For Each block In pageResponse.Blocks.Where(Function(b) b.BlockType = BlockType.LINE)
allText.AppendLine(block.Text)
Next
nextToken = pageResponse.NextToken
Loop While nextToken IsNot Nothing
Return allText.ToString()
Finally
' Step 5: Always clean up S3
Await _s3Client.DeleteObjectAsync(_bucketName, key)
End Try
End Function
End Class
This is the minimum viable implementation for reliable PDF processing — five distinct phases, two AWS service clients, and cleanup logic in a finally block. The complete production version with proper error handling, rate limit retry logic, and timeout management runs 150-300 lines.
Understanding IronOCR
IronOCR is a commercial .NET OCR library that runs entirely on your infrastructure. It wraps an optimized Tesseract 5 engine with automatic image preprocessing, native PDF support, and a synchronous API that produces results directly without external service calls or staging steps.
Key characteristics of the IronOCR architecture:
- Local processing only: No document data leaves the machine running your application
- Single NuGet package:
dotnet add package IronOcrinstalls everything including native binaries - Automatic preprocessing: Deskew, denoise, contrast enhancement, binarization, and resolution scaling happen automatically on poor-quality inputs
- Native PDF support: Reads PDFs directly via file path or stream without S3 staging or async jobs
- Thread-safe: A single
IronTesseractinstance handles concurrent requests across threads without contention - Perpetual licensing: $999 Lite / $1,499 Plus / $2,999 Professional / $5,999 Unlimited — one payment, no per-page charges, no usage metering
- 125+ language packs: Installed as separate NuGet packages, loaded locally, no network calls
Feature Comparison
| Feature | AWS Textract | IronOCR |
|---|---|---|
| Processing location | Amazon cloud (mandatory) | Local / on-premise |
| Multi-page PDF | Requires S3 + async job | Direct synchronous call |
| Cost model | Per-page (contact AWS for current pricing) | Perpetual license, no per-page fee |
| Internet required | Always | Never |
| Credential setup | IAM user/role + optional S3 | Single license key string |
| Air-gapped deployment | Not possible | Fully supported |
| Encrypted PDF support | Not supported | Built-in (password parameter) |
Detailed Feature Comparison
| Feature | AWS Textract | IronOCR |
|---|---|---|
| Text Extraction | ||
| Basic OCR (images) | Yes — DetectDocumentTextAsync |
Yes — ocr.Read(path) |
| Multi-page PDF | Requires S3 + async polling | Direct input.LoadPdf(path) |
| Password-protected PDF | Not supported | input.LoadPdf(path, Password: "x") |
| Stream input | Yes (byte array in request) | Yes — input.LoadImage(stream) |
| Structured Extraction | ||
| Table extraction | AnalyzeDocument + block graph traversal |
Word position-based reconstruction |
| Form field extraction | AnalyzeDocument + KEY_VALUE_SET blocks |
Region-based CropRectangle zones |
| Line-level results | Block filtering by BlockType.LINE |
result.Lines direct collection |
| Word-level with coordinates | Block filtering by BlockType.WORD |
result.Words with .X, .Y, .Width |
| Confidence scores | Per-block confidence | Per-word and overall result.Confidence |
| Processing Model | ||
| Synchronous (images) | Yes (single page only) | Yes (all document types) |
| Asynchronous | Required for PDFs | Optional — Task.Run() wrapper |
| Batch processing | Requires rate limit management (5 TPS default) | Unconstrained Parallel.ForEach |
| Preprocessing | ||
| Auto deskew | Not exposed | input.Deskew() |
| Noise removal | Internal (not configurable) | input.DeNoise() |
| Contrast enhancement | Internal (not configurable) | input.Contrast() |
| Resolution enhancement | Internal (not configurable) | input.EnhanceResolution(300) |
| Binarization | Internal | input.Binarize() |
| Output Formats | ||
| Plain text | Yes | Yes |
| Searchable PDF | No | result.SaveAsSearchablePdf(path) |
| hOCR | No | result.SaveAsHocrFile(path) |
| Structured JSON | Via block serialization | result.Words / result.Lines |
| Deployment | ||
| On-premise | No | Yes |
| Air-gapped | No | Yes |
| Docker | Yes (with AWS credentials injected) | Yes (no credentials required) |
| AWS Lambda | Native | Supported |
| Azure | Yes | Yes |
| Linux | Yes (AWS-managed) | Yes — get-started/linux/ |
| Compliance | ||
| HIPAA | Requires BAA with AWS | No external processor |
| GDPR | Data crosses to AWS regions | Data stays in-boundary |
| ITAR | Prohibited without special authorization | Fully on-premise |
| Air-gapped / CMMC Level 3 | Not possible | Supported |
Cost at Scale
The per-page pricing model is the defining structural constraint of AWS Textract. Costs that appear small per page accumulate significantly across a real document workflow.
AWS Textract Approach
// Every call to this method costs money — per page, permanently
public async Task<string> DetectTextAsync(string imagePath)
{
var imageBytes = File.ReadAllBytes(imagePath); // Image leaves your network
var request = new DetectDocumentTextRequest
{
Document = new Document
{
Bytes = new MemoryStream(imageBytes)
}
};
var response = await _client.DetectDocumentTextAsync(request); // per-page charge
return string.Join("\n", response.Blocks
.Where(b => b.BlockType == BlockType.LINE)
.Select(b => b.Text));
}
// Every call to this method costs money — per page, permanently
public async Task<string> DetectTextAsync(string imagePath)
{
var imageBytes = File.ReadAllBytes(imagePath); // Image leaves your network
var request = new DetectDocumentTextRequest
{
Document = new Document
{
Bytes = new MemoryStream(imageBytes)
}
};
var response = await _client.DetectDocumentTextAsync(request); // per-page charge
return string.Join("\n", response.Blocks
.Where(b => b.BlockType == BlockType.LINE)
.Select(b => b.Text));
}
Imports System.IO
Imports System.Threading.Tasks
' Every call to this method costs money — per page, permanently
Public Async Function DetectTextAsync(imagePath As String) As Task(Of String)
Dim imageBytes = File.ReadAllBytes(imagePath) ' Image leaves your network
Dim request = New DetectDocumentTextRequest With {
.Document = New Document With {
.Bytes = New MemoryStream(imageBytes)
}
}
Dim response = Await _client.DetectDocumentTextAsync(request) ' per-page charge
Return String.Join(vbLf, response.Blocks _
.Where(Function(b) b.BlockType = BlockType.LINE) _
.Select(Function(b) b.Text))
End Function
Consult the AWS Textract pricing page for current per-page rates. Different API features (basic text detection, table extraction, forms extraction) have different rates. A document containing tables and form fields incurs higher charges than basic text detection, and costs grow with volume with no upper bound and no way to pay ahead.
At high page volumes, three-year total costs can be substantial, and the meter keeps running.
IronOCR Approach
// One license. No per-page cost. Same code handles 1 page or 1,000,000.
IronOcr.License.LicenseKey = "YOUR-LICENSE-KEY";
var text = new IronTesseract().Read("document.jpg").Text;
// One license. No per-page cost. Same code handles 1 page or 1,000,000.
IronOcr.License.LicenseKey = "YOUR-LICENSE-KEY";
var text = new IronTesseract().Read("document.jpg").Text;
Imports IronOcr
' One license. No per-page cost. Same code handles 1 page or 1,000,000.
IronOcr.License.LicenseKey = "YOUR-LICENSE-KEY"
Dim text As String = New IronTesseract().Read("document.jpg").Text
The $2,999 Professional license covers 10 developers, unlimited projects, and unlimited page volume. After year one, the ongoing cost for pages processed is zero. For teams processing significant page volumes, the IronOCR license pays for itself quickly compared to ongoing per-page cloud charges.
The IronOCR licensing page covers tier details, SaaS subscription options for usage-based billing scenarios, and OEM redistribution terms.
Data Sovereignty and Compliance
AWS Textract's architecture makes one guarantee impossible: that your documents stay within your infrastructure. Every OCR operation transmits document content to Amazon's servers.
AWS Textract Approach
// This code sends PHI, legal documents, financial records — whatever is in
// the file — to Amazon Web Services infrastructure
public async Task<string> ProcessSensitiveDocumentAsync(string documentPath)
{
var imageBytes = File.ReadAllBytes(documentPath);
// Data crosses your security perimeter here
var request = new DetectDocumentTextRequest
{
Document = new Document
{
Bytes = new MemoryStream(imageBytes)
}
};
// Amazon processes it; you receive text back
var response = await _client.DetectDocumentTextAsync(request);
return string.Join("\n", response.Blocks
.Where(b => b.BlockType == BlockType.LINE)
.Select(b => b.Text));
}
// This code sends PHI, legal documents, financial records — whatever is in
// the file — to Amazon Web Services infrastructure
public async Task<string> ProcessSensitiveDocumentAsync(string documentPath)
{
var imageBytes = File.ReadAllBytes(documentPath);
// Data crosses your security perimeter here
var request = new DetectDocumentTextRequest
{
Document = new Document
{
Bytes = new MemoryStream(imageBytes)
}
};
// Amazon processes it; you receive text back
var response = await _client.DetectDocumentTextAsync(request);
return string.Join("\n", response.Blocks
.Where(b => b.BlockType == BlockType.LINE)
.Select(b => b.Text));
}
Imports System.IO
Imports System.Threading.Tasks
Imports Amazon.Textract
Imports Amazon.Textract.Model
Public Class DocumentProcessor
Private _client As AmazonTextractClient
Public Sub New(client As AmazonTextractClient)
_client = client
End Sub
' This code sends PHI, legal documents, financial records — whatever is in
' the file — to Amazon Web Services infrastructure
Public Async Function ProcessSensitiveDocumentAsync(documentPath As String) As Task(Of String)
Dim imageBytes = File.ReadAllBytes(documentPath)
' Data crosses your security perimeter here
Dim request As New DetectDocumentTextRequest With {
.Document = New Document With {
.Bytes = New MemoryStream(imageBytes)
}
}
' Amazon processes it; you receive text back
Dim response = Await _client.DetectDocumentTextAsync(request)
Return String.Join(vbLf, response.Blocks _
.Where(Function(b) b.BlockType = BlockType.LINE) _
.Select(Function(b) b.Text))
End Function
End Class
AWS offers a HIPAA Business Associate Agreement for covered entities, and GovCloud regions provide FedRAMP High authorization. These frameworks do not change the fundamental architecture: documents leave your infrastructure for every operation. For ITAR-controlled technical data, this is not a compliance nuance — it is a prohibition. For CMMC Level 3 workloads with CUI, cloud transmission requires specific authorizations most defense contractors do not hold. For air-gapped systems — research networks, industrial control environments, classified facilities — Textract is simply unavailable.
AWS Textract is available in six regions: us-east-1, us-west-2, eu-west-1, eu-west-2, ap-southeast-1, and ap-southeast-2. Organizations with data residency requirements outside these regions have no compliant option.
IronOCR Approach
// IronOCR: document bytes never leave this process
public string ProcessSensitiveDocument(string documentPath)
{
// Processes entirely on local hardware — no network call
var ocr = new IronTesseract();
return ocr.Read(documentPath).Text;
}
// IronOCR: document bytes never leave this process
public string ProcessSensitiveDocument(string documentPath)
{
// Processes entirely on local hardware — no network call
var ocr = new IronTesseract();
return ocr.Read(documentPath).Text;
}
' IronOCR: document bytes never leave this process
Public Function ProcessSensitiveDocument(documentPath As String) As String
' Processes entirely on local hardware — no network call
Dim ocr As New IronTesseract()
Return ocr.Read(documentPath).Text
End Function
Because IronOCR executes locally, it fits naturally into healthcare workflows processing PHI, legal document systems handling privileged communications, financial applications handling payment card images, and defense contractor pipelines processing CUI. There is no external processor to audit, no BAA to negotiate, no data residency constraint to satisfy. The compliance scope is your organization's own infrastructure.
For teams deploying on AWS infrastructure but needing local processing, IronOCR runs on AWS EC2 and Lambda without any dependency on Textract — the processing happens within your own AWS account boundary rather than Amazon's managed service.
Async Polling vs. Synchronous Processing
The architectural split between Textract's synchronous (single-image) and asynchronous (multi-page PDF) APIs is not a minor API detail. It shapes how services are built, how errors are handled, and how much code maintainers must read and reason about.
AWS Textract Approach
// Full production-grade async processor for Textract PDF handling
public class TextractAsyncProcessor
{
private readonly AmazonTextractClient _textractClient;
private readonly AmazonS3Client _s3Client;
private readonly string _bucketName;
private readonly TimeSpan _pollInterval = TimeSpan.FromSeconds(5);
private readonly TimeSpan _maxWaitTime = TimeSpan.FromMinutes(10);
public async Task<DocumentResult> ProcessDocumentAsync(
string localFilePath,
CancellationToken cancellationToken = default)
{
var s3Key = $"textract-uploads/{Guid.NewGuid()}{Path.GetExtension(localFilePath)}";
try
{
// Phase 1: Upload to S3
await UploadToS3Async(localFilePath, s3Key, cancellationToken);
// Phase 2: Start Textract job
var jobId = await StartTextractJobAsync(s3Key, cancellationToken);
// Phase 3: Poll until complete (up to 10 minutes)
var pollResult = await PollForCompletionAsync(jobId, cancellationToken);
if (!pollResult.Success)
throw new Exception($"Textract job failed: {pollResult.ErrorMessage}");
// Phase 4: Retrieve paginated results
return await GetAllResultsAsync(jobId, cancellationToken);
}
finally
{
// Phase 5: S3 cleanup — must succeed or storage costs accumulate
await DeleteFromS3Async(s3Key, cancellationToken);
}
}
private async Task<(bool Success, string ErrorMessage)> PollForCompletionAsync(
string jobId, CancellationToken cancellationToken)
{
var startTime = DateTime.UtcNow;
int pollCount = 0;
while (DateTime.UtcNow - startTime < _maxWaitTime)
{
cancellationToken.ThrowIfCancellationRequested();
var response = await _textractClient.GetDocumentTextDetectionAsync(
new GetDocumentTextDetectionRequest { JobId = jobId }, cancellationToken);
pollCount++;
switch (response.JobStatus)
{
case JobStatus.SUCCEEDED: return (true, null);
case JobStatus.FAILED: return (false, response.StatusMessage ?? "Unknown error");
case JobStatus.IN_PROGRESS:
await Task.Delay(_pollInterval, cancellationToken);
break;
default:
throw new Exception($"Unknown job status: {response.JobStatus}");
}
}
return (false, "Job timed out");
}
}
// Full production-grade async processor for Textract PDF handling
public class TextractAsyncProcessor
{
private readonly AmazonTextractClient _textractClient;
private readonly AmazonS3Client _s3Client;
private readonly string _bucketName;
private readonly TimeSpan _pollInterval = TimeSpan.FromSeconds(5);
private readonly TimeSpan _maxWaitTime = TimeSpan.FromMinutes(10);
public async Task<DocumentResult> ProcessDocumentAsync(
string localFilePath,
CancellationToken cancellationToken = default)
{
var s3Key = $"textract-uploads/{Guid.NewGuid()}{Path.GetExtension(localFilePath)}";
try
{
// Phase 1: Upload to S3
await UploadToS3Async(localFilePath, s3Key, cancellationToken);
// Phase 2: Start Textract job
var jobId = await StartTextractJobAsync(s3Key, cancellationToken);
// Phase 3: Poll until complete (up to 10 minutes)
var pollResult = await PollForCompletionAsync(jobId, cancellationToken);
if (!pollResult.Success)
throw new Exception($"Textract job failed: {pollResult.ErrorMessage}");
// Phase 4: Retrieve paginated results
return await GetAllResultsAsync(jobId, cancellationToken);
}
finally
{
// Phase 5: S3 cleanup — must succeed or storage costs accumulate
await DeleteFromS3Async(s3Key, cancellationToken);
}
}
private async Task<(bool Success, string ErrorMessage)> PollForCompletionAsync(
string jobId, CancellationToken cancellationToken)
{
var startTime = DateTime.UtcNow;
int pollCount = 0;
while (DateTime.UtcNow - startTime < _maxWaitTime)
{
cancellationToken.ThrowIfCancellationRequested();
var response = await _textractClient.GetDocumentTextDetectionAsync(
new GetDocumentTextDetectionRequest { JobId = jobId }, cancellationToken);
pollCount++;
switch (response.JobStatus)
{
case JobStatus.SUCCEEDED: return (true, null);
case JobStatus.FAILED: return (false, response.StatusMessage ?? "Unknown error");
case JobStatus.IN_PROGRESS:
await Task.Delay(_pollInterval, cancellationToken);
break;
default:
throw new Exception($"Unknown job status: {response.JobStatus}");
}
}
return (false, "Job timed out");
}
}
Imports System
Imports System.IO
Imports System.Threading
Imports System.Threading.Tasks
Imports Amazon.Textract
Imports Amazon.S3
Imports Amazon.Textract.Model
' Full production-grade async processor for Textract PDF handling
Public Class TextractAsyncProcessor
Private ReadOnly _textractClient As AmazonTextractClient
Private ReadOnly _s3Client As AmazonS3Client
Private ReadOnly _bucketName As String
Private ReadOnly _pollInterval As TimeSpan = TimeSpan.FromSeconds(5)
Private ReadOnly _maxWaitTime As TimeSpan = TimeSpan.FromMinutes(10)
Public Async Function ProcessDocumentAsync(localFilePath As String, Optional cancellationToken As CancellationToken = Nothing) As Task(Of DocumentResult)
Dim s3Key = $"textract-uploads/{Guid.NewGuid()}{Path.GetExtension(localFilePath)}"
Try
' Phase 1: Upload to S3
Await UploadToS3Async(localFilePath, s3Key, cancellationToken)
' Phase 2: Start Textract job
Dim jobId = Await StartTextractJobAsync(s3Key, cancellationToken)
' Phase 3: Poll until complete (up to 10 minutes)
Dim pollResult = Await PollForCompletionAsync(jobId, cancellationToken)
If Not pollResult.Success Then
Throw New Exception($"Textract job failed: {pollResult.ErrorMessage}")
End If
' Phase 4: Retrieve paginated results
Return Await GetAllResultsAsync(jobId, cancellationToken)
Finally
' Phase 5: S3 cleanup — must succeed or storage costs accumulate
Await DeleteFromS3Async(s3Key, cancellationToken)
End Try
End Function
Private Async Function PollForCompletionAsync(jobId As String, cancellationToken As CancellationToken) As Task(Of (Success As Boolean, ErrorMessage As String))
Dim startTime = DateTime.UtcNow
Dim pollCount As Integer = 0
While DateTime.UtcNow - startTime < _maxWaitTime
cancellationToken.ThrowIfCancellationRequested()
Dim response = Await _textractClient.GetDocumentTextDetectionAsync(New GetDocumentTextDetectionRequest With {.JobId = jobId}, cancellationToken)
pollCount += 1
Select Case response.JobStatus
Case JobStatus.SUCCEEDED
Return (True, Nothing)
Case JobStatus.FAILED
Return (False, If(response.StatusMessage, "Unknown error"))
Case JobStatus.IN_PROGRESS
Await Task.Delay(_pollInterval, cancellationToken)
Case Else
Throw New Exception($"Unknown job status: {response.JobStatus}")
End Select
End While
Return (False, "Job timed out")
End Function
End Class
This is not boilerplate that can be generated and forgotten. When a Textract job fails mid-flight, the S3 cleanup must still run. When a job times out after 10 minutes, the caller needs a clean error. When the network drops during polling, the retry strategy must not create duplicate jobs. Each of these failure modes requires explicit handling — the structure shown above is the minimum responsible implementation.
Batch processing adds another layer: Textract's default StartDocumentTextDetection TPS limit is 5 requests per second. Processing 100 documents requires a SemaphoreSlim throttle, a rate-replenishment timer, and retry logic for ProvisionedThroughputExceededException.
IronOCR Approach
// IronOCR: same synchronous API regardless of document type or size
public string ProcessDocument(string filePath)
{
using var input = new OcrInput();
if (Path.GetExtension(filePath).Equals(".pdf", StringComparison.OrdinalIgnoreCase))
input.LoadPdf(filePath);
else
input.LoadImage(filePath);
return new IronTesseract().Read(input).Text;
}
// IronOCR: same synchronous API regardless of document type or size
public string ProcessDocument(string filePath)
{
using var input = new OcrInput();
if (Path.GetExtension(filePath).Equals(".pdf", StringComparison.OrdinalIgnoreCase))
input.LoadPdf(filePath);
else
input.LoadImage(filePath);
return new IronTesseract().Read(input).Text;
}
Imports System.IO
' IronOCR: same synchronous API regardless of document type or size
Public Function ProcessDocument(filePath As String) As String
Using input As New OcrInput()
If Path.GetExtension(filePath).Equals(".pdf", StringComparison.OrdinalIgnoreCase) Then
input.LoadPdf(filePath)
Else
input.LoadImage(filePath)
End If
Return New IronTesseract().Read(input).Text
End Using
End Function
There is no polling loop, no job ID tracking, no S3 bucket, no result pagination. The same code handles a single JPEG and a 200-page PDF. Processing completes or throws — no intermediate "in progress" state to manage. For batch processing, IronOCR is thread-safe and a single IronTesseract instance handles Parallel.ForEach without locks or semaphores.
The IronTesseract setup guide covers configuration, and the PDF input guide documents page range selection, password-protected PDFs, and stream-based input for PDFs retrieved from databases or HTTP responses.
Credential Management Overhead
Starting an OCR operation with AWS Textract involves IAM configuration before a single page is processed.
AWS Textract Approach
Before calling DetectDocumentTextAsync, a developer must:
- Create an AWS account or obtain access to an existing one
- Create an IAM user or role with
textract:DetectDocumentTextandtextract:AnalyzeDocumentpermissions - Generate and securely store access key ID and secret access key
- Configure credential resolution — environment variables, AWS credentials file, or EC2 instance profile
- If processing PDFs: create an S3 bucket, configure bucket policy, add
s3:PutObjectands3:DeleteObjectpermissions - Implement credential rotation policies to meet security standards
- Store credentials securely in each deployment environment — Docker secrets, Kubernetes secrets, AWS Secrets Manager, or CI/CD pipeline variables
// Every environment needs these configured before this constructor succeeds
public TextractOcrService()
{
// Reads credentials from environment, ~/.aws/credentials, or IAM role
_client = new AmazonTextractClient(Amazon.RegionEndpoint.USEast1);
}
// Every environment needs these configured before this constructor succeeds
public TextractOcrService()
{
// Reads credentials from environment, ~/.aws/credentials, or IAM role
_client = new AmazonTextractClient(Amazon.RegionEndpoint.USEast1);
}
' Every environment needs these configured before this constructor succeeds
Public Sub New()
' Reads credentials from environment, ~/.aws/credentials, or IAM role
_client = New AmazonTextractClient(Amazon.RegionEndpoint.USEast1)
End Sub
When credentials expire, rotate, or are misconfigured, every OCR call fails with AmazonTextractException carrying ErrorCode == "AccessDeniedException". In a production system, this means implementing specific catch blocks for credential failures and monitoring for IAM policy drift.
IronOCR Approach
// One-time setup at application startup
IronOcr.License.LicenseKey = "YOUR-LICENSE-KEY";
// Or from environment — recommended for deployments
IronOcr.License.LicenseKey = Environment.GetEnvironmentVariable("IRONOCR_LICENSE");
// One-time setup at application startup
IronOcr.License.LicenseKey = "YOUR-LICENSE-KEY";
// Or from environment — recommended for deployments
IronOcr.License.LicenseKey = Environment.GetEnvironmentVariable("IRONOCR_LICENSE");
' One-time setup at application startup
IronOcr.License.LicenseKey = "YOUR-LICENSE-KEY"
' Or from environment — recommended for deployments
IronOcr.License.LicenseKey = Environment.GetEnvironmentVariable("IRONOCR_LICENSE")
The license key is a static string. It does not expire mid-operation, does not require rotation, and carries no permissions to manage. A Docker container that processes documents does not need injected AWS credentials, an IAM role bound to an execution context, or network access to AWS STS for token refresh.
The complete credential overhead reduction when moving from Textract to IronOCR: three NuGet packages removed (AWSSDK.Textract, AWSSDK.S3, AWSSDK.Core), all AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_DEFAULT_REGION environment variables removed, and IAM roles and S3 bucket configurations decommissioned. The image input guide and stream input guide cover the full range of input methods that replace Textract's byte-array and S3-object document models.
API Mapping Reference
| AWS Textract API | IronOCR Equivalent |
|---|---|
AmazonTextractClient |
IronTesseract |
AmazonS3Client |
Not required |
DetectDocumentTextRequest |
OcrInput |
DetectDocumentTextResponse |
OcrResult |
AnalyzeDocumentRequest |
OcrInput with CropRectangle for zones |
StartDocumentTextDetectionRequest |
OcrInput — synchronous, no start needed |
GetDocumentTextDetectionRequest |
Not required — results immediate |
Document.Bytes |
input.LoadImage(bytes) or input.LoadImage(stream) |
S3Object (document staging) |
File path string or stream |
Block (BlockType.LINE) |
result.Lines |
Block (BlockType.WORD) |
result.Words |
Block (BlockType.TABLE) |
Word position grouping via result.Words |
Block (BlockType.KEY_VALUE_SET) |
CropRectangle region extraction |
Block.Confidence |
word.Confidence / result.Confidence |
JobStatus.SUCCEEDED |
Not applicable — synchronous return |
JobStatus.IN_PROGRESS |
Not applicable — no async state |
response.NextToken (pagination) |
Not applicable — results not paginated |
ProvisionedThroughputExceededException |
Not applicable — no TPS limits |
client.DetectDocumentTextAsync(request) |
ocr.Read(path) |
client.AnalyzeDocumentAsync(request) |
ocr.Read(input) |
client.StartDocumentTextDetectionAsync(request) |
ocr.Read(input) |
client.GetDocumentTextDetectionAsync(request) |
Not applicable |
When Teams Consider Moving from AWS Textract to IronOCR
When the Monthly Bill Becomes a Budget Line Item
Teams that started with Textract at low volume often encounter a specific moment: the AWS bill for OCR processing appears in a quarterly budget review and someone asks whether this cost is fixed. It is not. At high page volumes, annual Textract costs can be substantial — consult the AWS Textract pricing page for current rates. The IronOCR Professional license at $2,999 one-time pays for itself quickly at moderate to high page volumes.
When a Compliance Requirement Blocks Cloud Processing
Healthcare organizations implementing document digitization workflows frequently discover mid-project that HIPAA PHI cannot flow through cloud services without a BAA and additional legal review, or that their security team prohibits cloud transmission entirely. Defense contractors handling technical drawings, specifications, or any CUI face ITAR and CMMC constraints that exclude AWS Textract from consideration. Legal firms processing privileged communications have similar concerns. These are not theoretical compliance edge cases — they appear regularly in procurement reviews, security audits, and contract negotiations. IronOCR processes locally, so the compliance question for document data reduces to whether your own infrastructure is in scope, not whether Amazon's infrastructure is in scope.
When the Async PDF Complexity Exceeds Its Value
The five-phase S3-async pipeline — upload, start job, poll, paginate results, clean up — is not technically difficult to implement. It is difficult to maintain, test, and operate. Every phase is a failure point. S3 upload failures require retry logic. Textract job failures require distinguishing transient from permanent errors. Polling timeouts require timeout handling separate from cancellation. Result pagination requires accumulating state across multiple API calls. S3 cleanup failures require alerting because orphaned objects accumulate costs. Teams that have shipped this pipeline into production typically spend more ongoing engineering time maintaining it than they spent building it. The IronOCR equivalent — input.LoadPdf(path) followed by ocr.Read(input) — eliminates all five phases and their associated failure modes.
When Deployment Environments Lack Internet Access
Docker containers running in isolated network segments, on-premise servers without outbound internet, air-gapped research environments, and industrial systems with strict network controls all share one characteristic: AWS Textract is not available. IronOCR installs as a standard NuGet package and operates without any network calls after installation. Teams running .NET applications in these environments have no Textract option and need a library that processes locally. The Docker deployment guide and Linux deployment guide cover the specific configuration for containerized environments.
When Rate Limit Throttling Disrupts Batch Workflows
The default StartDocumentTextDetection TPS limit is 5 requests per second. DetectDocumentText synchronous calls are also rate-limited. Batch jobs processing hundreds or thousands of documents must implement SemaphoreSlim throttling, exponential backoff on ProvisionedThroughputExceededException, and rate-replenishment timers. AWS supports TPS limit increase requests, but they require justification, review, and are not guaranteed. IronOCR processes as fast as local CPU allows — a 32-core server processes 32 documents concurrently without throttle configuration or service tier negotiation.
Common Migration Considerations
Replacing the Block Graph with Direct Collections
Textract represents all results as a flat List<Block> where lines, words, cells, tables, and key-value pairs are distinguished by BlockType and linked by relationship ID arrays. IronOCR provides direct typed collections.
// Textract: filter flat block list by type
var lines = response.Blocks.Where(b => b.BlockType == BlockType.LINE);
var words = response.Blocks.Where(b => b.BlockType == BlockType.WORD);
// IronOCR: direct access to typed collections
var result = ocr.Read(imagePath);
var lines = result.Lines; // IEnumerable<OcrResult.OcrResultLine>
var words = result.Words; // IEnumerable<OcrResult.OcrResultWord>
foreach (var word in result.Words)
Console.WriteLine($"'{word.Text}' at ({word.X},{word.Y}) confidence {word.Confidence}%");
// Textract: filter flat block list by type
var lines = response.Blocks.Where(b => b.BlockType == BlockType.LINE);
var words = response.Blocks.Where(b => b.BlockType == BlockType.WORD);
// IronOCR: direct access to typed collections
var result = ocr.Read(imagePath);
var lines = result.Lines; // IEnumerable<OcrResult.OcrResultLine>
var words = result.Words; // IEnumerable<OcrResult.OcrResultWord>
foreach (var word in result.Words)
Console.WriteLine($"'{word.Text}' at ({word.X},{word.Y}) confidence {word.Confidence}%");
' Textract: filter flat block list by type
Dim lines = response.Blocks.Where(Function(b) b.BlockType = BlockType.LINE)
Dim words = response.Blocks.Where(Function(b) b.BlockType = BlockType.WORD)
' IronOCR: direct access to typed collections
Dim result = ocr.Read(imagePath)
Dim lines = result.Lines ' IEnumerable(Of OcrResult.OcrResultLine)
Dim words = result.Words ' IEnumerable(Of OcrResult.OcrResultWord)
For Each word In result.Words
Console.WriteLine($"'{word.Text}' at ({word.X},{word.Y}) confidence {word.Confidence}%")
Next
The structured results guide covers result.Pages, result.Paragraphs, result.Lines, result.Words, and coordinate access for building layout-aware document processing.
Replacing S3-Staged PDF Processing with Direct LoadPdf
Any Textract code that uploads to S3 before starting a detection job can be replaced with a direct PDF load. No staging bucket, no upload timing, no cleanup logic.
// Textract: upload to S3 → start job → poll → paginate → cleanup (50+ lines)
// IronOCR equivalent:
public string ProcessPdf(string pdfPath)
{
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf(pdfPath);
return ocr.Read(input).Text;
}
// Specific page ranges (no Textract equivalent without async job per range)
public string ProcessPdfPages(string pdfPath, int startPage, int endPage)
{
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdfPages(pdfPath, startPage, endPage);
return ocr.Read(input).Text;
}
// Textract: upload to S3 → start job → poll → paginate → cleanup (50+ lines)
// IronOCR equivalent:
public string ProcessPdf(string pdfPath)
{
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf(pdfPath);
return ocr.Read(input).Text;
}
// Specific page ranges (no Textract equivalent without async job per range)
public string ProcessPdfPages(string pdfPath, int startPage, int endPage)
{
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdfPages(pdfPath, startPage, endPage);
return ocr.Read(input).Text;
}
Imports IronOcr
Public Class PdfProcessor
' Textract: upload to S3 → start job → poll → paginate → cleanup (50+ lines)
' IronOCR equivalent:
Public Function ProcessPdf(pdfPath As String) As String
Dim ocr As New IronTesseract()
Using input As New OcrInput()
input.LoadPdf(pdfPath)
Return ocr.Read(input).Text
End Using
End Function
' Specific page ranges (no Textract equivalent without async job per range)
Public Function ProcessPdfPages(pdfPath As String, startPage As Integer, endPage As Integer) As String
Dim ocr As New IronTesseract()
Using input As New OcrInput()
input.LoadPdfPages(pdfPath, startPage, endPage)
Return ocr.Read(input).Text
End Using
End Function
End Class
Adding Preprocessing for Documents That Produced Low Confidence in Textract
Textract's preprocessing is internal and not configurable. When a scanned document produces poor results, the only options are retrying or accepting low-confidence output. IronOCR exposes the preprocessing pipeline directly.
// For documents that returned low-confidence results from Textract
using var input = new OcrInput();
input.LoadImage("low-quality-scan.jpg");
input.Deskew(); // Fix rotation from scanner misalignment
input.DeNoise(); // Remove scanner noise artifacts
input.Contrast(); // Boost faint text
input.EnhanceResolution(300); // Scale to optimal OCR resolution
var result = new IronTesseract().Read(input);
Console.WriteLine($"Confidence: {result.Confidence}%");
// For documents that returned low-confidence results from Textract
using var input = new OcrInput();
input.LoadImage("low-quality-scan.jpg");
input.Deskew(); // Fix rotation from scanner misalignment
input.DeNoise(); // Remove scanner noise artifacts
input.Contrast(); // Boost faint text
input.EnhanceResolution(300); // Scale to optimal OCR resolution
var result = new IronTesseract().Read(input);
Console.WriteLine($"Confidence: {result.Confidence}%");
Imports IronOcr
Dim input As New OcrInput()
input.LoadImage("low-quality-scan.jpg")
input.Deskew() ' Fix rotation from scanner misalignment
input.DeNoise() ' Remove scanner noise artifacts
input.Contrast() ' Boost faint text
input.EnhanceResolution(300) ' Scale to optimal OCR resolution
Dim result = New IronTesseract().Read(input)
Console.WriteLine($"Confidence: {result.Confidence}%")
The image quality correction guide and image filters tutorial document the full preprocessing pipeline and combinations that work best for specific document types. For confidence score interpretation and per-element confidence access, the confidence scores guide covers the result.Confidence property and per-word confidence values.
Handling the Async-to-Synchronous Pattern Change
Existing Textract code is necessarily async Task<T> throughout because the SDK is async-only. IronOCR operations are synchronous. For application code that already has an async call chain, wrap the IronOCR call in Task.Run to keep the async boundary.
// Preserves async call site for minimal refactoring
public async Task<string> ExtractTextAsync(string path)
{
return await Task.Run(() => new IronTesseract().Read(path).Text);
}
// Preserves async call site for minimal refactoring
public async Task<string> ExtractTextAsync(string path)
{
return await Task.Run(() => new IronTesseract().Read(path).Text);
}
Imports System.Threading.Tasks
' Preserves async call site for minimal refactoring
Public Async Function ExtractTextAsync(path As String) As Task(Of String)
Return Await Task.Run(Function() New IronTesseract().Read(path).Text)
End Function
This is a convenience wrapper, not a requirement. For server-side processing where the calling code is already on a background thread, the synchronous call is preferred directly.
Additional IronOCR Capabilities
Beyond the comparison points above, IronOCR provides capabilities that have no AWS Textract equivalent:
- Barcode reading during OCR: Set
ocr.Configuration.ReadBarCodes = trueand barcodes in the document are extracted alongside text in one pass — no separate barcode scanning step - Progress tracking for long jobs: Subscribe to progress events for multi-page processing without polling an external service
- Scanned document processing: Optimized pipeline for typical office scanner output including duplex scans and mixed-orientation pages
- Multi-language simultaneous extraction: Combine language packs at read time —
OcrLanguage.French + OcrLanguage.German— with no API tier change - Passport and ID reading: Dedicated pipeline for machine-readable zones on identity documents, extracting structured fields without manual region definition
.NET Compatibility and Future Readiness
IronOCR targets .NET 8 and .NET 9, with active compatibility for .NET Standard 2.0 projects and .NET Framework 4.6.2 through 4.8. The library ships native binaries for Windows x64, Windows x86, Linux x64, and macOS via a single NuGet package — no runtime identifier switching or platform-specific package references. AWS Textract's AWSSDK.Textract package supports the same modern .NET targets, but the deployment model carries the full AWS SDK dependency tree, IAM credential infrastructure, and the architectural constraints documented throughout this article. IronOCR maintains active development with regular releases tracking Tesseract 5 engine updates and .NET runtime advances, including compatibility with .NET 10 when released.
Conclusion
AWS Textract and IronOCR solve the same problem — extracting text from documents in .NET applications — with fundamentally incompatible architectural assumptions. Textract assumes documents can leave your network, that cloud service costs scale linearly with volume, and that multi-page PDFs justify a five-phase async pipeline with S3 staging. IronOCR assumes documents stay where they are processed, that license costs should be decoupled from volume, and that PDF processing should require the same three lines of code as image processing.
The cost arithmetic is the clearest dividing line. At low volumes, Textract's per-page fees are manageable. As volume grows, annual costs compound significantly. At high page volumes with table extraction, multi-year Textract costs can vastly exceed even IronOCR's Unlimited license at $5,999. The opening math holds: the per-page model adds up fast, and it never stops.
Data sovereignty is the second structural constraint. For healthcare, legal, financial, and government workloads, the question of where documents are processed is not a preference — it is a compliance requirement. IronOCR processes locally by design, not by configuration. There is no "local mode" to enable; local processing is the only mode. That makes the compliance answer simple: your documents stay in your infrastructure because there is nowhere else for them to go.
For teams evaluating OCR at genuine scale, or operating in environments where document data cannot leave internal infrastructure, IronOCR's documentation provides the complete API reference, deployment guides for Docker, AWS, Azure, and Linux, and tutorials covering the full range of OCR use cases from basic image reading to searchable PDF generation and multi-language extraction.
Frequently Asked Questions
What is Amazon Textract?
Amazon Textract is an OCR solution used by developers and enterprises to extract text from images and documents. It is one of several OCR options evaluated alongside IronOCR for .NET application development.
How does IronOCR compare to Amazon Textract for .NET developers?
IronOCR is a NuGet-native .NET OCR library using IronTesseract as its core engine. Compared to Amazon Textract, it offers simpler deployment (no SDK installers), flat-rate pricing, and a clean C# API without COM interop or cloud dependencies.
Is IronOCR easier to set up than Amazon Textract?
IronOCR installs via a single NuGet package. There are no SDK installers, license files to copy, COM components to register, or separate runtime binaries to manage. The entire OCR engine is bundled in the package.
What accuracy differences exist between Amazon Textract and IronOCR?
IronOCR achieves high recognition accuracy for standard business documents, invoices, receipts, and scanned forms. For highly degraded documents or uncommon scripts, accuracy varies by source quality. IronOCR includes image preprocessing filters to improve recognition on low-quality inputs.
Does IronOCR support PDF text extraction?
Yes. IronOCR extracts text from both native PDFs and scanned PDF images in a single call. It also supports multi-page TIFF files, images, and streams. For scanned PDFs, OCR is applied page-by-page with per-page result objects.
How does Amazon Textract licensing compare to IronOCR?
IronOCR uses a flat-rate perpetual license with no per-page or per-scan charges. Organizations processing high document volumes pay the same license cost regardless of volume. Details and volume pricing are on the IronOCR licensing page.
What languages does IronOCR support?
IronOCR supports 127 languages via separate NuGet language packs. Adding a language requires a single 'dotnet add package IronOcr.Languages.{Language}' command. No manual file placement or path configuration is needed.
How do I install IronOCR in a .NET project?
Install via NuGet: 'Install-Package IronOcr' in Package Manager Console or 'dotnet add package IronOcr' in the CLI. Additional language packs are installed the same way. No native SDK installer is required.
Is IronOCR suitable for Docker and containerized deployments, unlike Amazon Textract?
Yes. IronOCR works in Docker containers via its NuGet package. The license key is set via an environment variable. No license files, SDK paths, or volume mounts are required for the OCR engine itself.
Can I try IronOCR before purchasing, compared to Amazon Textract?
Yes. IronOCR trial mode processes documents and returns OCR results with a watermark overlay on output. You can verify accuracy on your own documents before purchasing a license.
Does IronOCR support barcode reading alongside text extraction?
IronOCR focuses on text extraction and OCR. For barcode reading, Iron Software provides IronBarcode as a companion library. Both are available individually or as part of the Iron Suite bundle.
Is it easy to migrate from Amazon Textract to IronOCR?
Migration from Amazon Textract to IronOCR typically involves replacing initialization sequences with IronTesseract instantiation, removing COM lifecycle management, and updating API calls. Most migrations reduce code complexity significantly.

