High Peak Memory During Bulk OCR
Running OCR over many PDF segments at once multiplies memory use. Each task renders full-page bitmaps through an OcrInput, and a fresh IronTesseract engine per segment reloads the language model files every time. At full processor concurrency this pushes peak memory into the multi-GB range, with spikes that fail in memory-limited environments.
OCR is memory-heavy by nature. Every OcrInput renders full-page bitmaps, and every IronTesseract engine loads language model files into memory. Creating a new engine per segment reloads those models repeatedly, and running one OCR task per CPU core (Environment.ProcessorCount) lets many bitmap-heavy jobs run side by side. With nothing limiting how many tasks are active, peak memory scales directly with concurrency.
The fix is to bound the number of in-flight jobs: cap concurrency, reuse engines from a pool, and gate work with a semaphore.
Solution
1. Cap OCR concurrency
Clamp the number of simultaneous OCR tasks to a small ceiling. Fewer concurrent tasks mean fewer full-page bitmaps in memory at once, which directly lowers the peak. Tune the ceiling to the machine's capability.
// Clamp concurrency to avoid memory saturation and CPU over-subscription.
int concurrency = Math.Clamp(Environment.ProcessorCount / 2, 1, 4);
// Clamp concurrency to avoid memory saturation and CPU over-subscription.
int concurrency = Math.Clamp(Environment.ProcessorCount / 2, 1, 4);
Imports System
' Clamp concurrency to avoid memory saturation and CPU over-subscription.
Dim concurrency As Integer = Math.Clamp(Environment.ProcessorCount \ 2, 1, 4)
2. Pool the engines
Create exactly one IronTesseract engine per concurrent slot at startup and reuse them across every segment, rather than constructing a new engine and reloading the language model each time.
// Pre-create one engine per concurrent slot and reuse them across segments.
var enginePool = new ConcurrentBag<IronTesseract>(
Enumerable.Range(0, concurrency).Select(_ => new IronTesseract())
);
// Pre-create one engine per concurrent slot and reuse them across segments.
var enginePool = new ConcurrentBag<IronTesseract>(
Enumerable.Range(0, concurrency).Select(_ => new IronTesseract())
);
' Pre-create one engine per concurrent slot and reuse them across segments.
Dim enginePool As New ConcurrentBag(Of IronTesseract)(
Enumerable.Range(0, concurrency).Select(Function(_) New IronTesseract())
)
Building the pool once amortizes the language-model load cost across the whole run instead of paying it per segment.
3. Gate work with a semaphore
Initialize a SemaphoreSlim to the concurrency limit and wrap it in using. Each task calls WaitAsync() before it starts and Release() in a finally, so only the allowed number of segments are ever in flight at once.
using var semaphore = new SemaphoreSlim(concurrency);
await semaphore.WaitAsync();
try
{
// Rent a pre-loaded engine from the pool.
if (!enginePool.TryTake(out var ocr))
ocr = new IronTesseract(); // Defensive fallback; should never be reached.
try
{
using var input = new OcrInput();
input.LoadPdf(segmentStream); // page-range segment produced upstream
var ocrResult = await ocr.ReadAsync(input);
ocrResult.SaveAsSearchablePdf(outputPath);
}
finally
{
enginePool.Add(ocr); // Return engine to pool for the next waiting segment.
}
}
finally
{
semaphore.Release();
}
using var semaphore = new SemaphoreSlim(concurrency);
await semaphore.WaitAsync();
try
{
// Rent a pre-loaded engine from the pool.
if (!enginePool.TryTake(out var ocr))
ocr = new IronTesseract(); // Defensive fallback; should never be reached.
try
{
using var input = new OcrInput();
input.LoadPdf(segmentStream); // page-range segment produced upstream
var ocrResult = await ocr.ReadAsync(input);
ocrResult.SaveAsSearchablePdf(outputPath);
}
finally
{
enginePool.Add(ocr); // Return engine to pool for the next waiting segment.
}
}
finally
{
semaphore.Release();
}
Imports System.Threading
Imports IronOcr
Dim semaphore As New SemaphoreSlim(concurrency)
Await semaphore.WaitAsync()
Try
' Rent a pre-loaded engine from the pool.
Dim ocr As IronTesseract = Nothing
If Not enginePool.TryTake(ocr) Then
ocr = New IronTesseract() ' Defensive fallback; should never be reached.
End If
Try
Using input As New OcrInput()
input.LoadPdf(segmentStream) ' page-range segment produced upstream
Dim ocrResult = Await ocr.ReadAsync(input)
ocrResult.SaveAsSearchablePdf(outputPath)
End Using
Finally
enginePool.Add(ocr) ' Return engine to pool for the next waiting segment.
End Try
Finally
semaphore.Release()
End Try
The WaitAsync() call blocks until a slot frees up, and returning the engine in the inner finally hands a pre-loaded engine straight to the next waiting segment.
4. Dispose OcrInput per segment
Wrap each OcrInput in using so its rendered page bitmaps are released the moment the segment is read, before the next task claims the slot.
using on OcrInput is what keeps bitmap memory from accumulating across segments; without it, freed slots still hold their page bitmaps.

