Handling Large PDF Files in IronOCR

Running OCR on large multi-page PDFs can spike memory and crash the process. The usual culprit is loading every page into memory at once.

System.OutOfMemoryException

OcrInput.LoadPdf("large.pdf") reads all pages in one go. IronOCR's imaging system then renders every page simultaneously, and because the DPI defaults to 200, the memory footprint climbs fast. On a big document that means an OutOfMemoryException or resource deadlock.

Solution

The fix is to process the PDF one page at a time and keep the render DPI as low as the text quality allows.

1. Get the Page Count

Open the document with IronPdf.PdfDocument (or any PDF library) to read how many pages it has.

using var pdf = IronPdf.PdfDocument.FromFile(pdfPath);
var pageCount = pdf.PageCount;
using var pdf = IronPdf.PdfDocument.FromFile(pdfPath);
var pageCount = pdf.PageCount;
Imports IronPdf

Using pdf = IronPdf.PdfDocument.FromFile(pdfPath)
    Dim pageCount = pdf.PageCount
End Using
$vbLabelText   $csharpLabel

2. Load One Page at a Time

Loop through the pages and load each individually with OcrInput.LoadPdfPage("file.pdf", pageIndex, dpi). Where the visual quality holds up, drop the DPI as low as 80 to conserve memory.

3. Read and Concatenate

Pass each single-page input to IronTesseract.Read() and append the result to a StringBuilder.

var ocr = new IronTesseract();
var pdfPath = "large.pdf";
using var pdf = IronPdf.PdfDocument.FromFile(pdfPath);
var pageCount = pdf.PageCount;
var textBuilder = new StringBuilder();
for (int i = 0; i < pageCount; i++)
{
    using var input = new OcrInput();
    input.LoadPdfPage(pdfPath, i, 80);
    var result = ocr.Read(input);
    textBuilder.Append(result.Text);
    textBuilder.Append(' '); // Add space between pages
}
Console.WriteLine(textBuilder.ToString().Trim());
var ocr = new IronTesseract();
var pdfPath = "large.pdf";
using var pdf = IronPdf.PdfDocument.FromFile(pdfPath);
var pageCount = pdf.PageCount;
var textBuilder = new StringBuilder();
for (int i = 0; i < pageCount; i++)
{
    using var input = new OcrInput();
    input.LoadPdfPage(pdfPath, i, 80);
    var result = ocr.Read(input);
    textBuilder.Append(result.Text);
    textBuilder.Append(' '); // Add space between pages
}
Console.WriteLine(textBuilder.ToString().Trim());
Imports IronOcr
Imports IronPdf
Imports System.Text

Dim ocr As New IronTesseract()
Dim pdfPath As String = "large.pdf"
Using pdf = PdfDocument.FromFile(pdfPath)
    Dim pageCount As Integer = pdf.PageCount
    Dim textBuilder As New StringBuilder()
    For i As Integer = 0 To pageCount - 1
        Using input As New OcrInput()
            input.LoadPdfPage(pdfPath, i, 80)
            Dim result = ocr.Read(input)
            textBuilder.Append(result.Text)
            textBuilder.Append(" ") ' Add space between pages
        End Using
    Next
    Console.WriteLine(textBuilder.ToString().Trim())
End Using
$vbLabelText   $csharpLabel

The using on each OcrInput releases the page's image data before the next iteration, so memory stays flat across the loop instead of growing with page count.

Option: Compress the PDF First

For very complex or image-heavy files, the page-by-page loop may still struggle. Compress the PDF with IronPDF's Compress API before OCR to cut down the image data IronOCR has to handle. This pays off most on scanned or image-heavy documents.

Compress straight to a stream, then load that stream into OcrInput:

var pdf = PdfDocument.FromFile(pdfPath);
var stream = pdf.CompressPdfToStream(CompressStructTree: true);
var ocrTesseract = new IronTesseract();
using var ocrInput = new OcrInput();
ocrInput.LoadPdfPage(stream, 1);
var pdf = PdfDocument.FromFile(pdfPath);
var stream = pdf.CompressPdfToStream(CompressStructTree: true);
var ocrTesseract = new IronTesseract();
using var ocrInput = new OcrInput();
ocrInput.LoadPdfPage(stream, 1);
Imports IronTesseract

Dim pdf = PdfDocument.FromFile(pdfPath)
Dim stream = pdf.CompressPdfToStream(CompressStructTree:=True)
Dim ocrTesseract = New IronTesseract()
Using ocrInput As New OcrInput()
    ocrInput.LoadPdfPage(stream, 1)
End Using
$vbLabelText   $csharpLabel

When a PDF carries embedded images, lower JpegQuality during compression to shrink the data further:

var pdf = PdfDocument.FromFile(@"D:\hugePdf.pdf");
var stream = pdf.CompressPdfToStream(JpegQuality: 75, CompressStructTree: true);
var pdf = PdfDocument.FromFile(@"D:\hugePdf.pdf");
var stream = pdf.CompressPdfToStream(JpegQuality: 75, CompressStructTree: true);
Imports System

Dim pdf = PdfDocument.FromFile("D:\hugePdf.pdf")
Dim stream = pdf.CompressPdfToStream(JpegQuality:=75, CompressStructTree:=True)
$vbLabelText   $csharpLabel

WarningReusing the same MemoryStream across loop iterations? Reset its position to 0 before each read. A stream is consumed once read, so the next read fails if the position isn't reset.

Debug Tips

  • Lower the DPI: values in the 80 to 100 range cut memory use sharply when the text is still legible.
  • Avoid LoadPdf() on large files: read the whole document at once only when it is genuinely small.
  • Dispose early: wrap OcrInput in using statements so memory is freed between pages.
  • Parallelize with care: run pages concurrently only when the machine has spare memory and CPU; it backfires on large PDFs.
Curtis Chau
Technical Writer

Curtis Chau holds a Bachelor’s degree in Computer Science (Carleton University) and specializes in front-end development with expertise in Node.js, TypeScript, JavaScript, and React. Passionate about crafting intuitive and aesthetically pleasing user interfaces, Curtis enjoys working with modern frameworks and creating well-structured, visually appealing manuals.

...

Read More
Ready to Get Started?
Nuget Downloads 6,106,091 | Version: 2026.7 just released
Still Scrolling Icon

Still Scrolling?

Want proof fast? PM > Install-Package IronOcr
run a sample watch your image become searchable text.