Skip to footer content
USING IRONOCR

How to OCR a PDF in C#: Extract Text from Scanned Documents with .NET

Scanned PDF documents present a persistent challenge for .NET developers: text exists only as images, making it impossible to search, copy, or process programmatically. Optical Character Recognition (OCR) solves this by converting scanned images into editable and searchable data -- transforming paper documents, camera-captured images, or any image-based PDF file into machine-readable text. Whether the goal is digitizing paper archives, automating data extraction, or building document processing pipelines, the ability to perform OCR on PDF files in C# is a critical capability.

IronOCR is a .NET OCR library built on the Tesseract 5 engine with additional accuracy enhancements. It lets developers extract text from any PDF document -- scanned or otherwise -- with a small number of lines of code. This article walks through the core workflows: basic PDF OCR, page-selective processing, region-targeted extraction, and image preprocessing for challenging scans.

How Do You Perform OCR on a PDF in C#?

The fastest path to PDF text extraction in .NET starts with installing IronOCR via NuGet. Open a terminal in your project directory and run:

dotnet add package IronOcr
dotnet add package IronOcr
SHELL

With the package installed, the following top-level statement program reads a scanned PDF and prints its extracted text:

using IronOcr;

// Initialize the OCR engine
var ocr = new IronTesseract();

// Load the PDF and perform OCR
using var input = new OcrInput();
input.LoadPdf("scanned-report.pdf");

// Run recognition
OcrResult result = ocr.Read(input);

// Access the extracted text
string text = result.Text;
Console.WriteLine(text);
using IronOcr;

// Initialize the OCR engine
var ocr = new IronTesseract();

// Load the PDF and perform OCR
using var input = new OcrInput();
input.LoadPdf("scanned-report.pdf");

// Run recognition
OcrResult result = ocr.Read(input);

// Access the extracted text
string text = result.Text;
Console.WriteLine(text);
Imports IronOcr

' Initialize the OCR engine
Dim ocr As New IronTesseract()

' Load the PDF and perform OCR
Using input As New OcrInput()
    input.LoadPdf("scanned-report.pdf")

    ' Run recognition
    Dim result As OcrResult = ocr.Read(input)

    ' Access the extracted text
    Dim text As String = result.Text
    Console.WriteLine(text)
End Using
$vbLabelText   $csharpLabel

The IronTesseract class wraps Tesseract 5 with .NET-native optimizations for both .NET Core and .NET Framework targets. The OcrInput object manages PDF loading and internal page rendering. When Read is called, the OCR process analyzes each page and returns an OcrResult containing the full extracted text, plus structured data about paragraphs, lines, words, and their pixel coordinates.

The result can be written to a text file, passed to downstream processing logic, stored in a database, or fed into a document indexing pipeline. For further reading on the underlying engine, see the Tesseract OCR documentation and the IronOCR API reference.

Input

How to OCR a PDF: Extract Text from Scanned Documents with C# .NET OCR PDF: Image 1 - Sample PDF Input

Output

How to OCR a PDF: Extract Text from Scanned Documents with C# .NET OCR PDF: Image 2 - Console Output

How Do You Read Specific Pages from a PDF?

Processing every page of a long document wastes time and memory when only certain pages contain relevant content. IronOCR lets you target specific pages by passing zero-based page indices to LoadPdf:

using IronOcr;
using System.Collections.Generic;

var ocr = new IronTesseract();

// Specify pages to process (zero-based: 0 = first page)
var targetPages = new List<int> { 0, 2, 4 };

using var input = new OcrInput();
input.LoadPdf("lengthy-document.pdf", pageIndices: targetPages);

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
using IronOcr;
using System.Collections.Generic;

var ocr = new IronTesseract();

// Specify pages to process (zero-based: 0 = first page)
var targetPages = new List<int> { 0, 2, 4 };

using var input = new OcrInput();
input.LoadPdf("lengthy-document.pdf", pageIndices: targetPages);

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Imports System.Collections.Generic

Dim ocr As New IronTesseract()

' Specify pages to process (zero-based: 0 = first page)
Dim targetPages As New List(Of Integer) From {0, 2, 4}

Using input As New OcrInput()
    input.LoadPdf("lengthy-document.pdf", pageIndices:=targetPages)

    Dim result As OcrResult = ocr.Read(input)
    Console.WriteLine(result.Text)
End Using
$vbLabelText   $csharpLabel

Selective page loading reduces both processing time and memory consumption, which matters when working with multi-hundred-page archives where only a handful of pages contain the data needed. The zero-based index convention matches standard .NET collections: page index 0 is the first page of the document.

For documents where the relevant pages are not known in advance, consider running a fast full-document pass first with reduced DPI to identify page numbers, then re-running with full settings on only those pages.

Learn more about page-level control in the IronOCR page selection documentation.

How Do You Extract Data from a Specific Region of a Page?

Invoice processing, form digitization, and structured document parsing frequently require extracting text from a defined area rather than scanning an entire page. IronOCR supports region-targeted OCR through the ContentAreas parameter, which accepts an array of Rectangle objects specifying which portions of each page to analyze:

using IronOcr;
using IronSoftware.Drawing;

var ocr = new IronTesseract();

// Define the scan region: X, Y, Width, Height (all in pixels from top-left)
var invoiceFields = new Rectangle[]
{
    new Rectangle(130, 290, 250, 50)   // Invoice number field
};

using var input = new OcrInput();
input.LoadPdf("invoice.pdf", contentAreas: invoiceFields);

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
using IronOcr;
using IronSoftware.Drawing;

var ocr = new IronTesseract();

// Define the scan region: X, Y, Width, Height (all in pixels from top-left)
var invoiceFields = new Rectangle[]
{
    new Rectangle(130, 290, 250, 50)   // Invoice number field
};

using var input = new OcrInput();
input.LoadPdf("invoice.pdf", contentAreas: invoiceFields);

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Imports IronSoftware.Drawing

Dim ocr As New IronTesseract()

' Define the scan region: X, Y, Width, Height (all in pixels from top-left)
Dim invoiceFields As Rectangle() = {
    New Rectangle(130, 290, 250, 50)   ' Invoice number field
}

Using input As New OcrInput()
    input.LoadPdf("invoice.pdf", contentAreas:=invoiceFields)

    Dim result As OcrResult = ocr.Read(input)
    Console.WriteLine(result.Text)
End Using
$vbLabelText   $csharpLabel

The Rectangle constructor takes four integer parameters: the X coordinate, Y coordinate, width, and height -- all measured in pixels from the top-left corner of the rendered page. Targeting a small region rather than a full page reduces both OCR time and the chance of the engine picking up surrounding noise or unrelated text fields.

For batch invoice processing workflows, combine region extraction with iteration over result.Pages to pull structured data from the same field position across hundreds of documents. Each page result exposes the recognized text for its content area independently.

The IronOCR content areas example provides additional configuration options for multi-region scenarios.

Input

How to OCR a PDF: Extract Text from Scanned Documents with C# .NET OCR PDF: Image 3 - Sample Invoice

Output

How to OCR a PDF: Extract Text from Scanned Documents with C# .NET OCR PDF: Image 4 - Extracted Data Output

How Do You Improve OCR Accuracy on Scanned Documents?

Real-world scanned documents frequently arrive with quality problems: skewed pages, low resolution, or digital noise introduced by the scanning hardware or software. IronOCR includes a set of image preprocessing filters that correct these issues before the recognition engine runs:

using IronOcr;

var ocr = new IronTesseract();

using var input = new OcrInput();
// Load PDF at higher DPI for improved text recognition on small fonts
input.LoadPdf("poor-quality-scan.pdf", dpi: 300);

// Apply image correction filters
input.Deskew();    // Automatically straighten rotated pages
input.DeNoise();   // Remove scanning artifacts and speckles

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
using IronOcr;

var ocr = new IronTesseract();

using var input = new OcrInput();
// Load PDF at higher DPI for improved text recognition on small fonts
input.LoadPdf("poor-quality-scan.pdf", dpi: 300);

// Apply image correction filters
input.Deskew();    // Automatically straighten rotated pages
input.DeNoise();   // Remove scanning artifacts and speckles

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr

Dim ocr = New IronTesseract()

Using input = New OcrInput()
    ' Load PDF at higher DPI for improved text recognition on small fonts
    input.LoadPdf("poor-quality-scan.pdf", dpi:=300)

    ' Apply image correction filters
    input.Deskew()    ' Automatically straighten rotated pages
    input.DeNoise()   ' Remove scanning artifacts and speckles

    Dim result As OcrResult = ocr.Read(input)
    Console.WriteLine(result.Text)
End Using
$vbLabelText   $csharpLabel

The dpi parameter controls the resolution at which PDF pages are rendered before recognition runs. Higher values -- 200 to 300 DPI -- improve accuracy for documents with small or dense text, at the cost of slightly more memory during processing. The Deskew method detects and corrects page rotation automatically. DeNoise removes speckles and artifacts that can confuse the character recognition step.

For documents requiring more aggressive image correction, IronOCR also provides contrast enhancement, binarization (converting pages to black-and-white), and scale adjustments. Combining multiple filters in sequence can recover usable text from scans that would otherwise produce garbled output. Review the IronOCR image filters reference for the complete list of available preprocessing operations.

How Do You Handle Password-Protected and Multi-Format Documents?

IronOCR is not limited to standard PDF files. The library handles a range of input scenarios that appear frequently in document processing workflows.

Password-protected PDFs are supported by passing credentials during input construction:

using IronOcr;

var ocr = new IronTesseract();

using var input = new OcrInput();
input.LoadPdf("protected.pdf", password: "secret123");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
using IronOcr;

var ocr = new IronTesseract();

using var input = new OcrInput();
input.LoadPdf("protected.pdf", password: "secret123");

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr

Dim ocr As New IronTesseract()

Using input As New OcrInput()
    input.LoadPdf("protected.pdf", password:="secret123")

    Dim result As OcrResult = ocr.Read(input)
    Console.WriteLine(result.Text)
End Using
$vbLabelText   $csharpLabel

Image formats -- PNG, JPEG, TIFF, BMP, GIF, and multipage TIFF -- are loaded with the corresponding LoadImage or LoadImageFrames methods. The same preprocessing filters and region targeting options apply regardless of input format.

Multi-language documents are handled through IronOCR's language pack system. The library ships with English by default and supports more than 125 additional language packs covering Latin, Cyrillic, CJK, Arabic, and other scripts. Load additional languages before calling Read:

var ocr = new IronTesseract();
ocr.Language = OcrLanguage.German;
var ocr = new IronTesseract();
ocr.Language = OcrLanguage.German;
Dim ocr As New IronTesseract()
ocr.Language = OcrLanguage.German
$vbLabelText   $csharpLabel

For documents mixing multiple languages on the same page, MultiLanguage mode is available. This is particularly valuable for invoice processing in international environments where headers, line items, and addresses may appear in different languages.

Deployment works across Windows, Linux, macOS, and cloud environments including Azure and Docker containers.

How Do You Create Searchable PDFs from Scanned Documents?

Beyond extracting text into strings, IronOCR can produce searchable PDF output -- a PDF where the original scanned image is preserved as the visual layer while an invisible text layer is embedded for search and copy operations. This is the standard format produced by professional document scanners.

The IronOCR searchable PDF feature accepts an OcrResult and writes a new PDF file:

using IronOcr;

var ocr = new IronTesseract();

using var input = new OcrInput();
input.LoadPdf("scanned-archive.pdf");

OcrResult result = ocr.Read(input);

// Save as a searchable PDF
result.SaveAsSearchablePdf("output-searchable.pdf");
using IronOcr;

var ocr = new IronTesseract();

using var input = new OcrInput();
input.LoadPdf("scanned-archive.pdf");

OcrResult result = ocr.Read(input);

// Save as a searchable PDF
result.SaveAsSearchablePdf("output-searchable.pdf");
Imports IronOcr

Dim ocr As New IronTesseract()

Using input As New OcrInput()
    input.LoadPdf("scanned-archive.pdf")

    Dim result As OcrResult = ocr.Read(input)

    ' Save as a searchable PDF
    result.SaveAsSearchablePdf("output-searchable.pdf")
End Using
$vbLabelText   $csharpLabel

The output file can be opened in any PDF reader. Text selection, search, and copy operations work on the embedded text layer while the original scan appearance is preserved. This format is commonly required for compliance archives, legal document repositories, and enterprise content management systems.

For additional output formats, the OcrResult object also exposes per-page confidence scores, word-level bounding boxes, and structured paragraph data -- all useful for downstream classification or indexing tasks.

How Do You Read Barcodes and QR Codes Alongside Text?

Document processing pipelines often need to extract both human-readable text and machine-readable codes from the same document. IronOCR can detect and decode barcodes and QR codes during the same OCR pass, without requiring a separate library.

Enable barcode reading on the IronTesseract instance before processing:

using IronOcr;

var ocr = new IronTesseract();
ocr.Configuration.ReadBarCodes = true;

using var input = new OcrInput();
input.LoadPdf("shipment-labels.pdf");

OcrResult result = ocr.Read(input);

// Access recognized text
Console.WriteLine(result.Text);

// Access barcode data
foreach (var barcode in result.Barcodes)
{
    Console.WriteLine($"Type: {barcode.Format}, Value: {barcode.Value}");
}
using IronOcr;

var ocr = new IronTesseract();
ocr.Configuration.ReadBarCodes = true;

using var input = new OcrInput();
input.LoadPdf("shipment-labels.pdf");

OcrResult result = ocr.Read(input);

// Access recognized text
Console.WriteLine(result.Text);

// Access barcode data
foreach (var barcode in result.Barcodes)
{
    Console.WriteLine($"Type: {barcode.Format}, Value: {barcode.Value}");
}
Imports IronOcr

Dim ocr As New IronTesseract()
ocr.Configuration.ReadBarCodes = True

Using input As New OcrInput()
    input.LoadPdf("shipment-labels.pdf")

    Dim result As OcrResult = ocr.Read(input)

    ' Access recognized text
    Console.WriteLine(result.Text)

    ' Access barcode data
    For Each barcode In result.Barcodes
        Console.WriteLine($"Type: {barcode.Format}, Value: {barcode.Value}")
    Next
End Using
$vbLabelText   $csharpLabel

This is particularly useful for shipping label processing, inventory management, and any workflow where barcodes and printed text appear together on scanned documents. The IronOCR barcode reading guide covers supported formats including Code 128, QR codes, Data Matrix, and PDF417.

What Is the Difference Between IronOCR Input Types?

IronOCR provides two main approaches for loading PDF files, each suited to different scenarios:

IronOCR PDF Input Methods Compared
Approach Class Best For Notes
General input OcrInput.LoadPdf() Most use cases Supports all preprocessing filters, page selection, content areas
PDF-specific OcrPdfInput Simple scenarios Convenience wrapper; fewer configuration options
Image files OcrInput.LoadImage() PNG, JPEG, TIFF, BMP Same preprocessing and region targeting as PDF input
Multipage TIFF OcrInput.LoadImageFrames() Fax archives, scanner output Processes each frame as a separate page

For most production scenarios, OcrInput.LoadPdf() is the recommended approach because it exposes the full preprocessing and configuration API. OcrPdfInput works well for quick prototyping or situations where the default settings are sufficient.

What Are Your Next Steps?

The code examples above cover the core IronOCR workflows for PDF OCR in C#. Here is a brief checklist for taking the next step:

  • Install the package: dotnet add package IronOcr or search for IronOcr on NuGet
  • Run the basic example: Confirm text extraction from a sample PDF before building out full pipeline logic
  • Apply preprocessing: If working with scanned documents, add Deskew and DeNoise calls and test with representative samples
  • Explore additional features: Searchable PDF output, barcode reading, multi-language support, and structured data output
  • Review deployment guidance: Azure, Docker, and Linux deployment articles cover environment-specific configuration
  • Try the free trial: Start a free trial to test the full feature set before committing to a license
  • Get a license: IronOCR licensing options cover individual developers through enterprise deployments, with royalty-free redistribution

For questions about specific use cases, the IronOCR how-to library provides step-by-step articles covering dozens of scenarios. The full API surface is documented in the IronOCR API reference.

Frequently Asked Questions

What is the minimum code needed to OCR a PDF in C#?

Using IronOCR, the minimum code is: create an IronTesseract instance, create an OcrInput, call input.LoadPdf with the file path, then call ocr.Read(input). The result.Text property returns the extracted string.

How do you install IronOCR in a .NET project?

Run 'dotnet add package IronOcr' in the terminal, or search for IronOcr in the NuGet Package Manager within Visual Studio.

Can IronOCR process only specific pages of a PDF?

Yes. Pass a List of zero-based page indices to the pageIndices parameter of LoadPdf. Only the specified pages are rendered and processed, reducing time and memory usage.

How do you extract text from a specific region of a scanned PDF?

Pass an array of Rectangle objects to the contentAreas parameter of LoadPdf. Each rectangle specifies the X position, Y position, width, and height in pixels from the top-left corner of the page.

What preprocessing filters does IronOCR provide for scanned documents?

IronOCR provides Deskew (corrects page rotation), DeNoise (removes scanning artifacts), contrast enhancement, binarization, and scale adjustment. These can be chained to improve accuracy on poor-quality scans.

Does IronOCR support password-protected PDF files?

Yes. Pass the password string to the password parameter of LoadPdf. The library decrypts the document before rendering pages for OCR.

Can IronOCR create searchable PDF output?

Yes. After calling ocr.Read(input), call result.SaveAsSearchablePdf with an output file path. The resulting PDF preserves the original scan as the visual layer with an embedded invisible text layer for search and copy operations.

What languages does IronOCR support?

IronOCR supports more than 125 language packs covering Latin, Cyrillic, CJK, Arabic, and other scripts. Set the Language property on the IronTesseract instance before calling Read.

Can IronOCR read barcodes and QR codes from PDF documents?

Yes. Set ocr.Configuration.ReadBarCodes to true before calling Read. The OcrResult.Barcodes collection contains the decoded values and format types for all detected codes.

Does IronOCR work on Linux and in Docker containers?

Yes. IronOCR supports deployment on Windows, Linux, macOS, and cloud environments including Azure and Docker containers. The IronSoftware documentation includes environment-specific setup guides.

Kannaopat Udonpant
Software Engineer
Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering ...
Read More

Iron Support Team

We're online 24 hours, 5 days a week.
Chat
Email
Call Me