Skip to footer content
USING IRONOCR

How to OCR a PDF: Extract Text from Scanned Documents with C# .NET OCR PDF

Scanned PDF documents present a common challenge for .NET developers: the text exists only as images, making it impossible to search, copy, or process programmatically. Optical Character Recognition (OCR) technology solves this by converting scanned images and image files into editable and searchable data—transforming scanned paper documents, images captured by a digital camera, or any searchable PDF file into machine-readable text. Whether digitizing paper archives, automating data extraction, or building AI-powered document processing applications, the ability to convert PDF files using optical character recognition is essential. IronOCR is a powerful .NET OCR library that provides a streamlined approach to PDF OCR in C#. Built on the Tesseract OCR engine with enhanced accuracy, this .NET Optical Character Recognition library lets you extract text from any PDF document with just a few lines of code.

IronOCR is a powerful .NET OCR library that provides a streamlined approach to PDF OCR in C#. Built on the Tesseract OCR engine with enhanced accuracy, this .NET Optical Character Recognition library lets you extract text from any PDF document with just a few lines of code.

How Can I Perform OCR on a PDF in C#?

First, install the IronOCR library via NuGet Package Manager to add this powerful OCR engine to your system:

Install-Package IronOcr

The following example demonstrates how to load a PDF file and recognize text from an entire scanned document:

using IronOcr;
// Initialize the OCR engine
IronTesseract ocr = new IronTesseract();
// Load the PDF and perform OCR
using var pdfInput = new OcrPdfInput("scanned-report.pdf");
OcrResult result = ocr.Read(pdfInput);
// Output the extracted text
string extractedText = result.Text;
Console.WriteLine(extractedText);
using IronOcr;
// Initialize the OCR engine
IronTesseract ocr = new IronTesseract();
// Load the PDF and perform OCR
using var pdfInput = new OcrPdfInput("scanned-report.pdf");
OcrResult result = ocr.Read(pdfInput);
// Output the extracted text
string extractedText = result.Text;
Console.WriteLine(extractedText);
IRON VB CONVERTER ERROR developers@ironsoftware.com
$vbLabelText   $csharpLabel

The IronTesseract class serves as the primary OCR engine, wrapping Tesseract 5 with optimizations for .NET Core and .NET Framework applications. The OcrPdfInput object handles PDF loading and page rendering internally, eliminating the need to convert image formats manually. When you call the Read method, the OCR process analyzes each page and returns an OcrResult containing the extracted text as a string, along with structured data about paragraphs, lines, words, and their positions. Users can then save the output to a TXT file, target folder, Word documents, or use the API to process the data further.

Input

How to OCR a PDF: Extract Text from Scanned Documents with C# .NET OCR PDF: Image 1 - Sample PDF Input

Output

How to OCR a PDF: Extract Text from Scanned Documents with C# .NET OCR PDF: Image 2 - Console Output

How Do I Read Specific Pages from a PDF?

Processing large text documents becomes more efficient when you target only the pages you need. Pass a list of page indices to the PageIndices parameter to convert scanned PDF pages selectively:

using IronOcr;
using System.Collections.Generic;
IronTesseract ocr = new IronTesseract();
// Specify pages to process (zero-based indexing)
List<int> targetPages = new List<int>() { 0, 2, 4 };
using var pdfInput = new OcrPdfInput("lengthy-document.pdf", PageIndices: targetPages);
OcrResult result = ocr.Read(pdfInput);
// Save or process the OCR results
Console.WriteLine(result.Text);
using IronOcr;
using System.Collections.Generic;
IronTesseract ocr = new IronTesseract();
// Specify pages to process (zero-based indexing)
List<int> targetPages = new List<int>() { 0, 2, 4 };
using var pdfInput = new OcrPdfInput("lengthy-document.pdf", PageIndices: targetPages);
OcrResult result = ocr.Read(pdfInput);
// Save or process the OCR results
Console.WriteLine(result.Text);
IRON VB CONVERTER ERROR developers@ironsoftware.com
$vbLabelText   $csharpLabel

Note that IronOCR uses zero-based indexing, so page 0 represents the first page of your PDF document. This selective approach reduces processing time and memory consumption when working with multi-page scanned documents where only specific sections contain relevant searchable data.

How Can I Extract Data from a Specific Region?

Invoice processing, form digitization, and document parsing often require extracting text from defined areas rather than entire pages. This OCR tool allows you to create targeted scans using the ContentAreas parameter, which accepts an array of rectangles specifying the regions to process:

using IronOcr;
using IronSoftware.Drawing;
using System;
IronTesseract ocr = new IronTesseract();
// Define the scan region (x, y, width, height in pixels)
Rectangle[] invoiceFields = {
    new Rectangle(130, 290, 250, 50)   // Invoice number area
};
using var pdfInput = new OcrPdfInput("invoice.pdf", ContentAreas: invoiceFields);
OcrResult result = ocr.Read(pdfInput);
// Extract and output the structured data
Console.WriteLine(result.Text);
using IronOcr;
using IronSoftware.Drawing;
using System;
IronTesseract ocr = new IronTesseract();
// Define the scan region (x, y, width, height in pixels)
Rectangle[] invoiceFields = {
    new Rectangle(130, 290, 250, 50)   // Invoice number area
};
using var pdfInput = new OcrPdfInput("invoice.pdf", ContentAreas: invoiceFields);
OcrResult result = ocr.Read(pdfInput);
// Extract and output the structured data
Console.WriteLine(result.Text);
IRON VB CONVERTER ERROR developers@ironsoftware.com
$vbLabelText   $csharpLabel

The Rectangle constructor accepts four parameters: X position, Y position, width, and height—all measured in pixels from the top-left corner of the page. This targeted text recognition approach dramatically improves both speed and accuracy by focusing the OCR engine on specific content areas rather than processing irrelevant background elements. For batch invoice processing, combine region extraction with iterating through result pages to build editable structured data from multiple PDF files.

Input

How to OCR a PDF: Extract Text from Scanned Documents with C# .NET OCR PDF: Image 3 - Sample Invoice

Output

How to OCR a PDF: Extract Text from Scanned Documents with C# .NET OCR PDF: Image 4 - Extracted Data Output

How Do I Improve OCR Accuracy on Scanned Documents?

Real-world scanned paper documents often arrive with quality issues: skewed pages, low resolution, or digital noise from the scanning software. IronOCR includes preprocessing filters that address these challenges and help convert image quality problems into accurate text conversion:

using IronOcr;
IronTesseract ocr = new IronTesseract();
using var input = new OcrInput();
// Load PDF with higher DPI for better text recognition
input.LoadPdf("poor-quality-scan.pdf", DPI: 300);
// Apply image correction filters to process scanned images
input.Deskew();   // Straighten rotated pages
input.DeNoise();  // Remove scanning artifacts
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
using IronOcr;
IronTesseract ocr = new IronTesseract();
using var input = new OcrInput();
// Load PDF with higher DPI for better text recognition
input.LoadPdf("poor-quality-scan.pdf", DPI: 300);
// Apply image correction filters to process scanned images
input.Deskew();   // Straighten rotated pages
input.DeNoise();  // Remove scanning artifacts
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
IRON VB CONVERTER ERROR developers@ironsoftware.com
$vbLabelText   $csharpLabel

The DPI parameter controls the resolution at which PDF pages are rendered before the OCR process—higher values (200-300 DPI) improve accuracy for documents with small text. The Deskew method automatically detects and corrects page rotation, while DeNoise removes speckles and artifacts that interfere with optical character recognition (OCR). For documents requiring additional image adjustments, IronOCR provides contrast enhancement, binarization, and other tools to edit image quality.

This .NET OCR library also handles password-protected PDF documents by accepting credentials during input construction. The software supports 125+ language packs, enabling OCR on international documents. Beyond standard PDF files, IronOCR can process PNG, TIFF (including multipage TIFF), and other image format files. Deployment works seamlessly on Windows, Linux, macOS, and cloud platforms including Azure and Docker containers.

Conclusion

IronOCR transforms the complex task of PDF text extraction into a straightforward operation. From basic document reading to targeted region extraction and preprocessing for challenging scanned images, this OCR library handles the technical complexity while exposing a clean C# API that works across .NET Core and .NET Framework.

The code examples above demonstrate core functionality, but IronOCR extends further with barcode and QR code reading, searchable PDF creation to convert scanned PDF files into editable searchable documents, and structured data output including confidence scores and text positioning. Explore the complete API reference for advanced implementations—or try the free pro version features during your trial.

Purchase a license to deploy IronOCR in .NET applications production environments, or chat with our engineering team for project-specific guidance.

Ready to perform OCR in your .NET applications? Start with a free trial to explore the full feature set and download the SDK.

Frequently Asked Questions

What is OCR and why is it important for .NET developers?

OCR, or Optical Character Recognition, is a technology that converts scanned images and PDF files into editable and searchable text. This is crucial for .NET developers who need to process document images programmatically, enabling functionalities like searching and copying text.

How does IronOCR enhance the OCR process?

IronOCR enhances the OCR process by building on the Tesseract OCR engine, providing improved accuracy and a streamlined approach to extracting text from scanned documents in C#.

Can IronOCR handle PDF files directly for text extraction?

Yes, IronOCR can handle PDF files directly, allowing developers to extract text from scanned PDF documents using just a few lines of C# code.

What types of documents can IronOCR process?

IronOCR can process a variety of documents including scanned paper documents, images captured by digital cameras, and searchable PDF files, converting them into machine-readable text.

Is IronOCR suitable for automating data extraction tasks?

Absolutely, IronOCR is ideal for automating data extraction tasks as it can convert scanned images into structured, editable data, streamlining workflows and enhancing productivity.

What advantages does using IronOCR offer for AI-powered document processing applications?

IronOCR offers the advantage of converting documents into machine-readable text, which is essential for building AI-powered document processing applications that require text recognition and analysis capabilities.

How easy is it to implement IronOCR in a C# project?

Implementing IronOCR in a C# project is straightforward, requiring only a few lines of code to integrate its OCR capabilities and start extracting text from documents.

Does IronOCR improve upon the Tesseract OCR engine?

Yes, IronOCR builds upon the Tesseract OCR engine, enhancing its accuracy and performance to offer superior text recognition results.

Can IronOCR be used for digitizing paper archives?

Yes, IronOCR is well-suited for digitizing paper archives as it can convert scanned paper documents into searchable and editable digital text, facilitating easier document management.

What coding languages does IronOCR support for OCR implementation?

IronOCR supports OCR implementation in C#, making it a powerful tool for developers working within the .NET framework.

Kannaopat Udonpant
Software Engineer
Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering ...
Read More