How to Read Scanned Documents in C# | IronOCR

Read Scanned Documents in C# Using IronOCR

IronOCR enables C# developers to extract text from scanned PDFs and images using OCR technology, converting non-searchable image-based documents into searchable, accessible content with just a few lines of code.

Many PDFs contain non-searchable, image-based text. IronOCR converts this into searchable content, making it easier to locate specific information and enhancing document accessibility, especially for individuals with visual impairments.

Instead of manually copying or recreating text and images, automated extraction ensures accuracy and efficiency. This is particularly useful for research, legal documents, and content creation where reusing specific portions of PDFs is common.

Businesses can extract critical data from PDFs for analysis or system integration, streamlining workflows. Designers and marketers can also extract images for enhancement and reuse in various projects.

In this tutorial, we'll explore the OcrPdfInput methods, covering the available options and parameters to showcase how IronOCR simplifies PDF text and image extraction for various applications.

To use this function, you must also install the IronOcr.Extensions.AdvancedScan package.

Quickstart: Extract Text from a Scanned PDF or Image

Get started in seconds—with one line of code you'll load your scanned PDF or image using IronOCR's OcrInput.LoadPdf or LoadImage and instantly extract the text via ReadDocument. Perfect for developers who want OCR up and running fast.

Nuget IconGet started making PDFs with NuGet now:

  1. Install IronOCR with NuGet Package Manager

    PM > Install-Package IronOcr

  2. Copy and run this code snippet.

    var text = new IronOcr.IronTesseract().ReadDocument(new IronOcr.OcrInput().LoadPdf("scanned.pdf")).Text;
  3. Deploy to test on your live environment

    Start using IronOCR in your project today with a free trial
    arrow pointer

How Do I Extract Text from Scanned Documents?

To extract text from all images within a document, use the ReadDocument method. This method processes the document and returns an object containing the extracted text, which can be accessed through the Text property. The example below demonstrates how to use this method with a sample TIFF file.

IronOCR supports a wide variety of document formats for scanning. For images, you can work with JPG, PNG, GIF, TIFF, and BMP formats, while PDF support includes both single and multi-page documents. The library uses advanced Tesseract 5 technology to ensure high accuracy across all supported formats.

Please note

  • The method currently only works for English, Chinese, Japanese, Korean, and LatinAlphabet.
  • Using advanced scan on .NET Framework requires the project to run on x64 architecture.

What Does the Input Document Look Like?

Page from Harry Potter book showing Chapter Eight 'The Deathday Party' with narrative text about Hogwarts in October

How Do I Implement the OCR Code?

:path=/static-assets/ocr/content-code-examples/how-to/read-scanned-document-read-scanned-document.cs
using IronOcr;
using System;

// Instantiate OCR engine
var ocr = new IronTesseract();

// Configure OCR engine
using var input = new OcrInput();
input.LoadImage("potter.tiff");

// Perform OCR
OcrResult result = ocr.ReadDocument(input);

Console.WriteLine(result.Text);
Imports IronOcr
Imports System

' Instantiate OCR engine
Private ocr = New IronTesseract()

' Configure OCR engine
Private input = New OcrInput()
input.LoadImage("potter.tiff")

' Perform OCR
Dim result As OcrResult = ocr.ReadDocument(input)

Console.WriteLine(result.Text)
$vbLabelText   $csharpLabel

What Results Can I Expect from OCR Processing?

Visual Studio Debug window displaying OCR-processed Harry Potter text output from scanned document example

If you need to perform OCR on a PDF file instead, simply replace the LoadImage method with LoadPdf. This allows IronOCR to process and extract text from scanned PDFs in the same way.

Advanced Document Processing Options

When working with scanned documents, you often need more control over the OCR process. IronOCR provides several advanced features to enhance your text extraction results.

Processing Multi-Page Documents

For documents with multiple pages, IronOCR efficiently handles batch processing:

using IronOcr;

var ocr = new IronTesseract();
using var input = new OcrInput();

// Load a multi-page PDF
input.LoadPdf("multi-page-document.pdf");

// Process all pages
OcrResult result = ocr.ReadDocument(input);

// Access individual page results
foreach (var page in result.Pages)
{
    Console.WriteLine($"Page {page.PageNumber}: {page.Text}");
}
using IronOcr;

var ocr = new IronTesseract();
using var input = new OcrInput();

// Load a multi-page PDF
input.LoadPdf("multi-page-document.pdf");

// Process all pages
OcrResult result = ocr.ReadDocument(input);

// Access individual page results
foreach (var page in result.Pages)
{
    Console.WriteLine($"Page {page.PageNumber}: {page.Text}");
}
IRON VB CONVERTER ERROR developers@ironsoftware.com
$vbLabelText   $csharpLabel

Optimizing OCR Performance

The quality of your scanned documents directly impacts OCR accuracy. IronOCR includes built-in image optimization filters to enhance text recognition:

using IronOcr;

var ocr = new IronTesseract();
using var input = new OcrInput();

// Load and enhance image quality
input.LoadImage("low-quality-scan.jpg");
input.Deskew();  // Correct image skew
input.DeNoise(); // Remove background noise
input.Binarize(); // Convert to black and white

OcrResult result = ocr.ReadDocument(input);
using IronOcr;

var ocr = new IronTesseract();
using var input = new OcrInput();

// Load and enhance image quality
input.LoadImage("low-quality-scan.jpg");
input.Deskew();  // Correct image skew
input.DeNoise(); // Remove background noise
input.Binarize(); // Convert to black and white

OcrResult result = ocr.ReadDocument(input);
IRON VB CONVERTER ERROR developers@ironsoftware.com
$vbLabelText   $csharpLabel

Creating Searchable PDFs

One of the most valuable features when processing scanned documents is the ability to create searchable PDFs. This maintains the original document appearance while adding a text layer:

using IronOcr;

var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf("scanned-document.pdf");

// Process and save as searchable PDF
OcrResult result = ocr.ReadDocument(input);
result.SaveAsSearchablePdf("searchable-output.pdf");
using IronOcr;

var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf("scanned-document.pdf");

// Process and save as searchable PDF
OcrResult result = ocr.ReadDocument(input);
result.SaveAsSearchablePdf("searchable-output.pdf");
IRON VB CONVERTER ERROR developers@ironsoftware.com
$vbLabelText   $csharpLabel

Working with Different Document Types

IronOCR excels at processing various document types commonly encountered in business environments. Whether you're dealing with invoices, contracts, or historical documents, the library provides specialized features for extracting data from different sources.

Processing Legacy Documents

Many organizations have archives of scanned documents in older formats. IronOCR handles these efficiently, including support for multi-page TIFF files commonly used in document management systems.

Language Support

While this example focuses on English text, IronOCR supports over 125 international languages. This makes it ideal for processing multilingual documents or documents in non-English languages.

Best Practices for Document Scanning

To achieve optimal results when processing scanned documents:

  1. Scan Quality: Use a minimum resolution of 300 DPI for best results
  2. File Format: TIFF and PNG formats preserve quality better than JPEG for text documents
  3. Pre-processing: Apply appropriate filters based on your document condition
  4. Performance: For large batches, consider using multithreading capabilities

Troubleshooting Common Issues

When working with scanned documents, you might encounter various challenges. Here are solutions to common problems:

  • Poor quality scans: Apply enhancement filters before OCR processing
  • Skewed documents: Use the Deskew() method to correct orientation
  • Mixed content: Process specific regions if documents contain both text and non-text elements

For more detailed guidance, explore our comprehensive C# OCR tutorial or check out simple OCR examples to get started quickly.

Next Steps

Now that you understand how to extract text from scanned documents, you can explore more advanced features like making any PDF searchable or processing PDF streams for web applications. IronOCR's flexibility makes it suitable for everything from simple document digitization to complex enterprise document processing workflows.

Frequently Asked Questions

How do I extract text from a scanned PDF in C#?

IronOCR makes it simple to extract text from scanned PDFs in C#. Use the LoadPdf method to import your scanned PDF, then call ReadDocument to extract the text. For example: var text = new IronOcr.IronTesseract().ReadDocument(new IronOcr.OcrInput().LoadPdf("scanned.pdf")).Text; This single line of code loads your PDF and extracts all text content.

What file formats does the OCR library support for text extraction?

IronOCR supports a comprehensive range of document formats for OCR scanning. For images, it works with JPG, PNG, GIF, TIFF, and BMP formats. For PDFs, it handles both single and multi-page documents. The library uses advanced Tesseract 5 technology to ensure high accuracy across all supported formats.

Do I need to install additional packages for OCR functionality?

Yes, to use the full OCR functionality with IronOCR, you need to install the IronOcr.Extensions.AdvancedScan package in addition to the main IronOCR library. This extension package provides enhanced scanning capabilities for processing scanned documents.

Can I extract text from scanned images as well as PDFs?

Yes, IronOCR handles both scanned images and PDFs equally well. Use the LoadImage method for image files (JPG, PNG, GIF, TIFF, BMP) or LoadPdf for PDF documents. The ReadDocument method works with both input types to extract text content.

How does OCR help with non-searchable PDF documents?

IronOCR converts non-searchable, image-based PDFs into searchable content by extracting the text using OCR technology. This transformation makes it easier to locate specific information within documents and significantly enhances document accessibility, particularly for individuals with visual impairments.

What are the main business applications for OCR text extraction?

IronOCR enables businesses to extract critical data from PDFs for analysis and system integration, streamlining workflows. It's particularly useful for processing legal documents, research papers, and automating data entry. Designers and marketers can also extract images for enhancement and reuse in various projects.

Curtis Chau
Technical Writer

Curtis Chau holds a Bachelor’s degree in Computer Science (Carleton University) and specializes in front-end development with expertise in Node.js, TypeScript, JavaScript, and React. Passionate about crafting intuitive and aesthetically pleasing user interfaces, Curtis enjoys working with modern frameworks and creating well-structured, visually appealing manuals.

...

Read More
Ready to Get Started?
Nuget Downloads 5,246,844 | Version: 2025.12 just released