Read Scanned Documents in C# Using IronOCR
IronOCR enables C# developers to extract text from scanned PDFs and images using OCR technology, converting non-searchable image-based documents into searchable, accessible content with just a few lines of code.
Many PDFs contain non-searchable, image-based text. IronOCR converts this into searchable content, making it easier to locate specific information and enhancing document accessibility, especially for individuals with visual impairments.
Instead of manually copying or recreating text and images, automated extraction ensures accuracy and efficiency. This is particularly useful for research, legal documents, and content creation where reusing specific portions of PDFs is common.
Businesses can extract critical data from PDFs for analysis or system integration, streamlining workflows. Designers and marketers can also extract images for enhancement and reuse in various projects.
In this tutorial, we'll explore the OcrPdfInput methods, covering the available options and parameters to showcase how IronOCR simplifies PDF text and image extraction for various applications.
To use this function, you must also install the IronOcr.Extensions.AdvancedScan package.
Quickstart: Extract Text from a Scanned PDF or Image
Get started in seconds—with one line of code you'll load your scanned PDF or image using IronOCR's OcrInput.LoadPdf or LoadImage and instantly extract the text via ReadDocument. Perfect for developers who want OCR up and running fast.
Get started making PDFs with NuGet now:
Install IronOCR with NuGet Package Manager
Copy and run this code snippet.
var text = new IronOcr.IronTesseract().ReadDocument(new IronOcr.OcrInput().LoadPdf("scanned.pdf")).Text;Deploy to test on your live environment
Minimal Workflow (5 steps)
- Download the C# library for reading scanned documents
- Import the scanned document for processing
- Use the
LoadImagemethod for images orLoadPdffor scanned PDFs - Extract text using the
ReadDocumentmethod - Save or export the extracted text as needed for further use
How Do I Extract Text from Scanned Documents?
To extract text from all images within a document, use the ReadDocument method. This method processes the document and returns an object containing the extracted text, which can be accessed through the Text property. The example below demonstrates how to use this method with a sample TIFF file.
IronOCR supports a wide variety of document formats for scanning. For images, you can work with JPG, PNG, GIF, TIFF, and BMP formats, while PDF support includes both single and multi-page documents. The library uses advanced Tesseract 5 technology to ensure high accuracy across all supported formats.
- The method currently only works for English, Chinese, Japanese, Korean, and LatinAlphabet.
- Using advanced scan on .NET Framework requires the project to run on x64 architecture.
What Does the Input Document Look Like?

How Do I Implement the OCR Code?
:path=/static-assets/ocr/content-code-examples/how-to/read-scanned-document-read-scanned-document.csusing IronOcr;
using System;
// Instantiate OCR engine
var ocr = new IronTesseract();
// Configure OCR engine
using var input = new OcrInput();
input.LoadImage("potter.tiff");
// Perform OCR
OcrResult result = ocr.ReadDocument(input);
Console.WriteLine(result.Text);Imports IronOcr
Imports System
' Instantiate OCR engine
Private ocr = New IronTesseract()
' Configure OCR engine
Private input = New OcrInput()
input.LoadImage("potter.tiff")
' Perform OCR
Dim result As OcrResult = ocr.ReadDocument(input)
Console.WriteLine(result.Text)What Results Can I Expect from OCR Processing?

If you need to perform OCR on a PDF file instead, simply replace the LoadImage method with LoadPdf. This allows IronOCR to process and extract text from scanned PDFs in the same way.
Advanced Document Processing Options
When working with scanned documents, you often need more control over the OCR process. IronOCR provides several advanced features to enhance your text extraction results.
Processing Multi-Page Documents
For documents with multiple pages, IronOCR efficiently handles batch processing:
using IronOcr;
var ocr = new IronTesseract();
using var input = new OcrInput();
// Load a multi-page PDF
input.LoadPdf("multi-page-document.pdf");
// Process all pages
OcrResult result = ocr.ReadDocument(input);
// Access individual page results
foreach (var page in result.Pages)
{
Console.WriteLine($"Page {page.PageNumber}: {page.Text}");
}using IronOcr;
var ocr = new IronTesseract();
using var input = new OcrInput();
// Load a multi-page PDF
input.LoadPdf("multi-page-document.pdf");
// Process all pages
OcrResult result = ocr.ReadDocument(input);
// Access individual page results
foreach (var page in result.Pages)
{
Console.WriteLine($"Page {page.PageNumber}: {page.Text}");
}IRON VB CONVERTER ERROR developers@ironsoftware.comOptimizing OCR Performance
The quality of your scanned documents directly impacts OCR accuracy. IronOCR includes built-in image optimization filters to enhance text recognition:
using IronOcr;
var ocr = new IronTesseract();
using var input = new OcrInput();
// Load and enhance image quality
input.LoadImage("low-quality-scan.jpg");
input.Deskew(); // Correct image skew
input.DeNoise(); // Remove background noise
input.Binarize(); // Convert to black and white
OcrResult result = ocr.ReadDocument(input);using IronOcr;
var ocr = new IronTesseract();
using var input = new OcrInput();
// Load and enhance image quality
input.LoadImage("low-quality-scan.jpg");
input.Deskew(); // Correct image skew
input.DeNoise(); // Remove background noise
input.Binarize(); // Convert to black and white
OcrResult result = ocr.ReadDocument(input);IRON VB CONVERTER ERROR developers@ironsoftware.comCreating Searchable PDFs
One of the most valuable features when processing scanned documents is the ability to create searchable PDFs. This maintains the original document appearance while adding a text layer:
using IronOcr;
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf("scanned-document.pdf");
// Process and save as searchable PDF
OcrResult result = ocr.ReadDocument(input);
result.SaveAsSearchablePdf("searchable-output.pdf");using IronOcr;
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf("scanned-document.pdf");
// Process and save as searchable PDF
OcrResult result = ocr.ReadDocument(input);
result.SaveAsSearchablePdf("searchable-output.pdf");IRON VB CONVERTER ERROR developers@ironsoftware.comWorking with Different Document Types
IronOCR excels at processing various document types commonly encountered in business environments. Whether you're dealing with invoices, contracts, or historical documents, the library provides specialized features for extracting data from different sources.
Processing Legacy Documents
Many organizations have archives of scanned documents in older formats. IronOCR handles these efficiently, including support for multi-page TIFF files commonly used in document management systems.
Language Support
While this example focuses on English text, IronOCR supports over 125 international languages. This makes it ideal for processing multilingual documents or documents in non-English languages.
Best Practices for Document Scanning
To achieve optimal results when processing scanned documents:
- Scan Quality: Use a minimum resolution of 300 DPI for best results
- File Format: TIFF and PNG formats preserve quality better than JPEG for text documents
- Pre-processing: Apply appropriate filters based on your document condition
- Performance: For large batches, consider using multithreading capabilities
Troubleshooting Common Issues
When working with scanned documents, you might encounter various challenges. Here are solutions to common problems:
- Poor quality scans: Apply enhancement filters before OCR processing
- Skewed documents: Use the
Deskew()method to correct orientation - Mixed content: Process specific regions if documents contain both text and non-text elements
For more detailed guidance, explore our comprehensive C# OCR tutorial or check out simple OCR examples to get started quickly.
Next Steps
Now that you understand how to extract text from scanned documents, you can explore more advanced features like making any PDF searchable or processing PDF streams for web applications. IronOCR's flexibility makes it suitable for everything from simple document digitization to complex enterprise document processing workflows.
Frequently Asked Questions
How do I extract text from a scanned PDF in C#?
IronOCR makes it simple to extract text from scanned PDFs in C#. Use the LoadPdf method to import your scanned PDF, then call ReadDocument to extract the text. For example: var text = new IronOcr.IronTesseract().ReadDocument(new IronOcr.OcrInput().LoadPdf("scanned.pdf")).Text; This single line of code loads your PDF and extracts all text content.
What file formats does the OCR library support for text extraction?
IronOCR supports a comprehensive range of document formats for OCR scanning. For images, it works with JPG, PNG, GIF, TIFF, and BMP formats. For PDFs, it handles both single and multi-page documents. The library uses advanced Tesseract 5 technology to ensure high accuracy across all supported formats.
Do I need to install additional packages for OCR functionality?
Yes, to use the full OCR functionality with IronOCR, you need to install the IronOcr.Extensions.AdvancedScan package in addition to the main IronOCR library. This extension package provides enhanced scanning capabilities for processing scanned documents.
Can I extract text from scanned images as well as PDFs?
Yes, IronOCR handles both scanned images and PDFs equally well. Use the LoadImage method for image files (JPG, PNG, GIF, TIFF, BMP) or LoadPdf for PDF documents. The ReadDocument method works with both input types to extract text content.
How does OCR help with non-searchable PDF documents?
IronOCR converts non-searchable, image-based PDFs into searchable content by extracting the text using OCR technology. This transformation makes it easier to locate specific information within documents and significantly enhances document accessibility, particularly for individuals with visual impairments.
What are the main business applications for OCR text extraction?
IronOCR enables businesses to extract critical data from PDFs for analysis and system integration, streamlining workflows. It's particularly useful for processing legal documents, research papers, and automating data entry. Designers and marketers can also extract images for enhancement and reuse in various projects.






