How to Read PDFs in C# with IronOCR
IronOCR enables you to extract text from PDF files in C# with a single line of code, supporting all PDF versions and providing accurate OCR results through its Tesseract-based engine.
PDF stands for "Portable Document Format." It is a file format developed by Adobe that preserves the fonts, images, graphics, and layout of any source document, regardless of the application and platform used to create it. PDF files are typically used for sharing and viewing documents in a consistent format, irrespective of the software or hardware used to open them. IronOCR handles various versions of PDF documents, from older PDF 1.0 specifications to the latest PDF 2.0 standards.
Quickstart: OCR a PDF File in Seconds
Configure OCR quickly with IronOCR by constructing an OcrPdfInput that points to your PDF, then call Read. This example demonstrates text extraction from a PDF using IronOCR.
```cs :title=Try IronOCR PDF OCR in One Line using var result = new IronOcr.IronTesseract().Read(new IronOcr.OcrPdfInput("document.pdf", PdfContents.TextAndImages));
<div class="hsg-featured-snippet">
<h3>Minimal Workflow (5 steps)</h3>
<ol>
<li><a class="js-modal-open" data-modal-id="trial-license-after-download" href="https://nuget.org/packages/IronOcr/">Download a C# library for reading PDFs</a></li>
<li>Prepare the PDF document for reading</li>
<li>Construct the <strong>OcrPdfInput</strong> object with PDF file path</li>
<li>Employ the <code>Read</code> method to perform OCR on the imported PDF</li>
<li>Read specific pages by providing the page indices list</li>
</ol>
</div>
<br class="clear">
## How Do I Read an Entire PDF File?
Begin by instantiating the IronTesseract class to perform OCR. Then, utilize a 'using' statement to create an `OcrPdfInput` object, passing the PDF file path to it. Finally, perform OCR using the `Read` method. This approach works with both scanned PDFs (image-based) and searchable PDFs (text-based), suitable for [extracting text from various PDF types](https://ironsoftware.com/csharp/ocr/examples/csharp-pdf-ocr/).
```csharp
/* :path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-pdf.cs */
using IronOcr;
// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();
// Add PDF
using var pdfInput = new OcrPdfInput("Potter.pdf");
// Perform OCR
OcrResult ocrResult = ocrTesseract.Read(pdfInput);
// Access the extracted text
string extractedText = ocrResult.Text;
System.Console.WriteLine(extractedText);
In most cases, there's no need to specify the DPI property. However, providing a high DPI number in the construction of OcrPdfInput can enhance reading accuracy. The default DPI setting is typically sufficient for most standard PDF documents, but specialized documents may benefit from adjustment.
When Should I Adjust the DPI Settings?
DPI (Dots Per Inch) settings become crucial when dealing with low-resolution scanned documents or PDFs containing small text. For optimal results, consider adjusting DPI settings when:
- Working with scanned documents below 200 DPI
- Processing historical or archival PDFs
- Dealing with complex layouts or small fonts
- Encountering accuracy issues with default settings
A DPI of 300 is recommended for most OCR operations, while 600 DPI may be necessary for documents with very small text or intricate details.
What File Formats Does IronOCR Support Besides PDF?
IronOCR provides comprehensive support for numerous file formats beyond PDFs. You can process images in various formats including:
- JPEG/JPG for standard photographs
- PNG for images with transparency
- TIFF for multi-page documents
- BMP for uncompressed images
- GIF for simple graphics
Additionally, IronOCR can handle PDF streams directly from memory, suitable for web applications and cloud services.
Working with PDF Content Types
When processing PDFs, you can optimize performance by specifying the content type. The PdfContents enum allows you to target specific content:
// For text-only PDFs (faster processing)
var textOnlyPdf = new OcrPdfInput("document.pdf", PdfContents.Text);
// For image-only PDFs (scanned documents)
var imageOnlyPdf = new OcrPdfInput("scanned.pdf", PdfContents.Images);
// For mixed content (default)
var mixedPdf = new OcrPdfInput("mixed.pdf", PdfContents.TextAndImages);// For text-only PDFs (faster processing)
var textOnlyPdf = new OcrPdfInput("document.pdf", PdfContents.Text);
// For image-only PDFs (scanned documents)
var imageOnlyPdf = new OcrPdfInput("scanned.pdf", PdfContents.Images);
// For mixed content (default)
var mixedPdf = new OcrPdfInput("mixed.pdf", PdfContents.TextAndImages);IRON VB CONVERTER ERROR developers@ironsoftware.comHow Do I Read Specific Pages from a PDF?
When reading specific pages from a PDF document, specify the page index number for import. To do this, pass the list of page indices to the PageIndices parameter when constructing the OcrPdfInput. Keep in mind that page indices use zero-based numbering. This feature is particularly useful when working with large documents where only certain pages contain relevant information.
:path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-pdf-pages.csusing IronOcr;
using System.Collections.Generic;
// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();
// Create page indices list
List<int> pageIndices = new List<int>() { 0, 2 };
// Add PDF
using var pdfInput = new OcrPdfInput("Potter.pdf", PageIndices: pageIndices);
// Perform OCR
OcrResult ocrResult = ocrTesseract.Read(pdfInput);Imports IronOcr
Imports System.Collections.Generic
' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()
' Create page indices list
Private pageIndices As New List(Of Integer)() From {0, 2}
' Add PDF
Private pdfInput = New OcrPdfInput("Potter.pdf", PageIndices:= pageIndices)
' Perform OCR
Private ocrResult As OcrResult = ocrTesseract.Read(pdfInput)Why Does Page Numbering Start at Zero?
Zero-based indexing is a standard convention in C# and most programming languages. This means the first page is index 0, the second page is index 1, and so on. This consistency with array indexing makes it easier for developers to work with page collections programmatically. When converting from human-readable page numbers (1, 2, 3...) to indices, simply subtract 1 from the page number.
How Can I Read Non-Consecutive Pages?
Reading non-consecutive pages is straightforward with IronOCR. Simply add the desired page indices to your list in any order. For example:
// Read pages 1, 3, 5, and 10 (using zero-based indices)
List<int> pageIndices = new List<int>() { 0, 2, 4, 9 };
// Or use LINQ for range-based selection
var evenPages = Enumerable.Range(0, 10).Where(x => x % 2 == 0).ToList();// Read pages 1, 3, 5, and 10 (using zero-based indices)
List<int> pageIndices = new List<int>() { 0, 2, 4, 9 };
// Or use LINQ for range-based selection
var evenPages = Enumerable.Range(0, 10).Where(x => x % 2 == 0).ToList();IRON VB CONVERTER ERROR developers@ironsoftware.comThe OCR engine will process only the specified pages, significantly improving performance for large documents.
What Happens If I Specify Invalid Page Numbers?
If you specify page indices that exceed the document's page count, IronOCR will throw an exception. Implement error handling or validate page counts before processing. You can check the total page count of a PDF before performing OCR to ensure your indices are valid.
How Do I OCR a Specific Region of a PDF?
By narrowing down the area to be read, you can significantly enhance the reading efficiency. To achieve this, specify the precise region of the imported PDF that needs to be read. In the code example below, IronOCR focuses solely on extracting the chapter number and title. This technique, similar to defining OCR regions for images, improves both speed and accuracy.
:path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-specific-region.csusing IronOcr;
using IronSoftware.Drawing;
using System;
// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();
// Specify crop regions
Rectangle[] scanRegions = { new Rectangle(550, 100, 600, 300) };
// Add PDF
using (var pdfInput = new OcrPdfInput("Potter.pdf", ContentAreas: scanRegions))
{
// Perform OCR
OcrResult ocrResult = ocrTesseract.Read(pdfInput);
// Output the result to console
Console.WriteLine(ocrResult.Text);
}Imports IronOcr
Imports IronSoftware.Drawing
Imports System
' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()
' Specify crop regions
Private scanRegions() As Rectangle = { New Rectangle(550, 100, 600, 300) }
' Add PDF
Using pdfInput = New OcrPdfInput("Potter.pdf", ContentAreas:= scanRegions)
' Perform OCR
Dim ocrResult As OcrResult = ocrTesseract.Read(pdfInput)
' Output the result to console
Console.WriteLine(ocrResult.Text)
End UsingHow Do I Determine the Correct Rectangle Coordinates?

Finding the correct coordinates requires understanding the PDF's coordinate system. The Rectangle constructor takes four parameters: X (horizontal position), Y (vertical position), Width, and Height. All measurements are in pixels. Tools like PDF viewers with ruler features or debugging utilities can help identify exact coordinates. Alternatively, use trial and error with small adjustments to refine your selection area.
For more precise region definition, you can utilize the highlight texts for debugging feature to visualize the areas being processed.
Can I Specify Multiple Regions in One Operation?
Yes, IronOCR supports multiple regions in a single OCR operation. Simply add multiple Rectangle objects to your array:
Rectangle[] scanRegions = {
new Rectangle(50, 50, 200, 100), // Header region
new Rectangle(50, 200, 500, 300), // Main content region
new Rectangle(50, 550, 200, 50) // Footer region
};Rectangle[] scanRegions = {
new Rectangle(50, 50, 200, 100), // Header region
new Rectangle(50, 200, 500, 300), // Main content region
new Rectangle(50, 550, 200, 50) // Footer region
};IRON VB CONVERTER ERROR developers@ironsoftware.comEach region will be processed separately, and the results will be combined in the order specified.
Why Use Region-Specific OCR Instead of Full Page?
Region-specific OCR offers several advantages:
- Performance: Processing smaller areas is significantly faster
- Accuracy: Focusing on specific regions reduces noise from irrelevant content
- Structure: Extract data from forms and tables more reliably
- Cost efficiency: Less processing time means lower computational costs
This approach is particularly valuable when working with structured documents like invoices, forms, or reports where data appears in predictable locations. For complex document structures, explore reading tables in documents for specialized table extraction techniques.
What Advanced PDF OCR Features Are Available?
IronOCR offers additional capabilities for PDF processing that extend beyond basic text extraction. You can create searchable PDFs from scanned documents, preserving the original layout while adding a text layer for searching and copying. The library also supports multithreading for faster processing of large PDF collections.
For developers looking to get started with OCR in their .NET applications, exploring the simple OCR examples provides a solid foundation for understanding IronOCR's capabilities and best practices.
Handling Complex PDF Scenarios
When dealing with challenging PDF documents, IronOCR provides several advanced features:
- Image Preprocessing: Apply image filters to enhance text clarity
- Multiple Languages: Process documents containing multiple languages simultaneously
- Custom Configurations: Fine-tune OCR settings for specific document types
- Export Options: Save results in various formats including searchable PDFs and hOCR HTML
These features make IronOCR a comprehensive solution for enterprise-level PDF processing requirements.
Frequently Asked Questions
How do I extract text from a PDF file in C#?
You can extract text from PDF files using IronOCR with just one line of code. Simply create an IronTesseract instance and use the Read method with OcrPdfInput: `using var result = new IronOcr.IronTesseract().Read(new IronOcr.OcrPdfInput("document.pdf", PdfContents.TextAndImages));`. IronOCR handles both scanned PDFs (image-based) and searchable PDFs (text-based).
What PDF versions are supported for text extraction?
IronOCR supports all PDF versions, from older PDF 1.0 specifications to the latest PDF 2.0 standards. The OCR engine is built on Tesseract technology, ensuring accurate text extraction regardless of the PDF version you're working with.
Can I read only specific pages from a PDF instead of the entire document?
Yes, IronOCR allows you to read specific pages from a PDF by providing page indices. Instead of processing the entire document, you can specify which pages to extract text from using the OcrPdfInput object, making the OCR process more efficient for large documents.
What is the minimal workflow for OCR on a PDF file?
The minimal workflow with IronOCR consists of 5 steps: 1) Download the C# library, 2) Prepare your PDF document, 3) Create an OcrPdfInput object with the PDF file path, 4) Use the Read method to perform OCR, and 5) Optionally specify page indices for selective reading.
When should I adjust DPI settings for PDF OCR?
While IronOCR's default DPI settings work well for most standard PDFs, you should consider adjusting DPI when working with low-resolution scanned documents (below 200 DPI) or PDFs containing small text. Higher DPI settings in the OcrPdfInput construction can significantly enhance reading accuracy for specialized documents.
Does the OCR engine work with both scanned and searchable PDFs?
Yes, IronOCR effectively processes both scanned PDFs (image-based) and searchable PDFs (text-based). The Tesseract-based engine automatically handles different PDF types, making it versatile for extracting text from various PDF formats without requiring different approaches.







