How to Read PDFs

PDF stands for "Portable Document Format." It is a file format developed by Adobe that preserves the fonts, images, graphics, and layout of any source document, regardless of the application and platform used to create it. PDF files are typically used for sharing and viewing documents in a consistent format, irrespective of the software or hardware used to open them. IronOcr handles various versions of PDF documents with ease.

Get started with IronOCR

Start using IronOCR in your project today with a free trial.

First Step:
green arrow pointer



Read PDF Example

Begin by instantiating the IronTesseract class to perform OCR. Then, utilize a 'using' statement to create an OcrPdfInput object, passing the PDF file path to it. Finally, perform OCR using the Read method.

:path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-pdf.cs
using IronOcr;

// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();

// Add PDF
using var pdfInput = new OcrPdfInput("Potter.pdf");
// Perform OCR
OcrResult ocrResult = ocrTesseract.Read(pdfInput);
Imports IronOcr

' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()

' Add PDF
Private pdfInput = New OcrPdfInput("Potter.pdf")
' Perform OCR
Private ocrResult As OcrResult = ocrTesseract.Read(pdfInput)
$vbLabelText   $csharpLabel
Read PDF file

In most cases, there's no need to specify the DPI property. However, providing a high DPI number in the construction of OcrPdfInput can enhance reading accuracy.

Read PDF Pages Example

When reading specific pages from a PDF document, the user can specify the page index number for import. To do this, pass the list of page indices to the PageIndices parameter when constructing the OcrPdfInput. Keep in mind that page indices use zero-based numbering.

:path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-pdf-pages.cs
using IronOcr;
using System.Collections.Generic;

// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();

// Create page indices list
List<int> pageIndices = new List<int>() { 0, 2 };

// Add PDF
using var pdfInput = new OcrPdfInput("Potter.pdf", PageIndices: pageIndices);
// Perform OCR
OcrResult ocrResult = ocrTesseract.Read(pdfInput);
Imports IronOcr
Imports System.Collections.Generic

' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()

' Create page indices list
Private pageIndices As New List(Of Integer)() From {0, 2}

' Add PDF
Private pdfInput = New OcrPdfInput("Potter.pdf", PageIndices:= pageIndices)
' Perform OCR
Private ocrResult As OcrResult = ocrTesseract.Read(pdfInput)
$vbLabelText   $csharpLabel

Specify Scan Region

By narrowing down the area to be read, you can significantly enhance the reading efficiency. To achieve this, you can specify the precise region of the imported PDF that needs to be read. In the code example below, I have instructed IronOcr to focus solely on extracting the chapter number and title.

:path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-specific-region.cs
using IronOcr;
using IronSoftware.Drawing;
using System;

// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();

// Specify crop regions
Rectangle[] scanRegions = { new Rectangle(550, 100, 600, 300) };

// Add PDF
using (var pdfInput = new OcrPdfInput("Potter.pdf", ContentAreas: scanRegions))
{
    // Perform OCR
    OcrResult ocrResult = ocrTesseract.Read(pdfInput);

    // Output the result to console
    Console.WriteLine(ocrResult.Text);
}
Imports IronOcr
Imports IronSoftware.Drawing
Imports System

' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()

' Specify crop regions
Private scanRegions() As Rectangle = { New Rectangle(550, 100, 600, 300) }

' Add PDF
Using pdfInput = New OcrPdfInput("Potter.pdf", ContentAreas:= scanRegions)
	' Perform OCR
	Dim ocrResult As OcrResult = ocrTesseract.Read(pdfInput)

	' Output the result to console
	Console.WriteLine(ocrResult.Text)
End Using
$vbLabelText   $csharpLabel

OCR Result

Read specific region

Frequently Asked Questions

What is a PDF?

PDF stands for 'Portable Document Format.' It is a file format developed by Adobe that preserves the fonts, images, graphics, and layout of any source document, irrespective of the application and platform used to create it.

How can I read a PDF in C#?

To read a PDF, use IronOCR by instantiating the IronTesseract class, creating an OcrPdfInput object with the PDF file path, and using the Read method to perform OCR on the document.

How do I read specific pages from a PDF?

To read specific pages, use IronOCR to pass a list of page indices to the PageIndices parameter when constructing the OcrPdfInput. This allows you to import and perform OCR on specified pages only.

Can I specify a scan region in a PDF for OCR?

Yes, with IronOCR, you can specify a precise region of the PDF to be read by using the SelectRegion method. This narrows down the area for OCR and enhances reading efficiency.

What is the benefit of specifying a high DPI when reading PDFs?

Using IronOCR, specifying a high DPI can enhance the accuracy of reading PDFs. However, in most cases, it's not necessary to set this property explicitly.

What is the zero-based numbering system used for page indices?

In IronOCR, page indices are zero-based, meaning the first page of the PDF is indexed as 0. This is used when specifying which pages to read from a PDF.

Do I need to dispose of resources when using a library for OCR?

Yes, when using IronOCR, it is advisable to use a 'using' statement when working with OcrInput objects to ensure resources are disposed of properly after performing OCR operations.

Chaknith Bin
Software Engineer
Chaknith works on IronXL and IronBarcode. He has deep expertise in C# and .NET, helping improve the software and support customers. His insights from user interactions contribute to better products, documentation, and overall experience.
Reviewed by
Jeff Fritz
Jeffrey T. Fritz
Principal Program Manager - .NET Community Team
Jeff is also a Principal Program Manager for the .NET and Visual Studio teams. He is the executive producer of the .NET Conf virtual conference series and hosts 'Fritz and Friends' a live stream for developers that airs twice weekly where he talks tech and writes code together with viewers. Jeff writes workshops, presentations, and plans content for the largest Microsoft developer events including Microsoft Build, Microsoft Ignite, .NET Conf, and the Microsoft MVP Summit