IronOCR How-Tos Streams How to Read PDFs ByChaknith Bin October 25, 2023 Updated June 22, 2025 Share: PDF stands for "Portable Document Format." It is a file format developed by Adobe that preserves the fonts, images, graphics, and layout of any source document, regardless of the application and platform used to create it. PDF files are typically used for sharing and viewing documents in a consistent format, irrespective of the software or hardware used to open them. IronOcr handles various versions of PDF documents with ease. View the IronOCR YouTube Playlist Get started with IronOCR Start using IronOCR in your project today with a free trial. First Step: Start for Free How to Read PDFs Download a C# library for reading PDFs Prepare the PDF document for reading Construct the OcrPdfInput object with PDF file path Employ the Read method to perform OCR on the imported PDF Read specific pages by providing the page indices list Read PDF Example Begin by instantiating the IronTesseract class to perform OCR. Then, utilize a 'using' statement to create an OcrPdfInput object, passing the PDF file path to it. Finally, perform OCR using the Read method. :path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-pdf.cs using IronOcr; // The IronTesseract class is the primary class from IronOCR used for performing OCR operations. var ocrTesseract = new IronTesseract(); // Using statement ensures the OcrPdfInput object is disposed of after use, // which helps in freeing up resources. This input represents a PDF file to be processed. using (var pdfInput = new OcrPdfInput("Potter.pdf")) { // Perform OCR on the PDF input. The Read method processes the given PDF and outputs the result. // OcrResult holds the extracted text and other metadata. OcrResult ocrResult = ocrTesseract.Read(pdfInput); // The extracted text can be accessed through ocrResult.Text // Uncomment the below line to output the recognized text to the console // Console.WriteLine(ocrResult.Text); } Imports IronOcr ' The IronTesseract class is the primary class from IronOCR used for performing OCR operations. Private ocrTesseract = New IronTesseract() ' Using statement ensures the OcrPdfInput object is disposed of after use, ' which helps in freeing up resources. This input represents a PDF file to be processed. Using pdfInput = New OcrPdfInput("Potter.pdf") ' Perform OCR on the PDF input. The Read method processes the given PDF and outputs the result. ' OcrResult holds the extracted text and other metadata. Dim ocrResult As OcrResult = ocrTesseract.Read(pdfInput) ' The extracted text can be accessed through ocrResult.Text ' Uncomment the below line to output the recognized text to the console ' Console.WriteLine(ocrResult.Text); End Using $vbLabelText $csharpLabel In most cases, there's no need to specify the DPI property. However, providing a high DPI number in the construction of OcrPdfInput can enhance reading accuracy. Read PDF Pages Example When reading specific pages from a PDF document, the user can specify the page index number for import. To do this, pass the list of page indices to the PageIndices parameter when constructing the OcrPdfInput. Keep in mind that page indices use zero-based numbering. :path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-pdf-pages.cs // Import the necessary namespaces using IronOcr; using System.Collections.Generic; // Instantiate IronTesseract as an OCR engine var ocrTesseract = new IronTesseract(); // Create a list of page indices to specify which pages to process in the PDF. // Example: Process only the first and third pages (0 and 2 based index). var pageIndices = new List<int> { 0, 2 }; // Use a using statement to ensure the OcrPdfInput object is disposed of correctly. // This object reads the specified PDF and allows specifying specific pages for OCR. using (var pdfInput = new OcrPdfInput("Potter.pdf")) { // Assign the page indices to the PageIndices property of the pdfInput object pdfInput.PageIndices = pageIndices; // Perform OCR on the specified pages of the PDF. // The Read method processes the input and returns an OcrResult object containing the recognized text. OcrResult ocrResult = ocrTesseract.Read(pdfInput); // Example: Output the recognized text (Add your own logic if necessary) // Uncomment the following line to print the recognized text to the console // Console.WriteLine(ocrResult.Text); } ' Import the necessary namespaces Imports IronOcr Imports System.Collections.Generic ' Instantiate IronTesseract as an OCR engine Private ocrTesseract = New IronTesseract() ' Create a list of page indices to specify which pages to process in the PDF. ' Example: Process only the first and third pages (0 and 2 based index). Private pageIndices = New List(Of Integer) From {0, 2} ' Use a using statement to ensure the OcrPdfInput object is disposed of correctly. ' This object reads the specified PDF and allows specifying specific pages for OCR. Using pdfInput = New OcrPdfInput("Potter.pdf") ' Assign the page indices to the PageIndices property of the pdfInput object pdfInput.PageIndices = pageIndices ' Perform OCR on the specified pages of the PDF. ' The Read method processes the input and returns an OcrResult object containing the recognized text. Dim ocrResult As OcrResult = ocrTesseract.Read(pdfInput) ' Example: Output the recognized text (Add your own logic if necessary) ' Uncomment the following line to print the recognized text to the console ' Console.WriteLine(ocrResult.Text); End Using $vbLabelText $csharpLabel Specify Scan Region By narrowing down the area to be read, you can significantly enhance the reading efficiency. To achieve this, you can specify the precise region of the imported PDF that needs to be read. In the code example below, I have instructed IronOcr to focus solely on extracting the chapter number and title. :path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-specific-region.cs using IronOcr; using IronSoftware.Drawing; // Import the necessary namespace for Rectangle using System; // Initialize an instance of IronTesseract for performing OCR operations var ocrTesseract = new IronTesseract(); // Define crop regions as rectangles where OCR should be performed Rectangle[] scanRegions = { new Rectangle(550, 100, 600, 300) }; // Process the PDF file for OCR inside a using block to ensure proper resource management using (var pdfInput = new OcrInput("Potter.pdf", scanRegions)) { // Read the specified regions of the PDF and perform OCR to extract text OcrResult ocrResult = ocrTesseract.Read(pdfInput); // Output the recognized text to the console Console.WriteLine(ocrResult.Text); } Imports IronOcr Imports IronSoftware.Drawing ' Import the necessary namespace for Rectangle Imports System ' Initialize an instance of IronTesseract for performing OCR operations Private ocrTesseract = New IronTesseract() ' Define crop regions as rectangles where OCR should be performed Private scanRegions() As Rectangle = { New Rectangle(550, 100, 600, 300) } ' Process the PDF file for OCR inside a using block to ensure proper resource management Using pdfInput = New OcrInput("Potter.pdf", scanRegions) ' Read the specified regions of the PDF and perform OCR to extract text Dim ocrResult As OcrResult = ocrTesseract.Read(pdfInput) ' Output the recognized text to the console Console.WriteLine(ocrResult.Text) End Using $vbLabelText $csharpLabel OCR Result Frequently Asked Questions What is a PDF? PDF stands for 'Portable Document Format.' It is a file format developed by Adobe that preserves the fonts, images, graphics, and layout of any source document, irrespective of the application and platform used to create it. How can I read a PDF in C#? To read a PDF, use IronOCR by instantiating the IronTesseract class, creating an OcrPdfInput object with the PDF file path, and using the Read method to perform OCR on the document. How do I read specific pages from a PDF? To read specific pages, use IronOCR to pass a list of page indices to the PageIndices parameter when constructing the OcrPdfInput. This allows you to import and perform OCR on specified pages only. Can I specify a scan region in a PDF for OCR? Yes, with IronOCR, you can specify a precise region of the PDF to be read by using the SelectRegion method. This narrows down the area for OCR and enhances reading efficiency. What is the benefit of specifying a high DPI when reading PDFs? Using IronOCR, specifying a high DPI can enhance the accuracy of reading PDFs. However, in most cases, it's not necessary to set this property explicitly. What is the zero-based numbering system used for page indices? In IronOCR, page indices are zero-based, meaning the first page of the PDF is indexed as 0. This is used when specifying which pages to read from a PDF. Do I need to dispose of resources when using a library for OCR? Yes, when using IronOCR, it is advisable to use a 'using' statement when working with OcrInput objects to ensure resources are disposed of properly after performing OCR operations. Chaknith Bin Chat with engineering team now Software Engineer Chaknith is the Sherlock Holmes of developers. It first occurred to him he might have a future in software engineering, when he was doing code challenges for fun. His focus is on IronXL and IronBarcode, but he takes pride in helping customers with every product. Chaknith leverages his knowledge from talking directly with customers, to help further improve the products themselves. His anecdotal feedback goes beyond Jira tickets and supports product development, documentation and marketing, to improve customer’s overall experience.When he isn’t in the office, he can be found learning about machine learning, coding and hiking. Ready to Get Started? Start Free Trial Total downloads: 3,904,374 View Licenses >