How to Read Scanned Documents Using IronOCR

Many PDFs contain non-searchable, image-based text. IronOCR can convert this into searchable content, making it easier to locate specific information and enhancing document accessibility, especially for individuals with visual impairments.

Instead of manually copying or recreating text and images, automated extraction ensures accuracy and efficiency. This is particularly useful for research, legal documents, and content creation, where reusing specific portions of PDFs is common.

Businesses can extract critical data from PDFs for analysis or system integration, streamlining workflows. Designers and marketers can also extract images for enhancement and reuse in various projects.

In this tutorial, we'll explore the OcrPdfInput methods, covering the available options and parameters to showcase how IronOCR simplifies PDF text and image extraction for various applications.

Start using IronOCR in your project today with a free trial.

First Step:
green arrow pointer

To use this function, you must also install the IronOcr.Extensions.AdvancedScan package.

Read Scanned Documents Example

To extract text from all images within a document, use the ReadDocument method. This method processes the document and returns an object containing the extracted text, which can be accessed through the Text property. The example below demonstrates how to use this method with a sample TIFF file.

Please note

  • The method currently only works for English, Chinese, Japanese, Korean, and LatinAlphabet.
  • Using advanced scan on .NET Framework requires the project to run on x64 architecture.

Input

input

Code

:path=/static-assets/ocr/content-code-examples/how-to/read-scanned-document-read-scanned-document.cs
using IronOcr;
using System;

// Instantiate OCR engine
var ocr = new IronTesseract();

// Configure OCR engine
using var input = new OcrInput();
input.LoadImage("potter.tiff");

// Perform OCR
OcrResult result = ocr.ReadDocument(input);

Console.WriteLine(result.Text);
Imports IronOcr
Imports System

' Instantiate OCR engine
Private ocr = New IronTesseract()

' Configure OCR engine
Private input = New OcrInput()
input.LoadImage("potter.tiff")

' Perform OCR
Dim result As OcrResult = ocr.ReadDocument(input)

Console.WriteLine(result.Text)
$vbLabelText   $csharpLabel

Output

output

If you need to perform OCR on a PDF file instead, simply replace the LoadImage method with LoadPdf. This allows IronOCR to process and extract text from scanned PDFs in the same way.

Frequently Asked Questions

How can I read scanned documents using C#?

You can read scanned documents in C# by using IronOCR. First, download the C# library from NuGet, then import your scanned document using the LoadImage method for images or LoadPdf for PDFs. Finally, extract the text using the ReadDocument method.

What is the purpose of converting image-based text in PDFs to searchable content?

Converting image-based text in PDFs to searchable content with IronOCR enhances accessibility, making it easier to locate specific information and aiding individuals with visual impairments.

Can I extract text from images and PDFs with IronOCR?

Yes, IronOCR allows you to extract text from both images and PDFs. Use the LoadImage method for images and the LoadPdf method for PDFs, followed by the ReadDocument method to perform the extraction.

What are the language support capabilities of IronOCR?

IronOCR supports text extraction in English, Chinese, Japanese, Korean, and LatinAlphabet, making it versatile for multilingual document processing.

What architecture is required to use advanced scanning features in IronOCR?

To use advanced scanning features in IronOCR on the .NET Framework, your project must run on x64 architecture.

How can I use IronOCR for automated text extraction in business applications?

IronOCR can be used in business applications for automated text extraction by importing scanned documents, using the LoadPdf or LoadImage methods, and extracting text with the ReadDocument method. This streamlines workflows by allowing businesses to analyze and integrate critical data efficiently.

What steps are involved in extracting text from a scanned PDF using IronOCR?

To extract text from a scanned PDF using IronOCR, download the library, import the PDF using the LoadPdf method, then extract the text with the ReadDocument method. The extracted text can then be saved or exported as needed.

How does IronOCR benefit designers and marketers?

IronOCR benefits designers and marketers by allowing them to extract images and text from PDFs for enhancement and reuse in various projects, increasing efficiency and creative possibilities.

What package is necessary to install for using IronOCR's advanced features?

To access IronOCR's advanced features, you need to install the IronOcr.Extensions.AdvancedScan package from NuGet.

Curtis Chau
Technical Writer

Curtis Chau holds a Bachelor’s degree in Computer Science (Carleton University) and specializes in front-end development with expertise in Node.js, TypeScript, JavaScript, and React. Passionate about crafting intuitive and aesthetically pleasing user interfaces, Curtis enjoys working with modern frameworks and creating well-structured, visually appealing manuals.

...Read More