How to OCR a PDF Tutorial (Free Online Tools)

OCR or Optical Character Recognition is a process of converting textual information into digital form. PDF OCR is a popular application that can be used to improve business processes. One of the benefits of PDF OCR is that it can be used to improve the accessibility of information. This is particularly important for documents that are not available in a format that everyone can use or read. PDF OCR can be used to produce a copy of the document that is available in a format that everyone can use.

Another use of PDF OCR is in the tracking of documents. When a document is filed, scanned, or transcribed, it can be difficult to track which version of the document is associated with which file. With PDF OCR, it is possible to track the changes made to a document and determine which versions are associated with which file. This can be useful for managing document archives and preventing the loss of important information.

In this article, you'll learn how you can use OCR for any PDF file using Adobe Acrobat Pro software. This article will also introduce the .NET OCR library IronOCR which is one of the most efficient and feature-rich libraries available. Let's begin with Adobe Acrobat Pro.

OCR a PDF using Adobe Acrobat Pro DC

How to OCR a PDF - Figure 1

Adobe Acrobat Pro DC is the Pro version of Adobe Acrobat Reader DC. It is the most popular and powerful tool for PDF manipulation. With this software, you can create, edit, sign, and review any PDF document. Moreover, it enables you to convert PDFs to PowerPoint presentations, Word documents, or Excel files. It can also edit scanned documents.

The new version of Acrobat DC is also a document scanner that can quickly turn scanned documents into digital files using OCR technology. It features Optical Character Recognition as well as intelligent business card scanning that automatically detects and saves contact information from cards in seconds.

Along with being able to extract text from PDF files, Acrobat Pro DC has many features that make it a valuable tool for PDF transcription.

Let's see how we can use OCR of a scanned document using Adobe Acrobat Pro.

  • Open the desired PDF document, in our example a scanned PDF file, in Adobe Acrobat.
  • Select "Edit PDF" from the right pane of the document.

    How to OCR a PDF - Figure 2

  • This will open the interface of the Adobe Reader OCR PDF tool.
  • Click on the "Edit" button on the top ribbon.
  • This will convert scanned PDF documents to fully editable PDF documents. You'll be able to edit text and image files on the PDF file itself.

    How to OCR a PDF - Figure 3

  • You can also change the text block location, text font, etc.

After making any changes, save the file and you'll see these changes reflected in the document.

IronOCR: A .NET OCR Library

How to OCR a PDF - Figure 4

IronOCR is a .NET OCR library and OCR tool which can read text documents and images by converting them into a machine-readable format.

This Optical Character Recognition library was developed with the following considerations in mind:

  • The need for a robust and accurate OCR engine that can be used with different languages without needing any external software.
  • The need for an easy-to-use API that works across different platforms such as Windows, Linux, and macOS.
  • The need for an OCR engine that can be easily integrated into various .NET applications and supports both WPF and console apps.

IronOCR makes it easier for developers to create software that supports scanning documents, extracting text and metadata, indexing scanned image files, converting images to searchable PDFs, and converting scanned documents into readable text. IronOCR offers a lot of options when it comes to encoding, image format conversion, and text recognition and extraction. IronOCR supports 125 languages.

IronOCR provides an intuitive, robust, and accurate OCR process to recognize text from scanned documents, photographs, and screenshots while reducing time-consuming tasks like page segmentation and layout analysis. The library is developed in C# and its API design is straightforward with good readability.

Let's explore some code examples using IronOCR:

Code Examples

using IronOcr;

var Ocr = new IronTesseract();

using (var Input = new OcrInput())
{
    // OCR entire document
    Input.AddPdf("example.pdf", "password");

    // Alternatively OCR selected page numbers
    Input.AddPdfPages("example.pdf", new[] { 1, 2, 3 }, "password");

    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
using IronOcr;

var Ocr = new IronTesseract();

using (var Input = new OcrInput())
{
    // OCR entire document
    Input.AddPdf("example.pdf", "password");

    // Alternatively OCR selected page numbers
    Input.AddPdfPages("example.pdf", new[] { 1, 2, 3 }, "password");

    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
Imports IronOcr

Private Ocr = New IronTesseract()

Using Input = New OcrInput()
	' OCR entire document
	Input.AddPdf("example.pdf", "password")

	' Alternatively OCR selected page numbers
	Input.AddPdfPages("example.pdf", { 1, 2, 3 }, "password")

	Dim Result = Ocr.Read(Input)
	Console.WriteLine(Result.Text)
End Using
VB   C#

IronOCR provides you with the option of doing OCR of a whole PDF document or some selected page range of a PDF file.

PDF File (input)

How to OCR a PDF - Figure 5

Output in the Console

How to OCR a PDF - Figure 6

You can convert a PDF into a selectable PDF using IronOCR; it's very simple and straightforward. See the code snippet of the PDF conversion below:

using IronOcr;

var Ocr = new IronTesseract();

using (var Input = new OcrInput())
{
    Input.AddPdf("scan.pdf","password");

    // clean up twisted pages
    Input.Deskew();

    var Result = Ocr.Read(Input);
    Result.SaveAsSearchablePdf("searchable.pdf");
}
using IronOcr;

var Ocr = new IronTesseract();

using (var Input = new OcrInput())
{
    Input.AddPdf("scan.pdf","password");

    // clean up twisted pages
    Input.Deskew();

    var Result = Ocr.Read(Input);
    Result.SaveAsSearchablePdf("searchable.pdf");
}
Imports IronOcr

Private Ocr = New IronTesseract()

Using Input = New OcrInput()
	Input.AddPdf("scan.pdf","password")

	' clean up twisted pages
	Input.Deskew()

	Dim Result = Ocr.Read(Input)
	Result.SaveAsSearchablePdf("searchable.pdf")
End Using
VB   C#

IronOCR offers many other tools and features. You can explore IronOCR features by visiting the following link.

Conclusion

The IronOCR library has several advantages over other libraries available on the market. You can modify and extend its functionality by adding your own modules with just a few lines of code. IronOCR can currently read texts in over 125 languages. It has been developed to produce higher quality, more reliable results while consuming much less time and memory resources when compared to other libraries.

IronOCR is free for development. IronOCR also offers a free trial for testing in production. For more details about pricing and a free trial of IronOCR, follow the link.

How to OCR a PDF - Figure 7