C# PDF OCR

OCR of PDF documents is a very common use case in real life .NET software development.

We may need to perform OCR on a Pdf for a number of reasons including:

  • Extracting content so it can be re-purposed or modernized
  • Making scanned PDF documents searchable
  • Populating a Search index.

IronOCR provides a robust API for .net and C# developers to perform this task. IronOCR extends Tesseract 5 and adds PDF functionality to .NET developers.

You can download the software product from this link.

How to Perform OCR on a PDF in C#

IronOcr provides a robust API to extract text from PDFs and also to make scanned PDFs searchable using C# and other .NET languages.

In the following C# example we will OCR an Existing PDF.

using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
    Input.AddPdf("example.pdf"); 
    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
    Input.AddPdf("example.pdf"); 
    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
Imports IronOcr

Private Ocr = New IronTesseract()
Using Input = New OcrInput()
	Input.AddPdf("example.pdf")
	Dim Result = Ocr.Read(Input)
	Console.WriteLine(Result.Text)
End Using
VB   C#

We can also extract text from one or more specific pages:

using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
    Input.AddPdfPages("example.pdf", new[] { 1, 2, 3 });
    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
    Input.AddPdfPages("example.pdf", new[] { 1, 2, 3 });
    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
Imports IronOcr

Private Ocr = New IronTesseract()
Using Input = New OcrInput()
	Input.AddPdfPages("example.pdf", { 1, 2, 3 })
	Dim Result = Ocr.Read(Input)
	Console.WriteLine(Result.Text)
End Using
VB   C#

PDF OCR Results Class

The Result value does not just contain the PDF text. It also contains information about Pages, Paragraphs, Lines, Words , Characters and Barcodes discovered in the PDF document by IronOcr.

We can explore the IronOcr.OcrResult Class in: https://ironsoftware.com/csharp/ocr/examples/results-objects/.

Creating Searchable PDFs using OCR

One of our most popular OCR features is creating searchable PDFs from scans. This search-ability makes PDFs more accessible o users - and also makes them much easier to index in search engines such as ElasticSearch or even Google.

Making an existing Scanned PDF searchable

This code example will take a PDF document, improve the image quality, and return a searchable PDF.

using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
    Input.AddPdf("scan.pdf")

    // clean up twisted pages
    Input.Deskew();

    var Result = Ocr.Read(Input);
    Result.SaveAsSearchablePdf("searchable.pdf");
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
    Input.AddPdf("scan.pdf")

    // clean up twisted pages
    Input.Deskew();

    var Result = Ocr.Read(Input);
    Result.SaveAsSearchablePdf("searchable.pdf");
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
	Input.AddPdf("scan.pdf") Input.Deskew()

	Dim Result = Ocr.Read(Input)
	Result.SaveAsSearchablePdf("searchable.pdf")
End Using
VB   C#

Convert Images to a Searchable PDF

We can also use OCR to convert images files into a searchable PDF document in C# / .NET

using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
    Input.Add(@"images\page1.png")
    Input.Add(@"images\page2.bmp")
    Input.Add(@"images\page3.tiff")

    // clean up twisted pages
    Input.Deskew();
    var Result = Ocr.Read(Input);
    Result.SaveAsSearchablePdf("searchable.pdf");
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
    Input.Add(@"images\page1.png")
    Input.Add(@"images\page2.bmp")
    Input.Add(@"images\page3.tiff")

    // clean up twisted pages
    Input.Deskew();
    var Result = Ocr.Read(Input);
    Result.SaveAsSearchablePdf("searchable.pdf");
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
	Input.Add("images\page1.png") Input.Add("images\page2.bmp") Input.Add("images\page3.tiff") Input.Deskew()
	Dim Result = Ocr.Read(Input)
	Result.SaveAsSearchablePdf("searchable.pdf")
End Using
VB   C#