C# PDF OCR
OCR of PDF documents is a very common use case in real life .NET software development.
We may need to perform OCR on a Pdf for a number of reasons including:
- Extracting content so it can be re-purposed or modernized
- Making scanned PDF documents searchable
- Populating a Search index.
IronOCR provides a robust API for .net and C# developers to perform this task. IronOCR extends Tesseract 5 and adds PDF functionality to .NET developers.
You can download the software product from this link.
How to Perform OCR on a PDF in C#
IronOcr provides a robust API to extract text from PDFs and also to make scanned PDFs searchable using C# and other .NET languages.
In the following C# example we will OCR an Existing PDF.
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
Input.AddPdf("example.pdf");
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
Input.AddPdf("example.pdf");
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
Input.AddPdf("example.pdf")
Dim Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
End Using
We can also extract text from one or more specific pages:
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
Input.AddPdfPages("example.pdf", new[] { 1, 2, 3 });
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
Input.AddPdfPages("example.pdf", new[] { 1, 2, 3 });
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
Input.AddPdfPages("example.pdf", { 1, 2, 3 })
Dim Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
End Using
PDF OCR Results Class
The Result
value does not just contain the PDF text. It also contains information about Pages, Paragraphs, Lines, Words , Characters and Barcodes discovered in the PDF document by IronOcr.
We can explore the IronOcr.OcrResult
Class in: https://ironsoftware.com/csharp/ocr/examples/results-objects/.
Creating Searchable PDFs using OCR
One of our most popular OCR features is creating searchable PDFs from scans. This search-ability makes PDFs more accessible o users - and also makes them much easier to index in search engines such as ElasticSearch or even Google.
Making an existing Scanned PDF searchable
This code example will take a PDF document, improve the image quality, and return a searchable PDF.
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
Input.AddPdf("scan.pdf")
// clean up twisted pages
Input.Deskew();
var Result = Ocr.Read(Input);
Result.SaveAsSearchablePdf("searchable.pdf");
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
Input.AddPdf("scan.pdf")
// clean up twisted pages
Input.Deskew();
var Result = Ocr.Read(Input);
Result.SaveAsSearchablePdf("searchable.pdf");
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
Input.AddPdf("scan.pdf") Input.Deskew()
Dim Result = Ocr.Read(Input)
Result.SaveAsSearchablePdf("searchable.pdf")
End Using
Convert Images to a Searchable PDF
We can also use OCR to convert images files into a searchable PDF document in C# / .NET
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
Input.Add(@"images\page1.png")
Input.Add(@"images\page2.bmp")
Input.Add(@"images\page3.tiff")
// clean up twisted pages
Input.Deskew();
var Result = Ocr.Read(Input);
Result.SaveAsSearchablePdf("searchable.pdf");
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
Input.Add(@"images\page1.png")
Input.Add(@"images\page2.bmp")
Input.Add(@"images\page3.tiff")
// clean up twisted pages
Input.Deskew();
var Result = Ocr.Read(Input);
Result.SaveAsSearchablePdf("searchable.pdf");
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
Input.Add("images\page1.png") Input.Add("images\page2.bmp") Input.Add("images\page3.tiff") Input.Deskew()
Dim Result = Ocr.Read(Input)
Result.SaveAsSearchablePdf("searchable.pdf")
End Using