Test in a live environment
Test in production without watermarks.
Works wherever you need it to.
Optical character recognition (OCR) is a technology that converts photographs, handwritten documents, printed text, and scanned documents into machine-readable text. It is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as cognitive computing, machine translation, and (extracted) text-to-spreadsheet conversion. It is widely used as a form of data entry from printed paper data records — whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any suitable documentation. Pattern recognition, artificial intelligence, and computer vision all fall under the umbrella of OCR research.
In this article, we are going to compare two of the common libraries and applications for using OCR for PDF documents and images. These are:
ABBYY FineReader PDF is an optical character recognition (OCR) application created by ABBYY. It allows us to convert image documents (pictures, scans, PDF files), and screen captures can be converted to editable file formats such as Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Rich Text Format, HTML, PDF/A, searchable PDF, CSV, and text (plain text).
ABBYY FineReader is a desktop application available for both Windows, Linux, and macOS. It also allows the creation of editable formats for pdf files. We can also read PDFs just like we can with Adobe Acrobat. ABBYY FineReader integrates scanned documents into digital workflows.
Manage and complete documents in a simple and efficient manner to save time and effort. Work with any document in the same methodical way, whether it was created digitally or converted from paper. You can alter the text, tables, and full layout of your PDF without having to convert it first.
ABBYY FineReader PDF can create PDFs from more than 25 different file formats, straight from paper documents, or by printing to a PDF printer from practically any application. PDF/A-1 to PDF/A-3 are supported for long-term archiving, and PDF/UA ensures that content is accessible when using assistive software such as screen readers. It also empowers professionals to maximize efficiency in the digital workplace.
Create and update your own interactive PDF forms using ABBYY FineReader to successfully collect information and standardize documents. Create forms by combining interactive fields of various types, setting actions, editing existing PDF forms, or adding form elements to a conventional PDF.
ABBYY FineReader can instantly convert paper documents, scans, and scanned PDFs into searchable PDFs, allowing you to retrieve documents from digital archives and access the information they contain. FineReader PDF supports all compliance levels and variants of the PDF/A format, the industry standard for long-term archiving, from PDF/A-1 through PDF/A-3.
ABBYY’s latest AI-based OCR technology, FineReader PDF, makes it easier to digitize, retrieve, edit, protect, share, and collaborate on all kinds of documents in the same workflow. FineReader also includes document comparison, which helps us to compare original documents, as well as converted PDFs and image files.
IronOCR provides software for engineers who use IronOCR for .NET to read text content from photos and PDFs in .NET apps and Web sites. It scans photos for text and barcodes, and supports numerous worldwide languages; it can then provide output as either plain text or structured data. The OCR library from Iron Software can be used in MVC, Web, console, and desktop .NET applications. For commercial deployments, licensing is provided with direct assistance from the development team.
Open the Visual Studio software and go to "file menu". Select "new project", then select "Console Application".
Enter the project name and select the file path in the appropriate text box. Then, click the create button and select the required Dot net Framework, as in the screenshot below.
The Visual Studio project will now generate the structure for the selected application, and if you have selected the console, Windows, and web application, it will now open the program.cs file where you can enter the code and build/run the application.
Next, we can add the library to test the code.
We can download the ABBYY FineReader here.
The above image shows that there are two versions, Individual and Business, that you can download as per your requirements. Select the "download free trial" option. It will redirect you to a form as in the image below:
We will need to fill out the form to get the EXE file location. Click the download option to download the file.
Once the file download is completed, we can double click the EXE file to start the installation. Once completed, it will display a popup message, and it is now ready to use.
IronOCR Library can be downloaded and installed in four ways.
These are:
The Visual Studio software provides the NuGet Package manager option to install the package directly to the solution. The below screenshot shows how to open the NuGet Package Manager.
It provides a search box to show the list of packages from the NuGet website. In the package manager, we need to search for the keyword IronOCR, as in the screenshot below:
From the above image, we will get the list of related search items. We need to select the required option to install the package to the solution.
In Visual Studio, go to Tools-> NuGet Package manager -> Package Manager Console
Enter the following line in the Package Manager Console tab:
Install-Package IronOcr
Next, the package will download/install in the current project and be ready to use.
The third way is to download the NuGet package directly from the website.
Click the link here to download the latest package direct from the website. Once downloaded, follow the steps below to add the package to the project.
Both IronOCR and Abbyy Finereader have an OCR technology that will convert the image into text searching.
Next, open the ABBYY FineReader PDF app which will open with multiple options, as in the image below.
Next, select the option "Open" from the OCR Editor options. This will prompt an option to select image files:
After selecting a file, it will automatically start scanning the image into editable text, and then show the result in the window as in the screenshot below:
The above image shows the source image converted into editable text. However, the result is not too accurate. Some of the numbers are not recognized by the ABBYY FineReader PDF app. This is clearly shown in the comparison windows — on the left side is the source image, and on the right side is the OCR converted text.
var Ocr = new IronTesseract(); // nothing to configure
Ocr.Language = OcrLanguage.EnglishBest;
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
using (var Input = new OcrInput())
{
Input.AddImage(@"3.png");
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
Console.ReadKey();
}
var Ocr = new IronTesseract(); // nothing to configure
Ocr.Language = OcrLanguage.EnglishBest;
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
using (var Input = new OcrInput())
{
Input.AddImage(@"3.png");
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
Console.ReadKey();
}
Dim Ocr = New IronTesseract() ' nothing to configure
Ocr.Language = OcrLanguage.EnglishBest
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5
Using Input = New OcrInput()
Input.AddImage("3.png")
Dim Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
Console.ReadKey()
End Using
The Tesseract 5 API, which allows us to convert image files into text, is demonstrated above. We're making an object for Iron Tesseract in the above line of code. We're also making an OcrInput object that will allow us to add one or more picture files. We may need to give the available picture path inside the code when utilizing the OcrInput object method add. Any number of images can be added. The function "Read" in the Object IronTesseract that we constructed before may be utilized to get the images by parsing the image file and extracting the result into the OCR result. It is capable of extracting text from photos and converting it to a string.
We can also use Tesseract to add multi-frame images. "AddMultiFrameTiff" is a different method for this operation. The Tesseract library reads each frame in the image, and each frame is treated as a distinct page. The process will read the first frame of the image and then proceed on to the next frame, and so on, until all of the image's frames have been scanned. Only the tiff image format is supported by this method.
The above image is the output of the IronOCR result, which is accurate and shows the data correctly converted into editable text.
IronOCR and ABBYY FineReader PDF help to convert a PDF file into editable text. ABBYY FineReader PDF provides a list of options to the user such as save the page, edit image, recognize page, etc. It also provides save options such as txt, document, HTML format, etc. IronOCR also allows us to save converted OCR files into HTML, txt, pdf, etc.
Open the ABBYY FineReader PDF software. This will open a page like the image below, offering multiple options.
Next, select the option "Open" from the OCR Editor options. This will prompt an option to select the image/PDF. We can select either a PDF or an image, or we can select both files.
After selecting the file, click the OK button. It will automatically start scanning the image into editable text and show the result in a window like the screenshot below.
The above image shows the source PDF converted into editable text. However, the result is not completely accurate. Some of the numbers are not recognized by the ABBYY FineReader PDF application. This is clearly shown in the comparison windows — on the left side is the source PDF, and on the right side is the OCR converted text.
We can also use OCRInput to manage PDF files. Every page of the papers will be read by the Iron Tesseract class. The text will then be extracted from the pages. We may also open protected documents using a second function called "AddPdf," which allows us to add PDFs to our list of documents (password if it is protected). The following code demonstrates how to open a password-protected PDF document:
var Ocr = new IronTesseract(); // nothing to configure
using (var Input = new OcrInput())
{
Input.AddPdf("example.pdf", "password");
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
var Ocr = new IronTesseract(); // nothing to configure
using (var Input = new OcrInput())
{
Input.AddPdf("example.pdf", "password");
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
Dim Ocr = New IronTesseract() ' nothing to configure
Using Input = New OcrInput()
Input.AddPdf("example.pdf", "password")
Dim Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
End Using
The following methods are also provided by Iron Tesseract:
We may read and extract content from a single page in a PDF document using "Addpdfpage." Only the page number from which we wish to extract text needs to be specified. "AddPdfPage" allows us to extract text from numerous pages that we specify. In IEnumerable
IEnumerable<int> numbers = new List<int> {2,8,10 };
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
//single pageInput.AddPdfPage("example.pdf",10);
//Multiple pageInput.AddPdfPages("example.pdf", numbers);
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
Result.SaveAsTextFile("ocrtext.txt");
}
IEnumerable<int> numbers = new List<int> {2,8,10 };
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
//single pageInput.AddPdfPage("example.pdf",10);
//Multiple pageInput.AddPdfPages("example.pdf", numbers);
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
Result.SaveAsTextFile("ocrtext.txt");
}
Dim numbers As IEnumerable(Of Integer) = New List(Of Integer) From {2, 8, 10}
Dim Ocr = New IronTesseract()
Using Input = New OcrInput()
'single pageInput.AddPdfPage("example.pdf",10);
'Multiple pageInput.AddPdfPages("example.pdf", numbers);
Dim Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
Result.SaveAsTextFile("ocrtext.txt")
End Using
Using the SaveAsTextFile function, we can store the result as a text file, which allows us to download the file to the output directory path. Also, we can save the file into the HTML file using SaveAsHocrFile.
FineReader has some additional options such as: Draw Text Area, Draw Picture Area, Draw Table Area, Draw Recognize Area, etc. These help the user to improve the performance of the OCR. Further, in addition to performing OCR, the application also enables users to complete operations such as combining PDFs, splitting PDFs, editing PDFs, etc.
IronOCR has unique features which allow us to read barcodes and QR codes from scanned documents. The below codes show how we can read barcodes from a given image or document.
var Ocr = new IronTesseract(); // nothing to configure
Ocr.Language = OcrLanguage.EnglishBest;
Ocr.Configuration.ReadBarCodes = true;
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
using (var Input = new OcrInput())
{
Input.AddImage("barcode.gif");
var Result = Ocr.Read(Input);
foreach (var Barcode in Result.Barcodes)
{
Console.WriteLine(Barcode.Value);
}
}
var Ocr = new IronTesseract(); // nothing to configure
Ocr.Language = OcrLanguage.EnglishBest;
Ocr.Configuration.ReadBarCodes = true;
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
using (var Input = new OcrInput())
{
Input.AddImage("barcode.gif");
var Result = Ocr.Read(Input);
foreach (var Barcode in Result.Barcodes)
{
Console.WriteLine(Barcode.Value);
}
}
Dim Ocr = New IronTesseract() ' nothing to configure
Ocr.Language = OcrLanguage.EnglishBest
Ocr.Configuration.ReadBarCodes = True
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5
Using Input = New OcrInput()
Input.AddImage("barcode.gif")
Dim Result = Ocr.Read(Input)
For Each Barcode In Result.Barcodes
Console.WriteLine(Barcode.Value)
Next Barcode
End Using
The above is the code that helps to read barcodes from a given image or PDF document. It is able to read more than one barcode from a page/image. To read the barcode, IronOCR has a unique setting Ocr.Configuration.ReadBarCodes which helps to read the barcode; the default value is set to false.
After reading the input, the data will be saved into the object called OCRResult; this has a property called Barcodes that assembles all the available barcode data into a list. By using the for-each loop, we can get all the barcodes' details one-by-one. Also, it will scan the barcode and read the value of the barcode — two operations completed in one process!
Furthermore, threading options are supported too, meaning we can perform multiple OCR processes at the same time. IronOCR is also able to recognize a specific area from a specified region.
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
var ContentArea = new System.Drawing.Rectangle() { X = 215, Y = 1250, Height = 280, Width = 1335 };
Input.Add("document.png", ContentArea);
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
var ContentArea = new System.Drawing.Rectangle() { X = 215, Y = 1250, Height = 280, Width = 1335 };
Input.Add("document.png", ContentArea);
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
Dim Ocr = New IronTesseract()
Using Input = New OcrInput()
Dim ContentArea = New System.Drawing.Rectangle() With {
.X = 215,
.Y = 1250,
.Height = 280,
.Width = 1335
}
Input.Add("document.png", ContentArea)
Dim Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
End Using
The above is the sample code for performing OCR on a specific region. We only need to specify the rectangular region on the image or PDF — the Tesseract engine in IronOCR enables the recognition of the text.
When employing IronOCR in the .NET Framework context, Tesseract is straightforward and easy to use. It supports photos and PDF documents in a variety of ways. It also provides a number of settings for improving the Tesseract OCR library's performance. Various languages are supported, as well as numerous languages in a single operation. To discover more about the Tesseract OCR, visit their website.
ABBYY FineReader PDF is a software application that uses an artificial intelligence engine to recognize an image/PDF document. It also provides various settings to improve the performance of the OCR process. Further, it provides the option to select multiple languages. ABBYY FineReader PDF does have some limitations on the usage of the page conversions. There are different prices for different operating systems. To know more about the ABBYY FineReader PDF price details, click here.
IronOCR is better than ABBYY FineReader PDF. The comparison demonstrated that some of the low-quality images were not recognized by FineReader, while it also failed to recognize some of the characters from the image, and reported them as unknown. On the other hand, IronOCR shows complete and accurate results. It also allows us to recognize barcode data and read the values of barcodes from images. The IronOCR package provides a lifetime license, and there are no ongoing costs. The IronOCR package supports multiple platforms at a single price. To know more about IronOCR price details, click here.
9 .NET API products for your office documents