Published June 7, 2023
How to Get OCR Text Recognition
1. Introduction
OCR (Optical Character Recognition) is a transformative technology that revolutionizes the way a scanned document can be handled in our modern digital landscape. It empowers computers to extract and recognize text from various sources, including scanned PDF documents, enabling us to efficiently edit PDF documents and work with them. Adobe Acrobat is one of the tools for optical character recognition (OCR), you can quickly extract text from scanned documents and turn them into an editable PDF, a searchable image PDF.
For developers, by leveraging OCR libraries such as IronOCR and Tesseract, developers gain access to powerful tools and APIs that utilize advanced algorithms and machine learning techniques. These libraries enable accurate text recognition, making it easier to manage and extract valuable information from both existing scanned documents and new documents. With OCR, businesses and individuals can unlock the potential of their scanned documents and page image, enhancing productivity and enabling seamless content analysis. Whether it's extracting data from invoices, digitizing paper-based records, or simply improving document accessibility, OCR proves to be an indispensable tool in modern technology.
2. Tesseract
2.1. Introduction and Features
In 1995, Tesseract was one of the three most accurate OCR engines. It is offered for Windows, Linux, and Mac OS X. However, due to a lack of resources, developers have only thoroughly tested it with Windows and Ubuntu.
Tesseract supports output text formatting, OCR positional data, and page layout analysis as of version 3.00. Using the Leptonica library, support for a variety of new image formats was introduced. Tesseract has the ability to distinguish between monospaced and proportionally spaced text.
Utilizing a front-end like OCRopus, Tesseract is ideal for use as a back-end and can be utilized for more challenging OCR jobs, such as layout analysis.
Some shortcoming of Tesseract is that the latest builds have not been designed to compile on Windows, and the C# API wrappers for the library is not frequently maintained or updated, and are likely to be years behind the new releases.
2.2. Install Tesseract
Using the NuGet Package Manager Console, enter the following command.
Install-Package Tesseract
Or we use the NuGet Package Manager tool to download the package Just like the below image. Select the first result and try to install it.
You must manually install and save the language files in the project folder after installing the NuGet Package. This can be considered a shortcoming of this specific library.
Visit the following website to download the language files. Once downloaded, unzip the files, and add the "tessdata" folder to your project's debug folder.
2.3. OCR With Tesseract
Enter the below code in the created Visual Studio C# project. Which will help us to get the extracted text from the given image.
using Tesseract;
var ocrengine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default);
var img = Pix.LoadFromFile("Demo.png");
var res = ocrengine.Process(img);
Console.WriteLine(res.GetText());
Console.ReadKey();
using Tesseract;
var ocrengine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default);
var img = Pix.LoadFromFile("Demo.png");
var res = ocrengine.Process(img);
Console.WriteLine(res.GetText());
Console.ReadKey();
Imports Tesseract
Private ocrengine = New TesseractEngine("tessdata", "eng", EngineMode.Default)
Private img = Pix.LoadFromFile("Demo.png")
Private res = ocrengine.Process(img)
Console.WriteLine(res.GetText())
Console.ReadKey()
First, we create a TerreractEngine
object and load the Language data into the Engine. Then, we load the image file which needs to recognize text with the help of Tesseract pix. Then we are passing the loaded image into the TerreractEngine
which will help us to extract the correct recognized text by using the Gettext
method available in the tesseactEngine
. This is the output from the code.
To know more about Tesseract in C# click here.
4. IronOCR
4.1. Introduction and Features
IronOCR extends from Tesseract and introduces a native C# OCR library with more stability and higher accuracy over the vanilla Tesseract library. It can be used to read text content from PDFs and photos in .NET applications and websites. It can produce either plain text or structured data and supports a wide range of foreign languages. Moreover, it can also scan photos for text and barcodes. The OCR library from Iron Software can be used in the console, Web, MVC, and desktop .NET applications. The licensing process for commercial deployments receives direct support from the development team.
- IronOCR scans paper documents, barcodes, and QR codes from any image or PDF file using the most recent Tesseract 5 engine. With the aid of this library, desktop, console, and online programs can easily implement OCR.
- IronOCR supports 127 languages worldwide, as well as word lists and custom languages.
- IronOCR can read more than 20 distinct kinds of barcodes and QR codes.
- The output of IronOCR can be both barcode data and plain text. An alternative structured data object paradigm enables developers to receive all material for direct entry into .NET applications in the shape of structured Headings, Paragraphs, Lines, Words, and Characters.
4.2. Install IronOCR
Enter the next command into the NuGet Package Manager Console.
PM > Install-Package IronOcr
Or we can edit and install using the NuGet Package Manager which allows us to search the package. If we need to extract data from other languages, then we need to install that package also which also can be found in the NuGet Package Manager.
4.3. Using IronOCR
Below is the sample code which we can use to Recognize the text from the given image into text.
var Ocr = new IronTesseract();
Ocr.Language = OcrLanguage.EnglishBest;
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
using (var Input = new OcrInput())
{
Input.AddImage(@"Demo.png");
var R = Ocr.Read(Input);
Console.WriteLine(R.Text);
Console.ReadKey();
}
var Ocr = new IronTesseract();
Ocr.Language = OcrLanguage.EnglishBest;
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
using (var Input = new OcrInput())
{
Input.AddImage(@"Demo.png");
var R = Ocr.Read(Input);
Console.WriteLine(R.Text);
Console.ReadKey();
}
Dim Ocr = New IronTesseract()
Ocr.Language = OcrLanguage.EnglishBest
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5
Using Input = New OcrInput()
Input.AddImage("Demo.png")
Dim R = Ocr.Read(Input)
Console.WriteLine(R.Text)
Console.ReadKey()
End Using
We're making an object for the Iron Tesseract in the line of code above. Additionally, a OcrInput
object is being created to allow us to add one or more image files. We might need to give the path of the image inside the code when utilizing the OcrInput
object method add. You are free to upload as many pictures as you want. We can utilize the functionality "Read" in the Object IronTesseract
that we previously constructed to retrieve the photographs by parsing the image file and extracting the result into the OCR result. It is capable of removing text from photos and converting it into a string.
Below is the result from the given image, which shows that it has been extracted the text from the same given image as before.
For a detailed IronOCR tutorial, refer to this article here.
5. Conclusion
Tesseract is Open source which allows us to develop the code using their source code as a base. It was written using C++ and it doesn't have any official library for the .NET environment. If there is any issue with the code, there is no dedicated team to fix or support the user. Tesseract doesn't have any code tutorial on common use cases in the documentation. It is difficult for the beginner to understand the code and library.
On the other hand, IronOCR supports all .NET Projects such as .NET Framework Standard 2, .NET Framework 4.5, and .NET Core 2, 3, and 5. IronOCR supports with the latest technologies such as Mono, Xamarin, Azure, etc. IronOCR tools allow us to get the best results from Tesseract, repair documents, or photos that were scanned insufficiently. It utilizes the NuGet Package to control the intricate Tesseract dictionary system. Iron OCR Library helps us to develop an OCR tool.
IronOCR allows us to use the package without any additional settings, it supports multipage frame TIFF, PDF files, and all popular image formats. We can also read barcode values from images and detect barcode data. IronOCR offers a 30-day free trial and has a development edition that is less priced. The lifetime license included in the IronOCR bundle has no further costs. The IronOCR package offers coverage for various systems with just one payment. Visit this page to find out more about IronOCR's pricing.