How to Get Text from Images Using Tesseract

Leveraging libraries such as IronOCR and Tesseract grants developers access to advanced algorithms and machine learning techniques for extracting textual information from images and scanned documents. This tutorial will show readers how to use the Tesseract library to perform text extraction from images, and will then conclude by introducing IronOCR's unique approach.

1. OCR with Tesseract

1.1. Install Tesseract

Using the NuGet Package Manager Console, enter the following command.

Install-Package Tesseract

Or download the package via the NuGet Package Manager.

How to Get OCR Text Recognition, Figure 1: Install Tesseract package in the NuGet Package Manager Install Tesseract package in the NuGet Package Manager

You must manually install and save the language files in the project folder after installing the NuGet Package. This can be considered a shortcoming of this specific library.

Visit the following website to download the language files. Once downloaded, unzip the files, and add the "tessdata" folder to your project's debug folder.

1.2. Using Tesseract (Quick-start)

OCR on a given image can be performed using the source code below:

using Tesseract;

var ocrEngine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default);
var img = Pix.LoadFromFile("Demo.png");
var res = ocrEngine.Process(img);
Console.WriteLine(res.GetText());
Console.ReadKey();
using Tesseract;

var ocrEngine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default);
var img = Pix.LoadFromFile("Demo.png");
var res = ocrEngine.Process(img);
Console.WriteLine(res.GetText());
Console.ReadKey();
Imports Tesseract

Private ocrEngine = New TesseractEngine("tessdata", "eng", EngineMode.Default)
Private img = Pix.LoadFromFile("Demo.png")
Private res = ocrEngine.Process(img)
Console.WriteLine(res.GetText())
Console.ReadKey()
VB   C#

First, a TerreractEngine object must be created and load the language data into the Engine. Then, the desired image file is loaded with the help of Tesseract Pix. Then this image is passed into the TerreractEngine to extract the correct recognized text by using the GetText method available in the TesseractEngine. This is the output from the code.

How to Get OCR Text Recognition, Figure 2: Extracted text from the image Extracted text from the image

1.3 Tesseract Considerations

  1. Tesseract supports output text formatting, OCR positional data, and page layout analysis as of version 3.00.
  2. Tesseract is available on Windows, Linux, and Mac OS X. However, Tesseract has only been confirmed to work as intended on Windows and Ubuntu due to limited development support.
  3. Tesseract can distinguish between monospaced and proportionally spaced text.
  4. Utilizing a front-end like OCRopus, Tesseract is ideal for use as a back-end and can be utilized for more challenging OCR jobs, such as layout analysis.
  5. Some of Tesseract's shortcomings:

    • The latest builds have not been designed to compile on Windows
    • Tesseract's C# API wrappers are maintained infrequently, and are years behind new releases of Tesseract

To learn more about Tesseract in C#, please visit the Tesseract tutorial.

2. OCR with IronOCR

2.1. Installing IronCR

Enter the next command into the NuGet Package Manager Console.

Install-Package IronOcr

Or install the IronOCR library via the NuGet Package Manager, along with additional packages for other languages, which are simple and convenient to use.

How to Get OCR Text Recognition, Figure 3: Install IronOcr and languages packages via NuGet Package Manager Install IronOcr and languages packages via NuGet Package Manager

2.2. Using IronOCR

Below is the sample code to recognize the text from the given image.

var ocr = new IronTesseract();
ocr.Language = OcrLanguage.EnglishBest;
ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
using (var input = new OcrInput())      
{          
    input.LoadImage(@"Demo.png");
    var result = ocr.Read(input);
    Console.WriteLine(result.Text);
    Console.ReadKey();
}
var ocr = new IronTesseract();
ocr.Language = OcrLanguage.EnglishBest;
ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
using (var input = new OcrInput())      
{          
    input.LoadImage(@"Demo.png");
    var result = ocr.Read(input);
    Console.WriteLine(result.Text);
    Console.ReadKey();
}
Dim ocr = New IronTesseract()
ocr.Language = OcrLanguage.EnglishBest
ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5
Using input = New OcrInput()
	input.LoadImage("Demo.png")
	Dim result = ocr.Read(input)
	Console.WriteLine(result.Text)
	Console.ReadKey()
End Using
VB   C#

The code above instantiates a IronTesseract object. Additionally, a OcrInput object is being created to add one or more image files, proving the local file path with LoadImage method. You are free to upload as many pictures as you want. The functionality Read in the Object IronTesseract will parse the image file and extract the result into the OCR result.

How to Get OCR Text Recognition, Figure 4: Extracted text output using IronOCR library Extracted text output using IronOCR library

2.3 IronOCR Considerations

  1. IronOCR is an extension of the Tesseract library, introducing more stability and higher accuracy.
  2. IronOCR can read text content from PDFs and photos IronOCR can also read more than 20 distinct kinds of barcodes and QR codes.
  3. Output can be rendered either as plain text, structured data, as barcodes, or as QR codes.
  4. The library recognizes 127 languages worldwide.
  5. IronOCR works in all .NET environments flexibly (console, Web, desktop, etc), and also supports the latest mobile frameworks such as Mono, Xamarin, Azure, and MAUI.
  6. IronOCR offers a free trial and has a development edition that is less priced. Learn more about licensing.

For a detailed IronOCR tutorial, refer to this article to read text from an image in C#.