Updated January 29, 2024
How to Get Text from Images Using Tesseract
Leveraging libraries such as IronOCR and Tesseract grants developers access to advanced algorithms and machine learning techniques for extracting textual information from images and scanned documents. This tutorial will show readers how to use the Tesseract library to perform text extraction from images, and will then conclude by introducing IronOCR's unique approach.
1. OCR with Tesseract
1.1. Install Tesseract
Using the NuGet Package Manager Console, enter the following command.
Install-Package Tesseract
Or download the package via the NuGet Package Manager.
Install Tesseract
package in the NuGet Package Manager
You must manually install and save the language files in the project folder after installing the NuGet Package. This can be considered a shortcoming of this specific library.
Visit the following website to download the language files. Once downloaded, unzip the files, and add the "tessdata" folder to your project's debug folder.
1.2. Using Tesseract (Quick-start)
OCR on a given image can be performed using the source code below:
using Tesseract;
var ocrEngine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default);
var img = Pix.LoadFromFile("Demo.png");
var res = ocrEngine.Process(img);
Console.WriteLine(res.GetText());
Console.ReadKey();
using Tesseract;
var ocrEngine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default);
var img = Pix.LoadFromFile("Demo.png");
var res = ocrEngine.Process(img);
Console.WriteLine(res.GetText());
Console.ReadKey();
Imports Tesseract
Private ocrEngine = New TesseractEngine("tessdata", "eng", EngineMode.Default)
Private img = Pix.LoadFromFile("Demo.png")
Private res = ocrEngine.Process(img)
Console.WriteLine(res.GetText())
Console.ReadKey()
First, a TerreractEngine
object must be created and load the language data into the Engine. Then, the desired image file is loaded with the help of Tesseract Pix. Then this image is passed into the TerreractEngine
to extract the correct recognized text by using the GetText
method available in the TesseractEngine
. This is the output from the code.
Extracted text from the image
1.3 Tesseract Considerations
- Tesseract supports output text formatting, OCR positional data, and page layout analysis as of version 3.00.
- Tesseract is available on Windows, Linux, and Mac OS X. However, Tesseract has only been confirmed to work as intended on Windows and Ubuntu due to limited development support.
- Tesseract can distinguish between monospaced and proportionally spaced text.
- Utilizing a front-end like OCRopus, Tesseract is ideal for use as a back-end and can be utilized for more challenging OCR jobs, such as layout analysis.
Some of Tesseract's shortcomings:
- The latest builds have not been designed to compile on Windows
- Tesseract's C# API wrappers are maintained infrequently, and are years behind new releases of Tesseract
To learn more about Tesseract in C#, please visit the Tesseract tutorial.
2. OCR with IronOCR
2.1. Installing IronCR
Enter the next command into the NuGet Package Manager Console.
Install-Package IronOcr
Or install the IronOCR library via the NuGet Package Manager, along with additional packages for other languages, which are simple and convenient to use.
Install IronOcr and languages packages via NuGet Package Manager
2.2. Using IronOCR
Below is the sample code to recognize the text from the given image.
var ocr = new IronTesseract();
ocr.Language = OcrLanguage.EnglishBest;
ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
using (var input = new OcrInput())
{
input.LoadImage(@"Demo.png");
var result = ocr.Read(input);
Console.WriteLine(result.Text);
Console.ReadKey();
}
var ocr = new IronTesseract();
ocr.Language = OcrLanguage.EnglishBest;
ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
using (var input = new OcrInput())
{
input.LoadImage(@"Demo.png");
var result = ocr.Read(input);
Console.WriteLine(result.Text);
Console.ReadKey();
}
Dim ocr = New IronTesseract()
ocr.Language = OcrLanguage.EnglishBest
ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5
Using input = New OcrInput()
input.LoadImage("Demo.png")
Dim result = ocr.Read(input)
Console.WriteLine(result.Text)
Console.ReadKey()
End Using
The code above instantiates a IronTesseract
object. Additionally, a OcrInput
object is being created to add one or more image files, proving the local file path with LoadImage
method. You are free to upload as many pictures as you want. The functionality Read
in the Object IronTesseract
will parse the image file and extract the result into the OCR result.
Extracted text output using IronOCR library
2.3 IronOCR Considerations
- IronOCR is an extension of the Tesseract library, introducing more stability and higher accuracy.
- IronOCR can read text content from PDFs and photos IronOCR can also read more than 20 distinct kinds of barcodes and QR codes.
- Output can be rendered either as plain text, structured data, as barcodes, or as QR codes.
- The library recognizes 127 languages worldwide.
- IronOCR works in all .NET environments flexibly (console, Web, desktop, etc), and also supports the latest mobile frameworks such as Mono, Xamarin, Azure, and MAUI.
- IronOCR offers a free trial and has a development edition that is less priced. Learn more about licensing.
For a detailed IronOCR tutorial, refer to this article to read text from an image in C#.