How to Get Text from Images Using Tesseract
Leveraging libraries such as IronOCR and Tesseract grants developers access to advanced algorithms and machine learning techniques for extracting textual information from images and scanned documents. This tutorial will show readers how to use the Tesseract library to perform text extraction from images, and will then conclude by introducing IronOCR's unique approach.
1. OCR with Tesseract
1.1. Install Tesseract
Using the NuGet Package Manager Console, enter the following command:
Install-Package Tesseract
Or download the package via the NuGet Package Manager.
Install
Tesseract
package in the NuGet Package Manager
You must manually install and save the language files in the project folder after installing the NuGet Package. This can be considered a shortcoming of this specific library.
Visit the following website to download the language files. Once downloaded, unzip the files, and add the "tessdata" folder to your project's debug folder.
1.2. Using Tesseract (Quick-start)
OCR on a given image can be performed using the source code below:
using Tesseract;
class Program
{
static void Main()
{
// Initialize Tesseract engine with English language data
using var ocrEngine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default);
// Load the image to be processed
using var img = Pix.LoadFromFile("Demo.png");
// Process the image to extract text
using var res = ocrEngine.Process(img);
// Output the recognized text
Console.WriteLine(res.GetText());
Console.ReadKey();
}
}
using Tesseract;
class Program
{
static void Main()
{
// Initialize Tesseract engine with English language data
using var ocrEngine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default);
// Load the image to be processed
using var img = Pix.LoadFromFile("Demo.png");
// Process the image to extract text
using var res = ocrEngine.Process(img);
// Output the recognized text
Console.WriteLine(res.GetText());
Console.ReadKey();
}
}
Imports Tesseract
Friend Class Program
Shared Sub Main()
' Initialize Tesseract engine with English language data
Dim ocrEngine = New TesseractEngine("tessdata", "eng", EngineMode.Default)
' Load the image to be processed
Dim img = Pix.LoadFromFile("Demo.png")
' Process the image to extract text
Dim res = ocrEngine.Process(img)
' Output the recognized text
Console.WriteLine(res.GetText())
Console.ReadKey()
End Sub
End Class
- First, a
TesseractEngine
object must be created, loading the language data into the engine. - The desired image file is then loaded with the help of
Pix.LoadFromFile
. - The image is passed into the
TesseractEngine
to extract text using theProcess
method. - The recognized text is obtained with the
GetText
method and printed to the console.
Extracted text from the image
1.3 Tesseract Considerations
- Tesseract supports output text formatting, OCR positional data, and page layout analysis as of version 3.00.
- Tesseract is available on Windows, Linux, and MacOS, though it is primarily confirmed to work as intended on Windows and Ubuntu due to limited development support.
- Tesseract can distinguish between monospaced and proportionally spaced text.
- Utilizing a front-end like OCRopus, Tesseract is ideal for use as a back-end and can be utilized for more challenging OCR jobs, such as layout analysis.
- Some of Tesseract's shortcomings:
- The latest builds have not been designed to compile on Windows
- Tesseract's C# API wrappers are maintained infrequently and are years behind new releases of Tesseract
To learn more about Tesseract in C#, please visit the Tesseract tutorial.
2. OCR with IronOCR
2.1. Installing IronOCR
Enter the next command into the NuGet Package Manager Console:
Install-Package IronOcr
Or install the IronOCR library via the NuGet Package Manager, along with additional packages for other languages, which are simple and convenient to use.
Install IronOcr and languages packages via NuGet Package Manager
2.2. Using IronOCR
Below is a sample code to recognize the text from the given image:
using IronOcr;
class Program
{
static void Main()
{
// Create an IronTesseract instance with predefined settings
var ocr = new IronTesseract()
{
Language = OcrLanguage.EnglishBest,
Configuration = { TesseractVersion = TesseractVersion.Tesseract5 }
};
// Create an OcrInput instance for image processing
using var input = new OcrInput();
// Load the image to be processed
input.AddImage("Demo.png");
// Process the image and extract text
var result = ocr.Read(input);
// Output the recognized text
Console.WriteLine(result.Text);
Console.ReadKey();
}
}
using IronOcr;
class Program
{
static void Main()
{
// Create an IronTesseract instance with predefined settings
var ocr = new IronTesseract()
{
Language = OcrLanguage.EnglishBest,
Configuration = { TesseractVersion = TesseractVersion.Tesseract5 }
};
// Create an OcrInput instance for image processing
using var input = new OcrInput();
// Load the image to be processed
input.AddImage("Demo.png");
// Process the image and extract text
var result = ocr.Read(input);
// Output the recognized text
Console.WriteLine(result.Text);
Console.ReadKey();
}
}
Imports IronOcr
Friend Class Program
Shared Sub Main()
' Create an IronTesseract instance with predefined settings
Dim ocr = New IronTesseract() With {
.Language = OcrLanguage.EnglishBest,
.Configuration = { TesseractVersion = TesseractVersion.Tesseract5 }
}
' Create an OcrInput instance for image processing
Dim input = New OcrInput()
' Load the image to be processed
input.AddImage("Demo.png")
' Process the image and extract text
Dim result = ocr.Read(input)
' Output the recognized text
Console.WriteLine(result.Text)
Console.ReadKey()
End Sub
End Class
- This code initializes an
IronTesseract
object, setting up the language and Tesseract version. - An
OcrInput
object is then created to load image files using theAddImage
method. - The
Read
method ofIronTesseract
processes the image and extracts the text, which is then printed to the console.
Extracted text output using IronOCR library
2.3 IronOCR Considerations
- IronOCR is an extension of the Tesseract library, introducing more stability and higher accuracy.
- IronOCR can read text content from PDFs and photos. It can also read more than 20 distinct kinds of barcodes and QR codes.
- Output can be rendered either as plain text, structured data, barcodes, or QR codes.
- The library recognizes 125 languages worldwide.
- IronOCR works in all .NET environments flexibly (console, Web, desktop, etc.), and also supports the latest mobile frameworks such as Mono, Xamarin, Azure, and MAUI.
- IronOCR offers a free trial and has a development edition that is less priced. Learn more about licensing.
For a detailed IronOCR tutorial, refer to this article to read text from an image in C#.