USING IRONOCR

Tesseract OCR for Multiple Languages (Developer Tutorial)

Published September 29, 2024
Share:

Introduction

OCR is a technology used to change documents of different nature into editable and searchable data. It changes scanned images, PDF files, and digital camera photos into data that can be edited and searched. This technology is extensively used in changing printed documents into digital form for editing, searching, and storing, reducing the physical space taken by documents. OCR has a massive role in data entry automation, thus saving companies and organizations a lot of time by reducing the labor used by human resources.

It is a process that uses advanced machine learning techniques and pattern recognition to extract text accurately from images. The latest developments relating to OCR have increased its accuracy, thus supporting more languages and complex scripts such as the Arabic script. Very necessary in finance, health, legislation, and education, OCR emerged as an indispensable tool where processing several printed documents rapidly was a prerequisite. This article will use Tesseract to OCR images in multiple languages data.

How to Use Tesseract OCR with Multiple Languages

  1. First, install the IronOCR/Tesseract NuGet package inside your .NET project.
  2. A class IronTesseract instance will be created, further initializing the OCR engine.
  3. It is designed so that the language property supports more than one language.
  4. Specify the image file path you want to process, then create an OcrInput object.
  5. Now, perform OCR on the input image using the Read function of the IronTesseract instance.
  6. Take the result and display the recognized text.

What is Tesseract?

Tesseract is an open Optical Character Recognition engine developed by Hewlett-Packard and later maintained by Google. It is famous for its high accuracy and adaptability, making it one of the most prominent OCRs. Tesseract supports script detection, recognizes text in many languages, and can handle multiple languages; hence, it is generally used for projects requiring multilingual documents and support.

The Tesseract OCR engine works on information contained in any single pixel of the image, following patterns depicting characters, words, and sentences that are eventually converted to machine-readable text. The many image file types it supports, such as TIFF, JPEG, and PNG, allow Tesseract to produce text in formats like plain text, HTML, and searchable PDF.

One of the significant advantages of Tesseract is that it can be trained to be sensitive to particular fonts or new languages added. It is also frequently used in various applications, ranging from simple text extraction to complex tasks in digitizing historical documents, processing invoices, or even accessibility software that enables reading for the visually impaired.

Creating a New Project in Visual Studio

Open the program Visual Studio. On opening the program, proceed to the "file menu." Under the "file menu," there is the option "new project." Under "new project," click on "Console Application." In this post, we will create PDF documents using a console program.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 1 - Create a new project

Enter your project's name and the file's location in the text boxes provided. Then, as shown in the image below, click the Create button and select which .NET Framework you need.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 2 - Select the .NET Framework along with providing a project and save location.

Now that the application version has been selected, the Visual Studio project will create its structure. If you have chosen the console, Windows, or web versions, it will open the program—cs file to add code and build/run the application.

Install Tesseract OCR For .NET

The first step is downloading and installing the Tesseract OCR software on your computer. Here is the official Tesseract GitHub repository with the Tesseract installer: https://github.com/tesseract-ocr/tesseract.

It would be best to get Tesseract OCR onto your computer by following the setup instructions specific to your operating system—whether that be Windows, macOS, or Linux. Once installed, add the Tesseract.NET package to your C# project using Visual Studio's NuGet Package Manager.

Open the NuGet Package Manager in your Visual Studio project from Tools -> NuGet Package Manager -> Manage NuGet Packages for Solution. Afterward, search for "Tesseract" in the NuGet Package Manager to obtain either the "Tesseract" or "Tesseract.NET" package. Select this package and click the Install button to install it in your project.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 3 - Search for Tesseract in the browse tab

Tesseract OCR using C#

You must set up Tesseract in your C# project to specify the location of the Tesseract OCR executable and language data files after installing the Tesseract.NET wrapper. Here's an example:

using System;
using System.Drawing;
using Tesseract;
class Program
{
    static void Main()
    {
        // Set the path to the Tesseract data files (traineddata files)
        string tessDataPath = @"./tessdata"; // Ensure this directory contains the language data files
        // Load the image
        string imagePath = @"path_to_your_image.png";
        using (var img = Pix.LoadFromFile(imagePath))
        {
            // Add tesseract languages into engine
            using (var engine = new TesseractEngine(tessDataPath,  "eng+spa+fra", EngineMode.Default))
            {
                using (var page = engine.Process(img))
                {
                    // Extract the text
                    string text = page.GetText();
                    Console.WriteLine("Recognized Text:");
                    Console.WriteLine(text);
                }
            }
        }
    }
}
using System;
using System.Drawing;
using Tesseract;
class Program
{
    static void Main()
    {
        // Set the path to the Tesseract data files (traineddata files)
        string tessDataPath = @"./tessdata"; // Ensure this directory contains the language data files
        // Load the image
        string imagePath = @"path_to_your_image.png";
        using (var img = Pix.LoadFromFile(imagePath))
        {
            // Add tesseract languages into engine
            using (var engine = new TesseractEngine(tessDataPath,  "eng+spa+fra", EngineMode.Default))
            {
                using (var page = engine.Process(img))
                {
                    // Extract the text
                    string text = page.GetText();
                    Console.WriteLine("Recognized Text:");
                    Console.WriteLine(text);
                }
            }
        }
    }
}
Imports System
Imports System.Drawing
Imports Tesseract
Friend Class Program
	Shared Sub Main()
		' Set the path to the Tesseract data files (traineddata files)
		Dim tessDataPath As String = "./tessdata" ' Ensure this directory contains the language data files
		' Load the image
		Dim imagePath As String = "path_to_your_image.png"
		Using img = Pix.LoadFromFile(imagePath)
			' Add tesseract languages into engine
			Using engine = New TesseractEngine(tessDataPath, "eng+spa+fra", EngineMode.Default)
				Using page = engine.Process(img)
					' Extract the text
					Dim text As String = page.GetText()
					Console.WriteLine("Recognized Text:")
					Console.WriteLine(text)
				End Using
			End Using
		End Using
	End Sub
End Class
VB   C#

The above code explains how Tesseract OCR can detect and extract text from images containing multiple languages. It initially sets the path to the Tesseract language data files. The necessary trained .traineddata files for each corresponding language, such as English, Spanish, and French, should be present in the path.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 4 - Example input

It loads an image specified by imagePath into the Pix.LoadFromFile method. More specifically, one would expect an image with English, Spanish, and French text. Then, an instance of TesseractEngine will be initialized with paths to language data files and languages of interest, "eng+spa+fra," to recognize the text. The engine will work in default mode.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 5 - Example console output

The image will then be passed into the Process method of the engine class, in which it is analyzed, text content extracted, and stored in the variable text. At last, the extracted text will be printed into the console, creating a visualization of how the OCR works.

What is IronOCR?

IronOCR is a proprietary OCR library focused on Dot NET. It adds OCR capabilities to .NET applications and allows text extraction from images, scanned documents, PDFs, and all other visual media. Driving cutting-edge text recognition with the incredibly successful Tesseract engine, IronOCR also includes several additional features that make it suitable for use in enterprise applications.

This makes IronOCR extremely good—its tremendous language support—more than 120 languages with support for automatic language detection and processing documents containing multiple languages simultaneously. In turn, this makes IronOCR very versatile and deployable globally, where multilingual document processing is highly critical.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 6 - IronOCR: The C# OCR Library

On the other hand, IronOCR still emphasizes simplicity in use and integration. Its extremely easy-to-use API is supplemented by detailed documentation and a set of example projects that will help any developer get up and running quickly. It supports a wide array of image formats and PDF documents. In-built advanced image preprocessing, noise reduction, and error correction features improve OCR accuracy and performance.

Install IronOCR

You can install the packages directly into your solution using Visual Studio's NuGet Package management tool. The following snapshot shows how to open the NuGet Package Manager.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 7 - How to get to the NuGet package manager through Visual Studio

It has an embedded search box, which needs to display a list of packages from the NuGet website. As one can see from the screenshot below, we will search the package manager for the phrase IronOCR:

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 8 - Search for IronOCR in the solutions explorer

The above graph might offer a list of valid search terms. We need to select what is necessary to install the solution package.

Also, Install the Required Tesseract language packs one by one, like the one below for this example.

In this example, we will use Spanish, French, and English language codes. English is the default language pack, so it does not require installation.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 9 - Install French language package

Install Spanish from the NuGet package.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 10 - Install Spanish language package

Read multiple Language with IronOCR With Tesseract Engine

The following example demonstrates how to recognize text in multiple languages from an image using C# and the IronOCR and Tesseract engine.

using IronOcr;
class Program
{
    static void Main(string[] args)
    {
        // Initialize IronTesseract engine
        var Ocr = new IronTesseract();
        // Add multiple languages
        Ocr.Language = OcrLanguage.English + OcrLanguage.Spanish + OcrLanguage.French;
        // Path to the image
        var inputFile = @"path\to\your\image.png";
        // Read the image and perform OCR
        using (var input = new OcrInput(inputFile))
        {
            // Perform OCR
            var result = Ocr.Read(input);
            // Display the result
            Console.WriteLine("Text:");
            Console.WriteLine(result.Text);
        }
    }
}
using IronOcr;
class Program
{
    static void Main(string[] args)
    {
        // Initialize IronTesseract engine
        var Ocr = new IronTesseract();
        // Add multiple languages
        Ocr.Language = OcrLanguage.English + OcrLanguage.Spanish + OcrLanguage.French;
        // Path to the image
        var inputFile = @"path\to\your\image.png";
        // Read the image and perform OCR
        using (var input = new OcrInput(inputFile))
        {
            // Perform OCR
            var result = Ocr.Read(input);
            // Display the result
            Console.WriteLine("Text:");
            Console.WriteLine(result.Text);
        }
    }
}
Imports IronOcr
Friend Class Program
	Shared Sub Main(ByVal args() As String)
		' Initialize IronTesseract engine
		Dim Ocr = New IronTesseract()
		' Add multiple languages
		Ocr.Language = OcrLanguage.English + OcrLanguage.Spanish + OcrLanguage.French
		' Path to the image
		Dim inputFile = "path\to\your\image.png"
		' Read the image and perform OCR
		Using input = New OcrInput(inputFile)
			' Perform OCR
			Dim result = Ocr.Read(input)
			' Display the result
			Console.WriteLine("Text:")
			Console.WriteLine(result.Text)
		End Using
	End Sub
End Class
VB   C#

The above C# program is an example of using the IronOCR library to perform Optical Character Recognition on an image containing English, Spanish, and French characters. This example begins by importing the namespace that IronOCR requires, followed by the declaration of a class named program with a Main method, which is the application's entry point.

First, in the Main method, an instance of the IronTesseract class is instantiated. This instance will then be assigned to the variable Ocr. Then, the Language property of this instance is set to include English, Spanish, and French by combining OcrLanguage.English, OcrLanguage.Spanish, and OcrLanguage.French. All this will ensure the OCR engine can recognize and process text in any of these three languages.

First, set the path to the image file input using the inputFile variable. On the following line, load this image inside a statement in the OcrInput class for sound resource management and disposal. Finally, call the Read method of the IronTesseract instance with the input object and perform the OCR on the image.

Finally, the result is checked for recognized text and printed in the console. These Console.WriteLine statements will print out what detected text there is so that the user can see the OCR results. The program illustrates an effective way to use IronOCR's multilingual capability to extract text from images of words in different languages.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 11 - Recognized text output

Why IronOCR is Better than Tesseract?

It is pretty user-friendly compared to Tesseract and has some advantages over the latter. First of all, IronOCR provides excellent language support. The engine supports 127 languages straight out of the box, while for Tesseract, there are reports to support some 100 languages; however, often, it requires complex configurations and additional training for the best performance. Additionally, IronOCR is easier to use, as it easily integrates into .NET applications and completes documentation.

Much easier to use than Tesseract, IronOCR has a less steep learning curve and requires less technical setup. As if that weren't enough, IronOCR comes with advanced image preprocessing and regular functionality updates for better accuracy and reliability with complex and varied document types. IronOCR is a much better choice for any developer seeking a solid, versatile, easily applied OCR solution.

Conclusion

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 12 - IronOCR licensing page

While Tesseract and IronOCR represent very robust OCR technologies, each has unique capabilities and strengths. Again, since it is open-source, widely used, and in a leading position concerning language availability and flexibility, Tesseract stands out. It is reliable for anyone seeking a free solution with active communities and continuous improvement.

In contrast, IronOCR is a proprietary library of the .NET Framework that has improved user experience with easier integration and excellent support for most image file types. It also features outstanding performance in text recognition, one of which is handling low-quality image content. IronOCR fully supports many languages and has some extra features that make it more user-friendly and complete for the seeking ease and full support developer.

A cost-effective development edition is available for IronOCR. When you purchase the IronOCR package, you receive a lifetime license. Since the IronOCR offers start at 749$, a one-time cost for multiple systems, it provides excellent value for money. It provides 24*7 online engineer support to IronOCR licensed users. Please refer to the IronOCR website to learn more about the fee; for more information about Iron Software products, refer here.

< PREVIOUS
Passport OCR SDK (Developer Tutorial)
NEXT >
How to Create OCR Software Demo in C#