USING IRONOCR

Tesseract OCR for Multiple Languages (Developer Tutorial)

OCR is a technology used to change documents of different nature into editable and searchable data. It changes scanned images, PDF files, and digital camera photos into data that can be edited and searched. This technology is extensively used in changing printed documents into digital form for editing, searching, and storing, reducing the physical space taken by documents. OCR has a massive role in data entry automation, thus saving companies and organizations a lot of time by reducing the labor used by human resources.

It is a process that uses advanced machine learning techniques and pattern recognition to extract text accurately from images. The latest developments relating to OCR have increased its accuracy, thus supporting more languages and complex scripts such as the Arabic script. Very necessary in finance, health, legislation, and education, OCR emerged as an indispensable tool where processing several printed documents rapidly was a prerequisite. This article will use Tesseract to OCR images in multiple languages data.

How to Use Tesseract OCR with Multiple Languages

  1. First, install the IronOCR/Tesseract NuGet package inside your .NET project.
  2. Create an instance of the IronTesseract class, which will initialize the OCR engine.
  3. The language property supports more than one language.
  4. Specify the image file path you want to process, then create an OcrInput object.
  5. Now, perform OCR on the input image using the Read function of the IronTesseract instance.
  6. Take the result and display the recognized text.

What is Tesseract?

Tesseract is an open Optical Character Recognition engine developed by Hewlett-Packard and later maintained by Google. It is famous for its high accuracy and adaptability, making it one of the most prominent OCRs. Tesseract supports script detection, recognizes text in many languages, and can handle multiple languages; hence, it is generally used for projects requiring multilingual documents and support.

The Tesseract OCR engine works on information contained in any single pixel of the image, following patterns depicting characters, words, and sentences that are eventually converted to machine-readable text. The many image file types it supports, such as TIFF, JPEG, and PNG, allow Tesseract to produce text in formats like plain text, HTML, and searchable PDF.

One of the significant advantages of Tesseract is that it can be trained to be sensitive to particular fonts or new languages added. It is also frequently used in various applications, ranging from simple text extraction to complex tasks in digitizing historical documents, processing invoices, or even accessibility software that enables reading for the visually impaired.

Creating a New Project in Visual Studio

Open the program Visual Studio. On opening the program, proceed to the "file menu." Under the "file menu," there is the option "new project." Under "new project," click on "Console Application." In this post, we will create PDF documents using a console program.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 1 - Create a new project

Enter your project's name and the file's location in the text boxes provided. Then, as shown in the image below, click the Create button and select which .NET Framework you need.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 2 - Select the .NET Framework along with providing a project and save location.

Now that the application version has been selected, the Visual Studio project will create its structure. If you have chosen the console, Windows, or web versions, it will open the program.cs file to add code and build/run the application.

Install Tesseract OCR For .NET

The first step is downloading and installing the Tesseract OCR software on your computer. Here is the official Tesseract GitHub repository with the Tesseract installer: https://github.com/tesseract-ocr/tesseract.

It would be best to get Tesseract OCR onto your computer by following the setup instructions specific to your operating system—whether that be Windows, macOS, or Linux. Once installed, add the Tesseract.NET package to your C# project using Visual Studio's NuGet Package Manager.

Open the NuGet Package Manager in your Visual Studio project from Tools -> NuGet Package Manager -> Manage NuGet Packages for Solution. Afterward, search for "Tesseract" in the NuGet Package Manager to obtain either the "Tesseract" or "Tesseract.NET" package. Select this package and click the Install button to install it in your project.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 3 - Search for Tesseract in the browse tab

Tesseract OCR using C#

You must set up Tesseract in your C# project to specify the location of the Tesseract OCR executable and language data files after installing the Tesseract.NET wrapper. Here's an example:

using System;
using System.Drawing;
using Tesseract;

class Program
{
    static void Main()
    {
        // Set the path to the Tesseract data files (traineddata files)
        string tessDataPath = @"./tessdata"; // Ensure this directory contains the language data files

        // Load the image
        string imagePath = @"path_to_your_image.png";
        using (var img = Pix.LoadFromFile(imagePath))
        {
            // Add languages to the Tesseract engine
            using (var engine = new TesseractEngine(tessDataPath, "eng+spa+fra", EngineMode.Default))
            {
                using (var page = engine.Process(img))
                {
                    // Extract the text
                    string text = page.GetText();
                    Console.WriteLine("Recognized Text:");
                    Console.WriteLine(text);
                }
            }
        }
    }
}
using System;
using System.Drawing;
using Tesseract;

class Program
{
    static void Main()
    {
        // Set the path to the Tesseract data files (traineddata files)
        string tessDataPath = @"./tessdata"; // Ensure this directory contains the language data files

        // Load the image
        string imagePath = @"path_to_your_image.png";
        using (var img = Pix.LoadFromFile(imagePath))
        {
            // Add languages to the Tesseract engine
            using (var engine = new TesseractEngine(tessDataPath, "eng+spa+fra", EngineMode.Default))
            {
                using (var page = engine.Process(img))
                {
                    // Extract the text
                    string text = page.GetText();
                    Console.WriteLine("Recognized Text:");
                    Console.WriteLine(text);
                }
            }
        }
    }
}
Imports System
Imports System.Drawing
Imports Tesseract

Friend Class Program
	Shared Sub Main()
		' Set the path to the Tesseract data files (traineddata files)
		Dim tessDataPath As String = "./tessdata" ' Ensure this directory contains the language data files

		' Load the image
		Dim imagePath As String = "path_to_your_image.png"
		Using img = Pix.LoadFromFile(imagePath)
			' Add languages to the Tesseract engine
			Using engine = New TesseractEngine(tessDataPath, "eng+spa+fra", EngineMode.Default)
				Using page = engine.Process(img)
					' Extract the text
					Dim text As String = page.GetText()
					Console.WriteLine("Recognized Text:")
					Console.WriteLine(text)
				End Using
			End Using
		End Using
	End Sub
End Class
$vbLabelText   $csharpLabel

The above code explains how Tesseract OCR can detect and extract text from images containing multiple languages. It initially sets the path to the Tesseract language data files. The necessary .traineddata files for each corresponding language, such as English, Spanish, and French, should be present in the path.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 4 - Example input

It loads an image specified by imagePath using the Pix.LoadFromFile method. More specifically, one would expect an image with English, Spanish, and French text. Then, an instance of TesseractEngine will be initialized with paths to language data files and languages of interest, "eng+spa+fra," to recognize the text. The engine will work in default mode.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 5 - Example console output

The image will then be processed using the Process method of the engine class, where it is analyzed, text content extracted, and stored in the variable text. The extracted text is then printed to the console, creating a visualization of how the OCR works.

What is IronOCR?

IronOCR is a proprietary OCR library focused on .NET. It adds OCR capabilities to .NET applications and allows text extraction from images, scanned documents, PDFs, and all other visual media. Driving cutting-edge text recognition with the incredibly successful Tesseract engine, IronOCR also includes several additional features that make it suitable for use in enterprise applications.

IronOCR offers tremendous language support—more than 120 languages with support for automatic language detection and processing documents containing multiple languages simultaneously. This makes IronOCR very versatile and deployable globally, where multilingual document processing is highly critical.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 6 - IronOCR: The C# OCR Library

On the other hand, IronOCR emphasizes simplicity in use and integration. Its extremely easy-to-use API is supplemented by detailed documentation and a set of example projects that will help any developer get up and running quickly. It supports a wide array of image formats and PDF documents. Built-in advanced image preprocessing, noise reduction, and error correction features improve OCR accuracy and performance.

Install IronOCR

You can install the packages directly into your solution using Visual Studio's NuGet Package management tool. The following snapshot shows how to open the NuGet Package Manager.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 7 - How to get to the NuGet package manager through Visual Studio

It has an embedded search box, which displays a list of packages from the NuGet website. As seen in the screenshot below, we will search the package manager for the phrase IronOCR:

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 8 - Search for IronOCR in the solutions explorer

The search results might offer a list of potential solutions. You'll need to select the necessary solution package to install.

Also, Install the required Tesseract language packs one by one, such as the one below for this example.

In this example, we will use Spanish, French, and English language codes. English is the default language pack and does not require installation.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 9 - Install French language package

Install Spanish from the NuGet package.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 10 - Install Spanish language package

Read Multiple Languages with IronOCR with Tesseract Engine

The following example demonstrates how to recognize text in multiple languages from an image using C# and the IronOCR and Tesseract engine.

using IronOcr;

class Program
{
    static void Main(string[] args)
    {
        // Initialize IronTesseract engine
        var Ocr = new IronTesseract();

        // Add multiple languages
        Ocr.Language = OcrLanguage.English + OcrLanguage.Spanish + OcrLanguage.French;

        // Path to the image
        var inputFile = @"path\to\your\image.png";

        // Read the image and perform OCR
        using (var input = new OcrInput(inputFile))
        {
            // Perform OCR
            var result = Ocr.Read(input);

            // Display the result
            Console.WriteLine("Recognized Text:");
            Console.WriteLine(result.Text);
        }
    }
}
using IronOcr;

class Program
{
    static void Main(string[] args)
    {
        // Initialize IronTesseract engine
        var Ocr = new IronTesseract();

        // Add multiple languages
        Ocr.Language = OcrLanguage.English + OcrLanguage.Spanish + OcrLanguage.French;

        // Path to the image
        var inputFile = @"path\to\your\image.png";

        // Read the image and perform OCR
        using (var input = new OcrInput(inputFile))
        {
            // Perform OCR
            var result = Ocr.Read(input);

            // Display the result
            Console.WriteLine("Recognized Text:");
            Console.WriteLine(result.Text);
        }
    }
}
Imports IronOcr

Friend Class Program
	Shared Sub Main(ByVal args() As String)
		' Initialize IronTesseract engine
		Dim Ocr = New IronTesseract()

		' Add multiple languages
		Ocr.Language = OcrLanguage.English + OcrLanguage.Spanish + OcrLanguage.French

		' Path to the image
		Dim inputFile = "path\to\your\image.png"

		' Read the image and perform OCR
		Using input = New OcrInput(inputFile)
			' Perform OCR
			Dim result = Ocr.Read(input)

			' Display the result
			Console.WriteLine("Recognized Text:")
			Console.WriteLine(result.Text)
		End Using
	End Sub
End Class
$vbLabelText   $csharpLabel

The above C# program uses the IronOCR library to perform Optical Character Recognition on an image containing English, Spanish, and French characters. The program begins by importing the required namespace for IronOCR and declaring a class named Program with a Main method, which is the application's entry point.

In the Main method, an instance of the IronTesseract class is instantiated and assigned to the variable Ocr. The Language property is set to include English, Spanish, and French by combining OcrLanguage.English, OcrLanguage.Spanish, and OcrLanguage.French. This ensures the OCR engine can recognize and process text in any of these three languages.

The path to the input image file is set using the inputFile variable. The image is then loaded inside an using statement with an instance of the OcrInput class for proper resource management and disposal. Finally, the Read method of the IronTesseract instance is called with the input object to perform OCR on the image.

The recognized text is then printed to the console using the Console.WriteLine method. This program illustrates an effective method for using IronOCR's multilingual capability to extract text from images containing words in different languages.

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 11 - Recognized text output

Why IronOCR is Better than Tesseract?

IronOCR is more user-friendly compared to Tesseract and offers some advantages. First, IronOCR provides excellent language support with 127 languages straight out of the box, while Tesseract may require complex configurations and additional training for optimal performance with some of its 100 supported languages. Additionally, IronOCR easily integrates into .NET applications and comes with comprehensive documentation.

IronOCR has a less steep learning curve and requires less technical setup than Tesseract. It also features advanced image preprocessing and regular updates for better accuracy and reliability with complex document types. IronOCR is a great choice for developers seeking a solid, versatile, and easily applied OCR solution.

Conclusion

Tesseract OCR for Multiple Languages (Developer Tutorial): Figure 12 - IronOCR licensing page

While both Tesseract and IronOCR are robust OCR technologies, each has unique capabilities and strengths. Tesseract, being open-source, is reliable for anyone seeking a free solution and has active communities and continuous improvement.

In contrast, IronOCR is a proprietary library for the .NET Framework, offering improved user experience with easier integration and support for most image file types. It also performs well in text recognition, particularly with low-quality image content. IronOCR fully supports many languages and has extra features that make it more user-friendly.

IronOCR offers a cost-effective development edition and when purchased, provides a lifetime license. The IronOCR package starts at $749 as a one-time cost for multiple systems, offering excellent value for money and 24/7 online engineer support to licensed users. For more information, refer to the IronOCR website.

Frequently Asked Questions

What is OCR technology?

OCR (Optical Character Recognition) is a technology used to convert different types of documents into editable and searchable data. It processes scanned images, PDF files, and digital camera photos to extract text for editing, searching, and storing.

How does OCR work?

OCR works by analyzing information contained in any single pixel of an image, recognizing patterns that depict characters, words, and sentences, and converting them into machine-readable text. It supports various image file types and can produce text in formats like plain text, HTML, and searchable PDF.

How can OCR be used with multiple languages?

To use OCR with multiple languages, you can utilize IronOCR by installing the IronOCR/Tesseract NuGet package in your .NET project, creating an instance of the IronTesseract class, setting the language property to include multiple languages, and processing the image using the Read function.

What makes IronOCR different from other OCR tools?

IronOCR is more user-friendly, offering excellent language support with 127 languages out-of-the-box. It integrates easily into .NET applications, has comprehensive documentation, and features advanced image preprocessing for better accuracy, making it a suitable choice for developers.

What are the benefits of using OCR technology?

OCR technology is highly accurate and supports script detection, recognizes text in many languages, and can handle multilingual documents, making it ideal for projects requiring comprehensive language support.

How do you set up a new Visual Studio project for OCR?

To set up a new Visual Studio project, open Visual Studio, go to the file menu, select 'new project,' and choose 'Console Application.' Enter your project's name and location, select the .NET Framework, and then proceed to add code and build/run your application.

What languages does IronOCR support?

IronOCR supports more than 120 languages, including automatic language detection and processing of multilingual documents, making it highly versatile for global deployment.

How do you install OCR for .NET?

To install OCR for .NET, you can use IronOCR by downloading and installing the IronOCR software, following the setup instructions for your operating system, and adding the IronOCR package to your C# project using Visual Studio's NuGet Package Manager.

What is the cost of the IronOCR package?

The IronOCR package starts at $749 as a one-time cost for multiple systems, offering a cost-effective development edition with a lifetime license and 24/7 online engineer support for licensed users.

What are the use cases for OCR technology?

OCR technology is used extensively in finance, health, legislation, and education sectors to automate data entry, digitize printed documents, process invoices, and enable accessibility software for the visually impaired.

Kannaopat Udonpant
Software Engineer
Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering team, where he focuses on IronPDF. Kannapat values his job because he learns directly from the developer who writes most of the code used in IronPDF. In addition to peer learning, Kannapat enjoys the social aspect of working at Iron Software. When he's not writing code or documentation, Kannapat can usually be found gaming on his PS5 or rewatching The Last of Us.
< PREVIOUS
Passport OCR SDK (Developer Tutorial)
NEXT >
How to Create OCR Software Demo in C#