How to use Multiple Languages with Tesseract

Kannapat Udonpant

October 25, 2023

Updated December 10, 2024

In the realm of Optical Character Recognition (OCR) technology, IronOCR is a well-regarded tool known for its ability to extract text from various languages and scripts. We use the Tesseract Engine to provide a reliable and easy-to-use OCR tool.

In this article, we'll how IronOCR effectively handles text in multiple languages, thanks to Tesseract. Whether you're an experienced developer looking for a reliable multilingual OCR solution or simply curious about how it all works, this article will help you understand IronOCR and its Tesseract engine, shedding light on the capabilities of this invaluable tool

Get started with IronOCR

Start using IronOCR in your project today with a free trial.

First Step:

How to use Multiple Languages with Tesseract

Download a C# library for reading multiple languages
Prepare the PDF document and image for reading
Install additional language pack via NuGet
Use the AddSecondaryLanguage method to enable the desired languages
Set the Language property to change the default language

Read Multi-Language PDF Example

IronOcr provides about 125 language packs however only English is installed by default, the rest can be download from NuGet. You can have a look at all the available language packs here..

In the following example I will show you the code for using multiple languages in IronOcr to extract text from a PDF file.

:path=/static-assets/ocr/content-code-examples/how-to/ocr-multiple-languages-pdf-input.cs

using IronOcr;
using System;

// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();

// Set secondary language to Russian
ocrTesseract.AddSecondaryLanguage(OcrLanguage.Russian);

// Add PDF
using var pdfInput = new OcrPdfInput(@"example.pdf");
// Perform OCR
OcrResult result = ocrTesseract.Read(pdfInput);

// Output extracted text to console
Console.WriteLine(result.Text);

Imports IronOcr
Imports System

' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()

' Set secondary language to Russian
ocrTesseract.AddSecondaryLanguage(OcrLanguage.Russian)

' Add PDF
Dim pdfInput = New OcrPdfInput("example.pdf")
' Perform OCR
Dim result As OcrResult = ocrTesseract.Read(pdfInput)

' Output extracted text to console
Console.WriteLine(result.Text)

$vbLabelText $csharpLabel

You can add any number of secondary languages using the AddSecondaryLanguage method. However, please note that this addition may affect speed and performance. The priority of the language depends on the order in which it is added, with the first added having higher priority.

Read Multi-Language Image Example

The primary language is set to English by default. To change the primary language, set the Language property to the desired language. Afterward, you can also add secondary languages.

:path=/static-assets/ocr/content-code-examples/how-to/ocr-multiple-languages-image-input.cs

using IronOcr;
using System;

// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();

// Set primary language to Hindi
ocrTesseract.Language = OcrLanguage.Russian;
ocrTesseract.AddSecondaryLanguage(OcrLanguage.Japanese);

// Add image
using var imageInput = new OcrImageInput(@"example.png");
// Perform OCR
OcrResult result = ocrTesseract.Read(imageInput);

// Output extracted text to console
Console.WriteLine(result.Text);

Imports IronOcr
Imports System

' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()

' Set primary language to Hindi
ocrTesseract.Language = OcrLanguage.Russian
ocrTesseract.AddSecondaryLanguage(OcrLanguage.Japanese)

' Add image
Dim imageInput = New OcrImageInput("example.png")
' Perform OCR
Dim result As OcrResult = ocrTesseract.Read(imageInput)

' Output extracted text to console
Console.WriteLine(result.Text)

$vbLabelText $csharpLabel

If you do this right you can expect results like the ones below.

Russian and Japanese

Conclusion

In brief, IronOCR, backed by the powerful Tesseract engine, excels at extracting text from documents in multiple languages. It's an indispensable tool for handling the complexities of reading text in many languages, offering developers and curious minds a versatile solution. Whether you're processing PDFs with text in various languages or working with multilingual content in images, IronOCR simplifies the task of recognizing and extracting text in multiple languages.

Kannapat Udonpant

Chat with engineering team now

Software Engineer

Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering team, where he focuses on IronPDF. Kannapat values his job because he learns directly from the developer who writes most of the code used in IronPDF. In addition to peer learning, Kannapat enjoys the social aspect of working at Iron Software. When he's not writing code or documentation, Kannapat can usually be found gaming on his PS5 or rewatching The Last of Us.