How to use Multiple Languages with Tesseract
In the realm of Optical Character Recognition (OCR) technology, IronOCR is a well-regarded tool known for its ability to extract text from various languages and scripts. We use the Tesseract Engine to provide a reliable and easy-to-use OCR tool.
In this article, we'll how IronOCR effectively handles text in multiple languages, thanks to Tesseract. Whether you're an experienced developer looking for a reliable multilingual OCR solution or simply curious about how it all works, this article will help you understand IronOCR and its Tesseract engine, shedding light on the capabilities of this invaluable tool
How to use Multiple Languages with Tesseract
- Download a C# library for reading multiple languages
- Prepare the PDF document and image for reading
- Install additional language pack via NuGet
- Use the
AddSecondaryLanguage
method to enable the desired languages - Set the Language property to change the default language
Install with NuGet
Install-Package IronOcr
Download DLL
Manually install into your project
Read Multi-Language PDF Example
IronOcr provides about 125 language packs however only English is installed by default, the rest can be download from NuGet. You can have a look at all the available language packs here..
In the following example I will show you the code for using multiple languages in IronOcr to extract text from a PDF file.
:path=/static-assets/ocr/content-code-examples/how-to/ocr-multiple-languages-pdf-input.cs
using IronOcr;
using System;
// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();
// Set secondary language to Russian
ocrTesseract.AddSecondaryLanguage(OcrLanguage.Russian);
// Add PDF
using var pdfInput = new OcrPdfInput(@"example.pdf");
// Perform OCR
OcrResult result = ocrTesseract.Read(pdfInput);
// Output extracted text to console
Console.WriteLine(result.Text);
Imports IronOcr
Imports System
' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()
' Set secondary language to Russian
ocrTesseract.AddSecondaryLanguage(OcrLanguage.Russian)
' Add PDF
Dim pdfInput = New OcrPdfInput("example.pdf")
' Perform OCR
Dim result As OcrResult = ocrTesseract.Read(pdfInput)
' Output extracted text to console
Console.WriteLine(result.Text)
You can add any number of secondary languages using the AddSecondaryLanguage
method. However, please note that this addition may affect speed and performance. The priority of the language depends on the order in which it is added, with the first added having higher priority.
Read Multi-Language Image Example
The primary language is set to English by default. To change the primary language, set the Language property to the desired language. Afterward, you can also add secondary languages.
:path=/static-assets/ocr/content-code-examples/how-to/ocr-multiple-languages-image-input.cs
using IronOcr;
using System;
// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();
// Set primary language to Hindi
ocrTesseract.Language = OcrLanguage.Russian;
ocrTesseract.AddSecondaryLanguage(OcrLanguage.Japanese);
// Add image
using var imageInput = new OcrImageInput(@"example.png");
// Perform OCR
OcrResult result = ocrTesseract.Read(imageInput);
// Output extracted text to console
Console.WriteLine(result.Text);
Imports IronOcr
Imports System
' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()
' Set primary language to Hindi
ocrTesseract.Language = OcrLanguage.Russian
ocrTesseract.AddSecondaryLanguage(OcrLanguage.Japanese)
' Add image
Dim imageInput = New OcrImageInput("example.png")
' Perform OCR
Dim result As OcrResult = ocrTesseract.Read(imageInput)
' Output extracted text to console
Console.WriteLine(result.Text)
If you do this right you can expect results like the ones below.
Conclusion
In brief, IronOCR, backed by the powerful Tesseract engine, excels at extracting text from documents in multiple languages. It's an indispensable tool for handling the complexities of reading text in many languages, offering developers and curious minds a versatile solution. Whether you're processing PDFs with text in various languages or working with multilingual content in images, IronOCR simplifies the task of recognizing and extracting text in multiple languages.