How to use Multiple Languages with Tesseract

How to Use Multiple Languages with Tesseract in C#

IronOCR enables text extraction from documents in multiple languages using the Tesseract engine by configuring primary and secondary languages with just one line of code, supporting over 125 language packs for seamless multilingual OCR processing.

Introduction

IronOCR provides text extraction from various languages and scripts using the Tesseract Engine as a reliable OCR tool.

This article explores how IronOCR handles text in multiple languages through Tesseract. You'll learn how to implement multilingual OCR solutions and understand the capabilities of IronOCR and its Tesseract engine integration.

Processing documents in multiple languages is essential for modern applications. International business documents, multilingual websites, and global communication platforms require accurate text extraction across language barriers. IronOCR addresses this need by integrating with Tesseract's extensive language support, enabling text extraction from documents containing multiple scripts and character sets simultaneously.

Quickstart: Using IronOCR to Recognize Text in Multiple Languages

Configure IronOCR with a primary language and add secondary languages in one line to extract text from multilingual documents or images.

Nuget IconGet started making PDFs with NuGet now:

  1. Install IronOCR with NuGet Package Manager

    PM > Install-Package IronOcr

  2. Copy and run this code snippet.

    string text = new IronTesseract { Language = OcrLanguage.Spanish }.AddSecondaryLanguage(OcrLanguage.French).Read("doc_or_image_path").Text;
  3. Deploy to test on your live environment

    Start using IronOCR in your project today with a free trial
    arrow pointer


How Do I Read Multi-Language PDFs with IronOCR?

IronOcr provides about 125 language packs; only English is installed by default. Download additional languages from NuGet. View all available language packs here..

PDFs containing multiple languages require specific OCR engine configuration. IronOCR allows you to specify primary and secondary languages before processing documents, ensuring optimal recognition accuracy across different scripts and character sets.

Which Languages Are Available for PDF Extraction?

The following example shows how to use multiple languages in IronOcr to extract text from a PDF file.

:path=/static-assets/ocr/content-code-examples/how-to/ocr-multiple-languages-pdf-input.cs
using IronOcr;
using System;

// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();

// Set secondary language to Russian
ocrTesseract.AddSecondaryLanguage(OcrLanguage.Russian);

// Add PDF
using var pdfInput = new OcrPdfInput(@"example.pdf");
// Perform OCR
OcrResult result = ocrTesseract.Read(pdfInput);

// Output extracted text to console
Console.WriteLine(result.Text);
Imports IronOcr
Imports System

' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()

' Set secondary language to Russian
ocrTesseract.AddSecondaryLanguage(OcrLanguage.Russian)

' Add PDF
Dim pdfInput = New OcrPdfInput("example.pdf")
' Perform OCR
Dim result As OcrResult = ocrTesseract.Read(pdfInput)

' Output extracted text to console
Console.WriteLine(result.Text)
$vbLabelText   $csharpLabel

For complex PDF processing scenarios, see our guide on PDF OCR Text Extraction covering advanced techniques for various PDF formats and structures.

How Does Language Priority Affect OCR Results?

Add any number of secondary languages using the AddSecondaryLanguage method. Note that additional languages may affect speed and performance. Language priority depends on the order added, with the first having higher priority.

Understanding language priority is crucial when processing multilingual documents. The primary language receives highest priority during text extraction—the OCR engine first attempts to match characters against the primary language's character set. Secondary languages are consulted when encountering characters that don't match primary language patterns.

For optimal performance:

  • Set the most common language in your document as primary
  • Add secondary languages ordered by frequency in the document
  • Limit secondary languages to those necessary for your use case

For high-performance applications with multiple languages, see our Fast OCR Configuration guide to optimize processing speed.

How Do I Process Multi-Language Images with Tesseract?

English is the default primary language. To change it, set the Language property to your desired language, then add secondary languages as needed.

Images containing multilingual text require careful configuration. Unlike PDFs, images may contain varied text orientations, different fonts, and mixed scripts. IronOCR's Tesseract integration provides comprehensive language configuration options for these scenarios.

When Should I Change the Default Language Setting?

Change the default language when:

  • The document majority is in a non-English language
  • Processing documents from a specific region or country
  • Your application targets users working with non-English content
  • Optimizing recognition accuracy for specific character sets

Here's a complete multi-language image processing example:

// Example code for reading multi-language image with IronOCR
using IronOcr;

// Initialize IronTesseract OCR engine
var Ocr = new IronTesseract();

:path=/static-assets/ocr/content-code-examples/how-to/ocr-multiple-languages-image-input.cs
// Example code for reading multi-language image with IronOCR
using IronOcr;

// Initialize IronTesseract OCR engine
var Ocr = new IronTesseract();

using IronOcr;
using System;

// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();

// Set primary language to Russian
ocrTesseract.Language = OcrLanguage.Russian;
ocrTesseract.AddSecondaryLanguage(OcrLanguage.Japanese);

// Add image
using var imageInput = new OcrImageInput(@"example.png");
// Perform OCR
OcrResult result = ocrTesseract.Read(imageInput);

// Output extracted text to console
Console.WriteLine(result.Text);
IRON VB CONVERTER ERROR developers@ironsoftware.com
$vbLabelText   $csharpLabel

For custom languages or specialized fonts, see our tutorial on Using Custom Language Files.

What Results Can I Expect from Multi-Language OCR?

Proper configuration produces results like these:

Multi-language text processing app showing Russian and Japanese content with console output displaying character processing

Multi-language OCR result quality depends on several factors:

  1. Image Quality: Higher resolution (300+ DPI) produces better results. See our DPI Settings guide.
  2. Text Clarity: Clear, well-defined text without artifacts yields more accurate recognition
  3. Language Configuration: Proper primary and secondary language setup ensures correct character recognition patterns
  4. Pre-processing: Appropriate filters improve results significantly. See our Image Correction Filters guide for enhancement techniques.

What Are the Key Takeaways for Multi-Language OCR?

IronOCR, using the Tesseract engine, extracts text from multilingual documents effectively. It handles the complexities of reading text in many languages, providing a versatile solution. Whether processing PDFs with various languages or working with multilingual image content, IronOCR simplifies recognizing and extracting text across languages.

Key advantages of IronOCR for multi-language text extraction:

  • Extensive Language Support: Over 125 international OCR languages via NuGet packages
  • Flexible Configuration: Simple API for primary and secondary language settings
  • High Accuracy: Uses Tesseract 5's advanced recognition algorithms
  • Performance Optimization: Built-in multithreading support
  • Cross-Platform Compatibility: Works on Windows, Linux, and macOS

IronOCR provides a comprehensive solution combining ease of use with powerful features for multi-language OCR implementation. Build document management systems, translation tools, or any application requiring multilingual text extraction with the flexibility and reliability needed for success.

Start your multi-language OCR project by downloading IronOCR from NuGet and exploring our documentation and examples. For specific use cases or advanced scenarios, our troubleshooting guides provide insights for optimal results.

Frequently Asked Questions

How do I perform OCR on documents containing multiple languages?

IronOCR allows you to configure multilingual OCR with just one line of code. Set a primary language using the Language property and add secondary languages using the AddSecondaryLanguage method. This enables IronOCR to accurately extract text from documents containing multiple scripts and character sets simultaneously.

Which languages are supported for text extraction?

IronOCR supports over 125 language packs through its Tesseract engine integration. While English is installed by default, you can download additional language packs from NuGet to enable OCR capabilities for languages ranging from Spanish and French to Arabic, Chinese, Japanese, and many more.

How do I add secondary languages for OCR processing?

Use the AddSecondaryLanguage method in IronOCR to enable additional languages. For example: new IronTesseract { Language = OcrLanguage.Spanish }.AddSecondaryLanguage(OcrLanguage.French). This configuration allows IronOCR to recognize text in both Spanish and French within the same document.

Can I extract text from multilingual PDFs?

Yes, IronOCR can process PDFs containing multiple languages. Simply configure the OCR engine with your primary and secondary languages before processing. IronOCR will automatically handle different scripts and character sets within the PDF, ensuring accurate text extraction across all languages present in the document.

Do I need to install language packs separately?

Yes, while IronOCR includes English by default, additional language packs must be installed via NuGet. Each language pack contains the necessary data for IronOCR's Tesseract engine to recognize text in that specific language. You can view and download all available language packs from the IronOCR languages page.

What is the minimal workflow for multilingual OCR?

The minimal workflow involves 5 steps: 1) Download IronOCR library, 2) Prepare your PDF or image document, 3) Install required language packs via NuGet, 4) Use AddSecondaryLanguage method to enable additional languages, and 5) Set the Language property for your primary language. This setup enables accurate multilingual text extraction.

Kannaopat Udonpant
Software Engineer
Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering ...
Read More
Reviewed by
Jeff Fritz
Jeffrey T. Fritz
Principal Program Manager - .NET Community Team
Jeff is also a Principal Program Manager for the .NET and Visual Studio teams. He is the executive producer of the .NET Conf virtual conference series and hosts 'Fritz and Friends' a live stream for developers that airs twice weekly where he talks tech and writes code together with viewers. Jeff writes workshops, presentations, and plans content for the largest Microsoft developer events including Microsoft Build, Microsoft Ignite, .NET Conf, and the Microsoft MVP Summit
Ready to Get Started?
Nuget Downloads 5,246,844 | Version: 2025.12 just released