How to use Multiple Languages with Tesseract

This tutorial provides a comprehensive guide on using Tesseract in conjunction with IronOCR to recognize text in multiple languages from PDFs and images. First, ensure that IronOCR and the necessary language packs are installed in your project using the NuGet package manager. Begin by importing the required namespaces and setting up IronOCR with a valid license key to unlock its full capabilities. Instantiate the IronTesseract object to perform optical character recognition, initially using English as the default language. To add support for additional languages, such as Russian, utilize the method to add a secondary language.

Here's a step-by-step guide with properly formatted code:

// Ensure you have added references to IronOCR package and installed necessary language packs.

// Import required namespaces
using IronOcr;
using System;
using System.IO;

class OCRExample
{
    static void Main()
    {
        // Configure IronOcr with your license key
        var Ocr = new IronTesseract();

        // Set the primary language to English
        Ocr.Language = OcrLanguage.English;

        // Add a secondary language (Russian)
        Ocr.AddSecondaryLanguage(OcrLanguage.Russian);

        // Load a PDF file and perform OCR
        var pdfInput = new OcrInput();
        pdfInput.AddPdf("example.PDF");

        var result = Ocr.Read(pdfInput);

        // Ensure accurate display of multilingual characters in the console
        Console.OutputEncoding = System.Text.Encoding.UTF8;

        // Print the extracted text from the PDF
        Console.WriteLine("Extracted Text from PDF:");
        Console.WriteLine(result.Text);

        // Adjust primary language to Russian and add Japanese as a secondary language
        Ocr.Language = OcrLanguage.Russian;
        Ocr.AddSecondaryLanguage(OcrLanguage.Japanese);

        // Load an image file and perform OCR
        var imageInput = new OcrInput();
        imageInput.AddImage("example.png");

        var imageResult = Ocr.Read(imageInput);

        // Print the extracted text from the image
        Console.WriteLine("Extracted Text from Image:");
        Console.WriteLine(imageResult.Text);
    }
}
// Ensure you have added references to IronOCR package and installed necessary language packs.

// Import required namespaces
using IronOcr;
using System;
using System.IO;

class OCRExample
{
    static void Main()
    {
        // Configure IronOcr with your license key
        var Ocr = new IronTesseract();

        // Set the primary language to English
        Ocr.Language = OcrLanguage.English;

        // Add a secondary language (Russian)
        Ocr.AddSecondaryLanguage(OcrLanguage.Russian);

        // Load a PDF file and perform OCR
        var pdfInput = new OcrInput();
        pdfInput.AddPdf("example.PDF");

        var result = Ocr.Read(pdfInput);

        // Ensure accurate display of multilingual characters in the console
        Console.OutputEncoding = System.Text.Encoding.UTF8;

        // Print the extracted text from the PDF
        Console.WriteLine("Extracted Text from PDF:");
        Console.WriteLine(result.Text);

        // Adjust primary language to Russian and add Japanese as a secondary language
        Ocr.Language = OcrLanguage.Russian;
        Ocr.AddSecondaryLanguage(OcrLanguage.Japanese);

        // Load an image file and perform OCR
        var imageInput = new OcrInput();
        imageInput.AddImage("example.png");

        var imageResult = Ocr.Read(imageInput);

        // Print the extracted text from the image
        Console.WriteLine("Extracted Text from Image:");
        Console.WriteLine(imageResult.Text);
    }
}
' Ensure you have added references to IronOCR package and installed necessary language packs.

' Import required namespaces
Imports IronOcr
Imports System
Imports System.IO

Friend Class OCRExample
	Shared Sub Main()
		' Configure IronOcr with your license key
		Dim Ocr = New IronTesseract()

		' Set the primary language to English
		Ocr.Language = OcrLanguage.English

		' Add a secondary language (Russian)
		Ocr.AddSecondaryLanguage(OcrLanguage.Russian)

		' Load a PDF file and perform OCR
		Dim pdfInput = New OcrInput()
		pdfInput.AddPdf("example.PDF")

		Dim result = Ocr.Read(pdfInput)

		' Ensure accurate display of multilingual characters in the console
		Console.OutputEncoding = System.Text.Encoding.UTF8

		' Print the extracted text from the PDF
		Console.WriteLine("Extracted Text from PDF:")
		Console.WriteLine(result.Text)

		' Adjust primary language to Russian and add Japanese as a secondary language
		Ocr.Language = OcrLanguage.Russian
		Ocr.AddSecondaryLanguage(OcrLanguage.Japanese)

		' Load an image file and perform OCR
		Dim imageInput = New OcrInput()
		imageInput.AddImage("example.png")

		Dim imageResult = Ocr.Read(imageInput)

		' Print the extracted text from the image
		Console.WriteLine("Extracted Text from Image:")
		Console.WriteLine(imageResult.Text)
	End Sub
End Class
$vbLabelText   $csharpLabel

Explanation

  • License Configuration: The IronTesseract object is initialized, and a valid license key is required to utilize IronOCR's full capabilities.
  • Language Settings: English is set as the default language, with Russian and Japanese added as secondary languages to enhance text recognition in multiple languages.
  • OCR on PDF: The input PDF is loaded using AddPdf, and the OCR process is executed using Read, capturing the text content.
  • Output Encoding: Console output is set to UTF-8 to ensure the proper display of multilingual characters.
  • OCR on Image: Similar to the PDF, an image is loaded, OCR is performed, and the extracted text is printed to the console.

By following these steps, you can seamlessly extract and recognize text in English, Russian, and Japanese from various file types. This tutorial highlights the effectiveness of using multiple languages with Tesseract and IronOCR, making it straightforward to process multilingual text in PDFs and images. For more tutorials and to start using IronOCR, subscribe to Iron Software and consider signing up for a trial.

Further Reading: How to use Multiple Languages with Tesseract

Kannaopat Udonpant
Software Engineer
Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering team, where he focuses on IronPDF. Kannapat values his job because he learns directly from the developer who writes most of the code used in IronPDF. In addition to peer learning, Kannapat enjoys the social aspect of working at Iron Software. When he's not writing code or documentation, Kannapat can usually be found gaming on his PS5 or rewatching The Last of Us.
< PREVIOUS
How to use OCR Language Packs in IronOCR
NEXT >
How to extract text from an image file

Report an Issue