How to use Custom Language with Tesseract

When it comes to optical character recognition (OCR), you sometimes need to deal with custom languages, specialized scripts, or ciphers. To read an input image containing a custom language, the Tesseract engine must be provided with training data for that specific language. This data is stored in a special .traineddata file.

While the complex process of creating (training) this file is done using Tesseract's own tools, IronOCR fully supports using these custom language files. This lets you apply your trained model to decipher and read text from any input. In this how-to guide, we'll showcase how to load and use a custom .traineddata file with IronOCR.

Get started with IronOCR

Start using IronOCR in your project today with a free trial.

First Step:
green arrow pointer


Custom Language with Tesseract

To use a custom language with Tesseract, we must first load our .traineddata file by calling the UseCustomTesseractLanguageFile method. This is an essential step, as this file contains all the training data that allows Tesseract to recognize the custom language's unique characters.

Afterward, we load our input document just as we would for a regular OCR operation. In this instance, we are loading a PDF containing custom language paragraphs using LoadPdf.

Finally, we use the Read method to extract the text from the input. The result can then be printed to the console or, as the example shows, saved (piped) to a text file for reference.

Input

We'll use this sample PDF, which contains text in our custom language, as the input.

We'll be using this custom language .traindata for our example.

Code Example

:path=/static-assets/ocr/content-code-examples/how-to/ocr-custom-language.cs
using IronOcr;
using System;
using System.IO;

var ocrTesseract = new IronTesseract();

// Load the traineddata file for the custom language
ocrTesseract.UseCustomTesseractLanguageFile("AMGDT.traineddata");

using var ocrInput = new OcrInput();
// Load the PDF containing text in the custom language
ocrInput.LoadPdf("custom.pdf");

var ocrResult = ocrTesseract.Read(ocrInput);

// Print text to the console
Console.WriteLine("--- OCR Result ---");
Console.WriteLine(ocrResult.Text);
Console.WriteLine("------------------");

// Pipe text to a .txt file
string outputFilePath = "ocr_output.txt";
File.WriteAllText(outputFilePath, ocrResult.Text);

Console.WriteLine($"\nSuccessfully saved text to {outputFilePath}");
IRON VB CONVERTER ERROR developers@ironsoftware.com
$vbLabelText   $csharpLabel

Output

OCR Output text

This output shows the result from our custom language model. As you can see, by providing the correct training data, IronOCR successfully deciphered the text, and the result is in plain English. Additionally, this is the txt output generated by the code.

Frequently Asked Questions

What is the purpose of using a custom language with Tesseract in IronOCR?

Using a custom language with Tesseract in IronOCR allows you to recognize and extract text from images or PDFs that contain specialized scripts or languages not supported by default. This is achieved by loading a custom `.traineddata` file containing the necessary training data for that language.

How do I load a custom language training data file in IronOCR?

You can load a custom language training data file in IronOCR by using the `UseCustomTesseractLanguageFile` method. This step is crucial as it provides the Tesseract engine with the training data needed to recognize the unique characters of the custom language.

What are the steps to perform OCR on an image with a custom language using IronOCR?

To perform OCR on an image with a custom language using IronOCR, first download the C# library, initialize the OCR engine, load the custom language training data with `UseCustomTesseractLanguageFile`, load the input image with `LoadImage`, and finally extract the text using the `Read` method.

Can IronOCR handle PDFs containing custom language text?

Yes, IronOCR can handle PDFs containing custom language text. You can load the PDF using the `LoadPdf` method and then use the `Read` method to extract the text based on the custom language training data provided.

What is a `.traineddata` file in the context of Tesseract and IronOCR?

A `.traineddata` file is a data file used by Tesseract OCR that contains the training data for a specific language. It allows the OCR engine to recognize and process characters from that language, and can be utilized in IronOCR to work with custom languages.

Do I need to create my own `.traineddata` file for every custom language in IronOCR?

No, you don't need to create your own `.traineddata` file for every custom language. You can use existing `.traineddata` files if available. However, if a specific language is not supported, you may need to create one using Tesseract's tools.

What output formats are supported by IronOCR when using custom languages?

IronOCR supports various output formats when using custom languages, such as plain text output which can be printed to the console or saved to a text file. The extracted text can be manipulated further as needed.

Curtis Chau
Technical Writer

Curtis Chau holds a Bachelor’s degree in Computer Science (Carleton University) and specializes in front-end development with expertise in Node.js, TypeScript, JavaScript, and React. Passionate about crafting intuitive and aesthetically pleasing user interfaces, Curtis enjoys working with modern frameworks and creating well-structured, visually appealing manuals.

...

Read More
Ready to Get Started?
Nuget Downloads 5,041,124 | Version: 2025.11 just released