How to Custom Font Training for Tesseract 5 in C#

How to Train a Custom Font with Tesseract 5 in C#

The default Tesseract English model misreads plenty of real-world inputs: hospital handwritten intake forms, vintage book digitizations, a game studio's bespoke decorative typeface, or industry-specific symbols a generic OCR engine has never seen. The fix is to train Tesseract on the exact font yourself, producing a single .traineddata artifact you can ship anywhere IronOCR runs.

This guide walks through Tesseract 5 custom font training end to end in C#: install the WSL2 Ubuntu toolchain, render .box and .tif training files from your .ttf or .otf, build the .traineddata model with tesstrain against a base eng.traineddata, then load the result in IronOCR. Once trained, the file is portable across Windows, macOS, Linux, and Docker.

Quickstart: Use Your Trained Font File in C#

Configure IronOCR by pointing UseCustomTesseractLanguageFile at your trained .traineddata file, then call Read on any image as you would with a stock language pack.

  1. Install IronOCR with NuGet Package Manager

    PM > Install-Package IronOcr
  2. Copy and run this code snippet.

    using IronOcr;
    
    var ocr = new IronTesseract();
    ocr.UseCustomTesseractLanguageFile("path/to/YourCustomFont.traineddata");
    string text = ocr.Read(new OcrInput("image-with-special-font.png")).Text;
  3. Deploy to test on your live environment

    Start using IronOCR in your project today with a free trial

    arrow pointer

How Do I Set Up the Training Environment?

How Do I Install IronOCR?

Install IronOCR via NuGet:

Install-Package IronOcr

The DLL package is a manual alternative if you cannot use NuGet. For the underlying engine, see the Tesseract 5 features guide and the custom language reference.

How Do I Install and Set Up WSL2 and Ubuntu?

Refer to the tutorial on Setting up WSL2 and Ubuntu.

Please noteCustom font training requires Linux.

WSL2 is enough: once training is done, the resulting .traineddata file ships with your IronOCR app on Windows, macOS, Linux, or Docker. For deployment details, see the Linux deployment guide.

How Do I Install Tesseract 5 on Ubuntu?

Use these commands to install Tesseract 5:

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
SHELL

The tesseract-ocr package is the engine that runs recognition; libtesseract-dev exposes the headers that tesstrain needs to build a model. Once your trained file is in use, the Tesseract configuration guide covers runtime tuning.

How Do I Prepare the Font for Training?

Which Font Should I Download?

This tutorial uses the AMGDT font, in either .ttf or .otf format.

Windows File Explorer showing downloaded AMGDT Regular.ttf font file highlighted in red box for training

When picking a font to train:

  • Pick fonts the default English model already misreads. Training a font that's already recognized wastes time.
  • Confirm the font's license permits redistribution if your .traineddata will ship with an application.
  • Decorative, handwritten, and industry-specific fonts (medical, legal, cartographic) gain the most accuracy from training.
  • Match training samples to what production will actually see, including resolution and lighting.

How Do I Mount the Disk Drive?

Mount Drive D: as your working space:

cd /
cd /mnt/d
cd /
cd /mnt/d
SHELL

WSL2 mounts every Windows drive under /mnt/<letter>, so you can edit files on Windows and run training commands against them in the same session.

How Do I Copy the Font File to Ubuntu Font Folder?

Tesseract renders sample text in your font to build training images, so the font has to be installed on the Linux side, not just on Windows. Copy the font file to both Ubuntu font directories: /usr/share/fonts and /usr/local/share/fonts. The simplest way is to type \wsl$ in File Explorer's address bar to browse the Ubuntu filesystem from Windows, then drag the .ttf across.

Windows File Explorer showing \\wsl$ network path for accessing Ubuntu filesystem from Windows

Here's how the font copy should look once it lands in the Ubuntu fonts directory:

AMGDT font file being copied into the Ubuntu fonts folder and recognized by the system

What If I Get Destination Folder Access Denied?

If File Explorer rejects the copy, run it from a root shell instead:

cd /
su root
cd /c/Users/Admin/Downloads/'AMGDT Regular'
cp 'AMGDT Regular.ttf' /usr/share/fonts
cp 'AMGDT Regular.ttf' /usr/local/share/fonts
exit
cd /
su root
cd /c/Users/Admin/Downloads/'AMGDT Regular'
cp 'AMGDT Regular.ttf' /usr/share/fonts
cp 'AMGDT Regular.ttf' /usr/local/share/fonts
exit
SHELL

How Do I Clone the Training Repositories from GitHub?

The training pipeline depends on three repositories. Clone the tutorial wrapper first, then the two upstream Tesseract repos inside it, then create the output folder:

git clone https://github.com/astutejoe/tesseract_tutorial.git
cd tesseract_tutorial
git clone https://github.com/tesseract-ocr/tesstrain
git clone https://github.com/tesseract-ocr/tesseract
mkdir tesstrain/data
git clone https://github.com/astutejoe/tesseract_tutorial.git
cd tesseract_tutorial
git clone https://github.com/tesseract-ocr/tesstrain
git clone https://github.com/tesseract-ocr/tesseract
mkdir tesstrain/data
SHELL
  • Tesseract_tutorial bundles the Python scripts and config files that drive each training step (text generation, image rendering, training-pair creation).
  • tesstrain contains the Makefile that drives the actual training run.
  • Tesseract contains the tessdata folder with stock .traineddata files used as the starting model for custom training.
  • tesstrain/data is where generated .box files (character bounding boxes), .tif images, and intermediate LSTM checkpoints all land.

Here's how the clone sequence should look in the terminal:

Terminal running the four git clone commands and creating the tesstrain data folder

For working with multiple language packs alongside a custom one, see our international languages guide.

How Do I Generate Training Files?

How Do I Run the split_training_text.py Script?

From the Tesseract_tutorial folder, run:

python split_training_text.py
python split_training_text.py
SHELL

The script generates one .box / .tif pair per training sample and writes them to the data folder.

Here's how the script run should look as it generates the training pairs:

Terminal running split_training_text.py and generating .box and .tif files in the data folder

How Do I Fix Fontconfig Warning?

Terminal showing fontconfig warnings about missing Apex font and empty font directory errors

If you see the warning Fontconfig warning: "/tmp/fonts.conf, line 4: empty font directory name ignored", fontconfig cannot resolve the font directories. Fix it by editing tesseract_tutorial/fonts.conf:

<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>

<dir>~/.fonts</dir>
<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>

<dir>~/.fonts</dir>
XML

Copy it to /etc/fonts:

cp fonts.conf /etc/fonts
cp fonts.conf /etc/fonts
SHELL

Then point split_training_text.py at the same path:

fontconf_dir = '/etc/fonts'
fontconf_dir = '/etc/fonts'
PYTHON

How Many Training Files Should I Generate?

By default the script generates 100 training pairs. Change the count near the top of split_training_text.py:

Python code setting count=100 and slicing lines array to limit training data size

Sizing guidance:

  • 100-500 samples are enough to confirm the pipeline works end-to-end.
  • 1000-5000 samples are the working range for production accuracy.
  • Training text must cover every character your font needs to recognize, ideally several times each.
  • More samples mean more training time; pick the smallest count that hits your accuracy target.

Where Do I Download the eng.traineddata File?

Download eng.traineddata from the tessdata_best repository and place it in Tesseract_tutorial/tesseract/tessdata.

The base model gives the trainer linguistic context (which character sequences form plausible words), so accuracy is much better than training from scratch. Pick a base model in the same language as your training text. If you hit issues, see the custom OCR language packs troubleshooting guide.

How Do I Build My Custom Font Trained Data File?

From the tesstrain folder, run:

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
SHELL
  • MODEL_NAME is the name of your custom font (used for the output filename).
  • START_MODEL is the base .traineddata you downloaded above.
  • MAX_ITERATIONS caps the training run; higher values typically reduce error rate.

What If I Get "Failed to Read Data" in Makefile?

To resolve "Failed to read data" errors, patch the Makefile:

WORDLIST_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-word-dawg
NUMBERS_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-number-dawg
PUNC_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-punc-dawg

The patch points the Makefile at the actual output directory so it can locate the dictionary files.

How Do I Fix "Failed to Load Script Unicharset"?

Download Latin.unicharset from langdata_lstm and place it in the tesstrain/data/langdata folder.

The .unicharset file defines which characters the trainer is allowed to emit. Use the file that covers every character in your font, for example Cyrillic.unicharset for Cyrillic fonts or Devanagari.unicharset for Devanagari.

Here's how a successful training run should look as tesstrain produces the .traineddata file:

tesstrain build pipeline running through training iterations and emitting the AMGDT.traineddata file

How Do I Verify the Accuracy of My Trained Data File?

With 1000 .box and .tif files and 3000 training iterations, the output AMGDT.traineddata reaches a training error rate (BCER) of around 5.77%.

Tesseract training log showing BCER improvement from 6.388% to 5.771% over iterations 2194-2298

To test the trained model with IronOCR, point UseCustomTesseractLanguageFile at the file and read a sample image:

using IronOcr;

// Load the trained model; AutoOsd handles orientation
var ocr = new IronTesseract();
ocr.UseCustomTesseractLanguageFile("path/to/AMGDT.traineddata");
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.AutoOsd;

// Preprocess so the model sees clean glyphs
using var input = new OcrInput();
input.LoadImage("test-image-with-amgdt-font.png");
input.EnhanceResolution(300);
input.DeNoise();

// Confidence reflects training quality
var result = ocr.Read(input);
Console.WriteLine($"Text: {result.Text}");
Console.WriteLine($"Confidence: {result.Confidence}%");
using IronOcr;

// Load the trained model; AutoOsd handles orientation
var ocr = new IronTesseract();
ocr.UseCustomTesseractLanguageFile("path/to/AMGDT.traineddata");
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.AutoOsd;

// Preprocess so the model sees clean glyphs
using var input = new OcrInput();
input.LoadImage("test-image-with-amgdt-font.png");
input.EnhanceResolution(300);
input.DeNoise();

// Confidence reflects training quality
var result = ocr.Read(input);
Console.WriteLine($"Text: {result.Text}");
Console.WriteLine($"Confidence: {result.Confidence}%");
Imports IronOcr

' Load the trained model; AutoOsd handles orientation
Dim ocr As New IronTesseract()
ocr.UseCustomTesseractLanguageFile("path/to/AMGDT.traineddata")
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.AutoOsd

' Preprocess so the model sees clean glyphs
Using input As New OcrInput()
    input.LoadImage("test-image-with-amgdt-font.png")
    input.EnhanceResolution(300)
    input.DeNoise()

    ' Confidence reflects training quality
    Dim result = ocr.Read(input)
    Console.WriteLine($"Text: {result.Text}")
    Console.WriteLine($"Confidence: {result.Confidence}%")
End Using
$vbLabelText   $csharpLabel

The Confidence property is the per-document score; if it stays low even on clean inputs, the most common causes are too few training samples or a base model that doesn't match the script. Once your .traineddata is verified, see our custom language guide for the general workflow of loading any custom language file.

What Are the Key Takeaways for Custom Font Training?

Training a custom font is a one-time setup: generate .box / .tif pairs from your target font, build a .traineddata model with tesstrain, then load it through UseCustomTesseractLanguageFile. From there IronOCR reads images with the new model exactly the same way it reads stock English.

Key advantages of using IronOCR with a custom Tesseract model:

  • Reuses standard Tesseract artifacts: any .traineddata file you can build with tesstrain works in IronOCR without conversion.
  • Cross-platform output: training requires Linux (or WSL2), but the trained file ships with your application on Windows, macOS, Linux, and Docker.
  • Drop-in with the rest of the API: combine custom fonts with multiple secondary languages, image quality correction, and DPI tuning without changing the recognition path.
  • Tunable accuracy: error rate is a function of training samples times iterations. Both knobs are exposed (the script's sample count plus MAX_ITERATIONS) so you can dial in the trade-off between training time and BCER without leaving Tesseract.

For larger pipelines, consider progress tracking and async processing when applying your trained model across many documents.

Frequently Asked Questions

How do I use a custom trained font file in C#?

You can use your custom-trained Tesseract font file in IronOCR with just a few lines of code. Simply create an IronTesseract instance, call UseCustomTesseractLanguageFile() with the path to your .traineddata file, and then use the Read() method to perform OCR on images containing your special font.

What are the requirements for training custom fonts for OCR?

Custom font training requires a Linux environment (WSL2 with Ubuntu is recommended for Windows users), Tesseract 5 installed with development libraries, and the font file you want to train (either .ttf or .otf format). The resulting .traineddata files created in Linux work seamlessly with IronOCR across all platforms.

Why should I train custom fonts instead of using standard OCR?

Training custom fonts improves OCR accuracy for specific fonts, especially decorative or special fonts that differ significantly from standard Tesseract models. IronOCR can then use these trained font files to accurately recognize text in images containing these unique fonts that would otherwise be difficult to read with standard OCR models.

Can I use custom trained fonts across different platforms?

Yes, while the training process requires Linux, the resulting .traineddata files work seamlessly across all platforms with IronOCR. This means you can train once on Linux and use the trained data file on Windows, macOS, or Linux deployments.

What installation method is recommended for getting started?

For quick setup, you can download the IronOCR DLL directly or install through NuGet Package Manager. NuGet is recommended as it handles dependencies automatically and makes updates easier. IronOCR provides comprehensive support for Tesseract 5 features and custom language implementations.

Does IronOCR support multiple languages?

IronOCR supports multiple languages, making it a versatile tool for global applications that require text recognition in different languages.

Can IronOCR be integrated into existing applications?

IronOCR is designed to be easily integrated into existing applications using C#, allowing developers to add OCR functionality to their software with minimal effort.

What are the benefits of using IronOCR for document management?

Using IronOCR for document management streamlines the workflow by converting scanned documents into searchable and editable text, reducing the need for manual data entry and improving document accessibility.

How can IronOCR improve data accuracy?

IronOCR improves data accuracy through its advanced recognition algorithms and image correction features, ensuring that the text extraction process is both reliable and precise.

Is there a free trial available for IronOCR?

Yes, Iron Software offers a free trial of IronOCR, allowing users to test its features and capabilities before making a purchase decision.

Kannaopat Udonpant
Software Engineer
Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering ...
Read More
Reviewed by
Jeff Fritz
Jeffrey T. Fritz
Principal Program Manager - .NET Community Team
Jeff is also a Principal Program Manager for the .NET and Visual Studio teams. He is the executive producer of the .NET Conf virtual conference series and hosts 'Fritz and Friends' a live stream for developers that airs twice weekly where he talks tech and writes code together with viewers. Jeff writes workshops, presentations, and plans content for the largest Microsoft developer events including Microsoft Build, Microsoft Ignite, .NET Conf, and the Microsoft MVP Summit
Ready to Get Started?
Nuget Downloads 5,860,850 | Version: 2026.5 just released
Still Scrolling Icon

Still Scrolling?

Want proof fast? PM > Install-Package IronOcr
run a sample watch your image become searchable text.