How to Custom Font Training for Tesseract 5 in C#

C# Custom Font Training in Tesseract 5 for Windows Developers

Custom font training in Tesseract 5 improves OCR accuracy for specific fonts. The process creates training data that teaches the engine font characteristics. The resulting .traineddata file works with IronOCR to recognize decorative or special fonts accurately.

Quickstart: Use Your .traineddata Font File in C#

Use your custom-trained Tesseract font file in IronOCR with just a few lines. Perfect for accurate OCR on special or decorative fonts.

Nuget IconGet started making PDFs with NuGet now:

  1. Install IronOCR with NuGet Package Manager

    PM > Install-Package IronOcr

  2. Copy and run this code snippet.

    var ocr = new IronOcr.IronTesseract();
    ocr.UseCustomTesseractLanguageFile("path/to/YourCustomFont.traineddata");
    string text = ocr.Read(new IronOcr.OcrInput("image-with-special-font.png")).Text;
  3. Deploy to test on your live environment

    Start using IronOCR in your project today with a free trial
    arrow pointer

How Do I Download the Latest Version of IronOCR?

Which Installation Method Should I Use?

Download the IronOcr DLL directly to your machine.

Why Use NuGet Instead?

Alternatively, install through NuGet with this command:

Install-Package IronOcr

IronOCR provides comprehensive support for Tesseract 5 features and custom language implementations, making it ideal for specialized OCR requirements.


How Do I Install and Set Up WSL2 and Ubuntu?

Refer to the tutorial on Setting up WSL2 and Ubuntu.

Please noteCustom font training requires Linux.

Training requires Linux, but the resulting .traineddata files work seamlessly across all platforms. For detailed Linux setup instructions, see our Linux deployment guide.

How Do I Install Tesseract 5 on Ubuntu?

Use these commands to install Tesseract 5:

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
SHELL

These packages provide the core Tesseract OCR engine and development libraries needed for training. For advanced Tesseract configuration options, refer to our detailed configuration guide.

Which Font Should I Download for Training?

This tutorial uses the AMGDT font. The font file can be either .ttf or .otf. Windows File Explorer showing downloaded AMGDT Regular.ttf font file highlighted in red box for training

When selecting fonts for training:

  • Choose fonts that differ significantly from standard Tesseract models
  • Ensure proper licensing for the font
  • Consider decorative, handwritten, or specialized industry fonts
  • Test with fonts your application encounters in production

How Do I Mount the Disk Drive for Custom Font Training?

Use these commands to mount Drive D: as your working space:

cd /
cd /mnt/d
cd /
cd /mnt/d
SHELL

This allows you to work with files stored on Windows drives directly from the Ubuntu WSL2 environment.

How Do I Copy the Font File to Ubuntu Font Folder?

Copy the font file to these Ubuntu directories: /usr/share/fonts and /usr/local/share/fonts.

Access files in Ubuntu by typing \\wsl$ in the file explorer's address bar.

Windows File Explorer showing \\wsl$ network path for accessing Ubuntu filesystem from Windows

What If I Get Destination Folder Access Denied?

If you encounter access denied errors, use the command line to copy files:

cd /
su root
cd /c/Users/Admin/Downloads/'AMGDT Regular'
cp 'AMGDT Regular.ttf' /usr/share/fonts
cp 'AMGDT Regular.ttf' /usr/local/share/fonts
exit
cd /
su root
cd /c/Users/Admin/Downloads/'AMGDT Regular'
cp 'AMGDT Regular.ttf' /usr/share/fonts
cp 'AMGDT Regular.ttf' /usr/local/share/fonts
exit
SHELL

Font installation is crucial for the training process. The system needs access to render the font when generating training images.

How Do I Clone tesseract_tutorial from GitHub?

Clone the tesseract_tutorial repository using this command:

git clone https://github.com/astutejoe/tesseract_tutorial.git
git clone https://github.com/astutejoe/tesseract_tutorial.git
SHELL

This repository contains essential Python scripts and configuration files for the training process. The scripts automate many manual steps in font training.

How Do I Clone tesstrain and tesseract from GitHub?

Navigate to the tesseract_tutorial directory, then clone the tesstrain and tesseract repositories:

git clone https://github.com/tesseract-ocr/tesstrain
git clone https://github.com/tesseract-ocr/tesseract
git clone https://github.com/tesseract-ocr/tesstrain
git clone https://github.com/tesseract-ocr/tesseract
SHELL
  • tesstrain contains the Makefile used to create the .traineddata file
  • tesseract contains the tessdata folder with original .traineddata files used as references during custom font training

For more information on working with multiple language packs and custom training data, see our international languages guide.

How Do I Create a "data" Folder for Storing Outputs?

Create a data folder within tesseract_tutorial/tesstrain:

mkdir tesseract_tutorial/tesstrain/data
mkdir tesseract_tutorial/tesstrain/data
SHELL

This folder stores all generated training files including .box, .tif, and intermediate training artifacts.

How Do I Run split_training_text.py?

Return to the tesseract_tutorial folder and execute this command:

python split_training_text.py
python split_training_text.py
SHELL

After running split_training_text.py, it creates .box and .tif files in the data folder.

How Do I Fix Fontconfig Warning?

Terminal showing fontconfig warnings about missing Apex font and empty font directory errors If you see the warning Fontconfig warning: "/tmp/fonts.conf, line 4: empty font directory name ignored", it indicates missing font directories. Fix this by editing the tesseract_tutorial/fonts.conf file and adding:

<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>
<!-- the following element will be removed in the future -->
<dir>~/.fonts</dir>
<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>
<!-- the following element will be removed in the future -->
<dir>~/.fonts</dir>
XML

Copy it to /etc/fonts with:

cp fonts.conf /etc/fonts
cp fonts.conf /etc/fonts
SHELL

Additionally, update split_training_text.py:

fontconf_dir = '/etc/fonts'
fontconf_dir = '/etc/fonts'
PYTHON

How Many Training Files Should I Generate?

The current configuration generates 100 training files. You can modify this in split_training_text.py.

Python code setting count=100 and slicing lines array to limit training data size

For production-quality training:

  • Start with 100-500 samples for testing
  • Use 1000-5000 samples for better accuracy
  • Include diverse text samples covering all required characters
  • Balance training time with accuracy requirements

Where Do I Download eng.traineddata?

Download eng.traineddata from this repository and place it in tesseract_tutorial/tesseract/tessdata.

The base model provides linguistic context that improves recognition accuracy. Choose a base model that matches your target language. For troubleshooting custom language pack issues, consult our custom OCR language packs guide.

How Do I Create My Custom Font .traineddata?

Navigate to the tesstrain folder and use this command in WSL2:

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
SHELL
  • MODEL_NAME is your custom font name
  • START_MODEL is the original .traineddata reference
  • MAX_ITERATIONS defines the number of iterations (more iterations improve accuracy)

What If I Get "Failed to Read Data" in Makefile?

To resolve "Failed to read data" issues, modify the Makefile:

WORDLIST_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-word-dawg
NUMBERS_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-number-dawg
PUNC_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-punc-dawg

This modification ensures the Makefile looks for files in the correct output directory structure.

How Do I Fix "Failed to Load Script Unicharset"?

Insert Latin.unicharset into the tesstrain/data/langdata folder. Find Latin.unicharset here.

The unicharset file defines the character set for your language or script. Ensure it matches your font's character coverage.

How Do I Verify the Accuracy of Created .traineddata?

With 1000 .box and .tif files and 3000 training iterations, the output .traineddata (AMGDT.traineddata) achieves a minimal training error rate (BCER) of around 5.77%.

Tesseract training log showing BCER improvement from 6.388% to 5.771% over iterations 2194-2298

To test your trained model with IronOCR:

using IronOcr;

// Initialize IronOCR with custom trained data
var ocr = new IronTesseract();

// Load your custom trained font
ocr.UseCustomTesseractLanguageFile(@"path\to\AMGDT.traineddata");

// Configure for optimal results
ocr.Configuration.BlackListCharacters = "";
ocr.Configuration.WhiteListCharacters = "";
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.AutoOsd;

// Process an image with your custom font
using var input = new OcrInput();
input.LoadImage("test-image-with-amgdt-font.png");

// Optional: Apply filters if needed
input.EnhanceResolution(300);
input.DeNoise();

// Perform OCR
var result = ocr.Read(input);
Console.WriteLine($"Recognized Text: {result.Text}");
Console.WriteLine($"Confidence: {result.Confidence}%");
using IronOcr;

// Initialize IronOCR with custom trained data
var ocr = new IronTesseract();

// Load your custom trained font
ocr.UseCustomTesseractLanguageFile(@"path\to\AMGDT.traineddata");

// Configure for optimal results
ocr.Configuration.BlackListCharacters = "";
ocr.Configuration.WhiteListCharacters = "";
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.AutoOsd;

// Process an image with your custom font
using var input = new OcrInput();
input.LoadImage("test-image-with-amgdt-font.png");

// Optional: Apply filters if needed
input.EnhanceResolution(300);
input.DeNoise();

// Perform OCR
var result = ocr.Read(input);
Console.WriteLine($"Recognized Text: {result.Text}");
Console.WriteLine($"Confidence: {result.Confidence}%");
IRON VB CONVERTER ERROR developers@ironsoftware.com
$vbLabelText   $csharpLabel

For implementing custom fonts in production applications, explore our guide on using custom language files.

For further reading and reference, see the tutorial: YouTube Video

Frequently Asked Questions

How do I use a custom trained font file in C#?

You can use your custom-trained Tesseract font file in IronOCR with just a few lines of code. Simply create an IronTesseract instance, call UseCustomTesseractLanguageFile() with the path to your .traineddata file, and then use the Read() method to perform OCR on images containing your special font.

What are the requirements for training custom fonts for OCR?

Custom font training requires a Linux environment (WSL2 with Ubuntu is recommended for Windows users), Tesseract 5 installed with development libraries, and the font file you want to train (either .ttf or .otf format). The resulting .traineddata files created in Linux work seamlessly with IronOCR across all platforms.

Why should I train custom fonts instead of using standard OCR?

Training custom fonts improves OCR accuracy for specific fonts, especially decorative or special fonts that differ significantly from standard Tesseract models. IronOCR can then use these trained font files to accurately recognize text in images containing these unique fonts that would otherwise be difficult to read with standard OCR models.

Can I use custom trained fonts across different platforms?

Yes, while the training process requires Linux, the resulting .traineddata files work seamlessly across all platforms with IronOCR. This means you can train once on Linux and use the trained data file on Windows, macOS, or Linux deployments.

What installation method is recommended for getting started?

For quick setup, you can download the IronOCR DLL directly or install through NuGet Package Manager. NuGet is recommended as it handles dependencies automatically and makes updates easier. IronOCR provides comprehensive support for Tesseract 5 features and custom language implementations.

Kannaopat Udonpant
Software Engineer
Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering ...
Read More
Reviewed by
Jeff Fritz
Jeffrey T. Fritz
Principal Program Manager - .NET Community Team
Jeff is also a Principal Program Manager for the .NET and Visual Studio teams. He is the executive producer of the .NET Conf virtual conference series and hosts 'Fritz and Friends' a live stream for developers that airs twice weekly where he talks tech and writes code together with viewers. Jeff writes workshops, presentations, and plans content for the largest Microsoft developer events including Microsoft Build, Microsoft Ignite, .NET Conf, and the Microsoft MVP Summit
Ready to Get Started?
Nuget Downloads 5,219,969 | Version: 2025.12 just released