How to Train a Custom Font with Tesseract 5 in C#
The default Tesseract English model misreads plenty of real-world inputs: hospital handwritten intake forms, vintage book digitizations, a game studio's bespoke decorative typeface, or industry-specific symbols a generic OCR engine has never seen. The fix is to train Tesseract on the exact font yourself, producing a single .traineddata artifact you can ship anywhere IronOCR runs.
This guide walks through Tesseract 5 custom font training end to end in C#: install the WSL2 Ubuntu toolchain, render .box and .tif training files from your .ttf or .otf, build the .traineddata model with tesstrain against a base eng.traineddata, then load the result in IronOCR. Once trained, the file is portable across Windows, macOS, Linux, and Docker.
Quickstart: Use Your Trained Font File in C#
Configure IronOCR by pointing UseCustomTesseractLanguageFile at your trained .traineddata file, then call Read on any image as you would with a stock language pack.
-
Install IronOCR with NuGet Package Manager
PM > Install-Package IronOcr -
Copy and run this code snippet.
using IronOcr; var ocr = new IronTesseract(); ocr.UseCustomTesseractLanguageFile("path/to/YourCustomFont.traineddata"); string text = ocr.Read(new OcrInput("image-with-special-font.png")).Text; -
Deploy to test on your live environment
Start using IronOCR in your project today with a free trial
Minimal Workflow (5 steps)
- Download IronOCR via NuGet to read with custom-trained fonts
- Install Tesseract 5 on WSL2 Ubuntu and clone the
tesstraintraining repositories - Generate training files for your target font with
split_training_text.py - Build your custom
.traineddatafile usingtesstrainand a base language model - Load the trained file in IronOCR with
UseCustomTesseractLanguageFileand callRead
How Do I Set Up the Training Environment?
How Do I Install IronOCR?
Install IronOCR via NuGet:
Install-Package IronOcr
The DLL package is a manual alternative if you cannot use NuGet. For the underlying engine, see the Tesseract 5 features guide and the custom language reference.
How Do I Install and Set Up WSL2 and Ubuntu?
Refer to the tutorial on Setting up WSL2 and Ubuntu.
WSL2 is enough: once training is done, the resulting .traineddata file ships with your IronOCR app on Windows, macOS, Linux, or Docker. For deployment details, see the Linux deployment guide.
How Do I Install Tesseract 5 on Ubuntu?
Use these commands to install Tesseract 5:
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
The tesseract-ocr package is the engine that runs recognition; libtesseract-dev exposes the headers that tesstrain needs to build a model. Once your trained file is in use, the Tesseract configuration guide covers runtime tuning.
How Do I Prepare the Font for Training?
Which Font Should I Download?
This tutorial uses the AMGDT font, in either .ttf or .otf format.

When picking a font to train:
- Pick fonts the default English model already misreads. Training a font that's already recognized wastes time.
- Confirm the font's license permits redistribution if your
.traineddatawill ship with an application. - Decorative, handwritten, and industry-specific fonts (medical, legal, cartographic) gain the most accuracy from training.
- Match training samples to what production will actually see, including resolution and lighting.
How Do I Mount the Disk Drive?
Mount Drive D: as your working space:
cd /
cd /mnt/d
cd /
cd /mnt/d
WSL2 mounts every Windows drive under /mnt/<letter>, so you can edit files on Windows and run training commands against them in the same session.
How Do I Copy the Font File to Ubuntu Font Folder?
Tesseract renders sample text in your font to build training images, so the font has to be installed on the Linux side, not just on Windows. Copy the font file to both Ubuntu font directories: /usr/share/fonts and /usr/local/share/fonts. The simplest way is to type \wsl$ in File Explorer's address bar to browse the Ubuntu filesystem from Windows, then drag the .ttf across.

Here's how the font copy should look once it lands in the Ubuntu fonts directory:
What If I Get Destination Folder Access Denied?
If File Explorer rejects the copy, run it from a root shell instead:
cd /
su root
cd /c/Users/Admin/Downloads/'AMGDT Regular'
cp 'AMGDT Regular.ttf' /usr/share/fonts
cp 'AMGDT Regular.ttf' /usr/local/share/fonts
exit
cd /
su root
cd /c/Users/Admin/Downloads/'AMGDT Regular'
cp 'AMGDT Regular.ttf' /usr/share/fonts
cp 'AMGDT Regular.ttf' /usr/local/share/fonts
exit
How Do I Clone the Training Repositories from GitHub?
The training pipeline depends on three repositories. Clone the tutorial wrapper first, then the two upstream Tesseract repos inside it, then create the output folder:
git clone https://github.com/astutejoe/tesseract_tutorial.git
cd tesseract_tutorial
git clone https://github.com/tesseract-ocr/tesstrain
git clone https://github.com/tesseract-ocr/tesseract
mkdir tesstrain/data
git clone https://github.com/astutejoe/tesseract_tutorial.git
cd tesseract_tutorial
git clone https://github.com/tesseract-ocr/tesstrain
git clone https://github.com/tesseract-ocr/tesseract
mkdir tesstrain/data
- Tesseract_tutorial bundles the Python scripts and config files that drive each training step (text generation, image rendering, training-pair creation).
- tesstrain contains the Makefile that drives the actual training run.
- Tesseract contains the tessdata folder with stock
.traineddatafiles used as the starting model for custom training. - tesstrain/data is where generated
.boxfiles (character bounding boxes),.tifimages, and intermediate LSTM checkpoints all land.
Here's how the clone sequence should look in the terminal:
For working with multiple language packs alongside a custom one, see our international languages guide.
How Do I Generate Training Files?
How Do I Run the split_training_text.py Script?
From the Tesseract_tutorial folder, run:
python split_training_text.py
python split_training_text.py
The script generates one .box / .tif pair per training sample and writes them to the data folder.
Here's how the script run should look as it generates the training pairs:
How Do I Fix Fontconfig Warning?

If you see the warning Fontconfig warning: "/tmp/fonts.conf, line 4: empty font directory name ignored", fontconfig cannot resolve the font directories. Fix it by editing tesseract_tutorial/fonts.conf:
<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>
<dir>~/.fonts</dir>
<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>
<dir>~/.fonts</dir>
Copy it to /etc/fonts:
cp fonts.conf /etc/fonts
cp fonts.conf /etc/fonts
Then point split_training_text.py at the same path:
fontconf_dir = '/etc/fonts'
fontconf_dir = '/etc/fonts'
How Many Training Files Should I Generate?
By default the script generates 100 training pairs. Change the count near the top of split_training_text.py:

Sizing guidance:
- 100-500 samples are enough to confirm the pipeline works end-to-end.
- 1000-5000 samples are the working range for production accuracy.
- Training text must cover every character your font needs to recognize, ideally several times each.
- More samples mean more training time; pick the smallest count that hits your accuracy target.
Where Do I Download the eng.traineddata File?
Download eng.traineddata from the tessdata_best repository and place it in Tesseract_tutorial/tesseract/tessdata.
The base model gives the trainer linguistic context (which character sequences form plausible words), so accuracy is much better than training from scratch. Pick a base model in the same language as your training text. If you hit issues, see the custom OCR language packs troubleshooting guide.
How Do I Build My Custom Font Trained Data File?
From the tesstrain folder, run:
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
- MODEL_NAME is the name of your custom font (used for the output filename).
- START_MODEL is the base
.traineddatayou downloaded above. - MAX_ITERATIONS caps the training run; higher values typically reduce error rate.
What If I Get "Failed to Read Data" in Makefile?
To resolve "Failed to read data" errors, patch the Makefile:
WORDLIST_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-word-dawg
NUMBERS_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-number-dawg
PUNC_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-punc-dawg
The patch points the Makefile at the actual output directory so it can locate the dictionary files.
How Do I Fix "Failed to Load Script Unicharset"?
Download Latin.unicharset from langdata_lstm and place it in the tesstrain/data/langdata folder.
The .unicharset file defines which characters the trainer is allowed to emit. Use the file that covers every character in your font, for example Cyrillic.unicharset for Cyrillic fonts or Devanagari.unicharset for Devanagari.
Here's how a successful training run should look as tesstrain produces the .traineddata file:
How Do I Verify the Accuracy of My Trained Data File?
With 1000 .box and .tif files and 3000 training iterations, the output AMGDT.traineddata reaches a training error rate (BCER) of around 5.77%.

To test the trained model with IronOCR, point UseCustomTesseractLanguageFile at the file and read a sample image:
using IronOcr;
// Load the trained model; AutoOsd handles orientation
var ocr = new IronTesseract();
ocr.UseCustomTesseractLanguageFile("path/to/AMGDT.traineddata");
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.AutoOsd;
// Preprocess so the model sees clean glyphs
using var input = new OcrInput();
input.LoadImage("test-image-with-amgdt-font.png");
input.EnhanceResolution(300);
input.DeNoise();
// Confidence reflects training quality
var result = ocr.Read(input);
Console.WriteLine($"Text: {result.Text}");
Console.WriteLine($"Confidence: {result.Confidence}%");
using IronOcr;
// Load the trained model; AutoOsd handles orientation
var ocr = new IronTesseract();
ocr.UseCustomTesseractLanguageFile("path/to/AMGDT.traineddata");
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.AutoOsd;
// Preprocess so the model sees clean glyphs
using var input = new OcrInput();
input.LoadImage("test-image-with-amgdt-font.png");
input.EnhanceResolution(300);
input.DeNoise();
// Confidence reflects training quality
var result = ocr.Read(input);
Console.WriteLine($"Text: {result.Text}");
Console.WriteLine($"Confidence: {result.Confidence}%");
Imports IronOcr
' Load the trained model; AutoOsd handles orientation
Dim ocr As New IronTesseract()
ocr.UseCustomTesseractLanguageFile("path/to/AMGDT.traineddata")
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.AutoOsd
' Preprocess so the model sees clean glyphs
Using input As New OcrInput()
input.LoadImage("test-image-with-amgdt-font.png")
input.EnhanceResolution(300)
input.DeNoise()
' Confidence reflects training quality
Dim result = ocr.Read(input)
Console.WriteLine($"Text: {result.Text}")
Console.WriteLine($"Confidence: {result.Confidence}%")
End Using
The Confidence property is the per-document score; if it stays low even on clean inputs, the most common causes are too few training samples or a base model that doesn't match the script. Once your .traineddata is verified, see our custom language guide for the general workflow of loading any custom language file.
What Are the Key Takeaways for Custom Font Training?
Training a custom font is a one-time setup: generate .box / .tif pairs from your target font, build a .traineddata model with tesstrain, then load it through UseCustomTesseractLanguageFile. From there IronOCR reads images with the new model exactly the same way it reads stock English.
Key advantages of using IronOCR with a custom Tesseract model:
- Reuses standard Tesseract artifacts: any
.traineddatafile you can build with tesstrain works in IronOCR without conversion. - Cross-platform output: training requires Linux (or WSL2), but the trained file ships with your application on Windows, macOS, Linux, and Docker.
- Drop-in with the rest of the API: combine custom fonts with multiple secondary languages, image quality correction, and DPI tuning without changing the recognition path.
- Tunable accuracy: error rate is a function of training samples times iterations. Both knobs are exposed (the script's sample count plus
MAX_ITERATIONS) so you can dial in the trade-off between training time and BCER without leaving Tesseract.
For larger pipelines, consider progress tracking and async processing when applying your trained model across many documents.
Frequently Asked Questions
How do I use a custom trained font file in C#?
You can use your custom-trained Tesseract font file in IronOCR with just a few lines of code. Simply create an IronTesseract instance, call UseCustomTesseractLanguageFile() with the path to your .traineddata file, and then use the Read() method to perform OCR on images containing your special font.
What are the requirements for training custom fonts for OCR?
Custom font training requires a Linux environment (WSL2 with Ubuntu is recommended for Windows users), Tesseract 5 installed with development libraries, and the font file you want to train (either .ttf or .otf format). The resulting .traineddata files created in Linux work seamlessly with IronOCR across all platforms.
Why should I train custom fonts instead of using standard OCR?
Training custom fonts improves OCR accuracy for specific fonts, especially decorative or special fonts that differ significantly from standard Tesseract models. IronOCR can then use these trained font files to accurately recognize text in images containing these unique fonts that would otherwise be difficult to read with standard OCR models.
Can I use custom trained fonts across different platforms?
Yes, while the training process requires Linux, the resulting .traineddata files work seamlessly across all platforms with IronOCR. This means you can train once on Linux and use the trained data file on Windows, macOS, or Linux deployments.
What installation method is recommended for getting started?
For quick setup, you can download the IronOCR DLL directly or install through NuGet Package Manager. NuGet is recommended as it handles dependencies automatically and makes updates easier. IronOCR provides comprehensive support for Tesseract 5 features and custom language implementations.
Does IronOCR support multiple languages?
IronOCR supports multiple languages, making it a versatile tool for global applications that require text recognition in different languages.
Can IronOCR be integrated into existing applications?
IronOCR is designed to be easily integrated into existing applications using C#, allowing developers to add OCR functionality to their software with minimal effort.
What are the benefits of using IronOCR for document management?
Using IronOCR for document management streamlines the workflow by converting scanned documents into searchable and editable text, reducing the need for manual data entry and improving document accessibility.
How can IronOCR improve data accuracy?
IronOCR improves data accuracy through its advanced recognition algorithms and image correction features, ensuring that the text extraction process is both reliable and precise.
Is there a free trial available for IronOCR?
Yes, Iron Software offers a free trial of IronOCR, allowing users to test its features and capabilities before making a purchase decision.

