C# Custom Font Training for Tesseract 5 (for Windows Users)

ByKannapat Udonpant

March 5, 2023

Updated June 22, 2025

Utilize custom font training for Tesseract 5 to improve the accuracy and recognition capabilities of the OCR engine when working with specific fonts or font styles that may not be well-supported by default.

The process involves providing Tesseract with training data, such as font samples and corresponding text, so that it can learn the specific characteristics and patterns of the custom fonts.

Get Started with IronOCR

Start using IronOCR in your project today with a free trial.

First Step:

How to Use Tesseract Custom Font in C#

Download a C# library to train custom font with Tesseract
Prepare the targeted font file to be used for training
Follow the steps specified in the article
Contains solutions for commonly encountered errors
Export the trained data file for further usage

Step 1: Download the Latest Version of IronOCR

Install via DLL

Download the IronOcr DLL directly to your machine.

Install via NuGet

Alternatively, you can install through NuGet with the following command:

Install-Package IronOcr

Step 2: Install and Set Up WSL2 and Ubuntu

Refer to the tutorial on Setting up WSL2 and Ubuntu.

Please note

Custom font training can currently be done only on Linux.

Step 3: Install Tesseract 5 on Ubuntu

Use the following commands to install Tesseract 5:

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

SHELL

Step 4: Download the Font You Would Like to Train

We are using the AMGDT font for this tutorial. The font file can be either .ttf or .otf. Example of downloaded font file

Step 5: Mount the Disk Drive of Your Working Space for Custom Font Training

Use the commands below to mount Drive D: as your working space.

cd /
cd /mnt/d

cd /
cd /mnt/d

SHELL

Step 6: Copy the Font File to Ubuntu Font Folder

Copy the font file to the following Ubuntu directories: /usr/share/fonts and /usr/local/share/fonts.

Access files in Ubuntu by typing \\wsl$ in the file explorer's address bar.

Ubuntu folder directory

Troubleshooting: Destination Folder Access Denied

If you encounter access denied errors, resolve this by using the command line to copy the files.

cd /
su root
cd /c/Users/Admin/Downloads/'AMGDT Regular'
cp 'AMGDT Regular.ttf' /usr/share/fonts
cp 'AMGDT Regular.ttf' /usr/local/share/fonts
exit

cd /
su root
cd /c/Users/Admin/Downloads/'AMGDT Regular'
cp 'AMGDT Regular.ttf' /usr/share/fonts
cp 'AMGDT Regular.ttf' /usr/local/share/fonts
exit

SHELL

Step 7: Clone `tesseract_tutorial` from GitHub

Clone the tesseract_tutorial repository using the following command:

git clone https://github.com/astutejoe/tesseract_tutorial.git

git clone https://github.com/astutejoe/tesseract_tutorial.git

SHELL

Step 8: Clone `tesstrain` and `tesseract` from GitHub

Navigate to the tesseract_tutorial directory, then clone the tesstrain and tesseract repositories:

git clone https://github.com/tesseract-ocr/tesstrain
git clone https://github.com/tesseract-ocr/tesseract

git clone https://github.com/tesseract-ocr/tesstrain
git clone https://github.com/tesseract-ocr/tesseract

SHELL

tesstrain contains the "Makefile" used to create the .traineddata file.
tesseract contains the "tessdata" folder, which includes original .traindata files used for reference during custom font training.

Step 9: Create a "data" Folder for Storing Outputs

Create a "data" folder within tesseract_tutorial/tesstrain.

Step 10: Run `split_training_text.py`

Return to the tesseract_tutorial folder and execute the following command:

python split_training_text.py

python split_training_text.py

SHELL

After running split_training_text.py, it will create .box and .tif files in the "data" folder.

Troubleshooting: Fontconfig Warning

Font Config Warning If you see the warning Fontconfig warning: "/tmp/fonts.conf, line 4: empty font directory name ignored", it is due to missing font directories. Solve this by editing the tesseract_tutorial/fonts.conf file and adding:

<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>
<!-- the following element will be removed in the future -->
<dir>~/.fonts</dir>

<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>
<!-- the following element will be removed in the future -->
<dir>~/.fonts</dir>

XML

Copy it to /etc/fonts with:

cp fonts.conf /etc/fonts

cp fonts.conf /etc/fonts

SHELL

Additionally, update split_training_text.py:

fontconf_dir = '/etc/fonts'

fontconf_dir = '/etc/fonts'

PYTHON

Note: Number of Training (.box and .tif) Files

The current number of training files is set to 100. You can modify this in split_training_text.py.

Set the Number of Training Files

Step 11: Download `eng.traineddata`

Download eng.traineddata from this repository and place it in tesseract_tutorial/tesseract/tessdata.

Step 12: Create Your Custom Font `.traineddata`

Navigate to the tesstrain folder and use the command below in WSL2:

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100

SHELL

MODEL_NAME is your custom font name.
START_MODEL is the original .traineddata reference.
MAX_ITERATIONS defines the number of iterations (more iterations can improve the accuracy of .traineddata).

Troubleshooting: "Failed to Read Data" in Makefile

To resolve "Failed to read data" issues, modify the Makefile:

WORDLIST_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-word-dawg
NUMBERS_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-number-dawg
PUNC_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-punc-dawg

Troubleshooting: "Failed to Load Script Unicharset"

Insert Latin.unicharset into the tesstrain/data/langdata folder. The Latin.unicharset can be found here.

Step 13: Verify the Accuracy of Created `.traineddata`

With 1000 .box and .tif files and 3000 iterations of training, the output .traineddata (AMGDT.traineddata) achieves a minimal training error rate (BCER) of around 5.77.

Traineddata Accuracy

For further reading and reference, see the tutorial: YouTube Video

Frequently Asked Questions

What is the purpose of custom font training in Tesseract?

Custom font training in Tesseract is used to improve the accuracy and recognition capabilities of the OCR engine when dealing with specific fonts or font styles that may not be well-supported by default.

How can I begin using custom fonts in C#?

To begin using custom fonts with Tesseract in C#, you need to download a C# library like IronOCR, prepare the target font file for training, and follow the outlined steps in the guide.

What are the installation steps for a C# OCR library?

IronOCR can be installed either by downloading the DLL directly or through NuGet using the command: Install-Package IronOcr.

Can custom font training be done on Windows?

Custom font training for Tesseract is currently supported only on Linux, which can be set up on Windows using WSL2 and Ubuntu.

What command is used to install Tesseract 5 on Ubuntu?

To install Tesseract 5 on Ubuntu, use the following commands: sudo apt install tesseract-ocr and sudo apt install libtesseract-dev.

How do you handle 'Destination Folder Access Denied' errors during font file copying?

If you encounter 'Destination Folder Access Denied' errors, resolve this by using the command line to copy the files with root access.

What is the role of the 'tesstrain' repository?

The 'tesstrain' repository contains the 'Makefile' used to create the '.traineddata' file necessary for custom font training in Tesseract.

How do you address the 'Fontconfig warning' during training?

To address the 'Fontconfig warning', add font directory paths in the 'fonts.conf' file and copy it to '/etc/fonts'.

What should be done if 'Failed to Load Script Unicharset' occurs?

Insert 'Latin.unicharset' into the 'tesstrain/data/langdata' folder to resolve the 'Failed to Load Script Unicharset' issue.

How is the accuracy of the trained data verified?

The accuracy of the trained data is verified by achieving a minimal training error rate (BCER) after a sufficient number of iterations and training files.

Kannapat Udonpant

Chat with engineering team now

Software Engineer

Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering team, where he focuses on IronPDF. Kannapat values his job because he learns directly from the developer who writes most of the code used in IronPDF. In addition to peer learning, Kannapat enjoys the social aspect of working at Iron Software. When he's not writing code or documentation, Kannapat can usually be found gaming on his PS5 or rewatching The Last of Us.

On This Page

C# Custom Font Training for Tesseract 5 (for Windows Users)

Get Started with IronOCR

How to Use Tesseract Custom Font in C#

Step 1: Download the Latest Version of IronOCR

Install via DLL

Install via NuGet

Step 2: Install and Set Up WSL2 and Ubuntu

Please note

Step 3: Install Tesseract 5 on Ubuntu

Step 4: Download the Font You Would Like to Train

Step 5: Mount the Disk Drive of Your Working Space for Custom Font Training

Step 6: Copy the Font File to Ubuntu Font Folder

Troubleshooting: Destination Folder Access Denied

Step 7: Clone `tesseract_tutorial` from GitHub

Step 8: Clone `tesstrain` and `tesseract` from GitHub

Step 9: Create a "data" Folder for Storing Outputs

Step 10: Run `split_training_text.py`

Troubleshooting: Fontconfig Warning

Note: Number of Training (.box and .tif) Files

Step 11: Download `eng.traineddata`

Step 12: Create Your Custom Font `.traineddata`

Troubleshooting: "Failed to Read Data" in Makefile

Troubleshooting: "Failed to Load Script Unicharset"

Step 13: Verify the Accuracy of Created `.traineddata`

Frequently Asked Questions

What is the purpose of custom font training in Tesseract?

How can I begin using custom fonts in C#?

What are the installation steps for a C# OCR library?

Can custom font training be done on Windows?

What command is used to install Tesseract 5 on Ubuntu?

How do you handle 'Destination Folder Access Denied' errors during font file copying?

What is the role of the 'tesstrain' repository?

How do you address the 'Fontconfig warning' during training?

What should be done if 'Failed to Load Script Unicharset' occurs?

How is the accuracy of the trained data verified?

Ready to Get Started?

On This Page

C# Custom Font Training for Tesseract 5 (for Windows Users)

Get Started with IronOCR

How to Use Tesseract Custom Font in C#

Step 1: Download the Latest Version of IronOCR

Install via DLL

Install via NuGet

Step 2: Install and Set Up WSL2 and Ubuntu

Please note

Step 3: Install Tesseract 5 on Ubuntu

Step 4: Download the Font You Would Like to Train

Step 5: Mount the Disk Drive of Your Working Space for Custom Font Training

Step 6: Copy the Font File to Ubuntu Font Folder

Troubleshooting: Destination Folder Access Denied

Step 7: Clone tesseract_tutorial from GitHub

Step 8: Clone tesstrain and tesseract from GitHub

Step 9: Create a "data" Folder for Storing Outputs

Step 10: Run split_training_text.py

Troubleshooting: Fontconfig Warning

Note: Number of Training (.box and .tif) Files

Step 11: Download eng.traineddata

Step 12: Create Your Custom Font .traineddata

Troubleshooting: "Failed to Read Data" in Makefile

Troubleshooting: "Failed to Load Script Unicharset"

Step 13: Verify the Accuracy of Created .traineddata

Frequently Asked Questions

What is the purpose of custom font training in Tesseract?

How can I begin using custom fonts in C#?

What are the installation steps for a C# OCR library?

Can custom font training be done on Windows?

What command is used to install Tesseract 5 on Ubuntu?

How do you handle 'Destination Folder Access Denied' errors during font file copying?

What is the role of the 'tesstrain' repository?

How do you address the 'Fontconfig warning' during training?

What should be done if 'Failed to Load Script Unicharset' occurs?

How is the accuracy of the trained data verified?

Ready to Get Started?

Get your FREE

Next step: Start free 30-day Trial

Next step: Start free 30-day Trial

Trusted by Over 2 Million Engineers Worldwide

Step 7: Clone `tesseract_tutorial` from GitHub

Step 8: Clone `tesstrain` and `tesseract` from GitHub

Step 10: Run `split_training_text.py`

Step 11: Download `eng.traineddata`

Step 12: Create Your Custom Font `.traineddata`

Step 13: Verify the Accuracy of Created `.traineddata`