C# Custom Font Training for Tesseract 5 (for Windows Users)

Utilize custom font training for Tesseract 5 to improve the accuracy and recognition capabilities of the OCR engine when working with specific fonts or font styles that may not be well-supported by default.

The process involves providing Tesseract with training data, such as font samples and corresponding text, so that it can learn the specific characteristics and patterns of the custom fonts.

Get Started with IronOCR

Start using IronOCR in your project today with a free trial.

First Step:
green arrow pointer



Step 1: Download the Latest Version of IronOCR

Install via DLL

Download the IronOcr DLL directly to your machine.

Install via NuGet

Alternatively, you can install through NuGet with the following command:

Install-Package IronOcr

Step 2: Install and Set Up WSL2 and Ubuntu

Refer to the tutorial on Setting up WSL2 and Ubuntu.

Please noteCustom font training can currently be done only on Linux.

Step 3: Install Tesseract 5 on Ubuntu

Use the following commands to install Tesseract 5:

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
SHELL

Step 4: Download the Font You Would Like to Train

We are using the AMGDT font for this tutorial. The font file can be either .ttf or .otf. Example of downloaded font file

Step 5: Mount the Disk Drive of Your Working Space for Custom Font Training

Use the commands below to mount Drive D: as your working space.

cd /
cd /mnt/d
cd /
cd /mnt/d
SHELL

Step 6: Copy the Font File to Ubuntu Font Folder

Copy the font file to the following Ubuntu directories: /usr/share/fonts and /usr/local/share/fonts.

Access files in Ubuntu by typing \\wsl$ in the file explorer's address bar.

Ubuntu folder directory

Troubleshooting: Destination Folder Access Denied

If you encounter access denied errors, resolve this by using the command line to copy the files.

cd /
su root
cd /c/Users/Admin/Downloads/'AMGDT Regular'
cp 'AMGDT Regular.ttf' /usr/share/fonts
cp 'AMGDT Regular.ttf' /usr/local/share/fonts
exit
cd /
su root
cd /c/Users/Admin/Downloads/'AMGDT Regular'
cp 'AMGDT Regular.ttf' /usr/share/fonts
cp 'AMGDT Regular.ttf' /usr/local/share/fonts
exit
SHELL

Step 7: Clone tesseract_tutorial from GitHub

Clone the tesseract_tutorial repository using the following command:

git clone https://github.com/astutejoe/tesseract_tutorial.git
git clone https://github.com/astutejoe/tesseract_tutorial.git
SHELL

Step 8: Clone tesstrain and tesseract from GitHub

Navigate to the tesseract_tutorial directory, then clone the tesstrain and tesseract repositories:

git clone https://github.com/tesseract-ocr/tesstrain
git clone https://github.com/tesseract-ocr/tesseract
git clone https://github.com/tesseract-ocr/tesstrain
git clone https://github.com/tesseract-ocr/tesseract
SHELL
  • tesstrain contains the "Makefile" used to create the .traineddata file.
  • tesseract contains the "tessdata" folder, which includes original .traindata files used for reference during custom font training.

Step 9: Create a "data" Folder for Storing Outputs

Create a "data" folder within tesseract_tutorial/tesstrain.

Step 10: Run split_training_text.py

Return to the tesseract_tutorial folder and execute the following command:

python split_training_text.py
python split_training_text.py
SHELL

After running split_training_text.py, it will create .box and .tif files in the "data" folder.

Troubleshooting: Fontconfig Warning

Font Config Warning If you see the warning Fontconfig warning: "/tmp/fonts.conf, line 4: empty font directory name ignored", it is due to missing font directories. Solve this by editing the tesseract_tutorial/fonts.conf file and adding:

<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>
<!-- the following element will be removed in the future -->
<dir>~/.fonts</dir>
<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>
<!-- the following element will be removed in the future -->
<dir>~/.fonts</dir>
XML

Copy it to /etc/fonts with:

cp fonts.conf /etc/fonts
cp fonts.conf /etc/fonts
SHELL

Additionally, update split_training_text.py:

fontconf_dir = '/etc/fonts'
fontconf_dir = '/etc/fonts'
PYTHON

Note: Number of Training (.box and .tif) Files

The current number of training files is set to 100. You can modify this in split_training_text.py.

Set the Number of Training Files

Step 11: Download eng.traineddata

Download eng.traineddata from this repository and place it in tesseract_tutorial/tesseract/tessdata.

Step 12: Create Your Custom Font .traineddata

Navigate to the tesstrain folder and use the command below in WSL2:

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
SHELL
  • MODEL_NAME is your custom font name.
  • START_MODEL is the original .traineddata reference.
  • MAX_ITERATIONS defines the number of iterations (more iterations can improve the accuracy of .traineddata).

Troubleshooting: "Failed to Read Data" in Makefile

To resolve "Failed to read data" issues, modify the Makefile:

WORDLIST_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-word-dawg
NUMBERS_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-number-dawg
PUNC_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-punc-dawg

Troubleshooting: "Failed to Load Script Unicharset"

Insert Latin.unicharset into the tesstrain/data/langdata folder. The Latin.unicharset can be found here.

Step 13: Verify the Accuracy of Created .traineddata

With 1000 .box and .tif files and 3000 iterations of training, the output .traineddata (AMGDT.traineddata) achieves a minimal training error rate (BCER) of around 5.77.

Traineddata Accuracy

For further reading and reference, see the tutorial: YouTube Video

Frequently Asked Questions

How can I train a custom font for use with Tesseract in C#?

To train a custom font for Tesseract using C#, you must first download IronOCR, prepare your font file, and set up a Linux environment via WSL2 and Ubuntu on Windows, as Tesseract's custom font training is supported only on Linux.

What are the steps to install Tesseract 5 on a Windows system using WSL2?

To install Tesseract 5 on Windows using WSL2, you need to set up Ubuntu and then use the commands sudo apt install tesseract-ocr and sudo apt install libtesseract-dev to complete the installation.

What should I do if I encounter 'Destination Folder Access Denied' errors while copying font files?

If you face 'Destination Folder Access Denied' errors, use the command line with root access to copy the font files into the necessary directories to bypass permission issues.

Why is a Linux environment necessary for custom font training in Tesseract?

A Linux environment is required for custom font training in Tesseract because the training tools are designed to run on Unix-like systems, and WSL2 can be used to emulate this environment on Windows.

How do I fix 'Fontconfig warning' errors when training custom fonts?

To resolve 'Fontconfig warning' errors, you should add the font directory paths to the fonts.conf file and ensure it is copied to the /etc/fonts directory.

What is the purpose of the 'tesstrain' repository in custom font training?

The 'tesstrain' repository is used to create the .traineddata file needed for custom font training in Tesseract, providing the scripts and Makefile necessary for the process.

How can I resolve the 'Failed to Load Script Unicharset' error?

To fix the 'Failed to Load Script Unicharset' error, you need to insert the Latin.unicharset into the tesstrain/data/langdata folder to ensure the necessary character set is available.

How do I verify the accuracy of my custom trained data in Tesseract?

You can verify the accuracy of your custom trained data by checking the training error rate, known as BCER, and ensuring it is minimal after sufficient iterations and training file adjustments.

Kannaopat Udonpant
Software Engineer
Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering ...Read More
Reviewed by
Jeff Fritz
Jeffrey T. Fritz
Principal Program Manager - .NET Community Team
Jeff is also a Principal Program Manager for the .NET and Visual Studio teams. He is the executive producer of the .NET Conf virtual conference series and hosts 'Fritz and Friends' a live stream for developers that airs twice weekly where he talks tech and writes code together with viewers. Jeff writes workshops, presentations, and plans content for the largest Microsoft developer events including Microsoft Build, Microsoft Ignite, .NET Conf, and the Microsoft MVP Summit