C# Custom font training for Tesseract 5 (for Windows users)

by Kannapat Udompant

Utilize Custom font training for Tesseract 5 to improve the accuracy and recognition capabilities of the OCR engine when working with specific fonts or font styles that may not be well-supported by default.

The process involves providing Tesseract with training data, such as font samples and corresponding text, so that it can learn the specific characteristics and patterns of the custom fonts.


Step 1: Download the Latest Version of IronOCR

C# NuGet Library for OCR

Install with NuGet

Install-Package IronOcr
or
C# OCR DLL

Download DLL

Download DLL

Manually install into your project

Install via DLL

Download the IronOcr DLL directly to your machine.

Install via NuGet

Alternatively, you can install through NuGet .

Install-Package IronOcr

Step 2: Install and set up WSL2 and Ubuntu

Here is the tutorial for setting up WSL2 and Ubuntu ** Currently, the custom font training can be done only on Linux

Step 3: Install Tesseract 5 on Ubuntu

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

Step 4: Download font you would like to train

We are using AMGDT font for this tutorial. The font file can be both .ttf or .otf

Step 5: Mount the disk drive of your working space for the custom font training

The following command shows how to mount Drive D: as a working space.

cd /
cd /mnt/d

Step 6: Copy the font file to Ubuntu font folder

Here is the Ubuntu font folder directory; Ubuntu/usr/share/fonts and Ubuntu/usr/local/share/fonts.

** To access file on Ubuntu, type \\\wsl$ in file explorer directory

Troubleshooting: Destination Folder Access Denied

This issue can be solved by copy file by using command line

cd /
su root
cd c/Users/Admin/Downloads/’AMGDT Regular’
cp ‘AMGDT Regular.ttf’ /usr/share/fonts
cp ‘AMGDT Regular.ttf’ /usr/local/share/fonts
su username

Step 7: Clone tesseract_tutorial from Github

tesseract_tutorial repository can be cloned from the following url; https://github.com/astutejoe/tesseract_tutorial.git by using the following command;

git clone https://github.com/astutejoe/tesseract_tutorial.git

Step 8: Clone tesstrain and tesseract from Github

Go to tesseract_tutorial folder directory, then git clone https://github.com/tesseract-ocr/tesstrain and https://github.com/tesseract-ocr/tesseract

  • tesstrain contains “Makefile” file which using to create .traineddata file (objective of this tutorial)
  • tesseract contains “tessdata” folder which is a container of original .traindata file using as reference for custom font training

Step 9: Create “data” folder for storing outputs

The "data" should be created in tesseract_tutorial/tesstrain.

Step 10: Run split_training_text.py

Return to tesseract_tutorial folder directory then compile the following command;

python split_training_text.py

After runningsplit_training_text.py, it will create .box and .tif file in “data” folder.

Troubleshooting: Fontconfig warning: “/tmp/fonts.conf, line 4: empty font directory name ignored

This issue is caused by font directory in Ubuntu folder cannot be found and can be solved by insert these lines of code in tesseract_tutorial/fonts.conf

and

<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>
<!-- the following element will be removed in the future -->
<dir>~/.fonts</dir>

Then copy it to /etc/fonts

cp fonts.conf /etc/fonts

Finally, add these lines of code to split_training_text.py

and

fontconf_dir = '/etc/fonts`

Note: Number of training (.box and .tif) file

Currently the number of training files is 100. This number of training files can be in editing or deleting these lines of code in split_training_text.py.

Step 11: Download eng.traineddata

eng.traineddata can be found from the following url: https://github.com/tesseract-ocr/tessdata_best. Download it into tesseract_tutorial/tesseract/tessdata because this eng.traineddata in tessdata_best is better than the original one in tessdata folder.

Step 12: Create your custom font .traineddata

Go to tesstrain folder directory and put this command line in the WSL2

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
  • Make training = run code in tesstrain/Makefile
  • MODEL_NAME = the name of your custom font
  • START_MODEL = the name of original .traineddata
  • MAX_ITERATIONS = number of iteration (bigger number means more accurate of .traineddata

“Failed to read data from: ” can be solved by editing lines of code in Makefile

Before:

After:

and

make - Makefile
WORDLIST_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-word-dawg
NUMBERS_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-number-dawg
PUNC_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-punc-dawg

“Failed to load script unicharset from:data/langdata/Latin.unicharset” can be solved by inserting Latin.unicharset into tesstrain/data/langdata folder

Step 13: The accuracy of created .traineddata

With 1000 of .box and .tif files and 3000 iterations of training, the output .traineddta (AMGDT.traineddata) has a minimal training error rate (BCER) around 5.77

For more reading and further reference: ref: https://www.youtube.com/watch?v=KE4xEzFGSU8ustom

Kannapat Udonpant

Software Engineer

Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering team, where he focuses on IronPDF. Kannapat values his job because he learns directly from the developer who writes most of the code used in IronPDF. In addition to peer learning, Kannapat enjoys the social aspect of working at Iron Software. When he's not writing code or documentation, Kannapat can usually be found gaming on his PS5 or rewatching The Last of Us.