C# Custom font training for Tesseract 5 (for Windows users)

Step 1: Download the Latest Version of IronOCR

C# NuGet Library for OCR

Install with NuGet

Install-Package IronOcr

Download DLL

Download DLL

Manually install into your project

Install via DLL

Download the IronOcr DLL directly to your machine.

Install via NuGet

Alternatively, you can install through NuGet .

 PM > Install-Package IronOcr

Step 2: Install and set up WSL2 and Ubuntu

Here is the tutorial for setting up WSL2 and Ubuntu ** Currently, the custom font training can be done only on Linux

Step 3: Install Tesseract 5 on Ubuntu

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

Step 4: Download font you would like to train

We are using AMGDT font for this tutorial. The font file can be both .ttf or .otf

Step 5: Mount the disk drive of your working space for the custom font training

The following command shows how to mount Drive D: as a working space.

cd /
cd /mnt/d

Step 6: Copy the font file to Ubuntu font folder

Here is the Ubuntu font folder directory; Ubuntu/usr/share/fonts and Ubuntu/usr/local/share/fonts.

** To access file on Ubuntu, type \\\wsl$ in file explorer directory

Troubleshooting: Destination Folder Access Denied

This issue can be solved by copy file by using command line

cd /
su root
cd c/Users/Admin/Downloads/’AMGDT Regular’
cp ‘AMGDT Regular.ttf’ /usr/share/fonts
cp ‘AMGDT Regular.ttf’ /usr/local/share/fonts
su username

Step 7: Clone tesseract_tutorial from Github

tesseract_tutorial repository can be cloned from the following url; https://github.com/astutejoe/tesseract_tutorial.git by using the following command;

git clone https://github.com/astutejoe/tesseract_tutorial.git

Step 8: Clone tesstrain and tesseract from Github

Go to tesseract_tutorial folder directory, then git clone https://github.com/tesseract-ocr/tesstrain and https://github.com/tesseract-ocr/tesseract(https://github.com/tesseract-ocr/tesseract)

  • tesstrain contains “Makefile” file which using to create .traineddata file (objective of this tutorial)
  • tesseract contains “tessdata” folder which is a container of original .traindata file using as reference for custom font training

Step 9: Create “data” folder for storing outputs

The "data" should be created in tesseract_tutorial/tesstrain.

Step 10: Run split_training_text.py

Return to tesseract_tutorial folder directory then compile the following command;

python split_training_text.py

After runningsplit_training_text.py, it will create .box and .tif file in “data” folder.

Troubleshooting: Fontconfig warning: “/tmp/fonts.conf, line 4: empty font directory name ignored

This issue is caused by font directory in Ubuntu folder cannot be found and can be solved by insert these lines of code in tesseract_tutorial/fonts.conf


<dir prefix="xdg">fonts</dir>
<!-- the following element will be removed in the future -->

Then copy it to /etc/fonts

cp fonts.conf /etc/fonts

Finally, add these lines of code to split_training_text.py


fontconf_dir = '/etc/fonts`

Note: Number of training (.box and .tif) file

Currently the number of training files is 100. This number of training files can be in editing or deleting these lines of code in split_training_text.py.

Step 11: Download eng.traineddata

eng.traineddata can be found from the following url: https://github.com/tesseract-ocr/tessdata_best. Download it into tesseract_tutorial/tesseract/tessdata because this eng.traineddata in tessdata_best is better than the original one in tessdata folder.

Step 12: Create your custom font .traineddata

Go to tesstrain folder directory and put this command line in the WSL2

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
  • Make training = run code in tesstrain/Makefile
  • MODEL_NAME = the name of your custom font
  • START_MODEL = the name of original .traineddata
  • MAX_ITERATIONS = number of iteration (bigger number means more accurate of .traineddata

“Failed to read data from: ” can be solved by editing lines of code in Makefile




```make - Makefile WORDLIST_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-word-dawg NUMBERS_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-number-dawg PUNC_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-punc-dawg

### “Failed to load script unicharset from:data/langdata/Latin.unicharset” can be solved by inserting `Latin.unicharset` into `tesstrain/data/langdata` folder
 - `Latin.unicharset` can be found in the following url; https://github.com/tesseract-ocr/langdata_lstm

## Step 13: The accuracy of created `.traineddata`
With 1000 of `.box` and `.tif` files and 3000 iterations of training, the output `.traineddta` (AMGDT.traineddata) has a minimal training error rate (BCER) around 5.77


**ref**: [https://www.youtube.com/watch?v=KE4xEzFGSU8ustom](https://www.youtube.com/watch?v=KE4xEzFGSU8ustom)