C# Custom font training for Tesseract 5 (for Windows users)
Utilize Custom font training for Tesseract 5 to improve the accuracy and recognition capabilities of the OCR engine when working with specific fonts or font styles that may not be well-supported by default.
The process involves providing Tesseract with training data, such as font samples and corresponding text, so that it can learn the specific characteristics and patterns of the custom fonts.
Get started with IronOCR
Start using IronOCR in your project today with a free trial.
How to Use Tesseract Custom Font in C#
- Download a C# library to train custom font with Tesseract
- Prepare the targeted font file to be used for training
- Follow the steps specified in the article
- Contains solutions for commonly encountered errors
- Export the trained data file for further usage
Step 1: Download the Latest Version of IronOCR
Install via DLL
Download the IronOcr DLL directly to your machine.
Install via NuGet
Alternatively, you can install through NuGet .
Install-Package IronOcr
Step 2: Install and set up WSL2 and Ubuntu
Here is the tutorial for setting up WSL2 and Ubuntu ** Currently, the custom font training can be done only on Linux
Step 3: Install Tesseract 5 on Ubuntu
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
Step 4: Download font you would like to train
We are using AMGDT font for this tutorial. The font file can be both .ttf or .otf
Step 5: Mount the disk drive of your working space for the custom font training
The following command shows how to mount Drive D:
as a working space.
cd /
cd /mnt/d
Step 6: Copy the font file to Ubuntu font folder
Here is the Ubuntu font folder directory; Ubuntu/usr/share/fonts
and Ubuntu/usr/local/share/fonts
.
** To access file on Ubuntu, type \\\wsl$
in file explorer directory
Troubleshooting: Destination Folder Access Denied
This issue can be solved by copy file by using command line
cd /
su root
cd c/Users/Admin/Downloads/’AMGDT Regular’
cp ‘AMGDT Regular.ttf’ /usr/share/fonts
cp ‘AMGDT Regular.ttf’ /usr/local/share/fonts
su username
Step 7: Clone tesseract_tutorial from Github
tesseract_tutorial
repository can be cloned from the following url; https://github.com/astutejoe/tesseract_tutorial.git by using the following command;
git clone https://github.com/astutejoe/tesseract_tutorial.git
Step 8: Clone tesstrain and tesseract from Github
Go to tesseract_tutorial
folder directory, then git clone https://github.com/tesseract-ocr/tesstrain and https://github.com/tesseract-ocr/tesseract
- tesstrain contains “Makefile” file which using to create .traineddata file (objective of this tutorial)
- tesseract contains “tessdata” folder which is a container of original .traindata file using as reference for custom font training
Step 9: Create “data” folder for storing outputs
The "data" should be created in tesseract_tutorial/tesstrain
.
Step 10: Run split_training_text.py
Return to tesseract_tutorial
folder directory then compile the following command;
python split_training_text.py
After runningsplit_training_text.py
, it will create .box
and .tif
file in “data” folder.
Troubleshooting: Fontconfig warning: “/tmp/fonts.conf, line 4: empty font directory name ignored
This issue is caused by font directory in Ubuntu folder cannot be found and can be solved by insert these lines of code in tesseract_tutorial/fonts.conf
and
<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>
<!-- the following element will be removed in the future -->
<dir>~/.fonts</dir>
Then copy it to /etc/fonts
cp fonts.conf /etc/fonts
Finally, add these lines of code to split_training_text.py
and
fontconf_dir = '/etc/fonts`
Note: Number of training (.box and .tif) file
Currently the number of training files is 100. This number of training files can be in editing or deleting these lines of code in split_training_text.py.
Step 11: Download eng.traineddata
eng.traineddata
can be found from the following url: https://github.com/tesseract-ocr/tessdata_best. Download it into tesseract_tutorial/tesseract/tessdata
because this eng.traineddata
in tessdata_best
is better than the original one in tessdata
folder.
Step 12: Create your custom font .traineddata
Go to tesstrain
folder directory and put this command line in the WSL2
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
- Make training = run code in
tesstrain/Makefile
- MODEL_NAME = the name of your custom font
- START_MODEL = the name of original
.traineddata
- MAX_ITERATIONS = number of iteration (bigger number means more accurate of
.traineddata
“Failed to read data from: ” can be solved by editing lines of code in Makefile
Before:
After:
and
make - Makefile
WORDLIST_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-word-dawg
NUMBERS_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-number-dawg
PUNC_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-punc-dawg
“Failed to load script unicharset from:data/langdata/Latin.unicharset” can be solved by inserting Latin.unicharset
into tesstrain/data/langdata
folder
Latin.unicharset
can be found in the following url; https://github.com/tesseract-ocr/langdata_lstm
Step 13: The accuracy of created .traineddata
With 1000 of .box
and .tif
files and 3000 iterations of training, the output .traineddta
(AMGDT.traineddata) has a minimal training error rate (BCER) around 5.77
For more reading and further reference: ref: https://www.youtube.com/watch?v=KE4xEzFGSU8ustom