C# Custom Font Training for Tesseract 5 (for Windows Users)
Utilize custom font training for Tesseract 5 to improve the accuracy and recognition capabilities of the OCR engine when working with specific fonts or font styles that may not be well-supported by default.
The process involves providing Tesseract with training data, such as font samples and corresponding text, so that it can learn the specific characteristics and patterns of the custom fonts.
Get Started with IronOCR
Start using IronOCR in your project today with a free trial.
How to Use Tesseract Custom Font in C#
- Download a C# library to train custom font with Tesseract
- Prepare the targeted font file to be used for training
- Follow the steps specified in the article
- Contains solutions for commonly encountered errors
- Export the trained data file for further usage
Step 1: Download the Latest Version of IronOCR
Install via DLL
Download the IronOcr DLL directly to your machine.
Install via NuGet
Alternatively, you can install through NuGet with the following command:
Install-Package IronOcr
Step 2: Install and Set Up WSL2 and Ubuntu
Refer to the tutorial on Setting up WSL2 and Ubuntu.
Please note
Step 3: Install Tesseract 5 on Ubuntu
Use the following commands to install Tesseract 5:
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
Step 4: Download the Font You Would Like to Train
We are using the AMGDT font for this tutorial. The font file can be either .ttf or .otf.
Step 5: Mount the Disk Drive of Your Working Space for Custom Font Training
Use the commands below to mount Drive D:
as your working space.
cd /
cd /mnt/d
cd /
cd /mnt/d
Step 6: Copy the Font File to Ubuntu Font Folder
Copy the font file to the following Ubuntu directories: /usr/share/fonts
and /usr/local/share/fonts
.
Access files in Ubuntu by typing \\wsl$
in the file explorer's address bar.
Troubleshooting: Destination Folder Access Denied
If you encounter access denied errors, resolve this by using the command line to copy the files.
cd /
su root
cd /c/Users/Admin/Downloads/'AMGDT Regular'
cp 'AMGDT Regular.ttf' /usr/share/fonts
cp 'AMGDT Regular.ttf' /usr/local/share/fonts
exit
cd /
su root
cd /c/Users/Admin/Downloads/'AMGDT Regular'
cp 'AMGDT Regular.ttf' /usr/share/fonts
cp 'AMGDT Regular.ttf' /usr/local/share/fonts
exit
Step 7: Clone tesseract_tutorial
from GitHub
Clone the tesseract_tutorial
repository using the following command:
git clone https://github.com/astutejoe/tesseract_tutorial.git
git clone https://github.com/astutejoe/tesseract_tutorial.git
Step 8: Clone tesstrain
and tesseract
from GitHub
Navigate to the tesseract_tutorial
directory, then clone the tesstrain
and tesseract
repositories:
git clone https://github.com/tesseract-ocr/tesstrain
git clone https://github.com/tesseract-ocr/tesseract
git clone https://github.com/tesseract-ocr/tesstrain
git clone https://github.com/tesseract-ocr/tesseract
tesstrain
contains the "Makefile" used to create the.traineddata
file.tesseract
contains the "tessdata" folder, which includes original.traindata
files used for reference during custom font training.
Step 9: Create a "data" Folder for Storing Outputs
Create a "data" folder within tesseract_tutorial/tesstrain
.
Step 10: Run split_training_text.py
Return to the tesseract_tutorial
folder and execute the following command:
python split_training_text.py
python split_training_text.py
After running split_training_text.py
, it will create .box
and .tif
files in the "data" folder.
Troubleshooting: Fontconfig Warning
If you see the warning
Fontconfig warning: "/tmp/fonts.conf, line 4: empty font directory name ignored"
, it is due to missing font directories. Solve this by editing the tesseract_tutorial/fonts.conf
file and adding:
<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>
<!-- the following element will be removed in the future -->
<dir>~/.fonts</dir>
<dir>/usr/share/fonts</dir>
<dir>/usr/local/share/fonts</dir>
<dir prefix="xdg">fonts</dir>
<!-- the following element will be removed in the future -->
<dir>~/.fonts</dir>
Copy it to /etc/fonts
with:
cp fonts.conf /etc/fonts
cp fonts.conf /etc/fonts
Additionally, update split_training_text.py
:
fontconf_dir = '/etc/fonts'
fontconf_dir = '/etc/fonts'
Note: Number of Training (.box and .tif) Files
The current number of training files is set to 100. You can modify this in split_training_text.py
.
Step 11: Download eng.traineddata
Download eng.traineddata
from this repository and place it in tesseract_tutorial/tesseract/tessdata
.
Step 12: Create Your Custom Font .traineddata
Navigate to the tesstrain
folder and use the command below in WSL2:
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=AMGDT START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
MODEL_NAME
is your custom font name.START_MODEL
is the original.traineddata
reference.MAX_ITERATIONS
defines the number of iterations (more iterations can improve the accuracy of.traineddata
).
Troubleshooting: "Failed to Read Data" in Makefile
To resolve "Failed to read data" issues, modify the Makefile:
WORDLIST_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-word-dawg
NUMBERS_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-number-dawg
PUNC_FILE := $(OUTPUT_DIR2)/$(MODEL_NAME).lstm-punc-dawg
Troubleshooting: "Failed to Load Script Unicharset"
Insert Latin.unicharset
into the tesstrain/data/langdata
folder. The Latin.unicharset
can be found here.
Step 13: Verify the Accuracy of Created .traineddata
With 1000 .box
and .tif
files and 3000 iterations of training, the output .traineddata
(AMGDT.traineddata) achieves a minimal training error rate (BCER) of around 5.77.
For further reading and reference, see the tutorial: YouTube Video
Frequently Asked Questions
What is the purpose of custom font training in Tesseract?
Custom font training in Tesseract is used to improve the accuracy and recognition capabilities of the OCR engine when dealing with specific fonts or font styles that may not be well-supported by default.
How can I begin using custom fonts with Tesseract in C#?
To begin using custom fonts with Tesseract in C#, you need to download a C# library like IronOCR, prepare the target font file for training, and follow the outlined steps in the guide.
What are the installation steps for IronOCR?
IronOCR can be installed either by downloading the DLL directly or through NuGet using the command: Install-Package IronOcr.
Can custom font training be done on Windows?
Custom font training for Tesseract is currently supported only on Linux, which can be set up on Windows using WSL2 and Ubuntu.
What command is used to install Tesseract 5 on Ubuntu?
To install Tesseract 5 on Ubuntu, use the following commands: sudo apt install tesseract-ocr and sudo apt install libtesseract-dev.
How do you handle 'Destination Folder Access Denied' errors during font file copying?
If you encounter 'Destination Folder Access Denied' errors, resolve this by using the command line to copy the files with root access.
What is the role of the 'tesstrain' repository?
The 'tesstrain' repository contains the 'Makefile' used to create the '.traineddata' file necessary for custom font training in Tesseract.
How do you address the 'Fontconfig warning' during training?
To address the 'Fontconfig warning', add font directory paths in the 'fonts.conf' file and copy it to '/etc/fonts'.
What should be done if 'Failed to Load Script Unicharset' occurs?
Insert 'Latin.unicharset' into the 'tesstrain/data/langdata' folder to resolve the 'Failed to Load Script Unicharset' issue.
How is the accuracy of the trained data verified?
The accuracy of the trained data is verified by achieving a minimal training error rate (BCER) after a sufficient number of iterations and training files.