Test in production without watermarks.
Works wherever you need it to.
Get 30 days of fully functional product.
Have it up and running in minutes.
Full access to our support engineering team during your product trial
In this tutorial, we walk through the process of training Tesseract 5 OCR with custom fonts. Beginning with downloading Iron OCR for Windows, we establish a Linux environment using WSL and Ubuntu for effective test training. The tutorial details commands to install required packages and libraries, ensuring a smooth setup. Custom fonts are integrated by copying files to designated directories and updating configuration files. Using GitHub repositories, we download and prepare necessary tutorial files, adjusting paths and settings to accommodate custom fonts. The guide explains generating box and TIFF image files, crucial for training, and modifies file extensions for compatibility. By replacing default training data with enhanced files from GitHub, we create a custom font.training data file. The training process, set for 100 iterations, is highlighted, with recommendations for increasing iterations and training sets for improved accuracy. This comprehensive tutorial ensures users can effectively train OCR systems to recognize custom fonts, enhancing the utility of OCR libraries.
Below are the steps with code examples to guide you through the setup and training process:
# Update and upgrade your system
sudo apt update && sudo apt upgrade -y
# Install essential packages
sudo apt install wget libleptonica-dev libtesseract-dev tesseract-ocr tesseract-ocr-eng git -y
# Clone the Tesseract training repository
git clone https://github.com/tesseract-ocr/tesseract.git
# Update and upgrade your system
sudo apt update && sudo apt upgrade -y
# Install essential packages
sudo apt install wget libleptonica-dev libtesseract-dev tesseract-ocr tesseract-ocr-eng git -y
# Clone the Tesseract training repository
git clone https://github.com/tesseract-ocr/tesseract.git
The above Bash script updates your Ubuntu system and installs necessary libraries for Tesseract OCR.
# Navigate to the root of the newly-cloned repository
cd tesseract
# Create directories for custom fonts
mkdir -p training_data/custom_fonts
# Copy your custom font files to the directory
# Make sure you replace 'path_to_fonts' with your actual font path
cp /path_to_fonts/*.ttf training_data/custom_fonts/
# Update configuration files, if necessary, to include new fonts
# Using a text editor such as nano or vim, edit the file that maps font paths
# Navigate to the root of the newly-cloned repository
cd tesseract
# Create directories for custom fonts
mkdir -p training_data/custom_fonts
# Copy your custom font files to the directory
# Make sure you replace 'path_to_fonts' with your actual font path
cp /path_to_fonts/*.ttf training_data/custom_fonts/
# Update configuration files, if necessary, to include new fonts
# Using a text editor such as nano or vim, edit the file that maps font paths
The second script copies custom font files into the appropriate directory and demonstrates how to modify configuration files.
# Install necessary packages for font and image manipulation
sudo apt install fonttools imagemagick -y
# Convert training images to TIFF format, necessary for Tesseract training
# This command assumes 'images' directory contains your raw image files
mogrify -format tiff images/*.png
# Install necessary packages for font and image manipulation
sudo apt install fonttools imagemagick -y
# Convert training images to TIFF format, necessary for Tesseract training
# This command assumes 'images' directory contains your raw image files
mogrify -format tiff images/*.png
This segment installs additional image processing tools and converts images to the TIFF format, which is needed for Tesseract training.
# Generate box files from your TIFF images
# Necessary to pair with TIFF files for training
for file in images/*.tiff; do
# Generates a .box file for each .tiff image
tesseract "$file" "${file%.*}" -l eng box.train
done
# Generate box files from your TIFF images
# Necessary to pair with TIFF files for training
for file in images/*.tiff; do
# Generates a .box file for each .tiff image
tesseract "$file" "${file%.*}" -l eng box.train
done
Above, we loop through each TIFF image to generate corresponding box files, which are essential for training.
# Begin training with Tesseract
# The following command trains Tesseract for a set number of iterations
# Adjustments may be necessary depending on the size of your dataset
tesseract training_data/custom_fonts/*.tiff custom_font_name --psm 6 lstm.train
# Set the number of iterations as needed for your font's complexity
# Begin training with Tesseract
# The following command trains Tesseract for a set number of iterations
# Adjustments may be necessary depending on the size of your dataset
tesseract training_data/custom_fonts/*.tiff custom_font_name --psm 6 lstm.train
# Set the number of iterations as needed for your font's complexity
Finally, this script initiates the training process. Adjust the number of iterations to suit your needs for the complexity of the font.
Further Reading: C# Custom font training for Tesseract 5 (for Windows users)