Test in production without watermarks.
Works wherever you need it to.
Get 30 days of fully functional product.
Have it up and running in minutes.
Full access to our support engineering team during your product trial
OCR is a technology used to change documents of different nature into editable and searchable data. It changes scanned images, PDF files, and digital camera photos into data that can be edited and searched. This technology is extensively used in changing printed documents into digital form for editing, searching, and storing, reducing the physical space taken by documents. OCR has a massive role in data entry automation, thus saving companies and organizations a lot of time by reducing the labor used by human resources.
It is a process that uses advanced machine learning techniques and pattern recognition to extract text accurately from images. The latest developments relating to OCR have increased its accuracy, thus supporting more languages and complex scripts such as the Arabic script. Very necessary in finance, health, legislation, and education, OCR emerged as an indispensable tool where processing several printed documents rapidly was a prerequisite. This article will use Tesseract to OCR images in multiple languages data.
IronTesseract
class, which will initialize the OCR engine.OcrInput
object.Read
function of the IronTesseract
instance.Tesseract is an open Optical Character Recognition engine developed by Hewlett-Packard and later maintained by Google. It is famous for its high accuracy and adaptability, making it one of the most prominent OCRs. Tesseract supports script detection, recognizes text in many languages, and can handle multiple languages; hence, it is generally used for projects requiring multilingual documents and support.
The Tesseract OCR engine works on information contained in any single pixel of the image, following patterns depicting characters, words, and sentences that are eventually converted to machine-readable text. The many image file types it supports, such as TIFF, JPEG, and PNG, allow Tesseract to produce text in formats like plain text, HTML, and searchable PDF.
One of the significant advantages of Tesseract is that it can be trained to be sensitive to particular fonts or new languages added. It is also frequently used in various applications, ranging from simple text extraction to complex tasks in digitizing historical documents, processing invoices, or even accessibility software that enables reading for the visually impaired.
Open the program Visual Studio. On opening the program, proceed to the "file menu." Under the "file menu," there is the option "new project." Under "new project," click on "Console Application." In this post, we will create PDF documents using a console program.
Enter your project's name and the file's location in the text boxes provided. Then, as shown in the image below, click the Create button and select which .NET Framework you need.
Now that the application version has been selected, the Visual Studio project will create its structure. If you have chosen the console, Windows, or web versions, it will open the program.cs
file to add code and build/run the application.
The first step is downloading and installing the Tesseract OCR software on your computer. Here is the official Tesseract GitHub repository with the Tesseract installer: https://github.com/tesseract-ocr/tesseract.
It would be best to get Tesseract OCR onto your computer by following the setup instructions specific to your operating system—whether that be Windows, macOS, or Linux. Once installed, add the Tesseract.NET package to your C# project using Visual Studio's NuGet Package Manager.
Open the NuGet Package Manager in your Visual Studio project from Tools -> NuGet Package Manager -> Manage NuGet Packages for Solution. Afterward, search for "Tesseract" in the NuGet Package Manager to obtain either the "Tesseract" or "Tesseract.NET" package. Select this package and click the Install button to install it in your project.
You must set up Tesseract in your C# project to specify the location of the Tesseract OCR executable and language data files after installing the Tesseract.NET wrapper. Here's an example:
using System;
using System.Drawing;
using Tesseract;
class Program
{
static void Main()
{
// Set the path to the Tesseract data files (traineddata files)
string tessDataPath = @"./tessdata"; // Ensure this directory contains the language data files
// Load the image
string imagePath = @"path_to_your_image.png";
using (var img = Pix.LoadFromFile(imagePath))
{
// Add languages to the Tesseract engine
using (var engine = new TesseractEngine(tessDataPath, "eng+spa+fra", EngineMode.Default))
{
using (var page = engine.Process(img))
{
// Extract the text
string text = page.GetText();
Console.WriteLine("Recognized Text:");
Console.WriteLine(text);
}
}
}
}
}
using System;
using System.Drawing;
using Tesseract;
class Program
{
static void Main()
{
// Set the path to the Tesseract data files (traineddata files)
string tessDataPath = @"./tessdata"; // Ensure this directory contains the language data files
// Load the image
string imagePath = @"path_to_your_image.png";
using (var img = Pix.LoadFromFile(imagePath))
{
// Add languages to the Tesseract engine
using (var engine = new TesseractEngine(tessDataPath, "eng+spa+fra", EngineMode.Default))
{
using (var page = engine.Process(img))
{
// Extract the text
string text = page.GetText();
Console.WriteLine("Recognized Text:");
Console.WriteLine(text);
}
}
}
}
}
Imports System
Imports System.Drawing
Imports Tesseract
Friend Class Program
Shared Sub Main()
' Set the path to the Tesseract data files (traineddata files)
Dim tessDataPath As String = "./tessdata" ' Ensure this directory contains the language data files
' Load the image
Dim imagePath As String = "path_to_your_image.png"
Using img = Pix.LoadFromFile(imagePath)
' Add languages to the Tesseract engine
Using engine = New TesseractEngine(tessDataPath, "eng+spa+fra", EngineMode.Default)
Using page = engine.Process(img)
' Extract the text
Dim text As String = page.GetText()
Console.WriteLine("Recognized Text:")
Console.WriteLine(text)
End Using
End Using
End Using
End Sub
End Class
The above code explains how Tesseract OCR can detect and extract text from images containing multiple languages. It initially sets the path to the Tesseract language data files. The necessary .traineddata
files for each corresponding language, such as English, Spanish, and French, should be present in the path.
It loads an image specified by imagePath
using the Pix.LoadFromFile
method. More specifically, one would expect an image with English, Spanish, and French text. Then, an instance of TesseractEngine
will be initialized with paths to language data files and languages of interest, "eng+spa+fra," to recognize the text. The engine will work in default mode.
The image will then be processed using the Process
method of the engine class, where it is analyzed, text content extracted, and stored in the variable text
. The extracted text is then printed to the console, creating a visualization of how the OCR works.
IronOCR is a proprietary OCR library focused on .NET. It adds OCR capabilities to .NET applications and allows text extraction from images, scanned documents, PDFs, and all other visual media. Driving cutting-edge text recognition with the incredibly successful Tesseract engine, IronOCR also includes several additional features that make it suitable for use in enterprise applications.
IronOCR offers tremendous language support—more than 120 languages with support for automatic language detection and processing documents containing multiple languages simultaneously. This makes IronOCR very versatile and deployable globally, where multilingual document processing is highly critical.
On the other hand, IronOCR emphasizes simplicity in use and integration. Its extremely easy-to-use API is supplemented by detailed documentation and a set of example projects that will help any developer get up and running quickly. It supports a wide array of image formats and PDF documents. Built-in advanced image preprocessing, noise reduction, and error correction features improve OCR accuracy and performance.
You can install the packages directly into your solution using Visual Studio's NuGet Package management tool. The following snapshot shows how to open the NuGet Package Manager.
It has an embedded search box, which displays a list of packages from the NuGet website. As seen in the screenshot below, we will search the package manager for the phrase IronOCR:
The search results might offer a list of potential solutions. You'll need to select the necessary solution package to install.
Also, Install the required Tesseract language packs one by one, such as the one below for this example.
In this example, we will use Spanish, French, and English language codes. English is the default language pack and does not require installation.
Install Spanish from the NuGet package.
The following example demonstrates how to recognize text in multiple languages from an image using C# and the IronOCR and Tesseract engine.
using IronOcr;
class Program
{
static void Main(string[] args)
{
// Initialize IronTesseract engine
var Ocr = new IronTesseract();
// Add multiple languages
Ocr.Language = OcrLanguage.English + OcrLanguage.Spanish + OcrLanguage.French;
// Path to the image
var inputFile = @"path\to\your\image.png";
// Read the image and perform OCR
using (var input = new OcrInput(inputFile))
{
// Perform OCR
var result = Ocr.Read(input);
// Display the result
Console.WriteLine("Recognized Text:");
Console.WriteLine(result.Text);
}
}
}
using IronOcr;
class Program
{
static void Main(string[] args)
{
// Initialize IronTesseract engine
var Ocr = new IronTesseract();
// Add multiple languages
Ocr.Language = OcrLanguage.English + OcrLanguage.Spanish + OcrLanguage.French;
// Path to the image
var inputFile = @"path\to\your\image.png";
// Read the image and perform OCR
using (var input = new OcrInput(inputFile))
{
// Perform OCR
var result = Ocr.Read(input);
// Display the result
Console.WriteLine("Recognized Text:");
Console.WriteLine(result.Text);
}
}
}
Imports IronOcr
Friend Class Program
Shared Sub Main(ByVal args() As String)
' Initialize IronTesseract engine
Dim Ocr = New IronTesseract()
' Add multiple languages
Ocr.Language = OcrLanguage.English + OcrLanguage.Spanish + OcrLanguage.French
' Path to the image
Dim inputFile = "path\to\your\image.png"
' Read the image and perform OCR
Using input = New OcrInput(inputFile)
' Perform OCR
Dim result = Ocr.Read(input)
' Display the result
Console.WriteLine("Recognized Text:")
Console.WriteLine(result.Text)
End Using
End Sub
End Class
The above C# program uses the IronOCR library to perform Optical Character Recognition on an image containing English, Spanish, and French characters. The program begins by importing the required namespace for IronOCR and declaring a class named Program
with a Main
method, which is the application's entry point.
In the Main method, an instance of the IronTesseract
class is instantiated and assigned to the variable Ocr
. The Language
property is set to include English, Spanish, and French by combining OcrLanguage.English
, OcrLanguage.Spanish
, and OcrLanguage.French
. This ensures the OCR engine can recognize and process text in any of these three languages.
The path to the input image file is set using the inputFile
variable. The image is then loaded inside an using
statement with an instance of the OcrInput
class for proper resource management and disposal. Finally, the Read
method of the IronTesseract instance is called with the input object to perform OCR on the image.
The recognized text is then printed to the console using the Console.WriteLine
method. This program illustrates an effective method for using IronOCR's multilingual capability to extract text from images containing words in different languages.
IronOCR is more user-friendly compared to Tesseract and offers some advantages. First, IronOCR provides excellent language support with 127 languages straight out of the box, while Tesseract may require complex configurations and additional training for optimal performance with some of its 100 supported languages. Additionally, IronOCR easily integrates into .NET applications and comes with comprehensive documentation.
IronOCR has a less steep learning curve and requires less technical setup than Tesseract. It also features advanced image preprocessing and regular updates for better accuracy and reliability with complex document types. IronOCR is a great choice for developers seeking a solid, versatile, and easily applied OCR solution.
While both Tesseract and IronOCR are robust OCR technologies, each has unique capabilities and strengths. Tesseract, being open-source, is reliable for anyone seeking a free solution and has active communities and continuous improvement.
In contrast, IronOCR is a proprietary library for the .NET Framework, offering improved user experience with easier integration and support for most image file types. It also performs well in text recognition, particularly with low-quality image content. IronOCR fully supports many languages and has extra features that make it more user-friendly.
IronOCR offers a cost-effective development edition and when purchased, provides a lifetime license. The IronOCR package starts at $749 as a one-time cost for multiple systems, offering excellent value for money and 24/7 online engineer support to licensed users. For more information, refer to the IronOCR website.
OCR (Optical Character Recognition) is a technology used to convert different types of documents into editable and searchable data. It processes scanned images, PDF files, and digital camera photos to extract text for editing, searching, and storing.
OCR works by analyzing information contained in any single pixel of an image, recognizing patterns that depict characters, words, and sentences, and converting them into machine-readable text. It supports various image file types and can produce text in formats like plain text, HTML, and searchable PDF.
To use OCR with multiple languages, you can utilize IronOCR by installing the IronOCR/Tesseract NuGet package in your .NET project, creating an instance of the IronTesseract class, setting the language property to include multiple languages, and processing the image using the Read function.
IronOCR is more user-friendly, offering excellent language support with 127 languages out-of-the-box. It integrates easily into .NET applications, has comprehensive documentation, and features advanced image preprocessing for better accuracy, making it a suitable choice for developers.
OCR technology is highly accurate and supports script detection, recognizes text in many languages, and can handle multilingual documents, making it ideal for projects requiring comprehensive language support.
To set up a new Visual Studio project, open Visual Studio, go to the file menu, select 'new project,' and choose 'Console Application.' Enter your project's name and location, select the .NET Framework, and then proceed to add code and build/run your application.
IronOCR supports more than 120 languages, including automatic language detection and processing of multilingual documents, making it highly versatile for global deployment.
To install OCR for .NET, you can use IronOCR by downloading and installing the IronOCR software, following the setup instructions for your operating system, and adding the IronOCR package to your C# project using Visual Studio's NuGet Package Manager.
The IronOCR package starts at $749 as a one-time cost for multiple systems, offering a cost-effective development edition with a lifetime license and 24/7 online engineer support for licensed users.
OCR technology is used extensively in finance, health, legislation, and education sectors to automate data entry, digitize printed documents, process invoices, and enable accessibility software for the visually impaired.