OCR TOOLS

How to Get Text from Images Using Tesseract

Leveraging libraries such as IronOCR and Tesseract grants developers access to advanced algorithms and machine learning techniques for extracting textual information from images and scanned documents. This tutorial will show readers how to use the Tesseract library to perform text extraction from images, and will then conclude by introducing IronOCR's unique approach.

1. OCR with Tesseract

1.1. Install Tesseract

Using the NuGet Package Manager Console, enter the following command:

Install-Package Tesseract

Or download the package via the NuGet Package Manager.

How to Get OCR Text Recognition, Figure 1: Install Tesseract package in the NuGet Package Manager Install Tesseract package in the NuGet Package Manager

You must manually install and save the language files in the project folder after installing the NuGet Package. This can be considered a shortcoming of this specific library.

Visit the following website to download the language files. Once downloaded, unzip the files, and add the "tessdata" folder to your project's debug folder.

1.2. Using Tesseract (Quick-start)

OCR on a given image can be performed using the source code below:

using Tesseract;

class Program
{
    static void Main()
    {
        // Initialize Tesseract engine with English language data
        using var ocrEngine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default);

        // Load the image to be processed
        using var img = Pix.LoadFromFile("Demo.png");

        // Process the image to extract text
        using var res = ocrEngine.Process(img);

        // Output the recognized text
        Console.WriteLine(res.GetText());
        Console.ReadKey();
    }
}
using Tesseract;

class Program
{
    static void Main()
    {
        // Initialize Tesseract engine with English language data
        using var ocrEngine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default);

        // Load the image to be processed
        using var img = Pix.LoadFromFile("Demo.png");

        // Process the image to extract text
        using var res = ocrEngine.Process(img);

        // Output the recognized text
        Console.WriteLine(res.GetText());
        Console.ReadKey();
    }
}
Imports Tesseract

Friend Class Program
	Shared Sub Main()
		' Initialize Tesseract engine with English language data
		Dim ocrEngine = New TesseractEngine("tessdata", "eng", EngineMode.Default)

		' Load the image to be processed
		Dim img = Pix.LoadFromFile("Demo.png")

		' Process the image to extract text
		Dim res = ocrEngine.Process(img)

		' Output the recognized text
		Console.WriteLine(res.GetText())
		Console.ReadKey()
	End Sub
End Class
$vbLabelText   $csharpLabel
  • First, a TesseractEngine object must be created, loading the language data into the engine.
  • The desired image file is then loaded with the help of Pix.LoadFromFile.
  • The image is passed into the TesseractEngine to extract text using the Process method.
  • The recognized text is obtained with the GetText method and printed to the console.

How to Get OCR Text Recognition, Figure 2: Extracted text from the image Extracted text from the image

1.3 Tesseract Considerations

  1. Tesseract supports output text formatting, OCR positional data, and page layout analysis as of version 3.00.
  2. Tesseract is available on Windows, Linux, and MacOS, though it is primarily confirmed to work as intended on Windows and Ubuntu due to limited development support.
  3. Tesseract can distinguish between monospaced and proportionally spaced text.
  4. Utilizing a front-end like OCRopus, Tesseract is ideal for use as a back-end and can be utilized for more challenging OCR jobs, such as layout analysis.
  5. Some of Tesseract's shortcomings:
    • The latest builds have not been designed to compile on Windows
    • Tesseract's C# API wrappers are maintained infrequently and are years behind new releases of Tesseract

To learn more about Tesseract in C#, please visit the Tesseract tutorial.

2. OCR with IronOCR

2.1. Installing IronOCR

Enter the next command into the NuGet Package Manager Console:

Install-Package IronOcr

Or install the IronOCR library via the NuGet Package Manager, along with additional packages for other languages, which are simple and convenient to use.

How to Get OCR Text Recognition, Figure 3: Install IronOcr and languages packages via NuGet Package Manager Install IronOcr and languages packages via NuGet Package Manager

2.2. Using IronOCR

Below is a sample code to recognize the text from the given image:

using IronOcr;

class Program
{
    static void Main()
    {
        // Create an IronTesseract instance with predefined settings
        var ocr = new IronTesseract()
        {
            Language = OcrLanguage.EnglishBest,
            Configuration = { TesseractVersion = TesseractVersion.Tesseract5 }
        };

        // Create an OcrInput instance for image processing
        using var input = new OcrInput();

        // Load the image to be processed
        input.AddImage("Demo.png");

        // Process the image and extract text
        var result = ocr.Read(input);

        // Output the recognized text
        Console.WriteLine(result.Text);
        Console.ReadKey();
    }
}
using IronOcr;

class Program
{
    static void Main()
    {
        // Create an IronTesseract instance with predefined settings
        var ocr = new IronTesseract()
        {
            Language = OcrLanguage.EnglishBest,
            Configuration = { TesseractVersion = TesseractVersion.Tesseract5 }
        };

        // Create an OcrInput instance for image processing
        using var input = new OcrInput();

        // Load the image to be processed
        input.AddImage("Demo.png");

        // Process the image and extract text
        var result = ocr.Read(input);

        // Output the recognized text
        Console.WriteLine(result.Text);
        Console.ReadKey();
    }
}
Imports IronOcr

Friend Class Program
	Shared Sub Main()
		' Create an IronTesseract instance with predefined settings
		Dim ocr = New IronTesseract() With {
			.Language = OcrLanguage.EnglishBest,
			.Configuration = { TesseractVersion = TesseractVersion.Tesseract5 }
		}

		' Create an OcrInput instance for image processing
		Dim input = New OcrInput()

		' Load the image to be processed
		input.AddImage("Demo.png")

		' Process the image and extract text
		Dim result = ocr.Read(input)

		' Output the recognized text
		Console.WriteLine(result.Text)
		Console.ReadKey()
	End Sub
End Class
$vbLabelText   $csharpLabel
  • This code initializes an IronTesseract object, setting up the language and Tesseract version.
  • An OcrInput object is then created to load image files using the AddImage method.
  • The Read method of IronTesseract processes the image and extracts the text, which is then printed to the console.

How to Get OCR Text Recognition, Figure 4: Extracted text output using IronOCR library Extracted text output using IronOCR library

2.3 IronOCR Considerations

  1. IronOCR is an extension of the Tesseract library, introducing more stability and higher accuracy.
  2. IronOCR can read text content from PDFs and photos. It can also read more than 20 distinct kinds of barcodes and QR codes.
  3. Output can be rendered either as plain text, structured data, barcodes, or QR codes.
  4. The library recognizes 127 languages worldwide.
  5. IronOCR works in all .NET environments flexibly (console, Web, desktop, etc.), and also supports the latest mobile frameworks such as Mono, Xamarin, Azure, and MAUI.
  6. IronOCR offers a free trial and has a development edition that is less priced. Learn more about licensing.

For a detailed IronOCR tutorial, refer to this article to read text from an image in C#.

Kannaopat Udonpant
Software Engineer
Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering team, where he focuses on IronPDF. Kannapat values his job because he learns directly from the developer who writes most of the code used in IronPDF. In addition to peer learning, Kannapat enjoys the social aspect of working at Iron Software. When he's not writing code or documentation, Kannapat can usually be found gaming on his PS5 or rewatching The Last of Us.
< PREVIOUS
OCR C# Open Source (List For Developers)
NEXT >
Best OCR API (Updated List Comparison)