C# Tesseract OCR Example

by Jim Baker

Tesseract is an excellent academic OCR (optical character recognition) library available for free, for almost all use cases to developers.

C# is lucky to have one of the most accurate and fast Tesseract Libraries available.

IronOCR extends Google Tesseract with IronTesseract - a native C# OCR library with improved stability and higher accuracy than the free Tesseract library.

This article compares and explains why .NET developers strongly consider using IronOCR IronTesseract over vanilla Tesseract.

C# Tesseract OCR

Code Example for .NET OCR Usage - Extract Text from Images in C

Use NuGet Package Manager to install the IronOCR NuGet Package into your Visual Studio solution.

:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-1.cs
using IronOcr;
using System;

var ocr = new IronTesseract();

// Hundreds of languages available
ocr.Language = OcrLanguage.English;

using var input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames(@"img\example.tiff", pageindices);
// input.DeNoise();  optional filter
// input.Deskew();   optional filter

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
// Explore the OcrResult using IntelliSense
Imports IronOcr
Imports System

Private ocr = New IronTesseract()

' Hundreds of languages available
ocr.Language = OcrLanguage.English

Dim input = New OcrInput()
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img\example.tiff", pageindices)
' input.DeNoise();  optional filter
' input.Deskew();   optional filter

Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
' Explore the OcrResult using IntelliSense
VB   C#

Installation Options

C# NuGet Library for OCR

Install with NuGet

Install-Package IronOcr
or
C# OCR DLL

Download DLL

Download DLL

Manually install into your project

Using Tesseract Engine for OCR with .NET

When using Tesseract Engine, most of us are working with a C++ library.

Interop is not a lot of fun in .NET - and has poor cross-platform and Azure compatibility. It requires us to choose the bittiness of our application, meaning that we may only deploy to 32 or 64-bit targets.

We may need to ensure that Visual C++ runtimes are installed and even compile Tesseract ourselves to get the latest version. Free C# wrappers for these may be years behind the edge.

We also have to find, download and manage C++ DLLs and EXEs we may not understand, and deploy them in environments where permissions may not allow them to run.

It is easy to install using NuGet Package Manager to extract text from images and PDF files using Optical Character Recognition.

IronOCR Tesseract for C

With IronOCR, all Tesseract installation happens entirely using the NuGet Package Manager.

Install-Package IronOcr

There are no native dlls or exes to install. Everything is handled by a single .NET component library.

The entire API is in native .NET using a simple C# API using Tesseract.

It supports these kinds of Visual Studio projects to add optical character recognition in C#:

  • .NET Framework 4.6.2 and above
  • .NET Standard 2.0 and above (including 3.x, .NET 5, 6, 7 & 8)
  • .NET Core 2.0 and above (including 3.x, .NET 5, 6, 7 & 8)

Up To Date & Maintained

Google Tesseract with C

The latest builds of Tesseract 5 have never been designed to compile on Windows.

Installing Tesseract 5 for C# for free requires manually modifying and compiling Leptonica and Tesseract for Windows. The MinGW cross-compile chain is not successful at producing Windows interop binaries as of today.

In addition, free C# API wrappers on GitHub may be years behind or incompatible.

IronOCR Tesseract for .NET

Runs Tesseract 5 out of the box on Windows, macOS, Linux, Azure, AWS, Lambda, Mono, and Xamarin Mac with little or no configuration. No native binaries to manage. Framework and Core compatible.

There is little else to say other than it has been done right.

Accuracy

Google Tesseract in .NET Projects

Tesseract as a library was designed for perfect documents where a machine printed out a high-resolution text to a screen and then read it. That is why Tesseract is good at reading perfect documents.

The problem is that in the real world, that is not what we have. If Tesseract encounters an image that is rotated, skewed, is of a low DPI, scanned, or has background noise, it becomes almost impossible for Tesseract to get data from that image. In addition, Tesseract will also take a very long time to process that document before giving you back nonsense information.

A simple document that is very easy to read by the eye cannot be read by Tesseract well.

Tesseract is a free library optimal for reading straight and perfect text of standardized typefaces.

To use Tesseract when we are using scanned or photographed documents where the images are not digitally perfect like screenshots, we need to perform image preprocessing. This is normally done with Photoshop batch scripts or advanced ImageMagick usage.

Generally, this needs to be developed on a case-by-case basis for each type of document you are trying to deal with and can take weeks of development.

IronOCR Tesseract in .NET Projects

IronOCR takes this headache away. Users often achieve 99.8-100% accuracy with minimal configuration.

:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-2.cs
using IronOcr;
using System;

var ocr = new IronTesseract();
using var input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames(@"img\example.tiff", pageindices);
input.DeNoise();  //fixes digital noise
input.Deskew();   //fixes rotation and perspective

// there are dozens more filters, but most users wont need them
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Imports System

Private ocr = New IronTesseract()
Private input = New OcrInput()
Private pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img\example.tiff", pageindices)
input.DeNoise() 'fixes digital noise
input.Deskew() 'fixes rotation and perspective

' there are dozens more filters, but most users wont need them
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
VB   C#

Image Compatibility

Google Tesseract in .NET

Only accepts Leptonica PIX image format which is an IntPtr C++ object in C#. PIX objects are not managed memory - and failure to handle them with care in C# results in memory leaks.

Leptonica has good general image compatibility but throws many console warnings and errors. There are known issues with TIFF files and limited support for PDF OCR.

IronOCR Tesseract for .NET

Images are memory managed. PDF & Tiff supported. System. Drawing, Stream, and Byte Array are included for every file format.

Broad image support:

  • PDF Documents
  • Pdf Pages
  • MultiFrame TIFF files
  • JPEG & JPEG2000
  • GIF
  • PNG
  • BMP
  • WBMP
  • System.Drawing.Image
  • System.Drawing.Bitmap
  • System.IO.Streams of images
  • Binary image Data (byte[])
  • And many more...

OCR Image Compatibility Code Example

:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-3.cs
using IronOcr;
using System;

var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf("example.pdf", password: "password");
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames("multi-frame.tiff", pageindices);
input.LoadImage("image1.png");
input.LoadImage("image2.jpeg");
//... many more

var result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Imports System

Private ocr = New IronTesseract()
Private input = New OcrInput()
input.LoadPdf("example.pdf", password:= "password")
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("multi-frame.tiff", pageindices)
input.LoadImage("image1.png")
input.LoadImage("image2.jpeg")
'... many more

Dim result = ocr.Read(input)
Console.WriteLine(result.Text)
VB   C#

Performance

Free Google Tesseract

Google Tesseract can perform fast and accurate results if properly tunes and the input images have been preprocessed using Photoshop or ImageMagick.

You will notice that most Tesseract examples online are actually from high-resolution screenshots with no digital noise, in fonts that Tesseract has been designed to work well with.

Tesseracts own documentation states that input images should be sampled at 300DPI or higher for OCR to be effective.

IronOCR Tesseract Library

The IronOcr .NET Tesseract DLL works accurately and at speed for most images out of the box. We have implemented multithreading to make use of the multi-core processors that most machines now use.

Even low-resolution images generally work with a high degree of accuracy in your program. No PhotoShop required.

Developers often achieve over 99%+ accuracy with little configuration - which matches current Machine Learning web APIs without the ongoing costs, security risks and bandwidth issues.

Speeds are fast but can be improved with a little coding.

Performance Tuning Example

:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-4.cs
using IronOcr;
using System;

var ocr = new IronTesseract();

// Configure for speed.  35% faster and only 0.2% loss of accuracy
ocr.Configuration.BlackListCharacters = "~`$#^*_}{][|\\@¢©«»°±·×‑–—‘’“”•…′″€™←↑→↓↔⇄⇒∅∼≅≈≠≤≥≪≫⌁⌘○◔◑◕●☐☑☒☕☮☯☺♡⚓✓✰";
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.Auto;
ocr.Configuration.ReadBarCodes = false;
ocr.Language = OcrLanguage.EnglishFast;

using var input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames(@"img\Potter.tiff", pageindices);
var result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Imports System

Private ocr = New IronTesseract()

' Configure for speed.  35% faster and only 0.2% loss of accuracy
ocr.Configuration.BlackListCharacters = "~`$#^*_}{][|\@¢©«»°±·×‑–—‘’“”•…′″€™←↑→↓↔⇄⇒∅∼≅≈≠≤≥≪≫⌁⌘○◔◑◕●☐☑☒☕☮☯☺♡⚓✓✰"
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.Auto
ocr.Configuration.ReadBarCodes = False
ocr.Language = OcrLanguage.EnglishFast

Dim input = New OcrInput()
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img\Potter.tiff", pageindices)
Dim result = ocr.Read(input)
Console.WriteLine(result.Text)
VB   C#

API

Google Tesseract OCR in .NET

We have 2 free choices:

  • Work with Interop layers - Many that are found on GitHub are out of date, have unresolved tickets, Memory Leaks & Console warnings. May not support .NET Core or Standard.
  • Work with the command line EXE - Hard to deploy and constantly interrupted by virus scanners and security policies.

Neither of the above may work well in Web Applications, Azure, Mono, Xamarin, Linux, Docker, or Mac.

IronOCR Tesseract OCR Library for .NET

A managed and tested .NET Library for Tesseract called IronTesseract.

Fully documented with IntelliSense support.

Simplest Hello World for Tesseract in .NET

:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-5.cs
using IronOcr;

var text = new IronTesseract().Read("img.png").Text;
Imports IronOcr

Private text = (New IronTesseract()).Read("img.png").Text
VB   C#

Has active development and is supported by professional software engineers with a median experience level of over 20 years.

Compatibility

Google Tesseract + Interop for .NET

This may be made to work in most platforms if you are willing to find dependencies, build from source or update a free C# interop wrapper. These resources may not be fully compatible with .NET Core or .NET Standard projects.

At present, we have not encountered any logical and simple way to install LibTesseract5 for windows safely without IronTessseract.

IronOCR Tesseract .NET OCR Library

Unit Tested with CI, and has everything you need to run on:

  • Desktop applications,
  • Console Apps
  • Servers Processes
  • Web Applications & MVC
  • JetBrains Rider
  • Xamarin Mac

On:

  • Windows
  • Azure
  • Linux
  • Docker
  • Mac
  • BSD and FreeBSD

.NET Support for:

  • .NET Framework 4.6.2 and above
  • .NET Core - All active versions above 2.0
  • .NET Stanrdard - All active versions above 2.0
  • Mono
  • Xamarin Mac

Language Support

Google Tesseract

Tesseract dictionaries are managed as files and must be cloned from the https://github.com/tesseract-ocr/tessdata. This is about 4 GB.

Some Linux distros have some help to manage Tesseract dictionaries via apt-get.

Exact folder structures must be maintained or Tesseract fails.

IronOCR Tesseract

Supports more languages than https://github.com/tesseract-ocr/tessdata and they are each managed as a NuGet Package via NuGet Package Manager or easily installable downloads.

Unicode Language Example

:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-6.cs
using IronOcr;

var ocr = new IronTesseract();
ocr.Language = OcrLanguage.Arabic;

using var input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames("img/arabic.gif", pageindices);

// Add image filters if needed
// In this case, even thought input is very low quality
// IronTesseract can read what conventional Tesseract cannot.

var result = ocr.Read(input);

// Console can't print Arabic on Windows easily.
// Let's save to disk instead.
result.SaveAsTextFile("arabic.txt");
Imports IronOcr

Private ocr = New IronTesseract()
ocr.Language = OcrLanguage.Arabic

Dim input = New OcrInput()
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img/arabic.gif", pageindices)

' Add image filters if needed
' In this case, even thought input is very low quality
' IronTesseract can read what conventional Tesseract cannot.

Dim result = ocr.Read(input)

' Console can't print Arabic on Windows easily.
' Let's save to disk instead.
result.SaveAsTextFile("arabic.txt")
VB   C#

Multiple Language Example

It is also possible for OCR to use multiple languages at the same time. This can really help get English language metadata and URLs in Unicode documents.

:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-7.cs
using IronOcr;

// For the Chinese Language Pack:
// PM> Install IronOcr.Languages.ChineseSimplified

var ocr = new IronTesseract();
ocr.Language = OcrLanguage.ChineseSimplified;
ocr.AddSecondaryLanguage(OcrLanguage.English);

// We can add any number of languages
using var input = new OcrInput();
input.LoadPdf("multi-language.pdf");
var result = ocr.Read(input);
result.SaveAsTextFile("results.txt");
Imports IronOcr

' For the Chinese Language Pack:
' PM> Install IronOcr.Languages.ChineseSimplified

Private ocr = New IronTesseract()
ocr.Language = OcrLanguage.ChineseSimplified
ocr.AddSecondaryLanguage(OcrLanguage.English)

' We can add any number of languages
Dim input = New OcrInput()
input.LoadPdf("multi-language.pdf")
Dim result = ocr.Read(input)
result.SaveAsTextFile("results.txt")
VB   C#

What Else

IronOCR Tesseract has additional features for .NET software developers.

  • Automatic image analysis to configure Tesseract for common errors
  • Image to Searchable PDF Conversion
  • PDF OCR
  • Can make any PDF searchable and indexable on search engines
  • OCR to HTML output
  • TIFF to PDF conversion
  • Barcode Reading
  • QR Code Reading
  • Multithreading
  • An advanced OcrResult Class that allows inspection of Blocks, Paragraphs, Lines, Words, Characters, Fonts and OCR statistics.

Conclusion

Google Tesseract for C# OCR

This is the right library to use for free & academic projects in C#.

Tesseract is an excellent resource for C++ developers, but it is not a complete OCR library for .NET.

When dealing with scanned or photographed images because these images need to be processed so as to be orthogonal, standardized, high-resolution, and free of digital noise before Tesseract can accurately work with them.

IronOCR Tesseract OCR Library for .NET Framework & Core

In contrast, IronOCR can do this and more in a single line of code.

It is true: IronOCR uses Tesseract for its internal OCR engine.
A very finally tuned Tesseract build for C# with a lot of performance improvements and features added as standard.

It is the right choice for any project where developer time is valuable. When was the last time you found a .NET software Engineer with weeks of time on their hands?

Get Started on your C# Tesseract Project

Use NuGet Package Manager in any Visual Studio project:

Install-Package IronOcr

Or you can download the IronOCR Tesseract .NET DLL and install it manually.

Any .NET coder should be able to get started with IronOCR Tesseract OCR in 5 minutes using examples on this page.

.NET Developer at Iron with a passion for OCR and natural language manipulation

Jim Baker

IronOCR Product Developer

Jim has been at the forefront of the IronOCR product development since its release in 2016. Jim worked on Tesseract 5 support for .NET Core & Standard through 2019-2020