C# Tesseract OCR Example
by Jim Baker
Tesseract is an excellent academic OCR (optical character recognition) library available for free, for almost all use cases to developers.
C# is lucky to have one of the most accurate and fast Tesseract Libraries available.
IronOCR extends Google Tesseract with IronTesseract
- a native C# OCR library with improved stability and higher accuracy than the free Tesseract library.
This article compares and explains why .NET developers strongly consider using IronOCR IronTesseract
over vanilla Tesseract.
How to Use Tesseract OCR in C# for .NET?
- Install Google Tesseract and IronOCR for .NET into Visual Studio
- Check the latest builds in C#
- Review accuracy and image compatibility
- Test performance and API function
- Consider Multi-Language Support
Code Example for .NET OCR Usage - Extract Text from Images in C#
Use NuGet Package Manager to install the IronOCR NuGet Package into your Visual Studio solution.
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-1.cs
using IronOcr;
using System;
var ocr = new IronTesseract();
// Hundreds of languages available
ocr.Language = OcrLanguage.English;
using var input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames(@"img\example.tiff", pageindices);
// input.DeNoise(); optional filter
// input.Deskew(); optional filter
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
// Explore the OcrResult using IntelliSense
Imports IronOcr
Imports System
Private ocr = New IronTesseract()
' Hundreds of languages available
ocr.Language = OcrLanguage.English
Dim input = New OcrInput()
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img\example.tiff", pageindices)
' input.DeNoise(); optional filter
' input.Deskew(); optional filter
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
' Explore the OcrResult using IntelliSense
Installation Options
<
Using Tesseract Engine for OCR with .NET
When using Tesseract Engine, most of us are working with a C++ library.
Interop is not a lot of fun in .NET - and has poor cross-platform and Azure compatibility. It requires us to choose the bittiness of our application, meaning that we may only deploy to 32 or 64-bit targets.
We may need to ensure that Visual C++ runtimes are installed and even compile Tesseract ourselves to get the latest version. Free C# wrappers for these may be years behind the edge.
We also have to find, download and manage C++ DLLs and EXEs we may not understand, and deploy them in environments where permissions may not allow them to run.
It is easy to install using NuGet Package Manager to extract text from images and PDF files using Optical Character Recognition.
IronOCR Tesseract for C#
With IronOCR, all Tesseract installation happens entirely using the NuGet Package Manager.
Install-Package IronOcr
There are no native dlls or exes to install. Everything is handled by a single .NET component library.
The entire API is in native .NET using a simple C# API using Tesseract.
It supports these kinds of Visual Studio projects to add optical character recognition in C#:
- .NET Framework 4.6.2 and above
- .NET Standard 2.0 and above (including 3.x, .NET 5, 6, 7 & 8)
- .NET Core 2.0 and above (including 3.x, .NET 5, 6, 7 & 8)
Up To Date & Maintained
Google Tesseract with C#
The latest builds of Tesseract 5 have never been designed to compile on Windows.
Installing Tesseract 5 for C# for free requires manually modifying and compiling Leptonica and Tesseract for Windows. The MinGW cross-compile chain is not successful at producing Windows interop binaries as of today.
In addition, free C# API wrappers on GitHub may be years behind or incompatible.
IronOCR Tesseract for .NET
IronOCR offers numerous advantages, including a user-friendly API for seamless integration into applications. It supports various image formats like JPEG, PNG, TIFF, and PDF, and provides advanced features such as automatic image preprocessing. Additionally, it's backed by a dedicated team offering commercial support and updates.
Runs Tesseract 5 out of the box on Windows, macOS, Linux, Azure, AWS, Lambda, Mono, and Xamarin Mac with little or no configuration. No native binaries to manage. Framework and Core compatible.
There is little else to say other than it has been done right.
Google OCR
Google Cloud OCR (Optical Character Recognition) is a service provided by Google Cloud Platform (GCP) that allows developers to extract text from images and scanned documents using machine learning algorithms.
Accuracy
Google Tesseract in .NET Projects
Tesseract as a library was designed for perfect documents where a machine printed out a high-resolution text to a screen and then read it. That is why Tesseract is good at reading perfect documents.
The problem is that in the real world, that is not what we have. If Tesseract encounters an image that is rotated, skewed, is of a low DPI, scanned, or has background noise, it becomes almost impossible for Tesseract to get data from that image. In addition, Tesseract will also take a very long time to process that document before giving you back nonsense information.
A simple document that is very easy to read by the eye cannot be read by Tesseract well.
Tesseract is a free library optimal for reading straight and perfect text of standardized typefaces.
To use Tesseract when we are using scanned or photographed documents where the images are not digitally perfect like screenshots, we need to perform image preprocessing. This is normally done with Photoshop batch scripts or advanced ImageMagick usage.
Generally, this needs to be developed on a case-by-case basis for each type of document you are trying to deal with and can take weeks of development.
IronOCR Tesseract in .NET Projects
IronOCR takes this headache away. Users often achieve 99.8-100% accuracy with minimal configuration.
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-2.cs
using IronOcr;
using System;
var ocr = new IronTesseract();
using var input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames(@"img\example.tiff", pageindices);
input.DeNoise(); //fixes digital noise
input.Deskew(); //fixes rotation and perspective
// there are dozens more filters, but most users wont need them
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Imports System
Private ocr = New IronTesseract()
Private input = New OcrInput()
Private pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img\example.tiff", pageindices)
input.DeNoise() 'fixes digital noise
input.Deskew() 'fixes rotation and perspective
' there are dozens more filters, but most users wont need them
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
Image Compatibility
Google Tesseract in .NET
Only accepts Leptonica PIX image format which is an IntPtr
C++ object in C#. PIX objects are not managed memory - and failure to handle them with care in C# results in memory leaks.
Leptonica has good general image compatibility but throws many console warnings and errors. There are known issues with TIFF files and limited support for PDF OCR.
IronOCR Tesseract for .NET
Images are memory managed. PDF & Tiff supported. System. Drawing, Stream, and Byte Array are included for every file format.
Broad image support:
- PDF Documents
- Pdf Pages
- MultiFrame TIFF files
- JPEG & JPEG2000
- GIF
- PNG
- BMP
- WBMP
System.Drawing.Image
System.Drawing.Bitmap
System.IO.Streams
of images- Binary image Data (byte [])
- And many more...
OCR Image Compatibility Code Example
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-3.cs
using IronOcr;
using System;
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf("example.pdf", Password: "password");
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames("multi-frame.tiff", pageindices);
input.LoadImage("image1.png");
input.LoadImage("image2.jpeg");
//... many more
var result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Imports System
Private ocr = New IronTesseract()
Private input = New OcrInput()
input.LoadPdf("example.pdf", Password:= "password")
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("multi-frame.tiff", pageindices)
input.LoadImage("image1.png")
input.LoadImage("image2.jpeg")
'... many more
Dim result = ocr.Read(input)
Console.WriteLine(result.Text)
Performance
Free Google Tesseract
Google Tesseract can perform fast and accurate results if properly tunes and the input images have been preprocessed using Photoshop or ImageMagick.
You will notice that most Tesseract examples online are actually from high-resolution screenshots with no digital noise, in fonts that Tesseract has been designed to work well with.
Tesseracts own documentation states that input images should be sampled at 300DPI or higher for OCR to be effective.
IronOCR Tesseract Library
The IronOcr .NET Tesseract DLL works accurately and at speed for most images out of the box. We have implemented multithreading to make use of the multi-core processors that most machines now use.
Even low-resolution images generally work with a high degree of accuracy in your program. No PhotoShop required.
Developers often achieve over 99%+ accuracy with little configuration - which matches current Machine Learning web APIs without the ongoing costs, security risks and bandwidth issues.
Speeds are fast but can be improved with a little coding.
Performance Tuning Example
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-4.cs
using IronOcr;
using System;
var ocr = new IronTesseract();
// Configure for speed. 35% faster and only 0.2% loss of accuracy
ocr.Configuration.BlackListCharacters = "~`$#^*_}{][|\\@¢©«»°±·×‑–—‘’“”•…′″€™←↑→↓↔⇄⇒∅∼≅≈≠≤≥≪≫⌁⌘○◔◑◕●☐☑☒☕☮☯☺♡⚓✓✰";
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.Auto;
ocr.Configuration.ReadBarCodes = false;
ocr.Language = OcrLanguage.EnglishFast;
using var input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames(@"img\Potter.tiff", pageindices);
var result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Imports System
Private ocr = New IronTesseract()
' Configure for speed. 35% faster and only 0.2% loss of accuracy
ocr.Configuration.BlackListCharacters = "~`$#^*_}{][|\@¢©«»°±·×‑–—‘’“”•…′″€™←↑→↓↔⇄⇒∅∼≅≈≠≤≥≪≫⌁⌘○◔◑◕●☐☑☒☕☮☯☺♡⚓✓✰"
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.Auto
ocr.Configuration.ReadBarCodes = False
ocr.Language = OcrLanguage.EnglishFast
Dim input = New OcrInput()
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img\Potter.tiff", pageindices)
Dim result = ocr.Read(input)
Console.WriteLine(result.Text)
API
Google Tesseract OCR in .NET
We have 2 free choices:
- Work with Interop layers - Many that are found on GitHub are out of date, have unresolved tickets, Memory Leaks & Console warnings. May not support .NET Core or Standard.
- Work with the command line EXE - Hard to deploy and constantly interrupted by virus scanners and security policies.
Neither of the above may work well in Web Applications, Azure, Mono, Xamarin, Linux, Docker, or Mac.
IronOCR Tesseract OCR Library for .NET
A managed and tested .NET Library for Tesseract called IronTesseract
.
Fully documented with IntelliSense support.
Simplest Hello World for Tesseract in .NET
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-5.cs
using IronOcr;
var text = new IronTesseract().Read("img.png").Text;
Imports IronOcr
Private text = (New IronTesseract()).Read("img.png").Text
Has active development and is supported by professional software engineers with a median experience level of over 20 years.
Compatibility
Google Tesseract + Interop for .NET
This may be made to work in most platforms if you are willing to find dependencies, build from source or update a free C# interop wrapper. These resources may not be fully compatible with .NET Core or .NET Standard projects.
At present, we have not encountered any logical and simple way to install LibTesseract5 for windows safely without IronTessseract
.
IronOCR Tesseract .NET OCR Library
Unit Tested with CI, and has everything you need to run on:
- Desktop applications,
- Console Apps
- Servers Processes
- Web Applications & MVC
- JetBrains Rider
- Xamarin Mac
On:
- Windows
- Azure
- Linux
- Docker
- Mac
- BSD and FreeBSD
.NET Support for:
- .NET Framework 4.6.2 and above
- .NET Core - All active versions above 2.0
- .NET Stanrdard - All active versions above 2.0
- Mono
- Xamarin Mac
Language Support
Google Tesseract
Tesseract dictionaries are managed as files and must be cloned from the https://github.com/tesseract-ocr/tessdata. This is about 4 GB.
Some Linux distros have some help to manage Tesseract dictionaries via apt-get
.
Exact folder structures must be maintained or Tesseract fails.
IronOCR Tesseract
Supports more languages than https://github.com/tesseract-ocr/tessdata and they are each managed as a NuGet Package via NuGet Package Manager or easily installable downloads.
Unicode Language Example
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-6.cs
using IronOcr;
var ocr = new IronTesseract();
ocr.Language = OcrLanguage.Arabic;
using var input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames("img/arabic.gif", pageindices);
// Add image filters if needed
// In this case, even thought input is very low quality
// IronTesseract can read what conventional Tesseract cannot.
var result = ocr.Read(input);
// Console can't print Arabic on Windows easily.
// Let's save to disk instead.
result.SaveAsTextFile("arabic.txt");
Imports IronOcr
Private ocr = New IronTesseract()
ocr.Language = OcrLanguage.Arabic
Dim input = New OcrInput()
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img/arabic.gif", pageindices)
' Add image filters if needed
' In this case, even thought input is very low quality
' IronTesseract can read what conventional Tesseract cannot.
Dim result = ocr.Read(input)
' Console can't print Arabic on Windows easily.
' Let's save to disk instead.
result.SaveAsTextFile("arabic.txt")
Multiple Language Example
It is also possible for OCR to use multiple languages at the same time. This can really help get English language metadata and URLs in Unicode documents.
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-7.cs
using IronOcr;
// For the Chinese Language Pack:
// PM> Install IronOcr.Languages.ChineseSimplified
var ocr = new IronTesseract();
ocr.Language = OcrLanguage.ChineseSimplified;
ocr.AddSecondaryLanguage(OcrLanguage.English);
// We can add any number of languages
using var input = new OcrInput();
input.LoadPdf("multi-language.pdf");
var result = ocr.Read(input);
result.SaveAsTextFile("results.txt");
Imports IronOcr
' For the Chinese Language Pack:
' PM> Install IronOcr.Languages.ChineseSimplified
Private ocr = New IronTesseract()
ocr.Language = OcrLanguage.ChineseSimplified
ocr.AddSecondaryLanguage(OcrLanguage.English)
' We can add any number of languages
Dim input = New OcrInput()
input.LoadPdf("multi-language.pdf")
Dim result = ocr.Read(input)
result.SaveAsTextFile("results.txt")
What Else
IronOCR Tesseract has additional features for .NET software developers.
- Automatic image analysis to configure Tesseract for common errors
- Image to Searchable PDF Conversion
- PDF OCR
- Can make any PDF searchable and indexable on search engines
- OCR to HTML output
- TIFF to PDF conversion
- Barcode Reading
- QR Code Reading
- Multithreading
- An advanced
OcrResult
Class that allows inspection of Blocks, Paragraphs, Lines, Words, Characters, Fonts and OCR statistics.
Conclusion
Google Tesseract for C# OCR
This is the right library to use for free & academic projects in C#.
Tesseract is an excellent resource for C++ developers, but it is not a complete OCR library for .NET.
When dealing with scanned or photographed images because these images need to be processed so as to be orthogonal, standardized, high-resolution, and free of digital noise before Tesseract can accurately work with them.
IronOCR Tesseract OCR Library for .NET Framework & Core
In contrast, IronOCR can do this and more in a single line of code.
It is true: IronOCR uses Tesseract for its internal OCR engine.
A very finally tuned Tesseract build for C# with a lot of performance improvements and features added as standard.
It is the right choice for any project where developer time is valuable. When was the last time you found a .NET software Engineer with weeks of time on their hands?
Get Started on your C# Tesseract Project
Use NuGet Package Manager in any Visual Studio project:
Install-Package IronOcr
Or you can download the IronOCR Tesseract .NET DLL and install it manually.
Any .NET coder should be able to get started with IronOCR Tesseract OCR in 5 minutes using examples on this page.
Check out the following comparison article: AWS vs Google Vision (OCR Features Comparison). To learn about more services that offer OCR technology.