How to use Tesseract OCR in C# Alternatives with IronOCR
Tesseract is an excellent academic OCR (optical character recognition) library available for free, for almost all use cases to developers.
C# is lucky to have one of the most accurate and fast Tesseract Libraries available.
IronOCR extends Google Tesseract with IronTesseract
- a native C# OCR library with improved stability and higher accuracy than the free Tesseract library.
This article compares and explains why .NET developers strongly consider using IronOCR IronTesseract
over vanilla Tesseract.
How to Use Tesseract OCR in C# for .NET?
- Install Google Tesseract and IronOCR for .NET into Visual Studio
- Check the latest builds in C#
- Review accuracy and image compatibility
- Test performance and API function
- Consider Multi-Language Support
Code Example for .NET OCR Usage - Extract Text from Images in C#
Use NuGet Package Manager to install the IronOCR NuGet Package into your Visual Studio solution.
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-1.cs
using IronOcr;
using System;
// Initialize IronTesseract for performing OCR (Optical Character Recognition)
var ocr = new IronTesseract
{
// Set the language for the OCR process to English
Language = OcrLanguage.English
};
// Create a new OCR input that can hold the images to be processed
using var input = new OcrInput();
// Specify the page indices to be processed from the TIFF image
var pageIndices = new int[] { 1, 2 };
// Load specific pages of the TIFF image into the OCR input object; this is useful for multi-page image files
input.LoadImageFrames(@"img\example.tiff", pageIndices);
// Optional pre-processing steps (uncomment as needed)
// input.DeNoise(); // Use this to remove noise from the image if necessary
// input.Deskew(); // Use this to straighten images that are slightly tilted
// Perform OCR on the provided input
OcrResult result = ocr.Read(input);
// Output the recognized text to the console
Console.WriteLine(result.Text);
// Note: The OcrResult object contains additional information about the OCR process
// and can be explored further using IntelliSense features in your IDE.
Imports IronOcr
Imports System
' Initialize IronTesseract for performing OCR (Optical Character Recognition)
Private ocr = New IronTesseract With {.Language = OcrLanguage.English}
' Create a new OCR input that can hold the images to be processed
Private input = New OcrInput()
' Specify the page indices to be processed from the TIFF image
Private pageIndices = New Integer() { 1, 2 }
' Load specific pages of the TIFF image into the OCR input object; this is useful for multi-page image files
input.LoadImageFrames("img\example.tiff", pageIndices)
' Optional pre-processing steps (uncomment as needed)
' input.DeNoise(); // Use this to remove noise from the image if necessary
' input.Deskew(); // Use this to straighten images that are slightly tilted
' Perform OCR on the provided input
Dim result As OcrResult = ocr.Read(input)
' Output the recognized text to the console
Console.WriteLine(result.Text)
' Note: The OcrResult object contains additional information about the OCR process
' and can be explored further using IntelliSense features in your IDE.
Installation Options
Using Tesseract Engine for OCR with .NET
When using Tesseract Engine, most of us are working with a C++ library.
Interop is not a lot of fun in .NET - and has poor cross-platform and Azure compatibility. It requires us to choose the bittiness of our application, meaning that we may only deploy to 32 or 64-bit targets.
We may need to ensure that Visual C++ runtimes are installed and even compile Tesseract ourselves to get the latest version. Free C# wrappers for these may be years behind the edge.
We also have to find, download and manage C++ DLLs and EXEs we may not understand and deploy them in environments where permissions may not allow them to run.
It is easy to install using NuGet Package Manager to extract text from images and PDF files using Optical Character Recognition.
IronOCR Tesseract for C#
With IronOCR, all Tesseract installation happens entirely using the NuGet Package Manager.
Install-Package IronOcr
There are no native dlls or exes to install. Everything is handled by a single .NET component library.
The entire API is in native .NET using a simple C# API using Tesseract.
It supports these kinds of Visual Studio projects to add optical character recognition in C#:
- .NET Framework 4.6.2 and above
- .NET Standard 2.0 and above (including 3.x, .NET 5, 6, 7 & 8)
- .NET Core 2.0 and above (including 3.x, .NET 5, 6, 7 & 8)
Up To Date & Maintained
Google Tesseract with C#
The latest builds of Tesseract 5 have never been designed to compile on Windows.
Installing Tesseract 5 for C# for free requires manually modifying and compiling Leptonica and Tesseract for Windows. The MinGW cross-compile chain is not successful at producing Windows interop binaries as of today.
In addition, free C# API wrappers on GitHub may be years behind or incompatible.
IronOCR Tesseract for .NET
IronOCR offers numerous advantages, including a user-friendly API for seamless integration into applications. It supports various image formats like JPEG, PNG, TIFF, and PDF, and provides advanced features such as automatic image preprocessing. Additionally, it's backed by a dedicated team offering commercial support and updates.
Runs Tesseract 5 out of the box on Windows, macOS, Linux, Azure, AWS, Lambda, Mono, and Xamarin Mac with little or no configuration. No native binaries to manage. Framework and Core compatible.
There is little else to say other than it has been done right.
Google OCR
Google Cloud OCR (Optical Character Recognition) is a service provided by Google Cloud Platform (GCP) that allows developers to extract text from images and scanned documents using machine learning algorithms.
Accuracy
Google Tesseract in .NET Projects
Tesseract as a library was designed for perfect documents where a machine printed a high-resolution text to a screen and then read it. That is why Tesseract is good at reading perfect documents.
The problem is that in the real world, that is not what we have. If Tesseract encounters an image that is rotated, skewed, is of a low DPI, scanned, or has background noise, it becomes almost impossible for Tesseract to get data from that image. In addition, Tesseract will also take a very long time to process that document before giving you back nonsense information.
A simple document that is very easy to read by the eye cannot be read by Tesseract well.
Tesseract is a free library optimal for reading straight and perfect text of standardized typefaces.
To use Tesseract when we are using scanned or photographed documents where the images are not digitally perfect like screenshots, we need to perform image preprocessing. This is normally done with Photoshop batch scripts or advanced ImageMagick usage.
Generally, this needs to be developed on a case-by-case basis for each type of document you are trying to deal with and can take weeks of development.
IronOCR Tesseract in .NET Projects
IronOCR takes this headache away. Users often achieve 99.8-100% accuracy with minimal configuration.
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-2.cs
// Import the IronOcr and System namespaces. IronOcr is a library for Optical Character Recognition.
using IronOcr;
using System;
// Create an instance of the IronTesseract class from the IronOcr library.
// This class provides methods to perform OCR on image inputs.
var ocr = new IronTesseract();
// Create an OcrInput object to load and manipulate images for OCR processing.
// The 'using' construct ensures that resources are automatically cleaned up after use.
using var input = new OcrInput();
// Specify the indices of the pages to be loaded from the multi-page TIFF image.
// Note: Page indices for image processing libraries typically start from 1.
var pageIndices = new int[] { 1, 2 };
// Load the specified image frames from the TIFF file located at the given relative path.
// Relative paths are resolved against the working directory of the application.
input.LoadImageFrames(@"img\example.tiff", pageIndices);
// Apply a de-noising filter to the images to reduce digital noise.
// This enhances the accuracy of subsequent OCR processing.
input.DeNoise();
// Apply a deskewing filter to correct any rotation or perspective distortion
// in the images. This helps in improving OCR precision.
input.Deskew();
// Perform the OCR operation on the prepared input and obtain the result.
// The result contains the recognized text from the processed images.
OcrResult result = ocr.Read(input);
// Output the recognized text to the console.
Console.WriteLine(result.Text);
' Import the IronOcr and System namespaces. IronOcr is a library for Optical Character Recognition.
Imports IronOcr
Imports System
' Create an instance of the IronTesseract class from the IronOcr library.
' This class provides methods to perform OCR on image inputs.
Private ocr = New IronTesseract()
' Create an OcrInput object to load and manipulate images for OCR processing.
' The 'using' construct ensures that resources are automatically cleaned up after use.
Private input = New OcrInput()
' Specify the indices of the pages to be loaded from the multi-page TIFF image.
' Note: Page indices for image processing libraries typically start from 1.
Private pageIndices = New Integer() { 1, 2 }
' Load the specified image frames from the TIFF file located at the given relative path.
' Relative paths are resolved against the working directory of the application.
input.LoadImageFrames("img\example.tiff", pageIndices)
' Apply a de-noising filter to the images to reduce digital noise.
' This enhances the accuracy of subsequent OCR processing.
input.DeNoise()
' Apply a deskewing filter to correct any rotation or perspective distortion
' in the images. This helps in improving OCR precision.
input.Deskew()
' Perform the OCR operation on the prepared input and obtain the result.
' The result contains the recognized text from the processed images.
Dim result As OcrResult = ocr.Read(input)
' Output the recognized text to the console.
Console.WriteLine(result.Text)
Image Compatibility
Google Tesseract in .NET
Only accepts Leptonica PIX image format which is an IntPtr
C++ object in C#. PIX objects are not managed memory - and failure to handle them with care in C# results in memory leaks.
Leptonica has good general image compatibility but throws many console warnings and errors. There are known issues with TIFF files and limited support for PDF OCR.
Images are memory managed. PDF & Tiff supported. System.Drawing, Stream, and Byte Array are included for every file format.
Broad image support:
- PDF Documents
- Pdf Pages
- MultiFrame TIFF files
- JPEG & JPEG2000
- GIF
- PNG
- BMP
- WBMP
System.Drawing.Image
System.Drawing.Bitmap
System.IO.Streams
of images- Binary image Data (byte [])
- And many more...
OCR Image Compatibility Code Example
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-3.cs
using IronOcr;
using System;
// This code demonstrates how to perform OCR using IronTesseract on various input types.
// These include password-protected PDFs, multi-frame TIFF images, and other image files.
// Create an instance of IronTesseract to handle OCR operations.
var ocr = new IronTesseract();
// Create an OcrInput object to load PDF and image files.
using var input = new OcrInput();
// Load a password-protected PDF file.
// "example.pdf" is the file name, and "password" is the PDF password.
// If the PDF is not password-protected, omit the second argument.
input.LoadPdf("example.pdf", "password");
// Define the indices of the pages to be loaded from a multi-frame TIFF image.
// Page indices are zero-based, meaning the first page is index 0.
// In this example, pages 1 and 2 will be loaded.
var pageIndices = new int[] { 1, 2 };
// Load specific pages from a multi-frame TIFF image, specified by the pageIndices array.
input.LoadImageFrames("multi-frame.tiff", pageIndices);
// Load individual image files into the OCR input.
// The images are loaded in the order they are added. You can load as many as needed.
input.LoadImage("image1.png");
input.LoadImage("image2.jpeg");
// Perform OCR on all the loaded inputs (PDF, TIFF frames, and images).
// The result contains the recognized text from all the input sources.
var result = ocr.Read(input);
// Output the recognized text to the console.
Console.WriteLine(result.Text);
Imports IronOcr
Imports System
' This code demonstrates how to perform OCR using IronTesseract on various input types.
' These include password-protected PDFs, multi-frame TIFF images, and other image files.
' Create an instance of IronTesseract to handle OCR operations.
Private ocr = New IronTesseract()
' Create an OcrInput object to load PDF and image files.
Private input = New OcrInput()
' Load a password-protected PDF file.
' "example.pdf" is the file name, and "password" is the PDF password.
' If the PDF is not password-protected, omit the second argument.
input.LoadPdf("example.pdf", "password")
' Define the indices of the pages to be loaded from a multi-frame TIFF image.
' Page indices are zero-based, meaning the first page is index 0.
' In this example, pages 1 and 2 will be loaded.
Dim pageIndices = New Integer() { 1, 2 }
' Load specific pages from a multi-frame TIFF image, specified by the pageIndices array.
input.LoadImageFrames("multi-frame.tiff", pageIndices)
' Load individual image files into the OCR input.
' The images are loaded in the order they are added. You can load as many as needed.
input.LoadImage("image1.png")
input.LoadImage("image2.jpeg")
' Perform OCR on all the loaded inputs (PDF, TIFF frames, and images).
' The result contains the recognized text from all the input sources.
Dim result = ocr.Read(input)
' Output the recognized text to the console.
Console.WriteLine(result.Text)
Performance
Free Google Tesseract
Google Tesseract can perform fast and accurate results if properly tuned and the input images have been preprocessed using Photoshop or ImageMagick.
You will notice that most Tesseract examples online are actually from high-resolution screenshots with no digital noise, in fonts that Tesseract has been designed to work well with.
Tesseract's own documentation states that input images should be sampled at 300DPI or higher for OCR to be effective.
IronOCR Tesseract Library
The IronOcr .NET Tesseract DLL works accurately and at speed for most images out of the box. We have implemented multithreading to make use of the multi-core processors that most machines now use.
Even low-resolution images generally work with a high degree of accuracy in your program. No PhotoShop required.
Developers often achieve over 99%+ accuracy with little configuration - which matches current Machine Learning web APIs without the ongoing costs, security risks, and bandwidth issues.
Speeds are fast but can be improved with a little coding.
Performance Tuning Example
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-4.cs
using IronOcr;
using System;
// This code configures and uses the IronTesseract OCR library to read text from specific pages of a multi-page TIFF image.
// Initialize IronTesseract for OCR
var ocr = new IronTesseract();
// Configure for speed: blacklist certain characters to improve processing speed
ocr.Configuration.BlackListCharacters = "~`$#^*_}{][
\\@¢©«»°±·×‑–—‘’“”•…′″€™←↑→↓↔⇄⇒∅∼≅≈≠≤≥≪≫⌁⌘○◔◑◕●☐☑☒☕☮☯☺♡⚓✓✰";
// Set the page segmentation mode to automatically detect elements in the image
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.Auto;
// Disable reading of barcodes to speed up the OCR process
ocr.Configuration.ReadBarCodes = false;
// Use a fast variant of the English language pack for quick OCR processing
ocr.Language = OcrLanguage.EnglishFast;
// Create a new OcrInput object to hold the images to be processed
using var input = new OcrInput();
// Define the page indices to load from the multi-page TIFF image
var pageIndices = new int[] { 1, 2 }; // Assumes pages are indexed from 0
// Load specific frames (pages) from the TIFF image
input.LoadImageFrames(@"img\Potter.tiff", pageIndices);
// Perform OCR on the input images
var result = ocr.Read(input);
// Output the OCR result text to the console
Console.WriteLine(result.Text);
Imports IronOcr
Imports System
' This code configures and uses the IronTesseract OCR library to read text from specific pages of a multi-page TIFF image.
' Initialize IronTesseract for OCR
Private ocr = New IronTesseract()
' Configure for speed: blacklist certain characters to improve processing speed
ocr.Configuration.BlackListCharacters = "~`$#^*_}{][
\@¢©«»°±·×‑–—‘’“”•…′″€™←↑→↓↔⇄⇒∅∼≅≈≠≤≥≪≫⌁⌘○◔◑◕●☐☑☒☕☮☯☺♡⚓✓✰"
' Set the page segmentation mode to automatically detect elements in the image
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.Auto
' Disable reading of barcodes to speed up the OCR process
ocr.Configuration.ReadBarCodes = False
' Use a fast variant of the English language pack for quick OCR processing
ocr.Language = OcrLanguage.EnglishFast
' Create a new OcrInput object to hold the images to be processed
Dim input = New OcrInput()
' Define the page indices to load from the multi-page TIFF image
Dim pageIndices = New Integer() { 1, 2 } ' Assumes pages are indexed from 0
' Load specific frames (pages) from the TIFF image
input.LoadImageFrames("img\Potter.tiff", pageIndices)
' Perform OCR on the input images
Dim result = ocr.Read(input)
' Output the OCR result text to the console
Console.WriteLine(result.Text)
API
Google Tesseract OCR in .NET
We have 2 free choices:
- Work with Interop layers - Many that are found on GitHub are out of date, have unresolved tickets, Memory Leaks & Console warnings. May not support .NET Core or Standard.
- Work with the command line EXE - Hard to deploy and constantly interrupted by virus scanners and security policies.
Neither of the above may work well in Web Applications, Azure, Mono, Xamarin, Linux, Docker, or Mac.
IronOCR Tesseract OCR Library for .NET
A managed and tested .NET Library for Tesseract called IronTesseract
.
Fully documented with IntelliSense support.
Simplest Hello World for Tesseract in .NET
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-5.cs
// Import the IronOcr namespace to use OCR functionalities provided by the IronTesseract class.
using IronOcr;
// Create an instance of the IronTesseract class to perform OCR.
var ocr = new IronTesseract();
// Read the specified image file and extract the text using the OCR engine.
// Ensure that the image file "img.png" exists in the same directory as the executable or provide an appropriate path.
try
{
OcrResult ocrResult = ocr.Read("img.png");
// Extract the recognized text from the OCR result.
string extractedText = ocrResult.Text;
// Output the extracted text to the console for review.
Console.WriteLine(extractedText);
}
catch (Exception ex)
{
// If any error occurs during the OCR process, write the error message to the console.
Console.WriteLine("An error occurred while trying to read the image: " + ex.Message);
}
' Import the IronOcr namespace to use OCR functionalities provided by the IronTesseract class.
Imports IronOcr
' Create an instance of the IronTesseract class to perform OCR.
Private ocr = New IronTesseract()
' Read the specified image file and extract the text using the OCR engine.
' Ensure that the image file "img.png" exists in the same directory as the executable or provide an appropriate path.
Try
Dim ocrResult As OcrResult = ocr.Read("img.png")
' Extract the recognized text from the OCR result.
Dim extractedText As String = ocrResult.Text
' Output the extracted text to the console for review.
Console.WriteLine(extractedText)
Catch ex As Exception
' If any error occurs during the OCR process, write the error message to the console.
Console.WriteLine("An error occurred while trying to read the image: " & ex.Message)
End Try
Has active development and is supported by professional software engineers with a median experience level of over 20 years.
Compatibility
Google Tesseract + Interop for .NET
This may be made to work in most platforms if you are willing to find dependencies, build from source, or update a free C# interop wrapper. These resources may not be fully compatible with .NET Core or .NET Standard projects.
At present, we have not encountered any logical and simple way to install LibTesseract5 for Windows safely without IronTessseract
.
IronOCR Tesseract .NET OCR Library
Unit Tested with CI, and has everything you need to run on:
- Desktop applications,
- Console Apps
- Server Processes
- Web Applications & MVC
- JetBrains Rider
- Xamarin Mac
On:
- Windows
- Azure
- Linux
- Docker
- Mac
- BSD and FreeBSD
.NET Support for:
- .NET Framework 4.6.2 and above
- .NET Core - All active versions above 2.0
- .NET Standard - All active versions above 2.0
- Mono
- Xamarin Mac
Language Support
Google Tesseract
Tesseract dictionaries are managed as files and must be cloned from the https://github.com/tesseract-ocr/tessdata. This is about 4 GB.
Some Linux distros have some help to manage Tesseract dictionaries via apt-get
.
Exact folder structures must be maintained or Tesseract fails.
IronOCR Tesseract
Supports more languages than https://github.com/tesseract-ocr/tessdata and they are each managed as a NuGet Package via NuGet Package Manager or easily installable downloads.
Unicode Language Example
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-6.cs
using IronOcr;
// This code snippet demonstrates how to use IronTesseract OCR engine to process multi-frame images,
// specifically to perform OCR on Arabic texts.
// Create an instance of the IronTesseract OCR engine.
var ocr = new IronTesseract
{
// Set the OCR language to Arabic.
Language = OcrLanguage.Arabic
};
// Create an OcrInput object to manage the input images for OCR.
// The using statement ensures that OcrInput is disposed of correctly, releasing any resources.
using var input = new OcrInput();
// Specify the frame indices of multi-frame images (e.g., GIF) that you want to process with OCR.
// The frame indices are 1-based, meaning that the first frame has an index of 1.
var pageIndices = new int[] { 1, 2 };
// Load the specified frames from the file into the OcrInput object for processing.
input.LoadImageFrames("img/arabic.gif", pageIndices);
// Optional: Add image filters if needed. This can help improve OCR quality, especially with low-quality images.
// IronTesseract includes advanced image filtering capabilities.
// Perform OCR on the input images and store the result.
var result = ocr.Read(input);
// If the console can't print Arabic on Windows easily, you can save the results instead.
// Save the OCR result to a text file for later review or processing.
result.SaveAsTextFile("arabic.txt");
Imports IronOcr
' This code snippet demonstrates how to use IronTesseract OCR engine to process multi-frame images,
' specifically to perform OCR on Arabic texts.
' Create an instance of the IronTesseract OCR engine.
Private ocr = New IronTesseract With {.Language = OcrLanguage.Arabic}
' Create an OcrInput object to manage the input images for OCR.
' The using statement ensures that OcrInput is disposed of correctly, releasing any resources.
Private input = New OcrInput()
' Specify the frame indices of multi-frame images (e.g., GIF) that you want to process with OCR.
' The frame indices are 1-based, meaning that the first frame has an index of 1.
Private pageIndices = New Integer() { 1, 2 }
' Load the specified frames from the file into the OcrInput object for processing.
input.LoadImageFrames("img/arabic.gif", pageIndices)
' Optional: Add image filters if needed. This can help improve OCR quality, especially with low-quality images.
' IronTesseract includes advanced image filtering capabilities.
' Perform OCR on the input images and store the result.
Dim result = ocr.Read(input)
' If the console can't print Arabic on Windows easily, you can save the results instead.
' Save the OCR result to a text file for later review or processing.
result.SaveAsTextFile("arabic.txt")
Multiple Language Example
It is possible for OCR to use multiple languages at the same time. This can really help get English language metadata and URLs in Unicode documents.
:path=/static-assets/ocr/content-code-examples/tutorials/c-sharp-tesseract-ocr-7.cs
using IronOcr;
// Ensure that the IronOCR library is installed and appropriate language packs are added through NuGet.
// For the Chinese Language Pack use the following NuGet Package Manager command:
// PM> Install-Package IronOcr.Languages.ChineseSimplified
// Instantiate the IronTesseract OCR engine
var ocr = new IronTesseract();
// Set the primary OCR language to Simplified Chinese
ocr.Language = OcrLanguage.ChineseSimplified;
// Add English as a secondary OCR language.
// This helps the OCR engine better recognize documents containing both Chinese and English text.
ocr.AddSecondaryLanguage(OcrLanguage.English);
// Create a new OcrInput instance to hold the documents to be scanned.
// The use of a `using` statement ensures that resources are automatically released when the operation is completed.
using var input = new OcrInput();
// Load the PDF document into the OcrInput instance.
// Ensure the path is correct and that the file exists in the specified location.
input.AddPdf("multi-language.pdf");
// Run the OCR process on the loaded input. This will return an OcrResult object containing the extracted text.
var result = ocr.Read(input);
// Save the extracted text to a text file.
// The result.SaveAsTextFile method writes the OCR results to "results.txt".
result.SaveAsTextFile("results.txt");
// This methodology allows for the processing of multi-language documents using IronOCR,
// enabling the extraction of text from PDFs in languages specified by the IronTesseract configuration.
Imports IronOcr
' Ensure that the IronOCR library is installed and appropriate language packs are added through NuGet.
' For the Chinese Language Pack use the following NuGet Package Manager command:
' PM> Install-Package IronOcr.Languages.ChineseSimplified
' Instantiate the IronTesseract OCR engine
Private ocr = New IronTesseract()
' Set the primary OCR language to Simplified Chinese
ocr.Language = OcrLanguage.ChineseSimplified
' Add English as a secondary OCR language.
' This helps the OCR engine better recognize documents containing both Chinese and English text.
ocr.AddSecondaryLanguage(OcrLanguage.English)
' Create a new OcrInput instance to hold the documents to be scanned.
' The use of a `using` statement ensures that resources are automatically released when the operation is completed.
Dim input = New OcrInput()
' Load the PDF document into the OcrInput instance.
' Ensure the path is correct and that the file exists in the specified location.
input.AddPdf("multi-language.pdf")
' Run the OCR process on the loaded input. This will return an OcrResult object containing the extracted text.
Dim result = ocr.Read(input)
' Save the extracted text to a text file.
' The result.SaveAsTextFile method writes the OCR results to "results.txt".
result.SaveAsTextFile("results.txt")
' This methodology allows for the processing of multi-language documents using IronOCR,
' enabling the extraction of text from PDFs in languages specified by the IronTesseract configuration.
What Else
IronOCR Tesseract has additional features for .NET software developers.
- Automatic image analysis to configure Tesseract for common errors
- Image to Searchable PDF Conversion
- PDF OCR
- Can make any PDF searchable and indexable on search engines
- OCR to HTML output
- TIFF to PDF conversion
- Barcode Reading
- QR Code Reading
- Multithreading
- An advanced
OcrResult
Class that allows inspection of Blocks, Paragraphs, Lines, Words, Characters, Fonts, and OCR statistics.
Conclusion
Google Tesseract for C# OCR
This is the right library to use for free & academic projects in C#.
Tesseract is an excellent resource for C++ developers, but it is not a complete OCR library for .NET.
When dealing with scanned or photographed images because these images need to be processed so as to be orthogonal, standardized, high-resolution, and free of digital noise before Tesseract can accurately work with them.
IronOCR Tesseract OCR Library for .NET Framework & Core
In contrast, IronOCR can do this and more in a single line of code.
It is true: IronOCR uses Tesseract for its internal OCR engine.
A very fine-tuned Tesseract build for C# with a lot of performance improvements and features added as standard.
It is the right choice for any project where developer time is valuable. When was the last time you found a .NET software engineer with weeks of time on their hands?
Get Started on your C# Tesseract Project
Use NuGet Package Manager in any Visual Studio project:
Install-Package IronOcr
Or you can download the IronOCR Tesseract .NET DLL and install it manually.
Any .NET coder should be able to get started with IronOCR Tesseract OCR in 5 minutes using examples on this page.
Check out the following comparison article: AWS vs Google Vision (OCR Features Comparison). To learn about more services that offer OCR technology.
Frequently Asked Questions
What is Tesseract OCR?
Tesseract is an open-source OCR (optical character recognition) library available for free, often used in academic and various development projects to convert images containing text into machine-readable text.
Why use IronOCR for Tesseract in C#?
IronOCR extends the capabilities of the Tesseract OCR by providing a native C# library called IronTesseract. It offers improved stability, higher accuracy, and easier integration with .NET projects without the complexities associated with using Tesseract directly.
How do you install IronOCR in Visual Studio?
You can install IronOCR in Visual Studio via the NuGet Package Manager by running the command 'nuget install IronOcr'. This handles all necessary dependencies without requiring additional native DLLs or EXEs.
What are the benefits of using IronOCR over free Tesseract libraries?
IronOCR provides a managed .NET library that simplifies deployment and improves performance. It includes features like multithreading, automatic image preprocessing, and supports a wide range of image formats and languages, making it more suitable for professional and commercial applications.
Can IronOCR handle multiple languages?
Yes, IronOCR supports multiple languages through NuGet packages. Developers can configure IronOCR to use several languages simultaneously, which is beneficial for processing documents containing mixed-language text.
What is the accuracy of IronOCR compared to Tesseract?
IronOCR often achieves 99.8-100% accuracy with minimal configuration, even on low-resolution or imperfect images, whereas Tesseract requires extensive image preprocessing to achieve similar accuracy.
What types of projects is IronOCR compatible with?
IronOCR is compatible with various project types, including .NET Framework 4.6.2 and above, .NET Standard 2.0 and above, .NET Core 2.0 and above, as well as platforms like Windows, macOS, Linux, Azure, AWS, Mono, and Xamarin Mac.
How does IronOCR handle image formats?
IronOCR supports a wide range of image formats, including PDF, TIFF, JPEG, PNG, GIF, BMP, and more. It can process images from various sources like System.Drawing.Image, streams, and byte arrays.
Is IronOCR suitable for real-world applications?
Yes, IronOCR is designed for real-world applications, offering high accuracy, ease of use, and robust support for complex OCR tasks. It is well-suited for commercial projects where quick integration and reliable performance are required.
How can developers get started with IronOCR?
Developers can quickly get started with IronOCR by installing it via NuGet Package Manager in Visual Studio and following the provided examples. The installation and configuration process is straightforward, allowing developers to implement OCR in their projects within minutes.