OCR in C# and VB.Net

IronOCR is a C# software library allowing .NET platform software developers to recognize and read text from images and PDF documents. It is a pure .NET OCR library using the most advanced Tesseract engine known, anywhere.

Installation

The first thing we have to do is install our OCR library into a Visual Studio project. To do this, we can choose one of two approaches:

 PM > Install-Package IronOcr
  1. The easiest way to https://ironsoftware.com/csharp/ocr/ is using NuGet Package Manager for Visual-Studio. The package name is “IronOcr”

  2. Or download the IronOcr DLL directly from our homepage.

Why Choose IronOCR?

Iron OCR is an easy-to-install, complete and well-documented .NET software library.

Choose IronOCR to achieve 99.8%+ OCR accuracy without using any external web services, ongoing fees or sending confidential documents over the internet.

Why C# developers choose IronOCR over Vanilla Tesseract:

  • Install as a single DLL or Nuget
  • Includes for Tesseract 5 , 4 and 3 Engines out of the box.
  • Accuracy 99.8% significantly outperforms regular Tesseract.
  • Blazing Speed and MultiThreading
  • MVC, WebApp, Desktop, Console & Server Application compatible
  • No Exes or C++ code to work with
  • Full PDF OCR support
  • To perform OCR an almost any Image file or PDF
  • Full .Net Core, Standard and FrameWork support
  • Deploy on Windows, Mac, Linux, Azure, Docker, Lambda, AWS
  • Read barcodes and QR codes
  • Export OCR as to XHTML
  • Export OCR to searchable PDF documents
  • Multithreading support
  • 125 international languages all managed via Nuget or OcrData files
  • Extract Images, Coordinates, Statistics and Fonts. Not just text.
  • Can be used to redistribute Tesseract OCR inside commercial & proprietary applications.

Iron OCR shines when working with real world images and imperfect documents such as photographs, or scans of low resolution which may have digital noise or imperfections.

Other free OCR libraries for the .NET platform such other .net tesseract APIs and web services do not perform so well on these real world use cases.

OCR with Tesseract 5 - Start Coding in C#

The code sample below shows how easy it is to read text from an image using C# or VB .NET.

OneLiner

string Text = new IronTesseract().Read(@"img\Screenshot.png").Text;
string Text = new IronTesseract().Read(@"img\Screenshot.png").Text;
Dim Text As String = (New IronTesseract()).Read("img\Screenshot.png").Text
VB   C#

Configurable Hello World

using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
    Input.AddImage("images/sample.jpeg")
    //... you can add any number of images
    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
    Input.AddImage("images/sample.jpeg")
    //... you can add any number of images
    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
Imports IronOcr

Private Ocr = New IronTesseract()
Using Input = New OcrInput()
	Input.AddImage("images/sample.jpeg") var Result = Ocr.Read(Input)
	Console.WriteLine(Result.Text)
End Using
VB   C#

C# PDF OCR

The same approach can similarly be used to extract text from any PDF document.

var Ocr = new IronTesseract();
using (var input = new OcrInput())
{

    input.AddPdf("example.pdf","password");
    // We can also select specific PDF page numnbers to OCR

    var Result = Ocr.Read(input);

    Console.WriteLine(Result.Text);
    Console.WriteLine($"{Result.Pages.Count()} Pages");
    // 1 page for every page of the PDF
}
var Ocr = new IronTesseract();
using (var input = new OcrInput())
{

    input.AddPdf("example.pdf","password");
    // We can also select specific PDF page numnbers to OCR

    var Result = Ocr.Read(input);

    Console.WriteLine(Result.Text);
    Console.WriteLine($"{Result.Pages.Count()} Pages");
    // 1 page for every page of the PDF
}
Dim Ocr = New IronTesseract()
Using input = New OcrInput()

	input.AddPdf("example.pdf","password")
	' We can also select specific PDF page numnbers to OCR

	Dim Result = Ocr.Read(input)

	Console.WriteLine(Result.Text)
	Console.WriteLine($"{Result.Pages.Count()} Pages")
	' 1 page for every page of the PDF
End Using
VB   C#

OCR for MultiPage TIFFs

using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
    input.AddMultiFrameTiff("multi-frame.tiff");    
    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
    input.AddMultiFrameTiff("multi-frame.tiff");    
    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
Imports IronOcr

Private Ocr = New IronTesseract()
Using Input = New OcrInput()
	input.AddMultiFrameTiff("multi-frame.tiff")
	Dim Result = Ocr.Read(Input)
	Console.WriteLine(Result.Text)
End Using
VB   C#

Barcodes and QR

A unique feature of Iron OCR is it can read barcodes and QR codes from documents while it is scanning for text. Instances of the OcrResult.OcrBarcode Class give the developer detailed information about each scanned barcode.

// using IronOcr;
var Ocr = new IronTesseract();
Ocr.Configuration.ReadBarCodes = true;

using (var input = new OcrInput())
{
    input.AddImage("img/Barcode.png");
    var Result = Ocr.Read(input);
    foreach (var Barcode in Result.Barcodes)
    {
        Console.WriteLine(Barcode.Value);
        // type and location properties also exposed
    }
}
// using IronOcr;
var Ocr = new IronTesseract();
Ocr.Configuration.ReadBarCodes = true;

using (var input = new OcrInput())
{
    input.AddImage("img/Barcode.png");
    var Result = Ocr.Read(input);
    foreach (var Barcode in Result.Barcodes)
    {
        Console.WriteLine(Barcode.Value);
        // type and location properties also exposed
    }
}
' using IronOcr;
Dim Ocr = New IronTesseract()
Ocr.Configuration.ReadBarCodes = True

Using input = New OcrInput()
	input.AddImage("img/Barcode.png")
	Dim Result = Ocr.Read(input)
	For Each Barcode In Result.Barcodes
		Console.WriteLine(Barcode.Value)
		' type and location properties also exposed
	Next Barcode
End Using
VB   C#

OCR on Specific Areas of Images

All of Iron OCR's scanning and reading methods provide the ability specify exactly which part of a page or pages we wish to read text from. This is very useful when we are looking at standardized forms and can save an awful lot of time and improve efficiency.

To use crop regions, we will need to add a system reference to System.Drawing so that we can use the System.Drawing.Rectangle object.

using IronOcr;

var Ocr = new IronTesseract();

using (var Input = new OcrInput())
{
    var ContentArea = new System.Drawing.Rectangle() { X = 215, Y = 1250, Height = 280, Width = 1335 };
   // Dimensions are in in px

    Input.Add("document.png", ContentArea);

    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
using IronOcr;

var Ocr = new IronTesseract();

using (var Input = new OcrInput())
{
    var ContentArea = new System.Drawing.Rectangle() { X = 215, Y = 1250, Height = 280, Width = 1335 };
   // Dimensions are in in px

    Input.Add("document.png", ContentArea);

    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
Imports IronOcr

Private Ocr = New IronTesseract()

Using Input = New OcrInput()
	Dim ContentArea = New System.Drawing.Rectangle() With {
		.X = 215,
		.Y = 1250,
		.Height = 280,
		.Width = 1335
	}
   ' Dimensions are in in px

	Input.Add("document.png", ContentArea)

	Dim Result = Ocr.Read(Input)
	Console.WriteLine(Result.Text)
End Using
VB   C#

OCR for Low Quality Scans

The Iron OCR OcrInput class can fix scans that normal Tesseract can not read.

using IronOcr;
var Ocr = new IronTesseract();

using (var Input = new OcrInput(@"img\Potter.LowQuality.tiff"))
{
    Input.DeNoise(); // fixes digital noise and poor scanning
    Input.Deskew();  // fixes rotation and perspective
    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
using IronOcr;
var Ocr = new IronTesseract();

using (var Input = new OcrInput(@"img\Potter.LowQuality.tiff"))
{
    Input.DeNoise(); // fixes digital noise and poor scanning
    Input.Deskew();  // fixes rotation and perspective
    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
Imports IronOcr
Private Ocr = New IronTesseract()

Using Input = New OcrInput("img\Potter.LowQuality.tiff")
	Input.DeNoise() ' fixes digital noise and poor scanning
	Input.Deskew() ' fixes rotation and perspective
	Dim Result = Ocr.Read(Input)
	Console.WriteLine(Result.Text)
End Using
VB   C#

Export OCR results as a Searchable PDF

using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
    input.Title = "Quarterly Report"
    input.AddImage("image1.jpeg");
    input.AddImage("image2.png");
    input.AddImage("image3.gif");

    var Result = Ocr.Read(input);
    Result.SaveAsSearchablePdf("searchable.pdf")
}
using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
    input.Title = "Quarterly Report"
    input.AddImage("image1.jpeg");
    input.AddImage("image2.png");
    input.AddImage("image3.gif");

    var Result = Ocr.Read(input);
    Result.SaveAsSearchablePdf("searchable.pdf")
}
Imports IronOcr

Private Ocr = New IronTesseract()
Using Input = New OcrInput()
	input.Title = "Quarterly Report" input.AddImage("image1.jpeg")
	input.AddImage("image2.png")
	input.AddImage("image3.gif")

	Dim Result = Ocr.Read(input)
	Result.SaveAsSearchablePdf("searchable.pdf")
End Using
VB   C#

TIFF to searchable PDF Conversion

using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
    input.AddMultiFrameTiff("example.tiff")
    var Result = Ocr.Read(input).SaveAsSearchablePdf("searchable.pdf")
}
using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
    input.AddMultiFrameTiff("example.tiff")
    var Result = Ocr.Read(input).SaveAsSearchablePdf("searchable.pdf")
}
Imports IronOcr

Private Ocr = New IronTesseract()
Using Input = New OcrInput()
	input.AddMultiFrameTiff("example.tiff") var Result = Ocr.Read(input).SaveAsSearchablePdf("searchable.pdf")
End Using
VB   C#

Export OCR results as HTML

using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
    input.Title = "Html Title"
    input.AddImage("image1.jpeg");
    var Result = Ocr.Read(input);
    Result.SaveAsHocrFile("results.html");
}
using IronOcr;

var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
    input.Title = "Html Title"
    input.AddImage("image1.jpeg");
    var Result = Ocr.Read(input);
    Result.SaveAsHocrFile("results.html");
}
Imports IronOcr

Private Ocr = New IronTesseract()
Using Input = New OcrInput()
	input.Title = "Html Title" input.AddImage("image1.jpeg")
	Dim Result = Ocr.Read(input)
	Result.SaveAsHocrFile("results.html")
End Using
VB   C#

OCR Image Enhancement Filters

IronOCR provides unique filters to OcrInput objects to inprove OCR performance.

Image Enhancement Code Example

using IronOcr;
var Ocr = new IronTesseract();

using (var Input = new OcrInput(@"LowQuality.jpeg"))
{
    Input.DeNoise(); // fixes digital noise and poor scanning
    Input.Deskew();  // fixes rotation and perspective
    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
using IronOcr;
var Ocr = new IronTesseract();

using (var Input = new OcrInput(@"LowQuality.jpeg"))
{
    Input.DeNoise(); // fixes digital noise and poor scanning
    Input.Deskew();  // fixes rotation and perspective
    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
Imports IronOcr
Private Ocr = New IronTesseract()

Using Input = New OcrInput("LowQuality.jpeg")
	Input.DeNoise() ' fixes digital noise and poor scanning
	Input.Deskew() ' fixes rotation and perspective
	Dim Result = Ocr.Read(Input)
	Console.WriteLine(Result.Text)
End Using
VB   C#

List of OCR Image Filters

Input filters to enhance OCR performance which are built into IronOCR include:

  • OcrInput.Rotate( double degrees) - Rotates images by a number of degrees clockwise. For anti-clockwise, use negative numbers.
  • OcrInput.Binarize() - This image filter turns every pixel black or white with no middle ground. May Improve OCR performance cases of very low contrast of text to background.
  • OcrInput.ToGrayScale() - This image filter turns every pixel into a shade of grayscale. Unlikely to improve OCR accuracy but may improve speed
  • OcrInput.Contrast() - Increases contrast automatically. This filter often improves OCR speed and accuracy in low contrast scans.
  • OcrInput.DeNoise() - Removes digital noise. This filter should only be used where noise is expected.
  • OcrInput.Invert() - Inverts every color. E.g. White becomes black : black becomes white.
  • OcrInput.Dilate() - Advanced Morphology. Dilation adds pixels to the boundaries of objects in an image. Opposite of Erode
  • OcrInput.Erode() - Advanced Morphology. Erosion removes pixels on object boundariesOpposite of Dilate
  • OcrInput.Deskew() - Rotates an image so it is the right way up and orthogonal. This is very useful for OCR because Tesseract tolerance for skewed scans can be as low as 5 degrees.
  • OcrInput.DeepCleanBackgroundNoise() - Heavy background noise removal. Only use this filter in case extreme document background noise is known, because this filter will also risk reducing OCR accuracy of clean documents, and is very CPU expensive.
  • OcrInput.EnhanceResolution - Enhances the resolution of low quality images. This filter is not often needed because OcrInput.MinimumDPI and OcrInput.TargetDPI will automatically catch and resolve low resolution inputs.

CleanBackgroundNoise. This is a setting which is somewhat time-consuming; however, it allows the library to automatically clean digital noise, paper crumples, and other imperfections within a digital image which would otherwise render it incapable of being read by other OCR libraries.

EnhanceContrast is a setting which causes Iron OCR to automatically increase the contrast of text against the background of an image, increasing the accuracy of OCR and generally increasing performance and the speed of OCR.

EnhanceResolution is a setting which will automatically detect low-resolution images (which are under 275 dpi) and automatically upscale the image and then sharpen all of the text so it can be read perfectly by an OCR library. Although this operation is in itself time-consuming, it generally reduces the overall time for an OCR operation on an image.

Language Iron OCR supports 22 international language packs, and the language setting can be used to select one or more multiple languages to be applied for an OCR operation.

Strategy Iron OCR supports two strategies. We may choose to either go for a fast and less accurate scan of a document, or use an advanced strategy which uses some artificial intelligence models to automatically improve the accuracy of OCR text by looking at the statistical relationship of words to one another in a sentence.

ColorSpace is a setting whereby we can choose to OCR in grayscale or color. Generally, grayscale is the best option. However, sometimes when there are texts or backgrounds of similar hue but very different color, a full-color color space will provide better results.

DetectWhiteTextOnDarkBackgrounds. Generally, all OCR libraries expect to see black text on white backgrounds. This setting allows Iron OCR to automatically detect negatives, or dark pages with white text, and read them.

InputImageType. This setting allows the developer to guide the OCR library as to whether it is looking at a full document or a snippet, such as a screenshot.

RotateAndStraighten is an advanced setting which allows Iron OCR the unique ability to read documents which are not only rotated, but perhaps containing perspective, such as photographs of text documents.

ReadBarcodes is a useful feature which allows Iron OCR to automatically read barcodes and QR codes on pages as it also reads text, without adding a large additional time burden.

ColorDepth. This setting determines how many bits per pixel the OCR library will use to determine the depth of a color. A higher color depth may increase OCR quality, but will also increase the time required for the OCR operation to complete.

125 Language Packs

Iron OCR supports 125 international languages via language packs which are distributed as DLLs, which can be [downloaded from this website]()/csharp/ocr/languages/, or also from the NuGet Package Manager.

Languages include German, French, English, Chinese, Japanese and many more. Specialist language packs exists for passport MRZ, MICR checks, Financial Data, License plates and many more. You can also use any tesseract ".traineddata" file - including ones you create yourself.

Language Example

// using IronOcr;
// PM> Install IronOcr.Languages.Arabic

var Ocr = new IronTesseract();
Ocr.Language = OcrLanguage.Arabic;

using (var input = new OcrInput())
{
    input.AddImage("img/arabic.gif");
    // Add image filters if needed
    // In this case, even thought input is very low quality
    // IronTesseract can read what conventional Tesseract cannot.

    var Result = Ocr.Read(input);

    // Console can't print Arabic on Windows easily.
    // Let's save to disk instead.
    Result.SaveAsTextFile("arabic.txt");
}   
// using IronOcr;
// PM> Install IronOcr.Languages.Arabic

var Ocr = new IronTesseract();
Ocr.Language = OcrLanguage.Arabic;

using (var input = new OcrInput())
{
    input.AddImage("img/arabic.gif");
    // Add image filters if needed
    // In this case, even thought input is very low quality
    // IronTesseract can read what conventional Tesseract cannot.

    var Result = Ocr.Read(input);

    // Console can't print Arabic on Windows easily.
    // Let's save to disk instead.
    Result.SaveAsTextFile("arabic.txt");
}   
' using IronOcr;
' PM> Install IronOcr.Languages.Arabic

Dim Ocr = New IronTesseract()
Ocr.Language = OcrLanguage.Arabic

Using input = New OcrInput()
	input.AddImage("img/arabic.gif")
	' Add image filters if needed
	' In this case, even thought input is very low quality
	' IronTesseract can read what conventional Tesseract cannot.

	Dim Result = Ocr.Read(input)

	' Console can't print Arabic on Windows easily.
	' Let's save to disk instead.
	Result.SaveAsTextFile("arabic.txt")
End Using
VB   C#

Multiple Language Example

It is also possible to OCR using multiple languages at the same time. This can really help get english language metadata and urls in Unicode documents.

// using IronOcr;
// PM> Install IronOcr.Languages.ChineseSimplified

var Ocr = new IronTesseract();
Ocr.Language = OcrLanguage.ChineseSimplified;
Ocr.AddSecondaryLanguage(OcrLanguage.English);
// We can add any number of languages

using (var input = new OcrInput())
{
    input.Add("multi-language.pdf");
    var Result = Ocr.Read(input);
    Result.SaveAsTextFile("results.txt");
}   
// using IronOcr;
// PM> Install IronOcr.Languages.ChineseSimplified

var Ocr = new IronTesseract();
Ocr.Language = OcrLanguage.ChineseSimplified;
Ocr.AddSecondaryLanguage(OcrLanguage.English);
// We can add any number of languages

using (var input = new OcrInput())
{
    input.Add("multi-language.pdf");
    var Result = Ocr.Read(input);
    Result.SaveAsTextFile("results.txt");
}   
' using IronOcr;
' PM> Install IronOcr.Languages.ChineseSimplified

Dim Ocr = New IronTesseract()
Ocr.Language = OcrLanguage.ChineseSimplified
Ocr.AddSecondaryLanguage(OcrLanguage.English)
' We can add any number of languages

Using input = New OcrInput()
	input.Add("multi-language.pdf")
	Dim Result = Ocr.Read(input)
	Result.SaveAsTextFile("results.txt")
End Using
VB   C#

Detailed OCR Results Objects

Iron OCR returns an OCR result object for each OCR operation. Generally, developers only use the text property of this object to get the text scanned from the image. However, the OCR results DOM is much more advanced than this.

using IronOcr;
using System.Drawing; //Add Assembly Reference

var Ocr = new IronTesseract();
Ocr.Configuration.EngineMode = TesseractEngineMode.TesseractAndLstm;
Ocr.Configuration.ReadBarCodes = true; //!Important

using (var Input = new OcrInput(@"images\sample.tiff"))
{
    OcrResult Result = Ocr.Read(Input);
    var Pages = Result.Pages;
    var Words = Pages[0].Words ;
    var Barcodes = Result.Barcodes;
    // Explore here to find a massive, detailed API:
    // - Pages, Blocks, Paraphaphs, Lines, Words, Chars
    // - Image Export, Fonts Coordinates, Statistical Data
}
using IronOcr;
using System.Drawing; //Add Assembly Reference

var Ocr = new IronTesseract();
Ocr.Configuration.EngineMode = TesseractEngineMode.TesseractAndLstm;
Ocr.Configuration.ReadBarCodes = true; //!Important

using (var Input = new OcrInput(@"images\sample.tiff"))
{
    OcrResult Result = Ocr.Read(Input);
    var Pages = Result.Pages;
    var Words = Pages[0].Words ;
    var Barcodes = Result.Barcodes;
    // Explore here to find a massive, detailed API:
    // - Pages, Blocks, Paraphaphs, Lines, Words, Chars
    // - Image Export, Fonts Coordinates, Statistical Data
}
Imports IronOcr
Imports System.Drawing 'Add Assembly Reference

Private Ocr = New IronTesseract()
Ocr.Configuration.EngineMode = TesseractEngineMode.TesseractAndLstm
Ocr.Configuration.ReadBarCodes = True '!Important

Using Input = New OcrInput("images\sample.tiff")
	Dim Result As OcrResult = Ocr.Read(Input)
	Dim Pages = Result.Pages
	Dim Words = Pages(0).Words
	Dim Barcodes = Result.Barcodes
	' Explore here to find a massive, detailed API:
	' - Pages, Blocks, Paraphaphs, Lines, Words, Chars
	' - Image Export, Fonts Coordinates, Statistical Data
End Using
VB   C#

Performance

IronOCR works out of the box with no need to performance tune or heavily modify input images.

Speed is Blazing: IronOcr.2020 + is up to 10 times faster and makes over 250% fewer errors than previous builds.

Learn More

To learn more about OCR in C#, VB, F#, or any other .NET language, please read our community tutorials, which give real world examples of how Iron OCR can be used and may show the nuances of how to get the best out of this library.

A full API reference for .NET developers is also available.