OCR in C# and VB.Net
IronOCR is a C# software library allowing .NET platform software developers to recognize and read text from images and PDF documents. It is a pure .NET OCR library using the most advanced Tesseract engine known, anywhere.
Installation
The first thing we have to do is install our OCR library into a Visual Studio project. To do this, we can choose one of two approaches:
PM > Install-Package IronOcr
The easiest way to https://ironsoftware.com/csharp/ocr/ is using NuGet Package Manager for Visual-Studio. The package name is “IronOcr”
- Or download the IronOcr DLL directly from our homepage.
Why Choose IronOCR?
Iron OCR is an easy-to-install, complete and well-documented .NET software library.
Choose IronOCR to achieve 99.8%+ OCR accuracy without using any external web services, ongoing fees or sending confidential documents over the internet.
Why C# developers choose IronOCR over Vanilla Tesseract:
- Install as a single DLL or Nuget
- Includes for Tesseract 5 , 4 and 3 Engines out of the box.
- Accuracy 99.8% significantly outperforms regular Tesseract.
- Blazing Speed and MultiThreading
- MVC, WebApp, Desktop, Console & Server Application compatible
- No Exes or C++ code to work with
- Full PDF OCR support
- To perform OCR an almost any Image file or PDF
- Full .NET Core, Standard and FrameWork support
- Deploy on Windows, Mac, Linux, Azure, Docker, Lambda, AWS
- Read barcodes and QR codes
- Export OCR as to XHTML
- Export OCR to searchable PDF documents
- Multithreading support
- 125 international languages all managed via Nuget or OcrData files
- Extract Images, Coordinates, Statistics and Fonts. Not just text.
- Can be used to redistribute Tesseract OCR inside commercial & proprietary applications.
Iron OCR shines when working with real world images and imperfect documents such as photographs, or scans of low resolution which may have digital noise or imperfections.
Other free OCR libraries for the .NET platform such other .net tesseract APIs and web services do not perform so well on these real world use cases.
OCR with Tesseract 5 - Start Coding in C#
The code sample below shows how easy it is to read text from an image using C# or VB .NET.
OneLiner
string Text = new IronTesseract().Read(@"img\Screenshot.png").Text;
string Text = new IronTesseract().Read(@"img\Screenshot.png").Text;
Dim Text As String = (New IronTesseract()).Read("img\Screenshot.png").Text
Configurable Hello World
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
Input.AddImage("images/sample.jpeg")
//... you can add any number of images
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
Input.AddImage("images/sample.jpeg")
//... you can add any number of images
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
Input.AddImage("images/sample.jpeg") var Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
End Using
C# PDF OCR
The same approach can similarly be used to extract text from any PDF document.
var Ocr = new IronTesseract();
using (var input = new OcrInput())
{
input.AddPdf("example.pdf","password");
// We can also select specific PDF page numnbers to OCR
var Result = Ocr.Read(input);
Console.WriteLine(Result.Text);
Console.WriteLine($"{Result.Pages.Count()} Pages");
// 1 page for every page of the PDF
}
var Ocr = new IronTesseract();
using (var input = new OcrInput())
{
input.AddPdf("example.pdf","password");
// We can also select specific PDF page numnbers to OCR
var Result = Ocr.Read(input);
Console.WriteLine(Result.Text);
Console.WriteLine($"{Result.Pages.Count()} Pages");
// 1 page for every page of the PDF
}
Dim Ocr = New IronTesseract()
Using input = New OcrInput()
input.AddPdf("example.pdf","password")
' We can also select specific PDF page numnbers to OCR
Dim Result = Ocr.Read(input)
Console.WriteLine(Result.Text)
Console.WriteLine($"{Result.Pages.Count()} Pages")
' 1 page for every page of the PDF
End Using
OCR for MultiPage TIFFs
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
input.AddMultiFrameTiff("multi-frame.tiff");
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
input.AddMultiFrameTiff("multi-frame.tiff");
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
input.AddMultiFrameTiff("multi-frame.tiff")
Dim Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
End Using
Barcodes and QR
A unique feature of Iron OCR is it can read barcodes and QR codes from documents while it is scanning for text. Instances of the OcrResult.OcrBarcode
Class give the developer detailed information about each scanned barcode.
// using IronOcr;
var Ocr = new IronTesseract();
Ocr.Configuration.ReadBarCodes = true;
using (var input = new OcrInput())
{
input.AddImage("img/Barcode.png");
var Result = Ocr.Read(input);
foreach (var Barcode in Result.Barcodes)
{
Console.WriteLine(Barcode.Value);
// type and location properties also exposed
}
}
// using IronOcr;
var Ocr = new IronTesseract();
Ocr.Configuration.ReadBarCodes = true;
using (var input = new OcrInput())
{
input.AddImage("img/Barcode.png");
var Result = Ocr.Read(input);
foreach (var Barcode in Result.Barcodes)
{
Console.WriteLine(Barcode.Value);
// type and location properties also exposed
}
}
' using IronOcr;
Dim Ocr = New IronTesseract()
Ocr.Configuration.ReadBarCodes = True
Using input = New OcrInput()
input.AddImage("img/Barcode.png")
Dim Result = Ocr.Read(input)
For Each Barcode In Result.Barcodes
Console.WriteLine(Barcode.Value)
' type and location properties also exposed
Next Barcode
End Using
OCR on Specific Areas of Images
All of Iron OCR's scanning and reading methods provide the ability specify exactly which part of a page or pages we wish to read text from. This is very useful when we are looking at standardized forms and can save an awful lot of time and improve efficiency.
To use crop regions, we will need to add a system reference to System.Drawing
so that we can use the System.Drawing.Rectangle
object.
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
var ContentArea = new System.Drawing.Rectangle() { X = 215, Y = 1250, Height = 280, Width = 1335 };
// Dimensions are in in px
Input.Add("document.png", ContentArea);
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
var ContentArea = new System.Drawing.Rectangle() { X = 215, Y = 1250, Height = 280, Width = 1335 };
// Dimensions are in in px
Input.Add("document.png", ContentArea);
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
Dim ContentArea = New System.Drawing.Rectangle() With {
.X = 215,
.Y = 1250,
.Height = 280,
.Width = 1335
}
' Dimensions are in in px
Input.Add("document.png", ContentArea)
Dim Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
End Using
OCR for Low Quality Scans
The Iron OCR OcrInput
class can fix scans that normal Tesseract can not read.
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput(@"img\Potter.LowQuality.tiff"))
{
Input.DeNoise(); // fixes digital noise and poor scanning
Input.Deskew(); // fixes rotation and perspective
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput(@"img\Potter.LowQuality.tiff"))
{
Input.DeNoise(); // fixes digital noise and poor scanning
Input.Deskew(); // fixes rotation and perspective
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput("img\Potter.LowQuality.tiff")
Input.DeNoise() ' fixes digital noise and poor scanning
Input.Deskew() ' fixes rotation and perspective
Dim Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
End Using
Export OCR results as a Searchable PDF
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
input.Title = "Quarterly Report"
input.AddImage("image1.jpeg");
input.AddImage("image2.png");
input.AddImage("image3.gif");
var Result = Ocr.Read(input);
Result.SaveAsSearchablePdf("searchable.pdf")
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
input.Title = "Quarterly Report"
input.AddImage("image1.jpeg");
input.AddImage("image2.png");
input.AddImage("image3.gif");
var Result = Ocr.Read(input);
Result.SaveAsSearchablePdf("searchable.pdf")
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
input.Title = "Quarterly Report" input.AddImage("image1.jpeg")
input.AddImage("image2.png")
input.AddImage("image3.gif")
Dim Result = Ocr.Read(input)
Result.SaveAsSearchablePdf("searchable.pdf")
End Using
TIFF to searchable PDF Conversion
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
input.AddMultiFrameTiff("example.tiff")
var Result = Ocr.Read(input).SaveAsSearchablePdf("searchable.pdf")
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
input.AddMultiFrameTiff("example.tiff")
var Result = Ocr.Read(input).SaveAsSearchablePdf("searchable.pdf")
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
input.AddMultiFrameTiff("example.tiff") var Result = Ocr.Read(input).SaveAsSearchablePdf("searchable.pdf")
End Using
Export OCR results as HTML
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
input.Title = "Html Title"
input.AddImage("image1.jpeg");
var Result = Ocr.Read(input);
Result.SaveAsHocrFile("results.html");
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput()){
input.Title = "Html Title"
input.AddImage("image1.jpeg");
var Result = Ocr.Read(input);
Result.SaveAsHocrFile("results.html");
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
input.Title = "Html Title" input.AddImage("image1.jpeg")
Dim Result = Ocr.Read(input)
Result.SaveAsHocrFile("results.html")
End Using
OCR Image Enhancement Filters
IronOCR provides unique filters to OcrInput
objects to inprove OCR performance.
Image Enhancement Code Example
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput(@"LowQuality.jpeg"))
{
Input.DeNoise(); // fixes digital noise and poor scanning
Input.Deskew(); // fixes rotation and perspective
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput(@"LowQuality.jpeg"))
{
Input.DeNoise(); // fixes digital noise and poor scanning
Input.Deskew(); // fixes rotation and perspective
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput("LowQuality.jpeg")
Input.DeNoise() ' fixes digital noise and poor scanning
Input.Deskew() ' fixes rotation and perspective
Dim Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
End Using
List of OCR Image Filters
Input filters to enhance OCR performance which are built into IronOCR include:
- OcrInput.Rotate( double degrees) - Rotates images by a number of degrees clockwise. For anti-clockwise, use negative numbers.
- OcrInput.Binarize() - This image filter turns every pixel black or white with no middle ground. May Improve OCR performance cases of very low contrast of text to background.
- OcrInput.ToGrayScale() - This image filter turns every pixel into a shade of grayscale. Unlikely to improve OCR accuracy but may improve speed
- OcrInput.Contrast() - Increases contrast automatically. This filter often improves OCR speed and accuracy in low contrast scans.
- OcrInput.DeNoise() - Removes digital noise. This filter should only be used where noise is expected.
- OcrInput.Invert() - Inverts every color. E.g. White becomes black : black becomes white.
- OcrInput.Dilate() - Advanced Morphology. Dilation adds pixels to the boundaries of objects in an image. Opposite of Erode
- OcrInput.Erode() - Advanced Morphology. Erosion removes pixels on object boundariesOpposite of Dilate
- OcrInput.Deskew() - Rotates an image so it is the right way up and orthogonal. This is very useful for OCR because Tesseract tolerance for skewed scans can be as low as 5 degrees.
- OcrInput.DeepCleanBackgroundNoise() - Heavy background noise removal. Only use this filter in case extreme document background noise is known, because this filter will also risk reducing OCR accuracy of clean documents, and is very CPU expensive.
- OcrInput.EnhanceResolution - Enhances the resolution of low quality images. This filter is not often needed because OcrInput.MinimumDPI and OcrInput.TargetDPI will automatically catch and resolve low resolution inputs.
CleanBackgroundNoise. This is a setting which is somewhat time-consuming; however, it allows the library to automatically clean digital noise, paper crumples, and other imperfections within a digital image which would otherwise render it incapable of being read by other OCR libraries.
EnhanceContrast is a setting which causes Iron OCR to automatically increase the contrast of text against the background of an image, increasing the accuracy of OCR and generally increasing performance and the speed of OCR.
EnhanceResolution is a setting which will automatically detect low-resolution images (which are under 275 dpi) and automatically upscale the image and then sharpen all of the text so it can be read perfectly by an OCR library. Although this operation is in itself time-consuming, it generally reduces the overall time for an OCR operation on an image.
Language Iron OCR supports 22 international language packs, and the language setting can be used to select one or more multiple languages to be applied for an OCR operation.
Strategy Iron OCR supports two strategies. We may choose to either go for a fast and less accurate scan of a document, or use an advanced strategy which uses some artificial intelligence models to automatically improve the accuracy of OCR text by looking at the statistical relationship of words to one another in a sentence.
ColorSpace is a setting whereby we can choose to OCR in grayscale or color. Generally, grayscale is the best option. However, sometimes when there are texts or backgrounds of similar hue but very different color, a full-color color space will provide better results.
DetectWhiteTextOnDarkBackgrounds. Generally, all OCR libraries expect to see black text on white backgrounds. This setting allows Iron OCR to automatically detect negatives, or dark pages with white text, and read them.
InputImageType. This setting allows the developer to guide the OCR library as to whether it is looking at a full document or a snippet, such as a screenshot.
RotateAndStraighten is an advanced setting which allows Iron OCR the unique ability to read documents which are not only rotated, but perhaps containing perspective, such as photographs of text documents.
ReadBarcodes is a useful feature which allows Iron OCR to automatically read barcodes and QR codes on pages as it also reads text, without adding a large additional time burden.
ColorDepth. This setting determines how many bits per pixel the OCR library will use to determine the depth of a color. A higher color depth may increase OCR quality, but will also increase the time required for the OCR operation to complete.
125 Language Packs
Iron OCR supports 125 international languages via language packs which are distributed as DLLs, which can be [downloaded from this website]()/csharp/ocr/languages/, or also from the NuGet Package Manager.
Languages include German, French, English, Chinese, Japanese and many more. Specialist language packs exists for passport MRZ, MICR checks, Financial Data, License plates and many more. You can also use any tesseract ".traineddata" file - including ones you create yourself.
Language Example
// using IronOcr;
// PM> Install IronOcr.Languages.Arabic
var Ocr = new IronTesseract();
Ocr.Language = OcrLanguage.Arabic;
using (var input = new OcrInput())
{
input.AddImage("img/arabic.gif");
// Add image filters if needed
// In this case, even thought input is very low quality
// IronTesseract can read what conventional Tesseract cannot.
var Result = Ocr.Read(input);
// Console can't print Arabic on Windows easily.
// Let's save to disk instead.
Result.SaveAsTextFile("arabic.txt");
}
// using IronOcr;
// PM> Install IronOcr.Languages.Arabic
var Ocr = new IronTesseract();
Ocr.Language = OcrLanguage.Arabic;
using (var input = new OcrInput())
{
input.AddImage("img/arabic.gif");
// Add image filters if needed
// In this case, even thought input is very low quality
// IronTesseract can read what conventional Tesseract cannot.
var Result = Ocr.Read(input);
// Console can't print Arabic on Windows easily.
// Let's save to disk instead.
Result.SaveAsTextFile("arabic.txt");
}
' using IronOcr;
' PM> Install IronOcr.Languages.Arabic
Dim Ocr = New IronTesseract()
Ocr.Language = OcrLanguage.Arabic
Using input = New OcrInput()
input.AddImage("img/arabic.gif")
' Add image filters if needed
' In this case, even thought input is very low quality
' IronTesseract can read what conventional Tesseract cannot.
Dim Result = Ocr.Read(input)
' Console can't print Arabic on Windows easily.
' Let's save to disk instead.
Result.SaveAsTextFile("arabic.txt")
End Using
Multiple Language Example
It is also possible to OCR using multiple languages at the same time. This can really help get english language metadata and urls in Unicode documents.
// using IronOcr;
// PM> Install IronOcr.Languages.ChineseSimplified
var Ocr = new IronTesseract();
Ocr.Language = OcrLanguage.ChineseSimplified;
Ocr.AddSecondaryLanguage(OcrLanguage.English);
// We can add any number of languages
using (var input = new OcrInput())
{
input.Add("multi-language.pdf");
var Result = Ocr.Read(input);
Result.SaveAsTextFile("results.txt");
}
// using IronOcr;
// PM> Install IronOcr.Languages.ChineseSimplified
var Ocr = new IronTesseract();
Ocr.Language = OcrLanguage.ChineseSimplified;
Ocr.AddSecondaryLanguage(OcrLanguage.English);
// We can add any number of languages
using (var input = new OcrInput())
{
input.Add("multi-language.pdf");
var Result = Ocr.Read(input);
Result.SaveAsTextFile("results.txt");
}
' using IronOcr;
' PM> Install IronOcr.Languages.ChineseSimplified
Dim Ocr = New IronTesseract()
Ocr.Language = OcrLanguage.ChineseSimplified
Ocr.AddSecondaryLanguage(OcrLanguage.English)
' We can add any number of languages
Using input = New OcrInput()
input.Add("multi-language.pdf")
Dim Result = Ocr.Read(input)
Result.SaveAsTextFile("results.txt")
End Using
Detailed OCR Results Objects
Iron OCR returns an OCR result object for each OCR operation. Generally, developers only use the text property of this object to get the text scanned from the image. However, the OCR results DOM is much more advanced than this.
using IronOcr;
using System.Drawing; //Add Assembly Reference
var Ocr = new IronTesseract();
Ocr.Configuration.EngineMode = TesseractEngineMode.TesseractAndLstm;
Ocr.Configuration.ReadBarCodes = true; //!Important
using (var Input = new OcrInput(@"images\sample.tiff"))
{
OcrResult Result = Ocr.Read(Input);
var Pages = Result.Pages;
var Words = Pages[0].Words ;
var Barcodes = Result.Barcodes;
// Explore here to find a massive, detailed API:
// - Pages, Blocks, Paraphaphs, Lines, Words, Chars
// - Image Export, Fonts Coordinates, Statistical Data
}
using IronOcr;
using System.Drawing; //Add Assembly Reference
var Ocr = new IronTesseract();
Ocr.Configuration.EngineMode = TesseractEngineMode.TesseractAndLstm;
Ocr.Configuration.ReadBarCodes = true; //!Important
using (var Input = new OcrInput(@"images\sample.tiff"))
{
OcrResult Result = Ocr.Read(Input);
var Pages = Result.Pages;
var Words = Pages[0].Words ;
var Barcodes = Result.Barcodes;
// Explore here to find a massive, detailed API:
// - Pages, Blocks, Paraphaphs, Lines, Words, Chars
// - Image Export, Fonts Coordinates, Statistical Data
}
Imports IronOcr
Imports System.Drawing 'Add Assembly Reference
Private Ocr = New IronTesseract()
Ocr.Configuration.EngineMode = TesseractEngineMode.TesseractAndLstm
Ocr.Configuration.ReadBarCodes = True '!Important
Using Input = New OcrInput("images\sample.tiff")
Dim Result As OcrResult = Ocr.Read(Input)
Dim Pages = Result.Pages
Dim Words = Pages(0).Words
Dim Barcodes = Result.Barcodes
' Explore here to find a massive, detailed API:
' - Pages, Blocks, Paraphaphs, Lines, Words, Chars
' - Image Export, Fonts Coordinates, Statistical Data
End Using
Performance
IronOCR works out of the box with no need to performance tune or heavily modify input images.
Speed is Blazing: IronOcr.2020 + is up to 10 times faster and makes over 250% fewer errors than previous builds.
Learn More
To learn more about OCR in C#, VB, F#, or any other .NET language, please read our community tutorials, which give real world examples of how Iron OCR can be used and may show the nuances of how to get the best out of this library.
A full API reference for .NET developers is also available.