Get Started with OCR in C# and VB.NET
IronOCR is a C# software library allowing .NET platform software developers to recognize and read text from images and PDF documents. It is a pure .NET OCR library using the most advanced Tesseract engine known, anywhere.
Installation
Install with NuGet Package Manager
Install IronOcr in Visual Studio or at the command line with the NuGet Package Manager. In Visual Studio, navigate to the console with:
- Tools ->
- NuGet Package Manager ->
- Package Manager Console
Install-Package IronOcr
And check out IronOcr on NuGet for more about version updates and installation.
There are other IronOCR NuGet Packages available for different platforms:
- Windows: https://www.nuget.org/packages/IronOcr
- Linux: https://www.nuget.org/packages/IronOcr.Linux
- MacOs: https://www.nuget.org/packages/IronOcr.MacOs
- MacOs (ARM): https://www.nuget.org/packages/IronOcr.MacOs.ARM
Download the IronOCR .ZIP
You may also choose to download IronOCR via .ZIP file instead. Click to directly download the DLL. Once you have the .zip downloaded:
Instructions for .NET Framework 4.0+ Installation:
- Include the IronOcr.dll in net40 folder into your project
- And then add Assembly references to:
- System.Configuration
- System.Drawing
- System.Web
Instructions for .NET Standard & .NET Core 2.0+, & .NET 5
- Include the IronOcr.dll in netstandard2.0 folder into your project
- And then add a NuGet Package Reference to:
- System.Drawing.Common 4.7 or higher
Download the IronOCR Installer (Windows only)
Another option is to download our IronOCR installer which will install all the required resources for IronOCR to work out-of-the-box. Please keep in mind this option is only for Windows systems. To download the installer please click here. Once you have the .zip downloaded:
Instructions for .NET Framework 4.0+ Installation:
- Include the IronOcr.dll in net40 folder into your project
- And then add Assembly references to:
- System.Configuration
- System.Drawing
- System.Web
Instructions for .NET Standard & .NET Core 2.0+, & .NET 5
- Include the IronOcr.dll in netstandard2.0 folder into your project
- And then add a NuGet Package Reference to:
- System.Drawing.Common 4.7 or higher
Why Choose IronOCR?
IronOCR is an easy-to-install, complete and well-documented .NET software library.
Choose IronOCR to achieve 99.8%+ OCR accuracy without using any external web services, ongoing fees or sending confidential documents over the internet.
Why C# developers choose IronOCR over Vanilla Tesseract:
- Install as a single DLL or NuGet
- Includes for Tesseract 5 , 4 and 3 Engines out of the box.
- Accuracy 99.8% significantly outperforms regular Tesseract.
- Blazing Speed and MultiThreading
- MVC, WebApp, Desktop, Console & Server Application compatible
- No Exes or C++ code to work with
- Full PDF OCR support
- To perform OCR an almost any Image file or PDF
- Full .NET Core, Standard and FrameWork support
- Deploy on Windows, Mac, Linux, Azure, Docker, Lambda, AWS
- Read barcodes and QR codes
- Export OCR as to XHTML
- Export OCR to searchable PDF documents
- Multithreading support
- 125 international languages all managed via NuGet or OcrData files
- Extract Images, Coordinates, Statistics and Fonts. Not just text.
- Can be used to redistribute Tesseract OCR inside commercial & proprietary applications.
IronOCR shines when working with real world images and imperfect documents such as photographs, or scans of low resolution which may have digital noise or imperfections.
Other free OCR libraries for the .NET platform such as other .NET Tesseract APIs and web services do not perform so well on these real world use cases.
OCR with Tesseract 5 - Start Coding in C#
The code sample below shows how easy it is to read text from an image using C# or VB .NET.
OneLiner
:path=/static-assets/ocr/content-code-examples/get-started/get-started-1.cs
string Text = new IronTesseract().Read(@"img\Screenshot.png").Text;
Dim Text As String = (New IronTesseract()).Read("img\Screenshot.png").Text
Configurable Hello World
:path=/static-assets/ocr/content-code-examples/get-started/get-started-2.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
// Add multiple images
input.LoadImage("images/sample.jpeg");
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
' Add multiple images
input.LoadImage("images/sample.jpeg")
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
C# PDF OCR
The same approach can similarly be used to extract text from any PDF document.
:path=/static-assets/ocr/content-code-examples/get-started/get-started-3.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
// We can also select specific PDF page numbers to OCR
input.LoadPdf("example.pdf", Password: "password");
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
// 1 page for every page of the PDF
Console.WriteLine($"{result.Pages.Length} Pages");
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
' We can also select specific PDF page numbers to OCR
input.LoadPdf("example.pdf", Password:= "password")
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
' 1 page for every page of the PDF
Console.WriteLine($"{result.Pages.Length} Pages")
OCR for MultiPage TIFFs
:path=/static-assets/ocr/content-code-examples/get-started/get-started-4.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames("multi-frame.tiff", pageindices);
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
Private pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("multi-frame.tiff", pageindices)
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
Barcodes and QR
A unique feature of IronOCR is it can read barcodes and QR codes from documents while it is scanning for text. Instances of the OcrResult.OcrBarcode
Class give the developer detailed information about each scanned barcode.
:path=/static-assets/ocr/content-code-examples/get-started/get-started-5.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
ocr.Configuration.ReadBarCodes = true;
using OcrInput input = new OcrInput();
input.LoadImage("img/Barcode.png");
OcrResult Result = ocr.Read(input);
foreach (var Barcode in Result.Barcodes)
{
// type and location properties also exposed
Console.WriteLine(Barcode.Value);
}
Imports IronOcr
Private ocr As New IronTesseract()
ocr.Configuration.ReadBarCodes = True
Using input As New OcrInput()
input.LoadImage("img/Barcode.png")
Dim Result As OcrResult = ocr.Read(input)
For Each Barcode In Result.Barcodes
' type and location properties also exposed
Console.WriteLine(Barcode.Value)
Next Barcode
End Using
OCR on Specific Areas of Images
All of IronOCR's scanning and reading methods provide the ability specify exactly which part of a page or pages we wish to read text from. This is very useful when we are looking at standardized forms and can save an awful lot of time and improve efficiency.
To use crop regions, we will need to add a system reference to System.Drawing
so that we can use the System.Drawing.Rectangle
object.
:path=/static-assets/ocr/content-code-examples/get-started/get-started-6.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
// Dimensions are in pixel
var contentArea = new System.Drawing.Rectangle() { X = 215, Y = 1250, Height = 280, Width = 1335 };
input.LoadImage("document.png", contentArea);
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
' Dimensions are in pixel
Private contentArea = New System.Drawing.Rectangle() With {
.X = 215,
.Y = 1250,
.Height = 280,
.Width = 1335
}
input.LoadImage("document.png", contentArea)
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
OCR for Low Quality Scans
The IronOCR OcrInput
class can fix scans that normal Tesseract can not read.
:path=/static-assets/ocr/content-code-examples/get-started/get-started-7.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames(@"img\Potter.tiff", pageindices);
// fixes digital noise and poor scanning
input.DeNoise();
// fixes rotation and perspective
input.Deskew();
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
Private pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img\Potter.tiff", pageindices)
' fixes digital noise and poor scanning
input.DeNoise()
' fixes rotation and perspective
input.Deskew()
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
Export OCR results as a Searchable PDF
:path=/static-assets/ocr/content-code-examples/get-started/get-started-8.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
input.Title = "Quarterly Report";
input.LoadImage("image1.jpeg");
input.LoadImage("image2.png");
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames("image3.gif", pageindices);
OcrResult result = ocr.Read(input);
result.SaveAsSearchablePdf("searchable.pdf");
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
input.Title = "Quarterly Report"
input.LoadImage("image1.jpeg")
input.LoadImage("image2.png")
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("image3.gif", pageindices)
Dim result As OcrResult = ocr.Read(input)
result.SaveAsSearchablePdf("searchable.pdf")
TIFF to searchable PDF Conversion
:path=/static-assets/ocr/content-code-examples/get-started/get-started-9.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames("example.tiff", pageindices);
ocr.Read(input).SaveAsSearchablePdf("searchable.pdf");
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
Private pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("example.tiff", pageindices)
ocr.Read(input).SaveAsSearchablePdf("searchable.pdf")
Export OCR results as HTML
:path=/static-assets/ocr/content-code-examples/get-started/get-started-10.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
input.Title = "Html Title";
input.LoadImage("image1.jpeg");
OcrResult Result = ocr.Read(input);
Result.SaveAsHocrFile("results.html");
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
input.Title = "Html Title"
input.LoadImage("image1.jpeg")
Dim Result As OcrResult = ocr.Read(input)
Result.SaveAsHocrFile("results.html")
OCR Image Enhancement Filters
IronOCR provides unique filters to OcrInput
objects to inprove OCR performance.
Image Enhancement Code Example
:path=/static-assets/ocr/content-code-examples/get-started/get-started-11.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
input.LoadImage("LowQuality.jpeg");
// fixes digital noise and poor scanning
input.DeNoise();
// fixes rotation and perspective
input.Deskew();
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
input.LoadImage("LowQuality.jpeg")
' fixes digital noise and poor scanning
input.DeNoise()
' fixes rotation and perspective
input.Deskew()
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
List of OCR Image Filters
Input filters to enhance OCR performance which are built into IronOCR include:
- OcrInput.Rotate( double degrees) - Rotates images by a number of degrees clockwise. For anti-clockwise, use negative numbers.
- OcrInput.Binarize() - This image filter turns every pixel black or white with no middle ground. May Improve OCR performance cases of very low contrast of text to background.
- OcrInput.ToGrayScale() - This image filter turns every pixel into a shade of grayscale. Unlikely to improve OCR accuracy but may improve speed
- OcrInput.Contrast() - Increases contrast automatically. This filter often improves OCR speed and accuracy in low contrast scans.
- OcrInput.DeNoise() - Removes digital noise. This filter should only be used where noise is expected.
- OcrInput.Invert() - Inverts every color. E.g. White becomes black : black becomes white.
- OcrInput.Dilate() - Advanced Morphology. Dilation adds pixels to the boundaries of objects in an image. Opposite of Erode
- OcrInput.Erode() - Advanced Morphology. Erosion removes pixels on object boundariesOpposite of Dilate
- OcrInput.Deskew() - Rotates an image so it is the right way up and orthogonal. This is very useful for OCR because Tesseract tolerance for skewed scans can be as low as 5 degrees.
- OcrInput.EnhanceResolution - Enhances the resolution of low quality images. This filter is not often needed because OcrInput.MinimumDPI and OcrInput.TargetDPI will automatically catch and resolve low resolution inputs.
- EnhanceResolution is a setting which will automatically detect low-resolution images (which are under 275 dpi) and automatically upscale the image and then sharpen all of the text so it can be read perfectly by an OCR library. Although this operation is in itself time-consuming, it generally reduces the overall time for an OCR operation on an image.
- Language IronOCR supports 22 international language packs, and the language setting can be used to select one or more multiple languages to be applied for an OCR operation.
- Strategy IronOCR supports two strategies. We may choose to either go for a fast and less accurate scan of a document, or use an advanced strategy which uses some artificial intelligence models to automatically improve the accuracy of OCR text by looking at the statistical relationship of words to one another in a sentence.
- ColorSpace is a setting whereby we can choose to OCR in grayscale or color. Generally, grayscale is the best option. However, sometimes when there are texts or backgrounds of similar hue but very different color, a full-color color space will provide better results.
- DetectWhiteTextOnDarkBackgrounds. Generally, all OCR libraries expect to see black text on white backgrounds. This setting allows IronOCR to automatically detect negatives, or dark pages with white text, and read them.
- InputImageType. This setting allows the developer to guide the OCR library as to whether it is looking at a full document or a snippet, such as a screenshot.
- RotateAndStraighten is an advanced setting which allows IronOCR the unique ability to read documents which are not only rotated, but perhaps containing perspective, such as photographs of text documents.
- ReadBarcodes is a useful feature which allows IronOCR to automatically read barcodes and QR codes on pages as it also reads text, without adding a large additional time burden.
- ColorDepth. This setting determines how many bits per pixel the OCR library will use to determine the depth of a color. A higher color depth may increase OCR quality, but will also increase the time required for the OCR operation to complete.
125 Language Packs
IronOCR supports 125 international languages via language packs which are distributed as DLLs, which can be downloaded from this website, or also from the NuGet Package Manager.
Languages include German, French, English, Chinese, Japanese and many more. Specialist language packs exists for passport MRZ, MICR checks, Financial Data, License plates and many more. You can also use any tesseract ".traineddata" file - including ones you create yourself.
Language Example
:path=/static-assets/ocr/content-code-examples/get-started/get-started-12.cs
using IronOcr;
// PM> Install IronOcr.Languages.Arabic
IronTesseract ocr = new IronTesseract();
ocr.Language = OcrLanguage.Arabic;
using OcrInput input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames("img/arabic.gif", pageindices);
// Add image filters if needed
// In this case, even thought input is very low quality
// IronTesseract can read what conventional Tesseract cannot.
OcrResult result = ocr.Read(input);
// Console can't print Arabic on Windows easily.
// Let's save to disk instead.
result.SaveAsTextFile("arabic.txt");
Imports IronOcr
' PM> Install IronOcr.Languages.Arabic
Private ocr As New IronTesseract()
ocr.Language = OcrLanguage.Arabic
Using input As New OcrInput()
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img/arabic.gif", pageindices)
' Add image filters if needed
' In this case, even thought input is very low quality
' IronTesseract can read what conventional Tesseract cannot.
Dim result As OcrResult = ocr.Read(input)
' Console can't print Arabic on Windows easily.
' Let's save to disk instead.
result.SaveAsTextFile("arabic.txt")
End Using
Multiple Language Example
It is also possible to OCR using multiple languages at the same time. This can really help get english language metadata and urls in Unicode documents.
:path=/static-assets/ocr/content-code-examples/get-started/get-started-13.cs
using IronOcr;
// PM> Install IronOcr.Languages.ChineseSimplified
IronTesseract ocr = new IronTesseract();
ocr.Language = OcrLanguage.ChineseSimplified;
// We can add any number of languages
ocr.AddSecondaryLanguage(OcrLanguage.English);
using OcrInput input = new OcrInput();
input.LoadPdf("multi-language.pdf");
OcrResult result = ocr.Read(input);
result.SaveAsTextFile("results.txt");
Imports IronOcr
' PM> Install IronOcr.Languages.ChineseSimplified
Private ocr As New IronTesseract()
ocr.Language = OcrLanguage.ChineseSimplified
' We can add any number of languages
ocr.AddSecondaryLanguage(OcrLanguage.English)
Using input As New OcrInput()
input.LoadPdf("multi-language.pdf")
Dim result As OcrResult = ocr.Read(input)
result.SaveAsTextFile("results.txt")
End Using
Detailed OCR Results Objects
IronOCR returns an OCR result object for each OCR operation. Generally, developers only use the text property of this object to get the text scanned from the image. However, the OCR results DOM is much more advanced than this.
:path=/static-assets/ocr/content-code-examples/get-started/get-started-14.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
// Must be set to true to read barcode
ocr.Configuration.ReadBarCodes = true;
using OcrInput input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames(@"img\sample.tiff", pageindices);
OcrResult result = ocr.Read(input);
var pages = result.Pages;
var words = pages[0].Words;
var barcodes = result.Barcodes;
// Explore here to find a massive, detailed API:
// - Pages, Blocks, Paraphaphs, Lines, Words, Chars
// - Image Export, Fonts Coordinates, Statistical Data, Tables
Imports IronOcr
Private ocr As New IronTesseract()
' Must be set to true to read barcode
ocr.Configuration.ReadBarCodes = True
Using input As New OcrInput()
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img\sample.tiff", pageindices)
Dim result As OcrResult = ocr.Read(input)
Dim pages = result.Pages
Dim words = pages(0).Words
Dim barcodes = result.Barcodes
' Explore here to find a massive, detailed API:
' - Pages, Blocks, Paraphaphs, Lines, Words, Chars
' - Image Export, Fonts Coordinates, Statistical Data, Tables
End Using
Performance
IronOCR works out of the box with no need to performance tune or heavily modify input images.
Speed is Blazing: IronOcr.2020 + is up to 10 times faster and makes over 250% fewer errors than previous builds.
Learn More
To learn more about OCR in C#, VB, F#, or any other .NET language, please read our community tutorials, which give real world examples of how IronOCR can be used and may show the nuances of how to get the best out of this library.
A full API reference for .NET developers is also available.