Get Started with OCR in C# and VB.NET
IronOCR is a C# software library allowing .NET platform software developers to recognize and read text from images and PDF documents. It is a pure .NET OCR library using the most advanced Tesseract engine known, anywhere.
Installation
Install with NuGet Package Manager
Install IronOcr in Visual Studio or at the command line with the NuGet Package Manager. In Visual Studio, navigate to the console with:
- Tools ->
- NuGet Package Manager ->
- Package Manager Console
Install-Package IronOcr
And check out IronOcr on NuGet for more about version updates and installation.
There are other IronOCR NuGet Packages available for different platforms:
- Windows: https://www.nuget.org/packages/IronOcr
- Linux: https://www.nuget.org/packages/IronOcr.Linux
- MacOS: https://www.nuget.org/packages/IronOcr.MacOs
- MacOS (ARM): https://www.nuget.org/packages/IronOcr.MacOs.ARM
Download the IronOCR .ZIP
You may also choose to download IronOCR via .ZIP file instead. Click to directly download the DLL. Once you have the .zip downloaded:
Instructions for .NET Framework 4.0+ Installation:
- Include the IronOcr.dll in net40 folder into your project
And then add Assembly references to:
- System.Configuration
- System.Drawing
- System.Web
Instructions for .NET Standard & .NET Core 2.0+, & .NET 5
- Include the IronOcr.dll in netstandard2.0 folder into your project
And then add a NuGet Package Reference to:
- System.Drawing.Common 4.7 or higher
Download the IronOCR Installer (Windows only)
Another option is to download our IronOCR installer which will install all the required resources for IronOCR to work out-of-the-box. Please keep in mind this option is only for Windows systems. To download the installer please click here. Once you have the .zip downloaded:
Instructions for .NET Framework 4.0+ Installation:
- Include the IronOcr.dll in net40 folder into your project
And then add Assembly references to:
- System.Configuration
- System.Drawing
- System.Web
Instructions for .NET Standard & .NET Core 2.0+, & .NET 5
- Include the IronOcr.dll in netstandard2.0 folder into your project
And then add a NuGet Package Reference to:
- System.Drawing.Common 4.7 or higher
Why Choose IronOCR?
IronOCR is an easy-to-install, complete and well-documented .NET software library.
Choose IronOCR to achieve 99.8%+ OCR accuracy without using any external web services, ongoing fees or sending confidential documents over the internet.
Why C# developers choose IronOCR over Vanilla Tesseract:
- Install as a single DLL or NuGet
- Includes Tesseract 5, 4, and 3 Engines out of the box.
- Accuracy 99.8% significantly outperforms regular Tesseract.
- Blazing Speed and MultiThreading
- MVC, WebApp, Desktop, Console & Server Application compatible
- No Exes or C++ code to work with
- Full PDF OCR support
- Perform OCR on almost any Image file or PDF
- Full .NET Core, Standard, and Framework support
- Deploy on Windows, Mac, Linux, Azure, Docker, Lambda, AWS
- Read barcodes and QR codes
- Export OCR results as XHTML
- Export OCR to searchable PDF documents
- Multithreading support
- 125 international languages all managed via NuGet or OcrData files
- Extract Images, Coordinates, Statistics, and Fonts. Not just text.
- Can be used to redistribute Tesseract OCR inside commercial & proprietary applications.
IronOCR shines when working with real-world images and imperfect documents such as photographs, or scans of low resolution which may have digital noise or imperfections.
Other free OCR libraries for the .NET platform such as other .NET Tesseract APIs and web services do not perform so well on these real-world use cases.
OCR with Tesseract 5 - Start Coding in C#
The code sample below shows how easy it is to read text from an image using C# or VB .NET.
OneLiner
:path=/static-assets/ocr/content-code-examples/get-started/get-started-1.cs
// Import the necessary libraries
using System;
using IronOcr; // Ensure the IronOCR library is installed for OCR functionality
namespace OCRExample
{
public class Program
{
public static void Main(string[] args)
{
// Create an instance of the IronTesseract class for OCR processing
var Ocr = new IronTesseract();
// Define the path to the image that we want to process
string imagePath = @"img\Screenshot.png";
// Attempt to read text from the specified image using OCR
try
{
// Read method performs the OCR operation on the image
var result = Ocr.Read(imagePath);
// Retrieve the text that was recognized from the image
string text = result.Text;
// Output the recognized text to the console
Console.WriteLine("Extracted Text:");
Console.WriteLine(text);
}
catch (Exception ex)
{
// Handle any exceptions that might occur during the OCR process
// Output the exception message to the console for debugging purposes
Console.WriteLine("An error occurred during OCR processing: " + ex.Message);
}
}
}
}
Configurable Hello World
:path=/static-assets/ocr/content-code-examples/get-started/get-started-2.cs
using IronOcr;
// Initialize the IronTesseract OCR engine
var ocr = new IronTesseract();
// Create an OcrInput object for processing images
using var input = new OcrInput();
// Load an image or multiple images for OCR processing
// Ensure the image path is correct and accessible
input.AddImage("images/sample.jpeg");
// Perform OCR on the loaded images
OcrResult result = ocr.Read(input);
// Output the extracted text to the console
Console.WriteLine(result.Text);
C# PDF OCR
The same approach can similarly be used to extract text from any PDF document.
:path=/static-assets/ocr/content-code-examples/get-started/get-started-3.cs
// Import the necessary namespace from IronOcr.
using IronOcr;
// Create an instance of the IronTesseract class from IronOcr.
// This is the main class for performing OCR operations.
IronTesseract ocr = new IronTesseract();
// Create an instance of the OcrInput class using a 'using' statement.
// This ensures that the OcrInput is disposed of properly, freeing up resources.
using (OcrInput input = new OcrInput())
{
try
{
// Load a PDF document into the OCR input.
// "example.pdf" is the path to the PDF file.
// "password" is used if the PDF is password protected.
input.LoadPdf("example.pdf", "password");
// Perform OCR on the input document.
// The Read() method performs the optical character recognition.
OcrResult result = ocr.Read(input);
// Output the recognized text to the console.
Console.WriteLine(result.Text);
// Output the number of pages in the PDF document that were processed by OCR.
Console.WriteLine($"{result.Pages.Count} Pages");
}
catch (Exception ex)
{
// If an error occurs (like file not found or wrong password), it will be caught here.
Console.WriteLine($"An error occurred: {ex.Message}");
}
}
// Note: This code snippet assumes that the IronOcr library is installed
// and properly referenced in the project. Ensure that you have
// the necessary licenses for using IronOcr if required.
OCR for MultiPage TIFFs
:path=/static-assets/ocr/content-code-examples/get-started/get-started-4.cs
// Importing the required IronOcr namespace to work with OCR functionalities
using IronOcr;
// Initialize an instance of IronTesseract for OCR processing
var ocr = new IronTesseract();
// Using statement to ensure that OcrInput resources are properly disposed of after use
using (var input = new OcrInput())
{
// Specify the indices of the image frames to load from the TIFF file
// Note: Indices are usually 0-based. If your library uses 1-based indices, adjust accordingly.
int[] pageIndices = new int[] { 0, 1 };
// Load specific frames of the multi-frame TIFF file for OCR processing
// "multi-frame.tiff" is the file path; replace it with the actual path if different
input.LoadImageFrames("multi-frame.tiff", pageIndices);
// Perform OCR reading on the input frames and store the result
OcrResult result = ocr.Read(input);
// Output the recognized text to the console
Console.WriteLine(result.Text);
}
Barcodes and QR
A unique feature of IronOCR is it can read barcodes and QR codes from documents while it is scanning for text. Instances of the OcrResult.OcrBarcode
class give the developer detailed information about each scanned barcode.
:path=/static-assets/ocr/content-code-examples/get-started/get-started-5.cs
using IronOcr;
// Initialize an instance of IronTesseract
IronTesseract ocr = new IronTesseract();
// Enable barcode reading functionality
ocr.Configuration.ReadBarCodes = true;
// Create an instance of OcrInput to load images for processing
using (OcrInput input = new OcrInput())
{
// Load the image containing barcodes. Ensure the file path is correct.
input.AddImage("img/Barcode.png");
// Process the input image to extract text and barcode information
OcrResult result = ocr.Read(input);
// Iterate over the extracted barcodes and print their values
foreach (var barcode in result.Barcodes)
{
// Output the value of the detected barcode
Console.WriteLine(barcode.Value);
// Optionally, more properties of the barcode, such as type and location, can be accessed and printed
// Example:
// Console.WriteLine($"Type: {barcode.Type}, Location: {barcode.Location}");
}
}
OCR on Specific Areas of Images
All of IronOCR's scanning and reading methods provide the ability to specify exactly which part of a page or pages we wish to read text from. This is very useful when we are looking at standardized forms and can save a lot of time and improve efficiency.
To use crop regions, we will need to add a system reference to System.Drawing
so that we can use the System.Drawing.Rectangle
object.
:path=/static-assets/ocr/content-code-examples/get-started/get-started-6.cs
using IronOcr;
using System;
// This code snippet demonstrates how to perform Optical Character Recognition (OCR)
// on a specified area of an image using the IronTesseract library.
// Create a new instance of IronTesseract which is the core OCR engine
IronTesseract ocr = new IronTesseract();
// Create a new OCR input object. The 'using' statement ensures that resources are disposed of correctly.
using OcrInput input = new OcrInput();
// Define a rectangle to specify the content area to be read by OCR within the image.
// Dimensions are in pixels: X and Y are the top-left coordinates, Width and Height define the rectangle size.
var contentArea = new System.Drawing.Rectangle
{
X = 215, // Horizontal offset from the left of the image
Y = 1250, // Vertical offset from the top of the image
Width = 1335, // Width of the area to be scanned
Height = 280 // Height of the area to be scanned
};
// Load the image into the OCR input object, using the defined content area.
// The image file name is "document.png". Ensure the file is in the correct path.
input.AddImage("document.png", contentArea);
// Perform OCR on the input using the IronTesseract engine.
// The 'Read' method processes the input and returns an OcrResult object which contains the recognized text.
OcrResult result = ocr.Read(input);
// Output the recognized text from the OCR result to the console.
Console.WriteLine(result.Text);
OCR for Low Quality Scans
The IronOCR OcrInput
class can fix scans that normal Tesseract cannot read.
:path=/static-assets/ocr/content-code-examples/get-started/get-started-7.cs
// Import the IronOcr library to utilize OCR functionalities
using IronOcr;
// Create an instance of IronTesseract to perform OCR
var ocr = new IronTesseract();
// Create an OcrInput for processing images. This needs to be disposed of appropriately.
using (var input = new OcrInput())
{
// Specify the page indices for frames we want to load from a multi-page image
int[] pageIndices = new int[] { 1, 2 };
// Load specified image frames from the given multi-page TIFF image
input.LoadImageFrames(@"img\Potter.tiff", pageIndices);
// Improve image quality by removing digital noise
input.DeNoise();
// Correct any skewness or rotation in the image for better OCR accuracy
input.Deskew();
// Perform OCR on the input and get the result
OcrResult result = ocr.Read(input);
// Print the recognized text to the console
Console.WriteLine(result.Text);
}
Export OCR results as a Searchable PDF
:path=/static-assets/ocr/content-code-examples/get-started/get-started-8.cs
using IronOcr;
// Initialize the OCR engine
IronTesseract ocr = new IronTesseract();
// Using statement for OcrInput ensures resources are disposed of properly
using (OcrInput input = new OcrInput())
{
// Set the title of the OCR input set
input.Title = "Quarterly Report";
// Load individual images for OCR processing
input.AddImage("image1.jpeg");
input.AddImage("image2.png");
// Load specific frames from a GIF for OCR processing
// The indices here specify the frames to be extracted
int[] pageIndices = { 1, 2 };
input.AddPdfPages("image3.gif", pageIndices);
// Perform OCR on the input images
OcrResult result = ocr.Read(input);
// Save the OCR output as a searchable PDF
result.SaveAsSearchablePdf("searchable.pdf");
}
TIFF to searchable PDF Conversion
:path=/static-assets/ocr/content-code-examples/get-started/get-started-9.cs
using IronOcr;
// This script performs Optical Character Recognition (OCR) on specific pages of a multi-page TIFF file
// and converts them into a searchable PDF using the IronOCR library.
// Create an instance of IronTesseract for OCR processing.
var ocr = new IronTesseract();
// Initialize OCR input processing. The OcrInput class is a disposable resource.
using (var input = new OcrInput())
{
// Specify the image frames to load from the multi-page TIFF file.
// Here, we are loading the first and second pages (1-based index).
var pageIndices = new int[] { 1, 2 };
// Load specified frames from the TIFF image into the OCR input.
// This method will extract the specified frames and prepare them for OCR.
input.LoadImageFrames("example.tiff", pageIndices);
// Perform OCR on the loaded frames.
// The result is then saved as a searchable PDF, where the text content can be searched.
var result = ocr.Read(input);
result.SaveAsSearchablePdf("searchable.pdf");
}
Export OCR results as HTML
:path=/static-assets/ocr/content-code-examples/get-started/get-started-10.cs
// This code demonstrates the use of the IronOcr library to perform OCR (Optical Character Recognition)
// on an image file and save the result in the HOCR format, which is an HTML-based representation
// of the OCR data.
using IronOcr;
// Instantiate an IronTesseract object, which is the main class for OCR functionality.
var ocr = new IronTesseract();
// Create an OcrInput object, which represents the input data for OCR.
// Use a 'using' statement for automatic disposal of resources once processing is complete.
using (var input = new OcrInput())
{
// Optionally set the title of the input, which can be used for referencing in output.
input.Title = "Html Title";
// Load the image file that you want to process with OCR.
input.AddImage("image1.jpeg");
// Perform OCR on the input data, obtaining an OcrResult object.
OcrResult result = ocr.Read(input);
// Save the OCR result as an HOCR file.
// HOCR is an HTML-based format that includes OCR text positions and content.
result.SaveAsHocrFile("results.html");
}
OCR Image Enhancement Filters
IronOCR provides unique filters to OcrInput
objects to improve OCR performance.
Image Enhancement Code Example
:path=/static-assets/ocr/content-code-examples/get-started/get-started-11.cs
// Import the necessary namespace for OCR functionality.
using IronOcr;
// Create a new instance of the IronTesseract for optical character recognition (OCR).
var ocr = new IronTesseract();
// Properly manage the OcrInput object within a using block to ensure resources are released after use.
using (var input = new OcrInput())
{
// Load the image from which text needs to be extracted.
// Ensure the image path is correct and the file exists at the given location.
input.AddImage("LowQuality.jpeg");
// Apply digital noise reduction to improve OCR accuracy.
input.DeNoise();
// Correct the rotation and perspective of the image if needed.
// Deskewing is useful for images that are slightly rotated or skewed.
input.Deskew();
// Perform OCR on the input image to extract text.
OcrResult result = ocr.Read(input);
// Output the recognized text to the console.
// Check if OCR extraction was successful and result is not null to avoid runtime exceptions.
if (result != null)
{
Console.WriteLine(result.Text);
}
else
{
Console.WriteLine("OCR failed to recognize the text.");
}
}
List of OCR Image Filters
Input filters to enhance OCR performance which are built into IronOCR include:
OcrInput.Rotate(double degrees)
- Rotates images by a number of degrees clockwise. For anti-clockwise rotation, use negative numbers.OcrInput.Binarize()
- This filter converts every pixel to either black or white with no middle ground, potentially improving OCR performance in very low contrast images.OcrInput.ToGrayScale()
- Converts every pixel into a shade of grayscale. It may not improve accuracy but could improve speed.OcrInput.Contrast()
- Automatically increases contrast, often improving speed and accuracy in low contrast scans.OcrInput.DeNoise()
- Removes digital noise, recommended only when noise is expected.OcrInput.Invert()
- Inverts every color (white becomes black and vice versa).OcrInput.Dilate()
- Advances morphology, adds pixels to object boundaries, opposite of Erode.OcrInput.Erode()
- Advances morphology, removes pixels from object boundaries, opposite of Dilate.OcrInput.Deskew()
- Rotates an image to orient it correctly. Useful because Tesseract's skew tolerance is limited.OcrInput.EnhanceResolution
- Enhances resolution of low-quality images. This setting is generally used to manage low DPI input automatically.EnhanceResolution
detects low-resolution images (below 275 dpi), upscales them, and sharpens text for better OCR results. Though time-consuming, it often reduces overall OCR operation time.Language
- Supports selection from 22 international language packs.Strategy
- Allows selection between fast and less accurate or advanced (using AI for accuracy) strategies based on the statistical relationship of words.ColorSpace
- Choose to OCR in grayscale or color; grayscale is generally optimal though color can be better in certain contrast scenarios.DetectWhiteTextOnDarkBackgrounds
- Adjusts for negative images, automatically detecting and reading white text on dark backgrounds.InputImageType
- Guides the OCR library, specifying whether it is working on a full document or a snippet.RotateAndStraighten
- Allows IronOCR to properly handle documents that are rotated or affected by perspective distortions.ReadBarcodes
- Automatically reads barcodes and QR codes concurrently with text scanning without significant added time.ColorDepth
- Determines bits per pixel for color depth in the OCR process. A higher depth can increase quality but also the time of processing.
125 Language Packs
IronOCR supports 125 international languages via language packs which are distributed as DLLs, available for download from this website, or from the NuGet Package Manager.
Languages include German, French, English, Chinese, Japanese, among others. Specialist language packs exist for MRZ, MICR checks, financial data, license plates, etc. Additionally, custom tesseract ".traineddata" files can be used.
Language Example
// Reference to the path of the source file that demonstrates setting language packs for OCR
:path=/static-assets/ocr/content-code-examples/get-started/get-started-12.cs
// Reference to the path of the source file that demonstrates setting language packs for OCR
using IronOcr;
// PM> Install IronOcr.Languages.Arabic
// Ensure you have the necessary NuGet package for Arabic language support.
// Create a new instance of IronTesseract for Optical Character Recognition (OCR)
IronTesseract ocr = new IronTesseract();
// Set the language to Arabic
ocr.Language = OcrLanguage.Arabic;
// Initialize a new OcrInput instance
using (OcrInput input = new OcrInput())
{
// Define the indices of the image frames to be loaded (in this example, frame indices start at 1 not 0)
var pageIndices = new int[] { 1, 2 };
// Load specific frames of a multi-frame image from the specified path using the defined indices
input.LoadImageFrames("img/arabic.gif", pageIndices);
// Note: Add image filters if needed to improve OCR accuracy; for this example, no filters are added
// Even though the input is of very low quality, IronTesseract can read what conventional Tesseract cannot.
// Perform the OCR reading process on the input image
OcrResult result = ocr.Read(input);
// Save the OCR result, which cannot be easily printed to console due to Windows limitations with Arabic text, to a file
result.SaveAsTextFile("arabic.txt");
}
Multiple Language Example
It is also possible to OCR using multiple languages at the same time. This can enhance OCR of English metadata and URLs in Unicode documents.
// Reference to the path of the source file that demonstrates multi-language OCR
:path=/static-assets/ocr/content-code-examples/get-started/get-started-13.cs
// Reference to the path of the source file that demonstrates multi-language OCR
using IronOcr;
// IronOCR provides OCR (Optical Character Recognition) capabilities for .NET applications.
// To use the Chinese Simplified language pack, uncomment and run the following command in the Package Manager Console:
// PM> Install-Package IronOcr.Languages.ChineseSimplified
// Create a new instance of IronTesseract to perform OCR operations.
IronTesseract ocr = new IronTesseract();
// Set the primary language for OCR to Chinese Simplified.
ocr.Language = OcrLanguage.ChineseSimplified;
// Add a secondary language (English) to improve recognition for documents with mixed languages.
ocr.AddSecondaryLanguage(OcrLanguage.English);
// Use the "using" statement to ensure that resources are disposed of correctly after usage.
using (OcrInput input = new OcrInput())
{
// Load a PDF document for OCR processing.
input.AddPdf("multi-language.pdf");
// Perform OCR on the loaded PDF and store the result.
OcrResult result = ocr.Read(input);
// Check if OCR is successful before saving the result
if (result.Success)
{
// Save the recognized text to a text file.
result.SaveAsTextFile("results.txt");
}
else
{
// Handle the OCR failure case, e.g., log or provide feedback.
Console.WriteLine("OCR failed to recognize the text.");
}
}
Detailed OCR Results Objects
IronOCR returns an OCR result object for each operation. Generally, developers access the Text
property to get scanned text. However, the results object contains much more detailed information.
// Reference to the path of the source file demonstrating detailed OCR result object usage
:path=/static-assets/ocr/content-code-examples/get-started/get-started-14.cs
// Reference to the path of the source file demonstrating detailed OCR result object usage
using IronOcr;
// Initialize IronTesseract for OCR tasks
IronTesseract ocr = new IronTesseract();
// Enable reading barcodes from images
ocr.Configuration.ReadBarCodes = true;
// Using statement ensures OcrInput is disposed correctly after use
using (OcrInput input = new OcrInput())
{
// Specify frames/pages of the multi-frame image to analyze (1-based index)
int[] pageIndices = new int[] { 1, 2 };
// Load specified pages of the image from a TIFF file
input.LoadImageFrames(@"img\sample.tiff", pageIndices);
// Perform OCR on the input image(s) and get the results
OcrResult result = ocr.Read(input);
// Access recognized pages from the OCR result
var pages = result.Pages;
// Check if any pages were recognized to avoid errors while accessing
if (pages.Count > 0)
{
// Access words on the first recognized page
var words = pages[0].Words;
// Add logic here to process the words if needed
}
// Access recognized barcodes from the OCR result
var barcodes = result.Barcodes;
// Add logic here to process the barcodes if needed
// Explore a detailed API offering various OCR result details:
// - Pages, Blocks, Paragraphs, Lines, Words, Chars
// - Image Export, Font Coordinates, Statistical Data, Tables
}
Performance
IronOCR works out of the box with no need for performance tuning or image modification.
Speed is blazing: IronOcr.2020+ is up to 10 times faster and makes over 250% fewer errors than previous builds.
Learn More
To learn more about OCR in C#, VB, F#, or any other .NET language, please read our community tutorials, which give real-world examples of using IronOCR and show the nuances of optimizing the library.
A full API reference for .NET developers is also available.