How To Read & PDFs and Scanned Images (OCR) in C# and VB.Net

Getting Started with OCR in C# and VB.Net

IronOCR is a C# software library allowing .NET platform software developers to recognize and read text from images and PDF documents. It converts images to text. As a bonus, Iron OCR can also read barcodes and QR codes and return them to the developer.

Installation

The first thing we have to do is install our OCR library into a Visual Studio project. To do this, we can choose one of two approaches:

  1. The easiest way to http://ironsoftware.com/csharp/ocr is using NuGet Package Manager for Visual-Studio. The package name is “IronOcr”
  2. Download the IronOcr DLL directly from our homepage.
 PM > Install-Package IronOcr

Why Choose IronOCR?

Iron OCR is an easy-to-install, complete and well-documented .NET software library. Iron OCR shines when working with real world images and imperfect documents such as photographs, or scans of low resolution which may have digital noise or imperfections. Other free OCR libraries for the .NET platform such as Tesseract do not perform so well on these real world use cases.

Iron OCR provides an excellent balance of performance against accuracy for image-to-text conversion in .NET.

Automated OCR

The Iron OCR Automated OCR class is the easiest way for developers to get started with optical character recognition in .NET. The Auto OCR class automatically detects image properties and adjusts for them, making best guesses about the most appropriate settings to read that document. It automatically corrects for digital noise, rotation, perspective, and even low-resolution documents.

The code sample below shows how easy it is to read text from an image using C# or VB .NET.

using System;
using IronOcr;
//..

var Ocr = new AutoOcr();
var Result = Ocr.Read(@"C:\path\to\any\image.png");
Console.WriteLine(Result.Text);
Imports System
Imports IronOcr
Dim Ocr As var = New AutoOcr
Dim Result As var = Ocr.Read("C:\path\to\any\image.png")
Console.WriteLine(Result.Text)
VB   C#

The same approach can similarly be used to extract text from a PDF document.

using System;
using IronOcr;

var Ocr = new IronOcr.AutoOcr();
var Results = Ocr.ReadPdf(@"C:\Users\Me\Desktop\Invoice.pdf");
var Barcodes = Results.Barcodes;
var Text = Results.Text;
Console.WriteLine(Text);
Imports System
Imports IronOcr
Dim Ocr As var = New IronOcr.AutoOcr
Dim Results As var = Ocr.ReadPdf("C:\Users\Me\Desktop\Invoice.pdf")
Dim Barcodes As var = Results.Barcodes
Dim Text As var = Results.Text
Console.WriteLine(Text)
VB   C#

Advanced OCR

The Iron OCR Advanced OCR class gives the developer granular control over the OCR operation, allowing them to achieve the highest degree of accuracy for very specific use cases.

using IronOcr;
//..
var Ocr = new AdvancedOcr()
{
    CleanBackgroundNoise = true,
    EnhanceContrast = true,
    EnhanceResolution = true,
    Language =  IronOcr.Languages.English.OcrLanguagePack,
    Strategy = IronOcr.AdvancedOcr.OcrStrategy.Advanced,
    ColorSpace = AdvancedOcr.OcrColorSpace.Color,
    DetectWhiteTextOnDarkBackgrounds = true,
    InputImageType = AdvancedOcr.InputTypes.AutoDetect,
    RotateAndStraighten = true,
    ReadBarCodes = true,
    ColorDepth = 4
};
var testImage = @"C:\path\to\scan.tiff";
var Results = Ocr.Read(testImage);
var Barcodes = Results.Barcodes.Select(b => b.Value);
Console.WriteLine(Results.Text);
Console.WriteLine("Barcodes:" + String.Join(",", Barcodes));
Imports IronOcr
Dim Ocr As var = New AdvancedOcr
Dim testImage As var = "C:\path\to\scan.tiff"
Dim Results As var = Ocr.Read(testImage)
Dim Barcodes As var = Results.Barcodes.Select(() => {  }, b.Value)
Console.WriteLine(Results.Text)
Console.WriteLine(("Barcodes:" + String.Join(",", Barcodes)))
VB   C#

Iron Advanced OCR can also be used to extract text from pages or entire PDF documents.

using IronOcr;

var Ocr = new AdvancedOcr()
{
    CleanBackgroundNoise = false,
    ColorDepth = 4,
    ColorSpace = AdvancedOcr.OcrColorSpace.Color,
    EnhanceContrast = false,
    DetectWhiteTextOnDarkBackgrounds = false,
    RotateAndStraighten = false,
    Language = IronOcr.Languages.English.OcrLanguagePack,
    EnhanceResolution = false,
    InputImageType = AdvancedOcr.InputTypes.Document,
    ReadBarCodes = true,
    Strategy = AdvancedOcr.OcrStrategy.Fast
};

var PagesToRead = new []{1,2,3};
var Results = Ocr.ReadPdf(@"C:\Users\Me\Desktop\Invoice.pdf", PagesToRead);
var Pages = Results.Pages;
var Barcodes = Results.Barcodes;
var FullPdfText = Results.Text;
Imports IronOcr
Dim Ocr As var = New AdvancedOcr
Dim PagesToRead As var
,2
,3
UnknownDim Results As var = Ocr.ReadPdf("C:\Users\Me\Desktop\Invoice.pdf", PagesToRead)
Dim Pages As var = Results.Pages
Dim Barcodes As var = Results.Barcodes
Dim FullPdfText As var = Results.Text
VB   C#

Advanced OCR settings include:

CleanBackgroundNoise. This is a setting which is somewhat time-consuming; however, it allows the library to automatically clean digital noise, paper crumples, and other imperfections within a digital image which would otherwise render it incapable of being read by other OCR libraries.

EnhanceContrast is a setting which causes Iron OCR to automatically increase the contrast of text against the background of an image, increasing the accuracy of OCR and generally increasing performance and the speed of OCR.

EnhanceResolution is a setting which will automatically detect low-resolution images (which are under 275 dpi) and automatically upscale the image and then sharpen all of the text so it can be read perfectly by an OCR library. Although this operation is in itself time-consuming, it generally reduces the overall time for an OCR operation on an image.

Language Iron OCR supports 22 international language packs, and the language setting can be used to select one or more multiple languages to be applied for an OCR operation.

Strategy Iron OCR supports two strategies. We may choose to either go for a fast and less accurate scan of a document, or use an advanced strategy which uses some artificial intelligence models to automatically improve the accuracy of OCR text by looking at the statistical relationship of words to one another in a sentence.

ColorSpace is a setting whereby we can choose to OCR in grayscale or color. Generally, grayscale is the best option. However, sometimes when there are texts or backgrounds of similar hue but very different color, a full-color color space will provide better results.

DetectWhiteTextOnDarkBackgrounds. Generally, all OCR libraries expect to see black text on white backgrounds. This setting allows Iron OCR to automatically detect negatives, or dark pages with white text, and read them.

InputImageType. This setting allows the developer to guide the OCR library as to whether it is looking at a full document or a snippet, such as a screenshot.

RotateAndStraighten is an advanced setting which allows Iron OCR the unique ability to read documents which are not only rotated, but perhaps containing perspective, such as photographs of text documents.

ReadBarcodes is a useful feature which allows Iron OCR to automatically read barcodes and QR codes on pages as it also reads text, without adding a large additional time burden.

ColorDepth. This setting determines how many bits per pixel the OCR library will use to determine the depth of a color. A higher color depth may increase OCR quality, but will also increase the time required for the OCR operation to complete.

Language Packs

Iron OCR has support for 22 international languages. By default, only English is installed. You can install additional languages via NuGet or by downloading DLLs from our Language Packs page.


var Ocr = new AutoOcr()
{
    Language = IronOcr.Languages.Arabic.OcrLanguagePack,
};
var results = Ocr.Read(@"path\to\arabic\document.png");
Dim Ocr As var = New AutoOcr
Dim results As var = Ocr.Read("path\to\arabic\document.png")
VB   C#

Likewise for AdvancedOcr:


var Ocr = new AdvancedOcr()
{
    Language = IronOcr.Languages.Arabic.OcrLanguagePack,
    //...
};
var results = Ocr.Read(@"path\to\arabic\document.png");
Dim Ocr As var = New AdvancedOcr
Dim results As var = Ocr.Read("path\to\arabic\document.png")
VB   C#

Barcodes and QR

A bonus feature of Iron OCR is it can read barcodes and QR codes from documents while it is scanning for text. Instances of the OcrResult.OcrBarcode Class give the developer detailed information about each scanned barcode.

using IronOcr;
using System;
using System.Collections.Generic;
using System.Drawing; //Add Assembly Reference

// We can delve deep into OCR results as an object model of
// Pages, Barcodes, Paragraphs, Lines, Words and Characters
var Ocr = new AdvancedOcr()
{
    ReadBarCodes = true,
    Strategy = AdvancedOcr.OcrStrategy.Fast,
    InputImageType = AdvancedOcr.InputTypes.Document
};

var Results = Ocr.Read(@"path\to\document.pdf");

foreach (var Page in Results.Pages)
{
    // page object
    foreach (var Barcode in Page.Barcodes){
        Console.WriteLine("Barcode of type {0} with value {1} found on page {2}", Barcode.Format.ToString(), Barcode.Value, Barcode.PageNumber);
        // location and an image of teh barcode may also be returned as required
    }
}
Imports IronOcr
Imports System
Imports System.Collections.Generic
Imports System.Drawing
Dim Ocr As var = New AdvancedOcr
Dim Results As var = Ocr.Read("path\to\document.pdf")
For Each Page In Results.Pages
    ' page object
    For Each Barcode In Page.Barcodes
        Console.WriteLine("Barcode of type {0} with value {1} found on page {2}", Barcode.Format.ToString, Barcode.Value, Barcode.PageNumber)
        ' location and an image of teh barcode may also be returned as required
    Next
Next
VB   C#

Crop Regions

All of Iron OCR's scanning and reading methods provide the ability to add a crop region, or to specify exactly which part of a page or pages we wish to read text from. This is very useful when we are looking at standardized forms and can save an awful lot of time and improve efficiency.

To use crop regions, we will need to add a system reference to the System.Drawing DLL so that we can use the System.Drawing.Rectangle object.


using IronOcr;
using System;
using System.Drawing; //Add Assembly Reference

// How to read just a rectangular portion of an image or PDF
var Ocr = new AutoOcr();
var X = 100; //px
var Y = 225;
var Width = 300;
var Height = 125;
var CropArea = new Rectangle(X,Y,Width,Height);
var Result = Ocr.Read(@"C:\path\to\image.png", CropArea );
// This approach works equally well with IronOcr.AdvancedOCR and PDF documents
Console.WriteLine(Result.Text);
Imports IronOcr
Imports System
Imports System.Drawing
Dim Ocr As var = New AutoOcr
Dim X As var = 100
Dim Y As var = 225
Dim Width As var = 300
Dim Height As var = 125
Dim CropArea As var = New Rectangle(X, Y, Width, Height)
Dim Result As var = Ocr.Read("C:\path\to\image.png", CropArea)
' This approach works equally well with IronOcr.AdvancedOCR and PDF documents
Console.WriteLine(Result.Text)
VB   C#

Getting Detailed Results Objects from IronOCR

Iron OCR returns an OCR result object for each OCR operation. Generally, developers only use the text property of this object to get the text scanned from the image. However, the OCR results object is much more advanced than this.

In the code sample below, we can see how we may iterate an OCR results object to look at the paragraphs, lines, words and characters of text which have been read during OCR, inspect them for statistical accuracy, and even look at them as images on a page by page basis.

using IronOcr;
using System;
using System.Collections.Generic;
using System.Drawing; //Add Assembly Reference
// We can delve deep into OCR results as an object model of
// Pages, Barcodes, Paragraphs, Lines, Words and Characters
var Ocr = new AdvancedOcr()
{
    Language = IronOcr.Languages.English.OcrLanguagePack,
    ColorSpace = AdvancedOcr.OcrColorSpace.GrayScale,
    EnhanceResolution = true,
    EnhanceContrast = true,
    CleanBackgroundNoise = true,
    ColorDepth = 4,
    RotateAndStraighten = false,
    DetectWhiteTextOnDarkBackgrounds = false,
    ReadBarCodes = true,
    Strategy = AdvancedOcr.OcrStrategy.Fast,
    InputImageType = AdvancedOcr.InputTypes.Document
};
var results = Ocr.Read(@"path\to\document.png");
foreach (var page in results.Pages)
{
    // page object
    int page_number = page.PageNumber;
    String page_text = page.Text;
    int page_wordcount = page.WordCount;
    List<OcrResult.OcrBarcode> barcodes = page.Barcodes;
    System.Drawing.Image page_image = page.Image;
    int page_width_px = page.Width;
    int page_height_px = page.Height;
    foreach (var paragraph in page.Paragraphs)
    {
        // pages -> paragraphs
        int paragraph_number = paragraph.ParagraphNumber;
        String paragraph_text = paragraph.Text;
        System.Drawing.Image paragraph_image = paragraph.Image;
        int paragraph_x_location = paragraph.X;
        int paragraph_y_location = paragraph.Y;
        int paragraph_width = paragraph.Width;
        int paragraph_height = paragraph.Height;
        double paragraph_ocr_accuracy = paragraph.Confidence;
        string paragraph_font_name = paragraph.FontName;
        double paragraph_font_size = paragraph.FontSize;
        OcrResult.TextFlow paragrapth_text_direction = paragraph.TextDirection;
        double paragrapth_rotation_degrees = paragraph.TextOrientation;
        foreach (var line in paragraph.Lines)
        {
            // pages -> paragraphs -> lines
            int line_number = line.LineNumber;
            String line_text = line.Text;
            System.Drawing.Image line_image = line.Image;
            int line_x_location = line.X;
            int line_y_location = line.Y;
            int line_width = line.Width;
            int line_height = line.Height;
            double line_ocr_accuracy = line.Confidence;
            double line_skew = line.BaselineAngle;
            double line_offset = line.BaselineOffset;
            foreach (var word in line.Words)
            {
                // pages -> paragraphs -> lines -> words
                int word_number = word.WordNumber;
                String word_text = word.Text;
                System.Drawing.Image word_image = word.Image;
                int word_x_location = word.X;
                int word_y_location = word.Y;
                int word_width = word.Width;
                int word_height = word.Height;
                double word_ocr_accuracy = word.Confidence;
                String word_font_name = word.FontName;
                double word_font_size = word.FontSize;
                bool word_is_bold = word.FontIsBold;
                bool word_is_fixed_width_font = word.FontIsFixedWidth;
                bool word_is_italic = word.FontIsItalic;
                bool word_is_serif_font = word.FontIsSerif;
                bool word_is_underlined = word.FontIsUnderlined;
                foreach (var character in word.Characters)
                {
                    // pages -> paragraphs -> lines -> words -> characters
                    int character_number = character.CharacterNumber;
                    String character_text = character.Text;
                    System.Drawing.Image character_image = character.Image;
                    int character_x_location = character.X;
                    int character_y_location = character.Y;
                    int character_width = character.Width;
                    int character_height = character.Height;
                    double character_ocr_accuracy = character.Confidence;
                }
            }
        }
    }
}
Imports IronOcr
Imports System
Imports System.Collections.Generic
Imports System.Drawing
Dim Ocr As var = New AdvancedOcr
Dim results As var = Ocr.Read("path\to\document.png")
For Each page In results.Pages
    ' page object
    Dim page_number As Integer = page.PageNumber
    Dim page_text As String = page.Text
    Dim page_wordcount As Integer = page.WordCount
    Dim barcodes As List(Of OcrResult.OcrBarcode) = page.Barcodes
    Dim page_image As System.Drawing.Image = page.Image
    Dim page_width_px As Integer = page.Width
    Dim page_height_px As Integer = page.Height
    For Each paragraph In page.Paragraphs
        ' pages -> paragraphs
        Dim paragraph_number As Integer = paragraph.ParagraphNumber
        Dim paragraph_text As String = paragraph.Text
        Dim paragraph_image As System.Drawing.Image = paragraph.Image
        Dim paragraph_x_location As Integer = paragraph.X
        Dim paragraph_y_location As Integer = paragraph.Y
        Dim paragraph_width As Integer = paragraph.Width
        Dim paragraph_height As Integer = paragraph.Height
        Dim paragraph_ocr_accuracy As Double = paragraph.Confidence
        Dim paragraph_font_name As String = paragraph.FontName
        Dim paragraph_font_size As Double = paragraph.FontSize
        Dim paragrapth_text_direction As OcrResult.TextFlow = paragraph.TextDirection
        Dim paragrapth_rotation_degrees As Double = paragraph.TextOrientation
        For Each line In paragraph.Lines
            ' pages -> paragraphs -> lines
            Dim line_number As Integer = line.LineNumber
            Dim line_text As String = line.Text
            Dim line_image As System.Drawing.Image = line.Image
            Dim line_x_location As Integer = line.X
            Dim line_y_location As Integer = line.Y
            Dim line_width As Integer = line.Width
            Dim line_height As Integer = line.Height
            Dim line_ocr_accuracy As Double = line.Confidence
            Dim line_skew As Double = line.BaselineAngle
            Dim line_offset As Double = line.BaselineOffset
            For Each word In line.Words
                ' pages -> paragraphs -> lines -> words
                Dim word_number As Integer = word.WordNumber
                Dim word_text As String = word.Text
                Dim word_image As System.Drawing.Image = word.Image
                Dim word_x_location As Integer = word.X
                Dim word_y_location As Integer = word.Y
                Dim word_width As Integer = word.Width
                Dim word_height As Integer = word.Height
                Dim word_ocr_accuracy As Double = word.Confidence
                Dim word_font_name As String = word.FontName
                Dim word_font_size As Double = word.FontSize
                Dim word_is_bold As Boolean = word.FontIsBold
                Dim word_is_fixed_width_font As Boolean = word.FontIsFixedWidth
                Dim word_is_italic As Boolean = word.FontIsItalic
                Dim word_is_serif_font As Boolean = word.FontIsSerif
                Dim word_is_underlined As Boolean = word.FontIsUnderlined
                For Each character In word.Characters
                    ' pages -> paragraphs -> lines -> words -> characters
                    Dim character_number As Integer = character.CharacterNumber
                    Dim character_text As String = character.Text
                    Dim character_image As System.Drawing.Image = character.Image
                    Dim character_x_location As Integer = character.X
                    Dim character_y_location As Integer = character.Y
                    Dim character_width As Integer = character.Width
                    Dim character_height As Integer = character.Height
                    Dim character_ocr_accuracy As Double = character.Confidence
                Next
            Next
        Next
    Next
Next
VB   C#

Performance

Reading (the action of recognizing text from a visual image) is a human art which computers are only just learning to achieve. All OCR is inherently slow, and even on a modern i7 or XEON based server, we can expect OCR to achieve only human reading speeds. This can be a surprise to developers at first, but it's perfectly normal.

The higher the quality of the input document, the faster the results will come out. It can be counterintuitive that larger documents with higher dpis (of an optimum range from perhaps 250 to 300 dpi) will actually scan faster than smaller image formats.

We will note that wherever possible, Iron OCR will use multithreading to speed up OCR operations on a page by page basis. This is particularly useful when batch processing images or reading multi-page PDF documents.

Learn More

To learn more about OCR in C#, VB, F#, or any other .NET language, please read our community tutorials, which give real world examples of how Iron OCR can be used and may show the nuances of how to get the best out of this library.

A full object reference for .NET developers is also available.