Read Text from Images with C# OCR
In this tutorial, we will learn how to convert images to text in C# and other .NET languages.
How to Convert Images to Text in C#
- Download the OCR Image to Text IronOCR Library
- Adjust Crop Regions to Read Parts of an Image
- Use up to 125 international languages via Language Packs
- Export OCR Scan Results as Text or Searchable PDF
Reading Text from Images in .NET Applications
We will use the IronOcr.IronTesseract
class to recognize text within images and look at the nuances of how to use Iron Tesseract OCR to get the highest performance in terms of accuracy and speed when reading text from images in .NET
To achieve "Image to Text" we will install the IronOCR library into a Visual Studio project.
To do this, we download the IronOcr DLL or use NuGet .
Install-Package IronOcr
Why IronOCR?
We use IronOCR for Tesseract management because its us unique in that it:
- Works straight out of the box in pure .NET
- Doesn't require Tesseract to be installed on your machine.
- Runs the latest engines: Tesseract 5 ( as well as Tesseract 4 & 3)
- Is available for any .NET project: .NET Framework 4.5 +, .NET Standard 2 + and .NET Core 2, 3 & 5!
- Has improved accuracy over and speed over traditional Tesseract
- Supports Xamarin, Mono, Azure and Docker
- Manage the complex Tesseract dictionary system using NuGet packages
- Supports PDFS, MultiFrame Tiffs and all major image formats without configuration
- Can correct low quality and skewed scans to get the best results from tesseract.
Start using IronOCR in your project today with a free trial.
Using Tesseract in C#
In this simple example, you can see we use the IronOcr.IronTesseract
class to read the text from an image and automatically return its value as a string.
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-1.cs
// PM> Install-Package IronOcr
using IronOcr;
OcrResult result = new IronTesseract().Read(@"img\Screenshot.png");
Console.WriteLine(result.Text);
' PM> Install-Package IronOcr
Imports IronOcr
Private result As OcrResult = (New IronTesseract()).Read("img\Screenshot.png")
Console.WriteLine(result.Text)
Which results in 100% accuracy with the following text:
IronOCR Simple Example
In this simple example we will test the accuracy of our C# OCR library to read text from a PNG
Image. This is a very basic test, but things will get more complicated as the tutorial continues.
The quick brown fox jumps over the lazy dog
Although this may seem simplistic, there is sophisticated behavior going on 'under the surface': scanning the image for alignment, quality and resolution, looking at its properties, optimizing the OCR engine, and using a trained artificial intelligence network to then read the text as a human would.
OCR is not a simple process for a computer to achieve, and reading speeds may be similar to those of a human. In other words, OCR is not an instantaneous process. In this case though, it is 100% accurate.
Advanced Use of IronOCR Tesseract for C#
In most real world use cases, developers are going to want the best performance possible for their project. In this case, we recommend that you move forward to use the OcrInput
and IronTesseract
classes within the IronOcr
namespace.
OcrInput gives you the facility to set the specific characteristics of an OCR job, such as:
- Working with almost any type of image including JPEG, TIFF, GIF, BMP & PNG
- Importing whole or parts of PDF documents
- Enhancing contrast, resolution & size
- Correcting for rotation, scan noise, digital noise, skew, negative images
IronTesseract
- Pick from hundreds of prepackaged language and language variants
- Use Tesseract 5, 4 or 3 OCR engines "out-of-the-box"
- Specify a document type whether we are looking at a screenshot, a snippet, or an entire document
- Read Barcodes
- Output results to: Searchable PDFs, Hocr HTML , a DOM & Strings
Example: Getting Started with OcrInput + IronTesseract
This all may seem daunting, but in the example below you will see the default settings which we would recommend you start with, which will work with almost any image you input to IronOCR.
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-2.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames(@"img\Potter.tiff", pageindices);
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
Private pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img\Potter.tiff", pageindices)
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
We can use this even on a medium quality scan with 100% accuracy.
As you can see, reading the text (and optionally barcodes) from a scanned image such as a TIFF was rather easy. This OCR job yields an accuracy of 100%.
OCR is not a perfect science when it comes to real world documents, yet IronTesseract is about as good as it gets.
You will also note that IronOCR can automatically read multi-page documents, such as TIFFs and even extract text from PDF documents automatically.
Example: A Low Quality Scan
Now we will try a much lower quality scan of the same page, at a low DPI, which has lots of distortion and digital noise and damage to the original paper.
This is where IronOCR truly shines against other OCR libraries such as Tesseract, and we will find alternative OCR projects shy away from discussing. OCR on real world scanned images rather than unrealistically 'perfect' test cases created digitally to give a 100% OCR accuracy.
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-3.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames(@"img\Potter.LowQuality.tiff", pageindices);
input.Deskew(); // removes rotation and perspective
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
Private pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img\Potter.LowQuality.tiff", pageindices)
input.Deskew() ' removes rotation and perspective
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
Without adding Input.Deskew()
to straighten the image we get a 52.5% accuracy. Not good enough.
Adding Input.Deskew()
brings us to 99.8% accuracy which is almost as accurate as the OCR of a high quality scan.
Image Filters may take a little time to run - but also reduce OCR processing time. It is a fine balance for a developer to get to know their input documents.
If you are not certain:
Input.Deskew()
is a safe and very successful filter to use.- Secondly try
Input.DeNoise()
to fix considerable digital noise.
Performance Tuning
The most important factor in the speed of an OCR job is in fact the quality of the input image. The less background noise that is present and the higher the dpi, with a perfect target dpi at about 200 dpi, will cause the fastest and most accurate OCR results.
This is not, however, necessary, as IronOCR shines at correcting imperfect documents (though this is time-consuming and will cause your OCR jobs to use more CPU cycles).
If possible, choosing input image formats with less digital noise such as TIFF or PNG can also yield faster results than lossy image formats such as JPEG.
Image Filters
The following Image filters can really improve performance:
- OcrInput.Rotate( double degrees) - Rotates images by a number of degrees clockwise. For anti-clockwise, use negative numbers.
- OcrInput.Binarize() - This image filter turns every pixel black or white with no middle ground. May Improve OCR performance cases of very low contrast of text to background.
- OcrInput.ToGrayScale() - This image filter turns every pixel into a shade of grayscale. Unlikely to improve OCR accuracy but may improve speed
- OcrInput.Contrast() - Increases contrast automatically. This filter often improves OCR speed and accuracy in low contrast scans.
- OcrInput.DeNoise() - Removes digital noise. This filter should only be used where noise is expected.
- OcrInput.Invert() - Inverts every color. E.g. White becomes black : black becomes white.
- OcrInput.Dilate() - Advanced Morphology. Dilation adds pixels to the boundaries of objects in an image. Opposite of Erode
- OcrInput.Erode() - Advanced Morphology. Erosion removes pixels on object boundariesOpposite of Dilate
- OcrInput.Deskew() - Rotates an image so it is the right way up and orthogonal. This is very useful for OCR because Tesseract tolerance for skewed scans can be as low as 5 degrees.
- OcrInput.DeepCleanBackgroundNoise() - Heavy background noise removal. Only use this filter in case extreme document background noise is known, because this filter will also risk reducing OCR accuracy of clean documents, and is very CPU expensive.
- OcrInput.EnhanceResolution - Enhances the resolution of low quality images. This filter is not often needed because OcrInput.MinimumDPI and OcrInput.TargetDPI will automatically catch and resolve low resolution inputs.
Performance Tuning for Speed
Using Iron Tesseract, we may wish to speed up OCR on higher quality scans.
If optimizing for speed we might start at this position and then turn features back on until the perfect balance is found.
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-4.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
// Configure for speed
ocr.Configuration.BlackListCharacters = "~`$#^*_}{][|\\";
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.Auto;
ocr.Language = OcrLanguage.EnglishFast;
using OcrInput input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames(@"img\Potter.tiff", pageindices);
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Private ocr As New IronTesseract()
' Configure for speed
ocr.Configuration.BlackListCharacters = "~`$#^*_}{][|\"
ocr.Configuration.PageSegmentationMode = TesseractPageSegmentationMode.Auto
ocr.Language = OcrLanguage.EnglishFast
Using input As New OcrInput()
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img\Potter.tiff", pageindices)
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
End Using
This result is 99.8% accurate compared to baseline 100% - but 35% faster.
Reading Cropped Regions of Images
As you can see from the following code sample, Iron's fork of Tesseract OCR is adept at reading specific areas of images.
We may use a System.Drawing.Rectangle
to specify, in pixels, the exact area of an image to read.
This can be incredibly useful when we are dealing with a standardized form which is filled out, where only a certain area has text which changes from case to case.
Example: Scanning an Area of a Page
We can use a System.Drawing.Rectangle
to specify a region in which we will read a document. The unit of measurement is always pixels.
We will see that this provides speed improvements as well as avoiding reading unnecessary text. In this example we will read a student's name from a central area of a standardized document.
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-5.cs
using IronOcr;
using IronSoftware.Drawing;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
// a 41% improvement on speed
Rectangle contentArea = new Rectangle(x: 215, y: 1250, height: 280, width: 1335);
input.LoadImage("img/ComSci.png", contentArea);
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Imports IronOcr
Imports IronSoftware.Drawing
Private ocr As New IronTesseract()
Private OcrInput As using
' a 41% improvement on speed
Private contentArea As New Rectangle(x:= 215, y:= 1250, height:= 280, width:= 1335)
input.LoadImage("img/ComSci.png", contentArea)
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
This yields a 41% speed increase - and allows us to be specific. This is incredibly useful for .NET OCR scenarios where documents are similar and consistent such as Invoices, Receipts, Checks, Forms, Expense Claims etc.
ContentAreas (OCR cropping) is also supported when reading PDFs.
International Languages
IronOCR supports 125 international languages via language packs which are distributed as DLLs, which can be downloaded from this website, or also from the NuGet Package Manager for Visual Studio.
We can install them by browsing NuGet (search for "IronOcr.Languages") or from the OCR language packs page.
Supported languages Include:
- Afrikaans
- Amharic Also known as አማርኛ
- Arabic Also known as العربية
- ArabicAlphabet Also known as العربية
- ArmenianAlphabet Also known as Հայերեն
- Assamese Also known as অসমীয়া
- Azerbaijani Also known as azərbaycan dili
- AzerbaijaniCyrillic Also known as azərbaycan dili
- Belarusian Also known as беларуская мова
- Bengali Also known as Bangla,বাংলা
- BengaliAlphabet Also known as Bangla,বাংলা
- Tibetan Also known as Tibetan Standard, Tibetan, Central ཡིག་
- Bosnian Also known as bosanski jezik
- Breton Also known as brezhoneg
- Bulgarian Also known as български език
- CanadianAboriginalAlphabet Also known as Canadian First Nations, Indigenous Canadians, Native Canadian, Inuit
- Catalan Also known as català, valencià
- Cebuano Also known as Bisaya, Binisaya
- Czech Also known as čeština, český jazyk
- CherokeeAlphabet Also known as ᏣᎳᎩ ᎦᏬᏂᎯᏍᏗ, Tsalagi Gawonihisdi
- ChineseSimplified Also known as 中文 (Zhōngwén), 汉语, 漢語
- ChineseSimplifiedVertical Also known as 中文 (Zhōngwén), 汉语, 漢語
- ChineseTraditional Also known as 中文 (Zhōngwén), 汉语, 漢語
- ChineseTraditionalVertical Also known as 中文 (Zhōngwén), 汉语, 漢語
- Cherokee Also known as ᏣᎳᎩ ᎦᏬᏂᎯᏍᏗ, Tsalagi Gawonihisdi
- Corsican Also known as corsu, lingua corsa
- Welsh Also known as Cymraeg
- CyrillicAlphabet Also known as Cyrillic scripts
- Danish Also known as dansk
- DanishFraktur Also known as dansk
- German Also known as Deutsch
- GermanFraktur Also known as Deutsch
- DevanagariAlphabet Also known as Nagair,देवनागरी
- Divehi Also known as ދިވެހި
- Dzongkha Also known as རྫོང་ཁ
- Greek Also known as ελληνικά
- English
- MiddleEnglish Also known as English (1100-1500 AD)
- Esperanto
- Estonian Also known as eesti, eesti keel
- EthiopicAlphabet Also known as Ge'ez,ግዕዝ, Gəʿəz
- Basque Also known as euskara, euskera
- Faroese Also known as føroyskt
- Persian Also known as فارسی
- Filipino Also known as National Language of the Philippines, Standardized Tagalog
- Finnish Also known as suomi, suomen kieli
- Financial Also known as Financial, Numerical and Technical Documents
- French Also known as français, langue française
- FrakturAlphabet Also known as Generic Fraktur, Calligraphic hand of the Latin alphabet
- Frankish Also known as Frenkisk, Old Franconian
- MiddleFrench Also known as Moyen Français,Middle French (ca. 1400-1600 AD)
- WesternFrisian Also known as Frysk
- GeorgianAlphabet Also known as ქართული
- ScottishGaelic Also known as Gàidhlig
- Irish Also known as Gaeilge
- Galician Also known as galego
- AncientGreek Also known as Ἑλληνική
- GreekAlphabet Also known as ελληνικά
- Gujarati Also known as ગુજરાતી
- GujaratiAlphabet Also known as ગુજરાતી
- GurmukhiAlphabet Also known as Gurmukhī, ਗੁਰਮੁਖੀ, Shahmukhi, گُرمُکھی, Sihk Script
- HangulAlphabet Also known as Korean Alphabet,한글,Hangeul,조선글,hosŏn'gŭl
- HangulVerticalAlphabet Also known as Korean Alphabet,한글,Hangeul,조선글,hosŏn'gŭl
- HanSimplifiedAlphabet Also known as Samhan ,한어, 韓語
- HanSimplifiedVerticalAlphabet Also known as Samhan ,한어, 韓語
- HanTraditionalAlphabet Also known as Samhan ,한어, 韓語
- HanTraditionalVerticalAlphabet Also known as Samhan ,한어, 韓語
- Haitian Also known as Kreyòl ayisyen
- Hebrew Also known as עברית
- HebrewAlphabet Also known as עברית
- Hindi Also known as हिन्दी, हिंदी
- Croatian Also known as hrvatski jezik
- Hungarian Also known as magyar
- Armenian Also known as Հայերեն
- Inuktitut Also known as ᐃᓄᒃᑎᑐᑦ
- Indonesian Also known as Bahasa Indonesia
- Icelandic Also known as Íslenska
- Italian Also known as italiano
- ItalianOld Also known as italiano
- JapaneseAlphabet Also known as 日本語 (にほんご)
- JapaneseVerticalAlphabet Also known as 日本語 (にほんご)
- Javanese Also known as basa Jawa
- Japanese Also known as 日本語 (にほんご)
- JapaneseVertical Also known as 日本語 (にほんご)
- Kannada Also known as ಕನ್ನಡ
- KannadaAlphabet Also known as ಕನ್ನಡ
- Georgian Also known as ქართული
- GeorgianOld Also known as ქართული
- Kazakh Also known as қазақ тілі
- Khmer Also known as ខ្មែរ, ខេមរភាសា, ភាសាខ្មែរ
- KhmerAlphabet Also known as ខ្មែរ, ខេមរភាសា, ភាសាខ្មែរ
- Kyrgyz Also known as Кыргызча, Кыргыз тили
- NorthernKurdish Also known as Kurmanji, کورمانجی ,Kurmancî
- Korean Also known as 한국어 (韓國語), 조선어 (朝鮮語)
- KoreanVertical Also known as 한국어 (韓國語), 조선어 (朝鮮語)
- Lao Also known as ພາສາລາວ
- LaoAlphabet Also known as ພາສາລາວ
- Latin Also known as latine, lingua latina
- LatinAlphabet Also known as latine, lingua latina
- Latvian Also known as latviešu valoda
- Lithuanian Also known as lietuvių kalba
- Luxembourgish Also known as Lëtzebuergesch
- Malayalam Also known as മലയാളം
- MalayalamAlphabet Also known as മലയാളം
- Marathi Also known as मराठी
- MICR Also known as Magnetic Ink Character Recognition, MICR Cheque Encoding
- Macedonian Also known as македонски јазик
- Maltese Also known as Malti
- Mongolian Also known as монгол
- Maori Also known as te reo Māori
- Malay Also known as bahasa Melayu, بهاس ملايو
- Myanmar Also known as Burmese ,ဗမာစာ
- MyanmarAlphabet Also known as Burmese ,ဗမာစာ
- Nepali Also known as नेपाली
- Dutch Also known as Nederlands, Vlaams
- Norwegian Also known as Norsk
- Occitan Also known as occitan, lenga d'òc
- Oriya Also known as ଓଡ଼ିଆ
- OriyaAlphabet Also known as ଓଡ଼ିଆ
- Panjabi Also known as ਪੰਜਾਬੀ, پنجابی
- Polish Also known as język polski, polszczyzna
- Portuguese Also known as português
- Pashto Also known as پښتو
- Quechua Also known as Runa Simi, Kichwa
- Romanian Also known as limba română
- Russian Also known as русский язык
- Sanskrit Also known as संस्कृतम्
- Sinhala Also known as සිංහල
- SinhalaAlphabet Also known as සිංහල
- Slovak Also known as slovenčina, slovenský jazyk
- SlovakFraktur Also known as slovenčina, slovenský jazyk
- Slovene Also known as slovenski jezik, slovenščina
- Sindhi Also known as सिन्धी, سنڌي، سندھی
- Spanish Also known as español, castellano
- SpanishOld Also known as español, castellano
- Albanian Also known as gjuha shqipe
- Serbian Also known as српски језик
- SerbianLatin Also known as српски језик
- Sundanese Also known as Basa Sunda
- Swahili Also known as Kiswahili
- Swedish Also known as Svenska
- Syriac Also known as Syrian, Syriac Aramaic,ܠܫܢܐ ܣܘܪܝܝܐ, Leššānā Suryāyā
- SyriacAlphabet Also known as Syrian, Syriac Aramaic,ܠܫܢܐ ܣܘܪܝܝܐ, Leššānā Suryāyā
- Tamil Also known as தமிழ்
- TamilAlphabet Also known as தமிழ்
- Tatar Also known as татар теле, tatar tele
- Telugu Also known as తెలుగు
- TeluguAlphabet Also known as తెలుగు
- Tajik Also known as тоҷикӣ, toğikī, تاجیکی
- Tagalog Also known as Wikang Tagalog, ᜏᜒᜃᜅ᜔ ᜆᜄᜎᜓᜄ᜔
- Thai Also known as ไทย
- ThaanaAlphabet Also known as Taana , Tāna , ތާނަ
- ThaiAlphabet Also known as ไทย
- TibetanAlphabet Also known as Tibetan Standard, Tibetan, Central ཡིག་
- Tigrinya Also known as ትግርኛ
- Tonga Also known as faka Tonga
- Turkish Also known as Türkçe
- Uyghur Also known as Uyƣurqə, ئۇيغۇرچە
- Ukrainian Also known as українська мова
- Urdu Also known as اردو
- Uzbek Also known as O‘zbek, Ўзбек, أۇزبېك
- UzbekCyrillic Also known as O‘zbek, Ўзбек, أۇزبېك
- Vietnamese Also known as Tiếng Việt
- VietnameseAlphabet Also known as Tiếng Việt
- Yiddish Also known as ייִדיש
- Yoruba Also known as Yorùbá
Example: OCR in Arabic (+ many more)
In the following example, we will show how we can scan an Arabic document.
PM> Install-Package IronOcr.Languages.Arabic
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-6.cs
// PM> Install IronOcr.Languages.Arabic
using IronOcr;
IronTesseract ocr = new IronTesseract();
ocr.Language = OcrLanguage.Arabic;
using OcrInput input = new OcrInput();
input.LoadImageFrame("img/arabic.gif", 1);
// add image filters if needed
// In this case, even thought input is very low quality
// IronTesseract can read what conventional Tesseract cannot.
OcrResult result = ocr.Read(input);
// Console can't print Arabic on Windows easily.
// Let's save to disk instead.
result.SaveAsTextFile("arabic.txt");
' PM> Install IronOcr.Languages.Arabic
Imports IronOcr
Private ocr As New IronTesseract()
ocr.Language = OcrLanguage.Arabic
Using input As New OcrInput()
input.LoadImageFrame("img/arabic.gif", 1)
' add image filters if needed
' In this case, even thought input is very low quality
' IronTesseract can read what conventional Tesseract cannot.
Dim result As OcrResult = ocr.Read(input)
' Console can't print Arabic on Windows easily.
' Let's save to disk instead.
result.SaveAsTextFile("arabic.txt")
End Using
Example: OCR in more than one language in the same document.
In the following example, we will show how to OCR scan multiple languages to the same document.
This is actually very common, where for example a Chinese document might contain English words and Urls.
PM> Install-Package IronOcr.Languages.ChineseSimplified
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-7.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
ocr.Language = OcrLanguage.ChineseSimplified;
// We can add any number of languages.
ocr.AddSecondaryLanguage(OcrLanguage.English);
// Optionally add custom tesseract .traineddata files by specifying a file path
using OcrInput input = new OcrInput();
input.LoadImage("img/MultiLanguage.jpeg");
OcrResult result = ocr.Read(input);
result.SaveAsTextFile("MultiLanguage.txt");
Imports IronOcr
Private ocr As New IronTesseract()
ocr.Language = OcrLanguage.ChineseSimplified
' We can add any number of languages.
ocr.AddSecondaryLanguage(OcrLanguage.English)
' Optionally add custom tesseract .traineddata files by specifying a file path
Using input As New OcrInput()
input.LoadImage("img/MultiLanguage.jpeg")
Dim result As OcrResult = ocr.Read(input)
result.SaveAsTextFile("MultiLanguage.txt")
End Using
Multi Page Documents
IronOcr can combine multiple pages / images into a single OcrResult
. This is extremely useful where a document has been made from multiple images. We will see later that this special feature of IronTesseract is extremely useful to produce searchable PDFs and HTML files from OCR inputs.
IronOcr makes it possible to "mix and match" images, TIFF frames and PDF pages into a single OCR input.
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-8.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
input.LoadImage("image1.jpeg");
input.LoadImage("image2.png");
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames("image3.gif", pageindices);
OcrResult result = ocr.Read(input);
Console.WriteLine($"{result.Pages.Length} Pages"); // 3 Pages
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
input.LoadImage("image1.jpeg")
input.LoadImage("image2.png")
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("image3.gif", pageindices)
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine($"{result.Pages.Length} Pages") ' 3 Pages
We can also easily OCR every page of a TIFF.
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-9.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames("MultiFrame.Tiff", pageindices);
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Console.WriteLine($"{result.Pages.Length} Pages");
// 1 page for every frame (page) in the TIFF
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
Private pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("MultiFrame.Tiff", pageindices)
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
Console.WriteLine($"{result.Pages.Length} Pages")
' 1 page for every frame (page) in the TIFF
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-10.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
input.LoadPdf("example.pdf", Password: "password");
// We can also select specific PDF page numbers to OCR
OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);
Console.WriteLine($"{result.Pages.Length} Pages");
// 1 page for every page of the PDF
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
input.LoadPdf("example.pdf", Password:= "password")
' We can also select specific PDF page numbers to OCR
Dim result As OcrResult = ocr.Read(input)
Console.WriteLine(result.Text)
Console.WriteLine($"{result.Pages.Length} Pages")
' 1 page for every page of the PDF
Searchable PDFs
Exporting OCR results as searchable PDFs in C# and VB.NET is a popular feature of IronOCR. This can really help with database population, SEO and PDF usability businesses and governments.
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-11.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
input.Title = "Quarterly Report";
input.LoadImage("image1.jpeg");
input.LoadImage("image2.png");
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames("image3.gif", pageindices);
OcrResult result = ocr.Read(input);
result.SaveAsSearchablePdf("searchable.pdf");
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
input.Title = "Quarterly Report"
input.LoadImage("image1.jpeg")
input.LoadImage("image2.png")
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("image3.gif", pageindices)
Dim result As OcrResult = ocr.Read(input)
result.SaveAsSearchablePdf("searchable.pdf")
Another OCR trick is to convert an existing PDF document searchable.
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-12.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
input.Title = "Pdf Metadata Name";
input.LoadPdf("example.pdf", Password: "password");
OcrResult result = ocr.Read(input);
result.SaveAsSearchablePdf("searchable.pdf");
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
input.Title = "Pdf Metadata Name"
input.LoadPdf("example.pdf", Password:= "password")
Dim result As OcrResult = ocr.Read(input)
result.SaveAsSearchablePdf("searchable.pdf")
The same applies to converting TIFF documents with 1 or more pages to searchable PDFs using IronTesseract.
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-13.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
input.Title = "Pdf Title";
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames("example.tiff", pageindices);
OcrResult result = ocr.Read(input);
result.SaveAsSearchablePdf("searchable.pdf");
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
input.Title = "Pdf Title"
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("example.tiff", pageindices)
Dim result As OcrResult = ocr.Read(input)
result.SaveAsSearchablePdf("searchable.pdf")
Exporting Hocr HTML
We can similarly export OCR result documents to Hocr HTML. This is an XML document which can be parsed by an XML reader, or marked up into visually appealing HTML.
This allows some degree of PDF to HTML and TIFF to HTML conversion.
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-14.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
using OcrInput input = new OcrInput();
input.Title = "Html Title";
// Add more content as required...
input.LoadImage("image2.jpeg");
input.LoadPdf("example.pdf",Password: "password");
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames("example.tiff", pageindices);
OcrResult result = ocr.Read(input);
result.SaveAsHocrFile("hocr.html");
Imports IronOcr
Private ocr As New IronTesseract()
Private OcrInput As using
input.Title = "Html Title"
' Add more content as required...
input.LoadImage("image2.jpeg")
input.LoadPdf("example.pdf",Password:= "password")
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("example.tiff", pageindices)
Dim result As OcrResult = ocr.Read(input)
result.SaveAsHocrFile("hocr.html")
Reading Barcodes in OCR Documents
IronOCR has a unique additional advantage over traditional tesseract in that it also reads barcodes and QR codes;
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-15.cs
using IronOcr;
IronTesseract ocr = new IronTesseract();
ocr.Configuration.ReadBarCodes = true;
using OcrInput input = new OcrInput();
input.LoadImage("img/Barcode.png");
OcrResult result = ocr.Read(input);
foreach (var barcode in result.Barcodes)
{
Console.WriteLine(barcode.Value);
// type and location properties also exposed
}
Imports IronOcr
Private ocr As New IronTesseract()
ocr.Configuration.ReadBarCodes = True
Using input As New OcrInput()
input.LoadImage("img/Barcode.png")
Dim result As OcrResult = ocr.Read(input)
For Each barcode In result.Barcodes
Console.WriteLine(barcode.Value)
' type and location properties also exposed
Next barcode
End Using
A Detailed Look at Image to Text OCR Results
The last thing we will look at in this tutorial is the OCR results object. When we read OCR, we normally only want the text out, but IronOCR actually contains a huge amount of information which may be of use to advanced developers.
Within an OCR results object, we have a collection of pages which can be iterated. Within each page, we may find barcodes, power graphs, lines of text, words, and characters.
Each of these objects in fact contains: a location; an X coordinate; a Y coordinate; a width and a height; an image associated with it which can be inspected; a font name; the font size; the direction in which the text is written; the rotation of the text; and the statistical confidence that IronOCR has for that specific word, line, or paragraph.
In short, this allows developers to be creative and work with OCR data in any way they choose to inspect and export information.
We can also work with and export any element from the .NET OCR Results object such as a paragraph, word or barcode as an Image or BitMap.
:path=/static-assets/ocr/content-code-examples/tutorials/how-to-read-text-from-an-image-in-csharp-net-16.cs
using IronOcr;
using IronSoftware.Drawing;
// We can delve deep into OCR results as an object model of Pages, Barcodes, Paragraphs, Lines, Words and Characters
// This allows us to explore, export and draw OCR content using other APIs
IronTesseract ocr = new IronTesseract();
ocr.Configuration.ReadBarCodes = true;
using OcrInput input = new OcrInput();
var pageindices = new int[] { 1, 2 };
input.LoadImageFrames(@"img\Potter.tiff", pageindices);
OcrResult result = ocr.Read(input);
foreach (var page in result.Pages)
{
// Page object
int pageNumber = page.PageNumber;
string pageText = page.Text;
int pageWordCount = page.WordCount;
// null if we don't set Ocr.Configuration.ReadBarCodes = true;
OcrResult.Barcode[] barcodes = page.Barcodes;
AnyBitmap pageImage = page.ToBitmap(input);
System.Drawing.Bitmap pageImageLegacy = page.ToBitmap(input);
double pageWidth = page.Width;
double pageHeight = page.Height;
foreach (var paragraph in page.Paragraphs)
{
// Pages -> Paragraphs
int paragraphNumber = paragraph.ParagraphNumber;
String paragraphText = paragraph.Text;
System.Drawing.Bitmap paragraphImage = paragraph.ToBitmap(input);
int paragraphXLocation = paragraph.X;
int paragraphYLocation = paragraph.Y;
int paragraphWidth = paragraph.Width;
int paragraphHeight = paragraph.Height;
double paragraphOcrAccuracy = paragraph.Confidence;
var paragraphTextDirection = paragraph.TextDirection;
foreach (var line in paragraph.Lines)
{
// Pages -> Paragraphs -> Lines
int lineNumber = line.LineNumber;
String lineText = line.Text;
AnyBitmap lineImage = line.ToBitmap(input);
System.Drawing.Bitmap lineImageLegacy = line.ToBitmap(input);
int lineXLocation = line.X;
int lineYLocation = line.Y;
int lineWidth = line.Width;
int lineHeight = line.Height;
double lineOcrAccuracy = line.Confidence;
double lineSkew = line.BaselineAngle;
double lineOffset = line.BaselineOffset;
foreach (var word in line.Words)
{
// Pages -> Paragraphs -> Lines -> Words
int wordNumber = word.WordNumber;
String wordText = word.Text;
AnyBitmap wordImage = word.ToBitmap(input);
System.Drawing.Image wordImageLegacy = word.ToBitmap(input);
int wordXLocation = word.X;
int wordYLocation = word.Y;
int wordWidth = word.Width;
int wordHeight = word.Height;
double wordOcrAccuracy = word.Confidence;
if (word.Font != null)
{
// Word.Font is only set when using Tesseract Engine Modes rather than LTSM
String fontName = word.Font.FontName;
double fontSize = word.Font.FontSize;
bool isBold = word.Font.IsBold;
bool isFixedWidth = word.Font.IsFixedWidth;
bool isItalic = word.Font.IsItalic;
bool isSerif = word.Font.IsSerif;
bool isUnderlined = word.Font.IsUnderlined;
bool fontIsCaligraphic = word.Font.IsCaligraphic;
}
foreach (var character in word.Characters)
{
// Pages -> Paragraphs -> Lines -> Words -> Characters
int characterNumber = character.CharacterNumber;
String characterText = character.Text;
AnyBitmap characterImage = character.ToBitmap(input);
System.Drawing.Bitmap characterImageLegacy = character.ToBitmap(input);
int characterXLocation = character.X;
int characterYLocation = character.Y;
int characterWidth = character.Width;
int characterHeight = character.Height;
double characterOcrAccuracy = character.Confidence;
// Output alternative symbols choices and their probability.
// Very useful for spell checking
OcrResult.Choice[] characterChoices = character.Choices;
}
}
}
}
}
Imports IronOcr
Imports IronSoftware.Drawing
' We can delve deep into OCR results as an object model of Pages, Barcodes, Paragraphs, Lines, Words and Characters
' This allows us to explore, export and draw OCR content using other APIs
Private ocr As New IronTesseract()
ocr.Configuration.ReadBarCodes = True
Using input As New OcrInput()
Dim pageindices = New Integer() { 1, 2 }
input.LoadImageFrames("img\Potter.tiff", pageindices)
Dim result As OcrResult = ocr.Read(input)
For Each page In result.Pages
' Page object
Dim pageNumber As Integer = page.PageNumber
Dim pageText As String = page.Text
Dim pageWordCount As Integer = page.WordCount
' null if we don't set Ocr.Configuration.ReadBarCodes = true;
Dim barcodes() As OcrResult.Barcode = page.Barcodes
Dim pageImage As AnyBitmap = page.ToBitmap(input)
Dim pageImageLegacy As System.Drawing.Bitmap = page.ToBitmap(input)
Dim pageWidth As Double = page.Width
Dim pageHeight As Double = page.Height
For Each paragraph In page.Paragraphs
' Pages -> Paragraphs
Dim paragraphNumber As Integer = paragraph.ParagraphNumber
Dim paragraphText As String = paragraph.Text
Dim paragraphImage As System.Drawing.Bitmap = paragraph.ToBitmap(input)
Dim paragraphXLocation As Integer = paragraph.X
Dim paragraphYLocation As Integer = paragraph.Y
Dim paragraphWidth As Integer = paragraph.Width
Dim paragraphHeight As Integer = paragraph.Height
Dim paragraphOcrAccuracy As Double = paragraph.Confidence
Dim paragraphTextDirection = paragraph.TextDirection
For Each line In paragraph.Lines
' Pages -> Paragraphs -> Lines
Dim lineNumber As Integer = line.LineNumber
Dim lineText As String = line.Text
Dim lineImage As AnyBitmap = line.ToBitmap(input)
Dim lineImageLegacy As System.Drawing.Bitmap = line.ToBitmap(input)
Dim lineXLocation As Integer = line.X
Dim lineYLocation As Integer = line.Y
Dim lineWidth As Integer = line.Width
Dim lineHeight As Integer = line.Height
Dim lineOcrAccuracy As Double = line.Confidence
Dim lineSkew As Double = line.BaselineAngle
Dim lineOffset As Double = line.BaselineOffset
For Each word In line.Words
' Pages -> Paragraphs -> Lines -> Words
Dim wordNumber As Integer = word.WordNumber
Dim wordText As String = word.Text
Dim wordImage As AnyBitmap = word.ToBitmap(input)
Dim wordImageLegacy As System.Drawing.Image = word.ToBitmap(input)
Dim wordXLocation As Integer = word.X
Dim wordYLocation As Integer = word.Y
Dim wordWidth As Integer = word.Width
Dim wordHeight As Integer = word.Height
Dim wordOcrAccuracy As Double = word.Confidence
If word.Font IsNot Nothing Then
' Word.Font is only set when using Tesseract Engine Modes rather than LTSM
Dim fontName As String = word.Font.FontName
Dim fontSize As Double = word.Font.FontSize
Dim isBold As Boolean = word.Font.IsBold
Dim isFixedWidth As Boolean = word.Font.IsFixedWidth
Dim isItalic As Boolean = word.Font.IsItalic
Dim isSerif As Boolean = word.Font.IsSerif
Dim isUnderlined As Boolean = word.Font.IsUnderlined
Dim fontIsCaligraphic As Boolean = word.Font.IsCaligraphic
End If
For Each character In word.Characters
' Pages -> Paragraphs -> Lines -> Words -> Characters
Dim characterNumber As Integer = character.CharacterNumber
Dim characterText As String = character.Text
Dim characterImage As AnyBitmap = character.ToBitmap(input)
Dim characterImageLegacy As System.Drawing.Bitmap = character.ToBitmap(input)
Dim characterXLocation As Integer = character.X
Dim characterYLocation As Integer = character.Y
Dim characterWidth As Integer = character.Width
Dim characterHeight As Integer = character.Height
Dim characterOcrAccuracy As Double = character.Confidence
' Output alternative symbols choices and their probability.
' Very useful for spell checking
Dim characterChoices() As OcrResult.Choice = character.Choices
Next character
Next word
Next line
Next paragraph
Next page
End Using
Summary
IronOCR provides C# developers the most advanced Tesseract API we know of on any platform.
IronOCR can be deployed on Windows, Linux, Mac, Azure, AWS, Lambda and supports .NET Framework projects as well as .NET Standard and .NET Core.
We can see that if we input even an imperfect document to IronOCR, it can accurately read its content to a statistical accuracy of about 99%, even though the document was badly formatted, skewed, and had digital noise.
We can also read barcodes in OCR scans, and even export our OCR as HTML and searchable PDFs.
This is unique to IronOCR and is a feature you will not find in standard OCR libraries or vanilla Tesseract.
Moving Forward
To continue to learn more about IronOCR, we recommend that you:
- Get Started with using our C# Tesseract OCR Quickstart guide.
- Explore the C# & VB code examples
- Read the in-depth MSDN-style API Reference.
Source Code Download
You may also enjoy the other .NET OCR tutorials in this section.