Class IronTesseract
IronTesseract is a comprehensive managed class for performing Tesseract OCR in .Net applications.
IronTesseract natively supports Tesseract 3, 4 and 5 engines, and will automatically install all required binaries and language packs (tessdata) files.
Inheritance
Inherited Members
Namespace: IronOcr
Assembly: IronOcr.dll
Syntax
public class IronTesseract
Constructors
IronTesseract()
Public constructor. Creates a default instance of IronTesseract
Declaration
public IronTesseract()
IronTesseract(TesseractConfiguration)
Public constructor. Creates an instance of IronTesseract with a customized TesseractConfiguration.
This allows advanced developers to fine tune Tesseract behavior.
Declaration
public IronTesseract(TesseractConfiguration Configuration)
Parameters
Type | Name | Description |
---|---|---|
TesseractConfiguration | Configuration |
Fields
Configuration
An instance of TesseractConfiguration which allows fine-grained control of the underlying Tesseract OCR Engine.
Options include: Tesseract Engine 4 or 5. Language file detail level. Page Segmentation Mode and access to the entire API of tesseract settings variables.
Declaration
public TesseractConfiguration Configuration
Field Value
Type | Description |
---|---|
TesseractConfiguration |
Properties
AzureFunctionWebRootPath
set Azure Function Web Root Path by using ExecutionContext class EX. ExecutionContext executionContext; var engine = new IronTesseract(); engine.AzureFunctionWebRootPath = executionContext.FunctionAppDirectory;
Declaration
public string AzureFunctionWebRootPath { get; set; }
Property Value
Type | Description |
---|---|
System.String |
Language
The Natural Language of the documents Which IronTesseract will read.
Default is English. Additional languages can be installed easily using Nuget https://www.nuget.org/packages?q=IronOcr.Languages or downloaded from https://ironsoftware.com/csharp/ocr/languages/
We may use multiple languages packs simultaneously with the UseMultipleLanguages method.
We can use custom Tesseract .tessdata language packs with the UseCustomTesseractLanguageFile(String) method.
Declaration
public OcrLanguage Language { get; set; }
Property Value
Type | Description |
---|---|
OcrLanguage |
See Also
MultiThreaded
Allows IronTesseract to read documents using multiple threads on multiple CPU cores. A significant performance improvement over traditional Tesseract.
Declaration
public bool MultiThreaded { get; set; }
Property Value
Type | Description |
---|---|
System.Boolean |
Methods
AddSecondaryLanguage(OcrLanguage)
IronTesseract will use multiple tesseract language files simultaneously. MultilingualOCR.
Any number of secondary languages may be added. Speed and performance may be affected.
Declaration
public void AddSecondaryLanguage(OcrLanguage SecondaryLanguage)
Parameters
Type | Name | Description |
---|---|---|
OcrLanguage | SecondaryLanguage | An additional OcrLanguage |
AddSecondaryLanguage(String)
IronTesseract will use multiple tesseract language files simultaneously. MultilingualOCR using a custom .traineddata tesseract 3,4 or 5 language file.
Any number of secondary languages may be added. Speed and performance may be affected.
Declaration
public void AddSecondaryLanguage(string CustomLanguagePath)
Parameters
Type | Name | Description |
---|---|---|
System.String | CustomLanguagePath | File path to a .traineddata tesseract language pack. |
ClearSecondaryLanguages()
Removes all languages add by AddSecondaryLanguage(OcrLanguage) or AddSecondaryLanguage(String)
Declaration
public void ClearSecondaryLanguages()
Read(OcrInput)
Reads text from an OcrInput object and returns an OcrResult object.
OcrInput is the preferred input type because it allows for OCR of multi-paged documents, and allows images to be enhanced before OCR to obtain faster, more accurate results.
There are also other overloads of this method that allow for Images and PDFs to be read directly as File paths and Bitmaps.
Declaration
public OcrResult Read(OcrInput Input)
Parameters
Type | Name | Description |
---|---|---|
OcrInput | Input | An OcrInput document which can contain one or more images and PDFs |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
Read(Image)
Reads text from a System.Drawing.Image Image file and returns an OcrResult object.
Declaration
public OcrResult Read(Image Image)
Parameters
Type | Name | Description |
---|---|---|
System.Drawing.Image | Image | A System.Drawing.Image or Bitmap |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
Read(Image, Rectangle)
Reads text from a region of a System.Drawing.Image Image file and returns an OcrResult object.
Declaration
public OcrResult Read(Image Image, Rectangle ContentArea)
Parameters
Type | Name | Description |
---|---|---|
System.Drawing.Image | Image | A System.Drawing.Image or Bitmap |
System.Drawing.Rectangle | ContentArea | Specifies a region within the image to extract text from as a System.Drawing.Rectangle with X, Y Width and Height in pixels. Setting a ContentArea can improve OCR speed. |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
Read(String)
Reads text from an Image file and returns an OcrResult object.
Declaration
public OcrResult Read(string ImagePath)
Parameters
Type | Name | Description |
---|---|---|
System.String | ImagePath | Path to an image file. |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
Read(String, Rectangle)
Reads text from a region of an Image file and returns an OcrResult object.
Declaration
public OcrResult Read(string ImagePath, Rectangle ContentArea)
Parameters
Type | Name | Description |
---|---|---|
System.String | ImagePath | Path to an image file. |
System.Drawing.Rectangle | ContentArea | Specifies a region within the image to extract text from as a System.Drawing.Rectangle with X, Y Width and Height in pixels. Setting a ContentArea can improve OCR speed. |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
ReadAsync(OcrInput)
Reads text from an OcrInput object and returns an OcrResult object.
OcrInput is the preferred input type because it allows for OCR of multi-paged documents, and allows images to be enhanced before OCR to obtain faster, more accurate results.
There are also other overloads of this method that allow for Images and PDFs to be read directly as File paths and Bitmaps.
Declaration
public Task<OcrResult> ReadAsync(OcrInput Input)
Parameters
Type | Name | Description |
---|---|---|
OcrInput | Input | An OcrInput document which can contain one or more images and PDFs |
Returns
Type | Description |
---|---|
System.Threading.Tasks.Task<OcrResult> | A task that represents the asynchronous read operation. The value of the System.Threading.Tasks.Task`1.Result property contains an OcrResult object containing text, and detailed, structured information about the extracted text content. |
ReadAsync(Image, Nullable<Rectangle>)
Reads text from an System.Drawing.Image object and returns an OcrResult object.
Declaration
public Task<OcrResult> ReadAsync(Image Image, Rectangle? ContentArea = null)
Parameters
Type | Name | Description |
---|---|---|
System.Drawing.Image | Image | An System.Drawing.Image object |
System.Nullable<System.Drawing.Rectangle> | ContentArea | Specifies a region within the image to extract text from as a System.Drawing.Rectangle with X, Y Width and Height in pixels. Setting a ContentArea can improve OCR speed. |
Returns
Type | Description |
---|---|
System.Threading.Tasks.Task<OcrResult> | A task that represents the asynchronous read operation. The value of the System.Threading.Tasks.Task`1.Result property contains an OcrResult object containing text, and detailed, structured information about the extracted text content. |
ReadAsync(String, Nullable<Rectangle>)
Reads text from an Image file and returns an OcrResult object.
Declaration
public Task<OcrResult> ReadAsync(string ImagePath, Rectangle? ContentArea = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | ImagePath | Path to an image file. |
System.Nullable<System.Drawing.Rectangle> | ContentArea | Specifies a region within the image to extract text from as a System.Drawing.Rectangle with X, Y Width and Height in pixels. Setting a ContentArea can improve OCR speed. |
Returns
Type | Description |
---|---|
System.Threading.Tasks.Task<OcrResult> | A task that represents the asynchronous read operation. The value of the System.Threading.Tasks.Task`1.Result property contains an OcrResult object containing text, and detailed, structured information about the extracted text content. |
UseCustomTesseractLanguageFile(String)
IronTesseract will use a tesseract .traineddata language file as its only OCR language.
https://github.com/tesseract-ocr/tessdata
Declaration
public void UseCustomTesseractLanguageFile(string TrainedDataPath)
Parameters
Type | Name | Description |
---|---|---|
System.String | TrainedDataPath | File path to a .traineddata file. These can be downloaded from https://github.com/tesseract-ocr/tessdata or generated using Tesseract command line. |
Events
OcrProgress
An Event which can be used to track OCR progress and inform users of OCR performance and progress.
Progress is reported via the OcrProgresEventsArgs class
Declaration
public event EventHandler<OcrProgresEventsArgs> OcrProgress
Event Type
Type | Description |
---|---|
System.EventHandler<OcrProgresEventsArgs> |
Examples
myIronTesseract.OcrProgress += (object o, IronOcr.Events.OcrProgresEventsArgs e) =>
{
Console.WriteLine(e.ProgressPercent + "% " + e.Duration.TotalSeconds+"s" );
}