Class IronTesseract
IronTesseract is a comprehensive managed class for performing Tesseract OCR in .Net applications.
IronTesseract natively supports Tesseract 3, 4 and 5 engines, and will automatically install all required binaries and language packs (tessdata) files.
Inheritance
Namespace: IronOcr
Assembly: IronOcr.dll
Syntax
public class IronTesseract : Object
Constructors
IronTesseract()
Public constructor. Creates a default instance of IronTesseract
Declaration
public IronTesseract()
IronTesseract(TesseractConfiguration)
Public constructor. Creates an instance of IronTesseract with a customized TesseractConfiguration.
This allows advanced developers to fine tune Tesseract behavior.
Declaration
public IronTesseract(TesseractConfiguration Configuration)
Parameters
Type | Name | Description |
---|---|---|
TesseractConfiguration | Configuration |
Fields
Configuration
An instance of TesseractConfiguration which allows fine-grained control of the underlying Tesseract OCR Engine.
Options include: Language file detail level. Page Segmentation Mode and access to the entire API of tesseract settings variables.
Declaration
public TesseractConfiguration Configuration
Field Value
Type | Description |
---|---|
TesseractConfiguration |
Properties
Language
The Natural Language of the documents Which IronTesseract will read.
Default is English. Additional languages can be installed easily using Nuget https://www.nuget.org/packages?q=IronOcr.Languages or downloaded from https://ironsoftware.com/csharp/ocr/languages/
We may use multiple languages packs simultaneously with the UseMultipleLanguages method.
We can use custom Tesseract .tessdata language packs with the UseCustomTesseractLanguageFile(String) method.
Declaration
public OcrLanguage Language { get; set; }
Property Value
Type | Description |
---|---|
OcrLanguage |
See Also
MultiThreaded
Read multiple PDF pages and images simultaneously on different threads
Declaration
public bool MultiThreaded { get; set; }
Property Value
Type | Description |
---|---|
System.Boolean |
Methods
AddSecondaryLanguage(OcrLanguage)
IronTesseract will use multiple tesseract language files simultaneously. MultilingualOCR.
Any number of secondary languages may be added. Speed and performance may be affected.
Declaration
public void AddSecondaryLanguage(OcrLanguage SecondaryLanguage)
Parameters
Type | Name | Description |
---|---|---|
OcrLanguage | SecondaryLanguage | An additional OcrLanguage |
AddSecondaryLanguage(String)
IronTesseract will use multiple tesseract language files simultaneously. MultilingualOCR using a custom .traineddata tesseract 3,4 or 5 language file.
Any number of secondary languages may be added. Speed and performance may be affected.
Declaration
public void AddSecondaryLanguage(string CustomLanguagePath)
Parameters
Type | Name | Description |
---|---|---|
System.String | CustomLanguagePath | File path to a .traineddata tesseract language pack. |
ClearSecondaryLanguages()
Removes all languages add by AddSecondaryLanguage(OcrLanguage) or AddSecondaryLanguage(String)
Declaration
public void ClearSecondaryLanguages()
Read(OcrInput)
Reads text from an OcrInput object and returns an OcrResult object.
OcrInput is the preferred input type because it allows for OCR of multi-paged documents, and allows images to be enhanced before OCR to obtain faster, more accurate results.
There are also other overloads of this method that allow for Images and PDFs to be read directly as File paths and Bitmaps.
Declaration
public OcrResult Read(OcrInput Input)
Parameters
Type | Name | Description |
---|---|---|
OcrInput | Input | An OcrInput document which can contain one or more images and PDFs |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
Read(AnyBitmap)
Reads text from a IronSoftware.Drawing.AnyBitmap Image file and returns an OcrResult object.
Declaration
public OcrResult Read(AnyBitmap Image)
Parameters
Type | Name | Description |
---|---|---|
IronSoftware.Drawing.AnyBitmap | Image | An IronSoftware.Drawing.AnyBitmap, SixLabor.ImageSharp.Image, SkiaSharp.SKBitmap, SkiaSharp.SKImage, Microsoft.Maui.Graphics.Platform.PlatformImage, System.Drawing.Bitmap, or System.Drawing.Image |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
Read(AnyBitmap, CropRectangle)
Reads text from a region of a IronSoftware.Drawing.AnyBitmap Image file and returns an OcrResult object.
Declaration
public OcrResult Read(AnyBitmap Image, CropRectangle ContentArea)
Parameters
Type | Name | Description |
---|---|---|
IronSoftware.Drawing.AnyBitmap | Image | An IronSoftware.Drawing.AnyBitmap, SixLabor.ImageSharp.Image, SkiaSharp.SKBitmap, SkiaSharp.SKImage, Microsoft.Maui.Graphics.Platform.PlatformImage, System.Drawing.Bitmap, or System.Drawing.Image |
IronSoftware.Drawing.CropRectangle | ContentArea | Specifies a region within the image to extract text from as a IronSoftware.Drawing.CropRectangle with X, Y Width and Height in pixels. Setting a ContentArea can improve OCR speed. |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
Read(String)
Reads text from an Image file and returns an OcrResult object.
Declaration
public OcrResult Read(string ImagePath)
Parameters
Type | Name | Description |
---|---|---|
System.String | ImagePath | Path to an image file. |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
Read(String, CropRectangle)
Reads text from a region of an Image file and returns an OcrResult object.
Declaration
public OcrResult Read(string ImagePath, CropRectangle ContentArea)
Parameters
Type | Name | Description |
---|---|---|
System.String | ImagePath | Path to an image file. |
IronSoftware.Drawing.CropRectangle | ContentArea | Specifies a region within the image to extract text from as a IronSoftware.Drawing.CropRectangle with X, Y Width and Height in pixels. Setting a ContentArea can improve OCR speed. |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
ReadAsync(OcrInput, Int32)
Reads text from an OcrInput object and returns an OcrResult object.
OcrInput is the preferred input type because it allows for OCR of multi-paged documents, and allows images to be enhanced before OCR to obtain faster, more accurate results.
There are also other overloads of this method that allow for Images and PDFs to be read directly as File paths and Bitmaps.
Declaration
public OcrReadTask ReadAsync(OcrInput Input, int TimeoutMs = -1)
Parameters
Type | Name | Description |
---|---|---|
OcrInput | Input | An OcrInput document which can contain one or more images and PDFs |
System.Int32 | TimeoutMs | Optional timeout in milliseconds, after which the Ocr read will be cancelled. Not supported in .NET 4.0 |
Returns
Type | Description |
---|---|
OcrReadTask | A task that represents the asynchronous read operation. The value of the System.Threading.Tasks.Task`1.Result property contains an OcrResult object containing text, and detailed, structured information about the extracted text content. |
ReadAsync(AnyBitmap, CropRectangle, Int32)
Reads text from an IronSoftware.Drawing.AnyBitmap object and returns an OcrResult object.
Declaration
public Task<OcrResult> ReadAsync(AnyBitmap Image, CropRectangle ContentArea = null, int TimeoutMs = -1)
Parameters
Type | Name | Description |
---|---|---|
IronSoftware.Drawing.AnyBitmap | Image | An IronSoftware.Drawing.AnyBitmap object |
IronSoftware.Drawing.CropRectangle | ContentArea | Specifies a region within the image to extract text from as a IronSoftware.Drawing.CropRectangle with X, Y Width and Height in pixels. Setting a ContentArea can improve OCR speed. |
System.Int32 | TimeoutMs | Optional timeout in milliseconds, after which the Ocr read will be cancelled. Not supported in .NET 4.0 |
Returns
Type | Description |
---|---|
System.Threading.Tasks.Task<OcrResult> | A task that represents the asynchronous read operation. The value of the System.Threading.Tasks.Task`1.Result property contains an OcrResult object containing text, and detailed, structured information about the extracted text content. |
ReadAsync(String, CropRectangle, Int32)
Reads text from an Image file and returns an OcrResult object.
Declaration
public OcrReadTask ReadAsync(string ImagePath, CropRectangle ContentArea = null, int TimeoutMs = -1)
Parameters
Type | Name | Description |
---|---|---|
System.String | ImagePath | Path to an image file. |
IronSoftware.Drawing.CropRectangle | ContentArea | Specifies a region within the image to extract text from as a IronSoftware.Drawing.CropRectangle with X, Y Width and Height in pixels. Setting a ContentArea can improve OCR speed. |
System.Int32 | TimeoutMs | Optional timeout in milliseconds, after which the Ocr read will be cancelled. Not supported in .NET 4.0 |
Returns
Type | Description |
---|---|
OcrReadTask | A task that represents the asynchronous read operation. The value of the System.Threading.Tasks.Task`1.Result property contains an OcrResult object containing text, and detailed, structured information about the extracted text content. |
ReadPdfAndOverlayText(Byte[], String, String)
This will take a PDF document and overlay text read by tesseract on top of the pages. This will preserve bookmarks from the original document.
Declaration
public byte[] ReadPdfAndOverlayText(byte[] pdf, string password = null, string ownerPassword = null)
Parameters
Type | Name | Description |
---|---|---|
System.Byte[] | PDF to read and overlay text on as bytes. |
|
System.String | password | Password of the input PDF if there is one. |
System.String | ownerPassword | Owner password of the input PDF if there is one. |
Returns
Type | Description |
---|---|
System.Byte[] | Returns the new PDF as a byte[]. To save in one step use ReadPdfAndOverlayText(Byte[], String, String). |
See Also
ReadPdfAndOverlayText(String, String, String)
This will take a PDF document and overlay text read by tesseract on top of the pages. This will preserve bookmarks from the original document.
Declaration
public byte[] ReadPdfAndOverlayText(string pdfFilePath, string password = null, string ownerPassword = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | pdfFilePath | Path to the PDF that needs selectable text overlay-ed on it. |
System.String | password | Password of the input PDF if there is one. |
System.String | ownerPassword | Owner password of the input PDF if there is one. |
Returns
Type | Description |
---|---|
System.Byte[] | Returns the new PDF as a byte[]. To save in one step use ReadPdfAndOverlayText(String, String, String). |
See Also
ReadPdfAndOverlayTextSaveAs(Byte[], String, String, String)
This will take a PDF document and overlay text read by tesseract on top of the pages. This will preserve bookmarks from the original document.
Declaration
public void ReadPdfAndOverlayTextSaveAs(byte[] pdf, string pdfSavePath, string password = null, string ownerPassword = null)
Parameters
Type | Name | Description |
---|---|---|
System.Byte[] | PDF to read and overlay text on as bytes. |
|
System.String | pdfSavePath | Path to save the new PDF. |
System.String | password | Password of the input PDF if there is one. |
System.String | ownerPassword | Owner password of the input PDF if there is one. |
ReadPdfAndOverlayTextSaveAs(String, String, String, String)
This will take a PDF document and overlay text read by tesseract on top of the pages. This will preserve bookmarks from the original document.
Declaration
public void ReadPdfAndOverlayTextSaveAs(string pdfFilePath, string pdfSavePath, string password = null, string ownerPassword = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | pdfFilePath | Path to the PDF that needs selectable text overlay-ed on it. |
System.String | pdfSavePath | Path to save the new PDF. |
System.String | password | Password of the input PDF if there is one. |
System.String | ownerPassword | Owner password of the input PDF if there is one. |
UseCustomTesseractLanguageFile(String)
IronTesseract will use a tesseract .traineddata language file as its only OCR language.
https://github.com/tesseract-ocr/tessdata
Declaration
public void UseCustomTesseractLanguageFile(string TrainedDataPath)
Parameters
Type | Name | Description |
---|---|---|
System.String | TrainedDataPath | File path to a .traineddata file. These can be downloaded from https://github.com/tesseract-ocr/tessdata or generated using Tesseract command line. |
Events
OcrProgress
An Event which can be used to track OCR progress and inform users of OCR performance and progress.
Progress is reported via the OcrProgressEventsArgs class
Declaration
public event EventHandler<OcrProgressEventsArgs> OcrProgress
Event Type
Type | Description |
---|---|
System.EventHandler<OcrProgressEventsArgs> |
Examples
myIronTesseract.OcrProgress += (object o, IronOcr.Events.OcrProgressEventsArgs e) =>
{
Console.WriteLine(e.ProgressPercent + "% " + e.Duration.TotalSeconds+"s" );
}