Class IronTesseract
IronTesseract is a comprehensive managed class for performing Tesseract OCR in .Net applications.
IronTesseract natively supports Tesseract 3, 4 and 5 engines, and will automatically install all required binaries and language packs (tessdata) files.
Inheritance
Implements
Namespace: IronOcr
Assembly: IronOcr.dll
Syntax
public class IronTesseract : Object, IOcrEngine
Constructors
IronTesseract()
Public constructor. Creates a default instance of IronTesseract
Declaration
public IronTesseract()
IronTesseract(TesseractConfiguration)
Public constructor. Creates an instance of IronTesseract with a customized TesseractConfiguration.
This allows advanced developers to fine tune Tesseract behavior.
Declaration
public IronTesseract(TesseractConfiguration Configuration)
Parameters
Type | Name | Description |
---|---|---|
TesseractConfiguration | Configuration |
Fields
Configuration
An instance of TesseractConfiguration which allows fine-grained control of the underlying Tesseract OCR Engine.
Options include: Language file detail level. Page Segmentation Mode and access to the entire API of tesseract settings variables.
Declaration
public TesseractConfiguration Configuration
Field Value
Type | Description |
---|---|
TesseractConfiguration |
Properties
EnableTesseractConsoleMessages
Gets or sets a value indicating whether Tesseract developer messages and warnings will be sent to console output.
Declaration
public bool EnableTesseractConsoleMessages { get; set; }
Property Value
Type | Description |
---|---|
System.Boolean |
Remarks
Setting this property to true
enables console output for Tesseract messages and warnings. Conversely, setting it to false
disables this output.
Language
The Natural Language of the documents Which IronTesseract will read.
Default is English. Additional languages can be installed easily using Nuget https://www.nuget.org/packages?q=IronOcr.Languages or downloaded from https://ironsoftware.com/csharp/ocr/languages/
We may use multiple languages packs simultaneously with the UseMultipleLanguages method.
We can use custom Tesseract .tessdata language packs with the UseCustomTesseractLanguageFile(String) method.
Declaration
public OcrLanguage Language { get; set; }
Property Value
Type | Description |
---|---|
OcrLanguage |
See Also
MultiThreaded
Read multiple PDF pages and images simultaneously on different threads
Declaration
public bool MultiThreaded { get; set; }
Property Value
Type | Description |
---|---|
System.Boolean |
Methods
AddSecondaryLanguage(OcrLanguage)
IronTesseract will use multiple tesseract language files simultaneously. MultilingualOCR
Any number of secondary languages may be added. Speed and performance may be affected.
Declaration
public void AddSecondaryLanguage(OcrLanguage SecondaryLanguage)
Parameters
Type | Name | Description |
---|---|---|
OcrLanguage | SecondaryLanguage | An additional OcrLanguage |
AddSecondaryLanguage(String)
IronTesseract will use multiple tesseract language files simultaneously. MultilingualOCR uses a custom .traineddata tesseract 3,4 or 5 language file.
Any number of secondary languages may be added. Speed and performance may be affected.
Declaration
public void AddSecondaryLanguage(string CustomLanguagePath)
Parameters
Type | Name | Description |
---|---|---|
System.String | CustomLanguagePath | File path to a .traineddata tesseract language pack. |
ClearSecondaryLanguages()
Removes all languages add by AddSecondaryLanguage(OcrLanguage) or AddSecondaryLanguage(String)
Declaration
public void ClearSecondaryLanguages()
ConvertToSearchablePdf(Byte[], String, String)
Perform OCR on images within the PDF, overlay the text onto the original PDF, and save the new PDF to file
Declaration
public void ConvertToSearchablePdf(byte[] PdfData, string SavePath, string Password = null)
Parameters
Type | Name | Description |
---|---|---|
System.Byte[] | PdfData | PDF file data |
System.String | SavePath | Save path of the searchable PDF |
System.String | Password | PDF password |
Remarks
Useful for generating a searchable PDF which retains bookmarks, annotations, etc.
ConvertToSearchablePdf(String, String, String)
Perform OCR on images within the PDF, overlay the text onto the original PDF, and save the new PDF to file
Declaration
public void ConvertToSearchablePdf(string PdfPath, string SavePath, string Password = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | PdfPath | PDF file path |
System.String | SavePath | Save path of the searchable PDF |
System.String | Password | PDF password |
Remarks
Useful for generating a searchable PDF which retains bookmarks, annotations, etc.
ConvertToSearchablePdfBytes(Byte[], String)
Perform OCR on images within the PDF, overlay the text onto the original PDF, and return a byte array of the new PDF
Declaration
public byte[] ConvertToSearchablePdfBytes(byte[] PdfData, string Password = null)
Parameters
Type | Name | Description |
---|---|---|
System.Byte[] | PdfData | PDF file data |
System.String | Password | PDF password |
Returns
Type | Description |
---|---|
System.Byte[] | Byte array of the generated Searchable PDF |
Remarks
Useful for generating a searchable PDF which retains bookmarks, annotations, etc.
ConvertToSearchablePdfBytes(String, String)
Perform OCR on images within the PDF, overlay the text onto the original PDF, and return a byte array of the new PDF
Declaration
public byte[] ConvertToSearchablePdfBytes(string PdfPath, string Password = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | PdfPath | PDF file path |
System.String | Password | PDF password |
Returns
Type | Description |
---|---|
System.Byte[] | Byte array of the generated Searchable PDF |
Remarks
Useful for generating a searchable PDF which retains bookmarks, annotations, etc.
Read(OcrInputBase)
Reads text from an OcrInput object and returns an OcrResult object.
OcrInput is the preferred input type because it allows for OCR of multi-paged documents, and allows images to be enhanced before OCR to obtain faster, more accurate results.
There are also other overloads of this method that allow for Images and PDFs to be read directly as File paths and Bitmaps.
Declaration
public OcrResult Read(OcrInputBase Input)
Parameters
Type | Name | Description |
---|---|---|
OcrInputBase | Input | An OcrInput document which can contain one or more images and PDFs |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
Read(OcrInputBase[])
Reads text from an array of OcrInput objects and returns an array of OcrResult objects.
OcrInput is the preferred input type because it allows for OCR of multi-paged documents, and allows images to be enhanced before OCR to obtain faster, more accurate results.
There are also other overloads of this method that allow for Images and PDFs to be read directly as File paths and Bitmaps.
Declaration
public OcrResult[] Read(OcrInputBase[] Inputs)
Parameters
Type | Name | Description |
---|---|---|
OcrInputBase[] | Inputs | An array of OcrInput documents which can contain one or more images and PDFs each |
Returns
Type | Description |
---|---|
OcrResult[] | An array of OcrResult objects containing text, and detailed, structured information about the extracted text content. |
Read(AnyBitmap)
Reads text from a IronSoftware.Drawing.AnyBitmap Image file and returns an OcrResult object.
Declaration
public OcrResult Read(AnyBitmap Image)
Parameters
Type | Name | Description |
---|---|---|
IronSoftware.Drawing.AnyBitmap | Image | An IronSoftware.Drawing.AnyBitmap, SixLabor.ImageSharp.Image, SkiaSharp.SKBitmap, SkiaSharp.SKImage, Microsoft.Maui.Graphics.Platform.PlatformImage, System.Drawing.Bitmap, or System.Drawing.Image |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
Read(AnyBitmap, Rectangle)
Reads text from a region of a IronSoftware.Drawing.AnyBitmap Image file and returns an OcrResult object.
Declaration
public OcrResult Read(AnyBitmap Image, Rectangle ContentArea)
Parameters
Type | Name | Description |
---|---|---|
IronSoftware.Drawing.AnyBitmap | Image | An IronSoftware.Drawing.AnyBitmap, SixLabor.ImageSharp.Image, SkiaSharp.SKBitmap, SkiaSharp.SKImage, Microsoft.Maui.Graphics.Platform.PlatformImage, System.Drawing.Bitmap, or System.Drawing.Image |
IronSoftware.Drawing.Rectangle | ContentArea | Specifies a region within the image to extract text from as a IronSoftware.Drawing.Rectangle with X, Y Width and Height in pixels. Setting a ContentArea can improve OCR speed. |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
Read(IDocumentId)
Read the existing Pdf document and return OCR results
Declaration
public IOcrResult Read(IDocumentId Document)
Parameters
Type | Name | Description |
---|---|---|
IronSoftware.IDocumentId | Document | Pdf document to read |
Returns
Type | Description |
---|---|
IronSoftware.IOcrResult | OCR results |
Read(IDocumentId, PdfContents)
Read the existing Pdf document and return OCR results
Declaration
public IOcrResult Read(IDocumentId Document, PdfContents Contents)
Parameters
Type | Name | Description |
---|---|---|
IronSoftware.IDocumentId | Document | Pdf document to read |
PdfContents | Contents | Contents to OCR |
Returns
Type | Description |
---|---|
IronSoftware.IOcrResult | OCR results |
Read(String)
Reads text from an Image file and returns an OcrResult object.
Declaration
public OcrResult Read(string ImagePath)
Parameters
Type | Name | Description |
---|---|---|
System.String | ImagePath | Path to an image file. |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
Read(String, Rectangle)
Reads text from a region of an Image file and returns an OcrResult object.
Declaration
public OcrResult Read(string ImagePath, Rectangle ContentArea)
Parameters
Type | Name | Description |
---|---|---|
System.String | ImagePath | Path to an image file. |
IronSoftware.Drawing.Rectangle | ContentArea | Specifies a region within the image to extract text from as a IronSoftware.Drawing.Rectangle with X, Y Width and Height in pixels. Setting a ContentArea can improve OCR speed. |
Returns
Type | Description |
---|---|
OcrResult | A OcrResult object containing text, and detailed, structured information about the extracted text content. |
ReadAsync(OcrInputBase, Int32)
Reads text from an OcrInput object and returns an OcrResult object.
OcrInput is the preferred input type because it allows for OCR of multi-paged documents, and allows images to be enhanced before OCR to obtain faster, more accurate results.
There are also other overloads of this method that allow for Images and PDFs to be read directly as File paths and Bitmaps.
Declaration
public OcrReadTask ReadAsync(OcrInputBase Input, int TimeoutMs = -1)
Parameters
Type | Name | Description |
---|---|---|
OcrInputBase | Input | An OcrInput document which can contain one or more images and PDFs |
System.Int32 | TimeoutMs | Optional timeout in milliseconds, after which the Ocr read will be cancelled. Not supported in .NET 4.0 |
Returns
Type | Description |
---|---|
OcrReadTask | A task that represents the asynchronous read operation. The value of the System.Threading.Tasks.Task`1.Result property contains an OcrResult object containing text, and detailed, structured information about the extracted text content. |
ReadAsync(AnyBitmap, Rectangle, Int32)
Reads text from an IronSoftware.Drawing.AnyBitmap object and returns an OcrResult object.
Declaration
public Task<OcrResult> ReadAsync(AnyBitmap Image, Rectangle ContentArea = null, int TimeoutMs = -1)
Parameters
Type | Name | Description |
---|---|---|
IronSoftware.Drawing.AnyBitmap | Image | An IronSoftware.Drawing.AnyBitmap object |
IronSoftware.Drawing.Rectangle | ContentArea | Specifies a region within the image to extract text from as a IronSoftware.Drawing.Rectangle with X, Y Width and Height in pixels. Setting a ContentArea can improve OCR speed. |
System.Int32 | TimeoutMs | Optional timeout in milliseconds, after which the Ocr read will be cancelled. Not supported in .NET 4.0 |
Returns
Type | Description |
---|---|
System.Threading.Tasks.Task<OcrResult> | A task that represents the asynchronous read operation. The value of the System.Threading.Tasks.Task`1.Result property contains an OcrResult object containing text, and detailed, structured information about the extracted text content. |
ReadAsync(String, Rectangle, Int32)
Reads text from an Image file and returns an OcrResult object.
Declaration
public OcrReadTask ReadAsync(string ImagePath, Rectangle ContentArea = null, int TimeoutMs = -1)
Parameters
Type | Name | Description |
---|---|---|
System.String | ImagePath | Path to an image file. |
IronSoftware.Drawing.Rectangle | ContentArea | Specifies a region within the image to extract text from as a IronSoftware.Drawing.Rectangle with X, Y Width and Height in pixels. Setting a ContentArea can improve OCR speed. |
System.Int32 | TimeoutMs | Optional timeout in milliseconds, after which the Ocr read will be cancelled. Not supported in .NET 4.0 |
Returns
Type | Description |
---|---|
OcrReadTask | A task that represents the asynchronous read operation. The value of the System.Threading.Tasks.Task`1.Result property contains an OcrResult object containing text, and detailed, structured information about the extracted text content. |
ReadDocument(OcrInputBase)
A strong IronTesseract Document reading method that specializes in scanned documents or photos of paper documents which contain a lot of text. Ensure you have use the OcrInput filters to improve inputs, see: https://ironsoftware.com/csharp/ocr/tutorials/c-sharp-ocr-image-filters/. For reading of Images use ReadPhoto(OcrInputBase). For reading of Passports use ReadPassport(OcrInputBase). For reading of LicensePlates use ReadLicensePlate(OcrInputBase). For reading of Scanned Documents contain tables with clarity outlines use ReadDocumentAdvanced(OcrInputBase).
Declaration
public OcrResult ReadDocument(OcrInputBase ocrInput)
Parameters
Type | Name | Description |
---|---|---|
OcrInputBase | ocrInput | OCR Input |
Returns
Type | Description |
---|---|
OcrResult | OCR result |
ReadDocumentAdvanced(OcrInputBase)
An optimized read utilizing machine learning models together with computer vision methods for documents that contains tables with clarity outlines. Ensure you have use the OcrInput filters to improve inputs, see: https://ironsoftware.com/csharp/ocr/tutorials/c-sharp-ocr-image-filters/. For reading of Scanned Documents use ReadDocument(OcrInputBase). For reading of Images use ReadPhoto(OcrInputBase). For reading of Passports use ReadPassport(OcrInputBase). For reading of LicensePlates use ReadLicensePlate(OcrInputBase).
Declaration
public OcrDocAdvancedResult ReadDocumentAdvanced(OcrInputBase ocrInput)
Parameters
Type | Name | Description |
---|---|---|
OcrInputBase | ocrInput | OCR Input |
Returns
Type | Description |
---|---|
OcrDocAdvancedResult | OCR document advanced result |
Remarks
**Current supported languages are English, Chinese, Japanese, Korean, and Latin.
**This is an extension method to the base IronOCR package and requires that you also install the IronOcr.Extensions.AdvancedScan package.
ReadImagesFromPdf(Byte[], String, IEnumerable<Int32>)
Extract all images from a PDF, perform OCR on the images, and return the results
Declaration
public OcrResult ReadImagesFromPdf(byte[] PdfData, string Password = null, IEnumerable<int> PageIndices = null)
Parameters
Type | Name | Description |
---|---|---|
System.Byte[] | PdfData | PDF file data |
System.String | Password | PDF password |
System.Collections.Generic.IEnumerable<System.Int32> | PageIndices | Pages to extract images from |
Returns
Type | Description |
---|---|
OcrResult | OCR results |
Remarks
Useful for generating a searchable PDF which retains bookmarks, annotations, etc.
ReadImagesFromPdf(String, String, IEnumerable<Int32>)
Extract all images from a PDF, perform OCR on the images, and return the results
Declaration
public OcrResult ReadImagesFromPdf(string PdfPath, string Password = null, IEnumerable<int> PageIndices = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | PdfPath | PDF file path |
System.String | Password | PDF password |
System.Collections.Generic.IEnumerable<System.Int32> | PageIndices | Pages to extract images from |
Returns
Type | Description |
---|---|
OcrResult | OCR results |
Remarks
Useful for generating a searchable PDF which retains bookmarks, annotations, etc.
ReadLicensePlate(OcrInputBase)
An optimized read that extracts a License Plate from photos. Ensure you have use the OcrInput filters to improve inputs, see: https://ironsoftware.com/csharp/ocr/tutorials/c-sharp-ocr-image-filters/. For reading of Scanned Documents use ReadDocument(OcrInputBase). For reading of Images use ReadPhoto(OcrInputBase) or ReadScreenShot(OcrInputBase). For reading of Passports use ReadPassport(OcrInputBase). For reading of Scanned Documents contain tables with clarity outlines use ReadDocumentAdvanced(OcrInputBase).
Declaration
public OcrLicensePlateResult ReadLicensePlate(OcrInputBase ocrInput)
Parameters
Type | Name | Description |
---|---|---|
OcrInputBase | ocrInput | OCR input |
Returns
Type | Description |
---|---|
OcrLicensePlateResult | OCR license plate result |
Remarks
**Current supported languages are English, Chinese, Japanese, Korean, and Latin.
**This is an extension method to the base IronOCR package and requires that you also install the IronOcr.Extensions.AdvancedScan package.
ReadPassport(OcrInputBase)
An optimized read that extracts Passport information from Passport photos by scanning the MRZ contents. Ensure you have use the OcrInput filters to improve inputs, see: https://ironsoftware.com/csharp/ocr/tutorials/c-sharp-ocr-image-filters/. For reading of Scanned Documents use ReadDocument(OcrInputBase). For reading of Images use ReadPhoto(OcrInputBase) or ReadScreenShot(OcrInputBase). For reading of License Plates use ReadLicensePlate(OcrInputBase). For reading of Scanned Documents contain tables with clarity outlines use ReadDocumentAdvanced(OcrInputBase).
Declaration
public OcrPassportResult ReadPassport(OcrInputBase ocrInput)
Parameters
Type | Name | Description |
---|---|---|
OcrInputBase | ocrInput | OCR input |
Returns
Type | Description |
---|---|
OcrPassportResult | OCR passport result |
Remarks
**This method only supports English language.
**This is an extension method to the base IronOCR package and requires that you also install the IronOcr.Extensions.AdvancedScan package.
ReadPhoto(OcrInputBase)
An optimized read that performs for images that contain hard-to-read text. Ensure you have use the OcrInput filters to improve inputs, see: https://ironsoftware.com/csharp/ocr/tutorials/c-sharp-ocr-image-filters/. For reading of Scanned Documents use ReadDocument(OcrInputBase). For reading of Passports use ReadPassport(OcrInputBase). For reading of License Plates use ReadLicensePlate(OcrInputBase). For reading of Scanned Documents contain tables with clarity outlines use ReadDocumentAdvanced(OcrInputBase).
Declaration
public OcrPhotoResult ReadPhoto(OcrInputBase ocrInput)
Parameters
Type | Name | Description |
---|---|---|
OcrInputBase | ocrInput | OCR input |
Returns
Type | Description |
---|---|
OcrPhotoResult | OCR photo result |
Remarks
**Current supported languages are English, Chinese, Japanese, Korean, and Latin.
**This is an extension method to the base IronOCR package and requires that you also install the IronOcr.Extensions.AdvancedScan package.
ReadScreenShot(OcrInputBase)
An optimized read that performs for screenshots that contain hard-to-read text. Ensure you have use the OcrInput filters to improve inputs, see: https://ironsoftware.com/csharp/ocr/tutorials/c-sharp-ocr-image-filters/. For reading of Scanned Documents use ReadDocument(OcrInputBase). For reading of Passports use ReadPassport(OcrInputBase). For reading of License Plates use ReadLicensePlate(OcrInputBase). For reading of Scanned Documents contain tables with clarity outlines use ReadDocumentAdvanced(OcrInputBase).
Declaration
public OcrPhotoResult ReadScreenShot(OcrInputBase ocrInput)
Parameters
Type | Name | Description |
---|---|---|
OcrInputBase | ocrInput | OCR input |
Returns
Type | Description |
---|---|
OcrPhotoResult | OCR photo result |
Remarks
**Current supported languages are English, Chinese, Japanese, Korean, and Latin.
**This is an extension method to the base IronOCR package and requires that you also install the IronOcr.Extensions.AdvancedScan package.
UseCustomTesseractLanguageFile(String)
IronTesseract will use a tesseract .traineddata language file as its only OCR language.
https://github.com/tesseract-ocr/tessdata
Declaration
public void UseCustomTesseractLanguageFile(string TrainedDataPath)
Parameters
Type | Name | Description |
---|---|---|
System.String | TrainedDataPath | File path to a .traineddata file. These can be downloaded from https://github.com/tesseract-ocr/tessdata or generated using Tesseract command line. |
Events
OcrProgress
An Event which can be used to track OCR progress and inform users of OCR performance and progress.
Progress is reported via the OcrProgressEventsArgs class
Declaration
public event EventHandler<OcrProgressEventsArgs> OcrProgress
Event Type
Type | Description |
---|---|
System.EventHandler<OcrProgressEventsArgs> |
Examples
myIronTesseract.OcrProgress += (object o, IronOcr.Events.OcrProgressEventsArgs e) =>
{
Console.WriteLine(e.ProgressPercent + "% " + e.Duration.TotalSeconds+"s" );
}