Skip to footer content
USING IRONOCR

.NET OCR SDK: A Text Recognition Library for C#

A .NET OCR SDK is a software development kit that lets C# and .NET applications extract text from images, scanned PDFs, and other document formats programmatically. IronOCR is a production-ready .NET OCR SDK that wraps a tuned Tesseract 5 engine with preprocessing filters, barcode reading, searchable PDF output, and support for 125+ languages -- all accessible through a clean C# API that works on Windows, Linux, macOS, and cloud platforms.

What Makes IronOCR the Right .NET OCR SDK for Your Project?

Building text recognition from scratch means managing image preprocessing pipelines, language data files, threading models, and output parsing -- months of work before you extract your first word. IronOCR eliminates that overhead by shipping a battle-tested engine that your team can drop into a project in minutes.

Key capabilities that set it apart from raw Tesseract bindings:

  • Recognition of 125+ languages and scripts including handwritten text
  • Built-in filters: noise removal, deskewing, binarization, resolution enhancement, and contrast correction
  • Barcode and QR code detection within the same read pass
  • Searchable PDF generation with invisible text layers for archiving workflows
  • Async and parallel batch processing for high-throughput pipelines
  • Zonal OCR for targeting specific page regions to cut processing time
  • Cross-platform support on Windows, Linux, macOS, Docker, and Azure

According to the Tesseract OCR project documentation, raw Tesseract requires manual configuration for language packs, DPI settings, and output modes. IronOCR handles all of this automatically, letting you focus on what the extracted text means rather than how to extract it.

How Does IronOCR Compare to Raw Tesseract?

Raw Tesseract via a P/Invoke wrapper or the Tesseract NuGet package leaves you responsible for: downloading and placing tessdata language files, selecting the correct page segmentation mode, handling multi-page TIFF and PDF splitting yourself, and wiring up threading if you want parallel processing. None of those details are unique to your business problem.

IronOCR wraps all of that plumbing. You get a typed API surface, automatic tessdata management, built-in PDF split-and-recombine, and a thread-safe engine that you can reuse across requests. The tradeoff is a paid license for production use -- the licensing page shows current pricing tiers including a free development license.

For teams that need open-source-only dependencies, raw Tesseract plus custom preprocessing is a viable path. For teams that need to ship reliable OCR quickly, IronOCR reduces the integration surface to a few lines of C#.

How Do You Install the IronOCR .NET SDK?

Installation comes through NuGet, the standard .NET package manager. Run the following command in your project directory:

Install-Package IronOcr

For Visual Studio users, search for IronOcr in the NuGet Package Manager GUI and install from there. For full installation options including manual DLL references, see the IronOCR installation documentation.

After installation, add the license key to your application startup or appsettings.json. You can start a free trial to get a trial key that unlocks all features during evaluation.

Verifying the Installation

A quick sanity check after installation confirms everything is wired up correctly. Create a console application targeting .NET 10:

using IronOcr;

// Minimal smoke test -- reads a single image and prints extracted text
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadImage("sample.png");
var result = ocr.Read(input);
Console.WriteLine(result.Text);
using IronOcr;

// Minimal smoke test -- reads a single image and prints extracted text
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadImage("sample.png");
var result = ocr.Read(input);
Console.WriteLine(result.Text);
$vbLabelText   $csharpLabel

If text appears in the console, the SDK is installed and the license key is valid. You are ready to build production workflows.

How Do You Extract Text From Images and PDFs in C#?

The core extraction pattern is consistent across all input types. You create an IronTesseract instance, load content into an OcrInput object, and call Read(). IronOCR auto-detects file format from the extension, so the same code path handles JPEG, PNG, TIFF, BMP, and multi-page PDFs.

using IronOcr;

// Reusable OCR service encapsulating the IronTesseract engine
public class OcrService
{
    private readonly IronTesseract _ocr = new IronTesseract();

    public string ExtractText(string filePath)
    {
        using var input = new OcrInput();

        // LoadPdf for PDF files; LoadImage for raster formats
        if (filePath.EndsWith(".pdf", StringComparison.OrdinalIgnoreCase))
            input.LoadPdf(filePath);
        else
            input.LoadImage(filePath);

        return _ocr.Read(input).Text;
    }

    public async Task<string> ExtractTextAsync(string filePath)
    {
        using var input = new OcrInput();

        if (filePath.EndsWith(".pdf", StringComparison.OrdinalIgnoreCase))
            input.LoadPdf(filePath);
        else
            input.LoadImage(filePath);

        var result = await _ocr.ReadAsync(input);
        return result.Text;
    }
}
using IronOcr;

// Reusable OCR service encapsulating the IronTesseract engine
public class OcrService
{
    private readonly IronTesseract _ocr = new IronTesseract();

    public string ExtractText(string filePath)
    {
        using var input = new OcrInput();

        // LoadPdf for PDF files; LoadImage for raster formats
        if (filePath.EndsWith(".pdf", StringComparison.OrdinalIgnoreCase))
            input.LoadPdf(filePath);
        else
            input.LoadImage(filePath);

        return _ocr.Read(input).Text;
    }

    public async Task<string> ExtractTextAsync(string filePath)
    {
        using var input = new OcrInput();

        if (filePath.EndsWith(".pdf", StringComparison.OrdinalIgnoreCase))
            input.LoadPdf(filePath);
        else
            input.LoadImage(filePath);

        var result = await _ocr.ReadAsync(input);
        return result.Text;
    }
}
$vbLabelText   $csharpLabel

Top-level entry point to exercise the service:

using IronOcr;

var service = new OcrService();
string text = await service.ExtractTextAsync("invoice.pdf");
Console.WriteLine(text);
using IronOcr;

var service = new OcrService();
string text = await service.ExtractTextAsync("invoice.pdf");
Console.WriteLine(text);
$vbLabelText   $csharpLabel

The IronTesseract instance is thread-safe and designed for reuse. Create it once at application startup (via dependency injection in ASP.NET Core, for example) rather than instantiating it per request.

For multi-page PDFs, result.Pages gives you per-page access to the text, confidence score, and bounding boxes. See the multi-page PDF OCR guide for details on page-by-page iteration.

How Do You Improve OCR Accuracy With Preprocessing Filters?

Raw scans from flatbed scanners, smartphone cameras, or fax machines frequently suffer from noise, rotation, low contrast, and insufficient resolution. IronOCR's image quality correction pipeline addresses each issue with targeted filters you chain before the read call.

using IronOcr;

public class AccuracyOptimizedOcr
{
    private readonly IronTesseract _ocr = new IronTesseract();

    public string ProcessLowQualityDocument(string filePath)
    {
        using var input = new OcrInput();

        if (filePath.EndsWith(".pdf", StringComparison.OrdinalIgnoreCase))
            input.LoadPdf(filePath);
        else
            input.LoadImage(filePath);

        // Chain preprocessing filters in order of operation
        input.DeNoise();              // Remove scan artifacts and speckling
        input.Deskew();               // Correct page tilt up to 35 degrees
        input.Scale(150);             // Enlarge small text for better recognition
        input.Binarize();             // Convert to black/white for cleaner edges
        input.EnhanceResolution(300); // Sharpen blurry or low-DPI input

        var result = _ocr.Read(input);

        // Confidence below 70 often signals a preprocessing mismatch
        if (result.Confidence < 70)
            Console.WriteLine($"Warning: low confidence ({result.Confidence:F1}%)");

        return result.Text;
    }
}
using IronOcr;

public class AccuracyOptimizedOcr
{
    private readonly IronTesseract _ocr = new IronTesseract();

    public string ProcessLowQualityDocument(string filePath)
    {
        using var input = new OcrInput();

        if (filePath.EndsWith(".pdf", StringComparison.OrdinalIgnoreCase))
            input.LoadPdf(filePath);
        else
            input.LoadImage(filePath);

        // Chain preprocessing filters in order of operation
        input.DeNoise();              // Remove scan artifacts and speckling
        input.Deskew();               // Correct page tilt up to 35 degrees
        input.Scale(150);             // Enlarge small text for better recognition
        input.Binarize();             // Convert to black/white for cleaner edges
        input.EnhanceResolution(300); // Sharpen blurry or low-DPI input

        var result = _ocr.Read(input);

        // Confidence below 70 often signals a preprocessing mismatch
        if (result.Confidence < 70)
            Console.WriteLine($"Warning: low confidence ({result.Confidence:F1}%)");

        return result.Text;
    }
}
$vbLabelText   $csharpLabel

Filter selection guidance:

  • DeNoise() -- use for scans with heavy speckling or compression artifacts
  • Deskew() -- use when documents are photographed at an angle; see page rotation detection for auto-detection
  • Scale() -- use for small print or sub-150 DPI input; values of 150-200 typically yield the best results
  • Binarize() -- use for colored or gradient backgrounds; converts image to strict black/white
  • EnhanceResolution() -- use for blurry or low-contrast text; targets 300 DPI as the Tesseract sweet spot

Research published in the International Journal on Document Analysis and Recognition consistently shows that binarization and deskewing are the two highest-impact preprocessing steps for improving character recognition rates. Apply both as a baseline for any production pipeline.

IronOCR preprocessing filters and their primary use cases
Filter Problem Solved When to Apply
DeNoise() Scanner artifacts, speckle noise Any flatbed or fax scan
Deskew() Page tilt and rotation Photographed or misaligned documents
Scale() Small text or low DPI Input below 150 DPI
Binarize() Color backgrounds, gradients Colored paper or watermarked forms
EnhanceResolution() Blur and low contrast Camera captures and compressed JPEGs

How Do You Build a Production Batch Processing Pipeline?

Single-document extraction is straightforward, but production scenarios involve hundreds or thousands of files arriving in queues, shared folders, or cloud storage. IronOCR's async API and thread-safe engine make it suitable for parallel workloads.

using IronOcr;
using Microsoft.Extensions.Logging;

public class ProductionOcrService
{
    private readonly IronTesseract _ocr;
    private readonly ILogger<ProductionOcrService> _logger;

    public ProductionOcrService(ILogger<ProductionOcrService> logger)
    {
        _logger = logger;
        _ocr = new IronTesseract
        {
            Configuration =
            {
                RenderSearchablePdfsAndHocr = true,
                ReadBarCodes = true
            }
        };
    }

    public async Task<IReadOnlyList<string>> ProcessBatchAsync(
        IEnumerable<string> filePaths,
        int maxDegreeOfParallelism = 4)
    {
        var results = new System.Collections.Concurrent.ConcurrentBag<string>();

        var options = new ParallelOptions
        {
            MaxDegreeOfParallelism = maxDegreeOfParallelism
        };

        await Parallel.ForEachAsync(filePaths, options, async (filePath, ct) =>
        {
            try
            {
                using var input = new OcrInput();

                if (filePath.EndsWith(".pdf", StringComparison.OrdinalIgnoreCase))
                    input.LoadPdf(filePath);
                else
                    input.LoadImage(filePath);

                var result = await _ocr.ReadAsync(input);
                results.Add(result.Text);
                _logger.LogInformation("Processed {FilePath} at {Confidence:F1}% confidence",
                    filePath, result.Confidence);
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "OCR failed for {FilePath}", filePath);
                results.Add(string.Empty);
            }
        });

        return results.ToList();
    }

    public void CreateSearchablePdf(string inputPath, string outputPath)
    {
        using var input = new OcrInput();

        if (inputPath.EndsWith(".pdf", StringComparison.OrdinalIgnoreCase))
            input.LoadPdf(inputPath);
        else
            input.LoadImage(inputPath);

        _ocr.Read(input).SaveAsSearchablePdf(outputPath);
        _logger.LogInformation("Searchable PDF written to {OutputPath}", outputPath);
    }
}
using IronOcr;
using Microsoft.Extensions.Logging;

public class ProductionOcrService
{
    private readonly IronTesseract _ocr;
    private readonly ILogger<ProductionOcrService> _logger;

    public ProductionOcrService(ILogger<ProductionOcrService> logger)
    {
        _logger = logger;
        _ocr = new IronTesseract
        {
            Configuration =
            {
                RenderSearchablePdfsAndHocr = true,
                ReadBarCodes = true
            }
        };
    }

    public async Task<IReadOnlyList<string>> ProcessBatchAsync(
        IEnumerable<string> filePaths,
        int maxDegreeOfParallelism = 4)
    {
        var results = new System.Collections.Concurrent.ConcurrentBag<string>();

        var options = new ParallelOptions
        {
            MaxDegreeOfParallelism = maxDegreeOfParallelism
        };

        await Parallel.ForEachAsync(filePaths, options, async (filePath, ct) =>
        {
            try
            {
                using var input = new OcrInput();

                if (filePath.EndsWith(".pdf", StringComparison.OrdinalIgnoreCase))
                    input.LoadPdf(filePath);
                else
                    input.LoadImage(filePath);

                var result = await _ocr.ReadAsync(input);
                results.Add(result.Text);
                _logger.LogInformation("Processed {FilePath} at {Confidence:F1}% confidence",
                    filePath, result.Confidence);
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "OCR failed for {FilePath}", filePath);
                results.Add(string.Empty);
            }
        });

        return results.ToList();
    }

    public void CreateSearchablePdf(string inputPath, string outputPath)
    {
        using var input = new OcrInput();

        if (inputPath.EndsWith(".pdf", StringComparison.OrdinalIgnoreCase))
            input.LoadPdf(inputPath);
        else
            input.LoadImage(inputPath);

        _ocr.Read(input).SaveAsSearchablePdf(outputPath);
        _logger.LogInformation("Searchable PDF written to {OutputPath}", outputPath);
    }
}
$vbLabelText   $csharpLabel

The MaxDegreeOfParallelism cap prevents memory exhaustion when files are large. A value of 4 works well on a four-core server; increase it only after profiling memory usage. For Azure Functions or AWS Lambda deployments, set concurrency to 1 per function instance and scale horizontally instead.

CreateSearchablePdf generates a PDF where the original image is preserved as a visible layer and recognized text is embedded invisibly beneath it. This allows full-text search in PDF viewers and indexing by search engines -- a common requirement in document management systems.

Monitoring Confidence Scores in Production

Every OcrResult exposes a Confidence property (0-100) that reflects how certain the engine is about the recognized text. Tracking this metric in your logging infrastructure gives you an early warning signal when document quality degrades -- for example, if a scanner's calibration drifts or a new document supplier sends lower-DPI scans than expected.

A practical threshold strategy: log a warning at confidence below 80, trigger a preprocessing-retry pass at below 70, and flag documents for human review at below 60. This tiered approach catches quality issues before they produce silent data corruption in downstream systems.

The Microsoft .NET logging documentation covers the ILogger patterns used in the batch service above for teams integrating with ASP.NET Core's built-in DI container.

How Do You Extract Structured Data From Scanned Documents?

Text extraction is the first step. The second step is parsing that text into typed fields your application can act on. This pattern combines IronOCR's read pass with .NET's Regex to pull structured data from invoices, forms, and reports.

using IronOcr;
using System.Text.RegularExpressions;

public record Invoice(
    string? InvoiceNumber,
    DateOnly? Date,
    decimal? TotalAmount,
    string RawText
);

public class InvoiceOcrService
{
    private readonly IronTesseract _ocr = new IronTesseract();

    public Invoice ExtractInvoiceData(string invoicePath)
    {
        using var input = new OcrInput();

        if (invoicePath.EndsWith(".pdf", StringComparison.OrdinalIgnoreCase))
            input.LoadPdf(invoicePath);
        else
            input.LoadImage(invoicePath);

        input.DeNoise();
        input.Deskew();

        var result = _ocr.Read(input);
        string text = result.Text;

        return new Invoice(
            InvoiceNumber: ExtractInvoiceNumber(text),
            Date: ExtractDate(text),
            TotalAmount: ExtractAmount(text),
            RawText: text
        );
    }

    private static string? ExtractInvoiceNumber(string text)
    {
        var match = Regex.Match(text, @"Invoice\s*#?:?\s*(\S+)", RegexOptions.IgnoreCase);
        return match.Success ? match.Groups[1].Value : null;
    }

    private static DateOnly? ExtractDate(string text)
    {
        // Numeric format: MM/DD/YYYY
        var numeric = Regex.Match(text, @"\b(\d{1,2}/\d{1,2}/\d{2,4})\b");
        if (numeric.Success && DateTime.TryParse(numeric.Groups[1].Value, out var d1))
            return DateOnly.FromDateTime(d1);

        // Written format: January 15, 2025
        var written = Regex.Match(text,
            @"\b(January|February|March|April|May|June|July|August|September|October|November|December)\s+(\d{1,2}),?\s+(\d{4})\b",
            RegexOptions.IgnoreCase);
        if (written.Success && DateTime.TryParse(written.Value, out var d2))
            return DateOnly.FromDateTime(d2);

        return null;
    }

    private static decimal? ExtractAmount(string text)
    {
        var match = Regex.Match(text, @"\$\s*(\d+(?:\.\d{2})?)");
        return match.Success && decimal.TryParse(match.Groups[1].Value, out var amt)
            ? amt
            : null;
    }
}
using IronOcr;
using System.Text.RegularExpressions;

public record Invoice(
    string? InvoiceNumber,
    DateOnly? Date,
    decimal? TotalAmount,
    string RawText
);

public class InvoiceOcrService
{
    private readonly IronTesseract _ocr = new IronTesseract();

    public Invoice ExtractInvoiceData(string invoicePath)
    {
        using var input = new OcrInput();

        if (invoicePath.EndsWith(".pdf", StringComparison.OrdinalIgnoreCase))
            input.LoadPdf(invoicePath);
        else
            input.LoadImage(invoicePath);

        input.DeNoise();
        input.Deskew();

        var result = _ocr.Read(input);
        string text = result.Text;

        return new Invoice(
            InvoiceNumber: ExtractInvoiceNumber(text),
            Date: ExtractDate(text),
            TotalAmount: ExtractAmount(text),
            RawText: text
        );
    }

    private static string? ExtractInvoiceNumber(string text)
    {
        var match = Regex.Match(text, @"Invoice\s*#?:?\s*(\S+)", RegexOptions.IgnoreCase);
        return match.Success ? match.Groups[1].Value : null;
    }

    private static DateOnly? ExtractDate(string text)
    {
        // Numeric format: MM/DD/YYYY
        var numeric = Regex.Match(text, @"\b(\d{1,2}/\d{1,2}/\d{2,4})\b");
        if (numeric.Success && DateTime.TryParse(numeric.Groups[1].Value, out var d1))
            return DateOnly.FromDateTime(d1);

        // Written format: January 15, 2025
        var written = Regex.Match(text,
            @"\b(January|February|March|April|May|June|July|August|September|October|November|December)\s+(\d{1,2}),?\s+(\d{4})\b",
            RegexOptions.IgnoreCase);
        if (written.Success && DateTime.TryParse(written.Value, out var d2))
            return DateOnly.FromDateTime(d2);

        return null;
    }

    private static decimal? ExtractAmount(string text)
    {
        var match = Regex.Match(text, @"\$\s*(\d+(?:\.\d{2})?)");
        return match.Success && decimal.TryParse(match.Groups[1].Value, out var amt)
            ? amt
            : null;
    }
}
$vbLabelText   $csharpLabel

This approach pairs well with zonal OCR when you know exactly where each field appears on a form. By supplying a bounding rectangle, you skip full-page recognition and target only the region containing the invoice number or total -- dramatically reducing processing time for fixed-layout documents.

For more advanced extraction scenarios including tables and structured forms, review the IronOCR data extraction examples on the product site.

How Do You Handle Multi-Language OCR in .NET?

Many organizations process documents in more than one language -- import/export forms, international contracts, or multilingual customer submissions. IronOCR handles this by allowing you to configure the language pack before the read call.

using IronOcr;

// Configure multi-language recognition
var ocr = new IronTesseract();
ocr.Language = OcrLanguage.EnglishBest;  // Swap for any of 125+ supported languages

// For mixed-language documents, combine language packs
ocr.AddSecondaryLanguage(OcrLanguage.German);

using var input = new OcrInput();
input.LoadPdf("multilingual-contract.pdf");
var result = ocr.Read(input);
Console.WriteLine(result.Text);
using IronOcr;

// Configure multi-language recognition
var ocr = new IronTesseract();
ocr.Language = OcrLanguage.EnglishBest;  // Swap for any of 125+ supported languages

// For mixed-language documents, combine language packs
ocr.AddSecondaryLanguage(OcrLanguage.German);

using var input = new OcrInput();
input.LoadPdf("multilingual-contract.pdf");
var result = ocr.Read(input);
Console.WriteLine(result.Text);
$vbLabelText   $csharpLabel

The IronOCR language support page lists all 125+ available language packs with download instructions. Language packs ship as NuGet packages (for example, IronOcr.Languages.German) so they integrate with the same package management workflow you already use.

For character sets outside the Latin alphabet -- Arabic, Chinese, Japanese, Korean -- IronOCR provides optimized models that handle right-to-left text direction and ideographic scripts. See the CJK OCR guide for configuration specifics.

What Are Your Next Steps?

You now have the patterns needed to add production-grade OCR to any .NET 10 application: basic text extraction, preprocessing for difficult scans, async batch processing, structured data parsing, and multi-language support.

From here, explore these areas based on your project needs:

Start with the free trial license to evaluate the full feature set on your own documents before committing to a tier.

NuGet Install with NuGet

PM >  Install-Package IronOcr

Check out IronOCR on NuGet for quick installation. With over 10 million downloads, it’s transforming PDF development with C#. You can also download the DLL or Windows installer.

Frequently Asked Questions

What is the .NET OCR SDK?

The .NET OCR SDK by IronOCR is a library designed to integrate optical character recognition capabilities into C# applications, allowing developers to extract text from images, PDFs, and scanned documents.

What are the key features of IronOCR's .NET SDK?

IronOCR's .NET SDK offers a simple API, support for multiple languages, cross-platform compatibility, and advanced features for handling various file formats and low-quality scans.

How does IronOCR handle different languages?

IronOCR's .NET SDK supports multiple languages, enabling text extraction and recognition from documents in various languages without requiring additional configurations.

Can IronOCR process low-quality scans?

Yes, IronOCR is designed to effectively handle low-quality scans, employing advanced algorithms to enhance text recognition accuracy even in challenging scenarios.

Is IronOCR's .NET SDK cross-platform?

IronOCR's .NET SDK is cross-platform, meaning it can be used on different operating systems, making it versatile for various development environments.

What file formats does IronOCR support?

IronOCR supports a wide range of file formats including images, PDFs, and scanned documents, providing flexibility for text recognition tasks across different media.

How can developers integrate IronOCR into their projects?

Developers can integrate IronOCR into their C# projects using its typed API, which simplifies the process of adding OCR functionality to applications.

What are some use cases for IronOCR?

IronOCR can be used in document management systems, automated data entry, content digitization, and any application requiring text extraction from images or PDFs.

Kannaopat Udonpant
Software Engineer
Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering ...
Read More