Skip to footer content
USING IRONOCR

How to Implement OCR in C# GitHub Projects with IronOCR

Searching for OCR solutions on GitHub often yields fragmented documentation, complex Tesseract configurations, and projects that have not been updated in years. For C# developers who need reliable text extraction from images and PDFs, navigating the repository ecosystem can consume hours that would be better spent coding. Many open-source optical character recognition projects require manual binary management, tessdata file downloads, and platform-specific troubleshooting.

This tutorial demonstrates how to implement OCR functionality in C# projects using IronOCR, a managed library that eliminates the configuration overhead common with raw Tesseract wrappers. Whether building document processing pipelines or adding text recognition to existing applications, this guide provides working code examples ready for any OCR C# GitHub project.

How Do You Get Started with IronOCR?

IronOCR provides a managed .NET library distributed via NuGet, making it straightforward to integrate into any GitHub repository. Unlike open-source Tesseract OCR wrappers that require manual management of binaries and tessdata configuration, IronOCR handles these dependencies internally and works out of the box on Windows, Linux, and macOS.

The library maintains official example repositories on GitHub that developers can clone and reference. These examples demonstrate real-world implementations, including image-to-text conversion, support for multiple languages, and PDF processing. Contributors can test features immediately after cloning without any additional setup.

To get started in Visual Studio, install IronOCR through the NuGet Package Manager:

Install-Package IronOcr
Install-Package IronOcr
SHELL

OCR C# GitHub: Implement Text Recognition with IronOCR: Image 1 - Installation

Once installed, this single package includes everything needed for OCR operations. The library supports .NET Framework 4.6.2+, .NET Core, and .NET 5 through 10 for maximum compatibility across project types.

How Do You Extract Text from Image Formats in C#?

The following example demonstrates basic text extraction using IronOCR's IronTesseract class. This OCR engine reads various image formats, including PNG, JPG, JPEG, BMP, GIF, and TIFF:

using IronOcr;

// Initialize the OCR engine
var ocr = new IronTesseract();

// Load and process an image
using var input = new OcrInput("document-scan.png");

// Perform OCR and retrieve results
var result = ocr.Read(input);

// Output the extracted text to console
Console.WriteLine($"Extracted Text:\n{result.Text}");
Console.WriteLine($"Confidence: {result.Confidence}%");
using IronOcr;

// Initialize the OCR engine
var ocr = new IronTesseract();

// Load and process an image
using var input = new OcrInput("document-scan.png");

// Perform OCR and retrieve results
var result = ocr.Read(input);

// Output the extracted text to console
Console.WriteLine($"Extracted Text:\n{result.Text}");
Console.WriteLine($"Confidence: {result.Confidence}%");
$vbLabelText   $csharpLabel

The IronTesseract class serves as the primary OCR engine, built on an optimized Tesseract 5 implementation. After creating an instance, the OcrInput object loads the target image from disk, a URL, or a byte array. The Read method processes the input and returns an OcrResult containing the extracted plain text along with a confidence percentage indicating recognition accuracy. Higher confidence values (above 90%) typically indicate clean, well-formatted source documents.

The OcrResult object provides structured access to recognized content. Beyond plain text, developers can access individual words, lines, paragraphs, and characters, along with their positions and confidence scores. Each Word includes bounding rectangle coordinates, making it valuable for applications that require precise text location data, such as document annotation or form field extraction.

Input

OCR C# GitHub: Implement Text Recognition with IronOCR: Image 2 - Sample Input

Output

OCR C# GitHub: Implement Text Recognition with IronOCR: Image 3 - Console Output

IronOCR also supports loading images from streams and byte arrays, which is particularly useful in web applications that receive file uploads. This means OCR processing can occur entirely in memory without writing temporary files to disk, reducing input-output overhead in high-throughput environments.

OCR C# GitHub: Implement Text Recognition with IronOCR: Image 4 - Features

How Does Image Preprocessing Improve Optical Character Recognition Accuracy?

Scanned documents often arrive skewed, noisy, or at suboptimal resolutions. IronOCR includes built-in preprocessing filters that correct these issues before the OCR engine processes the image:

using IronOcr;

var ocr = new IronTesseract();
using var input = new OcrInput("skewed-receipt.jpg");

// Apply preprocessing filters to enhance scan quality
input.Deskew();                    // Straighten rotated images
input.DeNoise();                   // Remove digital artifacts
input.EnhanceResolution(225);      // Optimize DPI for OCR

var result = ocr.Read(input);
Console.WriteLine(result.Text);
using IronOcr;

var ocr = new IronTesseract();
using var input = new OcrInput("skewed-receipt.jpg");

// Apply preprocessing filters to enhance scan quality
input.Deskew();                    // Straighten rotated images
input.DeNoise();                   // Remove digital artifacts
input.EnhanceResolution(225);      // Optimize DPI for OCR

var result = ocr.Read(input);
Console.WriteLine(result.Text);
$vbLabelText   $csharpLabel

The Deskew method automatically detects and corrects image rotation up to 15 degrees, handling the common case of pages placed slightly off-center on a scanner. The DeNoise filter removes speckling and artifacts common in photographed documents or older scans. EnhanceResolution upscales low-DPI images to the 200-300 DPI range, which is optimal for optical character recognition accuracy.

These filters can be chained together and run entirely in memory without requiring temporary files. In many cases, applying multiple preprocessing passes can substantially improve text recognition results on documents with severe quality issues such as faded ink, background noise, or camera distortion. The improvement is most noticeable on documents scanned below 150 DPI or photographs taken under uneven lighting conditions.

How Does Region-of-Interest Cropping Help Performance?

For documents where only a portion of the image contains relevant text, defining a crop region reduces both processing time and potential false positives from background noise:

using IronOcr;
using IronSoftware.Drawing;

var ocr = new IronTesseract();
using var input = new OcrInput("invoice.png");

// Define crop region (x, y, width, height in pixels)
var cropArea = new CropRectangle(50, 100, 600, 300);
input.AddRegion(cropArea);

var result = ocr.Read(input);
Console.WriteLine(result.Text);
using IronOcr;
using IronSoftware.Drawing;

var ocr = new IronTesseract();
using var input = new OcrInput("invoice.png");

// Define crop region (x, y, width, height in pixels)
var cropArea = new CropRectangle(50, 100, 600, 300);
input.AddRegion(cropArea);

var result = ocr.Read(input);
Console.WriteLine(result.Text);
$vbLabelText   $csharpLabel

Targeting a specific region is particularly valuable when processing structured documents such as invoices or forms, where the text fields occupy known positions. This approach can reduce OCR processing time by 40-70% compared to full-image analysis, depending on how much of the image is irrelevant.

Can You Extract Barcodes and QR Codes Alongside Text?

IronOCR can simultaneously recognize text and scan barcodes within the same document. This dual functionality is valuable for processing invoices, shipping labels, and inventory documents:

using IronOcr;

var ocr = new IronTesseract();
ocr.Configuration.ReadBarCodes = true;  // Enable barcode detection

using var input = new OcrInput("shipping-label.png");
var result = ocr.Read(input);

// Access extracted text
Console.WriteLine($"Text: {result.Text}");

// Access any barcodes found in the image
foreach (var barcode in result.Barcodes)
{
    Console.WriteLine($"Barcode ({barcode.Format}): {barcode.Value}");
}
using IronOcr;

var ocr = new IronTesseract();
ocr.Configuration.ReadBarCodes = true;  // Enable barcode detection

using var input = new OcrInput("shipping-label.png");
var result = ocr.Read(input);

// Access extracted text
Console.WriteLine($"Text: {result.Text}");

// Access any barcodes found in the image
foreach (var barcode in result.Barcodes)
{
    Console.WriteLine($"Barcode ({barcode.Format}): {barcode.Value}");
}
$vbLabelText   $csharpLabel

When ReadBarCodes is set to true, barcode detection activates without significantly impacting processing time. The Barcodes collection in the result contains the value and format type for each detected barcode. Supported formats include QR codes, Code 128, EAN-13, UPC, Data Matrix, and PDF417. This dual capability eliminates the need for separate barcode scanning libraries when processing documents that contain both human-readable text and machine-readable codes.

Input

OCR C# GitHub: Implement Text Recognition with IronOCR: Image 5 - Sample Barcode Image

Output

OCR C# GitHub: Implement Text Recognition with IronOCR: Image 6 - Console Barcode Text Output

For warehouse and logistics applications, combining text and barcode extraction in a single pass reduces API calls and simplifies application architecture. A single Read operation returns all recognizable data from the document, whether that data is printed text, handwriting, or machine-readable codes. The OcrResult.Barcodes property exposes a typed collection, so downstream code can iterate results without format-specific parsing logic.

How Do You Generate Searchable PDFs from Scanned Images?

Converting scanned documents to searchable PDFs enables text selection, copying, and full-text search within document management systems. This works with various image formats as input:

using IronOcr;

var ocr = new IronTesseract();
using var input = new OcrInput("scanned-contract.tiff");
var result = ocr.Read(input);

// Export as searchable PDF with invisible text layer
result.SaveAsSearchablePdf("contract-searchable.pdf");
using IronOcr;

var ocr = new IronTesseract();
using var input = new OcrInput("scanned-contract.tiff");
var result = ocr.Read(input);

// Export as searchable PDF with invisible text layer
result.SaveAsSearchablePdf("contract-searchable.pdf");
$vbLabelText   $csharpLabel

The SaveAsSearchablePdf method embeds an invisible text layer matching the recognized content, preserving the original document appearance while enabling text operations. This produces documents suitable for archival and enterprise document management systems. IronOCR also supports exporting results as HTML or JSON for integration with downstream systems.

For multi-page documents, IronOCR processes each page individually and assembles the output into a single file. TIFF files with multiple frames are handled automatically, making batch conversion of scanned document archives straightforward. The resulting PDF preserves the visual layout of the original scan while the embedded text layer makes each page fully searchable in any PDF viewer or document management platform.

How Do You Use IronOCR in Multilingual Applications?

IronOCR supports 125+ languages including English, Spanish, French, German, Chinese, Japanese, Arabic, and many others. Language packs install through NuGet as separate packages, keeping the core library lightweight:

using IronOcr;

// Install-Package IronOcr.Languages.French
var ocr = new IronTesseract();
ocr.Language = OcrLanguage.French;

using var input = new OcrInput("french-document.png");
var result = ocr.Read(input);
Console.WriteLine(result.Text);
using IronOcr;

// Install-Package IronOcr.Languages.French
var ocr = new IronTesseract();
ocr.Language = OcrLanguage.French;

using var input = new OcrInput("french-document.png");
var result = ocr.Read(input);
Console.WriteLine(result.Text);
$vbLabelText   $csharpLabel

For documents containing mixed languages on the same page, IronOCR supports loading multiple language models simultaneously. This is relevant for internationalized applications that process documents from multiple regions without needing to pre-sort files by language. Each language pack is maintained alongside the core library and supports the same preprocessing and output capabilities.

What Are the Best Practices for OCR in GitHub Projects?

When maintaining OCR projects on GitHub, a few organizational decisions improve contributor experience and long-term project health. These practices apply whether you are building a small utility script or a large enterprise document processing service.

Use Git LFS for large test images to avoid bloating the repository size. Binary assets in standard Git history inflate clone times and storage costs, particularly when test datasets include high-resolution scans. Store license keys in environment variables or GitHub Secrets, never in committed C# code; refer to the license key configuration guide for setup instructions.

Include sample images in a dedicated test-data folder so contributors can verify OCR functionality without sourcing their own documents. Document the supported image formats and .NET version requirements in README files to reduce onboarding questions. Build and run tests in CI pipelines using GitHub Actions to confirm the library functions correctly across target environments.

For GitHub Actions workflows, IronOCR runs in containerized environments on both Windows and Linux runners. Refer to the Linux deployment guide for configuration details when targeting Ubuntu or other non-Windows runners.

What Are Your Next Steps?

IronOCR brings reliable text recognition to C# GitHub projects through a NuGet-distributed library that handles Tesseract configuration, preprocessing, barcode detection, and multilingual support without external binary dependencies. The code examples in this guide cover the core use cases: basic text extraction, image preprocessing, barcode scanning, searchable PDF creation, and multilingual processing.

To explore the complete feature set, start a free trial with no time pressure or credit card requirement. When ready for production deployment, review the licensing options that cover individual developers through enterprise teams.

Related resources to extend your knowledge:

OCR C# GitHub: Implement Text Recognition with IronOCR: Image 7 - Licensing

Frequently Asked Questions

What is IronOCR?

IronOCR is a .NET OCR library for C# that extracts text from images and PDFs using an optimized Tesseract 5 engine. It installs via NuGet and handles binary dependencies internally, requiring no manual tessdata configuration.

How do I install IronOCR in a C# project?

Run `Install-Package IronOcr` in the NuGet Package Manager Console in Visual Studio, or use the NuGet Package Manager UI to search for IronOcr. The package includes all required binaries for Windows, Linux, and macOS.

Does IronOCR work on Linux for GitHub Actions?

Yes, IronOCR supports Linux runners in GitHub Actions. Refer to the Linux deployment guide at https://ironsoftware.com/csharp/ocr/how-to/linux/ for required package dependencies on Ubuntu and other distributions.

Can IronOCR read barcodes and QR codes?

Yes. Set ocr.Configuration.ReadBarCodes = true before calling Read(). The OcrResult.Barcodes collection contains the value and format type for each detected code, supporting QR, Code 128, EAN-13, UPC, Data Matrix, and PDF417.

How do I generate a searchable PDF from a scanned image?

After calling ocr.Read(input), use result.SaveAsSearchablePdf("output.pdf") to create a PDF with an invisible text layer over the original scan. The output is suitable for archival and enterprise document management systems.

Does IronOCR support languages other than English?

Yes. IronOCR supports 125+ languages through dedicated NuGet language packs. Install the language package (for example, Install-Package IronOcr.Languages.French), then set ocr.Language = OcrLanguage.French before processing.

How should I store IronOCR license keys in a GitHub repository?

Store license keys in GitHub Secrets and inject them as environment variables in your GitHub Actions workflow. Never commit license key strings directly in C# code or appsettings files.

What image formats does IronOCR support?

IronOCR supports PNG, JPG, JPEG, BMP, GIF, TIFF (including multi-frame), PDF, and other common formats. Images can be loaded from file paths, URLs, streams, or byte arrays.

Kannaopat Udonpant
Software Engineer
Before becoming a Software Engineer, Kannapat completed a Environmental Resources PhD from Hokkaido University in Japan. While pursuing his degree, Kannapat also became a member of the Vehicle Robotics Laboratory, which is part of the Department of Bioproduction Engineering. In 2022, he leveraged his C# skills to join Iron Software's engineering ...
Read More