Save OCR Results as hOCR HTML in C# with IronOCR
IronOCR enables developers to save OCR results as hOCR HTML files by setting RenderHocr to true and using SaveAsHocrFile or SaveAsHocrString methods, preserving text layout and character coordinates in structured HTML format.
Quickstart: Save OCR Output as hOCR HTML File
Enable hOCR rendering and export results directly to an HTML file with one setup and one method call.
Get started making PDFs with NuGet now:
Install IronOCR with NuGet Package Manager
Copy and run this code snippet.
var hocr = new IronTesseract { Configuration = { RenderHocr = true } }.Read(new OcrInput("image.png")).SaveAsHocrString();Deploy to test on your live environment
Minimal Workflow (5 steps)
- Download a C# library to save results as hOCR in an HTML file
- Prepare the targeted image and PDF document
- Set the RenderHocr property to true
- Utilize the
SaveAsHocrFilemethod to output an HTML file - Use the
SaveAsHocrStringmethod to output an HTML string
What Is hOCR and Why Use It?
hOCR, which stands for "HTML-based OCR," is a file format used to represent the results of Optical Character Recognition (OCR) in a structured manner. hOCR files are written in HTML and provide a way to store recognized text, layout information, and the coordinates of each recognized character within an image or document. This structured format makes hOCR particularly valuable for applications requiring text position data, such as document indexing, accessibility tools, and advanced search implementations.
The hOCR format is essential for developers building applications that need to understand not just what text is present, but where that text appears on the original document. This spatial information enables features like highlighting text for debugging, creating clickable overlays on original images, and maintaining document layout integrity when converting scanned documents to accessible formats. For enterprise applications processing scanned documents, hOCR provides the foundation for advanced document understanding and extraction workflows.
How Do I Export OCR Results as hOCR Files?
To export the result as hOCR, first enable the Configuration.RenderHocr property by setting it to true. After obtaining the OcrResult object from the Read method, use the SaveAsHocrFile method to export the OCR result as HTML. This method outputs an HTML file containing the reading result of the input documents. The code below demonstrates using the following sample TIFF file.
:path=/static-assets/ocr/content-code-examples/how-to/html-export-export-html.csusing IronOcr;
// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();
// Enable render as hOCR
ocrTesseract.Configuration.RenderHocr = true;
// Add image
using var imageInput = new OcrImageInput("Potter.tiff");
imageInput.Title = "Html Title";
// Perform OCR
OcrResult ocrResult = ocrTesseract.Read(imageInput);
// Export as HTML
ocrResult.SaveAsHocrFile("result.html");Imports IronOcr
' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()
' Enable render as hOCR
ocrTesseract.Configuration.RenderHocr = True
' Add image
Dim imageInput = New OcrImageInput("Potter.tiff")
imageInput.Title = "Html Title"
' Perform OCR
Dim ocrResult As OcrResult = ocrTesseract.Read(imageInput)
' Export as HTML
ocrResult.SaveAsHocrFile("result.html")The OcrInput class provides extensive options for preparing images before OCR processing. You can apply filters, specify regions of interest, and handle various input formats including multi-page TIFF files. When working with PDF OCR text extraction, the same hOCR export methods apply seamlessly.
Why Does Setting RenderHocr Matter?
Setting the RenderHocr property to true instructs IronOCR to generate the necessary hOCR structure during the OCR process. Without this configuration, the SaveAsHocrFile and SaveAsHocrString methods won't produce properly formatted hOCR output with layout preservation. This configuration must be set before calling the Read method, as it affects how the Tesseract engine processes and structures the output data.
The hOCR format preserves crucial metadata including:
- Character-level bounding boxes
- Word confidence scores
- Line and paragraph structure
- Page dimensions and DPI information
- Font characteristics when detectable
This metadata is particularly useful when implementing computer vision workflows or building systems that need to understand document structure beyond simple text extraction.
What File Types Support hOCR Export?
IronOCR supports hOCR export from various image formats including TIFF, PNG, JPEG, BMP, and GIF. PDF documents can also be processed and exported as hOCR, with each page's text and layout information preserved in the HTML structure. The library handles both single-page images and multi-page documents seamlessly.
For optimal results with different file types:
- TIFF: Ideal for scanned documents, supports multi-page processing
- PDF: Excellent for mixed content (text and images)
- PNG/JPEG: Best for photographs or screenshots requiring OCR
- BMP: Uncompressed format suitable for high-quality scans
When dealing with specialized document types like passports or license plates, the hOCR format helps preserve the spatial relationships between different text elements, making it easier to extract specific fields based on their location.
How Can I Export OCR Results as HTML Strings?
Using the same TIFF sample image, utilize the SaveAsHocrString method to export the OCR result as an HTML string. This method returns an HTML string.
:path=/static-assets/ocr/content-code-examples/how-to/html-export-export-html-string.cs// Export as HTML string
string hocr = ocrResult.SaveAsHocrString();' Export as HTML string
Dim hocr As String = ocrResult.SaveAsHocrString()The string output contains complete hOCR markup that can be further processed, stored in databases, or integrated into web applications. This approach is particularly useful when building searchable PDF systems or implementing custom document indexing solutions. For developers working with 125 international languages, the hOCR format preserves language-specific text attributes and reading direction information.
When Should I Use String Output Instead of Files?
String output is ideal when you need to process or manipulate the hOCR data in memory, integrate with web services, or store results in a database. This approach avoids file system dependencies and enables dynamic HTML generation for web applications. Common use cases include:
- Web API Integration: Return hOCR data directly in API responses
- Database Storage: Store OCR results with document metadata
- Real-time Processing: Process results without disk I/O overhead
- Cloud Functions: Work within serverless environments with limited file access
- Content Management Systems: Integrate OCR results into existing document workflows
For applications requiring progress tracking, string output allows immediate processing of partial results as they become available. This is particularly beneficial when implementing multithreaded OCR processing where multiple documents are processed concurrently.
How Do I Process Multiple Pages to HTML Strings?
When working with multi-page documents, SaveAsHocrString consolidates all pages into a single HTML string with proper page divisions. Each page's content is wrapped in appropriate hOCR elements, maintaining the document structure and page boundaries.
// Processing multi-page documents
using var multiPageInput = new OcrPdfInput("multi-page-document.pdf");
multiPageInput.Title = "Multi-Page Document";
// Configure for hOCR output
IronTesseract tesseract = new IronTesseract();
tesseract.Configuration.RenderHocr = true;
// Read all pages
OcrResult result = tesseract.Read(multiPageInput);
// Export as single HTML string with all pages
string fullHocr = result.SaveAsHocrString();
// Or process page by page
foreach (var page in result.Pages)
{
string pageHocr = page.SaveAsHocrString();
// Process individual page hOCR
}// Processing multi-page documents
using var multiPageInput = new OcrPdfInput("multi-page-document.pdf");
multiPageInput.Title = "Multi-Page Document";
// Configure for hOCR output
IronTesseract tesseract = new IronTesseract();
tesseract.Configuration.RenderHocr = true;
// Read all pages
OcrResult result = tesseract.Read(multiPageInput);
// Export as single HTML string with all pages
string fullHocr = result.SaveAsHocrString();
// Or process page by page
foreach (var page in result.Pages)
{
string pageHocr = page.SaveAsHocrString();
// Process individual page hOCR
}IRON VB CONVERTER ERROR developers@ironsoftware.comThis approach works seamlessly with PDF streams and supports advanced scenarios like processing specific page ranges or applying different OCR configurations to different pages.
Advanced hOCR Implementation Tips
What Are Best Practices for hOCR Output Quality?
To maximize the quality of your hOCR output, consider applying image optimization filters before processing:
var input = new OcrImageInput("document.png");
input.DeNoise(); // Remove image noise
input.Deskew(); // Correct image rotation
input.Scale(2); // Upscale for better recognition
IronTesseract ocr = new IronTesseract();
ocr.Configuration.RenderHocr = true;
var result = ocr.Read(input);var input = new OcrImageInput("document.png");
input.DeNoise(); // Remove image noise
input.Deskew(); // Correct image rotation
input.Scale(2); // Upscale for better recognition
IronTesseract ocr = new IronTesseract();
ocr.Configuration.RenderHocr = true;
var result = ocr.Read(input);IRON VB CONVERTER ERROR developers@ironsoftware.comFor low quality scans, additional preprocessing steps can significantly improve hOCR accuracy. The filter wizard helps determine optimal filter combinations for your specific document types.
How Does hOCR Structure Support Advanced Processing?
The generated hOCR follows the standard specification with nested div elements representing the document hierarchy:
<div class='ocr_page' title='bbox 0 0 2480 3508'>
<div class='ocr_carea' title='bbox 156 114 2324 3395'>
<p class='ocr_par' title='bbox 157 114 2323 164'>
<span class='ocr_line' title='bbox 157 114 2323 164'>
<span class='ocr_word' title='bbox 157 114 294 161'>Hello</span>
<span class='ocr_word' title='bbox 334 119 483 161'>World</span>
</span>
</p>
</div>
</div><div class='ocr_page' title='bbox 0 0 2480 3508'>
<div class='ocr_carea' title='bbox 156 114 2324 3395'>
<p class='ocr_par' title='bbox 157 114 2323 164'>
<span class='ocr_line' title='bbox 157 114 2323 164'>
<span class='ocr_word' title='bbox 157 114 294 161'>Hello</span>
<span class='ocr_word' title='bbox 334 119 483 161'>World</span>
</span>
</p>
</div>
</div>This structure enables precise text location extraction and advanced document analysis capabilities, making it valuable for applications requiring spatial text relationships or layout preservation. When working with table extraction, the hOCR format helps maintain the tabular structure and cell relationships.
The bbox (bounding box) attributes contain coordinates in the format "bbox left top right bottom", providing pixel-precise location data for each text element. This information is crucial for:
- Creating interactive document viewers with text selection
- Implementing redaction systems that preserve layout
- Building accessibility tools that maintain reading order
- Developing document comparison systems
For developers requiring even more detailed configuration options, the Tesseract detailed configuration guide provides advanced settings that affect hOCR output quality and structure.
Frequently Asked Questions
What is hOCR and why is it useful for OCR applications?
hOCR (HTML-based OCR) is a file format that represents OCR results in structured HTML, storing both recognized text and spatial information like character coordinates. IronOCR supports hOCR export, which is valuable for applications requiring text position data, document indexing, accessibility tools, and maintaining layout integrity when processing scanned documents.
How do I enable hOCR output in my C# OCR application?
To enable hOCR output with IronOCR, set the Configuration.RenderHocr property to true on your IronTesseract instance. This tells IronOCR to prepare the OCR results in hOCR format, allowing you to export them using SaveAsHocrFile or SaveAsHocrString methods.
What methods are available for exporting hOCR results?
IronOCR provides two methods for exporting hOCR results: SaveAsHocrFile() which saves the output directly to an HTML file on disk, and SaveAsHocrString() which returns the hOCR HTML as a string for further processing or storage in your application.
Can I export OCR results as hOCR with just one line of code?
Yes, IronOCR allows one-line hOCR export using method chaining. You can create an IronTesseract instance with RenderHocr enabled, read your input, and call SaveAsHocrString() all in a single statement: var hocr = new IronTesseract { Configuration = { RenderHocr = true } }.Read(new OcrInput("image.png")).SaveAsHocrString();
What type of spatial information does hOCR preserve from OCR results?
hOCR preserves layout information and coordinates of each recognized character within the original image or document. IronOCR's hOCR export maintains this spatial data, enabling features like text highlighting for debugging, creating clickable overlays on images, and understanding where text appears on the original document.







