How to Read Tables in Documents with C#
IronOCR enables C# developers to extract data from tables in PDFs and images using advanced machine learning models, handling both simple tables with basic cells and complex structures like invoices with merged cells using the ReadDocumentAdvanced method.
Extracting data from tables using plain Tesseract can be challenging because text often resides in cells and is sparsely scattered across the document. However, our library includes a machine learning model trained and fine-tuned for detecting and extracting table data accurately. Whether processing financial reports, inventory lists, or invoice data, IronOCR provides the tools to parse structured data efficiently.
For simple tables, rely on straightforward table detection using the standard OcrInput class. For more complex structures, our exclusive ReadDocumentAdvanced method provides robust results, effectively parsing tables and delivering data. This advanced method leverages machine learning to understand table layouts, merged cells, and complex formatting that traditional OCR often struggles with.
Quickstart: Extract Complex Table Cells in One Call
Get up and running in minutes—this example shows how a single IronOCR call using ReadDocumentAdvanced gives you detailed table cell data from a complex document. It demonstrates ease of use by loading a PDF, applying advanced table detection, and returning a list of cell information directly.
Get started making PDFs with NuGet now:
Install IronOCR with NuGet Package Manager
Copy and run this code snippet.
var cells = new IronTesseract().ReadDocumentAdvanced(new OcrInput().LoadPdf("invoiceTable.pdf")).Tables.First().CellInfos;Deploy to test on your live environment
The following steps guide you in getting started with reading tables using IronOCR:
Minimal Workflow (5 steps)
- Download a C# library to extract data from tables
- Prepare the image and PDF document for extraction
- Set the ReadDataTables property to true to enable table detection
- Use the
ReadDocumentAdvancedmethod for complex tables - Extract the data detected by these methods
How Do I Extract Data from Simple Tables?
Setting the ReadDataTables property to true enables table detection using Tesseract. This approach works well for basic tables with clear cell boundaries and no merged cells. I created a simple table PDF to test this feature, which you can download here: 'simple-table.pdf'. Simple tables without merged cells can be detected using this method. For more complex tables, refer to the method described below.
The standard table detection method is particularly effective for:
- Spreadsheet exports
- Basic data tables with consistent row/column structure
- Reports with tabular data
- Simple inventory lists
If working with PDF OCR text extraction in general, this method integrates seamlessly with IronOCR's broader document processing capabilities.
:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-tesseract.csusing IronOcr;
using System;
using System.Data;
// Instantiate OCR engine
var ocr = new IronTesseract();
// Enable table detection
ocr.Configuration.ReadDataTables = true;
using var input = new OcrPdfInput("simple-table.pdf");
var result = ocr.Read(input);
// Retrieve the data
var table = result.Tables[0].DataTable;
// Print out the table data
foreach (DataRow row in table.Rows)
{
foreach (var item in row.ItemArray)
{
Console.Write(item + "\t");
}
Console.WriteLine();
}Imports Microsoft.VisualBasic
Imports IronOcr
Imports System
Imports System.Data
' Instantiate OCR engine
Private ocr = New IronTesseract()
' Enable table detection
ocr.Configuration.ReadDataTables = True
Dim input = New OcrPdfInput("simple-table.pdf")
Dim result = ocr.Read(input)
' Retrieve the data
Dim table = result.Tables(0).DataTable
' Print out the table data
For Each row As DataRow In table.Rows
For Each item In row.ItemArray
Console.Write(item & vbTab)
Next item
Console.WriteLine()
Next rowHow Can I Read Complex Invoice Tables?
One of the more common complex tables found in business settings are invoices. Invoices are complex tables with rows and columns of data, often featuring merged cells, varying column widths, and nested structures. With IronOCR, we utilize the ReadDocumentAdvanced method to handle them effectively. The process involves scanning the document, identifying the table structure, and extracting the data. In this example, we use the 'invoiceTable.pdf' file to showcase how IronOCR retrieves all information from the invoice.
The ReadDocumentAdvanced method requires the IronOcr.Extensions.AdvancedScan package to be installed alongside the base IronOCR package. This extension provides advanced machine learning capabilities specifically trained for complex document layouts.
:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-ml.csusing IronOcr;
using System.Linq;
// Instantiate OCR engine
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf("invoiceTable.pdf");
// Perform OCR
var result = ocr.ReadDocumentAdvanced(input);
var cellList = result.Tables.First().CellInfos;IRON VB CONVERTER ERROR developers@ironsoftware.comThis method separates the text data of the document into two categories: one enclosed by borders and another without borders. For the bordered content, the library further divides it into subsections based on the table's structure. The method excels at handling:
- Invoice line items with varying descriptions
- Multi-column price breakdowns
- Shipping and billing address blocks
- Tax and total calculation sections
- Header and footer information
The results are shown below. Since this method focuses on information enclosed by borders, any merged cells spanning multiple rows will be treated as a single cell.
What Does the Extracted Data Look Like?

How Do I Organize and Process the Extracted Table Cells?
In the current implementation, the extracted cells are not yet organized properly. However, each cell contains valuable information such as X and Y coordinates, dimensions, and more. Using this data, we can create a helper class for various purposes. The cell information includes:
- Precise X/Y coordinates for positioning
- Width and height dimensions
- Text content
- Confidence scores
- Cell relationships
This detailed information enables you to reconstruct the table structure programmatically and apply custom logic for data extraction. You can also use these coordinates to define specific regions for targeted OCR processing in subsequent operations.
Below are some basic helper methods:
using System;
using System.Collections.Generic;
using System.Linq;
// A helper class to process table data by sorting cells based on coordinates
public static class TableProcessor
{
// Method to organize cells by their coordinates (Y top to bottom, X left to right)
public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
{
// Sort cells by Y (top to bottom), then by X (left to right)
var sortedCells = cells
.OrderBy(cell => cell.CellRect.Y)
.ThenBy(cell => cell.CellRect.X)
.ToList();
return sortedCells;
}
// Example method demonstrating how to process multiple tables
public static void ProcessTables(Tables tables)
{
foreach (var table in tables)
{
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
Console.WriteLine("Organized Table Cells:");
// Initialize previous Y coordinate
int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;
foreach (var cell in sortedCells)
{
// Print a new line if the Y-coordinate changes, indicating a new row
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
Console.WriteLine(); // Start a new row
previousY = cell.CellRect.Y;
}
// Print the cell text followed by a tab
Console.Write($"{cell.CellText}\t");
}
Console.WriteLine("\n--- End of Table ---"); // End of a table
}
}
// Method to extract a specific row by the given index
public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
{
if (table == null || table.CellInfos == null || !table.CellInfos.Any())
{
throw new ArgumentException("Table is empty or invalid.");
}
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
List<List<CellInfo>> rows = new List<List<CellInfo>>();
// Group cells into rows based on Y coordinates
int previousY = sortedCells.First().CellRect.Y;
List<CellInfo> currentRow = new List<CellInfo>();
foreach (var cell in sortedCells)
{
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
// Store the completed row and start a new one
rows.Add(new List<CellInfo>(currentRow));
currentRow.Clear();
previousY = cell.CellRect.Y;
}
currentRow.Add(cell);
}
// Add the last row if it wasn't added yet
if (currentRow.Any())
{
rows.Add(currentRow);
}
// Retrieve the specified row
if (rowIndex < 0 || rowIndex >= rows.Count)
{
throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
}
return rows[rowIndex];
}
}using System;
using System.Collections.Generic;
using System.Linq;
// A helper class to process table data by sorting cells based on coordinates
public static class TableProcessor
{
// Method to organize cells by their coordinates (Y top to bottom, X left to right)
public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
{
// Sort cells by Y (top to bottom), then by X (left to right)
var sortedCells = cells
.OrderBy(cell => cell.CellRect.Y)
.ThenBy(cell => cell.CellRect.X)
.ToList();
return sortedCells;
}
// Example method demonstrating how to process multiple tables
public static void ProcessTables(Tables tables)
{
foreach (var table in tables)
{
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
Console.WriteLine("Organized Table Cells:");
// Initialize previous Y coordinate
int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;
foreach (var cell in sortedCells)
{
// Print a new line if the Y-coordinate changes, indicating a new row
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
Console.WriteLine(); // Start a new row
previousY = cell.CellRect.Y;
}
// Print the cell text followed by a tab
Console.Write($"{cell.CellText}\t");
}
Console.WriteLine("\n--- End of Table ---"); // End of a table
}
}
// Method to extract a specific row by the given index
public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
{
if (table == null || table.CellInfos == null || !table.CellInfos.Any())
{
throw new ArgumentException("Table is empty or invalid.");
}
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
List<List<CellInfo>> rows = new List<List<CellInfo>>();
// Group cells into rows based on Y coordinates
int previousY = sortedCells.First().CellRect.Y;
List<CellInfo> currentRow = new List<CellInfo>();
foreach (var cell in sortedCells)
{
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
// Store the completed row and start a new one
rows.Add(new List<CellInfo>(currentRow));
currentRow.Clear();
previousY = cell.CellRect.Y;
}
currentRow.Add(cell);
}
// Add the last row if it wasn't added yet
if (currentRow.Any())
{
rows.Add(currentRow);
}
// Retrieve the specified row
if (rowIndex < 0 || rowIndex >= rows.Count)
{
throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
}
return rows[rowIndex];
}
}Imports Microsoft.VisualBasic
Imports System
Imports System.Collections.Generic
Imports System.Linq
' A helper class to process table data by sorting cells based on coordinates
Public Module TableProcessor
' Method to organize cells by their coordinates (Y top to bottom, X left to right)
Public Function OrganizeCellsByCoordinates(ByVal cells As List(Of CellInfo)) As List(Of CellInfo)
' Sort cells by Y (top to bottom), then by X (left to right)
Dim sortedCells = cells.OrderBy(Function(cell) cell.CellRect.Y).ThenBy(Function(cell) cell.CellRect.X).ToList()
Return sortedCells
End Function
' Example method demonstrating how to process multiple tables
Public Sub ProcessTables(ByVal tables As Tables)
For Each table In tables
Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
Console.WriteLine("Organized Table Cells:")
' Initialize previous Y coordinate
Dim previousY As Integer = If(sortedCells.Any(), sortedCells.First().CellRect.Y, 0)
For Each cell In sortedCells
' Print a new line if the Y-coordinate changes, indicating a new row
If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
Console.WriteLine() ' Start a new row
previousY = cell.CellRect.Y
End If
' Print the cell text followed by a tab
Console.Write($"{cell.CellText}" & vbTab)
Next cell
Console.WriteLine(vbLf & "--- End of Table ---") ' End of a table
Next table
End Sub
' Method to extract a specific row by the given index
Public Function ExtractRowByIndex(ByVal table As TableInfo, ByVal rowIndex As Integer) As List(Of CellInfo)
If table Is Nothing OrElse table.CellInfos Is Nothing OrElse Not table.CellInfos.Any() Then
Throw New ArgumentException("Table is empty or invalid.")
End If
Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
Dim rows As New List(Of List(Of CellInfo))()
' Group cells into rows based on Y coordinates
Dim previousY As Integer = sortedCells.First().CellRect.Y
Dim currentRow As New List(Of CellInfo)()
For Each cell In sortedCells
If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
' Store the completed row and start a new one
rows.Add(New List(Of CellInfo)(currentRow))
currentRow.Clear()
previousY = cell.CellRect.Y
End If
currentRow.Add(cell)
Next cell
' Add the last row if it wasn't added yet
If currentRow.Any() Then
rows.Add(currentRow)
End If
' Retrieve the specified row
If rowIndex < 0 OrElse rowIndex >= rows.Count Then
Throw New IndexOutOfRangeException($"Row index {rowIndex} is out of range.")
End If
Return rows(rowIndex)
End Function
End ModuleBest Practices for Table Extraction
When working with table extraction in IronOCR, consider these best practices:
Document Quality: Higher resolution documents yield better results. For scanned documents, ensure a minimum of 300 DPI.
Pre-processing: For documents with poor quality or skewed tables, consider using IronOCR's image correction features before processing.
Performance: For large documents with multiple tables, consider using multithreading and async support to process pages in parallel.
Output Options: After extracting table data, you can export results in various formats. Learn more about data output options and how to create searchable PDFs from your processed documents.
- Stream Processing: For web applications or scenarios working with in-memory documents, consider using OCR for PDF streams to avoid file system operations.
Summary
IronOCR provides powerful table extraction capabilities through both standard Tesseract-based detection and advanced machine learning methods. The standard approach works well for simple tables, while the ReadDocumentAdvanced method excels at complex documents like invoices. With the helper methods provided, you can organize and process the extracted data to suit your specific needs.
Explore more IronOCR features to enhance your document processing workflows and leverage the full potential of optical character recognition in your .NET applications.
Frequently Asked Questions
How can I extract table data from PDFs and images in C#?
IronOCR enables C# developers to extract table data from PDFs and images using advanced machine learning models. For simple tables, use the OcrInput class with ReadDataTables property set to true. For complex tables with merged cells, use the ReadDocumentAdvanced method for more accurate results.
What's the difference between simple and complex table extraction?
Simple table extraction in IronOCR uses the ReadDataTables property with Tesseract and works well for basic tables with clear cell boundaries. Complex table extraction requires the ReadDocumentAdvanced method, which uses machine learning to handle merged cells, invoices, and complex formatting.
How do I quickly extract data from complex tables?
Use IronOCR's ReadDocumentAdvanced method in a single call: var cells = new IronTesseract().ReadDocumentAdvanced(new OcrInput().LoadPdf('invoiceTable.pdf')).Tables.First().CellInfos; This leverages machine learning to understand table layouts and complex formatting.
What types of documents work best with simple table detection?
IronOCR's simple table detection method works particularly well with spreadsheet exports, basic data tables with consistent row/column structure, reports with tabular data, and simple inventory lists without merged cells.
How do I enable table detection for basic tables?
To enable table detection in IronOCR for basic tables, set the ReadDataTables property to true. This uses Tesseract's table detection capabilities and works well for tables with clear cell boundaries and no merged cells.
Can the library handle invoices and financial reports with complex layouts?
Yes, IronOCR's ReadDocumentAdvanced method is specifically designed to handle complex documents like invoices and financial reports. It uses machine learning models trained to detect and extract data from tables with merged cells and complex formatting.







