How to Read Table in Documents
Let’s talk about reading tables in documents. Extracting data from tables using plain Tesseract can be challenging, as the text often resides in cells and is sparsely scattered across the document. However, our library is equipped with a machine learning model that has been trained and fine-tuned for detecting and extracting table data accurately.
For simple tables, you can rely on straightforward table detection, while for more complex structures, our exclusive ReadDocumentAdvanced
method provides robust results, effectively parsing the table and delivering data.
Get started with IronOCR
Start using IronOCR in your project today with a free trial.
The following steps guide you in getting started with reading tables using IronOCR:
How to Read Table in Documents
- Download a C# library to extract data from tables
- Prepare the image and PDF document for extraction
- Set the ReadDataTables property to true to enable table detection
- Use the
ReadDocumentAdvanced
method for complex tables - Extract the data detected by these methods
Simple Table Example
Setting the ReadDataTables
property to true enables table detection using Tesseract. I created a simple table PDF to test this feature, which you can download here: 'simple-table.pdf'. Simple tables without merged cells can be detected using this method. For more complex tables, please refer to the method described below.
:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-tesseract.cs
using IronOcr;
using System;
using System.Data;
// This C# code uses the IronOCR library to read a table from a PDF file
// and display its contents as text in the console.
class Program
{
static void Main()
{
// Instantiate the IronTesseract OCR engine, which is used for optical character recognition.
var ocr = new IronTesseract();
// Enable table detection so that the OCR engine can recognize tabular data within the PDF.
ocr.Configuration.ReadDataTables = true;
// Using a 'using' statement to ensure that resources are automatically disposed of
// once the operation is complete. Here, we're loading the PDF file 'simple-table.pdf'.
using var input = new OcrPdfInput("simple-table.pdf");
// Perform the OCR operation on the input file to retrieve the OCR results.
var result = ocr.Read(input);
// Ensure that there is at least one table detected and available.
if (result.Tables.Count > 0)
{
// Retrieve the first table recognized in the PDF as a DataTable object.
// This object allows easy manipulation and retrieval of tabular data.
var table = result.Tables[0].DataTable;
// Iterate through each row of the table and print out all columns separated by tabs.
// This is to display the table data in a format that visually represents columns in the console.
foreach (DataRow row in table.Rows)
{
foreach (var item in row.ItemArray)
{
Console.Write(item + "\t"); // Print each cell in the row followed by a tab
}
Console.WriteLine(); // Move to the next line after each row
}
}
else
{
Console.WriteLine("No tables detected in the PDF.");
}
}
}
Imports Microsoft.VisualBasic
Imports IronOcr
Imports System
Imports System.Data
' This C# code uses the IronOCR library to read a table from a PDF file
' and display its contents as text in the console.
Friend Class Program
Shared Sub Main()
' Instantiate the IronTesseract OCR engine, which is used for optical character recognition.
Dim ocr = New IronTesseract()
' Enable table detection so that the OCR engine can recognize tabular data within the PDF.
ocr.Configuration.ReadDataTables = True
' Using a 'using' statement to ensure that resources are automatically disposed of
' once the operation is complete. Here, we're loading the PDF file 'simple-table.pdf'.
Dim input = New OcrPdfInput("simple-table.pdf")
' Perform the OCR operation on the input file to retrieve the OCR results.
Dim result = ocr.Read(input)
' Ensure that there is at least one table detected and available.
If result.Tables.Count > 0 Then
' Retrieve the first table recognized in the PDF as a DataTable object.
' This object allows easy manipulation and retrieval of tabular data.
Dim table = result.Tables(0).DataTable
' Iterate through each row of the table and print out all columns separated by tabs.
' This is to display the table data in a format that visually represents columns in the console.
For Each row As DataRow In table.Rows
For Each item In row.ItemArray
Console.Write(item & vbTab) ' Print each cell in the row followed by a tab
Next item
Console.WriteLine() ' Move to the next line after each row
Next row
Else
Console.WriteLine("No tables detected in the PDF.")
End If
End Sub
End Class
Complex Table Example
For complex tables, the ReadDocumentAdvanced
method handles them beautifully. In this example, we’ll use the 'table.pdf' file.
The ReadDocumentAdvanced
method requires the IronOcr.Extensions.AdvancedScan package to be installed alongside the base IronOCR package. Currently, this extension is only available on Windows.
Please note
:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-ml.cs
using IronOcr;
using System;
using System.Linq;
// This code performs optical character recognition (OCR) on a PDF document containing tables.
// It uses IronOcr's IronTesseract engine to process a PDF document and extract text from it,
// particularly focusing on table cell information.
// Instantiate the OCR engine
var ocr = new IronTesseract();
// Load the PDF input into OcrInput for processing
using (var input = new OcrInput())
{
// Load PDF file into the input container
input.LoadPdf("table.pdf");
// Perform OCR on the PDF document
var result = ocr.Read(input);
// Extract the first table's cell information
// If there could be no tables in the document, ensure to check if Tables is not empty
if (result.Tables.Any())
{
var cellList = result.Tables.First().Cells;
// Process the extracted cell data as needed
foreach (var cell in cellList)
{
// Example: Output each cell text to console
Console.WriteLine(cell.Text);
}
}
else
{
Console.WriteLine("No tables found in the document.");
}
}
Imports IronOcr
Imports System
Imports System.Linq
' This code performs optical character recognition (OCR) on a PDF document containing tables.
' It uses IronOcr's IronTesseract engine to process a PDF document and extract text from it,
' particularly focusing on table cell information.
' Instantiate the OCR engine
Private ocr = New IronTesseract()
' Load the PDF input into OcrInput for processing
Using input = New OcrInput()
' Load PDF file into the input container
input.LoadPdf("table.pdf")
' Perform OCR on the PDF document
Dim result = ocr.Read(input)
' Extract the first table's cell information
' If there could be no tables in the document, ensure to check if Tables is not empty
If result.Tables.Any() Then
Dim cellList = result.Tables.First().Cells
' Process the extracted cell data as needed
For Each cell In cellList
' Example: Output each cell text to console
Console.WriteLine(cell.Text)
Next cell
Else
Console.WriteLine("No tables found in the document.")
End If
End Using
This method separates the text data of the document into two categories: one enclosed by borders and another without borders. For the bordered content, the library further divides it into subsections based on the table's structure. The results are shown below. It is important to note that, since this method focuses on information enclosed by borders, any merged cells spanning multiple rows will be treated as a single cell.
Result

Helper Class
In the current implementation, the extracted cells are not yet organized properly. However, each cell contains valuable information such as X and Y coordinates, dimensions, and more. Using this data, we can create a helper class for various purposes. Below are some basic helper methods:
using System;
using System.Collections.Generic;
using System.Linq;
// A helper class to process table data by sorting cells based on coordinates
public static class TableProcessor
{
// Method to organize cells by their coordinates (Y top to bottom, X left to right)
public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
{
// Sort cells by Y (top to bottom), then by X (left to right)
var sortedCells = cells
.OrderBy(cell => cell.CellRect.Y)
.ThenBy(cell => cell.CellRect.X)
.ToList();
return sortedCells;
}
// Example method demonstrating how to process multiple tables
public static void ProcessTables(Tables tables)
{
foreach (var table in tables)
{
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
Console.WriteLine("Organized Table Cells:");
// Initialize previous Y coordinate
int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;
foreach (var cell in sortedCells)
{
// Print a new line if the Y-coordinate changes, indicating a new row
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
Console.WriteLine(); // Start a new row
previousY = cell.CellRect.Y;
}
// Print the cell text followed by a tab
Console.Write($"{cell.CellText}\t");
}
Console.WriteLine("\n--- End of Table ---"); // End of a table
}
}
// Method to extract a specific row by the given index
public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
{
if (table == null || table.CellInfos == null || !table.CellInfos.Any())
{
throw new ArgumentException("Table is empty or invalid.");
}
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
List<List<CellInfo>> rows = new List<List<CellInfo>>();
// Group cells into rows based on Y coordinates
int previousY = sortedCells.First().CellRect.Y;
List<CellInfo> currentRow = new List<CellInfo>();
foreach (var cell in sortedCells)
{
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
// Store the completed row and start a new one
rows.Add(new List<CellInfo>(currentRow));
currentRow.Clear();
previousY = cell.CellRect.Y;
}
currentRow.Add(cell);
}
// Add the last row if it wasn't added yet
if (currentRow.Any())
{
rows.Add(currentRow);
}
// Retrieve the specified row
if (rowIndex < 0 || rowIndex >= rows.Count)
{
throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
}
return rows[rowIndex];
}
}
using System;
using System.Collections.Generic;
using System.Linq;
// A helper class to process table data by sorting cells based on coordinates
public static class TableProcessor
{
// Method to organize cells by their coordinates (Y top to bottom, X left to right)
public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
{
// Sort cells by Y (top to bottom), then by X (left to right)
var sortedCells = cells
.OrderBy(cell => cell.CellRect.Y)
.ThenBy(cell => cell.CellRect.X)
.ToList();
return sortedCells;
}
// Example method demonstrating how to process multiple tables
public static void ProcessTables(Tables tables)
{
foreach (var table in tables)
{
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
Console.WriteLine("Organized Table Cells:");
// Initialize previous Y coordinate
int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;
foreach (var cell in sortedCells)
{
// Print a new line if the Y-coordinate changes, indicating a new row
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
Console.WriteLine(); // Start a new row
previousY = cell.CellRect.Y;
}
// Print the cell text followed by a tab
Console.Write($"{cell.CellText}\t");
}
Console.WriteLine("\n--- End of Table ---"); // End of a table
}
}
// Method to extract a specific row by the given index
public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
{
if (table == null || table.CellInfos == null || !table.CellInfos.Any())
{
throw new ArgumentException("Table is empty or invalid.");
}
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
List<List<CellInfo>> rows = new List<List<CellInfo>>();
// Group cells into rows based on Y coordinates
int previousY = sortedCells.First().CellRect.Y;
List<CellInfo> currentRow = new List<CellInfo>();
foreach (var cell in sortedCells)
{
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
// Store the completed row and start a new one
rows.Add(new List<CellInfo>(currentRow));
currentRow.Clear();
previousY = cell.CellRect.Y;
}
currentRow.Add(cell);
}
// Add the last row if it wasn't added yet
if (currentRow.Any())
{
rows.Add(currentRow);
}
// Retrieve the specified row
if (rowIndex < 0 || rowIndex >= rows.Count)
{
throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
}
return rows[rowIndex];
}
}
Imports Microsoft.VisualBasic
Imports System
Imports System.Collections.Generic
Imports System.Linq
' A helper class to process table data by sorting cells based on coordinates
Public Module TableProcessor
' Method to organize cells by their coordinates (Y top to bottom, X left to right)
Public Function OrganizeCellsByCoordinates(ByVal cells As List(Of CellInfo)) As List(Of CellInfo)
' Sort cells by Y (top to bottom), then by X (left to right)
Dim sortedCells = cells.OrderBy(Function(cell) cell.CellRect.Y).ThenBy(Function(cell) cell.CellRect.X).ToList()
Return sortedCells
End Function
' Example method demonstrating how to process multiple tables
Public Sub ProcessTables(ByVal tables As Tables)
For Each table In tables
Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
Console.WriteLine("Organized Table Cells:")
' Initialize previous Y coordinate
Dim previousY As Integer = If(sortedCells.Any(), sortedCells.First().CellRect.Y, 0)
For Each cell In sortedCells
' Print a new line if the Y-coordinate changes, indicating a new row
If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
Console.WriteLine() ' Start a new row
previousY = cell.CellRect.Y
End If
' Print the cell text followed by a tab
Console.Write($"{cell.CellText}" & vbTab)
Next cell
Console.WriteLine(vbLf & "--- End of Table ---") ' End of a table
Next table
End Sub
' Method to extract a specific row by the given index
Public Function ExtractRowByIndex(ByVal table As TableInfo, ByVal rowIndex As Integer) As List(Of CellInfo)
If table Is Nothing OrElse table.CellInfos Is Nothing OrElse Not table.CellInfos.Any() Then
Throw New ArgumentException("Table is empty or invalid.")
End If
Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
Dim rows As New List(Of List(Of CellInfo))()
' Group cells into rows based on Y coordinates
Dim previousY As Integer = sortedCells.First().CellRect.Y
Dim currentRow As New List(Of CellInfo)()
For Each cell In sortedCells
If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
' Store the completed row and start a new one
rows.Add(New List(Of CellInfo)(currentRow))
currentRow.Clear()
previousY = cell.CellRect.Y
End If
currentRow.Add(cell)
Next cell
' Add the last row if it wasn't added yet
If currentRow.Any() Then
rows.Add(currentRow)
End If
' Retrieve the specified row
If rowIndex < 0 OrElse rowIndex >= rows.Count Then
Throw New IndexOutOfRangeException($"Row index {rowIndex} is out of range.")
End If
Return rows(rowIndex)
End Function
End Module
Frequently Asked Questions
What is the main challenge of extracting data from tables in documents?
Extracting data from tables using plain Tesseract can be challenging because the text often resides in cells and is sparsely scattered across the document.
How can table data extraction be improved in documents?
Using a specialized library equipped with a machine learning model trained for detecting and extracting table data accurately can provide robust results even for complex table structures.
What steps are involved in getting started with reading tables using a specialized library?
To get started, download the appropriate C# library, prepare your document, set the relevant properties to enable table detection, use advanced methods for complex tables, and extract the data detected by these methods.
What is the purpose of enabling table detection properties?
Enabling table detection properties allows the library to identify and extract data from simple tables that do not have merged cells.
What is the advantage of using advanced methods for table detection?
Advanced methods provide enhanced table detection capabilities, handling complex tables effectively and separating text data into bordered and non-bordered categories.
What additional packages might be required for advanced table detection methods?
Advanced table detection methods may require additional extensions or packages to be installed alongside the base library.
How can you ensure compatibility with .NET Framework when using advanced features?
To ensure compatibility with .NET Framework, the project must run on x64 architecture, and the 'Prefer 32-bit' option should be unchecked in the project configuration.
What helper methods are available for processing extracted table data?
Helper methods such as organizing cells by coordinates, processing multiple tables, and extracting specific rows by index can assist in effectively handling and organizing extracted table data.
What information do the extracted cells contain?
Extracted cells contain valuable information such as X and Y coordinates, dimensions, and the text within the cell.