How to Read Table in Documents
Let’s talk about reading tables in documents. Extracting data from tables using plain Tesseract can be challenging, as the text often resides in cells and is sparsely scattered across the document. However, our library is equipped with a machine learning model that has been trained and fine-tuned for detecting and extracting table data accurately.
For simple tables, you can rely on straightforward table detection, while for more complex structures, our exclusive ReadDocumentAdvanced
method provides robust results, effectively parsing the table and delivering data.
Get started with IronOCR
Start using IronOCR in your project today with a free trial.
How to Read Table in Documents
- Download a C# library to extract data from tables
- Prepare the image and PDF document for extraction
- Set the ReadDataTables property to true to enable table detection
- Use the
ReadDocumentAdvanced
method for complex tables - Extract the data detected by these methods
Simple Table Example
Setting the ReadDataTables property to true enables table detection using Tesseract. I created a simple table PDF to test this feature, which you can download here: 'simple-table.pdf.'. Simple tables without merged cells can be detected using this method. For more complex tables, please refer to the method described below.
:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-tesseract.cs
using IronOcr;
using System;
using System.Data;
// Instantiate OCR engine
var ocr = new IronTesseract();
// Enable table detection
ocr.Configuration.ReadDataTables = true;
using var input = new OcrPdfInput("simple-table.pdf");
var result = ocr.Read(input);
// Retrieve the data
var table = result.Tables[0].DataTable;
// Print out the table data
foreach (DataRow row in table.Rows)
{
foreach (var item in row.ItemArray)
{
Console.Write(item + "\t");
}
Console.WriteLine();
}
Imports Microsoft.VisualBasic
Imports IronOcr
Imports System
Imports System.Data
' Instantiate OCR engine
Private ocr = New IronTesseract()
' Enable table detection
ocr.Configuration.ReadDataTables = True
Dim input = New OcrPdfInput("simple-table.pdf")
Dim result = ocr.Read(input)
' Retrieve the data
Dim table = result.Tables(0).DataTable
' Print out the table data
For Each row As DataRow In table.Rows
For Each item In row.ItemArray
Console.Write(item & vbTab)
Next item
Console.WriteLine()
Next row
Complex Table Example
For complex tables, the ReadDocumentAdvanced
method handles them beautifully. In this example, we’ll use the 'table.pdf' file.
The ReadDocumentAdvanced
method requires the IronOcr.Extensions.AdvancedScan package to be installed alongside the base IronOCR package. Currently, this extension is only available on Windows.
Please note
:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-ml.cs
using IronOcr;
using System.Linq;
// Instantiate OCR engine
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf("table.pdf");
// Perform OCR
var result = ocr.ReadDocumentAdvanced(input);
var cellList = result.Tables.First().CellInfos;
Imports IronOcr
Imports System.Linq
' Instantiate OCR engine
Private ocr = New IronTesseract()
Private input = New OcrInput()
input.LoadPdf("table.pdf")
' Perform OCR
Dim result = ocr.ReadDocumentAdvanced(input)
Dim cellList = result.Tables.First().CellInfos
This method separates the text data of the document into two categories: one enclosed by borders and another without borders. For the bordered content, the library further divides it into subsections based on the table's structure. The results are shown below. It is important to note that, since this method focuses on information enclosed by borders, any merged cells spanning multiple rows will be treated as a single cell.
Result
Helper Class
In the current implementation, the extracted cells are not yet organized properly. However, each cell contains valuable information such as X and Y coordinates, dimensions, and more. Using this data, we can create a helper class for various purposes. Below are some basic helper methods:
public static class TableProcessor
{
public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
{
// Sort cells by Y (top to bottom), then by X (left to right)
var sortedCells = cells
.OrderBy(cell => cell.CellRect.Y)
.ThenBy(cell => cell.CellRect.X)
.ToList();
return sortedCells;
}
// Example of how to use the function
public static void ProcessTables(Tables tables)
{
foreach (var table in tables)
{
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
Console.WriteLine("Organized Table Cells:");
// int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;
foreach (var cell in sortedCells)
{
// Print a new line if the Y-coordinate changes, indicating a new row
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
Console.WriteLine(); // Start a new row
previousY = cell.CellRect.Y;
}
Console.Write($"{cell.CellText}\t");
}
Console.WriteLine("\n--- End of Table ---");
}
}
public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
{
if (table == null || table.CellInfos == null || !table.CellInfos.Any())
{
throw new ArgumentException("Table is empty or invalid.");
}
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
List<List<CellInfo>> rows = new List<List<CellInfo>>();
// Group cells into rows
int previousY = sortedCells.First().CellRect.Y;
List<CellInfo> currentRow = new List<CellInfo>();
foreach (var cell in sortedCells)
{
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
// Store the completed row and start a new one
rows.Add(new List<CellInfo>(currentRow));
currentRow.Clear();
previousY = cell.CellRect.Y;
}
currentRow.Add(cell);
}
// Add the last row
if (currentRow.Any())
{
rows.Add(currentRow);
}
// Retrieve the specific row
if (rowIndex < 0 || rowIndex >= rows.Count)
{
throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
}
return rows[rowIndex];
}
}
public static class TableProcessor
{
public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
{
// Sort cells by Y (top to bottom), then by X (left to right)
var sortedCells = cells
.OrderBy(cell => cell.CellRect.Y)
.ThenBy(cell => cell.CellRect.X)
.ToList();
return sortedCells;
}
// Example of how to use the function
public static void ProcessTables(Tables tables)
{
foreach (var table in tables)
{
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
Console.WriteLine("Organized Table Cells:");
// int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;
foreach (var cell in sortedCells)
{
// Print a new line if the Y-coordinate changes, indicating a new row
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
Console.WriteLine(); // Start a new row
previousY = cell.CellRect.Y;
}
Console.Write($"{cell.CellText}\t");
}
Console.WriteLine("\n--- End of Table ---");
}
}
public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
{
if (table == null || table.CellInfos == null || !table.CellInfos.Any())
{
throw new ArgumentException("Table is empty or invalid.");
}
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
List<List<CellInfo>> rows = new List<List<CellInfo>>();
// Group cells into rows
int previousY = sortedCells.First().CellRect.Y;
List<CellInfo> currentRow = new List<CellInfo>();
foreach (var cell in sortedCells)
{
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
// Store the completed row and start a new one
rows.Add(new List<CellInfo>(currentRow));
currentRow.Clear();
previousY = cell.CellRect.Y;
}
currentRow.Add(cell);
}
// Add the last row
if (currentRow.Any())
{
rows.Add(currentRow);
}
// Retrieve the specific row
if (rowIndex < 0 || rowIndex >= rows.Count)
{
throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
}
return rows[rowIndex];
}
}
Imports Microsoft.VisualBasic
Public Module TableProcessor
Public Function OrganizeCellsByCoordinates(ByVal cells As List(Of CellInfo)) As List(Of CellInfo)
' Sort cells by Y (top to bottom), then by X (left to right)
Dim sortedCells = cells.OrderBy(Function(cell) cell.CellRect.Y).ThenBy(Function(cell) cell.CellRect.X).ToList()
Return sortedCells
End Function
' Example of how to use the function
Public Sub ProcessTables(ByVal tables As Tables)
For Each table In tables
Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
Console.WriteLine("Organized Table Cells:")
' int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
Dim previousY As Integer = If(sortedCells.Any(), sortedCells.First().CellRect.Y, 0)
For Each cell In sortedCells
' Print a new line if the Y-coordinate changes, indicating a new row
If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
Console.WriteLine() ' Start a new row
previousY = cell.CellRect.Y
End If
Console.Write($"{cell.CellText}" & vbTab)
Next cell
Console.WriteLine(vbLf & "--- End of Table ---")
Next table
End Sub
Public Function ExtractRowByIndex(ByVal table As TableInfo, ByVal rowIndex As Integer) As List(Of CellInfo)
If table Is Nothing OrElse table.CellInfos Is Nothing OrElse Not table.CellInfos.Any() Then
Throw New ArgumentException("Table is empty or invalid.")
End If
Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
Dim rows As New List(Of List(Of CellInfo))()
' Group cells into rows
Dim previousY As Integer = sortedCells.First().CellRect.Y
Dim currentRow As New List(Of CellInfo)()
For Each cell In sortedCells
If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
' Store the completed row and start a new one
rows.Add(New List(Of CellInfo)(currentRow))
currentRow.Clear()
previousY = cell.CellRect.Y
End If
currentRow.Add(cell)
Next cell
' Add the last row
If currentRow.Any() Then
rows.Add(currentRow)
End If
' Retrieve the specific row
If rowIndex < 0 OrElse rowIndex >= rows.Count Then
Throw New IndexOutOfRangeException($"Row index {rowIndex} is out of range.")
End If
Return rows(rowIndex)
End Function
End Module