How to Read Table in Documents

Let’s talk about reading tables in documents. Extracting data from tables using plain Tesseract can be challenging, as the text often resides in cells and is sparsely scattered across the document. However, our library is equipped with a machine learning model that has been trained and fine-tuned for detecting and extracting table data accurately.

For simple tables, you can rely on straightforward table detection, while for more complex structures, our exclusive ReadDocumentAdvanced method provides robust results, effectively parsing the table and delivering data.

Get started with IronOCR

Start using IronOCR in your project today with a free trial.

First Step:
green arrow pointer

The following steps guide you in getting started with reading tables using IronOCR:


Simple Table Example

Setting the ReadDataTables property to true enables table detection using Tesseract. I created a simple table PDF to test this feature, which you can download here: 'simple-table.pdf'. Simple tables without merged cells can be detected using this method. For more complex tables, please refer to the method described below.

:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-tesseract.cs
using IronOcr;
using System;
using System.Data;

// This C# code uses the IronOCR library to read a table from a PDF file
// and display its contents as text in the console.

class Program
{
    static void Main()
    {
        // Instantiate the IronTesseract OCR engine, which is used for optical character recognition.
        var ocr = new IronTesseract();

        // Enable table detection so that the OCR engine can recognize tabular data within the PDF.
        ocr.Configuration.ReadDataTables = true;

        // Using a 'using' statement to ensure that resources are automatically disposed of
        // once the operation is complete. Here, we're loading the PDF file 'simple-table.pdf'.
        using var input = new OcrPdfInput("simple-table.pdf");

        // Perform the OCR operation on the input file to retrieve the OCR results.
        var result = ocr.Read(input);

        // Ensure that there is at least one table detected and available.
        if (result.Tables.Count > 0)
        {
            // Retrieve the first table recognized in the PDF as a DataTable object.
            // This object allows easy manipulation and retrieval of tabular data.
            var table = result.Tables[0].DataTable;

            // Iterate through each row of the table and print out all columns separated by tabs.
            // This is to display the table data in a format that visually represents columns in the console.
            foreach (DataRow row in table.Rows)
            {
                foreach (var item in row.ItemArray)
                {
                    Console.Write(item + "\t"); // Print each cell in the row followed by a tab
                }
                Console.WriteLine(); // Move to the next line after each row
            }
        }
        else
        {
            Console.WriteLine("No tables detected in the PDF.");
        }
    }
}
Imports Microsoft.VisualBasic
Imports IronOcr
Imports System
Imports System.Data

' This C# code uses the IronOCR library to read a table from a PDF file
' and display its contents as text in the console.

Friend Class Program
	Shared Sub Main()
		' Instantiate the IronTesseract OCR engine, which is used for optical character recognition.
		Dim ocr = New IronTesseract()

		' Enable table detection so that the OCR engine can recognize tabular data within the PDF.
		ocr.Configuration.ReadDataTables = True

		' Using a 'using' statement to ensure that resources are automatically disposed of
		' once the operation is complete. Here, we're loading the PDF file 'simple-table.pdf'.
		Dim input = New OcrPdfInput("simple-table.pdf")

		' Perform the OCR operation on the input file to retrieve the OCR results.
		Dim result = ocr.Read(input)

		' Ensure that there is at least one table detected and available.
		If result.Tables.Count > 0 Then
			' Retrieve the first table recognized in the PDF as a DataTable object.
			' This object allows easy manipulation and retrieval of tabular data.
			Dim table = result.Tables(0).DataTable

			' Iterate through each row of the table and print out all columns separated by tabs.
			' This is to display the table data in a format that visually represents columns in the console.
			For Each row As DataRow In table.Rows
				For Each item In row.ItemArray
					Console.Write(item & vbTab) ' Print each cell in the row followed by a tab
				Next item
				Console.WriteLine() ' Move to the next line after each row
			Next row
		Else
			Console.WriteLine("No tables detected in the PDF.")
		End If
	End Sub
End Class
$vbLabelText   $csharpLabel

Complex Table Example

For complex tables, the ReadDocumentAdvanced method handles them beautifully. In this example, we’ll use the 'table.pdf' file.

The ReadDocumentAdvanced method requires the IronOcr.Extensions.AdvancedScan package to be installed alongside the base IronOCR package. Currently, this extension is only available on Windows.

Please note
Using advanced scan on .NET Framework requires the project to run on x64 architecture. Navigate to the project configuration and uncheck the "Prefer 32-bit" option to achieve this. Learn more in the following troubleshooting guide: "Advanced Scan on .NET Framework."

:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-ml.cs
using IronOcr;
using System;
using System.Linq;

// This code performs optical character recognition (OCR) on a PDF document containing tables.
// It uses IronOcr's IronTesseract engine to process a PDF document and extract text from it,
// particularly focusing on table cell information.

// Instantiate the OCR engine
var ocr = new IronTesseract();

// Load the PDF input into OcrInput for processing
using (var input = new OcrInput())
{
    // Load PDF file into the input container
    input.LoadPdf("table.pdf");

    // Perform OCR on the PDF document
    var result = ocr.Read(input);

    // Extract the first table's cell information
    // If there could be no tables in the document, ensure to check if Tables is not empty
    if (result.Tables.Any())
    {
        var cellList = result.Tables.First().Cells;

        // Process the extracted cell data as needed
        foreach (var cell in cellList)
        {
            // Example: Output each cell text to console
            Console.WriteLine(cell.Text);
        }
    }
    else
    {
        Console.WriteLine("No tables found in the document.");
    }
}
Imports IronOcr
Imports System
Imports System.Linq

' This code performs optical character recognition (OCR) on a PDF document containing tables.
' It uses IronOcr's IronTesseract engine to process a PDF document and extract text from it,
' particularly focusing on table cell information.

' Instantiate the OCR engine
Private ocr = New IronTesseract()

' Load the PDF input into OcrInput for processing
Using input = New OcrInput()
	' Load PDF file into the input container
	input.LoadPdf("table.pdf")

	' Perform OCR on the PDF document
	Dim result = ocr.Read(input)

	' Extract the first table's cell information
	' If there could be no tables in the document, ensure to check if Tables is not empty
	If result.Tables.Any() Then
		Dim cellList = result.Tables.First().Cells

		' Process the extracted cell data as needed
		For Each cell In cellList
			' Example: Output each cell text to console
			Console.WriteLine(cell.Text)
		Next cell
	Else
		Console.WriteLine("No tables found in the document.")
	End If
End Using
$vbLabelText   $csharpLabel

This method separates the text data of the document into two categories: one enclosed by borders and another without borders. For the bordered content, the library further divides it into subsections based on the table's structure. The results are shown below. It is important to note that, since this method focuses on information enclosed by borders, any merged cells spanning multiple rows will be treated as a single cell.

Result

Read Table in Document

Helper Class

In the current implementation, the extracted cells are not yet organized properly. However, each cell contains valuable information such as X and Y coordinates, dimensions, and more. Using this data, we can create a helper class for various purposes. Below are some basic helper methods:

using System;
using System.Collections.Generic;
using System.Linq;

// A helper class to process table data by sorting cells based on coordinates
public static class TableProcessor
{
    // Method to organize cells by their coordinates (Y top to bottom, X left to right)
    public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
    {
        // Sort cells by Y (top to bottom), then by X (left to right)
        var sortedCells = cells
            .OrderBy(cell => cell.CellRect.Y)
            .ThenBy(cell => cell.CellRect.X)
            .ToList();

        return sortedCells;
    }

    // Example method demonstrating how to process multiple tables
    public static void ProcessTables(Tables tables)
    {
        foreach (var table in tables)
        {
            var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);

            Console.WriteLine("Organized Table Cells:");

            // Initialize previous Y coordinate
            int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;

            foreach (var cell in sortedCells)
            {
                // Print a new line if the Y-coordinate changes, indicating a new row
                if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
                {
                    Console.WriteLine();  // Start a new row
                    previousY = cell.CellRect.Y;
                }

                // Print the cell text followed by a tab
                Console.Write($"{cell.CellText}\t");
            }

            Console.WriteLine("\n--- End of Table ---");  // End of a table
        }
    }

    // Method to extract a specific row by the given index
    public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
    {
        if (table == null || table.CellInfos == null || !table.CellInfos.Any())
        {
            throw new ArgumentException("Table is empty or invalid.");
        }

        var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
        List<List<CellInfo>> rows = new List<List<CellInfo>>();

        // Group cells into rows based on Y coordinates
        int previousY = sortedCells.First().CellRect.Y;
        List<CellInfo> currentRow = new List<CellInfo>();

        foreach (var cell in sortedCells)
        {
            if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
            {
                // Store the completed row and start a new one
                rows.Add(new List<CellInfo>(currentRow));
                currentRow.Clear();

                previousY = cell.CellRect.Y;
            }

            currentRow.Add(cell);
        }

        // Add the last row if it wasn't added yet
        if (currentRow.Any())
        {
            rows.Add(currentRow);
        }

        // Retrieve the specified row
        if (rowIndex < 0 || rowIndex >= rows.Count)
        {
            throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
        }

        return rows[rowIndex];
    }
}
using System;
using System.Collections.Generic;
using System.Linq;

// A helper class to process table data by sorting cells based on coordinates
public static class TableProcessor
{
    // Method to organize cells by their coordinates (Y top to bottom, X left to right)
    public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
    {
        // Sort cells by Y (top to bottom), then by X (left to right)
        var sortedCells = cells
            .OrderBy(cell => cell.CellRect.Y)
            .ThenBy(cell => cell.CellRect.X)
            .ToList();

        return sortedCells;
    }

    // Example method demonstrating how to process multiple tables
    public static void ProcessTables(Tables tables)
    {
        foreach (var table in tables)
        {
            var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);

            Console.WriteLine("Organized Table Cells:");

            // Initialize previous Y coordinate
            int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;

            foreach (var cell in sortedCells)
            {
                // Print a new line if the Y-coordinate changes, indicating a new row
                if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
                {
                    Console.WriteLine();  // Start a new row
                    previousY = cell.CellRect.Y;
                }

                // Print the cell text followed by a tab
                Console.Write($"{cell.CellText}\t");
            }

            Console.WriteLine("\n--- End of Table ---");  // End of a table
        }
    }

    // Method to extract a specific row by the given index
    public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
    {
        if (table == null || table.CellInfos == null || !table.CellInfos.Any())
        {
            throw new ArgumentException("Table is empty or invalid.");
        }

        var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
        List<List<CellInfo>> rows = new List<List<CellInfo>>();

        // Group cells into rows based on Y coordinates
        int previousY = sortedCells.First().CellRect.Y;
        List<CellInfo> currentRow = new List<CellInfo>();

        foreach (var cell in sortedCells)
        {
            if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
            {
                // Store the completed row and start a new one
                rows.Add(new List<CellInfo>(currentRow));
                currentRow.Clear();

                previousY = cell.CellRect.Y;
            }

            currentRow.Add(cell);
        }

        // Add the last row if it wasn't added yet
        if (currentRow.Any())
        {
            rows.Add(currentRow);
        }

        // Retrieve the specified row
        if (rowIndex < 0 || rowIndex >= rows.Count)
        {
            throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
        }

        return rows[rowIndex];
    }
}
Imports Microsoft.VisualBasic
Imports System
Imports System.Collections.Generic
Imports System.Linq

' A helper class to process table data by sorting cells based on coordinates
Public Module TableProcessor
	' Method to organize cells by their coordinates (Y top to bottom, X left to right)
	Public Function OrganizeCellsByCoordinates(ByVal cells As List(Of CellInfo)) As List(Of CellInfo)
		' Sort cells by Y (top to bottom), then by X (left to right)
		Dim sortedCells = cells.OrderBy(Function(cell) cell.CellRect.Y).ThenBy(Function(cell) cell.CellRect.X).ToList()

		Return sortedCells
	End Function

	' Example method demonstrating how to process multiple tables
	Public Sub ProcessTables(ByVal tables As Tables)
		For Each table In tables
			Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)

			Console.WriteLine("Organized Table Cells:")

			' Initialize previous Y coordinate
			Dim previousY As Integer = If(sortedCells.Any(), sortedCells.First().CellRect.Y, 0)

			For Each cell In sortedCells
				' Print a new line if the Y-coordinate changes, indicating a new row
				If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
					Console.WriteLine() ' Start a new row
					previousY = cell.CellRect.Y
				End If

				' Print the cell text followed by a tab
				Console.Write($"{cell.CellText}" & vbTab)
			Next cell

			Console.WriteLine(vbLf & "--- End of Table ---") ' End of a table
		Next table
	End Sub

	' Method to extract a specific row by the given index
	Public Function ExtractRowByIndex(ByVal table As TableInfo, ByVal rowIndex As Integer) As List(Of CellInfo)
		If table Is Nothing OrElse table.CellInfos Is Nothing OrElse Not table.CellInfos.Any() Then
			Throw New ArgumentException("Table is empty or invalid.")
		End If

		Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
		Dim rows As New List(Of List(Of CellInfo))()

		' Group cells into rows based on Y coordinates
		Dim previousY As Integer = sortedCells.First().CellRect.Y
		Dim currentRow As New List(Of CellInfo)()

		For Each cell In sortedCells
			If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
				' Store the completed row and start a new one
				rows.Add(New List(Of CellInfo)(currentRow))
				currentRow.Clear()

				previousY = cell.CellRect.Y
			End If

			currentRow.Add(cell)
		Next cell

		' Add the last row if it wasn't added yet
		If currentRow.Any() Then
			rows.Add(currentRow)
		End If

		' Retrieve the specified row
		If rowIndex < 0 OrElse rowIndex >= rows.Count Then
			Throw New IndexOutOfRangeException($"Row index {rowIndex} is out of range.")
		End If

		Return rows(rowIndex)
	End Function
End Module
$vbLabelText   $csharpLabel

Frequently Asked Questions

What is the main challenge of extracting data from tables in documents?

Extracting data from tables using plain Tesseract can be challenging because the text often resides in cells and is sparsely scattered across the document.

How can table data extraction be improved in documents?

Using a specialized library equipped with a machine learning model trained for detecting and extracting table data accurately can provide robust results even for complex table structures.

What steps are involved in getting started with reading tables using a specialized library?

To get started, download the appropriate C# library, prepare your document, set the relevant properties to enable table detection, use advanced methods for complex tables, and extract the data detected by these methods.

What is the purpose of enabling table detection properties?

Enabling table detection properties allows the library to identify and extract data from simple tables that do not have merged cells.

What is the advantage of using advanced methods for table detection?

Advanced methods provide enhanced table detection capabilities, handling complex tables effectively and separating text data into bordered and non-bordered categories.

What additional packages might be required for advanced table detection methods?

Advanced table detection methods may require additional extensions or packages to be installed alongside the base library.

How can you ensure compatibility with .NET Framework when using advanced features?

To ensure compatibility with .NET Framework, the project must run on x64 architecture, and the 'Prefer 32-bit' option should be unchecked in the project configuration.

What helper methods are available for processing extracted table data?

Helper methods such as organizing cells by coordinates, processing multiple tables, and extracting specific rows by index can assist in effectively handling and organizing extracted table data.

What information do the extracted cells contain?

Extracted cells contain valuable information such as X and Y coordinates, dimensions, and the text within the cell.

Curtis Chau
Technical Writer

Curtis Chau holds a Bachelor’s degree in Computer Science (Carleton University) and specializes in front-end development with expertise in Node.js, TypeScript, JavaScript, and React. Passionate about crafting intuitive and aesthetically pleasing user interfaces, Curtis enjoys working with modern frameworks and creating well-structured, visually appealing manuals.

Beyond development, Curtis has a strong interest in the Internet of Things (IoT), exploring innovative ways to integrate hardware and software. In his free time, he enjoys gaming and building Discord bots, combining his love for technology with creativity.