How to Read Table in Documents

This article was translated from English: Does it need improvement?
Translated
View the article in English

讓我們來談談如何在文件中讀取表格。 使用普通的 Tesseract 提取表格中的數據可能會具有挑戰性,因為文本通常位於單元格中,並且分散在整個文檔中。 然而,我們的庫配備了一個訓練和微調過的機器學習模型,可以準確地檢測和提取表格數據。

對於簡單的表格,您可以依賴直接的表格檢測,而對於更複雜的結構,我們的專有 ReadDocumentAdvanced 方法提供穩健的結果,有效解析表格並傳遞數據。

標題:2(快速入門:一次調用提取複雜表格單元格)

幾分鐘內開始運行——此示例顯示如何使用 ReadDocumentAdvanced 的單次 IronOCR 調用從複雜文檔中獲取詳細的表格單元格數據。 通過加載 PDF、應用高級表格檢測並直接返回單元格信息列表,展示了使用的簡便性。

Nuget IconGet started making PDFs with NuGet now:

  1. Install IronOCR with NuGet Package Manager

    PM > Install-Package IronOcr

  2. Copy and run this code snippet.

    var cells = new IronTesseract().ReadDocumentAdvanced(new OcrInput().LoadPdf("invoiceTable.pdf")).Tables.First().CellInfos;
  3. Deploy to test on your live environment

    Start using IronOCR in your project today with a free trial
    arrow pointer

以下步驟將引導您開始使用 IronOCR 讀取表格:

class="hsg-featured-snippet">

最小化工作流程(5 步驟)

  1. 下載 C# 庫以從表格中提取數據
  2. 準備圖像和 PDF 文檔以進行提取
  3. ReadDataTables 屬性設置為 true 以啟用表格檢測
  4. 對於複雜表格,使用 ReadDocumentAdvanced 方法
  5. 提取這些方法檢測到的數據


簡單表格範例

ReadDataTables 屬性設置為 true,使用 Tesseract 啟用表格檢測。 我創建了一個簡單的表格 PDF 為此功能進行測試,您可以在這裡下載:'simple-table.pdf'。 使用此方法可以檢測到沒有合併單元格的簡單表格。 請參考以下描述的方法以獲取更複雜的表格。

:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-tesseract.cs
using IronOcr;
using System;
using System.Data;

// Instantiate OCR engine
var ocr = new IronTesseract();

// Enable table detection
ocr.Configuration.ReadDataTables = true;

using var input = new OcrPdfInput("simple-table.pdf");
var result = ocr.Read(input);

// Retrieve the data
var table = result.Tables[0].DataTable;

// Print out the table data
foreach (DataRow row in table.Rows)
{
    foreach (var item in row.ItemArray)
    {
        Console.Write(item + "\t");
    }
    Console.WriteLine();
}
Imports Microsoft.VisualBasic
Imports IronOcr
Imports System
Imports System.Data

' Instantiate OCR engine
Private ocr = New IronTesseract()

' Enable table detection
ocr.Configuration.ReadDataTables = True

Dim input = New OcrPdfInput("simple-table.pdf")
Dim result = ocr.Read(input)

' Retrieve the data
Dim table = result.Tables(0).DataTable

' Print out the table data
For Each row As DataRow In table.Rows
	For Each item In row.ItemArray
		Console.Write(item & vbTab)
	Next item
	Console.WriteLine()
Next row
$vbLabelText   $csharpLabel

發票範例

在商業環境中更常見的複雜表格之一是發票。 發票本身即是具有數據行和列的複雜表格。 利用 IronOCR,我們使用 ReadDocumentAdvanced 方法來完美處理它們。 這一過程涉及掃描文檔、識別表格結構並提取數據。 在此示例中,我們將使用 'invoiceTable.pdf' 文件來展示 IronOCR 如何從發票中檢索所有信息。

ReadDocumentAdvanced 方法需要安裝 IronOcr.Extensions.AdvancedScan 包以及基礎的 IronOCR 包。

請注意 在 .NET Framework 上使用高級掃描需要項目運行在 x64 架構上。 導航到項目配置並取消選中“偏好 32 位”選項以實現此目標。 在以下故障排除指南中了解更多信息:"在 .NET Framework 上的高級掃描。”

:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-ml.cs
using IronOcr;
using System.Linq;

// Instantiate OCR engine
var ocr = new IronTesseract();

using var input = new OcrInput();
input.LoadPdf("invoiceTable.pdf");

// Perform OCR
var result = ocr.ReadDocumentAdvanced(input);

var cellList = result.Tables.First().CellInfos;
IRON VB CONVERTER ERROR developers@ironsoftware.com
$vbLabelText   $csharpLabel

此方法將文檔的文本數據分為兩類:一類被邊界包圍,另一類無邊界。 對於有邊界的內容,庫會根據表格的結構進一步將其劃分為子部分。 結果如下所示。 需要注意的是,由於此方法側重於被邊界包圍的信息,跨多行的合併單元格將被視為單一單元格。

結果

class="content-img-align-center">
class="center-image-wrapper"> Read Table in Document

輔助類

在當前實施中,提取的單元格尚未正確組織。 然而,每個單元格都包含有價值的信息,如 X 和 Y 坐標、尺寸等。 利用這些數據,我們可以為各種目的創建輔助類。 以下是一些基本的輔助方法:

using System;
using System.Collections.Generic;
using System.Linq;

// A helper class to process table data by sorting cells based on coordinates
public static class TableProcessor
{
    // Method to organize cells by their coordinates (Y top to bottom, X left to right)
    public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
    {
        // Sort cells by Y (top to bottom), then by X (left to right)
        var sortedCells = cells
            .OrderBy(cell => cell.CellRect.Y)
            .ThenBy(cell => cell.CellRect.X)
            .ToList();

        return sortedCells;
    }

    // Example method demonstrating how to process multiple tables
    public static void ProcessTables(Tables tables)
    {
        foreach (var table in tables)
        {
            var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);

            Console.WriteLine("Organized Table Cells:");

            // Initialize previous Y coordinate
            int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;

            foreach (var cell in sortedCells)
            {
                // Print a new line if the Y-coordinate changes, indicating a new row
                if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
                {
                    Console.WriteLine();  // Start a new row
                    previousY = cell.CellRect.Y;
                }

                // Print the cell text followed by a tab
                Console.Write($"{cell.CellText}\t");
            }

            Console.WriteLine("\n--- End of Table ---");  // End of a table
        }
    }

    // Method to extract a specific row by the given index
    public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
    {
        if (table == null || table.CellInfos == null || !table.CellInfos.Any())
        {
            throw new ArgumentException("Table is empty or invalid.");
        }

        var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
        List<List<CellInfo>> rows = new List<List<CellInfo>>();

        // Group cells into rows based on Y coordinates
        int previousY = sortedCells.First().CellRect.Y;
        List<CellInfo> currentRow = new List<CellInfo>();

        foreach (var cell in sortedCells)
        {
            if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
            {
                // Store the completed row and start a new one
                rows.Add(new List<CellInfo>(currentRow));
                currentRow.Clear();

                previousY = cell.CellRect.Y;
            }

            currentRow.Add(cell);
        }

        // Add the last row if it wasn't added yet
        if (currentRow.Any())
        {
            rows.Add(currentRow);
        }

        // Retrieve the specified row
        if (rowIndex < 0 || rowIndex >= rows.Count)
        {
            throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
        }

        return rows[rowIndex];
    }
}
using System;
using System.Collections.Generic;
using System.Linq;

// A helper class to process table data by sorting cells based on coordinates
public static class TableProcessor
{
    // Method to organize cells by their coordinates (Y top to bottom, X left to right)
    public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
    {
        // Sort cells by Y (top to bottom), then by X (left to right)
        var sortedCells = cells
            .OrderBy(cell => cell.CellRect.Y)
            .ThenBy(cell => cell.CellRect.X)
            .ToList();

        return sortedCells;
    }

    // Example method demonstrating how to process multiple tables
    public static void ProcessTables(Tables tables)
    {
        foreach (var table in tables)
        {
            var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);

            Console.WriteLine("Organized Table Cells:");

            // Initialize previous Y coordinate
            int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;

            foreach (var cell in sortedCells)
            {
                // Print a new line if the Y-coordinate changes, indicating a new row
                if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
                {
                    Console.WriteLine();  // Start a new row
                    previousY = cell.CellRect.Y;
                }

                // Print the cell text followed by a tab
                Console.Write($"{cell.CellText}\t");
            }

            Console.WriteLine("\n--- End of Table ---");  // End of a table
        }
    }

    // Method to extract a specific row by the given index
    public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
    {
        if (table == null || table.CellInfos == null || !table.CellInfos.Any())
        {
            throw new ArgumentException("Table is empty or invalid.");
        }

        var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
        List<List<CellInfo>> rows = new List<List<CellInfo>>();

        // Group cells into rows based on Y coordinates
        int previousY = sortedCells.First().CellRect.Y;
        List<CellInfo> currentRow = new List<CellInfo>();

        foreach (var cell in sortedCells)
        {
            if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
            {
                // Store the completed row and start a new one
                rows.Add(new List<CellInfo>(currentRow));
                currentRow.Clear();

                previousY = cell.CellRect.Y;
            }

            currentRow.Add(cell);
        }

        // Add the last row if it wasn't added yet
        if (currentRow.Any())
        {
            rows.Add(currentRow);
        }

        // Retrieve the specified row
        if (rowIndex < 0 || rowIndex >= rows.Count)
        {
            throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
        }

        return rows[rowIndex];
    }
}
Imports Microsoft.VisualBasic
Imports System
Imports System.Collections.Generic
Imports System.Linq

' A helper class to process table data by sorting cells based on coordinates
Public Module TableProcessor
	' Method to organize cells by their coordinates (Y top to bottom, X left to right)
	Public Function OrganizeCellsByCoordinates(ByVal cells As List(Of CellInfo)) As List(Of CellInfo)
		' Sort cells by Y (top to bottom), then by X (left to right)
		Dim sortedCells = cells.OrderBy(Function(cell) cell.CellRect.Y).ThenBy(Function(cell) cell.CellRect.X).ToList()

		Return sortedCells
	End Function

	' Example method demonstrating how to process multiple tables
	Public Sub ProcessTables(ByVal tables As Tables)
		For Each table In tables
			Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)

			Console.WriteLine("Organized Table Cells:")

			' Initialize previous Y coordinate
			Dim previousY As Integer = If(sortedCells.Any(), sortedCells.First().CellRect.Y, 0)

			For Each cell In sortedCells
				' Print a new line if the Y-coordinate changes, indicating a new row
				If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
					Console.WriteLine() ' Start a new row
					previousY = cell.CellRect.Y
				End If

				' Print the cell text followed by a tab
				Console.Write($"{cell.CellText}" & vbTab)
			Next cell

			Console.WriteLine(vbLf & "--- End of Table ---") ' End of a table
		Next table
	End Sub

	' Method to extract a specific row by the given index
	Public Function ExtractRowByIndex(ByVal table As TableInfo, ByVal rowIndex As Integer) As List(Of CellInfo)
		If table Is Nothing OrElse table.CellInfos Is Nothing OrElse Not table.CellInfos.Any() Then
			Throw New ArgumentException("Table is empty or invalid.")
		End If

		Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
		Dim rows As New List(Of List(Of CellInfo))()

		' Group cells into rows based on Y coordinates
		Dim previousY As Integer = sortedCells.First().CellRect.Y
		Dim currentRow As New List(Of CellInfo)()

		For Each cell In sortedCells
			If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
				' Store the completed row and start a new one
				rows.Add(New List(Of CellInfo)(currentRow))
				currentRow.Clear()

				previousY = cell.CellRect.Y
			End If

			currentRow.Add(cell)
		Next cell

		' Add the last row if it wasn't added yet
		If currentRow.Any() Then
			rows.Add(currentRow)
		End If

		' Retrieve the specified row
		If rowIndex < 0 OrElse rowIndex >= rows.Count Then
			Throw New IndexOutOfRangeException($"Row index {rowIndex} is out of range.")
		End If

		Return rows(rowIndex)
	End Function
End Module
$vbLabelText   $csharpLabel

常見問題解答

如何改進使用 C# 從文件中提取表格資料的效率?

您可以使用 C# 和 IronOCR 的機器學習模型來增強文件中表格資料的擷取。該模型經過專門訓練,能夠準確檢測和提取複雜的表格資料。這種方法比使用 Tesseract 等標準 OCR 工具更有效。

IronOCR 中的 `ReadDocumentAdvanced` 方法的目的是什麼?

IronOCR 中的 `ReadDocumentAdvanced` 方法旨在透過高效的解析和資料提取來處理複雜表格,從而提供可靠的結果。它尤其適用於處理結構複雜的表格。

我該如何使用 IronOCR 開始提取表格資料?

要開始使用 IronOCR 提取表格,請下載 C# 庫,準備文檔,透過設定ReadDataTables屬性啟用表格檢測,並使用ReadDocumentAdvanced方法處理複雜表格。

IronOCR 的進階表格擷取功能需要哪些額外的軟體套件?

IronOCR 中的高級表格提取需要 `IronOcr.Extensions.AdvancedScan` 包,該包是 Windows 特有的,用於有效管理複雜的表格結構。

如何使用 IronOCR 整理擷取的表格資料?

IronOCR 提供輔助方法,按座標組織提取的表數據,可讓您處理多個表並按索引提取特定行,從而更好地進行資料管理。

使用 IronOCR 擷取的表格儲存格中包含哪些元資料?

使用 IronOCR 提取的表格單元包含元數據,例如 X 和 Y 座標、單元格尺寸以及每個單元格內的文本內容,從而可以進行詳細的數據分析和組織。

在使用 IronOCR 進行表格擷取時,如何確保與 .NET Framework 相容?

為確保與 .NET Framework 相容,請確保您的專案在 x64 架構上運行,並在使用 IronOCR 進行表格提取時,取消選取專案配置中的「首選 32 位元」選項。

Curtis Chau
技術作家

Curtis Chau 擁有卡爾頓大學計算機科學學士學位,專注於前端開發,擅長於 Node.js、TypeScript、JavaScript 和 React。Curtis 熱衷於創建直觀且美觀的用戶界面,喜歡使用現代框架並打造結構良好、視覺吸引人的手冊。

除了開發之外,Curtis 對物聯網 (IoT) 有著濃厚的興趣,探索將硬體和軟體結合的創新方式。在閒暇時間,他喜愛遊戲並構建 Discord 機器人,結合科技與創意的樂趣。

審核人

A PHP Error was encountered

Severity: Warning

Message: Illegal string offset 'name'

Filename: sections/author_component.php

Line Number: 70

Backtrace:

File: /var/www/ironpdf.com/application/views/main/sections/author_component.php
Line: 70
Function: _error_handler

File: /var/www/ironpdf.com/application/libraries/Render.php
Line: 63
Function: view

File: /var/www/ironpdf.com/application/views/products/sections/three_column_docs_page_structure.php
Line: 64
Function: main_view

File: /var/www/ironpdf.com/application/libraries/Render.php
Line: 88
Function: view

File: /var/www/ironpdf.com/application/views/products/how-to/index.php
Line: 2
Function: view

File: /var/www/ironpdf.com/application/libraries/Render.php
Line: 88
Function: view

File: /var/www/ironpdf.com/application/libraries/Render.php
Line: 552
Function: view

File: /var/www/ironpdf.com/application/controllers/Products/Howto.php
Line: 31
Function: render_products_view

File: /var/www/ironpdf.com/index.php
Line: 292
Function: require_once

">

A PHP Error was encountered

Severity: Warning

Message: Illegal string offset 'title'

Filename: sections/author_component.php

Line Number: 84

Backtrace:

File: /var/www/ironpdf.com/application/views/main/sections/author_component.php
Line: 84
Function: _error_handler

File: /var/www/ironpdf.com/application/libraries/Render.php
Line: 63
Function: view

File: /var/www/ironpdf.com/application/views/products/sections/three_column_docs_page_structure.php
Line: 64
Function: main_view

File: /var/www/ironpdf.com/application/libraries/Render.php
Line: 88
Function: view

File: /var/www/ironpdf.com/application/views/products/how-to/index.php
Line: 2
Function: view

File: /var/www/ironpdf.com/application/libraries/Render.php
Line: 88
Function: view

File: /var/www/ironpdf.com/application/libraries/Render.php
Line: 552
Function: view

File: /var/www/ironpdf.com/application/controllers/Products/Howto.php
Line: 31
Function: render_products_view

File: /var/www/ironpdf.com/index.php
Line: 292
Function: require_once

A PHP Error was encountered

Severity: Warning

Message: Illegal string offset 'comment'

Filename: sections/author_component.php

Line Number: 85

Backtrace:

File: /var/www/ironpdf.com/application/views/main/sections/author_component.php
Line: 85
Function: _error_handler

File: /var/www/ironpdf.com/application/libraries/Render.php
Line: 63
Function: view

File: /var/www/ironpdf.com/application/views/products/sections/three_column_docs_page_structure.php
Line: 64
Function: main_view

File: /var/www/ironpdf.com/application/libraries/Render.php
Line: 88
Function: view

File: /var/www/ironpdf.com/application/views/products/how-to/index.php
Line: 2
Function: view

File: /var/www/ironpdf.com/application/libraries/Render.php
Line: 88
Function: view

File: /var/www/ironpdf.com/application/libraries/Render.php
Line: 552
Function: view

File: /var/www/ironpdf.com/application/controllers/Products/Howto.php
Line: 31
Function: render_products_view

File: /var/www/ironpdf.com/index.php
Line: 292
Function: require_once