如何读取文档中的表格

Curtis Chau

2025年一月13日

更新 2025年一月13日

Translated

View the article in English

让我们谈谈在文档中读取表格。使用普通的Tesseract从表格中提取数据可能具有挑战性，因为文本通常位于单元格中，并且在文档中零散分布。然而，我们的库配备了一个经过训练和优化的机器学习模型，可以准确地检测和提取表格数据。

对于简单表格，您可以依赖简单的表格检测，而对于更复杂的结构，我们独有的ReadDocumentAdvanced方法提供强大的结果，有效解析表格并传递数据。

开始使用IronOCR

立即在您的项目中开始使用IronOCR，并享受免费试用。

第一步：

如何读取文档中的表格

下载用于从表中提取数据的C#库
准备图像和PDF文档以进行提取
将ReadDataTables属性设置为true以启用表检测
使用ReadDocumentAdvanced方法处理复杂表格
提取通过这些方法检测到的数据

简单的表格示例

将ReadDataTables属性设置为true可以启用使用Tesseract的表格检测功能。我创建了一个简单的表格 PDF 来测试这个功能，您可以在这里下载：'simple-table.pdf.' 可以使用此方法检测没有合并单元格的简单表格。对于更复杂的表格，请参考下面描述的方法。

:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-tesseract.cs

using IronOcr;
using System;
using System.Data;

// Instantiate OCR engine
var ocr = new IronTesseract();

// Enable table detection
ocr.Configuration.ReadDataTables = true;

using var input = new OcrPdfInput("simple-table.pdf");
var result = ocr.Read(input);

// Retrieve the data
var table = result.Tables[0].DataTable;

// Print out the table data
foreach (DataRow row in table.Rows)
{
    foreach (var item in row.ItemArray)
    {
        Console.Write(item + "\t");
    }
    Console.WriteLine();
}

Imports Microsoft.VisualBasic
Imports IronOcr
Imports System
Imports System.Data

' Instantiate OCR engine
Private ocr = New IronTesseract()

' Enable table detection
ocr.Configuration.ReadDataTables = True

Dim input = New OcrPdfInput("simple-table.pdf")
Dim result = ocr.Read(input)

' Retrieve the data
Dim table = result.Tables(0).DataTable

' Print out the table data
For Each row As DataRow In table.Rows
	For Each item In row.ItemArray
		Console.Write(item & vbTab)
	Next item
	Console.WriteLine()
Next row

$vbLabelText $csharpLabel

复杂表格示例

对于复杂的表格，ReadDocumentAdvanced 方法能很好地处理它们。在这个例子中，我们将使用 'table.pdf' 文件。

ReadDocumentAdvanced 方法要求安装 IronOcr.Extensions.AdvancedScan 包与基本的 IronOCR 包一起使用。当前，此扩展仅适用于 Windows。

请注意

使用高级扫描在 .NET Framework 上需要项目运行在 x64 架构上。导航到项目配置并取消选中“优先使用32位”选项以实现此目的。在以下故障排除指南中了解更多信息："在 .NET Framework 上进行高级扫描."

:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-ml.cs

using IronOcr;
using System.Linq;

// Instantiate OCR engine
var ocr = new IronTesseract();

using var input = new OcrInput();
input.LoadPdf("table.pdf");

// Perform OCR
var result = ocr.ReadDocumentAdvanced(input);

var cellList = result.Tables.First().CellInfos;

Imports IronOcr
Imports System.Linq

' Instantiate OCR engine
Private ocr = New IronTesseract()

Private input = New OcrInput()
input.LoadPdf("table.pdf")

' Perform OCR
Dim result = ocr.ReadDocumentAdvanced(input)

Dim cellList = result.Tables.First().CellInfos

$vbLabelText $csharpLabel

此方法将文档的文本数据分为两类：一种是被边框包围的，另一种是没有边框的。对于有边框的内容，该库会根据表格的结构进一步将其划分为小节。结果如下所示。需要注意的是，由于此方法专注于信息被边框包围，任何跨多行的合并单元格都将被视为单个单元格。

结果

辅助类

在当前的实现中，提取的单元格尚未得到妥善组织。然而，每个单元格都包含有价值的信息，如 X 和 Y 坐标、尺寸等。使用这些数据，我们可以创建一个用于各种目的的辅助类。以下是一些基本的辅助方法：

public static class TableProcessor
{
    public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
    {
        // Sort cells by Y (top to bottom), then by X (left to right)
        var sortedCells = cells
            .OrderBy(cell => cell.CellRect.Y)
            .ThenBy(cell => cell.CellRect.X)
            .ToList();

        return sortedCells;
    }

    // Example of how to use the function
    public static void ProcessTables(Tables tables)
    {
        foreach (var table in tables)
        {
            var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);

            Console.WriteLine("Organized Table Cells:");

            // int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
            int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;

            foreach (var cell in sortedCells)
            {
                // Print a new line if the Y-coordinate changes, indicating a new row
                if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
                {
                    Console.WriteLine();  // Start a new row
                    previousY = cell.CellRect.Y;
                }

                Console.Write($"{cell.CellText}\t");
            }

            Console.WriteLine("\n--- End of Table ---");
        }
    }

    public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
    {
        if (table == null 
 table.CellInfos == null 
 !table.CellInfos.Any())
        {
            throw new ArgumentException("Table is empty or invalid.");
        }

        var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
        List<List<CellInfo>> rows = new List<List<CellInfo>>();

        // Group cells into rows
        int previousY = sortedCells.First().CellRect.Y;
        List<CellInfo> currentRow = new List<CellInfo>();

        foreach (var cell in sortedCells)
        {
            if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
            {
                // Store the completed row and start a new one
                rows.Add(new List<CellInfo>(currentRow));
                currentRow.Clear();

                previousY = cell.CellRect.Y;
            }

            currentRow.Add(cell);
        }

        // Add the last row
        if (currentRow.Any())
        {
            rows.Add(currentRow);
        }

        // Retrieve the specific row
        if (rowIndex < 0 
 rowIndex >= rows.Count)
        {
            throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
        }

        return rows[rowIndex];
    }
}

public static class TableProcessor
{
    public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
    {
        // Sort cells by Y (top to bottom), then by X (left to right)
        var sortedCells = cells
            .OrderBy(cell => cell.CellRect.Y)
            .ThenBy(cell => cell.CellRect.X)
            .ToList();

        return sortedCells;
    }

    // Example of how to use the function
    public static void ProcessTables(Tables tables)
    {
        foreach (var table in tables)
        {
            var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);

            Console.WriteLine("Organized Table Cells:");

            // int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
            int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;

            foreach (var cell in sortedCells)
            {
                // Print a new line if the Y-coordinate changes, indicating a new row
                if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
                {
                    Console.WriteLine();  // Start a new row
                    previousY = cell.CellRect.Y;
                }

                Console.Write($"{cell.CellText}\t");
            }

            Console.WriteLine("\n--- End of Table ---");
        }
    }

    public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
    {
        if (table == null 
 table.CellInfos == null 
 !table.CellInfos.Any())
        {
            throw new ArgumentException("Table is empty or invalid.");
        }

        var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
        List<List<CellInfo>> rows = new List<List<CellInfo>>();

        // Group cells into rows
        int previousY = sortedCells.First().CellRect.Y;
        List<CellInfo> currentRow = new List<CellInfo>();

        foreach (var cell in sortedCells)
        {
            if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
            {
                // Store the completed row and start a new one
                rows.Add(new List<CellInfo>(currentRow));
                currentRow.Clear();

                previousY = cell.CellRect.Y;
            }

            currentRow.Add(cell);
        }

        // Add the last row
        if (currentRow.Any())
        {
            rows.Add(currentRow);
        }

        // Retrieve the specific row
        if (rowIndex < 0 
 rowIndex >= rows.Count)
        {
            throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
        }

        return rows[rowIndex];
    }
}

Imports Microsoft.VisualBasic

Public Module TableProcessor
	Public Function OrganizeCellsByCoordinates(ByVal cells As List(Of CellInfo)) As List(Of CellInfo)
		' Sort cells by Y (top to bottom), then by X (left to right)
		Dim sortedCells = cells.OrderBy(Function(cell) cell.CellRect.Y).ThenBy(Function(cell) cell.CellRect.X).ToList()

		Return sortedCells
	End Function

	' Example of how to use the function
	Public Sub ProcessTables(ByVal tables As Tables)
		For Each table In tables
			Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)

			Console.WriteLine("Organized Table Cells:")

			' int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
			Dim previousY As Integer = If(sortedCells.Any(), sortedCells.First().CellRect.Y, 0)

			For Each cell In sortedCells
				' Print a new line if the Y-coordinate changes, indicating a new row
				If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
					Console.WriteLine() ' Start a new row
					previousY = cell.CellRect.Y
				End If

				Console.Write($"{cell.CellText}" & vbTab)
			Next cell

			Console.WriteLine(vbLf & "--- End of Table ---")
		Next table
	End Sub

	Public Function ExtractRowByIndex(ByVal table As TableInfo, ByVal rowIndex As Integer) As List(Of CellInfo)
		If table Is Nothing table.CellInfos Is Nothing (Not table.CellInfos.Any()) Then
			Throw New ArgumentException("Table is empty or invalid.")
		End If

		Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
		Dim rows As New List(Of List(Of CellInfo))()

		' Group cells into rows
		Dim previousY As Integer = sortedCells.First().CellRect.Y
		Dim currentRow As New List(Of CellInfo)()

		For Each cell In sortedCells
			If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
				' Store the completed row and start a new one
				rows.Add(New List(Of CellInfo)(currentRow))
				currentRow.Clear()

				previousY = cell.CellRect.Y
			End If

			currentRow.Add(cell)
		Next cell

		' Add the last row
		If currentRow.Any() Then
			rows.Add(currentRow)
		End If

		' Retrieve the specific row
		If rowIndex < 0 rowIndex >= rows.Count Then
			Throw New IndexOutOfRangeException($"Row index {rowIndex} is out of range.")
		End If

		Return rows(rowIndex)
	End Function
End Module

$vbLabelText $csharpLabel

Curtis Chau

立即与工程团队聊天

技术作家

Curtis Chau拥有计算机科学学士学位（卡尔顿大学），专注于前端开发，擅长Node.js、TypeScript、JavaScript和React。Curtis热衷于打造直观审美的用户界面，喜欢使用现代框架和创建结构良好的视觉手册。

在开发之外，Curtis对物联网（IoT）抱有浓厚兴趣，探索将硬件与软件集成的创新方法。在空闲时间，他喜欢玩游戏和制作Discord机器人，将他对技术的热爱与创造力结合起来。