如何读取文档中的表格
让我们谈谈在文档中读取表格。 使用普通的Tesseract从表格中提取数据可能具有挑战性,因为文本通常位于单元格中,并且在文档中零散分布。 然而,我们的库配备了一个经过训练和优化的机器学习模型,可以准确地检测和提取表格数据。
对于简单的表格,您可以依赖简单的表格检测,而对于更复杂的结构,我们独有的 ReadDocumentAdvanced
方法可提供强有力的结果,能够有效地解析表格并传递数据。
开始使用IronOCR
立即在您的项目中开始使用IronOCR,并享受免费试用。
如何读取文档中的表格
- 下载一个C#库以从表中提取数据
- 准备图像和PDF文档以进行提取
- 设置 读取数据表 将属性设置为 true 以启用表检测
- 使用
读取文档高级
复杂表格的方法 - 提取通过这些方法检测到的数据
简单的表格示例
将 ReadDataTables 属性设置为 true 可以启用使用 Tesseract 进行的表格检测。 我创建了一个简单的表格PDF来测试此功能,您可以在这里下载:simple-table.pdf.'. 可以使用此方法检测没有合并单元格的简单表格。 对于更复杂的表格,请参考下面描述的方法。
:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-tesseract.cs
using IronOcr;
using System;
using System.Data;
// Instantiate OCR engine
var ocr = new IronTesseract();
// Enable table detection
ocr.Configuration.ReadDataTables = true;
using var input = new OcrPdfInput("simple-table.pdf");
var result = ocr.Read(input);
// Retrieve the data
var table = result.Tables[0].DataTable;
// Print out the table data
foreach (DataRow row in table.Rows)
{
foreach (var item in row.ItemArray)
{
Console.Write(item + "\t");
}
Console.WriteLine();
}
Imports Microsoft.VisualBasic
Imports IronOcr
Imports System
Imports System.Data
' Instantiate OCR engine
Private ocr = New IronTesseract()
' Enable table detection
ocr.Configuration.ReadDataTables = True
Dim input = New OcrPdfInput("simple-table.pdf")
Dim result = ocr.Read(input)
' Retrieve the data
Dim table = result.Tables(0).DataTable
' Print out the table data
For Each row As DataRow In table.Rows
For Each item In row.ItemArray
Console.Write(item & vbTab)
Next item
Console.WriteLine()
Next row
复杂表格示例
对于复杂的表格,ReadDocumentAdvanced
方法可以很好地处理它们。 在这个示例中,我们将使用“table.pdf文件。
ReadDocumentAdvanced
方法需要IronOcr.Extensions.AdvancedScan与基础IronOCR包一起安装的包。 当前,此扩展仅适用于 Windows。
请注意
使用高级扫描在 .NET Framework 上需要项目运行在 x64 架构上。 导航到项目配置并取消选中“优先使用32位”选项以实现此目的。 了解更多信息,请参阅以下故障排除指南:高级扫描在 .NET 框架."
:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-ml.cs
using IronOcr;
using System.Linq;
// Instantiate OCR engine
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf("table.pdf");
// Perform OCR
var result = ocr.ReadDocumentAdvanced(input);
var cellList = result.Tables.First().CellInfos;
Imports IronOcr
Imports System.Linq
' Instantiate OCR engine
Private ocr = New IronTesseract()
Private input = New OcrInput()
input.LoadPdf("table.pdf")
' Perform OCR
Dim result = ocr.ReadDocumentAdvanced(input)
Dim cellList = result.Tables.First().CellInfos
此方法将文档的文本数据分为两类:一种是被边框包围的,另一种是没有边框的。 对于有边框的内容,该库会根据表格的结构进一步将其划分为小节。 结果如下所示。 需要注意的是,由于此方法专注于信息被边框包围,任何跨多行的合并单元格都将被视为单个单元格。
结果
辅助类
在当前的实现中,提取的单元格尚未得到妥善组织。 然而,每个单元格都包含有价值的信息,如 X 和 Y 坐标、尺寸等。 使用这些数据,我们可以创建一个用于各种目的的辅助类。 以下是一些基本的辅助方法:
public static class TableProcessor
{
public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
{
// Sort cells by Y (top to bottom), then by X (left to right)
var sortedCells = cells
.OrderBy(cell => cell.CellRect.Y)
.ThenBy(cell => cell.CellRect.X)
.ToList();
return sortedCells;
}
// Example of how to use the function
public static void ProcessTables(Tables tables)
{
foreach (var table in tables)
{
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
Console.WriteLine("Organized Table Cells:");
// int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;
foreach (var cell in sortedCells)
{
// Print a new line if the Y-coordinate changes, indicating a new row
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
Console.WriteLine(); // Start a new row
previousY = cell.CellRect.Y;
}
Console.Write($"{cell.CellText}\t");
}
Console.WriteLine("\n--- End of Table ---");
}
}
public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
{
if (table == null
table.CellInfos == null
!table.CellInfos.Any())
{
throw new ArgumentException("Table is empty or invalid.");
}
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
List<List<CellInfo>> rows = new List<List<CellInfo>>();
// Group cells into rows
int previousY = sortedCells.First().CellRect.Y;
List<CellInfo> currentRow = new List<CellInfo>();
foreach (var cell in sortedCells)
{
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
// Store the completed row and start a new one
rows.Add(new List<CellInfo>(currentRow));
currentRow.Clear();
previousY = cell.CellRect.Y;
}
currentRow.Add(cell);
}
// Add the last row
if (currentRow.Any())
{
rows.Add(currentRow);
}
// Retrieve the specific row
if (rowIndex < 0
rowIndex >= rows.Count)
{
throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
}
return rows[rowIndex];
}
}
public static class TableProcessor
{
public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
{
// Sort cells by Y (top to bottom), then by X (left to right)
var sortedCells = cells
.OrderBy(cell => cell.CellRect.Y)
.ThenBy(cell => cell.CellRect.X)
.ToList();
return sortedCells;
}
// Example of how to use the function
public static void ProcessTables(Tables tables)
{
foreach (var table in tables)
{
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
Console.WriteLine("Organized Table Cells:");
// int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;
foreach (var cell in sortedCells)
{
// Print a new line if the Y-coordinate changes, indicating a new row
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
Console.WriteLine(); // Start a new row
previousY = cell.CellRect.Y;
}
Console.Write($"{cell.CellText}\t");
}
Console.WriteLine("\n--- End of Table ---");
}
}
public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
{
if (table == null
table.CellInfos == null
!table.CellInfos.Any())
{
throw new ArgumentException("Table is empty or invalid.");
}
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
List<List<CellInfo>> rows = new List<List<CellInfo>>();
// Group cells into rows
int previousY = sortedCells.First().CellRect.Y;
List<CellInfo> currentRow = new List<CellInfo>();
foreach (var cell in sortedCells)
{
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
// Store the completed row and start a new one
rows.Add(new List<CellInfo>(currentRow));
currentRow.Clear();
previousY = cell.CellRect.Y;
}
currentRow.Add(cell);
}
// Add the last row
if (currentRow.Any())
{
rows.Add(currentRow);
}
// Retrieve the specific row
if (rowIndex < 0
rowIndex >= rows.Count)
{
throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
}
return rows[rowIndex];
}
}
Imports Microsoft.VisualBasic
Public Module TableProcessor
Public Function OrganizeCellsByCoordinates(ByVal cells As List(Of CellInfo)) As List(Of CellInfo)
' Sort cells by Y (top to bottom), then by X (left to right)
Dim sortedCells = cells.OrderBy(Function(cell) cell.CellRect.Y).ThenBy(Function(cell) cell.CellRect.X).ToList()
Return sortedCells
End Function
' Example of how to use the function
Public Sub ProcessTables(ByVal tables As Tables)
For Each table In tables
Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
Console.WriteLine("Organized Table Cells:")
' int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
Dim previousY As Integer = If(sortedCells.Any(), sortedCells.First().CellRect.Y, 0)
For Each cell In sortedCells
' Print a new line if the Y-coordinate changes, indicating a new row
If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
Console.WriteLine() ' Start a new row
previousY = cell.CellRect.Y
End If
Console.Write($"{cell.CellText}" & vbTab)
Next cell
Console.WriteLine(vbLf & "--- End of Table ---")
Next table
End Sub
Public Function ExtractRowByIndex(ByVal table As TableInfo, ByVal rowIndex As Integer) As List(Of CellInfo)
If table Is Nothing table.CellInfos Is Nothing (Not table.CellInfos.Any()) Then
Throw New ArgumentException("Table is empty or invalid.")
End If
Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
Dim rows As New List(Of List(Of CellInfo))()
' Group cells into rows
Dim previousY As Integer = sortedCells.First().CellRect.Y
Dim currentRow As New List(Of CellInfo)()
For Each cell In sortedCells
If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
' Store the completed row and start a new one
rows.Add(New List(Of CellInfo)(currentRow))
currentRow.Clear()
previousY = cell.CellRect.Y
End If
currentRow.Add(cell)
Next cell
' Add the last row
If currentRow.Any() Then
rows.Add(currentRow)
End If
' Retrieve the specific row
If rowIndex < 0 rowIndex >= rows.Count Then
Throw New IndexOutOfRangeException($"Row index {rowIndex} is out of range.")
End If
Return rows(rowIndex)
End Function
End Module