如何在文件中读取表格
讓我們來談談在文件中讀取表格。 使用普通 Tesseract 從表格中提取數據可能具有挑戰性,因為文本通常位於單元格中並且在文檔中稀疏分佈。 然而,我們的程式庫配備了一個機器學習模型,該模型已經過訓練和微調,以準確檢測和提取表格數據。
對於簡單的表格,您可以依賴直接的表格檢測,而對於更複雜的結構,我們獨特的 ReadDocumentAdvanced
方法可提供強大的結果,有效解析表格並傳遞數據。
開始使用IronOCR
立即在您的專案中使用IronOCR,並享受免費試用。
如何在文件中读取表格
- 下載一個 C# 函式庫以從表格中提取數據
- 準備影像和 PDF 文件以進行擷取。
- 設置 讀取資料表 將屬性設為 true 以啟用表格檢測
- 使用
ReadDocumentAdvanced
複雜表格的方法 - 提取這些方法檢測到的數據
簡單的表格範例
將 ReadDataTables 屬性設置為 true 可通過 Tesseract 啟用表格檢測。 我創建了一個簡單的表格 PDF 來測試此功能,您可以在此下載:simple-table.pdf.'. 可以使用此方法檢測不含合併單元格的簡單表格。 對於更複雜的表格,請參閱下述方法。
:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-tesseract.cs
using IronOcr;
using System;
using System.Data;
// Instantiate OCR engine
var ocr = new IronTesseract();
// Enable table detection
ocr.Configuration.ReadDataTables = true;
using var input = new OcrPdfInput("simple-table.pdf");
var result = ocr.Read(input);
// Retrieve the data
var table = result.Tables[0].DataTable;
// Print out the table data
foreach (DataRow row in table.Rows)
{
foreach (var item in row.ItemArray)
{
Console.Write(item + "\t");
}
Console.WriteLine();
}
Imports Microsoft.VisualBasic
Imports IronOcr
Imports System
Imports System.Data
' Instantiate OCR engine
Private ocr = New IronTesseract()
' Enable table detection
ocr.Configuration.ReadDataTables = True
Dim input = New OcrPdfInput("simple-table.pdf")
Dim result = ocr.Read(input)
' Retrieve the data
Dim table = result.Tables(0).DataTable
' Print out the table data
For Each row As DataRow In table.Rows
For Each item In row.ItemArray
Console.Write(item & vbTab)
Next item
Console.WriteLine()
Next row
複雜表格範例
對於複雜的表格,ReadDocumentAdvanced
方法能夠完美處理。 在此範例中,我們將使用“table.pdf「檔案。」
ReadDocumentAdvanced
方法需要IronOcr.Extensions.AdvancedScan與基本 IronOCR 套件一起安裝的套件。 目前,此擴充功能僅適用於Windows系統。
[{我(
使用高級掃描功能在 .NET Framework 上運行需要項目在 x64 架構上運行。 導航至專案配置並取消勾選“偏好32位”選項以實現此目的。 在以下的故障排除指南中了解更多信息:高級掃描在 .NET 框架上.
)}]
:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-ml.cs
using IronOcr;
using System.Linq;
// Instantiate OCR engine
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf("table.pdf");
// Perform OCR
var result = ocr.ReadDocumentAdvanced(input);
var cellList = result.Tables.First().CellInfos;
Imports IronOcr
Imports System.Linq
' Instantiate OCR engine
Private ocr = New IronTesseract()
Private input = New OcrInput()
input.LoadPdf("table.pdf")
' Perform OCR
Dim result = ocr.ReadDocumentAdvanced(input)
Dim cellList = result.Tables.First().CellInfos
此方法將文檔的文本數據分為兩類:一類被邊框包圍,另一類沒有邊框。 對於帶邊框的內容,該庫會根據表格的結構進一步將其劃分為子部分。 結果如下所示。 需要注意的是,由於此方法僅專注於邊界內的信息,任何跨越多行的合併儲存格將被視為單一儲存格。
結果
輔助類別
在當前的實作中,提取的單元格尚未被妥善組織。 然而,每個單元格都包含有價值的信息,如 X 和 Y 座標、尺寸等。 使用此數據,我們可以創建一個輔助類以用於各種用途。 以下是一些基本的輔助方法:
public static class TableProcessor
{
public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
{
// Sort cells by Y (top to bottom), then by X (left to right)
var sortedCells = cells
.OrderBy(cell => cell.CellRect.Y)
.ThenBy(cell => cell.CellRect.X)
.ToList();
return sortedCells;
}
// Example of how to use the function
public static void ProcessTables(Tables tables)
{
foreach (var table in tables)
{
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
Console.WriteLine("Organized Table Cells:");
// int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;
foreach (var cell in sortedCells)
{
// Print a new line if the Y-coordinate changes, indicating a new row
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
Console.WriteLine(); // Start a new row
previousY = cell.CellRect.Y;
}
Console.Write($"{cell.CellText}\t");
}
Console.WriteLine("\n--- End of Table ---");
}
}
public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
{
if (table == null
table.CellInfos == null
!table.CellInfos.Any())
{
throw new ArgumentException("Table is empty or invalid.");
}
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
List<List<CellInfo>> rows = new List<List<CellInfo>>();
// Group cells into rows
int previousY = sortedCells.First().CellRect.Y;
List<CellInfo> currentRow = new List<CellInfo>();
foreach (var cell in sortedCells)
{
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
// Store the completed row and start a new one
rows.Add(new List<CellInfo>(currentRow));
currentRow.Clear();
previousY = cell.CellRect.Y;
}
currentRow.Add(cell);
}
// Add the last row
if (currentRow.Any())
{
rows.Add(currentRow);
}
// Retrieve the specific row
if (rowIndex < 0
rowIndex >= rows.Count)
{
throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
}
return rows[rowIndex];
}
}
public static class TableProcessor
{
public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
{
// Sort cells by Y (top to bottom), then by X (left to right)
var sortedCells = cells
.OrderBy(cell => cell.CellRect.Y)
.ThenBy(cell => cell.CellRect.X)
.ToList();
return sortedCells;
}
// Example of how to use the function
public static void ProcessTables(Tables tables)
{
foreach (var table in tables)
{
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
Console.WriteLine("Organized Table Cells:");
// int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;
foreach (var cell in sortedCells)
{
// Print a new line if the Y-coordinate changes, indicating a new row
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
Console.WriteLine(); // Start a new row
previousY = cell.CellRect.Y;
}
Console.Write($"{cell.CellText}\t");
}
Console.WriteLine("\n--- End of Table ---");
}
}
public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
{
if (table == null
table.CellInfos == null
!table.CellInfos.Any())
{
throw new ArgumentException("Table is empty or invalid.");
}
var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
List<List<CellInfo>> rows = new List<List<CellInfo>>();
// Group cells into rows
int previousY = sortedCells.First().CellRect.Y;
List<CellInfo> currentRow = new List<CellInfo>();
foreach (var cell in sortedCells)
{
if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
{
// Store the completed row and start a new one
rows.Add(new List<CellInfo>(currentRow));
currentRow.Clear();
previousY = cell.CellRect.Y;
}
currentRow.Add(cell);
}
// Add the last row
if (currentRow.Any())
{
rows.Add(currentRow);
}
// Retrieve the specific row
if (rowIndex < 0
rowIndex >= rows.Count)
{
throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
}
return rows[rowIndex];
}
}
Imports Microsoft.VisualBasic
Public Module TableProcessor
Public Function OrganizeCellsByCoordinates(ByVal cells As List(Of CellInfo)) As List(Of CellInfo)
' Sort cells by Y (top to bottom), then by X (left to right)
Dim sortedCells = cells.OrderBy(Function(cell) cell.CellRect.Y).ThenBy(Function(cell) cell.CellRect.X).ToList()
Return sortedCells
End Function
' Example of how to use the function
Public Sub ProcessTables(ByVal tables As Tables)
For Each table In tables
Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
Console.WriteLine("Organized Table Cells:")
' int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
Dim previousY As Integer = If(sortedCells.Any(), sortedCells.First().CellRect.Y, 0)
For Each cell In sortedCells
' Print a new line if the Y-coordinate changes, indicating a new row
If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
Console.WriteLine() ' Start a new row
previousY = cell.CellRect.Y
End If
Console.Write($"{cell.CellText}" & vbTab)
Next cell
Console.WriteLine(vbLf & "--- End of Table ---")
Next table
End Sub
Public Function ExtractRowByIndex(ByVal table As TableInfo, ByVal rowIndex As Integer) As List(Of CellInfo)
If table Is Nothing table.CellInfos Is Nothing (Not table.CellInfos.Any()) Then
Throw New ArgumentException("Table is empty or invalid.")
End If
Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
Dim rows As New List(Of List(Of CellInfo))()
' Group cells into rows
Dim previousY As Integer = sortedCells.First().CellRect.Y
Dim currentRow As New List(Of CellInfo)()
For Each cell In sortedCells
If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
' Store the completed row and start a new one
rows.Add(New List(Of CellInfo)(currentRow))
currentRow.Clear()
previousY = cell.CellRect.Y
End If
currentRow.Add(cell)
Next cell
' Add the last row
If currentRow.Any() Then
rows.Add(currentRow)
End If
' Retrieve the specific row
If rowIndex < 0 rowIndex >= rows.Count Then
Throw New IndexOutOfRangeException($"Row index {rowIndex} is out of range.")
End If
Return rows(rowIndex)
End Function
End Module