Wie man Tabellen in Dokumenten liest

Curtis Chau

13. Januar 2025

Aktualisiert 13. Januar 2025

Teilen Sie:

Translated

View the article in English

Sprechen wir über das Lesen von Tabellen in Dokumenten. Das Extrahieren von Daten aus Tabellen mit einfachem Tesseract kann herausfordernd sein, da der Text oft in Zellen enthalten ist und spärlich über das Dokument verstreut ist. Unsere Bibliothek ist jedoch mit einem maschinellen Lernmodell ausgestattet, das trainiert und optimiert wurde, um Tabellendaten genau zu erkennen und zu extrahieren.

Für einfache Tabellen können Sie sich auf die unkomplizierte Tabellenerkennung verlassen, während unsere exklusive ReadDocumentAdvanced-Methode bei komplexeren Strukturen robuste Ergebnisse liefert, indem sie die Tabelle effektiv analysiert und Daten bereitstellt.

Legen Sie los mit IronOCR

Beginnen Sie noch heute mit der Verwendung von IronOCR in Ihrem Projekt mit einer kostenlosen Testversion.

Erster Schritt:

Wie man Tabellen in Dokumenten liest

Laden Sie eine C#-Bibliothek herunter, um Daten aus Tabellen zu extrahieren
Bereiten Sie das Bild und das PDF-Dokument für die Extraktion vor.
Setzen Sie die Eigenschaft ReadDataTables auf true, um die Tabellenerkennung zu aktivieren
Verwenden Sie die ReadDocumentAdvanced-Methode für komplexe Tabellen
Extrahieren Sie die von diesen Methoden erkannten Daten

Einfaches Tabellenbeispiel

Das Setzen der ReadDataTables-Eigenschaft auf true aktiviert die Tabellenerkennung mit Tesseract. Ich habe eine einfache Tabellen-PDF erstellt, um diese Funktion zu testen, die Sie hier herunterladen können: 'simple-table.pdf'. Einfache Tabellen ohne zusammengeführte Zellen können mit dieser Methode erkannt werden. Für komplexere Tabellen beachten Sie bitte die unten beschriebene Methode.

:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-tesseract.cs

using IronOcr;
using System;
using System.Data;

// Instantiate OCR engine
var ocr = new IronTesseract();

// Enable table detection
ocr.Configuration.ReadDataTables = true;

using var input = new OcrPdfInput("simple-table.pdf");
var result = ocr.Read(input);

// Retrieve the data
var table = result.Tables[0].DataTable;

// Print out the table data
foreach (DataRow row in table.Rows)
{
    foreach (var item in row.ItemArray)
    {
        Console.Write(item + "\t");
    }
    Console.WriteLine();
}

Imports Microsoft.VisualBasic
Imports IronOcr
Imports System
Imports System.Data

' Instantiate OCR engine
Private ocr = New IronTesseract()

' Enable table detection
ocr.Configuration.ReadDataTables = True

Dim input = New OcrPdfInput("simple-table.pdf")
Dim result = ocr.Read(input)

' Retrieve the data
Dim table = result.Tables(0).DataTable

' Print out the table data
For Each row As DataRow In table.Rows
	For Each item In row.ItemArray
		Console.Write(item & vbTab)
	Next item
	Console.WriteLine()
Next row

$vbLabelText $csharpLabel

Komplexes Tabellenbeispiel

Für komplexe Tabellen behandelt die Methode ReadDocumentAdvanced sie hervorragend. In diesem Beispiel verwenden wir die Datei 'table.pdf'.

Die Methode ReadDocumentAdvanced erfordert, dass das Paket IronOcr.Extensions.AdvancedScan zusammen mit dem Basis-Paket IronOCR installiert wird. Derzeit ist diese Erweiterung nur unter Windows verfügbar.

Bitte beachten Sie

Die Verwendung des erweiterten Scans im .NET-Framework setzt voraus, dass das Projekt auf einer x64-Architektur läuft. Navigieren Sie zur Projektkonfiguration und deaktivieren Sie die Option "32-Bit bevorzugen", um dies zu erreichen. Erfahren Sie mehr im folgenden Leitfaden zur Fehlerbehebung: "Erweiterter Scan auf .NET Framework."

:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-ml.cs

using IronOcr;
using System.Linq;

// Instantiate OCR engine
var ocr = new IronTesseract();

using var input = new OcrInput();
input.LoadPdf("table.pdf");

// Perform OCR
var result = ocr.ReadDocumentAdvanced(input);

var cellList = result.Tables.First().CellInfos;

Imports IronOcr
Imports System.Linq

' Instantiate OCR engine
Private ocr = New IronTesseract()

Private input = New OcrInput()
input.LoadPdf("table.pdf")

' Perform OCR
Dim result = ocr.ReadDocumentAdvanced(input)

Dim cellList = result.Tables.First().CellInfos

$vbLabelText $csharpLabel

Diese Methode unterteilt die Textdaten des Dokuments in zwei Kategorien: eine mit Rändern umschlossen und eine ohne Ränder. Für die umrandeten Inhalte unterteilt die Bibliothek diese weiter in Unterabschnitte basierend auf der Struktur der Tabelle. Die Ergebnisse werden unten angezeigt. Es ist wichtig zu beachten, dass, da sich diese Methode auf Informationen konzentriert, die von Rändern umschlossen sind, alle zusammengeführten Zellen, die sich über mehrere Zeilen erstrecken, als eine einzelne Zelle behandelt werden.

Ergebnis

Hilfsklasse

In der aktuellen Implementierung sind die extrahierten Zellen noch nicht richtig organisiert. Jede Zelle enthält jedoch wertvolle Informationen wie X- und Y-Koordinaten, Abmessungen und mehr. Anhand dieser Daten können wir eine Hilfsklasse für verschiedene Zwecke erstellen. Nachfolgend finden Sie einige grundlegende Hilfsmethoden:

public static class TableProcessor
{
    public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
    {
        // Sort cells by Y (top to bottom), then by X (left to right)
        var sortedCells = cells
            .OrderBy(cell => cell.CellRect.Y)
            .ThenBy(cell => cell.CellRect.X)
            .ToList();

        return sortedCells;
    }

    // Example of how to use the function
    public static void ProcessTables(Tables tables)
    {
        foreach (var table in tables)
        {
            var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);

            Console.WriteLine("Organized Table Cells:");

            // int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
            int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;

            foreach (var cell in sortedCells)
            {
                // Print a new line if the Y-coordinate changes, indicating a new row
                if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
                {
                    Console.WriteLine();  // Start a new row
                    previousY = cell.CellRect.Y;
                }

                Console.Write($"{cell.CellText}\t");
            }

            Console.WriteLine("\n--- End of Table ---");
        }
    }

    public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
    {
        if (table == null 
 table.CellInfos == null 
 !table.CellInfos.Any())
        {
            throw new ArgumentException("Table is empty or invalid.");
        }

        var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
        List<List<CellInfo>> rows = new List<List<CellInfo>>();

        // Group cells into rows
        int previousY = sortedCells.First().CellRect.Y;
        List<CellInfo> currentRow = new List<CellInfo>();

        foreach (var cell in sortedCells)
        {
            if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
            {
                // Store the completed row and start a new one
                rows.Add(new List<CellInfo>(currentRow));
                currentRow.Clear();

                previousY = cell.CellRect.Y;
            }

            currentRow.Add(cell);
        }

        // Add the last row
        if (currentRow.Any())
        {
            rows.Add(currentRow);
        }

        // Retrieve the specific row
        if (rowIndex < 0 
 rowIndex >= rows.Count)
        {
            throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
        }

        return rows[rowIndex];
    }
}

public static class TableProcessor
{
    public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
    {
        // Sort cells by Y (top to bottom), then by X (left to right)
        var sortedCells = cells
            .OrderBy(cell => cell.CellRect.Y)
            .ThenBy(cell => cell.CellRect.X)
            .ToList();

        return sortedCells;
    }

    // Example of how to use the function
    public static void ProcessTables(Tables tables)
    {
        foreach (var table in tables)
        {
            var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);

            Console.WriteLine("Organized Table Cells:");

            // int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
            int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;

            foreach (var cell in sortedCells)
            {
                // Print a new line if the Y-coordinate changes, indicating a new row
                if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
                {
                    Console.WriteLine();  // Start a new row
                    previousY = cell.CellRect.Y;
                }

                Console.Write($"{cell.CellText}\t");
            }

            Console.WriteLine("\n--- End of Table ---");
        }
    }

    public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
    {
        if (table == null 
 table.CellInfos == null 
 !table.CellInfos.Any())
        {
            throw new ArgumentException("Table is empty or invalid.");
        }

        var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
        List<List<CellInfo>> rows = new List<List<CellInfo>>();

        // Group cells into rows
        int previousY = sortedCells.First().CellRect.Y;
        List<CellInfo> currentRow = new List<CellInfo>();

        foreach (var cell in sortedCells)
        {
            if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
            {
                // Store the completed row and start a new one
                rows.Add(new List<CellInfo>(currentRow));
                currentRow.Clear();

                previousY = cell.CellRect.Y;
            }

            currentRow.Add(cell);
        }

        // Add the last row
        if (currentRow.Any())
        {
            rows.Add(currentRow);
        }

        // Retrieve the specific row
        if (rowIndex < 0 
 rowIndex >= rows.Count)
        {
            throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
        }

        return rows[rowIndex];
    }
}

Imports Microsoft.VisualBasic

Public Module TableProcessor
	Public Function OrganizeCellsByCoordinates(ByVal cells As List(Of CellInfo)) As List(Of CellInfo)
		' Sort cells by Y (top to bottom), then by X (left to right)
		Dim sortedCells = cells.OrderBy(Function(cell) cell.CellRect.Y).ThenBy(Function(cell) cell.CellRect.X).ToList()

		Return sortedCells
	End Function

	' Example of how to use the function
	Public Sub ProcessTables(ByVal tables As Tables)
		For Each table In tables
			Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)

			Console.WriteLine("Organized Table Cells:")

			' int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
			Dim previousY As Integer = If(sortedCells.Any(), sortedCells.First().CellRect.Y, 0)

			For Each cell In sortedCells
				' Print a new line if the Y-coordinate changes, indicating a new row
				If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
					Console.WriteLine() ' Start a new row
					previousY = cell.CellRect.Y
				End If

				Console.Write($"{cell.CellText}" & vbTab)
			Next cell

			Console.WriteLine(vbLf & "--- End of Table ---")
		Next table
	End Sub

	Public Function ExtractRowByIndex(ByVal table As TableInfo, ByVal rowIndex As Integer) As List(Of CellInfo)
		If table Is Nothing table.CellInfos Is Nothing (Not table.CellInfos.Any()) Then
			Throw New ArgumentException("Table is empty or invalid.")
		End If

		Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
		Dim rows As New List(Of List(Of CellInfo))()

		' Group cells into rows
		Dim previousY As Integer = sortedCells.First().CellRect.Y
		Dim currentRow As New List(Of CellInfo)()

		For Each cell In sortedCells
			If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
				' Store the completed row and start a new one
				rows.Add(New List(Of CellInfo)(currentRow))
				currentRow.Clear()

				previousY = cell.CellRect.Y
			End If

			currentRow.Add(cell)
		Next cell

		' Add the last row
		If currentRow.Any() Then
			rows.Add(currentRow)
		End If

		' Retrieve the specific row
		If rowIndex < 0 rowIndex >= rows.Count Then
			Throw New IndexOutOfRangeException($"Row index {rowIndex} is out of range.")
		End If

		Return rows(rowIndex)
	End Function
End Module

$vbLabelText $csharpLabel

Curtis Chau

Jetzt mit dem Ingenieurteam chatten

Technischer Redakteur

Curtis Chau hat einen Bachelor-Abschluss in Informatik (Carleton University) und spezialisiert sich auf Frontend-Entwicklung mit Expertise in Node.js, TypeScript, JavaScript und React. Leidenschaftlich daran interessiert, intuitive und ästhetisch ansprechende Benutzeroberflächen zu gestalten, arbeitet Curtis gerne mit modernen Frameworks und erstellt gut strukturierte, visuell ansprechende Handbücher.

Jenseits der Entwicklung hat Curtis ein starkes Interesse am Internet of Things (IoT) und erkundet innovative Möglichkeiten, Hardware und Software zu integrieren. In seiner Freizeit genießt er das Gaming und das Entwickeln von Discord-Bots, wobei er seine Liebe zur Technologie mit Kreativität verbindet.