Cómo leer una tabla en documentos

Curtis Chau

13 de enero, 2025

Actualizado 13 de enero, 2025

Translated

View the article in English

Hablemos sobre la lectura de tablas en documentos. Extraer datos de tablas usando Tesseract en su forma básica puede ser un desafío, ya que el texto a menudo se encuentra en celdas y está disperso esporádicamente a lo largo del documento. Sin embargo, nuestra biblioteca está equipada con un modelo de aprendizaje automático que ha sido entrenado y ajustado para detectar y extraer datos de tablas con precisión.

Para tablas simples, puedes confiar en la detección de tablas directa, mientras que para estructuras más complejas, nuestro exclusivo método ReadDocumentAdvanced ofrece resultados sólidos, analizando la tabla de manera efectiva y proporcionando los datos.

Comience con IronOCR

Comience a usar IronOCR en su proyecto hoy con una prueba gratuita.

Primer Paso:

Cómo leer una tabla en documentos

Descargar una biblioteca de C# para extraer datos de tablas
Prepare la imagen y el documento PDF para la extracción
Establezca la propiedad ReadDataTables en true para habilitar la detección de tablas
Utilice el método ReadDocumentAdvanced para tablas complejas
Extraer los datos detectados por estos métodos.

Ejemplo de Tabla Simple

Establecer la propiedad ReadDataTables en true habilita la detección de tablas utilizando Tesseract. He creado un PDF de tabla simple para probar esta característica, el cual puedes descargar aquí: 'simple-table.pdf'. Las tablas simples sin celdas combinadas pueden ser detectadas utilizando este método. Para tablas más complejas, por favor consulte el método descrito a continuación.

:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-tesseract.cs

using IronOcr;
using System;
using System.Data;

// Instantiate OCR engine
var ocr = new IronTesseract();

// Enable table detection
ocr.Configuration.ReadDataTables = true;

using var input = new OcrPdfInput("simple-table.pdf");
var result = ocr.Read(input);

// Retrieve the data
var table = result.Tables[0].DataTable;

// Print out the table data
foreach (DataRow row in table.Rows)
{
    foreach (var item in row.ItemArray)
    {
        Console.Write(item + "\t");
    }
    Console.WriteLine();
}

Imports Microsoft.VisualBasic
Imports IronOcr
Imports System
Imports System.Data

' Instantiate OCR engine
Private ocr = New IronTesseract()

' Enable table detection
ocr.Configuration.ReadDataTables = True

Dim input = New OcrPdfInput("simple-table.pdf")
Dim result = ocr.Read(input)

' Retrieve the data
Dim table = result.Tables(0).DataTable

' Print out the table data
For Each row As DataRow In table.Rows
	For Each item In row.ItemArray
		Console.Write(item & vbTab)
	Next item
	Console.WriteLine()
Next row

$vbLabelText $csharpLabel

Ejemplo de Tabla Compleja

Para tablas complejas, el método ReadDocumentAdvanced las maneja de manera excelente. En este ejemplo, usaremos el archivo 'table.pdf'.

El método ReadDocumentAdvanced requiere que el paquete IronOcr.Extensions.AdvancedScan esté instalado junto con el paquete base de IronOCR. Actualmente, esta extensión sólo está disponible en Windows.

Atención

El uso de la exploración avanzada en .NET Framework requiere que el proyecto se ejecute en arquitectura x64. Para ello, vaya a la configuración del proyecto y desmarque la opción "Preferir 32 bits". Obtenga más información en la siguiente guía de resolución de problemas: "Advanced Scan on .NET Framework".

:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-ml.cs

using IronOcr;
using System.Linq;

// Instantiate OCR engine
var ocr = new IronTesseract();

using var input = new OcrInput();
input.LoadPdf("table.pdf");

// Perform OCR
var result = ocr.ReadDocumentAdvanced(input);

var cellList = result.Tables.First().CellInfos;

Imports IronOcr
Imports System.Linq

' Instantiate OCR engine
Private ocr = New IronTesseract()

Private input = New OcrInput()
input.LoadPdf("table.pdf")

' Perform OCR
Dim result = ocr.ReadDocumentAdvanced(input)

Dim cellList = result.Tables.First().CellInfos

$vbLabelText $csharpLabel

Este método separa los datos de texto del documento en dos categorías: una encerrada por bordes y otra sin bordes. Para el contenido con bordes, la biblioteca lo divide aún más en subsecciones basadas en la estructura de la tabla. Los resultados se muestran a continuación. Es importante tener en cuenta que, dado que este método se centra en la información encerrada por bordes, cualquier celda combinada que abarque múltiples filas se tratará como una sola celda.

Resultado

Clase auxiliar

En la implementación actual, las celdas extraídas aún no están organizadas correctamente. Sin embargo, cada celda contiene información valiosa como las coordenadas X e Y, dimensiones y más. Con estos datos, podemos crear una clase auxiliar para diversos propósitos. A continuación, se presentan algunos métodos de ayuda básicos:

public static class TableProcessor
{
    public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
    {
        // Sort cells by Y (top to bottom), then by X (left to right)
        var sortedCells = cells
            .OrderBy(cell => cell.CellRect.Y)
            .ThenBy(cell => cell.CellRect.X)
            .ToList();

        return sortedCells;
    }

    // Example of how to use the function
    public static void ProcessTables(Tables tables)
    {
        foreach (var table in tables)
        {
            var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);

            Console.WriteLine("Organized Table Cells:");

            // int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
            int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;

            foreach (var cell in sortedCells)
            {
                // Print a new line if the Y-coordinate changes, indicating a new row
                if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
                {
                    Console.WriteLine();  // Start a new row
                    previousY = cell.CellRect.Y;
                }

                Console.Write($"{cell.CellText}\t");
            }

            Console.WriteLine("\n--- End of Table ---");
        }
    }

    public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
    {
        if (table == null 
 table.CellInfos == null 
 !table.CellInfos.Any())
        {
            throw new ArgumentException("Table is empty or invalid.");
        }

        var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
        List<List<CellInfo>> rows = new List<List<CellInfo>>();

        // Group cells into rows
        int previousY = sortedCells.First().CellRect.Y;
        List<CellInfo> currentRow = new List<CellInfo>();

        foreach (var cell in sortedCells)
        {
            if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
            {
                // Store the completed row and start a new one
                rows.Add(new List<CellInfo>(currentRow));
                currentRow.Clear();

                previousY = cell.CellRect.Y;
            }

            currentRow.Add(cell);
        }

        // Add the last row
        if (currentRow.Any())
        {
            rows.Add(currentRow);
        }

        // Retrieve the specific row
        if (rowIndex < 0 
 rowIndex >= rows.Count)
        {
            throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
        }

        return rows[rowIndex];
    }
}

public static class TableProcessor
{
    public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
    {
        // Sort cells by Y (top to bottom), then by X (left to right)
        var sortedCells = cells
            .OrderBy(cell => cell.CellRect.Y)
            .ThenBy(cell => cell.CellRect.X)
            .ToList();

        return sortedCells;
    }

    // Example of how to use the function
    public static void ProcessTables(Tables tables)
    {
        foreach (var table in tables)
        {
            var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);

            Console.WriteLine("Organized Table Cells:");

            // int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
            int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;

            foreach (var cell in sortedCells)
            {
                // Print a new line if the Y-coordinate changes, indicating a new row
                if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
                {
                    Console.WriteLine();  // Start a new row
                    previousY = cell.CellRect.Y;
                }

                Console.Write($"{cell.CellText}\t");
            }

            Console.WriteLine("\n--- End of Table ---");
        }
    }

    public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
    {
        if (table == null 
 table.CellInfos == null 
 !table.CellInfos.Any())
        {
            throw new ArgumentException("Table is empty or invalid.");
        }

        var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
        List<List<CellInfo>> rows = new List<List<CellInfo>>();

        // Group cells into rows
        int previousY = sortedCells.First().CellRect.Y;
        List<CellInfo> currentRow = new List<CellInfo>();

        foreach (var cell in sortedCells)
        {
            if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
            {
                // Store the completed row and start a new one
                rows.Add(new List<CellInfo>(currentRow));
                currentRow.Clear();

                previousY = cell.CellRect.Y;
            }

            currentRow.Add(cell);
        }

        // Add the last row
        if (currentRow.Any())
        {
            rows.Add(currentRow);
        }

        // Retrieve the specific row
        if (rowIndex < 0 
 rowIndex >= rows.Count)
        {
            throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
        }

        return rows[rowIndex];
    }
}

Imports Microsoft.VisualBasic

Public Module TableProcessor
	Public Function OrganizeCellsByCoordinates(ByVal cells As List(Of CellInfo)) As List(Of CellInfo)
		' Sort cells by Y (top to bottom), then by X (left to right)
		Dim sortedCells = cells.OrderBy(Function(cell) cell.CellRect.Y).ThenBy(Function(cell) cell.CellRect.X).ToList()

		Return sortedCells
	End Function

	' Example of how to use the function
	Public Sub ProcessTables(ByVal tables As Tables)
		For Each table In tables
			Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)

			Console.WriteLine("Organized Table Cells:")

			' int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
			Dim previousY As Integer = If(sortedCells.Any(), sortedCells.First().CellRect.Y, 0)

			For Each cell In sortedCells
				' Print a new line if the Y-coordinate changes, indicating a new row
				If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
					Console.WriteLine() ' Start a new row
					previousY = cell.CellRect.Y
				End If

				Console.Write($"{cell.CellText}" & vbTab)
			Next cell

			Console.WriteLine(vbLf & "--- End of Table ---")
		Next table
	End Sub

	Public Function ExtractRowByIndex(ByVal table As TableInfo, ByVal rowIndex As Integer) As List(Of CellInfo)
		If table Is Nothing table.CellInfos Is Nothing (Not table.CellInfos.Any()) Then
			Throw New ArgumentException("Table is empty or invalid.")
		End If

		Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
		Dim rows As New List(Of List(Of CellInfo))()

		' Group cells into rows
		Dim previousY As Integer = sortedCells.First().CellRect.Y
		Dim currentRow As New List(Of CellInfo)()

		For Each cell In sortedCells
			If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
				' Store the completed row and start a new one
				rows.Add(New List(Of CellInfo)(currentRow))
				currentRow.Clear()

				previousY = cell.CellRect.Y
			End If

			currentRow.Add(cell)
		Next cell

		' Add the last row
		If currentRow.Any() Then
			rows.Add(currentRow)
		End If

		' Retrieve the specific row
		If rowIndex < 0 rowIndex >= rows.Count Then
			Throw New IndexOutOfRangeException($"Row index {rowIndex} is out of range.")
		End If

		Return rows(rowIndex)
	End Function
End Module

$vbLabelText $csharpLabel

Curtis Chau

Chatea con el equipo de ingeniería ahora

Redactor técnico

Curtis Chau tiene una licenciatura en Ciencias de la Computación (Universidad de Carleton) y se especializa en desarrollo front-end con experiencia en Node.js, TypeScript, JavaScript y React. Apasionado por crear interfaces de usuario intuitivas y estéticamente atractivas, Curtis disfruta trabajando con frameworks modernos y creando manuales bien estructurados y visualmente atractivos.

Más allá del desarrollo, Curtis tiene un gran interés en el Internet de las Cosas (IoT), explorando formas innovadoras de integrar hardware y software. En su tiempo libre, disfruta de los videojuegos y de construir bots para Discord, combinando su amor por la tecnología con la creatividad.