ドキュメント内のテーブルを読み取る方法

Curtis Chau

2025年1月13日

更新済み 2025年1月13日

共有:

Translated

View the article in English

ドキュメント内のテーブルを読むことについて話しましょう。プレーンなTesseractを使用してテーブルからデータを抽出することは、テキストがしばしばセル内にあり、文書全体に散在しているため、困難です。ただし、当ライブラリは、テーブルデータを正確に検出して抽出するために、機械学習モデルを訓練し、微調整しています。

簡単な表の場合は、単純な表検出に頼ることができますが、より複雑な構造の場合には、弊社独自のReadDocumentAdvancedメソッドが堅牢な結果を提供し、表を効果的に解析してデータを提供します。

IronOCRを始めましょう

今日から無料トライアルでIronOCRをあなたのプロジェクトで使い始めましょう。

最初のステップ:

ドキュメント内のテーブルを読み取る方法

テーブルからデータを抽出するC#ライブラリをダウンロード
画像とPDF文書を抽出のために準備する
ReadDataTables プロパティを true に設定してテーブル検出を有効にします。
複雑なテーブルにはReadDocumentAdvancedメソッドを使用します
これらの方法で検出されたデータを抽出します。

シンプルなテーブル例

ReadDataTables プロパティを true に設定すると、Tesseract を使用したテーブル検出が有効になります。この機能をテストするためにシンプルなテーブルPDFを作成しました。こちらからダウンロードできます：'simple-table.pdf'。結合セルのないシンプルな表はこの方法で検出できます。より複雑なテーブルについては、以下に説明する方法を参照してください。

:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-tesseract.cs

using IronOcr;
using System;
using System.Data;

// Instantiate OCR engine
var ocr = new IronTesseract();

// Enable table detection
ocr.Configuration.ReadDataTables = true;

using var input = new OcrPdfInput("simple-table.pdf");
var result = ocr.Read(input);

// Retrieve the data
var table = result.Tables[0].DataTable;

// Print out the table data
foreach (DataRow row in table.Rows)
{
    foreach (var item in row.ItemArray)
    {
        Console.Write(item + "\t");
    }
    Console.WriteLine();
}

Imports Microsoft.VisualBasic
Imports IronOcr
Imports System
Imports System.Data

' Instantiate OCR engine
Private ocr = New IronTesseract()

' Enable table detection
ocr.Configuration.ReadDataTables = True

Dim input = New OcrPdfInput("simple-table.pdf")
Dim result = ocr.Read(input)

' Retrieve the data
Dim table = result.Tables(0).DataTable

' Print out the table data
For Each row As DataRow In table.Rows
	For Each item In row.ItemArray
		Console.Write(item & vbTab)
	Next item
	Console.WriteLine()
Next row

$vbLabelText $csharpLabel

複雑なテーブル例

複雑なテーブルについては、ReadDocumentAdvanced メソッドがそれらを美しく処理します。この例では、'table.pdf' ファイルを使用します。

ReadDocumentAdvanced メソッドは、基本の IronOCR パッケージと共に IronOcr.Extensions.AdvancedScan パッケージをインストールする必要があります。現在、この拡張機能はWindowsでのみ利用可能です。

次の内容にご注意ください。

高度なスキャンを .NET Framework で使用するには、プロジェクトを x64アーキテクチャで実行する必要があります。プロジェクトの設定に移動し、「32ビットを優先」のオプションのチェックを外してください。次のトラブルシューティングガイドで詳細をご確認ください：.NET Framework における高度なスキャン。

:path=/static-assets/ocr/content-code-examples/how-to/read-table-in-document-with-ml.cs

using IronOcr;
using System.Linq;

// Instantiate OCR engine
var ocr = new IronTesseract();

using var input = new OcrInput();
input.LoadPdf("table.pdf");

// Perform OCR
var result = ocr.ReadDocumentAdvanced(input);

var cellList = result.Tables.First().CellInfos;

Imports IronOcr
Imports System.Linq

' Instantiate OCR engine
Private ocr = New IronTesseract()

Private input = New OcrInput()
input.LoadPdf("table.pdf")

' Perform OCR
Dim result = ocr.ReadDocumentAdvanced(input)

Dim cellList = result.Tables.First().CellInfos

$vbLabelText $csharpLabel

このメソッドは、ドキュメントのテキストデータを境界で囲まれたものと囲まれていないものの2つのカテゴリに分けます。枠付きのコンテンツの場合、ライブラリはテーブルの構造に基づいてさらに小節に分けます。結果は以下の通りです。このメソッドは境界で囲まれた情報に焦点を当てているため、複数の行にまたがる結合セルは単一のセルとして扱われることに注意することが重要です。

結果

ヘルパークラス

現在の実装では、抽出されたセルがまだ適切に整理されていません。しかし、各セルにはX座標やY座標、寸法などの貴重な情報が含まれています。このデータを使用して、さまざまな目的に応じたヘルパークラスを作成できます。以下はいくつかの基本的なヘルパーメソッドです。

public static class TableProcessor
{
    public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
    {
        // Sort cells by Y (top to bottom), then by X (left to right)
        var sortedCells = cells
            .OrderBy(cell => cell.CellRect.Y)
            .ThenBy(cell => cell.CellRect.X)
            .ToList();

        return sortedCells;
    }

    // Example of how to use the function
    public static void ProcessTables(Tables tables)
    {
        foreach (var table in tables)
        {
            var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);

            Console.WriteLine("Organized Table Cells:");

            // int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
            int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;

            foreach (var cell in sortedCells)
            {
                // Print a new line if the Y-coordinate changes, indicating a new row
                if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
                {
                    Console.WriteLine();  // Start a new row
                    previousY = cell.CellRect.Y;
                }

                Console.Write($"{cell.CellText}\t");
            }

            Console.WriteLine("\n--- End of Table ---");
        }
    }

    public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
    {
        if (table == null 
 table.CellInfos == null 
 !table.CellInfos.Any())
        {
            throw new ArgumentException("Table is empty or invalid.");
        }

        var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
        List<List<CellInfo>> rows = new List<List<CellInfo>>();

        // Group cells into rows
        int previousY = sortedCells.First().CellRect.Y;
        List<CellInfo> currentRow = new List<CellInfo>();

        foreach (var cell in sortedCells)
        {
            if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
            {
                // Store the completed row and start a new one
                rows.Add(new List<CellInfo>(currentRow));
                currentRow.Clear();

                previousY = cell.CellRect.Y;
            }

            currentRow.Add(cell);
        }

        // Add the last row
        if (currentRow.Any())
        {
            rows.Add(currentRow);
        }

        // Retrieve the specific row
        if (rowIndex < 0 
 rowIndex >= rows.Count)
        {
            throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
        }

        return rows[rowIndex];
    }
}

public static class TableProcessor
{
    public static List<CellInfo> OrganizeCellsByCoordinates(List<CellInfo> cells)
    {
        // Sort cells by Y (top to bottom), then by X (left to right)
        var sortedCells = cells
            .OrderBy(cell => cell.CellRect.Y)
            .ThenBy(cell => cell.CellRect.X)
            .ToList();

        return sortedCells;
    }

    // Example of how to use the function
    public static void ProcessTables(Tables tables)
    {
        foreach (var table in tables)
        {
            var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);

            Console.WriteLine("Organized Table Cells:");

            // int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
            int previousY = sortedCells.Any() ? sortedCells.First().CellRect.Y : 0;

            foreach (var cell in sortedCells)
            {
                // Print a new line if the Y-coordinate changes, indicating a new row
                if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
                {
                    Console.WriteLine();  // Start a new row
                    previousY = cell.CellRect.Y;
                }

                Console.Write($"{cell.CellText}\t");
            }

            Console.WriteLine("\n--- End of Table ---");
        }
    }

    public static List<CellInfo> ExtractRowByIndex(TableInfo table, int rowIndex)
    {
        if (table == null 
 table.CellInfos == null 
 !table.CellInfos.Any())
        {
            throw new ArgumentException("Table is empty or invalid.");
        }

        var sortedCells = OrganizeCellsByCoordinates(table.CellInfos);
        List<List<CellInfo>> rows = new List<List<CellInfo>>();

        // Group cells into rows
        int previousY = sortedCells.First().CellRect.Y;
        List<CellInfo> currentRow = new List<CellInfo>();

        foreach (var cell in sortedCells)
        {
            if (Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8)
            {
                // Store the completed row and start a new one
                rows.Add(new List<CellInfo>(currentRow));
                currentRow.Clear();

                previousY = cell.CellRect.Y;
            }

            currentRow.Add(cell);
        }

        // Add the last row
        if (currentRow.Any())
        {
            rows.Add(currentRow);
        }

        // Retrieve the specific row
        if (rowIndex < 0 
 rowIndex >= rows.Count)
        {
            throw new IndexOutOfRangeException($"Row index {rowIndex} is out of range.");
        }

        return rows[rowIndex];
    }
}

Imports Microsoft.VisualBasic

Public Module TableProcessor
	Public Function OrganizeCellsByCoordinates(ByVal cells As List(Of CellInfo)) As List(Of CellInfo)
		' Sort cells by Y (top to bottom), then by X (left to right)
		Dim sortedCells = cells.OrderBy(Function(cell) cell.CellRect.Y).ThenBy(Function(cell) cell.CellRect.X).ToList()

		Return sortedCells
	End Function

	' Example of how to use the function
	Public Sub ProcessTables(ByVal tables As Tables)
		For Each table In tables
			Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)

			Console.WriteLine("Organized Table Cells:")

			' int previousY = sortedCells.FirstOrDefault()?.CellRect.Y ?? 0;
			Dim previousY As Integer = If(sortedCells.Any(), sortedCells.First().CellRect.Y, 0)

			For Each cell In sortedCells
				' Print a new line if the Y-coordinate changes, indicating a new row
				If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
					Console.WriteLine() ' Start a new row
					previousY = cell.CellRect.Y
				End If

				Console.Write($"{cell.CellText}" & vbTab)
			Next cell

			Console.WriteLine(vbLf & "--- End of Table ---")
		Next table
	End Sub

	Public Function ExtractRowByIndex(ByVal table As TableInfo, ByVal rowIndex As Integer) As List(Of CellInfo)
		If table Is Nothing table.CellInfos Is Nothing (Not table.CellInfos.Any()) Then
			Throw New ArgumentException("Table is empty or invalid.")
		End If

		Dim sortedCells = OrganizeCellsByCoordinates(table.CellInfos)
		Dim rows As New List(Of List(Of CellInfo))()

		' Group cells into rows
		Dim previousY As Integer = sortedCells.First().CellRect.Y
		Dim currentRow As New List(Of CellInfo)()

		For Each cell In sortedCells
			If Math.Abs(cell.CellRect.Y - previousY) > cell.CellRect.Height * 0.8 Then
				' Store the completed row and start a new one
				rows.Add(New List(Of CellInfo)(currentRow))
				currentRow.Clear()

				previousY = cell.CellRect.Y
			End If

			currentRow.Add(cell)
		Next cell

		' Add the last row
		If currentRow.Any() Then
			rows.Add(currentRow)
		End If

		' Retrieve the specific row
		If rowIndex < 0 rowIndex >= rows.Count Then
			Throw New IndexOutOfRangeException($"Row index {rowIndex} is out of range.")
		End If

		Return rows(rowIndex)
	End Function
End Module

$vbLabelText $csharpLabel

Curtis Chau

今すぐエンジニアリングチームとチャット

テクニカルライター

Curtis Chauはコンピューターサイエンスの学士号を取得（カールトン大学）し、Node.js、TypeScript、JavaScript、Reactに精通したフロントエンド開発を専門としています。直感的で美的に優れたユーザーインターフェースの作成に熱心で、Curtisは現代的なフレームワークで作業し、よく構造化された視覚的に魅力的なマニュアルを作成することを楽しんでいます。

開発以外にも、Curtisはモノのインターネット（IoT）に強い関心を持ち、ハードウェアとソフトウェアを統合する革新的な方法を探求しています。彼の余暇には、技術への愛を創造性と組み合わせて、ゲームを楽しんだりDiscordボットを作ったりしています。