PDFを読む方法

チャクニット・ビン

2023年10月25日

更新済み 2025年1月8日

共有:

Translated

View the article in English

PDFは「ポータブルドキュメントフォーマット」の略です。これはAdobeによって開発されたファイル形式で、作成に使用されたアプリケーションやプラットフォームに関係なく、元のドキュメントのフォント、画像、グラフィックス、およびレイアウトを保存します。 PDFファイルは、開くために使用されるソフトウェアやハードウェアに関係なく、文書を一貫したフォーマットで共有および表示するために一般的に使用されます。 IronOCRは、さまざまなバージョンのPDFドキュメントを容易に処理します。

IronOCRを始めましょう

今日から無料トライアルでIronOCRをあなたのプロジェクトで使い始めましょう。

最初のステップ:

PDFを読む方法

PDFを読み取るためのC#ライブラリをダウンロード
PDFドキュメントを読み取り用に準備する
PDFファイルパスでOcrPdfInputオブジェクトを構築する
Read メソッドを利用して、インポートされたPDFに対してOCRを実行します
ページインデックスリストを指定して特定のページを読み取る

PDF読み取りの例

IronTesseractクラスをインスタンス化してOCRを実行します。次に、'using'ステートメントを使用して、PDFファイルのパスを渡してOcrPdfInputオブジェクトを作成します。最後に、Read メソッドを使用してOCRを実行します。

:path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-pdf.cs

using IronOcr;

// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();

// Add PDF
using var pdfInput = new OcrPdfInput("Potter.pdf");
// Perform OCR
OcrResult ocrResult = ocrTesseract.Read(pdfInput);

Imports IronOcr

' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()

' Add PDF
Private pdfInput = New OcrPdfInput("Potter.pdf")
' Perform OCR
Private ocrResult As OcrResult = ocrTesseract.Read(pdfInput)

$vbLabelText $csharpLabel

ほとんどの場合、DPIプロパティを指定する必要はありません。ただし、OcrPdfInputの構築において高いDPI数を提供することで、読み取り精度を向上させることができます。

PDFページの読み取り例

PDFドキュメントの特定のページを読み取る際、ユーザーはインポートするページのインデックス番号を指定できます。これを行うには、OcrPdfInputを構築する際に、ページインデックスのリストをPageIndicesパラメーターに渡します。ページインデックスはゼロベースの番号付けを使用していることに留意してください。

:path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-pdf-pages.cs

using IronOcr;
using System.Collections.Generic;

// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();

// Create page indices list
List<int> pageIndices = new List<int>() { 0, 2 };

// Add PDF
using var pdfInput = new OcrPdfInput("Potter.pdf", PageIndices: pageIndices);
// Perform OCR
OcrResult ocrResult = ocrTesseract.Read(pdfInput);

Imports IronOcr
Imports System.Collections.Generic

' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()

' Create page indices list
Private pageIndices As New List(Of Integer)() From {0, 2}

' Add PDF
Private pdfInput = New OcrPdfInput("Potter.pdf", PageIndices:= pageIndices)
' Perform OCR
Private ocrResult As OcrResult = ocrTesseract.Read(pdfInput)

$vbLabelText $csharpLabel

スキャン領域の指定

読み取る範囲を絞り込むことで、読み取り効率を大幅に向上させることができます。これを実現するには、読み取る必要のあるインポートされたPDFの正確な領域を指定できます。以下のコード例では、IronOCRに章番号とタイトルのみを抽出するよう指示しました。

:path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-specific-region.cs

using IronOcr;
using IronSoftware.Drawing;
using System;

// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();

// Specify crop regions
Rectangle[] scanRegions = { new Rectangle(550, 100, 600, 300) };

// Add PDF
using (var pdfInput = new OcrPdfInput("Potter.pdf", ContentAreas: scanRegions))
{
    // Perform OCR
    OcrResult ocrResult = ocrTesseract.Read(pdfInput);

    // Output the result to console
    Console.WriteLine(ocrResult.Text);
}

Imports IronOcr
Imports IronSoftware.Drawing
Imports System

' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()

' Specify crop regions
Private scanRegions() As Rectangle = { New Rectangle(550, 100, 600, 300) }

' Add PDF
Using pdfInput = New OcrPdfInput("Potter.pdf", ContentAreas:= scanRegions)
	' Perform OCR
	Dim ocrResult As OcrResult = ocrTesseract.Read(pdfInput)

	' Output the result to console
	Console.WriteLine(ocrResult.Text)
End Using

$vbLabelText $csharpLabel

OCR結果

チャクニット・ビン

今すぐエンジニアリングチームとチャット

ソフトウェアエンジニア

チャクニットは開発者のシャーロック・ホームズです。彼がソフトウェアエンジニアリングの将来性に気付いたのは、楽しみでコーディングチャレンジをしていたときでした。彼のフォーカスはIronXLとIronBarcodeにありますが、すべての製品でお客様を助けることに誇りを持っています。チャクニットは顧客と直接話すことで得た知識を活用して、製品自体のさらなる改善に貢献しています。彼の逸話的なフィードバックは、単なるJiraチケットを超えて、製品開発、ドキュメントおよびマーケティングをサポートし、顧客の全体的な体験を向上させます。オフィスにいないときは、機械学習やコーディングについて学んだり、ハイキングを楽しんだりしています。