如何阅读 PDF

查克尼特·宾

2023年十月25日

更新 2025年一月8日

Translated

View the article in English

PDF代表“便携式文档格式”。这是Adobe开发的一种文件格式，用于保存任何源文档的字体、图像、图形和布局，无论使用哪种应用程序和平台创建。 PDF文件通常用于以一致的格式共享和查看文档，无论使用什么软件或硬件打开它。 IronOCR 轻松处理各种版本的PDF文件。

开始使用IronOCR

立即在您的项目中开始使用IronOCR，并享受免费试用。

第一步：

如何阅读 PDF

下载用于读取PDF的C#库
准备 PDF 文档以供阅读
构建OcrPdfInput对象，并提供PDF文件路径
使用Read方法对导入的PDF进行OCR
通过提供页面索引列表读取特定页面

阅读PDF示例

首先实例化 IronTesseract 类来执行 OCR。然后，使用 'using' 语句创建一个 OcrPdfInput 对象，将 PDF 文件路径传递给它。最后，使用Read方法执行OCR。

:path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-pdf.cs

using IronOcr;

// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();

// Add PDF
using var pdfInput = new OcrPdfInput("Potter.pdf");
// Perform OCR
OcrResult ocrResult = ocrTesseract.Read(pdfInput);

Imports IronOcr

' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()

' Add PDF
Private pdfInput = New OcrPdfInput("Potter.pdf")
' Perform OCR
Private ocrResult As OcrResult = ocrTesseract.Read(pdfInput)

$vbLabelText $csharpLabel

在大多数情况下，无需指定 DPI 属性。然而，在构建OcrPdfInput时提供高DPI数值可以提高阅读精确度。

读取PDF页面示例

在从PDF文档中读取特定页面时，用户可以指定要导入的页面索引号。要做到这一点，请在构建 OcrPdfInput 时将页面索引列表传递给 PageIndices 参数。请记住，页面索引使用基于零的编号。

:path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-pdf-pages.cs

using IronOcr;
using System.Collections.Generic;

// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();

// Create page indices list
List<int> pageIndices = new List<int>() { 0, 2 };

// Add PDF
using var pdfInput = new OcrPdfInput("Potter.pdf", PageIndices: pageIndices);
// Perform OCR
OcrResult ocrResult = ocrTesseract.Read(pdfInput);

Imports IronOcr
Imports System.Collections.Generic

' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()

' Create page indices list
Private pageIndices As New List(Of Integer)() From {0, 2}

' Add PDF
Private pdfInput = New OcrPdfInput("Potter.pdf", PageIndices:= pageIndices)
' Perform OCR
Private ocrResult As OcrResult = ocrTesseract.Read(pdfInput)

$vbLabelText $csharpLabel

指定扫描区域

通过缩小阅读区域，您可以显著提高阅读效率。为此，您可以指定需要读取的导入PDF的精确区域。在下面的代码示例中，我已指示IronOCR专注于提取章节号码和标题。

:path=/static-assets/ocr/content-code-examples/how-to/input-pdfs-read-specific-region.cs

using IronOcr;
using IronSoftware.Drawing;
using System;

// Instantiate IronTesseract
IronTesseract ocrTesseract = new IronTesseract();

// Specify crop regions
Rectangle[] scanRegions = { new Rectangle(550, 100, 600, 300) };

// Add PDF
using (var pdfInput = new OcrPdfInput("Potter.pdf", ContentAreas: scanRegions))
{
    // Perform OCR
    OcrResult ocrResult = ocrTesseract.Read(pdfInput);

    // Output the result to console
    Console.WriteLine(ocrResult.Text);
}

Imports IronOcr
Imports IronSoftware.Drawing
Imports System

' Instantiate IronTesseract
Private ocrTesseract As New IronTesseract()

' Specify crop regions
Private scanRegions() As Rectangle = { New Rectangle(550, 100, 600, 300) }

' Add PDF
Using pdfInput = New OcrPdfInput("Potter.pdf", ContentAreas:= scanRegions)
	' Perform OCR
	Dim ocrResult As OcrResult = ocrTesseract.Read(pdfInput)

	' Output the result to console
	Console.WriteLine(ocrResult.Text)
End Using

$vbLabelText $csharpLabel

OCR结果

查克尼特·宾

立即与工程团队聊天

软件工程师

Chaknith 是开发者中的福尔摩斯。他第一次意识到自己可能在软件工程方面有前途，是在他出于乐趣做代码挑战的时候。他的重点是 IronXL 和 IronBarcode，但他为能帮助客户解决每一款产品的问题而感到自豪。Chaknith 利用他从直接与客户交谈中获得的知识，帮助进一步改进产品。他的轶事反馈不仅仅局限于 Jira 票据，还支持产品开发、文档编写和市场营销，从而提升客户的整体体验。当他不在办公室时，他可能会在学习机器学习、编程或徒步旅行。