Create Searchable PDFs by OCR
We can use Iron's advanced Tesseract engine to convert images to searchable PDFs. It can also make existing PDFs searchable.
This adds to SEO performance and internal search indexing within intranets and databases.
How to Create Searchable PDFs with IronOCR Tesseract
- Install the OCR library to create searchable PDFs.
- Install the OCR library to create searchable PDFs.
- Create an
OcrInput
object and useAddImage
to register the image path. - Call all the required methods to process the image.
- Use the
Read
method on theOcrInput
object. - Call
SaveAsSearchablePdf
to save the images as a single PDF.
To implement the above steps in code, you can follow this C# example:
using IronOcr; // Import the IronOcr namespace
class Program
{
static void Main()
{
// Step 1: Create an instance of the IronTesseract class
var Ocr = new IronTesseract();
// Step 2: Initialize OcrInput
var Input = new OcrInput();
// Step 3: Add the image path(s) that you want to convert to a searchable PDF
Input.AddImage("path/to/your/image.jpg");
// Step 4: Perform OCR processing using the Read method
// This method returns an OcrResult, which includes the text read from the image
OcrResult Result = Ocr.Read(Input);
// Step 5: Save the output to a searchable PDF
// The SaveAsSearchablePdf method creates a PDF with the OCR text
Result.SaveAsSearchablePdf("path/to/output.pdf");
}
}
using IronOcr; // Import the IronOcr namespace
class Program
{
static void Main()
{
// Step 1: Create an instance of the IronTesseract class
var Ocr = new IronTesseract();
// Step 2: Initialize OcrInput
var Input = new OcrInput();
// Step 3: Add the image path(s) that you want to convert to a searchable PDF
Input.AddImage("path/to/your/image.jpg");
// Step 4: Perform OCR processing using the Read method
// This method returns an OcrResult, which includes the text read from the image
OcrResult Result = Ocr.Read(Input);
// Step 5: Save the output to a searchable PDF
// The SaveAsSearchablePdf method creates a PDF with the OCR text
Result.SaveAsSearchablePdf("path/to/output.pdf");
}
}
Imports IronOcr ' Import the IronOcr namespace
Friend Class Program
Shared Sub Main()
' Step 1: Create an instance of the IronTesseract class
Dim Ocr = New IronTesseract()
' Step 2: Initialize OcrInput
Dim Input = New OcrInput()
' Step 3: Add the image path(s) that you want to convert to a searchable PDF
Input.AddImage("path/to/your/image.jpg")
' Step 4: Perform OCR processing using the Read method
' This method returns an OcrResult, which includes the text read from the image
Dim Result As OcrResult = Ocr.Read(Input)
' Step 5: Save the output to a searchable PDF
' The SaveAsSearchablePdf method creates a PDF with the OCR text
Result.SaveAsSearchablePdf("path/to/output.pdf")
End Sub
End Class
Explanation
- IronTesseract Class: This is the main class used to perform OCR. It allows configuration of the OCR settings.
- OcrInput Object: This object holds the images you want to process. You add images to this object using
AddImage
. - Read Method: This method takes the
OcrInput
object and processes the images, extracting text from them. - SaveAsSearchablePdf Method: This saves the OCR result as a searchable PDF, embedding the recognized text under the images, making the PDF text searchable while maintaining the original image layout.
Make sure to replace "path/to/your/image.jpg"
and "path/to/output.pdf"
with the actual file paths you intend to use.