Create Searchable PDFs by OCR

We can use Iron's advanced Tesseract engine to convert images to searchable PDFs. It can also make existing PDFs searchable.

This adds to SEO performance and internal search indexing within intranets and databases.

To implement the above steps in code, you can follow this C# example:

using IronOcr;  // Import the IronOcr namespace

class Program
{
    static void Main()
    {
        // Step 1: Create an instance of the IronTesseract class
        var Ocr = new IronTesseract();

        // Step 2: Initialize OcrInput
        var Input = new OcrInput();

        // Step 3: Add the image path(s) that you want to convert to a searchable PDF
        Input.AddImage("path/to/your/image.jpg");

        // Step 4: Perform OCR processing using the Read method
        // This method returns an OcrResult, which includes the text read from the image
        OcrResult Result = Ocr.Read(Input);

        // Step 5: Save the output to a searchable PDF
        // The SaveAsSearchablePdf method creates a PDF with the OCR text
        Result.SaveAsSearchablePdf("path/to/output.pdf");
    }
}
using IronOcr;  // Import the IronOcr namespace

class Program
{
    static void Main()
    {
        // Step 1: Create an instance of the IronTesseract class
        var Ocr = new IronTesseract();

        // Step 2: Initialize OcrInput
        var Input = new OcrInput();

        // Step 3: Add the image path(s) that you want to convert to a searchable PDF
        Input.AddImage("path/to/your/image.jpg");

        // Step 4: Perform OCR processing using the Read method
        // This method returns an OcrResult, which includes the text read from the image
        OcrResult Result = Ocr.Read(Input);

        // Step 5: Save the output to a searchable PDF
        // The SaveAsSearchablePdf method creates a PDF with the OCR text
        Result.SaveAsSearchablePdf("path/to/output.pdf");
    }
}
Imports IronOcr ' Import the IronOcr namespace

Friend Class Program
	Shared Sub Main()
		' Step 1: Create an instance of the IronTesseract class
		Dim Ocr = New IronTesseract()

		' Step 2: Initialize OcrInput
		Dim Input = New OcrInput()

		' Step 3: Add the image path(s) that you want to convert to a searchable PDF
		Input.AddImage("path/to/your/image.jpg")

		' Step 4: Perform OCR processing using the Read method
		' This method returns an OcrResult, which includes the text read from the image
		Dim Result As OcrResult = Ocr.Read(Input)

		' Step 5: Save the output to a searchable PDF
		' The SaveAsSearchablePdf method creates a PDF with the OCR text
		Result.SaveAsSearchablePdf("path/to/output.pdf")
	End Sub
End Class
$vbLabelText   $csharpLabel

Explanation

  • IronTesseract Class: This is the main class used to perform OCR. It allows configuration of the OCR settings.
  • OcrInput Object: This object holds the images you want to process. You add images to this object using AddImage.
  • Read Method: This method takes the OcrInput object and processes the images, extracting text from them.
  • SaveAsSearchablePdf Method: This saves the OCR result as a searchable PDF, embedding the recognized text under the images, making the PDF text searchable while maintaining the original image layout.

Make sure to replace "path/to/your/image.jpg" and "path/to/output.pdf" with the actual file paths you intend to use.