Chinese OCR in C# and .NET

Other versions of this document:

IronOCR is a C# software component that allows .NET developers to read text from images and PDF documents in 126 languages, including Chinese. The Chinese Language Pack contains both Chinese Simplified and Chinese Traditional characters.

It is an advanced fork of Tesseract, built exclusively for .NET developers, and regularly outperforms other Tesseract engines for both speed and accuracy. The library allows the recognition of images and documents of different formats in various languages, including Chinese. It supports over 125 languages and offers a high level of recognition accuracy compared to the competition. IronOCR's API has been designed with extensibility and customization in mind. You can enhance IronOCR's performance by adding your tuning data or features to the tracker. IronOCR uses many Optical Character Recognition Techniques and can be used on a Windows PC, Linux, macOS, and other well-known platforms.

Contents of IronOcr.Languages.Chinese

This package contains 352 OCR languages for .NET:

  • ChineseSimplified
  • ChineseSimplifiedBest
  • ChineseSimplifiedFast
  • ChineseSimplifiedVertical
  • ChineseSimplifiedVerticalBest
  • ChineseSimplifiedVerticalFast
  • ChineseTraditional
  • ChineseTraditionalBest
  • ChineseTraditionalFast
  • ChineseTraditionalVertical
  • ChineseTraditionalVerticalBest
  • ChineseTraditionalVerticalFast

Download

We can download the Chinese Language Pack [中文 (Zhōngwén)] from the following links:

Using IronOCR for the Chinese Language

Create or Open a C# Project

To start with IronOCR, you need to create a C# .NET project. We are using Visual Studio 2022 for this purpose. You can choose a version according to your needs, with the latest version of Visual Studio recommended for a smooth experience. We will create a GUI interface to select the image. IronOCR can also be used in a console application by providing the direct path of the image. Implement the following steps to create a C# project in Visual Studio 2022:

  • Open Visual Studio 2022.
  • Click on the "Create a new project" button.

Image 1

  • Write "Windows" in the search bar, select the "Windows Form" application from the search results, and click on the "Next" button.

Image 2

  • Give a name to the project. I am assigning the name "ChineseOCR" to the project. After naming, click on the "Next" button.

Image 3

  • Select the .NET framework on the next screen. Choose the .NET framework according to the needs of your project. We are selecting the .NET 5.0 version for this tutorial.

Image 4

  • After selecting, click on the "Create" button. It will create the C# Windows Form project in Visual Studio.

The project has been created, and now it is ready to use with the IronOCR library. You can also use an existing C# project. Open the project and proceed with the installation of the IronOCR library. The following section explains how to install the IronOCR library in C# projects.

Installation

Using NuGet Package Manager

To install the IronOCR library with NuGet Package Manager, we must open the NuGet Package Manager interface. Follow these steps to install the IronOCR library:

  • Click on "Tools" in the main menu, hover over "NuGet Package Manager," and select "Manage NuGet Packages for Solution."

Image 5

  • This will open the NuGet Package Manager interface. Go to the "Browse" tab and search for IronOCR Chinese. Select the correct package from search results and click on the "Install" button to install it.

Image 6

  • The library installation will begin. After installation, you will be able to use the IronOCR library in your project.

Using the Package Manager Console

Using a console is always an easy option. We can install the IronOCR library using the Package Manager Console as well. Follow these steps to install the IronOCR library:

  • Open the Package Manager Console in Visual Studio. It is usually located at the bottom of Visual Studio.
  • Execute the following command in the console:

    Install-Package IronOcr.Languages.Chinese
  • You will see the installation progress of the library in the console. It will install the library automatically. After installation, the project will be ready for the IronOCR library.

Code Example: OCR for the Chinese language

Now, it's time to write the code for implementing the IronOCR library for the Chinese language. First, we have to develop the frontend for selecting the image file. Let's see how we can do this.

Developing the Frontend

We will use the "Toolbox" elements to design the front. We will create a Button, a Picture Box, a Rich TextBox, and two labels. We will drag and drop these elements from the Toolbox and place them in the Windows Form. Arrange these elements as needed.

The Button will be used for selecting the image file from the PC, the Picture Box will load the selected image, and the Rich TextBox will show the output text. You can adjust the size of each element according to your needs. The final frontend design will look like this:

Image 7

This window will pop up when you run the project. We have set the alignment of the Windows Form to appear in the center of the screen.

Our frontend is ready. Now, let's add the backend functionality to the button.

Backend code for IronOCR

First, import the IronOCR namespace to use it in our code by writing the following line at the top of the file:

using IronOCR;
using IronOCR;
Imports IronOCR
$vbLabelText   $csharpLabel

We will use the "Select Image" button for selecting and loading the image into the Picture Box. IronOCR will process the Chinese simplified text image and display the output text in the Rich TextBox. Let's add the functionality for the button by double-clicking on the button and writing the following code:

private void btn_image_Click(object sender, EventArgs e)
{
    OpenFileDialog open = new OpenFileDialog();
    if (open.ShowDialog() == DialogResult.OK)
    {
        // Display image in picture box  
        img_image.Image = new Bitmap(open.FileName);

        var Ocr = new IronTesseract();

        // Set OCR language to Chinese Traditional
        Ocr.Language = OcrLanguage.ChineseTraditional;

        using (var Input = new OcrInput(open.FileName))
        {
            // Perform OCR on the image input
            var Result = Ocr.Read(Input);

            // Output the recognized text
            txt_output.Text = Result.Text;
        }
    }
}
private void btn_image_Click(object sender, EventArgs e)
{
    OpenFileDialog open = new OpenFileDialog();
    if (open.ShowDialog() == DialogResult.OK)
    {
        // Display image in picture box  
        img_image.Image = new Bitmap(open.FileName);

        var Ocr = new IronTesseract();

        // Set OCR language to Chinese Traditional
        Ocr.Language = OcrLanguage.ChineseTraditional;

        using (var Input = new OcrInput(open.FileName))
        {
            // Perform OCR on the image input
            var Result = Ocr.Read(Input);

            // Output the recognized text
            txt_output.Text = Result.Text;
        }
    }
}
Private Sub btn_image_Click(ByVal sender As Object, ByVal e As EventArgs)
	Dim open As New OpenFileDialog()
	If open.ShowDialog() = DialogResult.OK Then
		' Display image in picture box  
		img_image.Image = New Bitmap(open.FileName)

		Dim Ocr = New IronTesseract()

		' Set OCR language to Chinese Traditional
		Ocr.Language = OcrLanguage.ChineseTraditional

		Using Input = New OcrInput(open.FileName)
			' Perform OCR on the image input
			Dim Result = Ocr.Read(Input)

			' Output the recognized text
			txt_output.Text = Result.Text
		End Using
	End If
End Sub
$vbLabelText   $csharpLabel

When a user clicks on the button, a dialogue will appear to select the image. When the user selects the image, it will automatically load into the Picture Box. We use Bitmap for displaying the image in the Picture Box. After that, IronOCR converts the image into Chinese text. We set the OCR language to Chinese Traditional for text recognition in traditional Chinese. The Ocr.Read function processes and stores the OCR result in the Result variable. If necessary, you can save the text in PDF, text, or HTML format using the SaveAs function to save files in various output formats supported by IronOCR.

Run the Project

Now it's time to run the project. Click the Run button in Visual Studio. You should see this screen:

Image 8

Click on the "Select Image" button. It will open the Select files dialogue box. Choose an image file and hit enter.

Image 9

It will load the image into the Picture Box, automatically scan it, and display the output in the text box.

Image 10

This is the output from the image we selected. IronOCR also supports the reading and scanning of PDF files. You can use the editable format of PDF files to scan and recognize text using IronOCR, and this can be done in different languages. IronOCR can also make an existing PDF document a searchable PDF. It employs various image filters to enhance the clarity of the images. Here are some of the filters:

  • Input.Binarize()
  • Input.Contrast()
  • Input.Deskew()
  • Input.DeNoise()
  • Input.Dilate()
  • Input.EnhanceResolution(300)

All these functions improve the visibility of the characters. IronOCR uses these functions to create a searchable PDF. Here's an example:

using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
    Input.AddPdf("scan.pdf");
    // Clean up twisted pages
    Input.Deskew();
    var Result = Ocr.Read(Input);
    Result.SaveAsSearchablePdf("searchable.pdf");
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
    Input.AddPdf("scan.pdf");
    // Clean up twisted pages
    Input.Deskew();
    var Result = Ocr.Read(Input);
    Result.SaveAsSearchablePdf("searchable.pdf");
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
	Input.AddPdf("scan.pdf")
	' Clean up twisted pages
	Input.Deskew()
	Dim Result = Ocr.Read(Input)
	Result.SaveAsSearchablePdf("searchable.pdf")
End Using
$vbLabelText   $csharpLabel

Licensing

IronOCR is free for development. You can use all its features actively for free. IronOCR also offers a free trial for production without any payment needed. Iron Software offers a popular deal — a suite of five software products for the price of just two. Simply pay the fee for two software products once, and you will be able to get all five products, including IronPDF and IronXL. You can find more information about licensing here.