Chinese OCR in C# and .NET

Other versions of this document:

IronOCR is a C# software component allowing .NET coders to read text from images and PDF documents in 126 languages, including Chinese. The Chinese Language Pack contains both Chinese Simplified and Chinese Traditional characters.

It is an advanced fork of Tesseract, built exclusively for .NET developers, and regularly outperforms other Tesseract engines for both speed and accuracy. The library allows the recognition of images and documents of different formats in various languages, including Chinese. It supports over 125 languages and offers a high level of recognition accuracy compared to the competition. IronOCR's API has been designed with extensibility and customization in mind. You can help IronOCR achieve high throughput and accuracy by adding your tuning data or features to the tracker. IronOCR uses many Optical Character Recognition Techniques. We can use it on a Windows PC, Linux, macOS, and other famous platforms.

Contents of IronOcr.Languages.Chinese

This package contains 352 OCR languages for .NET:

  • ChineseSimplified
  • ChineseSimplifiedBest
  • ChineseSimplifiedFast
  • ChineseSimplifiedVertical
  • ChineseSimplifiedVerticalBest
  • ChineseSimplifiedVerticalFast
  • ChineseTraditional
  • ChineseTraditionalBest
  • ChineseTraditionalFast
  • ChineseTraditionalVertical
  • ChineseTraditionalVerticalBest
  • ChineseTraditionalVerticalFast

Download

We can download the Chinese Language Pack [中文 (Zhōngwén)] from the following links:

Using IronOCR for the Chinese Language

Create or Open a C# Project

To start with IronOCR, we have to create a C# .NET project. We are using Visual Studio 2022 for this purpose. You can choose a version according to your needs. The latest version of Visual Studio is recommended for a smooth experience. We will create a GUI interface to select the image. We can also use IronOCR in a console application by giving the direct path of the picture. Implement the following steps to create a C# project in Visual Studio 2022:

  • Open Visual Studio 2022.
  • Click on the "Create a new project" button.
  • Write "Windows" in the search bar, select the "Windows Form" application from the search results and click on the "Next" button.
  • Give a name to the project. I am assigning the name "ChineseOCR" to the project. After the name, click on the "Next" button.
  • Select the .NET framework on the next screen. Select the .NET framework according to the needs of your project. We are selecting the .NET 5.0 version for this tutorial.
  • After selecting, click on the "Create" button. It will easily create the C# Windows Form project in Visual Studio.

The project has been created, and now it is ready to use in the IronOCR library. We can also use the already existing C# project. Open the project and start the installation of the IronOCR library. The following section will explore the methods to install the IronOCR library in C# projects.

Installation

Using NuGet Package Manager

To install the IronOCR library with NuGet Package Manager, we must open the NuGet Package Manager interface. Follow the following steps to install the IronOCR library:

  • Click on the "Tools" from the main menu, from the drop-down menu, hover on "NuGet Package Manager," and select the "Manage NuGet Package Manager for Solution."
  • This will open the NuGet Package Manager interface. Go to the browse tab and search for IronOCR Chinese. Select the right package from search results and click on the "Install" button to install it.
  • It will begin installing the library. After installation, you will be able to use the IronOCR library in your project.

Using the Package Manager Console

Using a console is always an easy option. We can install the IronOCR library using the Package Manager Console too. Follow the given steps to install the IronOCR library:

  • Open the Package Manager Console in Visual Studio. It is usually located at the bottom of Visual Studio.

  • Write the following command in the console:

    Install-Package IronOCR.Languages.Chinese
  • You will see the installation progress of the library in the console. It will install the library automatically. After installation, our project will be ready for the IronOCR library.

Code Example: OCR for the Chinese language

Now, it's time to write the code for implementing the IronOCR library for the Chinese language. First, we have to develop the frontend for selecting the image file. Let's take a look at how we can do this.

Developing the Frontend

We will use the "Toolbox" elements to design the front. We will create a Button, a Picture Box, a Rich Textbox, and two labels. We will drag and drop these elements from the Toolbox and place them in the windows form. We will manage these elements in style.

The button will be used for selecting the image file from the PC. Picture Box will load the selected image file, and Rich Textbox will show the output text. You can adjust the size of every element according to your needs. The final frontend design will look like this:

This window will pop up when you run the project. We have set the alignment of windows to form in the center of the screen. So, this screen will show up in the center.

Our frontend is ready. Next, it's time to add the backend functionality of the button.

Backend code for IronOCR

We have to import the IronOCR namespace first to use it in our code. Write the following line on the top of the file:

using IronOCR;
using IronOCR;
Imports IronOCR
VB   C#

We will use the "Select Image" button for selecting the image and loading the image in the Picture Box. IronOCR will process the Chinese simplified text image and show the output text in the Rich Text Box. Let's add the functionality of the button by double-clicking on the button. Write the following lines of code to add the described functionality:

private void btn_image_Click(object sender, EventArgs e)
{
    OpenFileDialog open = new OpenFileDialog();
    if (open.ShowDialog() == DialogResult.OK)
    {
        // display image in picture box  
        img_image.Image = new Bitmap(open.FileName);

        var Ocr = new IronTesseract();

Ocr.Language = OcrLanguage.ChineseTraditional;

using (var Input = new OcrInput(open.FileName))
        {
            var Result = Ocr.Read(Input);

            txt_output.Text = Result.Text;
        }
    }
}
private void btn_image_Click(object sender, EventArgs e)
{
    OpenFileDialog open = new OpenFileDialog();
    if (open.ShowDialog() == DialogResult.OK)
    {
        // display image in picture box  
        img_image.Image = new Bitmap(open.FileName);

        var Ocr = new IronTesseract();

Ocr.Language = OcrLanguage.ChineseTraditional;

using (var Input = new OcrInput(open.FileName))
        {
            var Result = Ocr.Read(Input);

            txt_output.Text = Result.Text;
        }
    }
}
Private Sub btn_image_Click(ByVal sender As Object, ByVal e As EventArgs)
	Dim open As New OpenFileDialog()
	If open.ShowDialog() = DialogResult.OK Then
		' display image in picture box  
		img_image.Image = New Bitmap(open.FileName)

		Dim Ocr = New IronTesseract()

Ocr.Language = OcrLanguage.ChineseTraditional

Using Input = New OcrInput(open.FileName)
			Dim Result = Ocr.Read(Input)

			txt_output.Text = Result.Text
End Using
	End If
End Sub
VB   C#

When a user clicks on the button, a dialogue will appear to select the image. When the user selects the image, it will automatically load into the picture box. We use Bitmap() for printing images in the picture box. After that, IronOCR will convert images into Chinese text. We set Ocr. Language to ChineseTraditional to recognize text in traditional Chinese. Ocr. The read function will read the process and store the OCR result in the Result variable. If you need to save the text in PDF, text, or HTML format, you will use the SaveAs function to save the file in any output format you want — IronOCR supports multiple output formats.

Run the Project

Now it's time to run the project. Click the Run button in Visual Studio. We will see this screen on our screen.

Click on the "Select Image" button. It will open the Select files Dialogue box. Select an image file and hit enter.

It will load it into the picture box, automatically scan the image, and show the output in the text box.

This is the output from the image we select. IronOCR supports the reading and scanning of PDF files too. We can use the editable format of PDF files to scan and recognize text using IronOCR. This can also be done in different languages. IronOCR can make the existing PDF document a searchable PDF. IronOCR has many image filters to make the images clear to view and understand. Here are the filters:

  • Input.Binarize()
  • Input.Contrast()
  • Input.Deskew()
  • Input.DeNoise()
  • Input.Dilate()
  • Input.EnhanceResolution(300)

All these functions increase the visibility of the characters. IronOCR uses these functions to clear and make a searchable PDF. Let's take a look at how this can be done:

using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
    Input.AddPdf("scan.pdf")
    // clean up twisted pages
    Input.Deskew();
    var Result = Ocr.Read(Input);
    Result.SaveAsSearchablePdf("searchable.pdf");
}
using IronOcr;
var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
    Input.AddPdf("scan.pdf")
    // clean up twisted pages
    Input.Deskew();
    var Result = Ocr.Read(Input);
    Result.SaveAsSearchablePdf("searchable.pdf");
}
Imports IronOcr
Private Ocr = New IronTesseract()
Using Input = New OcrInput()
	Input.AddPdf("scan.pdf") Input.Deskew()
	Dim Result = Ocr.Read(Input)
	Result.SaveAsSearchablePdf("searchable.pdf")
End Using
VB   C#

Licensing

IronOCR is free for development. You can actively use all its features for free. IronOCR also offers a free trial for production without any payment needed. Iron Software also currently offers a popular deal — a suite of five software products for the price of just two. Simply pay the fee for two software products one time, and you will be able to get all five products, including IronPDF and IronXL. You can find more information from this link about licensing.