Passer au contenu du pied de page
UTILISATION D'IRONWORD

Comment extraire du texte de Word en C#

Usually, the main task in document processing applications, data extraction, or text analysis is text extraction from Word document files. When developing a C# application, developers use libraries such as IronWord that help work with files in the .docx format and access the text inside the document instance. Using these libraries helps automate how the content is retrieved from the Word documents to enable the generation of report production, data mining, or even a document management system.

Using a library such as IronWord, one can extract text from any Word document instance; one only needs to load the document object, open paragraphs, or sections, and then retrieve the desired text while still maintaining its original layout. Such functionality will prove of exceptional utility in the legal, healthcare, and financial fields, where document processing is normally integral to workflows. C# is undoubtedly used to develop extremely scalable and efficient applications that extract text from Word files. Developers can combine it with more extensive systems or applications.

How to Extract Text from Word in C#

  1. Install the IronWord library via NuGet in your C# project.
  2. Add using IronWord; at the top of your C# file to extract text from Word.
  3. Set your license key.
  4. Load the existing Word document.
  5. Access paragraphs using the Paragraphs property.
  6. Loop through paragraphs and text elements using foreach loops.
  7. Extract and display text with Console.

What is IronWord?

IronWord is a powerful tool for retrieving text, ensuring that all kinds of files, such as PDF, Word, and TXT files, are fetched easily. It is designed with precision and speed for quick extraction into the needed text, structured or unstructured, while retaining the rest of the document's original format. IronWord is also utilized to provide document analysis, data extraction, and auto-indexing of content.

How to Extract Text from Word in C#: Figure 1 - IronWord

This tool supports almost all available file types to ensure smooth integration with applications and is therefore ideal for business automation and high-volume document processing. The scalability of libraries designed in this way allows easy handling of large volumes of documents, which is quite an important asset for enterprises working with bulk data extraction.

IronWord is also fully compatible with C# and other programming languages, meeting the needs of developers and organizations looking to streamline their document workflows smoothly.

Features of IronWord

Support of Multiple Document Formats

IronWord accepts files in a range of document formats, including:

  • PDFs: It can interpret text on PDFs with regular text, PDFs with embedded fonts, and those based on vectors.
  • Microsoft Word Files (DOCX): It reads text from Word documents easily while keeping the document structure and formatting intact.
  • Text Files (TXT): Additionally, IronWord processes plain text files, extracting and processing text from simple text.

Accurate Text Extraction

The IronWord extraction engine is adept at extracting text content even if it's buried inside complex documents with sophisticated page layouts, embedded fonts, or a mix of contents such as pictures and tables. The library preserves:

  • Text Formatting: Styles such as bold, italics, underlines, and other stylistic aspects applied to the text.
  • Document Hierarchy: Headers, paragraphs, and lists to maintain organization and readability.

Handling Structured and Unstructured Data

IronWord handles both structured and unstructured data. It can extract:

  • Structured Data: Documents with predictable formatting patterns, such as forms and contracts.
  • Unstructured Data: Documents with unpredictable text layouts, such as reports or articles.

It has proven useful in tasks involving data mining, information retrieval, and classification due to its ability to process a wide array of content.

Scalability for Big Volumes

IronWord is built to process large volumes of documents efficiently, offering great scalability for enterprise applications. Examples include:

  • Batching of Documents: Processing many documents at once.
  • Handling Large Files: No degradation in performance with large document sizes.

Seamless Integration with Programming Languages

IronWord integrates seamlessly into development environments, especially Python, through easy-to-use APIs. This allows developers to:

  • Import IronWord into Python Applications: Use IronWord functions directly within Python scripts.
  • Cross-Language Interoperability: Beyond Python, IronWord can be effectively utilized in other languages, facilitating tech stack inter-operability.

This ease of integration allows developers to focus on functionality, rather than infrastructure.

High Performance and Speed

IronWord has been optimized for performance, providing fast text extraction even from large documents, which is essential for real-time applications requiring rapid execution. The library offers:

  • Multithreading Support: Enhancing concurrent extraction processes.
  • Small Memory Footprint: Optimal system resource usage during processing enabling scalability for large datasets.

Optional OCR Support

For documents containing images, IronWord can be used alongside OCR technologies to:

  • Process Scanned Documents: Extract text from images, scanned PDFs, or other image-based formats.
  • Multilingual Support: Recognize and extract text in supported OCR languages.

Metadata Preservation

Beyond text extraction, IronWord preserves metadata from documents, such as:

  • Document Versioning and Compliance Information: Useful for compliance or archival purposes.
  • Document Management Systems: Where metadata is as important as content.

Creating a New Project in Visual Studio

To launch the Visual Studio application, choose File from the File menu and select "New Project" before selecting "Console App."

How to Extract Text from Word in C#: Figure 2 - Console App

Enter the name of the .NET project in the text field after selecting its location, then click the Create button and select the required .NET Framework.

How to Extract Text from Word in C#: Figure 3 - Project Configuration

Visual Studio project structures vary based on the selected application. To implement or run the application code, visit the Program.cs file, applicable in console, windows, or online applications.

How to Extract Text from Word in C#: Figure 4 - Target Framework

The library can then be tested once code is input.

Install IronWord Library

From the Visual Studio Tools Menu, choose NuGet Package Manager. To access the package management console, navigate to the Package Manager interface.

Install-Package IronWord

Once downloaded and installed, the package can be used for text extraction in an ongoing project.

How to Extract Text from Word in C#: Figure 5 - Install IronWord

The Package Manager method offers another option, allowing direct installation into the solution via Visual Studio's NuGet Package Manager. The graphic below illustrates how to access the Package Manager.

How to Extract Text from Word in C#: Figure 6 - NuGet Package Manager

Use the search field on the NuGet website to locate packages. Search for "IronWord" with the package manager as shown in the screenshot below.

How to Extract Text from Word in C#: Figure 7 - Search IronWord

The accompanying graphic displays related search results. Please make these adjustments to install the software on your computer.

Extract Text from a Word Document

To extract text from a document using IronWord, follow these steps. The example code below demonstrates text extraction from a Word document (.docx) using the IronWord library in C#.

// Include necessary libraries
using IronWord;

// Set the license key for IronWord
IronWord.License.LicenseKey = "License key here";

// Load the Word document
var docx1 = new WordDocument("D:\\C# Projects\\ConsoleApp\\ConsoleApp\\File\\existing.docx");

// Access the collection of paragraphs in the document
var paragraphObj = docx1.Paragraphs;

// Loop through each paragraph and its text elements
for (int i = 0; i < paragraphObj.Count; i++)
{
    for (int j = 0; j < paragraphObj[i].Texts.Count; j++)
    {
        // Print each text element to the console
        Console.WriteLine(paragraphObj[i].Texts[j].Text.ToString());
    }
}

// Wait for user input before closing the console
Console.ReadKey();
// Include necessary libraries
using IronWord;

// Set the license key for IronWord
IronWord.License.LicenseKey = "License key here";

// Load the Word document
var docx1 = new WordDocument("D:\\C# Projects\\ConsoleApp\\ConsoleApp\\File\\existing.docx");

// Access the collection of paragraphs in the document
var paragraphObj = docx1.Paragraphs;

// Loop through each paragraph and its text elements
for (int i = 0; i < paragraphObj.Count; i++)
{
    for (int j = 0; j < paragraphObj[i].Texts.Count; j++)
    {
        // Print each text element to the console
        Console.WriteLine(paragraphObj[i].Texts[j].Text.ToString());
    }
}

// Wait for user input before closing the console
Console.ReadKey();
' Include necessary libraries
Imports IronWord

' Set the license key for IronWord
IronWord.License.LicenseKey = "License key here"

' Load the Word document
Dim docx1 = New WordDocument("D:\C# Projects\ConsoleApp\ConsoleApp\File\existing.docx")

' Access the collection of paragraphs in the document
Dim paragraphObj = docx1.Paragraphs

' Loop through each paragraph and its text elements
For i As Integer = 0 To paragraphObj.Count - 1
	Dim j As Integer = 0
	Do While j < paragraphObj(i).Texts.Count
		' Print each text element to the console
		Console.WriteLine(paragraphObj(i).Texts(j).Text.ToString())
		j += 1
	Loop
Next i

' Wait for user input before closing the console
Console.ReadKey()
$vbLabelText   $csharpLabel

The code initializes the license key for IronWord and loads a .docx document from a specified path, creating a WordDocument object. After the document loads, it accesses all paragraphs through the Paragraphs property.

How to Extract Text from Word in C#: Figure 8 - Sample Word Document

A nested loop iterates over paragraphs and their text elements. The outer loop traverses each paragraph, while the inner loop processes each paragraph's text elements. Text elements are printed to the console after conversion to strings.

How to Extract Text from Word in C#: Figure 9 - Console Output

Console.ReadKey() suspends program execution, allowing output display until user input occurs before closing the application window. This approach extracts and prints Word document contents orderly.

Conclusion

IronWord is a versatile and efficient tool for text extraction across various document formats, particularly suitable for Word documents. Its user-friendly API and structured text extraction features make it a reliable solution for developers seeking automated document content retrieval. The tool maintains formatting while processing complex documents, proving valuable for legal, enterprise-level content management, and other applications. Implementing IronWord enhances document analysis, data extraction, and processing tasks, boosting productivity and accuracy when handling large text volumes.

IronWord's starting price is $599. Users can opt for a one-time annual subscription fee, gaining technical support and software updates access. IronWord incurs a cost that precludes free distribution. Refer to IronWord's license page for specific pricing details. Learn about other Iron Software products on the products page.

Questions Fréquemment Posées

Comment extraire du texte des documents Word en utilisant C# ?

Vous pouvez extraire du texte des documents Word en utilisant C# en installant la bibliothèque IronWord via NuGet, en ajoutant using IronWord; à votre fichier C#, en initialisant la bibliothèque avec votre clé de licence, en chargeant le document Word et en parcourant les paragraphes pour extraire et afficher le texte.

Quels sont les formats de document pris en charge pour l'extraction de texte avec IronWord ?

IronWord prend en charge l'extraction de texte à partir de divers formats de documents, y compris les fichiers Microsoft Word (DOCX), les fichiers PDF et les fichiers texte brut (TXT).

Comment IronWord assure-t-il une extraction de texte précise depuis des documents Word ?

IronWord maintient la mise en page et le formatage d'origine du texte, offrant une grande précision dans l'extraction de texte à partir de documents Word. Il prend en charge à la fois les données structurées et non structurées, ce qui le rend idéal pour générer des rapports et gérer des documents.

IronWord peut-il être intégré avec d'autres langages de programmation que C# ?

Oui, IronWord est conçu pour une intégration transparente avec d'autres langages de programmation, tels que Python, améliorant l'interopérabilité entre les langages et permettant aux développeurs de l'utiliser dans divers environnements.

IronWord prend-il en charge l'extraction de texte à partir de documents numérisés contenant des images ?

IronWord peut être utilisé avec des technologies OCR pour traiter les documents numérisés, permettant l'extraction de texte à partir d'images et supportant plusieurs langues, ce qui améliore sa polyvalence pour les tâches de traitement de documents.

Quelles sont les fonctionnalités clés de IronWord pour les développeurs C# ?

IronWord offre des fonctionnalités telles qu'une extraction de texte précise, la prise en charge de multiples formats de documents, l'évolutivité, la prise en charge du multithreading, un OCR optionnel pour les images et une intégration transparente avec d'autres langages de programmation, le rendant efficace pour l'analyse de documents et l'extraction de données.

Comment puis-je installer IronWord dans un projet C# ?

Pour installer IronWord dans un projet C#, utilisez le gestionnaire de packages NuGet dans Visual Studio. Recherchez 'IronWord' et ajoutez le package à votre projet pour commencer à extraire du texte des documents Word.

Quel est le modèle de tarification pour l'utilisation de IronWord ?

La tarification de IronWord commence à 599 $ pour des frais d'abonnement annuel unique, qui incluent l'accès au support technique et aux mises à jour logicielles, garantissant que vous disposez des dernières fonctionnalités et corrections.

Comment IronWord gère-t-il de grands volumes de documents pour l'extraction de texte ?

IronWord est optimisé pour la performance avec des fonctionnalités telles que la prise en charge du multithreading, ce qui lui permet de gérer et de s'adapter efficacement à de grands volumes de documents, le rendant adapté pour les applications au niveau entreprise.

Quels avantages IronWord offre-t-il pour le traitement de documents dans des secteurs comme le juridique ou la santé ?

IronWord améliore l'efficacité du traitement des documents en prenant en charge l'extraction de texte à partir de divers formats tout en maintenant le formatage d'origine. Son évolutivité et ses optimisations de performance le rendent idéal pour des secteurs comme le juridique et la santé où la gestion des documents est essentielle.

Jordi Bardia
Ingénieur logiciel
Jordi est le plus compétent en Python, C# et C++, et lorsqu'il ne met pas à profit ses compétences chez Iron Software, il programme des jeux. Partageant les responsabilités des tests de produit, du développement de produit et de la recherche, Jordi apporte une immense valeur à l'amé...
Lire la suite