如何从图像文件中提取文本

如何在 C# 中使用 Iron Tesseract

This article was translated from English: Does it need improvement?
Translated
View the article in English

C# 中 Iron Tesseract 的使用方法是创建一个 IronTesseract 实例,使用语言和 OCR 设置对其进行配置,然后在包含图像或 PDF 的 OcrInput 对象上调用 Read() 方法。 这将使用 Tesseract 5 的优化引擎将文本图像转换为可搜索的 PDF。

IronOCR 提供了一个直观的 API,用于使用定制和优化的 Tesseract 5,即 Iron Tesseract。 通过使用 IronOCR 和IronTesseract,您将能够将文本和扫描文档的图像转换为文本和可搜索的 PDF。 该库支持125种国际语言,并包含条形码阅读计算机视觉等高级功能。

快速入门:在 C# 中设置 IronTesseract 配置

本示例演示了如何使用特定设置配置 IronTesseract 并在一行代码中执行 OCR。

Nuget Icon立即开始使用 NuGet 创建 PDF 文件:

  1. 使用 NuGet 包管理器安装 IronOCR

    PM > Install-Package IronOcr

  2. 复制并运行这段代码。

    var result = new IronOcr.IronTesseract { Language = IronOcr.OcrLanguage.English, Configuration = new IronOcr.TesseractConfiguration { ReadBarCodes = false, RenderSearchablePdf = true, WhiteListCharacters = "ABCabc123" } }.Read(new IronOcr.OcrInput("image.png"));
  3. 部署到您的生产环境中进行测试

    立即开始在您的项目中使用 IronOCR,免费试用!
    arrow pointer

如何创建 IronTesseract 实例? --> <!--说明:显示代码执行输出或结果的截图 --> 用这段代码初始化一个 Tesseract 对象: ```csharp :path=/static-assets/ocr/content-code-examples/how-to/irontesseract-initialize-irontesseract.cs ``` 您可以通过选择不同语言、启用 BarCode 读取和白名单/黑名单字符来自定义 `IronTesseract` 的行为。 IronOCR 提供全面的[配置选项](https://ironsoftware.com/csharp/ocr/examples/csharp-configure-setup-tesseract/),可对 OCR 流程进行微调: ```csharp :path=/static-assets/ocr/content-code-examples/how-to/irontesseract-configure-irontesseract.cs ``` 配置完成后,您可以使用 Tesseract 功能读取 `OcrInput` 对象。 [OcrInput 类](https://ironsoftware.com/csharp/ocr/examples/csharp-ocr-input-for-iron-tesseract/)为加载各种输入格式提供了灵活的方法: ```csharp :path=/static-assets/ocr/content-code-examples/how-to/irontesseract-read.cs ``` 对于复杂的场景,您可以利用[多线程功能](https://ironsoftware.com/csharp/ocr/examples/csharp-tesseract-multithreading-for-speed/)同时处理多个文档,从而显著提高批量操作的性能。

什么是高级 Tesseract 配置变量? <!--说明:说明代码概念的图表或截图 --> IronOCR Tesseract 接口允许通过[IronOCR.TesseractConfiguration 类](/csharp/ocr/object-reference/api/IronOcr.TesseractConfiguration.html)完全控制 Tesseract 配置变量。 通过这些高级设置,您可以针对特定用例优化 OCR 性能,例如 [ 修复低质量扫描](https://ironsoftware.com/csharp/ocr/examples/ocr-low-quality-scans-tesseract/)或 [ 阅读特定文档类型](https://ironsoftware.com/csharp/ocr/tutorials/read-specific-document/)。

如何在代码中使用 Tesseract 配置? ```csharp :path=/static-assets/ocr/content-code-examples/how-to/irontesseract-tesseract-configuration.cs ``` IronOCR 还针对不同的文档类型提供专门的配置。例如,在[阅读护照](https://ironsoftware.com/csharp/ocr/examples/read-passport/)或[处理 MICR 支票](https://ironsoftware.com/csharp/ocr/examples/read-micr-cheque/)时,您可以应用特定的预处理过滤器和区域检测来提高准确性。 财务文件配置示例: ```csharp // Example: Configure for financial documents IronTesseract ocr = new IronTesseract { Language = OcrLanguage.English, Configuration = new TesseractConfiguration { PageSegmentationMode = TesseractPageSegmentationMode.SingleBlock, TesseractVariables = new Dictionary { ["tessedit_char_whitelist"] = "0123456789.$,", ["文本ord_heavy_nr"] = false, ["edges_max_children_per_outline"] = 10 } } }; // Apply preprocessing filters for better accuracy using OcrInput input = new OcrInput(); input.LoadPdf("financial-document.pdf"); input.Deskew(); input.EnhanceResolution(300); OcrResult result = ocr.Read(input); ```

所有 Tesseract 配置变量的完整列表是什么? <!--说明:说明代码概念的图表或截图 --> 这些可以通过`IronTesseract.Configuration.TesseractVariables["key"] = value;`进行设置。 配置变量允许您对 OCR 行为进行微调,以便在处理特定文档时获得最佳效果。 有关优化 OCR 性能的详细指导,请参阅我们的[快速 OCR 配置指南](https://ironsoftware.com/csharp/ocr/examples/tune-tesseract-for-speed-in-dotnet/)。 0Use old baseline algorithm 0All doc is proportial text 0Debug on fixed pitch test 0Turn off dp fixed pitch algorithm 0Do even faster pitch algorithm 0Write full metric stuff 0Draw row-level cuts 0Draw page-level cuts 0Use correct answer for fixed/prop 0Attempt whole doc/block fixed pitch 0Display separate words 0Display separate words 0Display forced fixed pitch words 0Moan about prop blocks 0Moan about fixed pitch blocks 0Dump stats when moaning 0Do current test 0.08Fraction of xheight for sameness 0.5Max initial cluster size 0.15Min initial cluster spacing 0.25Fraction of xheight 0.75Fraction of xheight 0.6Allowed size variance 0.3Non-fuzzy spacing region 2.8Min ratio space/nonspace 2Min ratio space/nonspace 1.5Pitch IQR/Gap IQR threshold 0.2Xh fraction noise in pitch 0.5Min width of decent blobs 0.1Fraction of x to ignore 0Debug level for unichar ambiguities 0Classify debug level 1Normalization Method ... 0Matcher Debug Level 0Matcher Debug Flags 0Learning Debug Level: 1Min # of permanent classes 3Reliable Config Threshold 5Enable adaption even if the ambiguities have not been seen 230Threshold for good protos during adaptive 0-255 230Threshold for good features during adaptive 0-255 229Class Pruner Threshold 0-255 15Class Pruner Multiplier 0-255: 7Class Pruner CutoffStrength: 10Integer Matcher Multiplier 0-255: 0Set to 1 for general debug info, to 2 for more details, to 3 to see all the debug messages 0Debug level for hyphenated words. 2Size of dict word to be treated as non-dict word 0Stopper debug level 10Max words to keep in list 10000Maximum number of different character choices to consider during permutation. This limit is especially useful when user patterns are specified, since overly generic patterns can result in dawg search exploring an overly large number of options. 1Fix blobs that aren't chopped 0Chop debug 10000Split Length 2Same distance 6Min Number of Points on Outline 150Max number of seams in seam_pile -50Min Inside Angle Bend 2000Min Outline Area 90Width of (smaller) chopped blobs above which we don't care that a chop is not near the center. 3X / Y length weight 0Debug level for wordrec 4Max number of broken pieces to associate 0SegSearch debug level 2000Maximum number of pain points stored in the queue 0Language model debug level 8Maximum order of the character ngram model 10Maximum number of prunable (those for which PrunablePath() is true) entries in each viterbi list recorded in BLOB_CHOICEs 500Maximum size of viterbi lists recorded in BLOB_CHOICEs 3Minimum length of compound words 0Display Segmentations 6Page seg mode: 0=osd only, 1=auto+osd, 2=auto_only, 3=auto, 4=column, 5=block_vert, 6=block, 7=line, 8=word, 9=word_circle, 10=char,11=sparse_text, 12=sparse_text+osd, 13=raw_line (Values from PageSegMode enum in tesseract/publictypes.h) 2Which OCR engine(s) to run (Tesseract, LSTM, both). 默认s to loading and running the most accurate available. 0Whether to use the top-line splitting process for Devanagari documents while performing page-segmentation. 0Whether to use the top-line splitting process for Devanagari documents while performing ocr. 0Debug level for BiDi 1Debug level 0Page number to apply boxes from 0Amount of debug output for bigram correction. 0Debug reassignment of small outlines 8Max diacritics to apply to a blob 16Max diacritics to apply to a word 0Reestimate debug 2alphas in a good word 39Adaptation decision algorithm for tess 0Print multilang debug info. 0Print paragraph debug info. 2Only preserve wds longer than this 10For adj length in rating per ch 1How many potential indicators needed 4Don't crunch words with long lower case strings 4Don't crunch words with long lower case strings 3Crunch words with long repetitions 1How many non-noise blbs either side? 1What constitues done for spacing 0Contextual fixspace debug 8Max allowed deviation of blob top outside of font data 8Min change in xht before actually trying it 0Debug level for sub & superscript fixer 85Set JPEG quality level 0Specify DPI for input image 50Specify minimum characters to try during OSD 0Rejection algorithm 2Rej blbs near image edge limit 8Reject any x-ht lt or eq than this -1-1 -> All pages, else specific page to process 1Run in parallel where possible 2Allows to include alternative symbols choices in the hOCR output. Valid input values are 0, 1 and 2. 0 is the default value. With 1 the alternative symbol choices per timestep are included. With 2 alternative symbol choices are extracted from the CTC process instead of the lattice. The choices are mapped per character. 5Sets the number of cascading iterations for the Beamsearch in lstm_选择模式. Note that lstm_选择模式 must be set to a value greater than 0 to produce results. 0Debug data 3or should we use mean 10No.samples reqd to reestimate for row 40No.gaps reqd with 1 large gap to treat as a table 20No.gaps reqd with few cert spaces to use certs 1How to avoid being silly 7Pixel size of noise 0Baseline debug level 10Fraction of size for maxima 16Transitions for normal blob 1super norm blobs to save row 0Use ambigs for deciding whether to adapt to a character 0Prioritize blob division over chopping 1Enable adaptive classifier 0Character Normalized Matching 0Baseline Normalized Matching 1Enable adaptive classifier 0Use pre-adapted classifier templates 0Save adapted templates to a file 0Enable match debugger 0Non-linear stroke-density normalization 0Bring up graphical debugging windows for fragments training 0Use two different windows for debugging the matching: One for the protos and one for the features. 0Assume the input is numbers [0-9]. 1Load system word dawg. 1Load frequent word dawg. 1Load unambiguous word dawg. 1Load dawg with punctuation patterns. 1Load dawg with number patterns. 1Load dawg with special word bigrams. 0Use only the first UTF8 step of the given string when computing log probabilities. 0Make AcceptableChoice() always return false. Useful when there is a need to explore all segmentations 0Don't use any alphabetic-specific tricks. Set to true in the traineddata config file for scripts that are cursive or inherently fixed-pitch 0Save Document Words 1Merge the fragments in the ratings matrix and delete them after merging 1Associator Enable 0force associator to run regardless of what enable_assoc is. This is used for CJK where component grouping is necessary. 1Chop enable 0Vertical creep 1Use new seam_pile 0include fixed-pitch heuristics in char segmentation 0Only run OCR for words that had truth recorded in BlamerBundle 0Print blamer debug messages 0Try to set the blame for errors 1Save alternative paths found during chopping and segmentation search 1Words are delimited by space 0Use sigmoidal score for certainty 0Take segmentation and labeling from box file 0Conversion of word/line box file to char box file 0Generate training data from boxed chars 0Generate more boxes from boxed chars 0Break input into lines and remap boxes if present 0Dump intermediate images made during page segmentation 1Try inverting the image in LSTMRecognizeWord 0Perform training for ambiguities 0Generate and print debug information for adaption 0Learn both character fragments (as is done in the special low exposure mode) as well as unfragmented characters. 0Each bounding box is assumed to contain ngrams. Only learn the ngrams whose outlines overlap horizontally. 0Draw output words 0Dump char choices 0Print timing stats 1Try to improve fuzzy spaces 0Don't bother with word plausibility 1Crunch double hyphens? 1Add words to the document dictionary 0Output font info per char 0Block and Row stats 1Enable correction based on the word bigram dictionary. 0Enable single word correction based on the dictionary. 1Remove and conditionally reassign small outlines when they confuse layout analysis, determining diacritics vs noise 0Do minimal rejection on pass 1 output 0Test adaption criteria 0Test for point 1Run paragraph detection on the post-text-recognition (more accurate) 1Use ratings matrix/beam search with lstm 1Reduce rejection on good docs 1Reject spaces? 0Add font info to hocr output 0Add coordinates for each character to hocr output 1Before word crunch? 0Take out ~^ early? 1As it says 1Don't touch sensible strings 0Use dictword test 1Individual rejection control 1Individual rejection control 1Individual rejection control 0Extend permuter check 0Extend permuter check 0Output text with boxes 0Capture the image from the IPE 0Run interactively? 1According to dict_word 0In multilingual mode use params model of the primary language 0Debug line finding 0Use CJK fixed pitch model 0Allow feature extractors to see the original outline 0Only initialize with the config file. Useful if the instance is not going to be used for OCR but say only for layout analysis. 0Turn on equation detector 1Enable vertical detection 0Force using vertical text page mode 0Preserve multiple interword spaces 1Detect music staff and remove intersecting components 0Script has no xheight, so use a single mode 0Space stats use prechopping? 0Constrain relative values of inter and intra-word gaps for old_to_method. 1Block stats to use fixed pitch rows? 0Force word breaks on punct to break long lines in non-space delimited langs 0Space stats use prechopping? 0Fix suspected bug in old code 1Only stat OBVIOUS spaces 1Only stat OBVIOUS spaces 1Only stat OBVIOUS spaces 1Only stat OBVIOUS spaces 1Use row alone when inadequate cert spaces 0Better guess 0Pass ANY flip to context? 1Don't restrict kn->sp fuzzy limit to tables 0Don't remove noise blobs 0Display unsorted blobs 0Display unsorted blobs 1Reject noise-like words 1Reject noise-like rows 0Debug row garbage detector Class str to debug learning A filename of user-provided words. A suffix of user-provided words located in tessdata. A filename of user-provided patterns. A suffix of user-provided patterns located in tessdata. Output file for ambiguities found in the dictionary Word for which stopper debug information should be printed to stdout Blacklist of chars not to recognize Whitelist of chars to recognize List of chars to override tessedit_char_blacklist .expExposure value follows this pattern in the image filename. The name of the image files are expected to be in the form [lang].[fontname].exp [num].tif 前导标点).,;:?!1st Trailing punctuation 2nd Trailing punctuationPage separator (default is form feed control character) 0.2Character Normalization Range ... 1.5Veto ratio between classifier ratings 5.5Veto difference between classifier certainties 0.125Good Match (0-1) 0Great Match (0-1) 0.02Perfect Match (0-1) 0.15Bad Match Pad (0-1) 0.1New template margin (0-1) 12Avg. noise blob length 0.015Maximum angle delta for prototype clustering 0Penalty to apply when a non-alnum is vertically out of its expected textline position 1.5Rating scaling factor 20Certainty scaling factor 0.00390625Scale factor for features not used 2.5Prune poor adapted results this much worse than best result -1Threshold at which 分类适应剪枝因子 starts -3Exclude fragments that do not look like whole characters from training and adaption 0.3Max large speckle size 10Penalty to add to worst rating for noise 0.125Score penalty (0.1 = 10%) added if there are subscripts or superscripts in a word, but it is otherwise OK. 0.25Score penalty (0.1 = 10%) added if an xheight is inconsistent. 1Score multiplier for word matches which have good case and are frequent in the given language (lower is better). 1.1Score multiplier for word matches that have good case (lower is better). 1.3125默认 score multiplier for word matches, which may have case issues (lower is better). 1.25Score multiplier for glyph fragment segmentations which do not match a dictionary word (lower is better). -2.25Worst certainty for words that can be inserted into the document dictionary -2.25Good blob limit 2最大字符宽高比
Tesseract 配置变量 默认 意义
分类_数量_cp_级别3类剪枝器级别数
textord_debug_tabfind0调试选项卡查找
textord_debug_bugs0启用与制表符查找错误相关的输出
textord_testregion_left-1调试报告矩形的左边缘
textord_testregion_top-1调试报告矩形的顶部边缘
文本ord_测试区域_右侧2147483647调试矩形的右边缘
textord_testregion_bottom2147483647调试矩形的底部边缘
textord_tabfind_show_partitions0显示分区边界,如果大于 1 则等待。
拆分调试级别0拆分 shiro-rekha 进程的调试级别。
edges_max_children_per_outline10角色轮廓内子角色的最大数量
边缘_最大子层5角色轮廓内嵌套子角色的最大层数
每个孙子的边缘子节点10抛掷轮廓的重要性比率
边缘子数量限制45斑点中允许的最大孔数
边缘_最小_无孔12方框内潜在字符的最小像素
边缘路径面积比率40Max lensq/area for acceptable child outline
textord_fp_chop_error2最大允许的切割单元弯曲度
textord_tabfind_show_images0Show image blobs
textord_skewsmooth_offset4对于平滑因子
textord_skewsmooth_offset21对于平滑因子
textord_test_x-2147483647测试点坐标
textord_test_y-2147483647测试点坐标
textword_min_blobs_in_row4梯度计数前的最小斑点数
文本ord_spline_minblobs8Min blobs in each spline segment
文本ord_spline_medianwin6Size of window for spline segmentation
textord_max_blob_overlaps4Max number of blobs a big blob can overlap
textord_min_xheight10Min credible pixel xheight
textord_lms_line_trials12Number of linew fits to do
oldbl_holed_losscount10Max lost before fallback line used
pitsync_linear_version6Use new fast algorithm
pitsync_fake_depth1Max advance fake generation
textord_tabfind_show_strokewidths0Show stroke widths
文本ord_dotmatrix_gap3Max pixel gap for broken pixed pitch
textord_debug_block0Block to do debug on
文本音调范围2Max range test on pitch
textord_words_veto_power5Rows required to outvote a veto
equationdetect_save_bi_image0Save input bi image
equationdetect_save_spt_image0Save special character image
方程检测保存种子图像0Save the seed image
方程检测保存合并图像0Save the merged image
poly_debug0Debug old poly
poly_wide_objects_better1More accurate approx on wide things
wordrec_display_splits0Display splits
textord_debug_printable0Make debug windows printable
textord_space_size_is_variable0If true, word delimiter spaces are assumed to have variable width, even though characters have fixed pitch.
textord_tabfind_show_initial_partitions0Show partition bounds
textord_tabfind_show_reject_blobs0Show blobs rejected as noise
textord_tabfind_show_columns0Show column bounds
textord_tabfind_show_blocks0Show final block bounds
textord_tabfind_find_tables1run table detection
拆分调试图像0Whether to create a debug image for split shiro-rekha process.
textord_show_fixed_cuts0Draw fixed pitch cell boundaries
边缘使用新的轮廓复杂性0Use the new outline complexity module
边缘调试0turn on debugging for this module
边缘_子类_修复0Remove boxy parents of char-like children
gapmap_debug0Say which blocks have tables
gapmap_use_ends0Use large space at start and end of rows
gapmap_noo_isolated_quanta0Ensure gaps not less than 2quanta wide
文本ord_heavy_nr0Vigorously remove noise
textord_show_initial_rows0Display row accumulation
textord_show_parallel_rows0Display page correlated rows
textord_show_expanded_rows0Display rows after expanding
textord_show_final_rows0Display rows after final fitting
文本ord_show_final_blobsDisplay blob bounds after pre-ass
textord_test_landscape0Tests refer to land/port
textord_parallel_baselines1Force parallel baselines
textord_straight_baselines0Force straight baselines
旧基线1
文本ord_old_xheight0Use old xheight algorithm
textord_fix_xheight_bug1Use spline baseline
textord_fix_makerow_bug1Prevent multiple baselines
textord_debug_xheights0Test xheight algorithms
textord_biased_skewcalc1Bias skew estimates with line length
textord_interpolating_skew1Interpolate across gaps
textord_new_initial_xheight1Use test xheight mechanism
textord_debug_blob0Print test blob information
textord_really_old_xheight0Use original wiseowl xheight
textord_oldbl_debug0Debug old baseline generation
textord_debug_baselines0Debug baseline generation
textord_oldbl_paradef1Use para default mechanism
textord_oldbl_split_splines1Split stepped splines
textord_oldbl_merge_parts1Merge suspect partitions
旧版本1Improve correlation of heights
oldbl_xhfix0Fix bug in modes threshold for xheights
textord_ocropus_mode0Make baselines for ocropus
textord_tabfind_only_strokewidths0Only run stroke widths
textord_tabfind_show_initialtabs0Show tab candidates
textord_tabfind_show_finaltabs0Show tab vectors
textord_show_tables0Show table regions
textord_tablefind_show_mark0Debug table marking steps in detail
textord_tablefind_show_stats0Show page stats used in table finding
textord_tablefind_recognize_tables0Enables the table recognizer for table layout and filtering.
textord_all_prop
textord_debug_pitch_test
textord_disable_pitch_test
textord_fast_pitch_test
textord_debug_pitch_metric
textord_show_row_cuts
textord_show_page_cuts
文本ord_pitch_cheat
textord_blockndoc_fixed
textord_show_initial_words
textord_show_new_words
textord_show_fixed_words
textord_blocksall_fixed
textord_blocksall_prop
textord_blocksall_testing
文本测试模式
textord_pitch_rowsimilarity
单词首字母下调
单词首字母上部
单词默认值_prop_nonspace
words_default_fixed_space
默认字数限制
textord_words_definite_spread
textord_spacesize_ratiofp
textord_spacesize_ratioprop
textord_fpiqr_ratio
textord_max_pitch_iqr
textord_fp_min_width
下划线偏移
调试级别
分类调试级别
分类规范方法
匹配器调试级别
匹配器调试标志
分类学习调试级别
matcher_permanent_classes_min
matcher_min_examples_for_prototyping
用于原型设计的充分示例匹配器
分类_适应_原型_阈值
分类_适应_特征_阈值
分类剪枝器阈值
分类剪枝乘数
分类_cp_截止强度
分类整数匹配器乘数
dawg_debug_level
连字符调试级别
小字体大小
stopper_debug_level
tessedit_truncate_wordchoice_log
最大置换尝试次数
修复未切碎的斑点
chop_debug
分割长度
砍到相同距离
砍伐最小轮廓点
剪缝绒毛尺寸
切内角
砍掉最小轮廓区域
截断居中最大宽度
砍伐 x 和 y 重量
wordrec_debug_level
wordrec_max_join_chunks
segsearch_debug_level
搜索最大痛点
segsearch_max_futile_classifications20Maximum number of pain point classifications per chunk that did not result in finding a better word choice.
语言模型调试级别
语言模型语序
language_model_viterbi_list_max_num_prunable
语言_模型_viterbi_list_max_size
语言模型最小复合长度
wordrec_display_segmentations
tessedit_pageseg_mode
tessedit_ocr_engine_mode
页面eg_devanagari_split_strategy
ocr_devanagari_split_strategy
双向调试
应用框调试
应用框_页面
tessedit_bigram_debug
调试噪声消除
噪声最大值
noise_maxperword
调试_x_ht_level
quality_min_initial_alphas_reqd
tessedit_tess_adaption_mode
多语言调试级别
段落调试级别
tessedit_preserve_min_wd_len
crunch_rating_max
crunch_pot_指标
crunch_leave_lc_strings
crunch_leave_uc_strings
长时间重复训练
crunch_debug0As it says
fixsp_non_noise_limit
fixsp_done_mode
调试修复空间级别
x_ht_acceptance_tolerance
x_ht_min_change
上标调试
jpg_质量
用户自定义DPI
最小字符数
suspect_level99Suspect marker level
suspect_short_words2Don't suspect dict wds longer than this
tessedit_reject_mode
tessedit_image_border
最小正常 x 高度像素
页码
tessedit_parallelize
lstm_选择模式
lstm_choice_iterations
tosp_debug_level
tosp_enough_space_samples_for_median
tosp_redo_kern_limit
tosp_few_samples
tosp_短行
tosp_sanity_method
textord_max_noise_size
文本ord_baseline_debug
文本ord_noise_sizefraction
文本ord_noise_translimit
文本ord_noise_sncount
使用歧义进行适应
优先划分
分类启用学习
tess_cn_matching
tess_bn_matching
启用自适应匹配器
分类_使用_预先调整好的模板
分类_保存_已适配模板
启用自适应调试器
分类非线性范数
disable_character_fragments1Do not include character fragments in the results of the classifier
分类调试字符片段
Matcher_debug_separate_windows
分类_bln_numeric_mode
加载系统狗
加载频率_dawg
加载无歧义的狗
加载_punc_dawg
加载编号_dawg
加载双字母狗
仅使用第一个 uft8_step
stopper_no_acceptable_choices
段非字母脚本
保存文档
合并矩阵中的片段
wordrec_enable_assoc
强制word_assoc
启用
垂直爬行
砍新缝堆
假设固定音高字符段
wordrec_skip_noo_truth_words
wordrec_debug_blamer
wordrec_run_blamer
保存备选方案
language_model_ngram_on0Turn on/off the use of character ngram model
language_model_ngram_use_ only_first_uft8_step0Use only the first UTF8 step of the given string when computing log probabilities.
language_model_ngram_space_delimited_language
语言_模型_使用_西格码_确定性
tessedit_resegment_from_boxes
tessedit_resegment_from_line_boxes
tessedit_train_from_boxes
tessedit_make_boxes_from_boxes
tessedit_train_line_recognizer
tessedit_dump_pageseg_images
tessedit_doo_invert
tessedit_ambigs_training
tessedit_adaption_debug
applybox_learn_chars_and_char_frags_mode
applybox_learn_ngrams_mode
tessedit_display_outwords
tessedit_dump_choices
tessedit_timing_debug
tessedit_fix_fuzzy_spaces
tessedit_unrej_any_wd
tessedit_fix_hyphens
tessedit_enable_doc_dict
tessedit_debug_fonts
tessedit_debug_block_rejection
tessedit_enable_bigram_correction
tessedit_enable_dict_correction
启用降噪
tessedit_minimal_rej_pass1
tessedit_test_adaptation
测试点
基于段落文本
lstm_use_matrix
tessedit_good_quality_unrej
tessedit_use_reject_spaces
tessedit_preserve_blk_rej_perfect_wds1Only rej partially rejected words in block rejection
tessedit_preserve_row_rej_perfect_wds1Only rej partially rejected words in row rejection
tessedit_dont_blkrej_good_wds0Use word segmentation quality metric
tessedit_dont_rowrej_good_wds0Use word segmentation quality metric
tessedit_row_rej_good_docs1Apply row rejection to good docs
tessedit_reject_bad_qual_wds1Reject all bad quality wds
tessedit_debug_doc_rejection0Page stats
tessedit_debug_quality_metrics0Output data to debug file
bland_unrej0unrej potential with no checks
unlv_tilde_crunching0Mark v.bad words for tilde crunch
字体信息
文字框
crunch_early_merge_tess_fails
crunch_early_convert_bad_unlv_chs
crunch_terrible_garbage
crunch_leave_ok_strings
crunch_accept_ok1Use acceptability in okstring
crunch_leave_accept_strings0Don't pot crunch sensible strings
crunch_include_numerals0Fiddle alpha figures
tessedit_prefer_joined_punct0Reward punctuation joins
tessedit_write_block_separators0Write block separators in output
tessedit_write_rep_codes0Write repetition char code
tessedit_write_unlv0Write .unlv output file
tessedit_create_txt0Write .txt output file
tessedit_create_hocr0Write .html hOCR output file
tessedit_create_alto0Write .xml ALTO file
tessedit_create_lstmbox0Write .box file for LSTM training
tessedit_create_tsv0Write .tsv output file
tessedit_create_wordstrbox0Write WordStr format .box output file
tessedit_create_pdf0Write .pdf output file
textonly_pdf0Create PDF with only one invisible text layer
suspect_constrain_1Il0UNLV keep 1Il chars rejected
tessedit_minimal_rejection0Only reject tess failures
tessedit_zero_rejection0Don't reject ANYTHING
tessedit_word_for_word0Make output have exactly one word per WERD
tessedit_zero_kelvin_rejection0Don't reject ANYTHING AT ALL
tessedit_rejection_debug0Adaption debug
tessedit_flip_0O1Contextual 0O O0 flips
rej_trust_doc_dawg0Use DOC dawg in 11l conf. detector
rej_1Il_use_dict_word
rej_1Il_trust_permuter_type1Don't double check
rej_use_tess_accepted
rej_use_tess_blanks
rej_use_good_perm
rej_use_sensible_wd
rej_alphas_in_number_perm
tessedit_create_boxfile
tessedit_write_images
交互式显示模式
tessedit_override_permuter
tessedit_use_primary_params_model
textord_tabfind_show_vlines
textord_use_cjk_fp_model
poly_allow_detailed_fx
tessedit_init_config_only
文本等式检测
textord_tabfind_vertical_text
textord_tabfind_force_vertical_text
保留词间空格
pageseg_apply_music_mask
textord_single_height_mode
tosp_old_too_method
tosp_old_to_constrain_sp_kn
tosp_only_use_prop_rows
tosp_force_wordbreak_on_punct
tosp_use_pre_chopping
tosp_old_too_bug_fix
tosp_block_use_cert_spaces
tosp_row_use_cert_spaces
tosp_narrow_blobs_not_cert
tosp_row_use_cert_spaces1
tosp_recovery_isolated_row_stats
tosp_only_small_gaps_for_kern
tosp_all_flips_fuzzy
tosp_fuzzy_limit_all
文本ord_no_rejects
文本ord_show_blobs
文本框
文本ord_noise_rejwords
文本ord_noise_rejrows
文本ord_noise_debug
分类_学习_调试_str
用户单词文件
用户词后缀
用户模式文件
用户模式后缀
输出歧义词文件
待调试单词
tessedit_char_blacklist
tessedit_char_whitelist
tessedit_char_unblacklist
tessedit_write_params_to_fileWrite all parameters to the given file.
应用框曝光模式
chs_leading_punct('`"
chs_trailing_punct1
chs_trailing_punct2)'`"
轮廓_奇特%|非标准数量的轮廓
outlines_2ij!?%":;非标准数量的轮廓
数字标点符号.,Punct. chs expected WITHIN numbers
未识别的字符|Output char for unidentified blobs
ok_repeated_ch_non_alphanum_wds-?*=Allow NN to unrej
冲突集 I_l_1Il1 []Il1 conflict set
文件类型.tifFilename extension
tessedit_load_sublangsList of languages to load with this one
页面分隔符
分类字符规范范围
分类最高评分率
分类最大确定性边际
匹配器_良好阈值
匹配器可靠自适应结果
匹配器完美阈值
Matcher_bad_match_pad
匹配器评分差距
匹配器平均噪声大小
Matcher_clustering_max_angle_delta
分类不合格垃圾惩罚
评分量表
确定性规模
tessedit_class_miss_scale
分类适应剪枝因子
分类适应剪枝阈值
分类字符片段垃圾确定性阈值
斑点大尺寸
斑点评级惩罚
x高度惩罚下标
x高度惩罚不一致
词段惩罚字典_词频
segment_penalty_dict_case_ok
段落_penalty_dict_case_bad
段落_penalty_dict_nonword
确定性规模20Certainty scaling factor
stopper_nondict_certainty_base-2.5Certainty threshold for non-dict words
stopper_phase2_certainty_rejection_offset1Reject certainty offset
stopper_certainty_per_char-0.5Certainty to add for each dict char above small word size.
stopper_allowable_character_badness3Max certaintly variation allowed in a word (in sigma)
doc_dict_pending_threshold0Worst certainty for using pending dictionary
doc_dict_certainty_threshold
tessedit_certainty_threshold
chop_split_dist_knob0.5Split length adjustment
chop_overlap_knob0.9Split overlap adjustment
chop_center_knob0.15Split center adjustment
chop_sharpness_knob0.06Split sharpness adjustment
chop_width_change_knob5Width change adjustment
chop_ok_split100OK split limit
chop_good_split50Good split limit
segsearch_max_char_wh_ratio(最大字符数比
为获得最佳效果,建议在应用 OCR 之前使用 IronOCR 的[图像预处理过滤器](https://ironsoftware.com/csharp/ocr/examples/ocr-image-filters-for-net-tesseract/)。 这些过滤器可以显著提高准确性,尤其是在处理 [ 低质量扫描](https://ironsoftware.com/csharp/ocr/examples/ocr-low-quality-scans-tesseract/)或 [ 表格](https://ironsoftware.com/csharp/ocr/examples/read-table-in-document/)等复杂文档时。

常见问题解答

如何在 C# 中配置用于 OCR 的 IronTesseract?

要配置 IronTesseract,请创建一个 IronTesseract 实例并设置语言和配置等属性。您可以指定 OCR 语言(从 125 种支持语言中选择)、启用条形码读取、配置可搜索 PDF 输出以及设置字符白名单。例如: var tesseract = new IronOcr.IronTesseract { Language = IronOcr.OcrLanguage.English, Configuration = new IronOcr.TesseractConfiguration { ReadBarCodes = false, RenderSearchablePdf = true }; var tesseract = new IronOcr.IronTesseract { Language = IronOcr.OcrLanguage.English, Configuration = new IronOcr.TesseractConfiguration { ReadBarCodes = false, RenderSearchablePdf = true };。};

IronTesseract 支持哪些输入格式?

IronTesseract 可通过 OcrInput 类接受各种输入格式。您可以处理图像(PNG、JPG 等)、PDF 文件和扫描文档。OcrInput 类为加载这些不同的格式提供了灵活的方法,使您可以轻松地在几乎所有包含文本的文档上执行 OCR。

使用 IronTesseract 能否在阅读文本的同时阅读 BarCode?

是的,IronTesseract 包含高级条形码读取功能。您可以通过在 TesseractConfiguration 中设置 ReadBarCodes = true 来启用条形码检测功能。这样,您就可以在一次 OCR 操作中从同一文档中提取文本和条形码数据。

如何从扫描文件创建可搜索的 PDF?

通过在 TesseractConfiguration 中设置 RenderSearchablePdf = true,IronTesseract 可以将扫描的文档和图像转换为可搜索的 PDF。这样创建的 PDF 文件中的文本是可选择和可搜索的,同时保持了原始文档的外观。

IronTesseract 的 OCR 支持哪些语言?

IronTesseract 支持 125 种国际语言的文本识别。您可以通过设置 IronTesseract 实例的语言属性来指定语言,如 IronOcr.OcrLanguage.English、Spanish、Chinese、Arabic 等。

能否限制 OCR 识别的字符?

是的,IronTesseract 允许通过 TesseractConfiguration 中的 WhiteListCharacters 属性将字符列入白名单和黑名单。当您知道预期的字符集时,该功能有助于提高准确性,例如只限于识别字母数字字符。

如何同时对多个文档执行 OCR?

IronTesseract 支持批处理的多线程功能。您可以利用并行处理功能同时对多个文档进行 OCR 识别,从而显著提高处理大量图像或 PDF 文件时的性能。

IronOCR 使用哪个版本的 Tesseract?

IronOCR 使用经过定制和优化的 Tesseract 5 版本,即 Iron Tesseract。与标准 Tesseract 实现相比,这一增强型引擎提高了准确性和性能,同时保持了与 .NET 应用程序的兼容性。

Curtis Chau
技术作家

Curtis Chau 拥有卡尔顿大学的计算机科学学士学位,专注于前端开发,精通 Node.js、TypeScript、JavaScript 和 React。他热衷于打造直观且美观的用户界面,喜欢使用现代框架并创建结构良好、视觉吸引力强的手册。

除了开发之外,Curtis 对物联网 (IoT) 有浓厚的兴趣,探索将硬件和软件集成的新方法。在空闲时间,他喜欢玩游戏和构建 Discord 机器人,将他对技术的热爱与创造力相结合。

审核者
Jeff Fritz
Jeffrey T. Fritz
首席项目经理 - .NET 社区团队
Jeff 也是 .NET 和 Visual Studio 团队的首席项目经理。他是 .NET Conf 虚拟会议系列的执行制片人,并主持“Fritz and Friends”直播节目,每周两次与观众一起谈论技术并编写代码。Jeff 撰写研讨会、演示文稿并计划包括 Microsoft Build、Microsoft Ignite、.NET Conf 和 Microsoft MVP 峰会在内的最大型微软开发者活动的内容。
准备开始了吗?
Nuget 下载 5,299,091 | 版本: 2025.12 刚刚发布