IronWebScraper 教程 高级网络抓取功能 Advanced Webscraping Features Curtis Chau 已更新:六月 9, 2025 Download IronWebScraper NuGet 下载 DLL 下载 Start Free Trial Copy for LLMs Copy for LLMs Copy page as Markdown for LLMs Open in ChatGPT Ask ChatGPT about this page Open in Gemini Ask Gemini about this page Open in Grok Ask Grok about this page Open in Perplexity Ask Perplexity about this page Share Share on Facebook Share on X (Twitter) Share on LinkedIn Copy URL Email article This article was translated from English: Does it need improvement? Translated View the article in English HttpIdentity 功能 某些网站系统需要用户登录才能查看内容; 在这种情况下,我们可以使用HttpIdentity。 以下是设置方法: // Create a new instance of HttpIdentity HttpIdentity id = new HttpIdentity(); // Set the network username and password for authentication id.NetworkUsername = "username"; id.NetworkPassword = "pwd"; // Add the identity to the collection of identities Identities.Add(id); // Create a new instance of HttpIdentity HttpIdentity id = new HttpIdentity(); // Set the network username and password for authentication id.NetworkUsername = "username"; id.NetworkPassword = "pwd"; // Add the identity to the collection of identities Identities.Add(id); ' Create a new instance of HttpIdentity Dim id As New HttpIdentity() ' Set the network username and password for authentication id.NetworkUsername = "username" id.NetworkPassword = "pwd" ' Add the identity to the collection of identities Identities.Add(id) $vbLabelText $csharpLabel IronWebScraper 中最令人印象深刻和强大的功能之一是能够使用成千上万个独特的用户凭据和/或浏览器引擎,通过多个登录会话来伪装或抓取网站。 public override void Init() { // Set the license key for IronWebScraper License.LicenseKey = "LicenseKey"; // Set the logging level to capture all logs this.LoggingLevel = WebScraper.LogLevel.All; // Assign the working directory for the output files this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\"; // Define an array of proxies var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(','); // Iterate over common Chrome desktop user agents foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents) { // Iterate over the proxies foreach (var proxy in proxies) { // Add a new HTTP identity with specific user agent and proxy Identities.Add(new HttpIdentity() { UserAgent = UA, UseCookies = true, Proxy = proxy }); } } // Make an initial request to the website with a parse method this.Request("http://www.Website.com", Parse); } public override void Init() { // Set the license key for IronWebScraper License.LicenseKey = "LicenseKey"; // Set the logging level to capture all logs this.LoggingLevel = WebScraper.LogLevel.All; // Assign the working directory for the output files this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\"; // Define an array of proxies var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(','); // Iterate over common Chrome desktop user agents foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents) { // Iterate over the proxies foreach (var proxy in proxies) { // Add a new HTTP identity with specific user agent and proxy Identities.Add(new HttpIdentity() { UserAgent = UA, UseCookies = true, Proxy = proxy }); } } // Make an initial request to the website with a parse method this.Request("http://www.Website.com", Parse); } Public Overrides Sub Init() ' Set the license key for IronWebScraper License.LicenseKey = "LicenseKey" ' Set the logging level to capture all logs Me.LoggingLevel = WebScraper.LogLevel.All ' Assign the working directory for the output files Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\" ' Define an array of proxies Dim proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(","c) ' Iterate over common Chrome desktop user agents For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents ' Iterate over the proxies For Each proxy In proxies ' Add a new HTTP identity with specific user agent and proxy Identities.Add(New HttpIdentity() With { .UserAgent = UA, .UseCookies = True, .Proxy = proxy }) Next proxy Next UA ' Make an initial request to the website with a parse method Me.Request("http://www.Website.com", Parse) End Sub $vbLabelText $csharpLabel 您有多个属性可以提供不同的行为,从而防止网站阻止您。 这些属性包括: NetworkDomain:用于用户身份验证的网络域。 支持 Windows、NTLM、Kerberos、Linux、BSD 和 Mac OS X 网络。 必须与NetworkUsername和NetworkPassword一起使用。 NetworkUsername:用于用户身份验证的网络/http 用户名。 支持 HTTP、Windows 网络、NTLM、Kerberos、Linux 网络、BSD 网络和 Mac OS。 NetworkPassword:用于用户身份验证的网络/http 密码。 支持 HTTP、Windows 网络、NTLM、Kerberos、Linux 网络、BSD 网络和 Mac OS。 Proxy:设置代理设置。 UserAgent:设置浏览器引擎(例如,Chrome 桌面版、Chrome 移动版、Chrome 平板版、IE 和 Firefox 等)。 HttpRequestHeaders:用于此身份的自定义标头值,接受字典对象Dictionary<string, string>。 UseCookies:启用/禁用使用 cookie。 IronWebScraper 使用随机身份运行抓取器。 如果我们需要指定使用特定的身份解析页面,可以这样做: public override void Init() { // Set the license key for IronWebScraper License.LicenseKey = "LicenseKey"; // Set the logging level to capture all logs this.LoggingLevel = WebScraper.LogLevel.All; // Assign the working directory for the output files this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\"; // Create a new instance of HttpIdentity HttpIdentity identity = new HttpIdentity(); // Set the network username and password for authentication identity.NetworkUsername = "username"; identity.NetworkPassword = "pwd"; // Add the identity to the collection of identities Identities.Add(identity); // Make a request to the website with the specified identity this.Request("http://www.Website.com", Parse, identity); } public override void Init() { // Set the license key for IronWebScraper License.LicenseKey = "LicenseKey"; // Set the logging level to capture all logs this.LoggingLevel = WebScraper.LogLevel.All; // Assign the working directory for the output files this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\"; // Create a new instance of HttpIdentity HttpIdentity identity = new HttpIdentity(); // Set the network username and password for authentication identity.NetworkUsername = "username"; identity.NetworkPassword = "pwd"; // Add the identity to the collection of identities Identities.Add(identity); // Make a request to the website with the specified identity this.Request("http://www.Website.com", Parse, identity); } Public Overrides Sub Init() ' Set the license key for IronWebScraper License.LicenseKey = "LicenseKey" ' Set the logging level to capture all logs Me.LoggingLevel = WebScraper.LogLevel.All ' Assign the working directory for the output files Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\" ' Create a new instance of HttpIdentity Dim identity As New HttpIdentity() ' Set the network username and password for authentication identity.NetworkUsername = "username" identity.NetworkPassword = "pwd" ' Add the identity to the collection of identities Identities.Add(identity) ' Make a request to the website with the specified identity Me.Request("http://www.Website.com", Parse, identity) End Sub $vbLabelText $csharpLabel 启用网页缓存功能 此功能用于缓存请求的页面。 它通常用于开发和测试阶段,允许开发人员缓存所需页面以便在更新代码后重用。 这使您可以在重新启动 Web 抓取器后在缓存页面上执行代码,而无需每次都连接到实时网站(操作回放)。 您可以在Init()方法中使用它: // Enable web cache without an expiration time EnableWebCache(); // OR enable web cache with a specified expiration time EnableWebCache(new TimeSpan(1, 30, 30)); // Enable web cache without an expiration time EnableWebCache(); // OR enable web cache with a specified expiration time EnableWebCache(new TimeSpan(1, 30, 30)); ' Enable web cache without an expiration time EnableWebCache() ' OR enable web cache with a specified expiration time EnableWebCache(New TimeSpan(1, 30, 30)) $vbLabelText $csharpLabel 它会将您的缓存数据保存到工作目录文件夹下的 WebCache 文件夹中。 public override void Init() { // Set the license key for IronWebScraper License.LicenseKey = "LicenseKey"; // Set the logging level to capture all logs this.LoggingLevel = WebScraper.LogLevel.All; // Assign the working directory for the output files this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\"; // Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds EnableWebCache(new TimeSpan(1, 30, 30)); // Make an initial request to the website with a parse method this.Request("http://www.Website.com", Parse); } public override void Init() { // Set the license key for IronWebScraper License.LicenseKey = "LicenseKey"; // Set the logging level to capture all logs this.LoggingLevel = WebScraper.LogLevel.All; // Assign the working directory for the output files this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\"; // Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds EnableWebCache(new TimeSpan(1, 30, 30)); // Make an initial request to the website with a parse method this.Request("http://www.Website.com", Parse); } Public Overrides Sub Init() ' Set the license key for IronWebScraper License.LicenseKey = "LicenseKey" ' Set the logging level to capture all logs Me.LoggingLevel = WebScraper.LogLevel.All ' Assign the working directory for the output files Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\" ' Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds EnableWebCache(New TimeSpan(1, 30, 30)) ' Make an initial request to the website with a parse method Me.Request("http://www.Website.com", Parse) End Sub $vbLabelText $csharpLabel IronWebScraper 还具有功能,可以通过使用Start(CrawlID)设置引擎启动进程名称,使引擎在重新启动代码后继续抓取。 static void Main(string[] args) { // Create an object from the Scraper class EngineScraper scrape = new EngineScraper(); // Start the scraping process with the specified crawl ID scrape.Start("enginestate"); } static void Main(string[] args) { // Create an object from the Scraper class EngineScraper scrape = new EngineScraper(); // Start the scraping process with the specified crawl ID scrape.Start("enginestate"); } Shared Sub Main(ByVal args() As String) ' Create an object from the Scraper class Dim scrape As New EngineScraper() ' Start the scraping process with the specified crawl ID scrape.Start("enginestate") End Sub $vbLabelText $csharpLabel 执行请求和响应将保存在工作目录内的 SavedState 文件夹中。 节流 我们可以控制每个域的最小和最大连接数和连接速度。 public override void Init() { // Set the license key for IronWebScraper License.LicenseKey = "LicenseKey"; // Set the logging level to capture all logs this.LoggingLevel = WebScraper.LogLevel.All; // Assign the working directory for the output files this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\"; // Set the total number of allowed open HTTP requests (threads) this.MaxHttpConnectionLimit = 80; // Set minimum polite delay (pause) between requests to a given domain or IP address this.RateLimitPerHost = TimeSpan.FromMilliseconds(50); // Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address this.OpenConnectionLimitPerHost = 25; // Do not obey the robots.txt files this.ObeyRobotsDotTxt = false; // Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses this.ThrottleMode = Throttle.ByDomainHostName; // Make an initial request to the website with a parse method this.Request("https://www.Website.com", Parse); } public override void Init() { // Set the license key for IronWebScraper License.LicenseKey = "LicenseKey"; // Set the logging level to capture all logs this.LoggingLevel = WebScraper.LogLevel.All; // Assign the working directory for the output files this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\"; // Set the total number of allowed open HTTP requests (threads) this.MaxHttpConnectionLimit = 80; // Set minimum polite delay (pause) between requests to a given domain or IP address this.RateLimitPerHost = TimeSpan.FromMilliseconds(50); // Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address this.OpenConnectionLimitPerHost = 25; // Do not obey the robots.txt files this.ObeyRobotsDotTxt = false; // Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses this.ThrottleMode = Throttle.ByDomainHostName; // Make an initial request to the website with a parse method this.Request("https://www.Website.com", Parse); } Public Overrides Sub Init() ' Set the license key for IronWebScraper License.LicenseKey = "LicenseKey" ' Set the logging level to capture all logs Me.LoggingLevel = WebScraper.LogLevel.All ' Assign the working directory for the output files Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\" ' Set the total number of allowed open HTTP requests (threads) Me.MaxHttpConnectionLimit = 80 ' Set minimum polite delay (pause) between requests to a given domain or IP address Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50) ' Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address Me.OpenConnectionLimitPerHost = 25 ' Do not obey the robots.txt files Me.ObeyRobotsDotTxt = False ' Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses Me.ThrottleMode = Throttle.ByDomainHostName ' Make an initial request to the website with a parse method Me.Request("https://www.Website.com", Parse) End Sub $vbLabelText $csharpLabel 节流 properties MaxHttpConnectionLimit 允许打开的 HTTP 请求(线程)的总数 RateLimitPerHost 给定域或 IP 地址之间请求的最小礼貌延迟或暂停(毫秒) OpenConnectionLimitPerHost 每个主机名允许的并发 HTTP 请求(线程)数 ThrottleMode 使 WebScraper 智能地限速请求,不仅按主机名,还按主机服务器的 IP 地址。 如果多个抓取的域托管在同一台机器上,这样做是礼貌的。 开始使用IronWebscraper 今天在您的项目中使用 IronWebScraper,免费试用。 第一步: 免费开始 常见问题解答 如何在 C# 中认证需要登录的网站上的用户? 您可以利用 IronWebScraper 的 HttpIdentity 功能,通过设置 NetworkDomain、NetworkUsername 和 NetworkPassword 来认证用户。 在开发过程中使用网页缓存有什么好处? 网页缓存功能允许您缓存请求的页面以供重复使用,帮助节省时间和资源,避免与实时网站重复连接,尤其在开发和测试阶段非常有用。 如何在网络抓取中管理多个登录会话? IronWebScraper 支持使用数千个唯一的用户凭据和浏览器引擎来模拟多个登录会话,这有助于防止网站检测和阻止抓取器。 网络抓取中的高级节流选项有哪些? IronWebScraper 提供 ThrottleMode 设置,可根据主机名和 IP 地址智能管理请求节流,确保与共享主机环境的礼貌交互。 如何在 IronWebScraper 中使用代理? 要使用代理,定义一个代理数组,并将它们与 IronWebScraper 中的 HttpIdentity 实例关联,从而允许通过不同的 IP 地址路由请求以实现匿名和访问控制。 IronWebScraper 如何处理请求延迟以防止服务器过载? IronWebScraper 中的 RateLimitPerHost 设置指定了对特定域名或 IP 地址请求之间的最小延迟,帮助通过间隔请求来防止服务器过载。 发生中断后可以继续进行网络抓取吗? 可以,IronWebScraper 可以使用 Start(CrawlID) 方法在发生中断后继续抓取,它保存执行状态并从最后保存的点继续执行。 如何控制网络抓取器中的并发 HTTP 连接数? 在 IronWebScraper 中,您可以设置 MaxHttpConnectionLimit 属性来控制允许的开放 HTTP 请求的总数,帮助管理服务器负载和资源。 有哪些选项可用于日志记录网络抓取活动? IronWebScraper 允许您使用 LoggingLevel 属性设置日志记录级别,从而在抓取操作期间启用全面的日志记录以进行详细分析和故障排除。 Curtis Chau 立即与工程团队聊天 技术作家 Curtis Chau 拥有卡尔顿大学的计算机科学学士学位,专注于前端开发,精通 Node.js、TypeScript、JavaScript 和 React。他热衷于打造直观且美观的用户界面,喜欢使用现代框架并创建结构良好、视觉吸引力强的手册。除了开发之外,Curtis 对物联网 (IoT) 有浓厚的兴趣,探索将硬件和软件集成的新方法。在空闲时间,他喜欢玩游戏和构建 Discord 机器人,将他对技术的热爱与创造力相结合。 准备开始了吗? Nuget 下载 122,916 | 版本: 2025.11 刚刚发布 免费 NuGet 下载 总下载量:122,916 查看许可证