Advanced Webscraping Features

This article was translated from English: Does it need improvement?
Translated
View the article in English

HttpIdentity 功能

某些网站系统需要用户登录才能查看内容; 在这种情况下,我们可以使用HttpIdentity。 以下是设置方法:

// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();

// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";

// Add the identity to the collection of identities
Identities.Add(id);
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();

// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";

// Add the identity to the collection of identities
Identities.Add(id);
' Create a new instance of HttpIdentity
Dim id As New HttpIdentity()

' Set the network username and password for authentication
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"

' Add the identity to the collection of identities
Identities.Add(id)
$vbLabelText   $csharpLabel

IronWebScraper 中最令人印象深刻和强大的功能之一是能够使用成千上万个独特的用户凭据和/或浏览器引擎,通过多个登录会话来伪装或抓取网站。

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Define an array of proxies
    var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');

    // Iterate over common Chrome desktop user agents
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        // Iterate over the proxies
        foreach (var proxy in proxies)
        {
            // Add a new HTTP identity with specific user agent and proxy
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Define an array of proxies
    var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');

    // Iterate over common Chrome desktop user agents
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        // Iterate over the proxies
        foreach (var proxy in proxies)
        {
            // Add a new HTTP identity with specific user agent and proxy
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Define an array of proxies
	Dim proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(","c)

	' Iterate over common Chrome desktop user agents
	For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
		' Iterate over the proxies
		For Each proxy In proxies
			' Add a new HTTP identity with specific user agent and proxy
			Identities.Add(New HttpIdentity() With {
				.UserAgent = UA,
				.UseCookies = True,
				.Proxy = proxy
			})
		Next proxy
	Next UA

	' Make an initial request to the website with a parse method
	Me.Request("http://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

您有多个属性可以提供不同的行为,从而防止网站阻止您。

这些属性包括:

  • NetworkDomain:用于用户身份验证的网络域。 支持 Windows、NTLM、Kerberos、Linux、BSD 和 Mac OS X 网络。 必须与NetworkUsernameNetworkPassword一起使用。
  • NetworkUsername:用于用户身份验证的网络/http 用户名。 支持 HTTP、Windows 网络、NTLM、Kerberos、Linux 网络、BSD 网络和 Mac OS。
  • NetworkPassword:用于用户身份验证的网络/http 密码。 支持 HTTP、Windows 网络、NTLM、Kerberos、Linux 网络、BSD 网络和 Mac OS。
  • Proxy:设置代理设置。
  • UserAgent:设置浏览器引擎(例如,Chrome 桌面版、Chrome 移动版、Chrome 平板版、IE 和 Firefox 等)。
  • HttpRequestHeaders:用于此身份的自定义标头值,接受字典对象Dictionary<string, string>
  • UseCookies:启用/禁用使用 cookie。

IronWebScraper 使用随机身份运行抓取器。 如果我们需要指定使用特定的身份解析页面,可以这样做:

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Create a new instance of HttpIdentity
    HttpIdentity identity = new HttpIdentity();

    // Set the network username and password for authentication
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";

    // Add the identity to the collection of identities
    Identities.Add(identity);

    // Make a request to the website with the specified identity
    this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Create a new instance of HttpIdentity
    HttpIdentity identity = new HttpIdentity();

    // Set the network username and password for authentication
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";

    // Add the identity to the collection of identities
    Identities.Add(identity);

    // Make a request to the website with the specified identity
    this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Create a new instance of HttpIdentity
	Dim identity As New HttpIdentity()

	' Set the network username and password for authentication
	identity.NetworkUsername = "username"
	identity.NetworkPassword = "pwd"

	' Add the identity to the collection of identities
	Identities.Add(identity)

	' Make a request to the website with the specified identity
	Me.Request("http://www.Website.com", Parse, identity)
End Sub
$vbLabelText   $csharpLabel

启用网页缓存功能

此功能用于缓存请求的页面。 它通常用于开发和测试阶段,允许开发人员缓存所需页面以便在更新代码后重用。 这使您可以在重新启动 Web 抓取器后在缓存页面上执行代码,而无需每次都连接到实时网站(操作回放)。

您可以在Init()方法中使用它:

// Enable web cache without an expiration time
EnableWebCache();

// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
// Enable web cache without an expiration time
EnableWebCache();

// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
' Enable web cache without an expiration time
EnableWebCache()

' OR enable web cache with a specified expiration time
EnableWebCache(New TimeSpan(1, 30, 30))
$vbLabelText   $csharpLabel

它会将您的缓存数据保存到工作目录文件夹下的 WebCache 文件夹中。

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
    EnableWebCache(new TimeSpan(1, 30, 30));

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
    EnableWebCache(new TimeSpan(1, 30, 30));

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
	EnableWebCache(New TimeSpan(1, 30, 30))

	' Make an initial request to the website with a parse method
	Me.Request("http://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

IronWebScraper 还具有功能,可以通过使用Start(CrawlID)设置引擎启动进程名称,使引擎在重新启动代码后继续抓取。

static void Main(string[] args)
{
    // Create an object from the Scraper class
    EngineScraper scrape = new EngineScraper();

    // Start the scraping process with the specified crawl ID
    scrape.Start("enginestate");
}
static void Main(string[] args)
{
    // Create an object from the Scraper class
    EngineScraper scrape = new EngineScraper();

    // Start the scraping process with the specified crawl ID
    scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
	' Create an object from the Scraper class
	Dim scrape As New EngineScraper()

	' Start the scraping process with the specified crawl ID
	scrape.Start("enginestate")
End Sub
$vbLabelText   $csharpLabel

执行请求和响应将保存在工作目录内的 SavedState 文件夹中。

节流

我们可以控制每个域的最小和最大连接数和连接速度。

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Set the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;

    // Set minimum polite delay (pause) between requests to a given domain or IP address
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);

    // Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
    this.OpenConnectionLimitPerHost = 25;

    // Do not obey the robots.txt files
    this.ObeyRobotsDotTxt = false;

    // Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
    this.ThrottleMode = Throttle.ByDomainHostName;

    // Make an initial request to the website with a parse method
    this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Set the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;

    // Set minimum polite delay (pause) between requests to a given domain or IP address
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);

    // Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
    this.OpenConnectionLimitPerHost = 25;

    // Do not obey the robots.txt files
    this.ObeyRobotsDotTxt = false;

    // Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
    this.ThrottleMode = Throttle.ByDomainHostName;

    // Make an initial request to the website with a parse method
    this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Set the total number of allowed open HTTP requests (threads)
	Me.MaxHttpConnectionLimit = 80

	' Set minimum polite delay (pause) between requests to a given domain or IP address
	Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)

	' Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
	Me.OpenConnectionLimitPerHost = 25

	' Do not obey the robots.txt files
	Me.ObeyRobotsDotTxt = False

	' Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
	Me.ThrottleMode = Throttle.ByDomainHostName

	' Make an initial request to the website with a parse method
	Me.Request("https://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

节流 properties

  • MaxHttpConnectionLimit 允许打开的 HTTP 请求(线程)的总数
  • RateLimitPerHost 给定域或 IP 地址之间请求的最小礼貌延迟或暂停(毫秒)
  • OpenConnectionLimitPerHost 每个主机名允许的并发 HTTP 请求(线程)数
  • ThrottleMode 使 WebScraper 智能地限速请求,不仅按主机名,还按主机服务器的 IP 地址。 如果多个抓取的域托管在同一台机器上,这样做是礼貌的。

开始使用IronWebscraper

今天在您的项目中使用 IronWebScraper,免费试用。

第一步:
green arrow pointer

常见问题解答

如何在 C# 中认证需要登录的网站上的用户?

您可以利用 IronWebScraper 的 HttpIdentity 功能,通过设置 NetworkDomainNetworkUsernameNetworkPassword 来认证用户。

在开发过程中使用网页缓存有什么好处?

网页缓存功能允许您缓存请求的页面以供重复使用,帮助节省时间和资源,避免与实时网站重复连接,尤其在开发和测试阶段非常有用。

如何在网络抓取中管理多个登录会话?

IronWebScraper 支持使用数千个唯一的用户凭据和浏览器引擎来模拟多个登录会话,这有助于防止网站检测和阻止抓取器。

网络抓取中的高级节流选项有哪些?

IronWebScraper 提供 ThrottleMode 设置,可根据主机名和 IP 地址智能管理请求节流,确保与共享主机环境的礼貌交互。

如何在 IronWebScraper 中使用代理?

要使用代理,定义一个代理数组,并将它们与 IronWebScraper 中的 HttpIdentity 实例关联,从而允许通过不同的 IP 地址路由请求以实现匿名和访问控制。

IronWebScraper 如何处理请求延迟以防止服务器过载?

IronWebScraper 中的 RateLimitPerHost 设置指定了对特定域名或 IP 地址请求之间的最小延迟,帮助通过间隔请求来防止服务器过载。

发生中断后可以继续进行网络抓取吗?

可以,IronWebScraper 可以使用 Start(CrawlID) 方法在发生中断后继续抓取,它保存执行状态并从最后保存的点继续执行。

如何控制网络抓取器中的并发 HTTP 连接数?

在 IronWebScraper 中,您可以设置 MaxHttpConnectionLimit 属性来控制允许的开放 HTTP 请求的总数,帮助管理服务器负载和资源。

有哪些选项可用于日志记录网络抓取活动?

IronWebScraper 允许您使用 LoggingLevel 属性设置日志记录级别,从而在抓取操作期间启用全面的日志记录以进行详细分析和故障排除。

Curtis Chau
技术作家

Curtis Chau 拥有卡尔顿大学的计算机科学学士学位,专注于前端开发,精通 Node.js、TypeScript、JavaScript 和 React。他热衷于打造直观且美观的用户界面,喜欢使用现代框架并创建结构良好、视觉吸引力强的手册。

除了开发之外,Curtis 对物联网 (IoT) 有浓厚的兴趣,探索将硬件和软件集成的新方法。在空闲时间,他喜欢玩游戏和构建 Discord 机器人,将他对技术的热爱与创造力相结合。

准备开始了吗?
Nuget 下载 122,916 | 版本: 2025.11 刚刚发布