C# 中的高级网页抓取功能
HttpIdentity 功能
某些网站系统需要用户登录才能查看内容; 在这种情况下,我们可以使用HttpIdentity。 以下是设置方法:
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);IronWebScraper 中最令人印象深刻和强大的功能之一是能够使用成千上万个独特的用户凭据和/或浏览器引擎,通过多个登录会话来伪装或抓取网站。
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}您有多个属性可以提供不同的行为,从而防止网站阻止您。
这些属性包括:
NetworkDomain:用于用户身份验证的网络域。 支持 Windows、NTLM、Kerberos、Linux、BSD 和 Mac OS X 网络。 必须与NetworkUsername和NetworkPassword一起使用。NetworkUsername:用于用户身份验证的网络/http 用户名。 支持 HTTP、Windows 网络、NTLM、Kerberos、Linux 网络、BSD 网络和 Mac OS。NetworkPassword:用于用户身份验证的网络/http 密码。 支持 HTTP、Windows 网络、NTLM、Kerberos、Linux 网络、BSD 网络和 Mac OS。Proxy:设置代理设置。UserAgent:设置浏览器引擎(例如,Chrome 桌面版、Chrome 移动版、Chrome 平板版、IE 和 Firefox 等)。HttpRequestHeaders:用于此身份的自定义标头值,接受字典对象Dictionary<string, string>。UseCookies:启用/禁用使用 cookie。
IronWebScraper 使用随机身份运行抓取器。 如果我们需要指定使用特定的身份解析页面,可以这样做:
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}启用网页缓存功能
此功能用于缓存请求的页面。 它通常用于开发和测试阶段,允许开发人员缓存所需页面以便在更新代码后重用。 这使您可以在重新启动 Web 抓取器后在缓存页面上执行代码,而无需每次都连接到实时网站(操作回放)。
您可以在Init()方法中使用它:
// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));它将把您的缓存数据保存到工作目录文件夹下的 WebCache 文件夹中。
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}IronWebScraper 还具有功能,可以通过使用Start(CrawlID)设置引擎启动进程名称,使引擎在重新启动代码后继续抓取。
static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}执行请求和响应将保存在工作目录内的 SavedState 文件夹中。
节流
我们可以控制每个域的最小和最大连接数和连接速度。
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}节流属性
MaxHttpConnectionLimit允许打开的 HTTP 请求(线程)的总数RateLimitPerHost给定域或 IP 地址之间请求的最小礼貌延迟或暂停(毫秒)OpenConnectionLimitPerHost每个主机名允许的并发 HTTP 请求(线程)数ThrottleMode使 WebScraper 智能地限速请求,不仅按主机名,还按主机服务器的 IP 地址。 如果多个抓取的域托管在同一台机器上,这样做是礼貌的。
开始使用 IronWebscraper
今天在您的项目中使用 IronWebScraper,免费试用。
常见问题解答
如何在 C# 中认证需要登录的网站上的用户?
您可以利用 IronWebScraper 的 HttpIdentity 功能,通过设置 NetworkDomain、NetworkUsername 和 NetworkPassword 来认证用户。
在开发过程中使用网页缓存有什么好处?
网页缓存功能允许您缓存请求的页面以供重复使用,帮助节省时间和资源,避免与实时网站重复连接,尤其在开发和测试阶段非常有用。
如何在网络抓取中管理多个登录会话?
IronWebScraper 支持使用数千个唯一的用户凭据和浏览器引擎来模拟多个登录会话,这有助于防止网站检测和阻止抓取器。
网络抓取中的高级节流选项有哪些?
IronWebScraper 提供 ThrottleMode 设置,可根据主机名和 IP 地址智能管理请求节流,确保与共享主机环境的礼貌交互。
如何在 IronWebScraper 中使用代理?
要使用代理,定义一个代理数组,并将它们与 IronWebScraper 中的 HttpIdentity 实例关联,从而允许通过不同的 IP 地址路由请求以实现匿名和访问控制。
IronWebScraper 如何处理请求延迟以防止服务器过载?
IronWebScraper 中的 RateLimitPerHost 设置指定了对特定域名或 IP 地址请求之间的最小延迟,帮助通过间隔请求来防止服务器过载。
发生中断后可以继续进行网络抓取吗?
可以,IronWebScraper 可以使用 Start(CrawlID) 方法在发生中断后继续抓取,它保存执行状态并从最后保存的点继续执行。
如何控制网络抓取器中的并发 HTTP 连接数?
在 IronWebScraper 中,您可以设置 MaxHttpConnectionLimit 属性来控制允许的开放 HTTP 请求的总数,帮助管理服务器负载和资源。
有哪些选项可用于日志记录网络抓取活动?
IronWebScraper 允许您使用 LoggingLevel 属性设置日志记录级别,从而在抓取操作期间启用全面的日志记录以进行详细分析和故障排除。






