高级网页抓取功能

Darrius Serrant
Darrius Serrant
2023年一月25日
更新 2024年十二月10日
分享:
This article was translated from English: Does it need improvement?
Translated
View the article in English

HttpIdentity 功能

一些网站系统要求用户登录后才能查看内容; 在这种情况下,我们可以使用HttpIdentity:-

HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id); 
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id); 
Dim id As New HttpIdentity()
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
Identities.Add(id)
$vbLabelText   $csharpLabel

在IronWebScraper中,最令人印象深刻和强大的功能之一是使用数千个独特的(用户凭证和/或浏览器引擎)通过多登录会话来伪装或抓取网站的能力。

public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        foreach (var proxy in proxies)
        {
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }
    this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        foreach (var proxy in proxies)
        {
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }
    this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
	License.LicenseKey = " LicenseKey "
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	Dim proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(","c)
	For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
		For Each proxy In proxies
			Identities.Add(New HttpIdentity() With {
				.UserAgent = UA,
				.UseCookies = True,
				.Proxy = proxy
			})
		Next proxy
	Next UA
	Me.Request("http://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

您拥有多个属性以赋予您不同的行为,因此可以防止网站屏蔽您。

一些这些属性:-

  • NetworkDomain:用于用户身份验证的网络域。 支持 Windows、NTLM、Keroberos、Linux、BSD 和 Mac OS X 网络。 必须与(NetworkUsername 和 NetworkPassword)一起使用
  • NetworkUsername:用于用户身份验证的网络/HTTP用户名。 支持Http、Windows网络、NTLM、Kerberos、Linux网络、BSD网络和Mac OS。
  • NetworkPassword:用于用户身份验证的网络/HTTP密码。 支持Http、Windows网络、NTLM、Keroberos、Linux网络、BSD网络和Mac OS。
  • 代理:设置代理设置
  • UserAgent:用于设置浏览器引擎(chrome桌面版、chrome手机版、chrome平板版、IE和Firefox等)。
  • HttpRequestHeaders:用于此身份的自定义标头值,并且接受字典对象(Dictionary <string, string>)
  • UseCookies : 启用/禁用使用Cookies

    IronWebScraper 使用随机身份运行抓取程序。 如果我们需要指定使用特定身份来解析页面,我们可以这样做。

public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    HttpIdentity identity = new HttpIdentity();
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";
    Identities.Add(id);
    this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    HttpIdentity identity = new HttpIdentity();
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";
    Identities.Add(id);
    this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
	License.LicenseKey = " LicenseKey "
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	Dim identity As New HttpIdentity()
	identity.NetworkUsername = "username"
	identity.NetworkPassword = "pwd"
	Identities.Add(id)
	Me.Request("http://www.Website.com", Parse, identity)
End Sub
$vbLabelText   $csharpLabel

启用网页缓存功能

此功能用于缓存请求的页面。 它经常在开发和测试阶段使用; 使开发者能够缓存所需页面以在更新代码后重用。 这使您可以在重新启动网络抓取工具后在缓存页面上执行代码,而不需要每次都连接到实时网站(操作重放)。

您可以在 Init() 方法中使用它

EnableWebCache();

EnableWebCache(TimeSpan Expiry);

它会将缓存数据保存到工作目录下的WebCache文件夹中。

public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    EnableWebCache(new TimeSpan(1,30,30));
    this.Request("http://www.WebSite.com", Parse);
}
public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    EnableWebCache(new TimeSpan(1,30,30));
    this.Request("http://www.WebSite.com", Parse);
}
Public Overrides Sub Init()
	License.LicenseKey = " LicenseKey "
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	EnableWebCache(New TimeSpan(1,30,30))
	Me.Request("http://www.WebSite.com", Parse)
End Sub
$vbLabelText   $csharpLabel

IronWebScraper 还具有功能,可以通过使用 Start(CrawlID) 设置引擎启动过程名称,使您的引擎在重新启动代码后继续抓取。

static void Main(string [] args)
{
    // Create Object From Scraper class
    EngineScraper scrape = new EngineScraper();
    // Start Scraping
    scrape.Start("enginestate");
}
static void Main(string [] args)
{
    // Create Object From Scraper class
    EngineScraper scrape = new EngineScraper();
    // Start Scraping
    scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
	' Create Object From Scraper class
	Dim scrape As New EngineScraper()
	' Start Scraping
	scrape.Start("enginestate")
End Sub
$vbLabelText   $csharpLabel

执行请求和响应将保存在工作目录内的 SavedState 文件夹中。

限流

我们可以控制每个域的最小和最大连接数量以及连接速度。

public override void Init()
{
    License.LicenseKey = "LicenseKey";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    // Gets or sets the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;
    // Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);            
    //     Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
    //     or IP address. This helps protect hosts against too many requests.
    this.OpenConnectionLimitPerHost = 25;
    this.ObeyRobotsDotTxt = false;
    //     Makes the WebSraper intelligently throttle requests not only by hostname, but
    //     also by host servers' IP addresses. This is polite in-case multiple scraped domains
    //     are hosted on the same machine.
    this.ThrottleMode = Throttle.ByDomainHostName;
    this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
    License.LicenseKey = "LicenseKey";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    // Gets or sets the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;
    // Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);            
    //     Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
    //     or IP address. This helps protect hosts against too many requests.
    this.OpenConnectionLimitPerHost = 25;
    this.ObeyRobotsDotTxt = false;
    //     Makes the WebSraper intelligently throttle requests not only by hostname, but
    //     also by host servers' IP addresses. This is polite in-case multiple scraped domains
    //     are hosted on the same machine.
    this.ThrottleMode = Throttle.ByDomainHostName;
    this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
	License.LicenseKey = "LicenseKey"
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	' Gets or sets the total number of allowed open HTTP requests (threads)
	Me.MaxHttpConnectionLimit = 80
	' Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
	Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
	'     Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
	'     or IP address. This helps protect hosts against too many requests.
	Me.OpenConnectionLimitPerHost = 25
	Me.ObeyRobotsDotTxt = False
	'     Makes the WebSraper intelligently throttle requests not only by hostname, but
	'     also by host servers' IP addresses. This is polite in-case multiple scraped domains
	'     are hosted on the same machine.
	Me.ThrottleMode = Throttle.ByDomainHostName
	Me.Request("https://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

限流属性

  • MaxHttpConnectionLimit


    允许打开的HTTP请求(线程)的总数

  • RateLimitPerHost


    请求给定域名或IP地址之间的最小礼貌延迟或暂停(以毫秒为单位)

  • 每个主机的开放连接限制


    允许的并发HTTP请求(线程)

  • ThrottleMode


    使WebSraper智能地按主机名和主机服务器的IP地址调节请求。 这是礼貌的,以防多个抓取的域托管在同一台机器上。


    开始使用IronWebscraper

    立即在您的项目中开始使用IronWebScraper,并享受免费试用。

    第一步:
    green arrow pointer

Darrius Serrant
全栈软件工程师(WebOps)

达瑞乌斯·塞兰特拥有迈阿密大学计算机科学学士学位,目前在Iron Software担任全栈WebOps营销工程师。从小对编码的热爱使他认为计算机既神秘又易接近,成为创意和解决问题的完美媒介。

在Iron Software,达瑞乌斯乐于创造新事物并简化复杂概念,使其更易于理解。作为我们在职开发者之一,他还自愿教授学生,将他的专业知识传授给下一代。

对达瑞乌斯而言,他的工作之所以令人满足,是因为它具有价值并产生了真正的影响。