高级网页抓取功能

This article was translated from English: Does it need improvement?
Translated
View the article in English

适用于的C# NuGet库

安装使用 NuGet

Install-Package IronWebScraper
Java PDF JAR

下载 DLL

下载DLL

手动安装到你的项目中

适用于的C# NuGet库

安装使用 NuGet

Install-Package IronWebScraper
Java PDF JAR

下载 DLL

下载DLL

手动安装到你的项目中

开始在您的项目中使用IronPDF,并立即获取免费试用。

第一步:
green arrow pointer

查看 IronWebScraperNuget 用于快速安装和部署。它有超过800万次下载,正在使用C#改变。

适用于的C# NuGet库 nuget.org/packages/IronWebScraper/
Install-Package IronWebScraper

考虑安装 IronWebScraper DLL 直接。下载并手动安装到您的项目或GAC表单中: webscraperIronWebScraper.zip

手动安装到你的项目中

下载DLL

HttpIdentity 特性

有些网站系统要求用户登录后才能查看内容;在这种情况下,我们可以使用 HttpIdentity:-

HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id); 
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id); 
Dim id As New HttpIdentity()
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
Identities.Add(id)
VB   C#

IronWebScraper 最令人印象深刻、功能最强大的功能之一,就是可以使用数千种独特的网络技术。 (用户凭据和/或浏览器引擎) 使用多重登录会话欺骗或搜索网站。

public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        foreach (var proxy in proxies)
        {
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }
    this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        foreach (var proxy in proxies)
        {
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }
    this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
	License.LicenseKey = " LicenseKey "
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	Dim proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(","c)
	For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
		For Each proxy In proxies
			Identities.Add(New HttpIdentity() With {
				.UserAgent = UA,
				.UseCookies = True,
				.Proxy = proxy
			})
		Next proxy
	Next UA
	Me.Request("http://www.Website.com", Parse)
End Sub
VB   C#

您有多种属性,可以为您提供不同的行为,从而防止网站屏蔽您。

其中一些属性是 -

  • 网络域 :用于用户身份验证的网络域。支持 Windows、NTLM、Keroberos、Linux、BSD 和 Mac OS X 网络。必须与 (网络用户名和网络密码)
  • 网络用户名 : 用于用户身份验证的网络/http 用户名。支持 Http、Windows 网络、NTLM、Kerberos、Linux 网络、BSD 网络和 Mac OS。
  • 网络密码 : 用于用户身份验证的网络/http 密码。支持 Http、Windows 网络、NTLM、Keroberos、Linux 网络、BSD 网络和 Mac OS。
  • 代理:设置代理设置
  • UserAgent :设置浏览器引擎 (chrome桌面、chrome手机、chrome平板电脑、IE和火狐等。)
  • HttpRequestHeaders:用于自定义将与此身份一起使用的头信息值,它接受字典对象(字典 <字符串, string>)

  • UseCookies : 启用/禁用 cookie

IronWebScraper 使用随机身份运行刮板。 如果我们需要指定使用特定身份来解析页面,我们可以这样做。

public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    HttpIdentity identity = new HttpIdentity();
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";
    Identities.Add(id);
    this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    HttpIdentity identity = new HttpIdentity();
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";
    Identities.Add(id);
    this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
	License.LicenseKey = " LicenseKey "
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	Dim identity As New HttpIdentity()
	identity.NetworkUsername = "username"
	identity.NetworkPassword = "pwd"
	Identities.Add(id)
	Me.Request("http://www.Website.com", Parse, identity)
End Sub
VB   C#

启用网络缓存功能

该功能用于缓存请求的页面。它通常用于开发和测试阶段;使开发人员能够缓存所需的页面,以便在更新代码后重复使用。这样,您就可以在重新启动 Web scraper 后在缓存页面上执行代码,而无需每次都连接到实时网站。 (动作重放).

您可以在 Init() 方法

启用网络缓存();`

启用网络缓存(有效期);`

它将把缓存数据保存到工作目录文件夹下的 WebCache 文件夹中

public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    EnableWebCache(new TimeSpan(1,30,30));
    this.Request("http://www.WebSite.com", Parse);
}
public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    EnableWebCache(new TimeSpan(1,30,30));
    this.Request("http://www.WebSite.com", Parse);
}
Public Overrides Sub Init()
	License.LicenseKey = " LicenseKey "
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	EnableWebCache(New TimeSpan(1,30,30))
	Me.Request("http://www.WebSite.com", Parse)
End Sub
VB   C#

IronWebScraper 还具有一些功能,可通过使用 "开始 "设置引擎启动进程名称,使引擎在重启代码后继续进行刮擦。(CrawlID)

static void Main(string [] args)
{
    // Create Object From Scraper class
    EngineScraper scrape = new EngineScraper();
    // Start Scraping
    scrape.Start("enginestate");
}
static void Main(string [] args)
{
    // Create Object From Scraper class
    EngineScraper scrape = new EngineScraper();
    // Start Scraping
    scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
	' Create Object From Scraper class
	Dim scrape As New EngineScraper()
	' Start Scraping
	scrape.Start("enginestate")
End Sub
VB   C#

执行请求和响应将保存在工作目录内的 SavedState 文件夹中。

Throttling

我们可以控制每个域的最小和最大连接数以及连接速度。

public override void Init()
{
    License.LicenseKey = "LicenseKey";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    // Gets or sets the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;
    // Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);            
    //     Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
    //     or IP address. This helps protect hosts against too many requests.
    this.OpenConnectionLimitPerHost = 25;
    this.ObeyRobotsDotTxt = false;
    //     Makes the WebSraper intelligently throttle requests not only by hostname, but
    //     also by host servers' IP addresses. This is polite in-case multiple scraped domains
    //     are hosted on the same machine.
    this.ThrottleMode = Throttle.ByDomainHostName;
    this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
    License.LicenseKey = "LicenseKey";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    // Gets or sets the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;
    // Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);            
    //     Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
    //     or IP address. This helps protect hosts against too many requests.
    this.OpenConnectionLimitPerHost = 25;
    this.ObeyRobotsDotTxt = false;
    //     Makes the WebSraper intelligently throttle requests not only by hostname, but
    //     also by host servers' IP addresses. This is polite in-case multiple scraped domains
    //     are hosted on the same machine.
    this.ThrottleMode = Throttle.ByDomainHostName;
    this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
	License.LicenseKey = "LicenseKey"
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	' Gets or sets the total number of allowed open HTTP requests (threads)
	Me.MaxHttpConnectionLimit = 80
	' Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
	Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
	'     Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
	'     or IP address. This helps protect hosts against too many requests.
	Me.OpenConnectionLimitPerHost = 25
	Me.ObeyRobotsDotTxt = False
	'     Makes the WebSraper intelligently throttle requests not only by hostname, but
	'     also by host servers' IP addresses. This is polite in-case multiple scraped domains
	'     are hosted on the same machine.
	Me.ThrottleMode = Throttle.ByDomainHostName
	Me.Request("https://www.Website.com", Parse)
End Sub
VB   C#

节流特性

  • 最大连接限制


    允许打开的 HTTP 请求总数 (线程)

  • 每个主机的费率限制


    最小礼貌延迟或暂停 (毫秒) 之间的请求

  • 每主机打开连接限制


    允许的并发 HTTP 请求数 (线程)

  • 节流阀模式


    使 WebSraper 不仅能根据主机名,还能根据主机服务器的 IP 地址对请求进行智能节流。这在同一台机器上托管多个刮擦域名的情况下非常有用

.NET软件工程师 对许多人来说,这是从.NET生成PDF文件的最有效方式,因为无需学习额外的API或复杂的设计系统。

艾哈迈德·阿布马格德

跨国IT公司的.NET软件解决方案架构师

艾哈迈德是一位经验丰富且经认证的微软技术专家,拥有超过10年的IT和软件开发经验。他曾在多家公司工作,现在是一家跨国IT公司的国家经理。

艾哈迈德已经使用IronPDF和IronWebScraper超过一年了,并在他公司的多个项目中使用它们。