高级网页抓取功能
开始在您的项目中使用IronPDF,并立即获取免费试用。
查看 IronWebScraper 上 Nuget 快速安装和部署。已经超过800万次下载,它正在用C#改变。
Install-Package IronWebScraper
考虑安装 IronWebScraper DLL 直接。下载并手动安装到您的项目或GAC表单中: webscraperIronWebScraper.zip
手动安装到你的项目中
下载DLLHttpIdentity 功能
一些网站系统要求用户登录后才能查看内容; 在这种情况下,我们可以使用HttpIdentity:-
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id);
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id);
Dim id As New HttpIdentity()
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
Identities.Add(id)
IronWebScraper 最令人印象深刻、功能最强大的功能之一,就是可以使用数千种独特的网络技术。(用户凭据和/或浏览器引擎)使用多重登录会话欺骗或搜索网站。
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
foreach (var proxy in proxies)
{
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
foreach (var proxy in proxies)
{
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
Dim proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(","c)
For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
For Each proxy In proxies
Identities.Add(New HttpIdentity() With {
.UserAgent = UA,
.UseCookies = True,
.Proxy = proxy
})
Next proxy
Next UA
Me.Request("http://www.Website.com", Parse)
End Sub
您拥有多个属性以赋予您不同的行为,因此可以防止网站屏蔽您。
一些这些属性:-
- NetworkDomain:用于用户认证的网络域。 支持 Windows、NTLM、Keroberos、Linux、BSD 和 Mac OS X 网络。 必须与此配合使用(网络用户名和网络密码)
- NetworkUsername:用于用户认证的网络/HTTP用户名。 支持Http、Windows网络、NTLM、Kerberos、Linux网络、BSD网络和Mac OS。
- NetworkPassword:用于用户身份验证的网络/HTTP密码。 支持Http、Windows网络、NTLM、Keroberos、Linux网络、BSD网络和Mac OS。
- 代理:设置代理配置
- UserAgent:设置浏览器引擎(chrome桌面、chrome手机、chrome平板电脑、IE和火狐等。)
- HttpRequestHeaders:用于将与此标识一起使用的自定义头部值,它接受字典对象。(字典 <字符串, string>)
UseCookies:启用/禁用使用 cookies
IronWebScraper 使用随机身份运行抓取程序。 如果我们需要指定使用特定身份来解析页面,我们可以这样做。
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
HttpIdentity identity = new HttpIdentity();
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
Identities.Add(id);
this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
HttpIdentity identity = new HttpIdentity();
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
Identities.Add(id);
this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
Dim identity As New HttpIdentity()
identity.NetworkUsername = "username"
identity.NetworkPassword = "pwd"
Identities.Add(id)
Me.Request("http://www.Website.com", Parse, identity)
End Sub
启用网页缓存功能
此功能用于缓存请求的页面。 它经常在开发和测试阶段使用; 使开发者能够缓存所需页面以在更新代码后重用。 这使您在重新启动网络抓取工具后能够在缓存的页面上执行代码,而不需要每次都连接到实时网站。(动作重放).
您可以在 Init 中使用它。()方法
启用网页缓存();`
或
启用网页缓存(有效期);`
它会将缓存数据保存到工作目录下的WebCache文件夹中。
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
EnableWebCache(new TimeSpan(1,30,30));
this.Request("http://www.WebSite.com", Parse);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
EnableWebCache(new TimeSpan(1,30,30));
this.Request("http://www.WebSite.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
EnableWebCache(New TimeSpan(1,30,30))
Me.Request("http://www.WebSite.com", Parse)
End Sub
IronWebScraper 还具有一些功能,可通过使用 "开始 "设置引擎启动进程名称,使引擎在重启代码后继续进行刮擦。(CrawlID)
static void Main(string [] args)
{
// Create Object From Scraper class
EngineScraper scrape = new EngineScraper();
// Start Scraping
scrape.Start("enginestate");
}
static void Main(string [] args)
{
// Create Object From Scraper class
EngineScraper scrape = new EngineScraper();
// Start Scraping
scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
' Create Object From Scraper class
Dim scrape As New EngineScraper()
' Start Scraping
scrape.Start("enginestate")
End Sub
执行请求和响应将保存在工作目录内的 SavedState 文件夹中。
限流
我们可以控制每个域的最小和最大连接数量以及连接速度。
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Gets or sets the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
// or IP address. This helps protect hosts against too many requests.
this.OpenConnectionLimitPerHost = 25;
this.ObeyRobotsDotTxt = false;
// Makes the WebSraper intelligently throttle requests not only by hostname, but
// also by host servers' IP addresses. This is polite in-case multiple scraped domains
// are hosted on the same machine.
this.ThrottleMode = Throttle.ByDomainHostName;
this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Gets or sets the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
// or IP address. This helps protect hosts against too many requests.
this.OpenConnectionLimitPerHost = 25;
this.ObeyRobotsDotTxt = false;
// Makes the WebSraper intelligently throttle requests not only by hostname, but
// also by host servers' IP addresses. This is polite in-case multiple scraped domains
// are hosted on the same machine.
this.ThrottleMode = Throttle.ByDomainHostName;
this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Gets or sets the total number of allowed open HTTP requests (threads)
Me.MaxHttpConnectionLimit = 80
' Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
' Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
' or IP address. This helps protect hosts against too many requests.
Me.OpenConnectionLimitPerHost = 25
Me.ObeyRobotsDotTxt = False
' Makes the WebSraper intelligently throttle requests not only by hostname, but
' also by host servers' IP addresses. This is polite in-case multiple scraped domains
' are hosted on the same machine.
Me.ThrottleMode = Throttle.ByDomainHostName
Me.Request("https://www.Website.com", Parse)
End Sub
限流属性
MaxHttpConnectionLimit
允许打开的 HTTP 请求总数 (线程)每个主机的费率限制
最小礼貌延迟或暂停 (毫秒)请求给定域名或IP地址之间OpenConnectionLimitPerHost
允许的并发 HTTP 请求数 (线程)节流阀模式
使WebSraper智能地限制请求,不仅按主机名,还按主机服务器的IP地址。 这是礼貌的,以防多个抓取的域托管在同一台机器上。