高級網頁抓取功能
HttpIdentity 功能
一些網站系統要求使用者登入後才能查看內容; 在這種情況下,我們可以使用 HttpIdentity: -
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id);
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id);
Dim id As New HttpIdentity()
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
Identities.Add(id)
在 IronWebScraper 中最令人印象深刻和强大的功能之一,是能够使用数千种独特的(用戶的憑證和/或瀏覽器引擎)利用多重登入會話來欺騙或抓取網站。
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
foreach (var proxy in proxies)
{
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
foreach (var proxy in proxies)
{
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
Dim proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(","c)
For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
For Each proxy In proxies
Identities.Add(New HttpIdentity() With {
.UserAgent = UA,
.UseCookies = True,
.Proxy = proxy
})
Next proxy
Next UA
Me.Request("http://www.Website.com", Parse)
End Sub
您有多個屬性可以提供不同的行為,因此可以防止網站封鎖您。
以下是其中一些屬性: -
- NetworkDomain:用於用戶認證的網路域。 支持Windows、NTLM、Kerberos、Linux、BSD和Mac OS X網絡。 必須與此同時使用(網絡用戶名和網絡密碼)
- NetworkUsername:用於使用者身份驗證的網路/HTTP 使用者名稱。 支持 Http、Windows 網路、NTLM、Kerberos、Linux 網路、BSD 網路和 Mac OS。
- NetworkPassword:用於用戶認證的網絡/HTTP 密碼。 支持Http、Windows網絡、NTLM、Keroberos、Linux網絡、BSD網絡和Mac OS。
- 代理:設定代理伺服器設置
- UserAgent:設置瀏覽器引擎(Chrome 桌面版、Chrome Mobile、Chrome 平板版、IE 和 Firefox 等。)
- HttpRequestHeaders:用於將與此身份一起使用的自定義標頭值,並接受字典物件。(字典<string,string>)
UseCookies:啟用/禁用使用 cookies
IronWebScraper 使用隨機身份運行抓取器。 如果我們需要指定使用特定的身份來解析頁面,我們可以這樣做。
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
HttpIdentity identity = new HttpIdentity();
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
Identities.Add(id);
this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
HttpIdentity identity = new HttpIdentity();
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
Identities.Add(id);
this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
Dim identity As New HttpIdentity()
identity.NetworkUsername = "username"
identity.NetworkPassword = "pwd"
Identities.Add(id)
Me.Request("http://www.Website.com", Parse, identity)
End Sub
啟用網頁快取功能
此功能用於緩存請求的頁面。 它經常在開發和測試階段使用; 允許開發者在更新代碼後緩存所需的頁面以供重用。 這使您能夠在重新啟動您的網絡抓取工具後,在已緩存的頁面上執行代碼,而無需每次都連接到實時網站。(即時回放).
您可以在 Init 中使用它。()方法
啟用網頁快取();`
或
啟用網頁快取(時限到期);`
它將您的快取資料儲存到工作目錄資料夾下的WebCache資料夾中。
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
EnableWebCache(new TimeSpan(1,30,30));
this.Request("http://www.WebSite.com", Parse);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
EnableWebCache(new TimeSpan(1,30,30));
this.Request("http://www.WebSite.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
EnableWebCache(New TimeSpan(1,30,30))
Me.Request("http://www.WebSite.com", Parse)
End Sub
IronWebScraper 也具有在重新啟動代碼後通過使用 Start 設置引擎啟動過程名稱來使您的引擎繼續抓取的功能。(爬行ID)
static void Main(string [] args)
{
// Create Object From Scraper class
EngineScraper scrape = new EngineScraper();
// Start Scraping
scrape.Start("enginestate");
}
static void Main(string [] args)
{
// Create Object From Scraper class
EngineScraper scrape = new EngineScraper();
// Start Scraping
scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
' Create Object From Scraper class
Dim scrape As New EngineScraper()
' Start Scraping
scrape.Start("enginestate")
End Sub
執行請求和回應將被保存到工作目錄內的 SavedState 資料夾中。
限流
我們可以控制每個域的最小和最大連接數以及連接速度。
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Gets or sets the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
// or IP address. This helps protect hosts against too many requests.
this.OpenConnectionLimitPerHost = 25;
this.ObeyRobotsDotTxt = false;
// Makes the WebSraper intelligently throttle requests not only by hostname, but
// also by host servers' IP addresses. This is polite in-case multiple scraped domains
// are hosted on the same machine.
this.ThrottleMode = Throttle.ByDomainHostName;
this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Gets or sets the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
// or IP address. This helps protect hosts against too many requests.
this.OpenConnectionLimitPerHost = 25;
this.ObeyRobotsDotTxt = false;
// Makes the WebSraper intelligently throttle requests not only by hostname, but
// also by host servers' IP addresses. This is polite in-case multiple scraped domains
// are hosted on the same machine.
this.ThrottleMode = Throttle.ByDomainHostName;
this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Gets or sets the total number of allowed open HTTP requests (threads)
Me.MaxHttpConnectionLimit = 80
' Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
' Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
' or IP address. This helps protect hosts against too many requests.
Me.OpenConnectionLimitPerHost = 25
Me.ObeyRobotsDotTxt = False
' Makes the WebSraper intelligently throttle requests not only by hostname, but
' also by host servers' IP addresses. This is polite in-case multiple scraped domains
' are hosted on the same machine.
Me.ThrottleMode = Throttle.ByDomainHostName
Me.Request("https://www.Website.com", Parse)
End Sub
節流屬性
MaxHttpConnectionLimit
允許打開的 HTTP 請求總數 (線程)RateLimitPerHost
最小禮貌延遲或暫停 (以毫秒計算)在請求到指定域名或IP地址之間OpenConnectionLimitPerHost
允許的同時 HTTP 請求數量 (線程)限速模式
使WebScraper不僅根據主機名稱,而且根據主機伺服器的IP地址智能地限制請求。 這是出於禮貌,以防多個抓取的網域託管在同一台機器上。開始使用IronWebscraper
立即在您的專案中使用IronWebScraper,並享受免費試用。