高級網頁抓取功能

Darrius Serrant
Darrius Serrant
2023年1月25日
已更新 2024年12月10日
分享:
This article was translated from English: Does it need improvement?
Translated
View the article in English

HttpIdentity 功能

一些網站系統要求使用者登入後才能查看內容; 在這種情況下,我們可以使用 HttpIdentity: -

HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id); 
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id); 
Dim id As New HttpIdentity()
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
Identities.Add(id)
$vbLabelText   $csharpLabel

IronWebScraper最令人印象深刻且強大的功能之一是能夠使用成千上萬的唯一(用戶憑據和/或瀏覽器引擎)來偽裝或抓取網站,使用多重登入會話。

public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        foreach (var proxy in proxies)
        {
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }
    this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        foreach (var proxy in proxies)
        {
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }
    this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
	License.LicenseKey = " LicenseKey "
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	Dim proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(","c)
	For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
		For Each proxy In proxies
			Identities.Add(New HttpIdentity() With {
				.UserAgent = UA,
				.UseCookies = True,
				.Proxy = proxy
			})
		Next proxy
	Next UA
	Me.Request("http://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

您有多個屬性可以提供不同的行為,因此可以防止網站封鎖您。

以下是其中一些屬性: -

  • NetworkDomain : 用於使用者驗證的網域。 支持Windows、NTLM、Kerberos、Linux、BSD和Mac OS X網絡。 必須與 (NetworkUsername 和 NetworkPassword) 一起使用
  • NetworkUsername:用於用戶驗證的網絡/HTTP 用戶名。 支持 Http、Windows 網路、NTLM、Kerberos、Linux 網路、BSD 網路和 Mac OS。
  • NetworkPassword:用於用戶身份驗證的網路/HTTP 密碼。 支持Http、Windows網絡、NTLM、Keroberos、Linux網絡、BSD網絡和Mac OS。
  • 代理:設定代理設定
  • UserAgent:設定瀏覽器引擎(Chrome 桌面版、Chrome 手機版、Chrome 平板版、IE 和 Firefox 等)。
  • HttpRequestHeaders:適用於自定義標頭值,將與此身份一起使用,並接受字典對象(Dictionary <string, string>)
  • UseCookies:啟用/禁用使用 cookies

    IronWebScraper 使用隨機身份運行抓取器。 如果我們需要指定使用特定的身份來解析頁面,我們可以這樣做。

public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    HttpIdentity identity = new HttpIdentity();
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";
    Identities.Add(id);
    this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    HttpIdentity identity = new HttpIdentity();
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";
    Identities.Add(id);
    this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
	License.LicenseKey = " LicenseKey "
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	Dim identity As New HttpIdentity()
	identity.NetworkUsername = "username"
	identity.NetworkPassword = "pwd"
	Identities.Add(id)
	Me.Request("http://www.Website.com", Parse, identity)
End Sub
$vbLabelText   $csharpLabel

啟用網頁快取功能

此功能用於緩存請求的頁面。 它經常在開發和測試階段使用; 允許開發者在更新代碼後緩存所需的頁面以供重用。 這使您能夠在重新啟動您的網頁抓取工具後在緩存的頁面上執行您的代碼,而不需要每次都連接到實時網站(行動重播)。

您可以在 Init() 方法中使用它

EnableWebCache();

EnableWebCache(Timespan Expiry);

它將您的快取資料儲存到工作目錄資料夾下的WebCache資料夾中。

public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    EnableWebCache(new TimeSpan(1,30,30));
    this.Request("http://www.WebSite.com", Parse);
}
public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    EnableWebCache(new TimeSpan(1,30,30));
    this.Request("http://www.WebSite.com", Parse);
}
Public Overrides Sub Init()
	License.LicenseKey = " LicenseKey "
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	EnableWebCache(New TimeSpan(1,30,30))
	Me.Request("http://www.WebSite.com", Parse)
End Sub
$vbLabelText   $csharpLabel

IronWebScraper 也具備一些功能,能夠讓您的引擎在重啟代碼後繼續抓取,通過使用 Start(CrawlID) 設定引擎的啟動過程名稱來實現。

static void Main(string [] args)
{
    // Create Object From Scraper class
    EngineScraper scrape = new EngineScraper();
    // Start Scraping
    scrape.Start("enginestate");
}
static void Main(string [] args)
{
    // Create Object From Scraper class
    EngineScraper scrape = new EngineScraper();
    // Start Scraping
    scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
	' Create Object From Scraper class
	Dim scrape As New EngineScraper()
	' Start Scraping
	scrape.Start("enginestate")
End Sub
$vbLabelText   $csharpLabel

執行請求和回應將被保存到工作目錄內的 SavedState 資料夾中。

限流

我們可以控制每個域的最小和最大連接數以及連接速度。

public override void Init()
{
    License.LicenseKey = "LicenseKey";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    // Gets or sets the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;
    // Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);            
    //     Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
    //     or IP address. This helps protect hosts against too many requests.
    this.OpenConnectionLimitPerHost = 25;
    this.ObeyRobotsDotTxt = false;
    //     Makes the WebSraper intelligently throttle requests not only by hostname, but
    //     also by host servers' IP addresses. This is polite in-case multiple scraped domains
    //     are hosted on the same machine.
    this.ThrottleMode = Throttle.ByDomainHostName;
    this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
    License.LicenseKey = "LicenseKey";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    // Gets or sets the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;
    // Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);            
    //     Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
    //     or IP address. This helps protect hosts against too many requests.
    this.OpenConnectionLimitPerHost = 25;
    this.ObeyRobotsDotTxt = false;
    //     Makes the WebSraper intelligently throttle requests not only by hostname, but
    //     also by host servers' IP addresses. This is polite in-case multiple scraped domains
    //     are hosted on the same machine.
    this.ThrottleMode = Throttle.ByDomainHostName;
    this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
	License.LicenseKey = "LicenseKey"
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	' Gets or sets the total number of allowed open HTTP requests (threads)
	Me.MaxHttpConnectionLimit = 80
	' Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
	Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
	'     Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
	'     or IP address. This helps protect hosts against too many requests.
	Me.OpenConnectionLimitPerHost = 25
	Me.ObeyRobotsDotTxt = False
	'     Makes the WebSraper intelligently throttle requests not only by hostname, but
	'     also by host servers' IP addresses. This is polite in-case multiple scraped domains
	'     are hosted on the same machine.
	Me.ThrottleMode = Throttle.ByDomainHostName
	Me.Request("https://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

節流屬性

  • MaxHttpConnectionLimit


    允許的 HTTP 請求(線程)總數

  • RateLimitPerHost


    對指定網域或 IP 位址的請求之間的最小禮貌延遲或暫停(以毫秒為單位)

  • OpenConnectionLimitPerHost


    允許的同時 HTTP 請求(執行緒)數量

  • ThrottleMode


    使 WebSraper 能夠智能地調節請求速度,不僅依據主機名稱,還根據主機伺服器的 IP 地址。 這是出於禮貌,以防多個抓取的網域託管在同一台機器上。


    開始使用IronWebscraper

    立即在您的專案中使用IronWebScraper,並享受免費試用。

    第一步:
    green arrow pointer

Darrius Serrant
全端軟體工程師(WebOps)

Darrius Serrant 擁有邁阿密大學的計算機科學學士學位,目前擔任 Iron Software 的全端 WebOps 行銷工程師。自幼對編程產生興趣,他認為計算機既神秘又易於接觸,使其成為創造力和解決問題的完美媒介。

在 Iron Software,Darrius 享受創造新事物並簡化複雜概念使其更易理解的過程。作為我們的其中一位常駐開發人員,他也自願教導學生,將他的專業知識傳授給下一代。

對 Darrius 來說,他的工作之所以令人滿足,是因為它受到重視並且產生了真正的影響。