高級網頁抓取功能

This article was translated from English: Does it need improvement?
Translated
View the article in English

C# NuGet 程式庫用于

安裝與 NuGet

Install-Package IronWebScraper
Java PDF JAR

下載 DLL

下載DLL

手動安裝到您的項目中

C# NuGet 程式庫用于

安裝與 NuGet

Install-Package IronWebScraper
Java PDF JAR

下載 DLL

下載DLL

手動安裝到您的項目中

立即開始在您的專案中使用IronPDF,並享受免費試用。

第一步:
green arrow pointer

查看 IronWebScraperNuget 快速安裝和部署。已被下載超過800萬次,它正用C#改變。

C# NuGet 程式庫用于 nuget.org/packages/IronWebScraper/
Install-Package IronWebScraper

請考慮安裝 IronWebScraper DLL 直接下載並手動安裝到您的專案或GAC表單: webscraperIronWebScraper.zip

手動安裝到您的項目中

下載DLL

HttpIdentity 功能

某些網站系統需要用戶登錄才能查看內容;在這種情況下,我們可以使用HttpIdentity:

HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id); 
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id); 
Dim id As New HttpIdentity()
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
Identities.Add(id)
VB   C#

在 IronWebScraper 中最令人印象深刻和强大的功能之一,是能够使用数千种独特的 (用戶的憑證和/或瀏覽器引擎) 利用多重登入會話來欺騙或抓取網站。

public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        foreach (var proxy in proxies)
        {
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }
    this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        foreach (var proxy in proxies)
        {
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }
    this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
	License.LicenseKey = " LicenseKey "
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	Dim proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(","c)
	For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
		For Each proxy In proxies
			Identities.Add(New HttpIdentity() With {
				.UserAgent = UA,
				.UseCookies = True,
				.Proxy = proxy
			})
		Next proxy
	Next UA
	Me.Request("http://www.Website.com", Parse)
End Sub
VB   C#

您有多種屬性可供選擇,以產生不同的行為,從而防止網站封鎖您。

一些這些屬性包括:

  • NetworkDomain:用於用戶身份驗證的網域。支援 Windows、NTLM、Kerberos、Linux、BSD 和 Mac OS X 網路。 必須與 (網絡用戶名和網絡密碼)
  • NetworkUsername: 用於用戶驗證的網絡/HTTP用戶名。支持Http、Windows網絡、NTLM、Kerberos、Linux網絡、BSD網絡和Mac OS。
  • NetworkPassword: 用於用戶驗證的網絡/HTTP密碼。支持Http、Windows網絡、NTLM、Kerberos、Linux網絡、BSD網絡和Mac OS。
  • Proxy: 設置代理設置
  • UserAgent: 設置瀏覽器引擎 (Chrome 桌面版、Chrome Mobile、Chrome 平板版、IE 和 Firefox 等。)
  • HttpRequestHeaders :用於此身份將使用的自訂標頭值,並接受字典物件(字典<string,string>)
  • UseCookies : 啟用/禁用使用 Cookie

IronWebScraper 使用隨機身份運行爬蟲。如果我們需要指定使用特定身份來解析頁面,我們可以這樣做。

public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    HttpIdentity identity = new HttpIdentity();
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";
    Identities.Add(id);
    this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    HttpIdentity identity = new HttpIdentity();
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";
    Identities.Add(id);
    this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
	License.LicenseKey = " LicenseKey "
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	Dim identity As New HttpIdentity()
	identity.NetworkUsername = "username"
	identity.NetworkPassword = "pwd"
	Identities.Add(id)
	Me.Request("http://www.Website.com", Parse, identity)
End Sub
VB   C#

啟用 Web 快取功能

此功能用於快取請求的頁面。它通常在開發和測試階段使用;讓開發人員可以在更新代碼後快取所需的頁面以供重用。這使您能夠在重新啟動 Web 爬取工具後在快取頁面上執行代碼,而不需要每次都連接到實時網站。 (即時回放)你可以在初始化中使用它() 方法

`EnableWebCache();

或者

`EnableWebCache(時限到期)它會將您的快取數據保存到工作目錄文件夾下的 WebCache 文件夾

public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    EnableWebCache(new TimeSpan(1,30,30));
    this.Request("http://www.WebSite.com", Parse);
}
public override void Init()
{
    License.LicenseKey = " LicenseKey ";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    EnableWebCache(new TimeSpan(1,30,30));
    this.Request("http://www.WebSite.com", Parse);
}
Public Overrides Sub Init()
	License.LicenseKey = " LicenseKey "
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	EnableWebCache(New TimeSpan(1,30,30))
	Me.Request("http://www.WebSite.com", Parse)
End Sub
VB   C#

IronWebScraper 也具有在重新啟動代碼後通過使用 Start 設置引擎啟動過程名稱來使您的引擎繼續抓取的功能。(爬行ID)

static void Main(string [] args)
{
    // Create Object From Scraper class
    EngineScraper scrape = new EngineScraper();
    // Start Scraping
    scrape.Start("enginestate");
}
static void Main(string [] args)
{
    // Create Object From Scraper class
    EngineScraper scrape = new EngineScraper();
    // Start Scraping
    scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
	' Create Object From Scraper class
	Dim scrape As New EngineScraper()
	' Start Scraping
	scrape.Start("enginestate")
End Sub
VB   C#

執行請求和回應將被保存到工作目錄內的 SavedState 資料夾中。

縮減

我們可以控制每個域的最小和最大連接數以及連接速度。

public override void Init()
{
    License.LicenseKey = "LicenseKey";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    // Gets or sets the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;
    // Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);            
    //     Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
    //     or IP address. This helps protect hosts against too many requests.
    this.OpenConnectionLimitPerHost = 25;
    this.ObeyRobotsDotTxt = false;
    //     Makes the WebSraper intelligently throttle requests not only by hostname, but
    //     also by host servers' IP addresses. This is polite in-case multiple scraped domains
    //     are hosted on the same machine.
    this.ThrottleMode = Throttle.ByDomainHostName;
    this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
    License.LicenseKey = "LicenseKey";
    this.LoggingLevel = WebScraper.LogLevel.All;
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
    // Gets or sets the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;
    // Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);            
    //     Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
    //     or IP address. This helps protect hosts against too many requests.
    this.OpenConnectionLimitPerHost = 25;
    this.ObeyRobotsDotTxt = false;
    //     Makes the WebSraper intelligently throttle requests not only by hostname, but
    //     also by host servers' IP addresses. This is polite in-case multiple scraped domains
    //     are hosted on the same machine.
    this.ThrottleMode = Throttle.ByDomainHostName;
    this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
	License.LicenseKey = "LicenseKey"
	Me.LoggingLevel = WebScraper.LogLevel.All
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
	' Gets or sets the total number of allowed open HTTP requests (threads)
	Me.MaxHttpConnectionLimit = 80
	' Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
	Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
	'     Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
	'     or IP address. This helps protect hosts against too many requests.
	Me.OpenConnectionLimitPerHost = 25
	Me.ObeyRobotsDotTxt = False
	'     Makes the WebSraper intelligently throttle requests not only by hostname, but
	'     also by host servers' IP addresses. This is polite in-case multiple scraped domains
	'     are hosted on the same machine.
	Me.ThrottleMode = Throttle.ByDomainHostName
	Me.Request("https://www.Website.com", Parse)
End Sub
VB   C#

節流屬性

  • MaxHttpConnectionLimit


    允許打開的 HTTP 請求總數 (線程)

  • RateLimitPerHost


    最小禮貌延遲或暫停 (以毫秒計算) 在請求到特定域名或 IP 地址之間

  • OpenConnectionLimitPerHost


    允許的同時 HTTP 請求數量 (線程)

  • 限速模式


    使 WebScraper 不僅按主機名稱,還按主機伺服器的 IP 地址智能地調節請求速率。這樣做是為了禮貌地考慮到多個被抓取的域名可能託管在同一台機器上。

.NET軟體工程師 對很多人來說,這是從 .NET 生成 PDF 文件的最有效方法,因為不需要學習額外的 API 或導航複雜的設計系統

艾哈邁德·阿布爾馬格德

跨國IT公司 .NET 軟體解決方案架構師

Ahmed 是一位經驗豐富且具有認證的微軟技術專家,擁有超過10年在 IT 和軟體開發領域的經驗。他曾在多家不同公司工作,現在是某跨國 IT 公司的國家經理。

Ahmed已經使用IronPDF和IronWebScraper超過一年,並在他公司中的多個專案中使用它們。