Advanced Webscraping Features in C

This article was translated from English: Does it need improvement?
Translated
View the article in English

HttpIdentity 功能

某些網站系統要求使用者必須登入才能瀏覽內容; 在此情況下,我們可以使用 HttpIdentity。 以下是設定方法:

// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();

// Set the network username and password for authentication
id.Net/workUsername = "username";
id.Net/workPassword = "pwd";

// Add the identity to the collection of identities
Identities.Add(id);
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();

// Set the network username and password for authentication
id.Net/workUsername = "username";
id.Net/workPassword = "pwd";

// Add the identity to the collection of identities
Identities.Add(id);
' Create a new instance of HttpIdentity
Dim id As New HttpIdentity()

' Set the network username and password for authentication
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"

' Add the identity to the collection of identities
Identities.Add(id)
$vbLabelText   $csharpLabel

IronWebScraper 最令人印象深刻且強大的功能之一,在於能夠運用數千組獨特的用戶憑證和/或瀏覽器引擎,透過多重登入會話來偽裝身分或抓取網站內容。

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Define an array of proxies
    var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');

    // Iterate over common Chrome desktop user agents
    foreach (var UA in IronWebScraper.Com/monUserAgents.ChromeDesktopUserAgents)
    {
        // Iterate over the proxies
        foreach (var proxy in proxies)
        {
            // Add a new HTTP identity with specific user agent and proxy
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Define an array of proxies
    var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');

    // Iterate over common Chrome desktop user agents
    foreach (var UA in IronWebScraper.Com/monUserAgents.ChromeDesktopUserAgents)
    {
        // Iterate over the proxies
        foreach (var proxy in proxies)
        {
            // Add a new HTTP identity with specific user agent and proxy
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
    ' Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey"

    ' Set the logging level to capture all logs
    Me.LoggingLevel = WebScraper.LogLevel.All

    ' Assign the working directory for the output files
    Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

    ' Define an array of proxies
    Dim proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(","c)

    ' Iterate over common Chrome desktop user agents
    For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
        ' Iterate over the proxies
        For Each proxy In proxies
            ' Add a new HTTP identity with specific user agent and proxy
            Identities.Add(New HttpIdentity() With {
                .UserAgent = UA,
                .UseCookies = True,
                .Proxy = proxy
            })
        Next
    Next

    ' Make an initial request to the website with a parse method
    Me.Request("http://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

您可透過多種屬性設定來實現不同的行為模式,藉此避免被網站封鎖。

這些特性包括:

  • NetworkDomain:用於使用者驗證的網路網域。 支援 Windows、NTLM、Kerberos、Linux、BSD 及 Mac OS X 網路。 必須與 NetworkUsernameNetworkPassword 配合使用。
  • NetworkUsername:用於使用者驗證的 network/http 使用者名稱。 支援 HTTP、Windows 網路、NTLM、Kerberos、Linux 網路、BSD 網路及 Mac OS。
  • NetworkPassword:用於使用者驗證的 network/http 密碼。 支援 HTTP、Windows 網路、NTLM、Kerberos、Linux 網路、BSD 網路及 Mac OS。
  • Proxy:用於設定代理伺服器設定。
  • UserAgent:用於設定瀏覽器引擎(例如:Chrome 桌面版、Chrome 行動版、Chrome 平板版、IE 及 Firefox 等)。
  • HttpRequestHeaders:針對將與此身分識別一起使用的自訂標頭值,它接受一個字典物件 Dictionary<string, string>
  • UseCookies:啟用/停用 Cookie。

IronWebScraper 會使用隨機身份來執行爬蟲程式。 若需指定使用特定身分來解析頁面,可透過以下方式實現:

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Create a new instance of HttpIdentity
    HttpIdentity identity = new HttpIdentity();

    // Set the network username and password for authentication
    identity.Net/workUsername = "username";
    identity.Net/workPassword = "pwd";

    // Add the identity to the collection of identities
    Identities.Add(identity);

    // Make a request to the website with the specified identity
    this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Create a new instance of HttpIdentity
    HttpIdentity identity = new HttpIdentity();

    // Set the network username and password for authentication
    identity.Net/workUsername = "username";
    identity.Net/workPassword = "pwd";

    // Add the identity to the collection of identities
    Identities.Add(identity);

    // Make a request to the website with the specified identity
    this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
    ' Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey"

    ' Set the logging level to capture all logs
    Me.LoggingLevel = WebScraper.LogLevel.All

    ' Assign the working directory for the output files
    Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

    ' Create a new instance of HttpIdentity
    Dim identity As New HttpIdentity()

    ' Set the network username and password for authentication
    identity.NetworkUsername = "username"
    identity.NetworkPassword = "pwd"

    ' Add the identity to the collection of identities
    Identities.Add(identity)

    ' Make a request to the website with the specified identity
    Me.Request("http://www.Website.com", AddressOf Parse, identity)
End Sub
$vbLabelText   $csharpLabel

啟用網頁快取功能

此功能用於快取所請求的頁面。 此工具常應用於開發與測試階段,讓開發人員能將所需頁面儲存至快取,以便在更新程式碼後重複使用。 這使您能在重新啟動網頁爬蟲後,於快取頁面執行程式碼,無需每次都連線至即時網站(動作重播)。

您可以在 Init() 方法中使用它:

// Enable web cache without an expiration time
EnableWebCache();

// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
// Enable web cache without an expiration time
EnableWebCache();

// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
' Enable web cache without an expiration time
EnableWebCache()

' OR enable web cache with a specified expiration time
EnableWebCache(New TimeSpan(1, 30, 30))
$vbLabelText   $csharpLabel

它會將您的快取資料儲存至工作目錄下的 WebCache 資料夾中。

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
    EnableWebCache(new TimeSpan(1, 30, 30));

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
    EnableWebCache(new TimeSpan(1, 30, 30));

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
	EnableWebCache(New TimeSpan(1, 30, 30))

	' Make an initial request to the website with a parse method
	Me.Request("http://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

IronWebScraper 還具備一項功能,可透過設定引擎啟動程序名稱(使用 Start(CrawlID)),讓您的引擎在重新啟動程式碼後仍能繼續進行擷取。

static void Main(string[] args)
{
    // Create an object from the Scraper class
    EngineScraper scrape = new EngineScraper();

    // Start the scraping process with the specified crawl ID
    scrape.Start("enginestate");
}
static void Main(string[] args)
{
    // Create an object from the Scraper class
    EngineScraper scrape = new EngineScraper();

    // Start the scraping process with the specified crawl ID
    scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
	' Create an object from the Scraper class
	Dim scrape As New EngineScraper()

	' Start the scraping process with the specified crawl ID
	scrape.Start("enginestate")
End Sub
$vbLabelText   $csharpLabel

執行請求與回應將儲存於工作目錄內的 SavedState 資料夾中。

流量限制

我們可以控制每個網域的最小與最大連線數,以及連線速度。

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Set the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;

    // Set minimum polite delay (pause) between requests to a given domain or IP address
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);

    // Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
    this.OpenConnectionLimitPerHost = 25;

    // Do not obey the robots.txt files
    this.ObeyRobotsDotTxt = false;

    // Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
    this.ThrottleMode = Throttle.ByDomainHostName;

    // Make an initial request to the website with a parse method
    this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Set the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;

    // Set minimum polite delay (pause) between requests to a given domain or IP address
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);

    // Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
    this.OpenConnectionLimitPerHost = 25;

    // Do not obey the robots.txt files
    this.ObeyRobotsDotTxt = false;

    // Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
    this.ThrottleMode = Throttle.ByDomainHostName;

    // Make an initial request to the website with a parse method
    this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Set the total number of allowed open HTTP requests (threads)
	Me.MaxHttpConnectionLimit = 80

	' Set minimum polite delay (pause) between requests to a given domain or IP address
	Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)

	' Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
	Me.OpenConnectionLimitPerHost = 25

	' Do not obey the robots.txt files
	Me.ObeyRobotsDotTxt = False

	' Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
	Me.ThrottleMode = Throttle.ByDomainHostName

	' Make an initial request to the website with a parse method
	Me.Request("https://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

限流屬性

  • MaxHttpConnectionLimit 允許的 HTTP 請求總數(執行緒)
  • RateLimitPerHost 針對特定網域或 IP 位址的請求之間,應保留的最低禮貌延遲或暫停時間(單位:毫秒)
  • OpenConnectionLimitPerHost 每個主機名稱允許的並發 HTTP 請求(執行緒)數量
  • ThrottleMode 使 WEBSCRAPER 能智慧地限制請求頻率,不僅依據主機名稱,亦依據主機伺服器的 IP 位址進行限制。 此表述旨在禮貌地處理可能發生於同一台伺服器上託管多個被抓取網域的情況。

開始使用 IronWebScraper

立即透過免費試用,在您的專案中開始使用 IronWebScraper。

第一步:
green arrow pointer

常見問題

如何在 C# 中對需要登入的網站進行使用者驗證?

您可以利用 HttpIdentity 功能,透過設定 NetworkDomain, NetworkUsername,以及 NetworkPassword.

在開發過程中使用網頁快取有何好處?

網頁快取功能可讓您將請求的頁面儲存至快取以供重複使用,藉此避免反覆連線至線上網站,從而節省時間與資源,此功能在開發與測試階段尤為實用。

在網頁抓取中,該如何管理多個登入會話?

IronWebscraper 允許使用數千組獨特的用戶憑證和瀏覽器引擎來模擬多個登入會話,有助於防止網站偵測並封鎖該爬蟲程式。

網頁爬取中的進階流量控制選項有哪些?

IronWebscraper 提供一項 ThrottleMode 設定,可根據主機名稱和 IP 位址智能管理請求限流,確保與共享主機環境的互動符合規範。

如何在 IronWebscraper 中使用代理伺服器?

若要使用代理伺服器,請定義一個代理伺服器陣列,並將其與 HttpIdentity IronWebscraper 中的實例,使請求能透過不同 IP 位址路由,以實現匿名性和存取控制。

IronWebscraper 如何處理請求延遲以防止伺服器過載?

IronWebScraper 中的 RateLimitPerHost 設定用於指定對特定網域或 IP 位址發送請求之間的最小延遲,透過分散請求間隔,有助於防止伺服器過載。

網頁抓取在中斷後能否恢復?

是的,IronWebScraper 可在中斷後透過 Start(CrawlID) 方法,該方法會儲存執行狀態並從上次儲存的點繼續執行。

如何控制網頁爬蟲中的並發 HTTP 連線數量?

在 IronWebscraper 中,您可以設定 MaxHttpConnectionLimit 屬性來控制允許開啟的 HTTP 請求總數,有助於管理伺服器負載與資源。

針對網頁抓取活動,有哪些記錄選項可用?

IronWebScraper 允許您透過 LoggingLevel 屬性設定記錄級別,在抓取操作期間啟用全面記錄,以便進行詳細分析與疑難排解。

Curtis Chau
技術撰稿人

Curtis Chau 擁有卡爾頓大學(Carleton University)的電腦科學學士學位,專精於前端開發,並精通 Node.js、TypeScript、JavaScript 及 React。他熱衷於打造直觀且美觀的用戶介面,喜歡運用現代框架,並創建結構完善、視覺上吸引人的手冊。

除了開發工作之外,Curtis 對物聯網(IoT)抱有濃厚興趣,致力於探索整合硬體與軟體的創新方法。閒暇時,他喜歡玩遊戲和開發 Discord 機器人,將對科技的熱愛與創意相結合。

準備開始了嗎?
Nuget 下載 137,906 | 版本: 2026.6 just released
Still Scrolling Icon

還在捲動嗎?

想要快速證明? PM > Install-Package IronWebScraper
執行範例 觀看您的目標網站成為結構化資料。