Advanced Webscraping Features in C
HttpIdentity 功能
某些網站系統要求使用者必須登入才能瀏覽內容; 在此情況下,我們可以使用 HttpIdentity。 以下是設定方法:
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.Net/workUsername = "username";
id.Net/workPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.Net/workUsername = "username";
id.Net/workPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);
' Create a new instance of HttpIdentity
Dim id As New HttpIdentity()
' Set the network username and password for authentication
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
' Add the identity to the collection of identities
Identities.Add(id)
IronWebScraper 最令人印象深刻且強大的功能之一,在於能夠運用數千組獨特的用戶憑證和/或瀏覽器引擎,透過多重登入會話來偽裝身分或抓取網站內容。
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.Com/monUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.Com/monUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Define an array of proxies
Dim proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(","c)
' Iterate over common Chrome desktop user agents
For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
' Iterate over the proxies
For Each proxy In proxies
' Add a new HTTP identity with specific user agent and proxy
Identities.Add(New HttpIdentity() With {
.UserAgent = UA,
.UseCookies = True,
.Proxy = proxy
})
Next
Next
' Make an initial request to the website with a parse method
Me.Request("http://www.Website.com", Parse)
End Sub
您可透過多種屬性設定來實現不同的行為模式,藉此避免被網站封鎖。
這些特性包括:
NetworkDomain:用於使用者驗證的網路網域。 支援 Windows、NTLM、Kerberos、Linux、BSD 及 Mac OS X 網路。 必須與NetworkUsername和NetworkPassword配合使用。NetworkUsername:用於使用者驗證的 network/http 使用者名稱。 支援 HTTP、Windows 網路、NTLM、Kerberos、Linux 網路、BSD 網路及 Mac OS。NetworkPassword:用於使用者驗證的 network/http 密碼。 支援 HTTP、Windows 網路、NTLM、Kerberos、Linux 網路、BSD 網路及 Mac OS。Proxy:用於設定代理伺服器設定。UserAgent:用於設定瀏覽器引擎(例如:Chrome 桌面版、Chrome 行動版、Chrome 平板版、IE 及 Firefox 等)。HttpRequestHeaders:針對將與此身分識別一起使用的自訂標頭值,它接受一個字典物件Dictionary<string, string>。UseCookies:啟用/停用 Cookie。
IronWebScraper 會使用隨機身份來執行爬蟲程式。 若需指定使用特定身分來解析頁面,可透過以下方式實現:
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.Net/workUsername = "username";
identity.Net/workPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.Net/workUsername = "username";
identity.Net/workPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Create a new instance of HttpIdentity
Dim identity As New HttpIdentity()
' Set the network username and password for authentication
identity.NetworkUsername = "username"
identity.NetworkPassword = "pwd"
' Add the identity to the collection of identities
Identities.Add(identity)
' Make a request to the website with the specified identity
Me.Request("http://www.Website.com", AddressOf Parse, identity)
End Sub
啟用網頁快取功能
此功能用於快取所請求的頁面。 此工具常應用於開發與測試階段,讓開發人員能將所需頁面儲存至快取,以便在更新程式碼後重複使用。 這使您能在重新啟動網頁爬蟲後,於快取頁面執行程式碼,無需每次都連線至即時網站(動作重播)。
您可以在 Init() 方法中使用它:
// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
' Enable web cache without an expiration time
EnableWebCache()
' OR enable web cache with a specified expiration time
EnableWebCache(New TimeSpan(1, 30, 30))
它會將您的快取資料儲存至工作目錄下的 WebCache 資料夾中。
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(New TimeSpan(1, 30, 30))
' Make an initial request to the website with a parse method
Me.Request("http://www.Website.com", Parse)
End Sub
IronWebScraper 還具備一項功能,可透過設定引擎啟動程序名稱(使用 Start(CrawlID)),讓您的引擎在重新啟動程式碼後仍能繼續進行擷取。
static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}
static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
' Create an object from the Scraper class
Dim scrape As New EngineScraper()
' Start the scraping process with the specified crawl ID
scrape.Start("enginestate")
End Sub
執行請求與回應將儲存於工作目錄內的 SavedState 資料夾中。
流量限制
我們可以控制每個網域的最小與最大連線數,以及連線速度。
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Set the total number of allowed open HTTP requests (threads)
Me.MaxHttpConnectionLimit = 80
' Set minimum polite delay (pause) between requests to a given domain or IP address
Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
' Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
Me.OpenConnectionLimitPerHost = 25
' Do not obey the robots.txt files
Me.ObeyRobotsDotTxt = False
' Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
Me.ThrottleMode = Throttle.ByDomainHostName
' Make an initial request to the website with a parse method
Me.Request("https://www.Website.com", Parse)
End Sub
限流屬性
MaxHttpConnectionLimit允許的 HTTP 請求總數(執行緒)RateLimitPerHost針對特定網域或 IP 位址的請求之間,應保留的最低禮貌延遲或暫停時間(單位:毫秒)OpenConnectionLimitPerHost每個主機名稱允許的並發 HTTP 請求(執行緒)數量ThrottleMode使 WEBSCRAPER 能智慧地限制請求頻率,不僅依據主機名稱,亦依據主機伺服器的 IP 位址進行限制。 此表述旨在禮貌地處理可能發生於同一台伺服器上託管多個被抓取網域的情況。
開始使用 IronWebScraper
立即透過免費試用,在您的專案中開始使用 IronWebScraper。
常見問題
如何在 C# 中對需要登入的網站進行使用者驗證?
您可以利用 HttpIdentity 功能,透過設定 NetworkDomain, NetworkUsername,以及 NetworkPassword.
在開發過程中使用網頁快取有何好處?
網頁快取功能可讓您將請求的頁面儲存至快取以供重複使用,藉此避免反覆連線至線上網站,從而節省時間與資源,此功能在開發與測試階段尤為實用。
在網頁抓取中,該如何管理多個登入會話?
IronWebscraper 允許使用數千組獨特的用戶憑證和瀏覽器引擎來模擬多個登入會話,有助於防止網站偵測並封鎖該爬蟲程式。
網頁爬取中的進階流量控制選項有哪些?
IronWebscraper 提供一項 ThrottleMode 設定,可根據主機名稱和 IP 位址智能管理請求限流,確保與共享主機環境的互動符合規範。
如何在 IronWebscraper 中使用代理伺服器?
若要使用代理伺服器,請定義一個代理伺服器陣列,並將其與 HttpIdentity IronWebscraper 中的實例,使請求能透過不同 IP 位址路由,以實現匿名性和存取控制。
IronWebscraper 如何處理請求延遲以防止伺服器過載?
IronWebScraper 中的 RateLimitPerHost 設定用於指定對特定網域或 IP 位址發送請求之間的最小延遲,透過分散請求間隔,有助於防止伺服器過載。
網頁抓取在中斷後能否恢復?
是的,IronWebScraper 可在中斷後透過 Start(CrawlID) 方法,該方法會儲存執行狀態並從上次儲存的點繼續執行。
如何控制網頁爬蟲中的並發 HTTP 連線數量?
在 IronWebscraper 中,您可以設定 MaxHttpConnectionLimit 屬性來控制允許開啟的 HTTP 請求總數,有助於管理伺服器負載與資源。
針對網頁抓取活動,有哪些記錄選項可用?
IronWebScraper 允許您透過 LoggingLevel 屬性設定記錄級別,在抓取操作期間啟用全面記錄,以便進行詳細分析與疑難排解。

