C# 中的進階網頁抓取功能
HttpIdentity 功能
某些網站系統要求使用者登入才能查看內容; 在這種情況下,我們可以使用HttpIdentity 。 以下是設定方法:
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);' Create a new instance of HttpIdentity
Dim id As New HttpIdentity()
' Set the network username and password for authentication
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
' Add the identity to the collection of identities
Identities.Add(id)IronWebScraper 最令人印象深刻和最強大的功能之一是能夠使用數千個獨特的使用者憑證和/或瀏覽器引擎,透過多個登入工作階段來欺騙或抓取網站。
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Define an array of proxies
Dim proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(","c)
' Iterate over common Chrome desktop user agents
For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
' Iterate over the proxies
For Each proxy In proxies
' Add a new HTTP identity with specific user agent and proxy
Identities.Add(New HttpIdentity() With {
.UserAgent = UA,
.UseCookies = True,
.Proxy = proxy
})
Next proxy
Next UA
' Make an initial request to the website with a parse method
Me.Request("http://www.Website.com", Parse)
End Sub您擁有多種屬性,可以賦予您不同的行為,從而防止網站封鎖您。
這些特性包括:
NetworkDomain:用於使用者驗證的網路網域。 支援 Windows、NTLM、Kerberos、Linux、BSD 和 Mac OS X 網路。 必須與NetworkUsername和NetworkPassword一起使用。NetworkUsername:用於使用者驗證的網路/http使用者名稱。 支援 HTTP、Windows 網路、NTLM、Kerberos、Linux 網路、BSD 網路和 Mac OS。NetworkPassword:用於使用者驗證的網路/http密碼。 支援 HTTP、Windows 網路、NTLM、Kerberos、Linux 網路、BSD 網路和 Mac OS。Proxy:設定代理設定。UserAgent:用於設定瀏覽器引擎(例如,Chrome 桌面版、Chrome 行動版、Chrome 平板電腦版、IE 和 Firefox 等)。HttpRequestHeaders:用於指定將與此標識一起使用的自訂標頭值,它接受一個字典物件Dictionary<string, string>。UseCookies:啟用/停用使用 cookies。
IronWebScraper 使用隨機身分來運行爬蟲。 如果我們需要指定使用特定標識來解析頁面,我們可以這樣做:
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Create a new instance of HttpIdentity
Dim identity As New HttpIdentity()
' Set the network username and password for authentication
identity.NetworkUsername = "username"
identity.NetworkPassword = "pwd"
' Add the identity to the collection of identities
Identities.Add(identity)
' Make a request to the website with the specified identity
Me.Request("http://www.Website.com", Parse, identity)
End Sub啟用網頁快取功能
此功能用於快取請求的頁面。 它常用於開發和測試階段,使開發人員能夠快取所需的頁面以便在更新程式碼後重複使用。 這樣,即使重新啟動網路爬蟲,也可以在快取的頁面上執行程式碼,而無需每次都連接到即時網站(動作重播)。
你可以在Init()方法中使用它:
// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));' Enable web cache without an expiration time
EnableWebCache()
' OR enable web cache with a specified expiration time
EnableWebCache(New TimeSpan(1, 30, 30))它會將您的快取資料儲存到工作目錄資料夾下的 WebCache 資料夾。
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(New TimeSpan(1, 30, 30))
' Make an initial request to the website with a parse method
Me.Request("http://www.Website.com", Parse)
End SubIronWebScraper 還具有一些功能,可透過使用Start(CrawlID)設定引擎啟動進程名稱,讓引擎在重新啟動程式碼後繼續抓取。
static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}Shared Sub Main(ByVal args() As String)
' Create an object from the Scraper class
Dim scrape As New EngineScraper()
' Start the scraping process with the specified crawl ID
scrape.Start("enginestate")
End Sub執行請求與回應將儲存在工作目錄內的 SavedState 資料夾。
節流
我們可以控制每個域的最小和最大連接數以及連接速度。
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Set the total number of allowed open HTTP requests (threads)
Me.MaxHttpConnectionLimit = 80
' Set minimum polite delay (pause) between requests to a given domain or IP address
Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
' Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
Me.OpenConnectionLimitPerHost = 25
' Do not obey the robots.txt files
Me.ObeyRobotsDotTxt = False
' Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
Me.ThrottleMode = Throttle.ByDomainHostName
' Make an initial request to the website with a parse method
Me.Request("https://www.Website.com", Parse)
End Sub節流特性
- <編碼>MaxHttpConnectionLimit</編碼 允許開啟的 HTTP 請求(線程)總數
- <代碼>RateLimitPerHost</代碼 向特定網域名稱或 IP 位址發出請求之間最短的禮貌性延遲或暫停時間(以毫秒為單位)。
OpenConnectionLimitPerHost每個主機名稱允許的並發 HTTP 請求(線程)數量節流模式讓 WebScraper 不僅能根據主機名,還能根據主機伺服器的 IP 位址智慧地限制請求。 如果多個抓取的網域託管在同一台機器上,這樣做比較禮貌。
開始使用 IronWebscraper
!{--01001100010010010100001001010010010000010101001001011001010111110101001101010100010001010101010 10100010111110101010001010010010010010100000101001100010111110100001001001100010011111010000100100110001001111010101
常見問題解答
如何在 C# 中驗證需要登錄的網站上的用戶?
您可以利用 IronWebScraper 中的 HttpIdentity 功能來通過設置屬性如 NetworkDomain、NetworkUsername 和 NetworkPassword 來驗證用戶。
在開發過程中使用網頁快取有什麼好處?
網頁快取功能允許您快取請求的頁面以供重複使用,這有助於節省時間和資源,避免重複連接到實際網站,特別是在開發和測試階段。
如何管理網路抓取中的多個登錄會話?
IronWebScraper 允許使用數千種獨特的用戶憑據和瀏覽器引擎來模擬多個登錄會話,這有助於防止網站檢測和封鎖抓取器。
網路抓取中有哪些高級的節流選項?
IronWebScraper 提供 ThrottleMode 設置,可以根據主機名稱和 IP 地址智能地管理請求節流,確保與共享主機環境的良好互動。
如何在 IronWebScraper 中使用代理?
要使用代理,請定義代理陣列並將它們與 IronWebScraper 中的 HttpIdentity 實例關聯,允許請求經由不同的 IP 地址路由以實現匿名性和訪問控制。
IronWebScraper 如何處理請求延遲以防止伺服器過載?
IronWebScraper 中的 RateLimitPerHost 設定指定了對特定域或 IP 地址的請求之間的最小延遲,通過間隔請求來幫助防止伺服器過載。
網路抓取中斷後是否可以恢復?
是的,IronWebScraper 可以通過使用 Start(CrawlID) 方法恢復抓取,該方法會保存執行狀態並從上次保存的點繼續。
如何控制網路抓取器中的並發 HTTP 連接數量?
在 IronWebScraper 中,您可以設置 MaxHttpConnectionLimit 屬性以控制允許開放的 HTTP 請求總數,幫助管理伺服器負載和資源。
有哪些選項可用來記錄網路抓取活動?
IronWebScraper 允許您使用 LoggingLevel 屬性設置日誌級別,提供全面的日誌記錄以便在抓取操作期間進行詳細分析和故障排除。






