C# 中的進階網頁抓取功能
HttpIdentity 功能
某些網站系統要求使用者登入才能查看內容; 在這種情況下,我們可以使用HttpIdentity 。 以下是設定方法:
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);IronWebScraper 最令人印象深刻和最強大的功能之一是能夠使用數千個獨特的使用者憑證和/或瀏覽器引擎,透過多個登入工作階段來欺騙或抓取網站。
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}您擁有多種屬性,可以賦予您不同的行為,從而防止網站封鎖您。
這些特性包括:
NetworkDomain:用於使用者驗證的網路網域。 支援 Windows、NTLM、Kerberos、Linux、BSD 和 Mac OS X 網路。 必須與NetworkUsername和NetworkPassword一起使用。NetworkUsername:用於使用者驗證的網路/http使用者名稱。 支援 HTTP、Windows 網路、NTLM、Kerberos、Linux 網路、BSD 網路和 Mac OS。NetworkPassword:用於使用者驗證的網路/http密碼。 支援 HTTP、Windows 網路、NTLM、Kerberos、Linux 網路、BSD 網路和 Mac OS。Proxy:設定代理設定。UserAgent:用於設定瀏覽器引擎(例如,Chrome 桌面版、Chrome 行動版、Chrome 平板電腦版、IE 和 Firefox 等)。HttpRequestHeaders:用於指定將與此標識一起使用的自訂標頭值,它接受一個字典物件Dictionary<string, string>。UseCookies:啟用/停用使用 cookies。
IronWebScraper 使用隨機身分來運行爬蟲。 如果我們需要指定使用特定標識來解析頁面,我們可以這樣做:
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}啟用網頁快取功能
此功能用於快取請求的頁面。 它常用於開發和測試階段,使開發人員能夠快取所需的頁面以便在更新程式碼後重複使用。 這樣,即使重新啟動網路爬蟲,也可以在快取的頁面上執行程式碼,而無需每次都連接到即時網站(動作重播)。
你可以在Init()方法中使用它:
// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));它會將您的快取資料儲存到工作目錄資料夾下的 WebCache 資料夾。
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}IronWebScraper 還具有一些功能,可透過使用Start(CrawlID)設定引擎啟動進程名稱,讓引擎在重新啟動程式碼後繼續抓取。
static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}執行請求與回應將儲存在工作目錄內的 SavedState 資料夾。
節流
我們可以控制每個域的最小和最大連接數以及連接速度。
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}節流特性
- <編碼>MaxHttpConnectionLimit</編碼 允許開啟的 HTTP 請求(線程)總數
- <代碼>RateLimitPerHost</代碼 向特定網域名稱或 IP 位址發出請求之間最短的禮貌性延遲或暫停時間(以毫秒為單位)。
OpenConnectionLimitPerHost每個主機名稱允許的並發 HTTP 請求(線程)數量節流模式讓 WebScraper 不僅能根據主機名,還能根據主機伺服器的 IP 位址智慧地限制請求。 如果多個抓取的網域託管在同一台機器上,這樣做比較禮貌。
開始使用 IronWebscraper
!{--01001100010010010100001001010010010000010101001001011001010111110101001101010100010001010101010 10100010111110101010001010010010010010100000101001100010111110100001001001100010011111010000100100110001001111010101
常見問題解答
如何使用 C# 對需要登入的網站進行使用者驗證?
您可以使用 IronWebScraper 中的HttpIdentity功能,透過設定NetworkDomain 、 NetworkUsername和NetworkPassword等屬性來驗證使用者身分。
在開發過程中使用 Web 快取有什麼好處?
Web 快取功能可讓您快取要求的頁面以供重複使用,從而避免重複連接到即時網站,有助於節省時間和資源,這在開發和測試階段尤其有用。
如何在網頁爬蟲中管理多個登入工作階段?
IronWebScraper 允許使用數千個獨特的使用者憑證和瀏覽器引擎來模擬多個登入會話,這有助於防止網站偵測和封鎖抓取程式。
網路爬蟲有哪些進階限速選項?
IronWebScraper 提供了一個ThrottleMode設置,可以根據主機名稱和 IP 位址智慧地管理請求限制,從而確保與共享主機環境進行合理的互動。
如何在 IronWebScraper 中使用代理程式?
若要使用代理,請定義一個代理數組,並將它們與 IronWebScraper 中的HttpIdentity實例關聯起來,從而允許請求透過不同的 IP 位址進行路由,以實現匿名性和存取控制。
IronWebScraper如何處理請求延遲以防止伺服器過載?
IronWebScraper 中的RateLimitPerHost設定指定對特定網域或 IP 位址的請求之間的最小延遲,透過分散請求來協助防止伺服器過載。
網路爬蟲中斷後能否恢復?
是的,IronWebScraper 可以使用Start(CrawlID)方法在中斷後還原抓取,該方法會保存執行狀態並從上次儲存的點還原。
如何控製網路爬蟲中並發HTTP連線數?
在 IronWebScraper 中,您可以設定MaxHttpConnectionLimit屬性來控制允許開啟的 HTTP 請求總數,從而協助管理伺服器負載和資源。
有哪些選項可用於記錄網路爬蟲活動?
IronWebScraper 可讓您使用LoggingLevel屬性設定日誌級別,從而在抓取作業期間啟用全面的日誌記錄,以便進行詳細的分析和故障排除。






