Advanced Webscraping Features in C
Funkcja HttpIdentity
Niektóre systemy stron internetowych wymagają, aby użytkownik był zalogowany, aby wyświetlić zawartość; w takim przypadku możemy użyć HttpIdentity. Oto jak to skonfigurować:
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);
' Create a new instance of HttpIdentity
Dim id As New HttpIdentity()
' Set the network username and password for authentication
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
' Add the identity to the collection of identities
Identities.Add(id)
Jedną z najbardziej imponujących i potężnych funkcji w IronWebscraper jest możliwość użycia tysięcy unikalnych danych logowania użytkownika i/lub silników przeglądarek do symulowania lub zbierania danych z witryn korzystając z wielu sesji logowania.
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Define an array of proxies
Dim proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(","c)
' Iterate over common Chrome desktop user agents
For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
' Iterate over the proxies
For Each proxy In proxies
' Add a new HTTP identity with specific user agent and proxy
Identities.Add(New HttpIdentity() With {
.UserAgent = UA,
.UseCookies = True,
.Proxy = proxy
})
Next proxy
Next UA
' Make an initial request to the website with a parse method
Me.Request("http://www.Website.com", Parse)
End Sub
Masz wiele właściwości, które umożliwiają różne zachowania, aby uniknąć blokowania przez strony internetowe.
Niektóre z tych właściwości to:
NetworkDomain: Domeną sieciową do użycia w uwierzytelnianiu użytkownika. Obsługuje sieci Windows, NTLM, Kerberos, Linux, BSD i Mac OS X. Musi być używane zNetworkUsernameiNetworkPassword.NetworkUsername: Nazwa użytkownika sieciowa/http do użycia w uwierzytelnianiu użytkownika. Obsługuje HTTP, sieci Windows, NTLM, Kerberos, sieci Linux, sieci BSD i Mac OS.NetworkPassword: Hasło sieciowe/http do użycia w uwierzytelnianiu użytkownika. Obsługuje HTTP, sieci Windows, NTLM, Kerberos, sieci Linux, sieci BSD i Mac OS.Proxy: Do ustawienia ustawień proxy.UserAgent: Do ustawienia silnika przeglądarki (np. Chrome desktop, Chrome mobile, Chrome tablet, IE, Firefox itp.).HttpRequestHeaders: Dla niestandardowych wartości w nagłówkach, które będą używane z tą tożsamością, akceptuje obiekt słownikaDictionary<string, string>.UseCookies: Włączanie/wyłączanie używania ciasteczek.
IronWebscraper uruchamia scraper korzystając z losowych tożsamości. Jeśli musimy określić użycie konkretnej tożsamości do analizy strony, możemy to zrobić:
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Create a new instance of HttpIdentity
Dim identity As New HttpIdentity()
' Set the network username and password for authentication
identity.NetworkUsername = "username"
identity.NetworkPassword = "pwd"
' Add the identity to the collection of identities
Identities.Add(identity)
' Make a request to the website with the specified identity
Me.Request("http://www.Website.com", Parse, identity)
End Sub
Włącz funkcję pamięci podręcznej
Ta funkcja służy do buforowania żądanych stron. Często używana w fazach rozwoju i testowania, umożliwiając programistom buforowanie wymaganych stron do ponownego użycia po zaktualizowaniu kodu. Umożliwia wykonanie kodu na stronie buforowanej po ponownym uruchomieniu webscrapera bez potrzeby łączenia się z aktywną stroną za każdym razem (action-replay).
Można używać jej w metodzie Init().
// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
' Enable web cache without an expiration time
EnableWebCache()
' OR enable web cache with a specified expiration time
EnableWebCache(New TimeSpan(1, 30, 30))
Zapisze dane w katalogu WebCache w folderze roboczym.
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(New TimeSpan(1, 30, 30))
' Make an initial request to the website with a parse method
Me.Request("http://www.Website.com", Parse)
End Sub
IronWebscraper posiada również funkcje umożliwiające kontynuację scrapingu po ponownym uruchomieniu kodu, ustawiając nazwę procesu uruchamiania silnika za pomocą Start(CrawlID).
static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}
static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
' Create an object from the Scraper class
Dim scrape As New EngineScraper()
' Start the scraping process with the specified crawl ID
scrape.Start("enginestate")
End Sub
Żądanie wykonania i odpowiedź zostaną zapisane w katalogu SavedState w folderze roboczym.
Regulacja obciążenia
Możemy kontrolować minimalną i maksymalną liczbę połączeń oraz prędkość połączeń na domenę.
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Set the total number of allowed open HTTP requests (threads)
Me.MaxHttpConnectionLimit = 80
' Set minimum polite delay (pause) between requests to a given domain or IP address
Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
' Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
Me.OpenConnectionLimitPerHost = 25
' Do not obey the robots.txt files
Me.ObeyRobotsDotTxt = False
' Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
Me.ThrottleMode = Throttle.ByDomainHostName
' Make an initial request to the website with a parse method
Me.Request("https://www.Website.com", Parse)
End Sub
Właściwości regulacji obciążenia
MaxHttpConnectionLimit
Całkowita liczba dozwolonych otwartych żądań HTTP (wątków)RateLimitPerHost
Minimalne uprzejme opóźnienie lub pauza (w milisekundach) pomiędzy żądaniami do danej domeny lub adresu IPOpenConnectionLimitPerHost
Dozwolona liczba równoczesnych żądań HTTP (wątków) na nazwę hostaThrottleMode
WebScraper inteligentnie reguluje żądania nie tylko według nazwy hosta, ale także według adresów IP serwerów hosta. Jest to uprzejme w przypadku, gdy wiele skanowanych domen jest hostowanych na tej samej maszynie.
Rozpocznij prace z IronWebscraper
Rozpocznij używanie IronWebScraper w swoim projekcie już dziś dzięki darmowej wersji próbnej.
Często Zadawane Pytania
How can I authenticate users on websites requiring login in C#?
You can utilize the HttpIdentity feature in IronWebScraper to authenticate users by setting up properties like NetworkDomain, NetworkUsername, and NetworkPassword.
What is the benefit of using web cache during development?
The web cache feature allows you to cache requested pages for reuse, which helps save time and resources by avoiding repeated connections to live websites, especially useful during the development and testing phases.
How can I manage multiple login sessions in web scraping?
IronWebScraper allows the use of thousands of unique user credentials and browser engines to simulate multiple login sessions, which helps prevent websites from detecting and blocking the scraper.
What are the advanced throttling options in web scraping?
IronWebScraper offers a ThrottleMode setting that intelligently manages request throttling based on hostnames and IP addresses, ensuring respectful interaction with shared hosting environments.
How can I use a proxy with IronWebScraper?
To use a proxy, define an array of proxies and associate them with HttpIdentity instances in IronWebScraper, allowing requests to be routed through different IP addresses for anonymity and access control.
How does IronWebScraper handle request delays to prevent server overload?
The RateLimitPerHost setting in IronWebScraper specifies the minimum delay between requests to a specific domain or IP address, helping to prevent server overload by spacing out requests.
Can web scraping be resumed after an interruption?
Yes, IronWebScraper can resume scraping after an interruption by using the Start(CrawlID) method, which saves the execution state and resumes from the last saved point.
How do I control the number of concurrent HTTP connections in a web scraper?
In IronWebScraper, you can set the MaxHttpConnectionLimit property to control the total number of allowed open HTTP requests, helping to manage server load and resources.
What options are available for logging web scraping activities?
IronWebScraper allows you to set the logging level using the LoggingLevel property, enabling comprehensive logging for detailed analysis and troubleshooting during scraping operations.

