Advanced Webscraping Features in C

This article was translated from English: Does it need improvement?
Translated
View the article in English

Funkcja HttpIdentity

Niektóre systemy stron internetowych wymagają, aby użytkownik był zalogowany, aby wyświetlić zawartość; w takim przypadku możemy użyć HttpIdentity. Oto jak to skonfigurować:

// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();

// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";

// Add the identity to the collection of identities
Identities.Add(id);
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();

// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";

// Add the identity to the collection of identities
Identities.Add(id);
' Create a new instance of HttpIdentity
Dim id As New HttpIdentity()

' Set the network username and password for authentication
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"

' Add the identity to the collection of identities
Identities.Add(id)
$vbLabelText   $csharpLabel

Jedną z najbardziej imponujących i potężnych funkcji w IronWebscraper jest możliwość użycia tysięcy unikalnych danych logowania użytkownika i/lub silników przeglądarek do symulowania lub zbierania danych z witryn korzystając z wielu sesji logowania.

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Define an array of proxies
    var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');

    // Iterate over common Chrome desktop user agents
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        // Iterate over the proxies
        foreach (var proxy in proxies)
        {
            // Add a new HTTP identity with specific user agent and proxy
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Define an array of proxies
    var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');

    // Iterate over common Chrome desktop user agents
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        // Iterate over the proxies
        foreach (var proxy in proxies)
        {
            // Add a new HTTP identity with specific user agent and proxy
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Define an array of proxies
	Dim proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(","c)

	' Iterate over common Chrome desktop user agents
	For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
		' Iterate over the proxies
		For Each proxy In proxies
			' Add a new HTTP identity with specific user agent and proxy
			Identities.Add(New HttpIdentity() With {
				.UserAgent = UA,
				.UseCookies = True,
				.Proxy = proxy
			})
		Next proxy
	Next UA

	' Make an initial request to the website with a parse method
	Me.Request("http://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

Masz wiele właściwości, które umożliwiają różne zachowania, aby uniknąć blokowania przez strony internetowe.

Niektóre z tych właściwości to:

  • NetworkDomain: Domeną sieciową do użycia w uwierzytelnianiu użytkownika. Obsługuje sieci Windows, NTLM, Kerberos, Linux, BSD i Mac OS X. Musi być używane z NetworkUsername i NetworkPassword.
  • NetworkUsername: Nazwa użytkownika sieciowa/http do użycia w uwierzytelnianiu użytkownika. Obsługuje HTTP, sieci Windows, NTLM, Kerberos, sieci Linux, sieci BSD i Mac OS.
  • NetworkPassword: Hasło sieciowe/http do użycia w uwierzytelnianiu użytkownika. Obsługuje HTTP, sieci Windows, NTLM, Kerberos, sieci Linux, sieci BSD i Mac OS.
  • Proxy: Do ustawienia ustawień proxy.
  • UserAgent: Do ustawienia silnika przeglądarki (np. Chrome desktop, Chrome mobile, Chrome tablet, IE, Firefox itp.).
  • HttpRequestHeaders: Dla niestandardowych wartości w nagłówkach, które będą używane z tą tożsamością, akceptuje obiekt słownika Dictionary<string, string>.
  • UseCookies: Włączanie/wyłączanie używania ciasteczek.

IronWebscraper uruchamia scraper korzystając z losowych tożsamości. Jeśli musimy określić użycie konkretnej tożsamości do analizy strony, możemy to zrobić:

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Create a new instance of HttpIdentity
    HttpIdentity identity = new HttpIdentity();

    // Set the network username and password for authentication
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";

    // Add the identity to the collection of identities
    Identities.Add(identity);

    // Make a request to the website with the specified identity
    this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Create a new instance of HttpIdentity
    HttpIdentity identity = new HttpIdentity();

    // Set the network username and password for authentication
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";

    // Add the identity to the collection of identities
    Identities.Add(identity);

    // Make a request to the website with the specified identity
    this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Create a new instance of HttpIdentity
	Dim identity As New HttpIdentity()

	' Set the network username and password for authentication
	identity.NetworkUsername = "username"
	identity.NetworkPassword = "pwd"

	' Add the identity to the collection of identities
	Identities.Add(identity)

	' Make a request to the website with the specified identity
	Me.Request("http://www.Website.com", Parse, identity)
End Sub
$vbLabelText   $csharpLabel

Włącz funkcję pamięci podręcznej

Ta funkcja służy do buforowania żądanych stron. Często używana w fazach rozwoju i testowania, umożliwiając programistom buforowanie wymaganych stron do ponownego użycia po zaktualizowaniu kodu. Umożliwia wykonanie kodu na stronie buforowanej po ponownym uruchomieniu webscrapera bez potrzeby łączenia się z aktywną stroną za każdym razem (action-replay).

Można używać jej w metodzie Init().

// Enable web cache without an expiration time
EnableWebCache();

// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
// Enable web cache without an expiration time
EnableWebCache();

// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
' Enable web cache without an expiration time
EnableWebCache()

' OR enable web cache with a specified expiration time
EnableWebCache(New TimeSpan(1, 30, 30))
$vbLabelText   $csharpLabel

Zapisze dane w katalogu WebCache w folderze roboczym.

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
    EnableWebCache(new TimeSpan(1, 30, 30));

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
    EnableWebCache(new TimeSpan(1, 30, 30));

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
	EnableWebCache(New TimeSpan(1, 30, 30))

	' Make an initial request to the website with a parse method
	Me.Request("http://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

IronWebscraper posiada również funkcje umożliwiające kontynuację scrapingu po ponownym uruchomieniu kodu, ustawiając nazwę procesu uruchamiania silnika za pomocą Start(CrawlID).

static void Main(string[] args)
{
    // Create an object from the Scraper class
    EngineScraper scrape = new EngineScraper();

    // Start the scraping process with the specified crawl ID
    scrape.Start("enginestate");
}
static void Main(string[] args)
{
    // Create an object from the Scraper class
    EngineScraper scrape = new EngineScraper();

    // Start the scraping process with the specified crawl ID
    scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
	' Create an object from the Scraper class
	Dim scrape As New EngineScraper()

	' Start the scraping process with the specified crawl ID
	scrape.Start("enginestate")
End Sub
$vbLabelText   $csharpLabel

Żądanie wykonania i odpowiedź zostaną zapisane w katalogu SavedState w folderze roboczym.

Regulacja obciążenia

Możemy kontrolować minimalną i maksymalną liczbę połączeń oraz prędkość połączeń na domenę.

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Set the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;

    // Set minimum polite delay (pause) between requests to a given domain or IP address
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);

    // Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
    this.OpenConnectionLimitPerHost = 25;

    // Do not obey the robots.txt files
    this.ObeyRobotsDotTxt = false;

    // Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
    this.ThrottleMode = Throttle.ByDomainHostName;

    // Make an initial request to the website with a parse method
    this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Set the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;

    // Set minimum polite delay (pause) between requests to a given domain or IP address
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);

    // Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
    this.OpenConnectionLimitPerHost = 25;

    // Do not obey the robots.txt files
    this.ObeyRobotsDotTxt = false;

    // Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
    this.ThrottleMode = Throttle.ByDomainHostName;

    // Make an initial request to the website with a parse method
    this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Set the total number of allowed open HTTP requests (threads)
	Me.MaxHttpConnectionLimit = 80

	' Set minimum polite delay (pause) between requests to a given domain or IP address
	Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)

	' Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
	Me.OpenConnectionLimitPerHost = 25

	' Do not obey the robots.txt files
	Me.ObeyRobotsDotTxt = False

	' Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
	Me.ThrottleMode = Throttle.ByDomainHostName

	' Make an initial request to the website with a parse method
	Me.Request("https://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

Właściwości regulacji obciążenia

  • MaxHttpConnectionLimit
    Całkowita liczba dozwolonych otwartych żądań HTTP (wątków)
  • RateLimitPerHost
    Minimalne uprzejme opóźnienie lub pauza (w milisekundach) pomiędzy żądaniami do danej domeny lub adresu IP
  • OpenConnectionLimitPerHost
    Dozwolona liczba równoczesnych żądań HTTP (wątków) na nazwę hosta
  • ThrottleMode
    WebScraper inteligentnie reguluje żądania nie tylko według nazwy hosta, ale także według adresów IP serwerów hosta. Jest to uprzejme w przypadku, gdy wiele skanowanych domen jest hostowanych na tej samej maszynie.

Rozpocznij prace z IronWebscraper

Rozpocznij używanie IronWebScraper w swoim projekcie już dziś dzięki darmowej wersji próbnej.

Pierwszy krok:
green arrow pointer

Często Zadawane Pytania

How can I authenticate users on websites requiring login in C#?

You can utilize the HttpIdentity feature in IronWebScraper to authenticate users by setting up properties like NetworkDomain, NetworkUsername, and NetworkPassword.

What is the benefit of using web cache during development?

The web cache feature allows you to cache requested pages for reuse, which helps save time and resources by avoiding repeated connections to live websites, especially useful during the development and testing phases.

How can I manage multiple login sessions in web scraping?

IronWebScraper allows the use of thousands of unique user credentials and browser engines to simulate multiple login sessions, which helps prevent websites from detecting and blocking the scraper.

What are the advanced throttling options in web scraping?

IronWebScraper offers a ThrottleMode setting that intelligently manages request throttling based on hostnames and IP addresses, ensuring respectful interaction with shared hosting environments.

How can I use a proxy with IronWebScraper?

To use a proxy, define an array of proxies and associate them with HttpIdentity instances in IronWebScraper, allowing requests to be routed through different IP addresses for anonymity and access control.

How does IronWebScraper handle request delays to prevent server overload?

The RateLimitPerHost setting in IronWebScraper specifies the minimum delay between requests to a specific domain or IP address, helping to prevent server overload by spacing out requests.

Can web scraping be resumed after an interruption?

Yes, IronWebScraper can resume scraping after an interruption by using the Start(CrawlID) method, which saves the execution state and resumes from the last saved point.

How do I control the number of concurrent HTTP connections in a web scraper?

In IronWebScraper, you can set the MaxHttpConnectionLimit property to control the total number of allowed open HTTP requests, helping to manage server load and resources.

What options are available for logging web scraping activities?

IronWebScraper allows you to set the logging level using the LoggingLevel property, enabling comprehensive logging for detailed analysis and troubleshooting during scraping operations.

Curtis Chau
Autor tekstów technicznych

Curtis Chau posiada tytuł licencjata z informatyki (Uniwersytet Carleton) i specjalizuje się w front-endowym rozwoju, z ekspertką w Node.js, TypeScript, JavaScript i React. Pasjonuje się tworzeniem intuicyjnych i estetycznie przyjemnych interfejsów użytkownika, Curtis cieszy się pracą z nowoczesnymi frameworkami i tworzeniem dobrze zorganizowanych, atrakcyjnych wizualnie podrę...

Czytaj więcej
Gotowy, aby rozpocząć?
Nuget Pliki do pobrania 133,274 | Wersja: 2026.4 just released
Still Scrolling Icon

Wciąż przewijasz?

Czy chcesz szybko dowodu? PM > Install-Package IronWebScraper
uruchom przykład obserwuj, jak twoja docelowa strona przekształca się w dane strukturalne.