Advanced Webscraping Features

This article was translated from English: Does it need improvement?
Translated
View the article in English

HttpIdentity Feature

Some website systems require the user to be logged in to view the content; in this case, we can use an HttpIdentity. Here is how you can set it up:

// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();

// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";

// Add the identity to the collection of identities
Identities.Add(id);
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();

// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";

// Add the identity to the collection of identities
Identities.Add(id);
' Create a new instance of HttpIdentity
Dim id As New HttpIdentity()

' Set the network username and password for authentication
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"

' Add the identity to the collection of identities
Identities.Add(id)
$vbLabelText   $csharpLabel

One of the most impressive and powerful features in IronWebScraper is the ability to use thousands of unique user credentials and/or browser engines to spoof or scrape websites using multiple login sessions.

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Define an array of proxies
    var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');

    // Iterate over common Chrome desktop user agents
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        // Iterate over the proxies
        foreach (var proxy in proxies)
        {
            // Add a new HTTP identity with specific user agent and proxy
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Define an array of proxies
    var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');

    // Iterate over common Chrome desktop user agents
    foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
    {
        // Iterate over the proxies
        foreach (var proxy in proxies)
        {
            // Add a new HTTP identity with specific user agent and proxy
            Identities.Add(new HttpIdentity()
            {
                UserAgent = UA,
                UseCookies = true,
                Proxy = proxy
            });
        }
    }

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Define an array of proxies
	Dim proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(","c)

	' Iterate over common Chrome desktop user agents
	For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
		' Iterate over the proxies
		For Each proxy In proxies
			' Add a new HTTP identity with specific user agent and proxy
			Identities.Add(New HttpIdentity() With {
				.UserAgent = UA,
				.UseCookies = True,
				.Proxy = proxy
			})
		Next proxy
	Next UA

	' Make an initial request to the website with a parse method
	Me.Request("http://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

You have multiple properties to give you different behaviors, so preventing websites from blocking you.

Some of these properties include:

  • NetworkDomain: The network domain to be used for user authentication. Supports Windows, NTLM, Kerberos, Linux, BSD, and Mac OS X networks. Must be used with NetworkUsername and NetworkPassword.
  • NetworkUsername: The network/http username to be used for user authentication. Supports HTTP, Windows networks, NTLM, Kerberos, Linux networks, BSD networks, and Mac OS.
  • NetworkPassword: The network/http password to be used for user authentication. Supports HTTP, Windows networks, NTLM, Kerberos, Linux networks, BSD networks, and Mac OS.
  • Proxy: To set proxy settings.
  • UserAgent: To set a browser engine (e.g., Chrome desktop, Chrome mobile, Chrome tablet, IE, and Firefox, etc.).
  • HttpRequestHeaders: For custom header values that will be used with this identity, it accepts a dictionary object Dictionary<string, string>.
  • UseCookies: Enable/disable using cookies.

IronWebScraper runs the scraper using random identities. If we need to specify the use of a specific identity to parse a page, we can do so:

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Create a new instance of HttpIdentity
    HttpIdentity identity = new HttpIdentity();

    // Set the network username and password for authentication
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";

    // Add the identity to the collection of identities
    Identities.Add(identity);

    // Make a request to the website with the specified identity
    this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Create a new instance of HttpIdentity
    HttpIdentity identity = new HttpIdentity();

    // Set the network username and password for authentication
    identity.NetworkUsername = "username";
    identity.NetworkPassword = "pwd";

    // Add the identity to the collection of identities
    Identities.Add(identity);

    // Make a request to the website with the specified identity
    this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Create a new instance of HttpIdentity
	Dim identity As New HttpIdentity()

	' Set the network username and password for authentication
	identity.NetworkUsername = "username"
	identity.NetworkPassword = "pwd"

	' Add the identity to the collection of identities
	Identities.Add(identity)

	' Make a request to the website with the specified identity
	Me.Request("http://www.Website.com", Parse, identity)
End Sub
$vbLabelText   $csharpLabel

Enable the Web Cache Feature

This feature is used to cache requested pages. It is often used in the development and testing phases, enabling developers to cache required pages for reuse after updating code. This enables you to execute your code on cached pages after restarting your web scraper without needing to connect to the live website every time (action-replay).

You can use it in the Init() method:

// Enable web cache without an expiration time
EnableWebCache();

// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
// Enable web cache without an expiration time
EnableWebCache();

// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
' Enable web cache without an expiration time
EnableWebCache()

' OR enable web cache with a specified expiration time
EnableWebCache(New TimeSpan(1, 30, 30))
$vbLabelText   $csharpLabel

It will save your cached data to the WebCache folder under the working directory folder.

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
    EnableWebCache(new TimeSpan(1, 30, 30));

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
    EnableWebCache(new TimeSpan(1, 30, 30));

    // Make an initial request to the website with a parse method
    this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
	EnableWebCache(New TimeSpan(1, 30, 30))

	' Make an initial request to the website with a parse method
	Me.Request("http://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

IronWebScraper also has features to enable your engine to continue scraping after restarting the code by setting the engine start process name using Start(CrawlID).

static void Main(string[] args)
{
    // Create an object from the Scraper class
    EngineScraper scrape = new EngineScraper();

    // Start the scraping process with the specified crawl ID
    scrape.Start("enginestate");
}
static void Main(string[] args)
{
    // Create an object from the Scraper class
    EngineScraper scrape = new EngineScraper();

    // Start the scraping process with the specified crawl ID
    scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
	' Create an object from the Scraper class
	Dim scrape As New EngineScraper()

	' Start the scraping process with the specified crawl ID
	scrape.Start("enginestate")
End Sub
$vbLabelText   $csharpLabel

The execution request and response will be saved in the SavedState folder inside the working directory.

Throttling

We can control the minimum and maximum connection numbers and connection speed per domain.

public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Set the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;

    // Set minimum polite delay (pause) between requests to a given domain or IP address
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);

    // Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
    this.OpenConnectionLimitPerHost = 25;

    // Do not obey the robots.txt files
    this.ObeyRobotsDotTxt = false;

    // Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
    this.ThrottleMode = Throttle.ByDomainHostName;

    // Make an initial request to the website with a parse method
    this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
    // Set the license key for IronWebScraper
    License.LicenseKey = "LicenseKey";

    // Set the logging level to capture all logs
    this.LoggingLevel = WebScraper.LogLevel.All;

    // Assign the working directory for the output files
    this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

    // Set the total number of allowed open HTTP requests (threads)
    this.MaxHttpConnectionLimit = 80;

    // Set minimum polite delay (pause) between requests to a given domain or IP address
    this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);

    // Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
    this.OpenConnectionLimitPerHost = 25;

    // Do not obey the robots.txt files
    this.ObeyRobotsDotTxt = false;

    // Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
    this.ThrottleMode = Throttle.ByDomainHostName;

    // Make an initial request to the website with a parse method
    this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
	' Set the license key for IronWebScraper
	License.LicenseKey = "LicenseKey"

	' Set the logging level to capture all logs
	Me.LoggingLevel = WebScraper.LogLevel.All

	' Assign the working directory for the output files
	Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"

	' Set the total number of allowed open HTTP requests (threads)
	Me.MaxHttpConnectionLimit = 80

	' Set minimum polite delay (pause) between requests to a given domain or IP address
	Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)

	' Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
	Me.OpenConnectionLimitPerHost = 25

	' Do not obey the robots.txt files
	Me.ObeyRobotsDotTxt = False

	' Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
	Me.ThrottleMode = Throttle.ByDomainHostName

	' Make an initial request to the website with a parse method
	Me.Request("https://www.Website.com", Parse)
End Sub
$vbLabelText   $csharpLabel

Throttling properties

  • MaxHttpConnectionLimit
    Total number of allowed open HTTP requests (threads)
  • RateLimitPerHost
    Minimum polite delay or pause (in milliseconds) between requests to a given domain or IP address
  • OpenConnectionLimitPerHost
    Allowed number of concurrent HTTP requests (threads) per hostname
  • ThrottleMode
    Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses. This is polite in case multiple scraped domains are hosted on the same machine.

Get started with IronWebscraper

Comience a usar IronWebScraper en su proyecto hoy con una prueba gratuita.

Primer Paso:
green arrow pointer

Preguntas Frecuentes

¿Cómo puedo autenticar usuarios en sitios web que requieren inicio de sesión en C#?

Puede utilizar la función HttpIdentity en IronWebScraper para autenticar usuarios configurando propiedades como NetworkDomain, NetworkUsername y NetworkPassword.

¿Cuál es el beneficio de usar caché web durante el desarrollo?

La función de caché web le permite almacenar en caché las páginas solicitadas para reutilizarlas, lo que ayuda a ahorrar tiempo y recursos al evitar conexiones repetidas a sitios web en vivo, especialmente útil durante las fases de desarrollo y prueba.

¿Cómo puedo gestionar múltiples sesiones de inicio de sesión en web scraping?

IronWebScraper permite el uso de miles de credenciales de usuario únicas y motores de navegador para simular múltiples sesiones de inicio de sesión, lo que ayuda a evitar que los sitios web detecten y bloqueen el scraper.

¿Cuáles son las opciones avanzadas de limitación en web scraping?

IronWebScraper ofrece una configuración ThrottleMode que gestiona inteligentemente la limitación de solicitudes según los nombres de host y las direcciones IP, asegurando una interacción respetuosa con los entornos de alojamiento compartido.

¿Cómo puedo usar un proxy con IronWebScraper?

Para usar un proxy, defina un array de proxies y asócielos con instancias de HttpIdentity en IronWebScraper, permitiendo que las solicitudes se dirijan a través de diferentes direcciones IP para el anonimato y el control de acceso.

¿Cómo maneja IronWebScraper los retrasos en las solicitudes para evitar la sobrecarga del servidor?

La configuración RateLimitPerHost en IronWebScraper especifica el retraso mínimo entre solicitudes a un dominio o dirección IP específicos, ayudando a prevenir la sobrecarga del servidor al espaciar las solicitudes.

¿Se puede reanudar el web scraping después de una interrupción?

Sí, IronWebScraper puede reanudar el scraping después de una interrupción utilizando el método Start(CrawlID), que guarda el estado de ejecución y reanuda desde el último punto guardado.

¿Cómo controlo el número de conexiones HTTP concurrentes en un web scraper?

En IronWebScraper, puede configurar la propiedad MaxHttpConnectionLimit para controlar el número total de solicitudes HTTP abiertas permitidas, ayudando a gestionar la carga y los recursos del servidor.

¿Qué opciones están disponibles para registrar actividades de web scraping?

IronWebScraper le permite establecer el nivel de registro utilizando la propiedad LoggingLevel, lo que permite un registro exhaustivo para un análisis detallado y solución de problemas durante las operaciones de scraping.

Darrius Serrant
Ingeniero de Software Full Stack (WebOps)

Darrius Serrant tiene una licenciatura en Ciencias de la Computación de la Universidad de Miami y trabaja como Ingeniero de Marketing WebOps Full Stack en Iron Software. Atraído por la programación desde joven, vio la computación como algo misterioso y accesible, convirtiéndolo en el ...

Leer más
¿Listo para empezar?
Nuget Descargas 122,916 | Versión: 2025.11 recién lanzado