Advanced Webscraping Features
HttpIdentity Feature
Some website systems require the user to be logged in to view the content; in this case, we can use an HttpIdentity
. Here is how you can set it up:
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);
' Create a new instance of HttpIdentity
Dim id As New HttpIdentity()
' Set the network username and password for authentication
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
' Add the identity to the collection of identities
Identities.Add(id)
One of the most impressive and powerful features in IronWebScraper is the ability to use thousands of unique user credentials and/or browser engines to spoof or scrape websites using multiple login sessions.
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Define an array of proxies
Dim proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(","c)
' Iterate over common Chrome desktop user agents
For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
' Iterate over the proxies
For Each proxy In proxies
' Add a new HTTP identity with specific user agent and proxy
Identities.Add(New HttpIdentity() With {
.UserAgent = UA,
.UseCookies = True,
.Proxy = proxy
})
Next proxy
Next UA
' Make an initial request to the website with a parse method
Me.Request("http://www.Website.com", Parse)
End Sub
You have multiple properties to give you different behaviors, so preventing websites from blocking you.
Some of these properties include:
NetworkDomain
: The network domain to be used for user authentication. Supports Windows, NTLM, Kerberos, Linux, BSD, and Mac OS X networks. Must be used withNetworkUsername
andNetworkPassword
.NetworkUsername
: The network/http username to be used for user authentication. Supports HTTP, Windows networks, NTLM, Kerberos, Linux networks, BSD networks, and Mac OS.NetworkPassword
: The network/http password to be used for user authentication. Supports HTTP, Windows networks, NTLM, Kerberos, Linux networks, BSD networks, and Mac OS.Proxy
: To set proxy settings.UserAgent
: To set a browser engine (e.g., Chrome desktop, Chrome mobile, Chrome tablet, IE, and Firefox, etc.).HttpRequestHeaders
: For custom header values that will be used with this identity, it accepts a dictionary objectDictionary<string, string>
.UseCookies
: Enable/disable using cookies.
IronWebScraper runs the scraper using random identities. If we need to specify the use of a specific identity to parse a page, we can do so:
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Create a new instance of HttpIdentity
Dim identity As New HttpIdentity()
' Set the network username and password for authentication
identity.NetworkUsername = "username"
identity.NetworkPassword = "pwd"
' Add the identity to the collection of identities
Identities.Add(identity)
' Make a request to the website with the specified identity
Me.Request("http://www.Website.com", Parse, identity)
End Sub
Enable the Web Cache Feature
This feature is used to cache requested pages. It is often used in the development and testing phases, enabling developers to cache required pages for reuse after updating code. This enables you to execute your code on cached pages after restarting your web scraper without needing to connect to the live website every time (action-replay).
You can use it in the Init()
method:
// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
' Enable web cache without an expiration time
EnableWebCache()
' OR enable web cache with a specified expiration time
EnableWebCache(New TimeSpan(1, 30, 30))
It will save your cached data to the WebCache folder under the working directory folder.
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(New TimeSpan(1, 30, 30))
' Make an initial request to the website with a parse method
Me.Request("http://www.Website.com", Parse)
End Sub
IronWebScraper also has features to enable your engine to continue scraping after restarting the code by setting the engine start process name using Start(CrawlID)
.
static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}
static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
' Create an object from the Scraper class
Dim scrape As New EngineScraper()
' Start the scraping process with the specified crawl ID
scrape.Start("enginestate")
End Sub
The execution request and response will be saved in the SavedState folder inside the working directory.
Throttling
We can control the minimum and maximum connection numbers and connection speed per domain.
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Set the total number of allowed open HTTP requests (threads)
Me.MaxHttpConnectionLimit = 80
' Set minimum polite delay (pause) between requests to a given domain or IP address
Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
' Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
Me.OpenConnectionLimitPerHost = 25
' Do not obey the robots.txt files
Me.ObeyRobotsDotTxt = False
' Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
Me.ThrottleMode = Throttle.ByDomainHostName
' Make an initial request to the website with a parse method
Me.Request("https://www.Website.com", Parse)
End Sub
Throttling properties
MaxHttpConnectionLimit
Total number of allowed open HTTP requests (threads)RateLimitPerHost
Minimum polite delay or pause (in milliseconds) between requests to a given domain or IP addressOpenConnectionLimitPerHost
Allowed number of concurrent HTTP requests (threads) per hostnameThrottleMode
Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses. This is polite in case multiple scraped domains are hosted on the same machine.
Get started with IronWebscraper
Start using IronWebScraper in your project today with a free trial.
Frequently Asked Questions
How can I authenticate users on websites requiring login in C#?
You can utilize the HttpIdentity
feature in IronWebScraper to authenticate users by setting up properties like NetworkDomain
, NetworkUsername
, and NetworkPassword
.
What is the benefit of using web cache during development?
The web cache feature allows you to cache requested pages for reuse, which helps save time and resources by avoiding repeated connections to live websites, especially useful during the development and testing phases.
How can I manage multiple login sessions in web scraping?
IronWebScraper allows the use of thousands of unique user credentials and browser engines to simulate multiple login sessions, which helps prevent websites from detecting and blocking the scraper.
What are the advanced throttling options in web scraping?
IronWebScraper offers a ThrottleMode
setting that intelligently manages request throttling based on hostnames and IP addresses, ensuring respectful interaction with shared hosting environments.
How can I use a proxy with IronWebScraper?
To use a proxy, define an array of proxies and associate them with HttpIdentity
instances in IronWebScraper, allowing requests to be routed through different IP addresses for anonymity and access control.
How does IronWebScraper handle request delays to prevent server overload?
The RateLimitPerHost
setting in IronWebScraper specifies the minimum delay between requests to a specific domain or IP address, helping to prevent server overload by spacing out requests.
Can web scraping be resumed after an interruption?
Yes, IronWebScraper can resume scraping after an interruption by using the Start(CrawlID)
method, which saves the execution state and resumes from the last saved point.
How do I control the number of concurrent HTTP connections in a web scraper?
In IronWebScraper, you can set the MaxHttpConnectionLimit
property to control the total number of allowed open HTTP requests, helping to manage server load and resources.
What options are available for logging web scraping activities?
IronWebScraper allows you to set the logging level using the LoggingLevel
property, enabling comprehensive logging for detailed analysis and troubleshooting during scraping operations.