Advanced Webscraping Features
HttpIdentity Feature
Some website systems require the user to be logged in to view the content; in this case, we can use an HttpIdentity
. Here is how you can set it up:
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);
// Create a new instance of HttpIdentity
HttpIdentity id = new HttpIdentity();
// Set the network username and password for authentication
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(id);
' Create a new instance of HttpIdentity
Dim id As New HttpIdentity()
' Set the network username and password for authentication
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
' Add the identity to the collection of identities
Identities.Add(id)
One of the most impressive and powerful features in IronWebScraper is the ability to use thousands of unique user credentials and/or browser engines to spoof or scrape websites using multiple login sessions.
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Define an array of proxies
var proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(',');
// Iterate over common Chrome desktop user agents
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
// Iterate over the proxies
foreach (var proxy in proxies)
{
// Add a new HTTP identity with specific user agent and proxy
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Define an array of proxies
Dim proxies = "IP-Proxy1:8080,IP-Proxy2:8081".Split(","c)
' Iterate over common Chrome desktop user agents
For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
' Iterate over the proxies
For Each proxy In proxies
' Add a new HTTP identity with specific user agent and proxy
Identities.Add(New HttpIdentity() With {
.UserAgent = UA,
.UseCookies = True,
.Proxy = proxy
})
Next proxy
Next UA
' Make an initial request to the website with a parse method
Me.Request("http://www.Website.com", Parse)
End Sub
You have multiple properties to give you different behaviors, so preventing websites from blocking you.
Some of these properties include:
NetworkDomain
: The network domain to be used for user authentication. Supports Windows, NTLM, Kerberos, Linux, BSD, and Mac OS X networks. Must be used withNetworkUsername
andNetworkPassword
.NetworkUsername
: The network/http username to be used for user authentication. Supports HTTP, Windows networks, NTLM, Kerberos, Linux networks, BSD networks, and Mac OS.NetworkPassword
: The network/http password to be used for user authentication. Supports HTTP, Windows networks, NTLM, Kerberos, Linux networks, BSD networks, and Mac OS.Proxy
: To set proxy settings.UserAgent
: To set a browser engine (e.g., Chrome desktop, Chrome mobile, Chrome tablet, IE, and Firefox, etc.).HttpRequestHeaders
: For custom header values that will be used with this identity, it accepts a dictionary objectDictionary<string, string>
.UseCookies
: Enable/disable using cookies.
IronWebScraper runs the scraper using random identities. If we need to specify the use of a specific identity to parse a page, we can do so:
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Create a new instance of HttpIdentity
HttpIdentity identity = new HttpIdentity();
// Set the network username and password for authentication
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
// Add the identity to the collection of identities
Identities.Add(identity);
// Make a request to the website with the specified identity
this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Create a new instance of HttpIdentity
Dim identity As New HttpIdentity()
' Set the network username and password for authentication
identity.NetworkUsername = "username"
identity.NetworkPassword = "pwd"
' Add the identity to the collection of identities
Identities.Add(identity)
' Make a request to the website with the specified identity
Me.Request("http://www.Website.com", Parse, identity)
End Sub
Enable the Web Cache Feature
This feature is used to cache requested pages. It is often used in the development and testing phases, enabling developers to cache required pages for reuse after updating code. This enables you to execute your code on cached pages after restarting your web scraper without needing to connect to the live website every time (action-replay).
You can use it in the Init()
method:
// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
// Enable web cache without an expiration time
EnableWebCache();
// OR enable web cache with a specified expiration time
EnableWebCache(new TimeSpan(1, 30, 30));
' Enable web cache without an expiration time
EnableWebCache()
' OR enable web cache with a specified expiration time
EnableWebCache(New TimeSpan(1, 30, 30))
It will save your cached data to the WebCache folder under the working directory folder.
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(new TimeSpan(1, 30, 30));
// Make an initial request to the website with a parse method
this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Enable web cache with a specific expiration time of 1 hour, 30 minutes, and 30 seconds
EnableWebCache(New TimeSpan(1, 30, 30))
' Make an initial request to the website with a parse method
Me.Request("http://www.Website.com", Parse)
End Sub
IronWebScraper also has features to enable your engine to continue scraping after restarting the code by setting the engine start process name using Start(CrawlID)
.
static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}
static void Main(string[] args)
{
// Create an object from the Scraper class
EngineScraper scrape = new EngineScraper();
// Start the scraping process with the specified crawl ID
scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
' Create an object from the Scraper class
Dim scrape As New EngineScraper()
' Start the scraping process with the specified crawl ID
scrape.Start("enginestate")
End Sub
The execution request and response will be saved in the SavedState folder inside the working directory.
Throttling
We can control the minimum and maximum connection numbers and connection speed per domain.
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
// Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey";
// Set the logging level to capture all logs
this.LoggingLevel = WebScraper.LogLevel.All;
// Assign the working directory for the output files
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Set the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Set minimum polite delay (pause) between requests to a given domain or IP address
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
this.OpenConnectionLimitPerHost = 25;
// Do not obey the robots.txt files
this.ObeyRobotsDotTxt = false;
// Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
this.ThrottleMode = Throttle.ByDomainHostName;
// Make an initial request to the website with a parse method
this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
' Set the license key for IronWebScraper
License.LicenseKey = "LicenseKey"
' Set the logging level to capture all logs
Me.LoggingLevel = WebScraper.LogLevel.All
' Assign the working directory for the output files
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Set the total number of allowed open HTTP requests (threads)
Me.MaxHttpConnectionLimit = 80
' Set minimum polite delay (pause) between requests to a given domain or IP address
Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
' Set the allowed number of concurrent HTTP requests (threads) per hostname or IP address
Me.OpenConnectionLimitPerHost = 25
' Do not obey the robots.txt files
Me.ObeyRobotsDotTxt = False
' Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses
Me.ThrottleMode = Throttle.ByDomainHostName
' Make an initial request to the website with a parse method
Me.Request("https://www.Website.com", Parse)
End Sub
Throttling properties
MaxHttpConnectionLimit
Total number of allowed open HTTP requests (threads)RateLimitPerHost
Minimum polite delay or pause (in milliseconds) between requests to a given domain or IP addressOpenConnectionLimitPerHost
Allowed number of concurrent HTTP requests (threads) per hostnameThrottleMode
Makes the WebScraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses. This is polite in case multiple scraped domains are hosted on the same machine.
Get started with IronWebscraper
Start using IronWebScraper in your project today with a free trial.
Frequently Asked Questions
What is the HttpIdentity feature in IronWebScraper?
The HttpIdentity feature allows users to log in and authenticate with websites by setting a network username and password, supporting various network types like Windows, NTLM, and Kerberos.
How can I enable web caching in IronWebScraper?
Web caching can be enabled using the EnableWebCache() method, which allows caching of requested pages for reuse, especially useful during development and testing phases.
What are the benefits of using the web cache feature?
The web cache feature prevents the need to connect to the live website repeatedly, saving time and resources during development by allowing execution on cached pages.
How does IronWebScraper handle multiple identities?
IronWebScraper can use thousands of unique user credentials and browser engines to simulate multiple login sessions, preventing websites from blocking the scraper.
What is the purpose of the ThrottleMode setting?
ThrottleMode makes the WebScraper intelligently throttle requests based on both hostname and host servers' IP addresses, ensuring polite scraping across shared hosting environments.
How can I set up a proxy in IronWebScraper?
Proxies can be set by defining an array of proxies and associating them with HttpIdentity instances, allowing requests to be routed through different IP addresses.
What is the role of RateLimitPerHost?
RateLimitPerHost specifies the minimum delay or pause between requests to a specific domain or IP address, helping to prevent overloading the server with requests.
Can IronWebScraper continue scraping after a restart?
Yes, IronWebScraper can continue scraping after a restart by using the Start(CrawlID) method, which saves execution state and can resume from the last saved state.
How do I set the maximum number of HTTP connections in IronWebScraper?
The maximum number of HTTP connections can be set using the MaxHttpConnectionLimit property, which controls the total number of allowed open HTTP requests (threads).
What logging options are available in IronWebScraper?
IronWebScraper allows setting the logging level via the LoggingLevel property, enabling capture of all logs for detailed analysis during scraping operations.