Advanced Webscraping Features
Install with NuGet
Install-Package IronWebScraper
Download DLL
Manually install into your project
Install with NuGet
Install-Package IronWebScraper
Download DLL
Manually install into your project
Start using IronPDF in your project today with a free trial.
Check out IronWebScraper on Nuget for quick installation and deployment. With over 8 million downloads, it's transforming with C#.
Install-Package IronWebScraper
Consider installing the IronWebScraper DLL directly. Download and manually install it for your project or GAC form: webscraperIronWebScraper.zip
Manually install into your project
Download DLLHttpIdentity Feature
Some website systems require the user to be logged in to view the content; in this case we can use a HttpIdentity: -
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id);
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id);
Dim id As New HttpIdentity()
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
Identities.Add(id)
One of the most impressive and powerful features in IronWebScraper, is the ability to use thousands of unique (user’s credentials and/or browser engines) to spoof or scrape websites using multi login sessions.
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
foreach (var proxy in proxies)
{
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
foreach (var proxy in proxies)
{
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
Dim proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(","c)
For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
For Each proxy In proxies
Identities.Add(New HttpIdentity() With {
.UserAgent = UA,
.UseCookies = True,
.Proxy = proxy
})
Next proxy
Next UA
Me.Request("http://www.Website.com", Parse)
End Sub
You have multiple properties to give you different behaviors, so preventing websites from blocking you.
Some of these properties: -
- NetworkDomain : The network domain to be used for user authentication. Supports Windows, NTLM , Keroberos, Linux, BSD and Mac OS X networks. Must be used with (NetworkUsername and NetworkPassword)
- NetworkUsername : The network/http username to be used for user authentication. Supports Http, Windows networks, NTLM , Kerberos , Linux networks, BSD networks and Mac OS.
- NetworkPassword : The network/http password to be used for user authentication. Supports Http , Windows networks, NTLM , Keroberos , Linux networks, BSD networks and Mac OS.
- Proxy : to set proxy settings
- UserAgent : to set browser engine (chrome desktop , chrome mobile , chrome tablet , IE and Firefox , etc.)
- HttpRequestHeaders : for custom header values that will be used with this identity , and it accept dictionary object (Dictionary <string, string>)
- UseCookies : enable/disable using cookies
IronWebScraper runs the scraper using random identities. If we need to specify the use of a specific identity to parse a page we can do so.
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
HttpIdentity identity = new HttpIdentity();
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
Identities.Add(id);
this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
HttpIdentity identity = new HttpIdentity();
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
Identities.Add(id);
this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
Dim identity As New HttpIdentity()
identity.NetworkUsername = "username"
identity.NetworkPassword = "pwd"
Identities.Add(id)
Me.Request("http://www.Website.com", Parse, identity)
End Sub
Enable the Web Cache Feature
This feature is used to cache Requested Pages. It is often used in development and testing phases; enabling developers to cache required pages for reuse after updating code. This enables you to execute your code on cached pages after restarting your Web scraper and not needing to connect to the live website every time (action-replay).
You can use it in Init() Method
EnableWebCache();
OR
EnableWebCache(Timespan Expiry);
It will save your cached data to the WebCache folder under the working directory folder
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
EnableWebCache(new TimeSpan(1,30,30));
this.Request("http://www.WebSite.com", Parse);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
EnableWebCache(new TimeSpan(1,30,30));
this.Request("http://www.WebSite.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
EnableWebCache(New TimeSpan(1,30,30))
Me.Request("http://www.WebSite.com", Parse)
End Sub
IronWebScraper also has features to enable your engine to continue scraping after restarting code by setting the engine start process name using Start(CrawlID)
static void Main(string [] args)
{
// Create Object From Scraper class
EngineScraper scrape = new EngineScraper();
// Start Scraping
scrape.Start("enginestate");
}
static void Main(string [] args)
{
// Create Object From Scraper class
EngineScraper scrape = new EngineScraper();
// Start Scraping
scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
' Create Object From Scraper class
Dim scrape As New EngineScraper()
' Start Scraping
scrape.Start("enginestate")
End Sub
The execution request and response will be saved in the SavedState folder inside the working directory.
Throttling
We can control the minimum and maximum connection numbers and connection speed per domain.
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Gets or sets the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
// or IP address. This helps protect hosts against too many requests.
this.OpenConnectionLimitPerHost = 25;
this.ObeyRobotsDotTxt = false;
// Makes the WebSraper intelligently throttle requests not only by hostname, but
// also by host servers' IP addresses. This is polite in-case multiple scraped domains
// are hosted on the same machine.
this.ThrottleMode = Throttle.ByDomainHostName;
this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Gets or sets the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
// or IP address. This helps protect hosts against too many requests.
this.OpenConnectionLimitPerHost = 25;
this.ObeyRobotsDotTxt = false;
// Makes the WebSraper intelligently throttle requests not only by hostname, but
// also by host servers' IP addresses. This is polite in-case multiple scraped domains
// are hosted on the same machine.
this.ThrottleMode = Throttle.ByDomainHostName;
this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Gets or sets the total number of allowed open HTTP requests (threads)
Me.MaxHttpConnectionLimit = 80
' Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
' Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
' or IP address. This helps protect hosts against too many requests.
Me.OpenConnectionLimitPerHost = 25
Me.ObeyRobotsDotTxt = False
' Makes the WebSraper intelligently throttle requests not only by hostname, but
' also by host servers' IP addresses. This is polite in-case multiple scraped domains
' are hosted on the same machine.
Me.ThrottleMode = Throttle.ByDomainHostName
Me.Request("https://www.Website.com", Parse)
End Sub
Throttling properties
- MaxHttpConnectionLimit
Total number of allowed open HTTP requests (threads) - RateLimitPerHost
Minimum polite delay or pause (in milliseconds) between request to a given domain or IP address - OpenConnectionLimitPerHost
Allowed number of concurrent HTTP requests (threads) - ThrottleMode
Makes the WebSraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses. This is polite in-case multiple scraped domains are hosted on the same machine