Class WebScraper

An easy to use base class which developers can extend to rapidly build custom web-scraping applications.

Inheritance

System.Object

WebScraper

Namespace: IronWebScraper

Assembly: IronWebScraper.dll

Syntax

public abstract class WebScraper : Object

Constructors

WebScraper()

Declaration

protected WebScraper()

Fields

AllowedDomains

If not empty, all requested Urls' hostname must match at least one of the AllowedDomains patterns. Patterns may be added using glob wildcard strings or Regex

Declaration

public UrlMatchPatternCollection AllowedDomains

Field Value

Type	Description
UrlMatchPatternCollection

AllowedUrls

If not empty, all requested Urls must match at least one of the AllowedUrls patterns. Patterns may be added using glob wildcard strings or Regex

Declaration

public UrlMatchPatternCollection AllowedUrls

Field Value

Type	Description
UrlMatchPatternCollection

BannedDomains

If not empty, no requested Urls' hostname may match any of the BannedDomains patterns. Patterns may be added using glob wildcard strings or Regex

Declaration

public UrlMatchPatternCollection BannedDomains

Field Value

Type	Description
UrlMatchPatternCollection

BannedUrls

If not empty, no requested Urls may match any of the BannedUrls patterns. Patterns may be added using glob wildcard strings or Regex

Declaration

public UrlMatchPatternCollection BannedUrls

Field Value

Type	Description
UrlMatchPatternCollection

CrawlId

A unique string used to identify a crawl job.

Declaration

public string CrawlId

Field Value

Type	Description
System.String

FilesDownloaded

The total number of files downloaded successfully with the DownloadImage and DownloadFile methods.

Declaration

public int FilesDownloaded

Field Value

Type	Description
System.Int32

Identities

A list of http identities to be used to fetch web resources.

Each Identity may have a different proxy IP addresses, userAgent, http headers, persistent cookies, username and password.

Best practice is to create Identities in your WebScraper.Init Method and Add them to this WebScraper.Identities List.

Declaration

public List<HttpIdentity> Identities

Field Value

Type	Description
System.Collections.Generic.List<HttpIdentity>

LoggingLevel

The level of logging made by the WebScraper engine to the Console.

LogLevel.Critical is normally the most useful setting, allowing the developer to write their own, meaningful and application relevant messages inside of Parse methods.

LogLevel.ScrapedData is useful when coding and testing a new WebScraper.

Declaration

public WebScraper.LogLevel LoggingLevel

Field Value

Type	Description
WebScraper.LogLevel

ObeyRobotsDotTxt

Causes the WebScraper to always obey /robots.txt directives including url and path restrictions and crawl rates.

Declaration

public bool ObeyRobotsDotTxt

Field Value

Type	Description
System.Boolean

WorkingDirectory

Path to a local directory where scraped data and state information will be saved.

Declaration

public string WorkingDirectory

Field Value

Type	Description
System.String

Properties

FailedUrls

Gets the number of failed http requests which have exceeded their total maximum number of retries.

Declaration

public int FailedUrls { get; }

Property Value

Type	Description
System.Int32

HttpRetryAttempts

The number of times WebScraper will retry a failed URL (normally with a new identity) before considering it non-scrapable.

Declaration

public int HttpRetryAttempts { get; set; }

Property Value

Type	Description
System.Int32

HttpTimeOut

Gets or the time after-which a HTTP request will be considered failed or lost. (non-contactable or Dns unavailable)

Declaration

public TimeSpan HttpTimeOut { get; set; }

Property Value

Type	Description
System.TimeSpan

MaxHttpConnectionLimit

Gets or sets the total number of allowed open HTTP requests (threads)

Declaration

public int MaxHttpConnectionLimit { get; set; }

Property Value

Type	Description
System.Int32

OpenConnectionLimitPerHost

Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname or IP address. This helps protect hosts against too many requests.

Declaration

public int OpenConnectionLimitPerHost { get; set; }

Property Value

Type	Description
System.Int32

RateLimitPerHost

Gets or sets minimum polite delay (pause) between request to a given domain or IP address.

Declaration

public TimeSpan RateLimitPerHost { get; set; }

Property Value

Type	Description
System.TimeSpan

SuccessfulFileDownloadCount

Gets the number of successful http downloads using the DownloadFile and DownloadImage methods..

Declaration

public int SuccessfulFileDownloadCount { get; }

Property Value

Type	Description
System.Int32

SuccessfulfulRequestCount

Gets the number of successful http requests.

Declaration

public int SuccessfulfulRequestCount { get; }

Property Value

Type	Description
System.Int32

ThrottleMode

Makes the WebSraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses. This is polite in-case multiple scraped domains are hosted on the same machine.

Declaration

public WebScraper.Throttle ThrottleMode { get; set; }

Property Value

Type	Description
WebScraper.Throttle	`true` if we wish to look up hosts' IP addresses for throttling; otherwise, `false`.

Methods

AcceptUrl(String)

Decides if the WebScraper will accept a given url. My be overridden to apply custom middleware logic.

Declaration

public virtual bool AcceptUrl(string url)

Parameters

Type	Name	Description
System.String	url

Returns

Type	Description
System.Boolean

ChooseIdentityForRequest(Request)

Picks a random identity from WebScraper.Identities for each request. Add Identities with proxy IP addresses, userAgents, headers, cookies, username and password in your Init Method and add them to the WebScraper.Identities List;

Override this method to create your own logic for non-random selection of a HttpIdentity for each request.

Declaration

public virtual HttpIdentity ChooseIdentityForRequest(Request request)

Parameters

Type	Name	Description
Request	request	The http Request

Returns

Type	Description
HttpIdentity	An HttpIdentity

DownloadFile(String, String, Boolean, HttpIdentity)

Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

Normally called with an Parse Method of IronWebScraper.WebScraper

Declaration

public virtual string DownloadFile(string url, string path, bool overWrite = false, HttpIdentity identity = null)

Parameters

Type	Name	Description
System.String	url	The absolute url of the resource to be downloaded.
System.String	path	The path to which the downloaded file should be saved. You may give a directory name or a file name. Relative paths will be resolved relative to WorkingDirectory.
System.Boolean	overWrite	If set to `true` any existing file at the given path will be overwritten. If set to `false` a unique name such as "file(1).html" will be created in the case of a naming conflict.
HttpIdentity	identity	An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.

Returns

Type	Description
System.String	The file path (relative to WorkingDirecory) which the file will be saved to.

DownloadFile(Uri, String, Boolean, HttpIdentity)

Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

Normally called with an Parse Method of IronWebScraper.WebScraper

Declaration

public virtual string DownloadFile(Uri uri, string path, bool overWrite = false, HttpIdentity identity = null)

Parameters

Type	Name	Description
System.Uri	uri	The absolute uri of the resource to be downloaded.
System.String	path	The path to which the downloaded file should be saved. You may give a directory name or a file name. Relative paths will be resolved relative to WorkingDirectory.
System.Boolean	overWrite	If set to `true` any existing file at the given path will be overwritten. If set to `false` a unique name such as "file(1).html" will be created in the case of a naming conflict.
HttpIdentity	identity	An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.

Returns

Type	Description
System.String	The file path (relative to WorkingDirecory) which the file will be saved to.

DownloadFileUnique(String, String, HttpIdentity)

Much like DownloadFile except if the file has already been downloaded or exists locally, it will not be re-downloaded.

Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

Normally called with an Parse Method of IronWebScraper.WebScraper

Declaration

public virtual string DownloadFileUnique(string url, string path, HttpIdentity identity = null)

Parameters

Type	Name	Description
System.String	url	The URL.
System.String	path	The path.
HttpIdentity	identity	The identity.

Returns

Type	Description
System.String

DownloadImage(String, String, Int32, Int32, Boolean, HttpIdentity)

Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

Normally called with an Parse Method of IronWebScraper.WebScraper

Declaration

public virtual string DownloadImage(string url, string path, int maxWidth = 0, int maxHeight = 0, bool overWrite = false, HttpIdentity identity = null)

Parameters

Type	Name	Description
System.String	url	The absolute url of the resource to be downloaded.
System.String	path	The path to which the downloaded file should be saved. You may give a directory name or a file name. Relative paths will be resolved relative to WorkingDirectory.
System.Int32	maxWidth	The Downloaded image will be scaled proportionally to this maximum width. Zero means no constraint.
System.Int32	maxHeight	The Downloaded image will be scaled proportionally to this maximum height. Zero means no constraint.
System.Boolean	overWrite	If set to `true` any existing file at the given path will be overwritten. If set to `false` a unique name such as "file(1).html" will be created in the case of a naming conflict.
HttpIdentity	identity	An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.

Returns

Type	Description
System.String	The file path (relative to WorkingDirecory) which the image will be saved to.

DownloadImage(Uri, String, Int32, Int32, Boolean, HttpIdentity)

Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

Normally called with an Parse Method of IronWebScraper.WebScraper

Declaration

public virtual string DownloadImage(Uri uri, string path, int maxWidth = 0, int maxHeight = 0, bool overWrite = false, HttpIdentity identity = null)

Parameters

Type	Name	Description
System.Uri	uri	The absolute uri of the resource to be downloaded.
System.String	path	The path to which the downloaded file should be saved. You may give a directory name or a file name. Relative paths will be resolved relative to WorkingDirectory.
System.Int32	maxWidth	The Downloaded image will be scaled proportionally to this maximum width. Zero means no constraint.
System.Int32	maxHeight	The Downloaded image will be scaled proportionally to this maximum height. Zero means no constraint.
System.Boolean	overWrite	If set to `true` any existing file at the given path will be overwritten. If set to `false` a unique name such as "file(1).html" will be created in the case of a naming conflict.
HttpIdentity	identity	An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.

Returns

Type	Description
System.String	The file path (relative to WorkingDirecory) which the image will be saved to.

EnableWebCache()

Caches web http responses for reuse. This allows WebScraper classes to be modified and restarted without re-downloading previously scraped urls.

Declaration

public void EnableWebCache()

EnableWebCache(TimeSpan)

Caches web http responses for reuse. This allows WebScraper classes to be modified and restarted without re-downloading previously scrape urls.

Declaration

public void EnableWebCache(TimeSpan cacheDuration)

Parameters

Type	Name	Description
System.TimeSpan	cacheDuration	Duration that responses will be cached for.

FetchUrlContents(String, HttpIdentity)

A handy shortcut method that fetches the text content from any Url (synchronously).

Declaration

public static string FetchUrlContents(string url, HttpIdentity identity = null)

Parameters

Type	Name	Description
System.String	url	The absolute URL.
HttpIdentity	identity	OPtional HTTP identity to choose a proxy, user agent, headers, username and password for the request.

Returns

Type	Description
System.String

FetchUrlContentsBinary(String, HttpIdentity)

A handy shortcut method that fetches the text content from any Url (synchronously) as a binary data in a byye array (byte[])

Declaration

public byte[] FetchUrlContentsBinary(string url, HttpIdentity identity = null)

Parameters

Type	Name	Description
System.String	url	The absolute URL.
HttpIdentity	identity	OPtional HTTP identity to choose a proxy, user agent, headers, username and password for the request.

Returns

Type	Description
System.Byte[]

Init()

Override this method initialize your web-scraper. Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.

Declaration

public abstract void Init()

Log(String, WebScraper.LogLevel)

Logs the specified message to the console. Logs can be Enabled using the EnableLogging. This function has been exposed and is over-ridable to allow for easy Email and Slack notification integration.

Declaration

public virtual void Log(string message, WebScraper.LogLevel type)

Parameters

Type	Name	Description
System.String	message	The string message.
WebScraper.LogLevel	type	The LogLevel.

ObeyRobotsDotTxtForHost(String)

Causes the WebScraper to always obey /robots.txt directives including path restrictions and crawl rates on a domain by domain basis. May be overridden for advanced control.

Declaration

public virtual bool ObeyRobotsDotTxtForHost(string host)

Parameters

Type	Name	Description
System.String	host

Returns

Type	Description
System.Boolean

Parse(Response)

Override this method to create the default Response handler for your web scraper. If you have multiple page types, you can add additional similar methods.

Declaration

public abstract void Parse(Response response)

Parameters

Type	Name	Description
Response	response	The http Response object to parse

PostRequest(String, Action<Response>, Dictionary<String, String>)

Request adds a new request to the scrape-job queue using the POST http method.

Declaration

public virtual void PostRequest(string url, Action<Response> parse, Dictionary<string, string> postVaraibles)

Parameters

Type	Name	Description
System.String	url	The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.
System.Action<Response>	parse	The method to be used to parse the Response (often this is WebScraper.Parse)
System.Collections.Generic.Dictionary<System.String, System.String>	postVaraibles	The POST variables as a dictionary of key-value pairs.

PostRequest(String, Action<Response>, Dictionary<String, String>, HttpIdentity, MetaData)

Request adds a new request to the scrape-job queue using the POST http method.

Declaration

public virtual void PostRequest(string url, Action<Response> parse, Dictionary<string, string> postVaraibles, HttpIdentity identity = null, MetaData metaData = null)

Parameters

Type	Name	Description
System.String	url	The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.
System.Action<Response>	parse	The method to be used to parse the Response (often this is WebScraper.Parse)
System.Collections.Generic.Dictionary<System.String, System.String>	postVaraibles	The POST variables as a dictionary of key-value pairs.
HttpIdentity	identity	An optional HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.
MetaData	metaData	Additional information of any Type can be sent with the request and then re-read when the response is parsed.

PostRequest(String, Action<Response>, Dictionary<String, String>, MetaData)

Request adds a new request to the scrape-job queue using the POST http method.

Declaration

public virtual void PostRequest(string url, Action<Response> parse, Dictionary<string, string> postVaraibles, MetaData metaData)

Parameters

Type	Name	Description
System.String	url	The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.
System.Action<Response>	parse	The method to be used to parse the Response (often this is WebScraper.Parse)
System.Collections.Generic.Dictionary<System.String, System.String>	postVaraibles	The POST variables as a dictionary of key-value pairs.
MetaData	metaData	Additional information of any Type can be sent with the request and then re-read when the response is parsed.

PostRequest(Uri, Action<Response>, Dictionary<String, String>)

Request adds a new request to the scrape-job queue using the POST http method.

Declaration

public virtual void PostRequest(Uri url, Action<Response> parse, Dictionary<string, string> postVaraibles)

Parameters

Type	Name	Description
System.Uri	url	The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.
System.Action<Response>	parse	The method to be used to parse the Response (often this is WebScraper.Parse)
System.Collections.Generic.Dictionary<System.String, System.String>	postVaraibles	The POST variables as a dictionary of key-value pairs.

PostRequest(Uri, Action<Response>, Dictionary<String, String>, HttpIdentity, MetaData)

Request adds a new request to the scrape-job queue using the POST http method.

Declaration

public virtual void PostRequest(Uri url, Action<Response> parse, Dictionary<string, string> postVaraibles, HttpIdentity identity = null, MetaData metaData = null)

Parameters

Type	Name	Description
System.Uri	url	The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.
System.Action<Response>	parse	The method to be used to parse the Response (often this is WebScraper.Parse)
System.Collections.Generic.Dictionary<System.String, System.String>	postVaraibles	The POST variables as a dictionary of key-value pairs.
HttpIdentity	identity	An optional HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.
MetaData	metaData	Additional information of any Type can be sent with the request and then re-read when the response is parsed.

PostRequest(Uri, Action<Response>, Dictionary<String, String>, MetaData)

Request adds a new request to the scrape-job queue using the POST http method.

Declaration

public virtual void PostRequest(Uri url, Action<Response> parse, Dictionary<string, string> postVaraibles, MetaData metaData)

Parameters

Type	Name	Description
System.Uri	url	The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.
System.Action<Response>	parse	The method to be used to parse the Response (often this is WebScraper.Parse)
System.Collections.Generic.Dictionary<System.String, System.String>	postVaraibles	The POST variables as a dictionary of key-value pairs.
MetaData	metaData	Additional information of any Type can be sent with the request and then re-read when the response is parsed.

Request(IEnumerable<String>, Action<Response>)

A key method called from with the Init and Parse Methods. Request adds new requests to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.

Declaration

public virtual void Request(IEnumerable<string> urls, Action<Response> parse)