Class WebScraper
An easy to use base class which developers can extend to rapidly build custom web-scraping applications.
Inheritance
Namespace: IronWebScraper
Assembly: IronWebScraper.dll
Syntax
public abstract class WebScraper : Object
Constructors
WebScraper()
Declaration
protected WebScraper()
Fields
AllowedDomains
If not empty, all requested Urls' hostname must match at least one of the AllowedDomains patterns. Patterns may be added using glob wildcard strings or Regex
Declaration
public UrlMatchPatternCollection AllowedDomains
Field Value
| Type | Description |
|---|---|
| UrlMatchPatternCollection |
AllowedUrls
If not empty, all requested Urls must match at least one of the AllowedUrls patterns. Patterns may be added using glob wildcard strings or Regex
Declaration
public UrlMatchPatternCollection AllowedUrls
Field Value
| Type | Description |
|---|---|
| UrlMatchPatternCollection |
BannedDomains
If not empty, no requested Urls' hostname may match any of the BannedDomains patterns. Patterns may be added using glob wildcard strings or Regex
Declaration
public UrlMatchPatternCollection BannedDomains
Field Value
| Type | Description |
|---|---|
| UrlMatchPatternCollection |
BannedUrls
If not empty, no requested Urls may match any of the BannedUrls patterns. Patterns may be added using glob wildcard strings or Regex
Declaration
public UrlMatchPatternCollection BannedUrls
Field Value
| Type | Description |
|---|---|
| UrlMatchPatternCollection |
CrawlId
A unique string used to identify a crawl job.
Declaration
public string CrawlId
Field Value
| Type | Description |
|---|---|
| System.String |
FilesDownloaded
The total number of files downloaded successfully with the DownloadImage and DownloadFile methods.
Declaration
public int FilesDownloaded
Field Value
| Type | Description |
|---|---|
| System.Int32 |
Identities
A list of http identities to be used to fetch web resources.
Each Identity may have a different proxy IP addresses, userAgent, http headers, persistent cookies, username and password.
Best practice is to create Identities in your WebScraper.Init Method and Add them to this WebScraper.Identities List.
Declaration
public List<HttpIdentity> Identities
Field Value
| Type | Description |
|---|---|
| System.Collections.Generic.List<HttpIdentity> |
LoggingLevel
The level of logging made by the WebScraper engine to the Console.
LogLevel.Critical is normally the most useful setting, allowing the developer to write their own, meaningful and application relevant messages inside of Parse methods.
LogLevel.ScrapedData is useful when coding and testing a new WebScraper.
Declaration
public WebScraper.LogLevel LoggingLevel
Field Value
| Type | Description |
|---|---|
| WebScraper.LogLevel |
ObeyRobotsDotTxt
Causes the WebScraper to always obey /robots.txt directives including url and path restrictions and crawl rates.
Declaration
public bool ObeyRobotsDotTxt
Field Value
| Type | Description |
|---|---|
| System.Boolean |
WorkingDirectory
Path to a local directory where scraped data and state information will be saved.
Declaration
public string WorkingDirectory
Field Value
| Type | Description |
|---|---|
| System.String |
Properties
FailedUrls
Gets the number of failed http requests which have exceeded their total maximum number of retries.
Declaration
public int FailedUrls { get; }
Property Value
| Type | Description |
|---|---|
| System.Int32 |
HttpRetryAttempts
The number of times WebScraper will retry a failed URL (normally with a new identity) before considering it non-scrapable.
Declaration
public int HttpRetryAttempts { get; set; }
Property Value
| Type | Description |
|---|---|
| System.Int32 |
HttpTimeOut
Gets or the time after-which a HTTP request will be considered failed or lost. (non-contactable or Dns unavailable)
Declaration
public TimeSpan HttpTimeOut { get; set; }
Property Value
| Type | Description |
|---|---|
| System.TimeSpan |
MaxHttpConnectionLimit
Gets or sets the total number of allowed open HTTP requests (threads)
Declaration
public int MaxHttpConnectionLimit { get; set; }
Property Value
| Type | Description |
|---|---|
| System.Int32 |
OpenConnectionLimitPerHost
Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname or IP address. This helps protect hosts against too many requests.
Declaration
public int OpenConnectionLimitPerHost { get; set; }
Property Value
| Type | Description |
|---|---|
| System.Int32 |
RateLimitPerHost
Gets or sets minimum polite delay (pause) between request to a given domain or IP address.
Declaration
public TimeSpan RateLimitPerHost { get; set; }
Property Value
| Type | Description |
|---|---|
| System.TimeSpan |
SuccessfulFileDownloadCount
Gets the number of successful http downloads using the DownloadFile and DownloadImage methods..
Declaration
public int SuccessfulFileDownloadCount { get; }
Property Value
| Type | Description |
|---|---|
| System.Int32 |
SuccessfulfulRequestCount
Gets the number of successful http requests.
Declaration
public int SuccessfulfulRequestCount { get; }
Property Value
| Type | Description |
|---|---|
| System.Int32 |
ThrottleMode
Makes the WebSraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses. This is polite in-case multiple scraped domains are hosted on the same machine.
Declaration
public WebScraper.Throttle ThrottleMode { get; set; }
Property Value
| Type | Description |
|---|---|
| WebScraper.Throttle |
|
Methods
AcceptUrl(String)
Decides if the WebScraper will accept a given url. My be overridden to apply custom middleware logic.
Declaration
public virtual bool AcceptUrl(string url)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | url |
Returns
| Type | Description |
|---|---|
| System.Boolean |
ChooseIdentityForRequest(Request)
Picks a random identity from WebScraper.Identities for each request. Add Identities with proxy IP addresses, userAgents, headers, cookies, username and password in your Init Method and add them to the WebScraper.Identities List;
Override this method to create your own logic for non-random selection of a HttpIdentity for each request.
Declaration
public virtual HttpIdentity ChooseIdentityForRequest(Request request)
Parameters
| Type | Name | Description |
|---|---|---|
| Request | request | The http Request |
Returns
| Type | Description |
|---|---|
| HttpIdentity | An HttpIdentity |
DownloadFile(String, String, Boolean, HttpIdentity)
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.
Normally called with an Parse Method of IronWebScraper.WebScraper
Declaration
public virtual string DownloadFile(string url, string path, bool overWrite = false, HttpIdentity identity = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | url | The absolute url of the resource to be downloaded. |
| System.String | path | The path to which the downloaded file should be saved. You may give a directory name or a file name. Relative paths will be resolved relative to WorkingDirectory. |
| System.Boolean | overWrite | If set to |
| HttpIdentity | identity | An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
Returns
| Type | Description |
|---|---|
| System.String | The file path (relative to WorkingDirecory) which the file will be saved to. |
DownloadFile(Uri, String, Boolean, HttpIdentity)
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.
Normally called with an Parse Method of IronWebScraper.WebScraper
Declaration
public virtual string DownloadFile(Uri uri, string path, bool overWrite = false, HttpIdentity identity = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Uri | uri | The absolute uri of the resource to be downloaded. |
| System.String | path | The path to which the downloaded file should be saved. You may give a directory name or a file name. Relative paths will be resolved relative to WorkingDirectory. |
| System.Boolean | overWrite | If set to |
| HttpIdentity | identity | An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
Returns
| Type | Description |
|---|---|
| System.String | The file path (relative to WorkingDirecory) which the file will be saved to. |
DownloadFileUnique(String, String, HttpIdentity)
Much like DownloadFile except if the file has already been downloaded or exists locally, it will not be re-downloaded.
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.
Normally called with an Parse Method of IronWebScraper.WebScraper
Declaration
public virtual string DownloadFileUnique(string url, string path, HttpIdentity identity = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | url | The URL. |
| System.String | path | The path. |
| HttpIdentity | identity | The identity. |
Returns
| Type | Description |
|---|---|
| System.String |
DownloadImage(String, String, Int32, Int32, Boolean, HttpIdentity)
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.
Normally called with an Parse Method of IronWebScraper.WebScraper
Declaration
public virtual string DownloadImage(string url, string path, int maxWidth = 0, int maxHeight = 0, bool overWrite = false, HttpIdentity identity = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | url | The absolute url of the resource to be downloaded. |
| System.String | path | The path to which the downloaded file should be saved. You may give a directory name or a file name. Relative paths will be resolved relative to WorkingDirectory. |
| System.Int32 | maxWidth | The Downloaded image will be scaled proportionally to this maximum width. Zero means no constraint. |
| System.Int32 | maxHeight | The Downloaded image will be scaled proportionally to this maximum height. Zero means no constraint. |
| System.Boolean | overWrite | If set to |
| HttpIdentity | identity | An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
Returns
| Type | Description |
|---|---|
| System.String | The file path (relative to WorkingDirecory) which the image will be saved to. |
DownloadImage(Uri, String, Int32, Int32, Boolean, HttpIdentity)
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.
Normally called with an Parse Method of IronWebScraper.WebScraper
Declaration
public virtual string DownloadImage(Uri uri, string path, int maxWidth = 0, int maxHeight = 0, bool overWrite = false, HttpIdentity identity = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Uri | uri | The absolute uri of the resource to be downloaded. |
| System.String | path | The path to which the downloaded file should be saved. You may give a directory name or a file name. Relative paths will be resolved relative to WorkingDirectory. |
| System.Int32 | maxWidth | The Downloaded image will be scaled proportionally to this maximum width. Zero means no constraint. |
| System.Int32 | maxHeight | The Downloaded image will be scaled proportionally to this maximum height. Zero means no constraint. |
| System.Boolean | overWrite | If set to |
| HttpIdentity | identity | An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
Returns
| Type | Description |
|---|---|
| System.String | The file path (relative to WorkingDirecory) which the image will be saved to. |
EnableWebCache()
Caches web http responses for reuse. This allows WebScraper classes to be modified and restarted without re-downloading previously scraped urls.
Declaration
public void EnableWebCache()
EnableWebCache(TimeSpan)
Caches web http responses for reuse. This allows WebScraper classes to be modified and restarted without re-downloading previously scrape urls.
Declaration
public void EnableWebCache(TimeSpan cacheDuration)
Parameters
| Type | Name | Description |
|---|---|---|
| System.TimeSpan | cacheDuration | Duration that responses will be cached for. |
FetchUrlContents(String, HttpIdentity)
A handy shortcut method that fetches the text content from any Url (synchronously).
Declaration
public static string FetchUrlContents(string url, HttpIdentity identity = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | url | The absolute URL. |
| HttpIdentity | identity | OPtional HTTP identity to choose a proxy, user agent, headers, username and password for the request. |
Returns
| Type | Description |
|---|---|
| System.String |
FetchUrlContentsBinary(String, HttpIdentity)
A handy shortcut method that fetches the text content from any Url (synchronously) as a binary data in a byye array (byte[])
Declaration
public byte[] FetchUrlContentsBinary(string url, HttpIdentity identity = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | url | The absolute URL. |
| HttpIdentity | identity | OPtional HTTP identity to choose a proxy, user agent, headers, username and password for the request. |
Returns
| Type | Description |
|---|---|
| System.Byte[] |
Init()
Override this method initialize your web-scraper. Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
Declaration
public abstract void Init()
Log(String, WebScraper.LogLevel)
Logs the specified message to the console. Logs can be Enabled using the EnableLogging. This function has been exposed and is over-ridable to allow for easy Email and Slack notification integration.
Declaration
public virtual void Log(string message, WebScraper.LogLevel type)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | message | The string message. |
| WebScraper.LogLevel | type | The LogLevel. |
ObeyRobotsDotTxtForHost(String)
Causes the WebScraper to always obey /robots.txt directives including path restrictions and crawl rates on a domain by domain basis. May be overridden for advanced control.
Declaration
public virtual bool ObeyRobotsDotTxtForHost(string host)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | host |
Returns
| Type | Description |
|---|---|
| System.Boolean |
Parse(Response)
Override this method to create the default Response handler for your web scraper. If you have multiple page types, you can add additional similar methods.
Declaration
public abstract void Parse(Response response)
Parameters
| Type | Name | Description |
|---|---|---|
| Response | response | The http Response object to parse |
PostRequest(String, Action<Response>, Dictionary<String, String>)
Request adds a new request to the scrape-job queue using the POST http method.
Declaration
public virtual void PostRequest(string url, Action<Response> parse, Dictionary<string, string> postVaraibles)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
| System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
| System.Collections.Generic.Dictionary<System.String, System.String> | postVaraibles | The POST variables as a dictionary of key-value pairs. |
PostRequest(String, Action<Response>, Dictionary<String, String>, HttpIdentity, MetaData)
Request adds a new request to the scrape-job queue using the POST http method.
Declaration
public virtual void PostRequest(string url, Action<Response> parse, Dictionary<string, string> postVaraibles, HttpIdentity identity = null, MetaData metaData = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
| System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
| System.Collections.Generic.Dictionary<System.String, System.String> | postVaraibles | The POST variables as a dictionary of key-value pairs. |
| HttpIdentity | identity | An optional HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
| MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
PostRequest(String, Action<Response>, Dictionary<String, String>, MetaData)
Request adds a new request to the scrape-job queue using the POST http method.
Declaration
public virtual void PostRequest(string url, Action<Response> parse, Dictionary<string, string> postVaraibles, MetaData metaData)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
| System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
| System.Collections.Generic.Dictionary<System.String, System.String> | postVaraibles | The POST variables as a dictionary of key-value pairs. |
| MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
PostRequest(Uri, Action<Response>, Dictionary<String, String>)
Request adds a new request to the scrape-job queue using the POST http method.
Declaration
public virtual void PostRequest(Uri url, Action<Response> parse, Dictionary<string, string> postVaraibles)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Uri | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
| System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
| System.Collections.Generic.Dictionary<System.String, System.String> | postVaraibles | The POST variables as a dictionary of key-value pairs. |
PostRequest(Uri, Action<Response>, Dictionary<String, String>, HttpIdentity, MetaData)
Request adds a new request to the scrape-job queue using the POST http method.
Declaration
public virtual void PostRequest(Uri url, Action<Response> parse, Dictionary<string, string> postVaraibles, HttpIdentity identity = null, MetaData metaData = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Uri | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
| System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
| System.Collections.Generic.Dictionary<System.String, System.String> | postVaraibles | The POST variables as a dictionary of key-value pairs. |
| HttpIdentity | identity | An optional HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
| MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
PostRequest(Uri, Action<Response>, Dictionary<String, String>, MetaData)
Request adds a new request to the scrape-job queue using the POST http method.
Declaration
public virtual void PostRequest(Uri url, Action<Response> parse, Dictionary<string, string> postVaraibles, MetaData metaData)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Uri | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
| System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
| System.Collections.Generic.Dictionary<System.String, System.String> | postVaraibles | The POST variables as a dictionary of key-value pairs. |
| MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
Request(IEnumerable<String>, Action<Response>)
A key method called from with the Init and Parse Methods. Request adds new requests to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(IEnumerable<string> urls, Action<Response> parse)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Collections.Generic.IEnumerable<System.String> | urls | The Absolute url or urls to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
| System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
Request(String, Action<Response>)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(string url, Action<Response> parse)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
| System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
Request(String, Action<Response>, HttpIdentity, MetaData)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(string url, Action<Response> parse, HttpIdentity identity = null, MetaData metaData = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
| System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
| HttpIdentity | identity | An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
| MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
Request(String, Action<Response>, MetaData)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(string url, Action<Response> parse, MetaData metaData)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
| System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
| MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
Request(Uri, Action<Response>)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(Uri url, Action<Response> parse)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Uri | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
| System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
Request(Uri, Action<Response>, HttpIdentity, MetaData)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(Uri url, Action<Response> parse, HttpIdentity identity = null, MetaData metaData = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Uri | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
| System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
| HttpIdentity | identity | An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
| MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
Request(Uri, Action<Response>, MetaData)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(Uri url, Action<Response> parse, MetaData metaData)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Uri | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
| System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
| MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
Retry(Response)
Retries a Response.
Usually called in a Parse method, this method is useful if a Captcha or error screen was encountered during Html parsing.
Declaration
public void Retry(Response response)
Parameters
| Type | Name | Description |
|---|---|---|
| Response | response |
Scrape(Object, String)
Appends any scraped data to a file in the JsonLines format. (1 json object per line). Will save any .Net object of any kind. This method is typically used with IronWebScraper.ScrapedData or developer defined classes for scraped data items. The default filename will follow the pattern "NameSpace.TypeName.jsonl". E.g: IronWebScraper.ScrapedData.jsonl
Declaration
public void Scrape(object item, string fileName = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Object | item | |
| System.String | fileName |
ScrapeUnique(Object, String)
Appends scraped data to a file in the JsonLines format. (1 json object per line). Automatically ignores duplicates. Will save any .Net object of any kind. This method is typically used with IronWebScraper.ScrapedData or developer defined classes for scraped data items. The default filename will follow the pattern "WorkingDirecory/NameSpace.TypeName.jsonl". E.g: Scrape/IronWebScraper.ScrapedData.jsonl
Declaration
public void ScrapeUnique(object item, string fileName = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Object | item | |
| System.String | fileName |
SetSiteSpecificCrawlRateLimit(String, TimeSpan)
Set a throttle limit for a specific domain
Declaration
public void SetSiteSpecificCrawlRateLimit(string hostName, TimeSpan crawlRate)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | hostName | The http host (domain name). |
| System.TimeSpan | crawlRate | The maximum frequency of http requests for the given hostName. |
Start(String)
Starts the WebScraper.
Set CrawlId to make this crawl resumable. Will also resume a previous scrawl with the same CrawlId if it exists.
Giving a CrawlId also causes the WebScraper to auto-save its state every 5 minutes in case of a crash, system failure or power outage. This feature is particularly useful for long running web-scraping tasks, allowing hours, days or even weeks of work to be recovered effortlessly.
Declaration
public void Start(string crawlId = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | crawlId |
StartAsync(String)
Starts the WebScraper Asynchronously. Set CrawlId to make this crawl resumable. Will resume a previous scrawl with the same CrawlId if it exists.
Declaration
public Task StartAsync(string crawlId = null)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | crawlId |
Returns
| Type | Description |
|---|---|
| System.Threading.Tasks.Task |
Stop()
Stops this WebScraper instance graceful. The WebScraper may be restated later with no loss of data by calling Start(CrawlId) or StartAsync(CrawlId)
Declaration
public void Stop()
UnScrape(Boolean)
Retrieves IronWebScraper.ScrapedData objects which were saved using the WebScraper.Scrape method.
Declaration
public IEnumerable<ScrapedData> UnScrape(bool ignoreErrors)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Boolean | ignoreErrors | if set to |
Returns
| Type | Description |
|---|---|
| System.Collections.Generic.IEnumerable<ScrapedData> |
UnScrape(String, Boolean)
Retrieves IronWebScraper.ScrapedData objects which were saved using the WebScraper.Scrape method.
Declaration
public IEnumerable<ScrapedData> UnScrape(string fileName = null, bool ignoreErrors = false)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | fileName | Path of the saved data file. |
| System.Boolean | ignoreErrors | if set to |
Returns
| Type | Description |
|---|---|
| System.Collections.Generic.IEnumerable<ScrapedData> |
UnScrape<T>(Boolean)
Retrieves native C# objects which were saved using the WebScraper.Scrape method in the JsonLines format.
Declaration
public IEnumerable<T> UnScrape<T>(bool ignoreErrors)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Boolean | ignoreErrors | if set to |
Returns
| Type | Description |
|---|---|
| System.Collections.Generic.IEnumerable<T> |
Type Parameters
| Name | Description |
|---|---|
| T | The Type of object to be returned. Giving no value will return an IEnumberable of IronWebScraper.ScrapedData |
UnScrape<T>(String, Boolean)
Retrieves native C# objects which were saved using the WebScraper.Scrape method in the JsonLines format.
Declaration
public IEnumerable<T> UnScrape<T>(string fileName = null, bool ignoreErrors = false)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | fileName | Path of the saved data file. |
| System.Boolean | ignoreErrors | if set to |
Returns
| Type | Description |
|---|---|
| System.Collections.Generic.IEnumerable<T> |
Type Parameters
| Name | Description |
|---|---|
| T | The Type of object to be returned. Giving no value will return an IEnumberable of IronWebScraper.ScrapedData |