Class WebScraper
An easy to use base class which developers can extend to rapidly build custom web-scraping applications.
Inheritance
Namespace: IronWebScraper
Assembly: IronWebScraper.dll
Syntax
public abstract class WebScraper : Object
Constructors
WebScraper()
Declaration
protected WebScraper()
Fields
AllowedDomains
If not empty, all requested Urls' hostname must match at least one of the AllowedDomains patterns. Patterns may be added using glob wildcard strings or Regex
Declaration
public UrlMatchPatternCollection AllowedDomains
Field Value
Type | Description |
---|---|
UrlMatchPatternCollection |
AllowedUrls
If not empty, all requested Urls must match at least one of the AllowedUrls patterns. Patterns may be added using glob wildcard strings or Regex
Declaration
public UrlMatchPatternCollection AllowedUrls
Field Value
Type | Description |
---|---|
UrlMatchPatternCollection |
BannedDomains
If not empty, no requested Urls' hostname may match any of the BannedDomains patterns. Patterns may be added using glob wildcard strings or Regex
Declaration
public UrlMatchPatternCollection BannedDomains
Field Value
Type | Description |
---|---|
UrlMatchPatternCollection |
BannedUrls
If not empty, no requested Urls may match any of the BannedUrls patterns. Patterns may be added using glob wildcard strings or Regex
Declaration
public UrlMatchPatternCollection BannedUrls
Field Value
Type | Description |
---|---|
UrlMatchPatternCollection |
CrawlId
A unique string used to identify a crawl job.
Declaration
public string CrawlId
Field Value
Type | Description |
---|---|
System.String |
FilesDownloaded
The total number of files downloaded successfully with the DownloadImage and DownloadFile methods.
Declaration
public int FilesDownloaded
Field Value
Type | Description |
---|---|
System.Int32 |
Identities
A list of http identities to be used to fetch web resources.
Each Identity may have a different proxy IP addresses, userAgent, http headers, persistent cookies, username and password.
Best practice is to create Identities in your WebScraper.Init Method and Add them to this WebScraper.Identities List.
Declaration
public List<HttpIdentity> Identities
Field Value
Type | Description |
---|---|
System.Collections.Generic.List<HttpIdentity> |
LoggingLevel
The level of logging made by the WebScraper engine to the Console.
LogLevel.Critical is normally the most useful setting, allowing the developer to write their own, meaningful and application relevant messages inside of Parse methods.
LogLevel.ScrapedData is useful when coding and testing a new WebScraper.
Declaration
public WebScraper.LogLevel LoggingLevel
Field Value
Type | Description |
---|---|
WebScraper.LogLevel |
ObeyRobotsDotTxt
Causes the WebScraper to always obey /robots.txt directives including url and path restrictions and crawl rates.
Declaration
public bool ObeyRobotsDotTxt
Field Value
Type | Description |
---|---|
System.Boolean |
WorkingDirectory
Path to a local directory where scraped data and state information will be saved.
Declaration
public string WorkingDirectory
Field Value
Type | Description |
---|---|
System.String |
Properties
FailedUrls
Gets the number of failed http requests which have exceeded their total maximum number of retries.
Declaration
public int FailedUrls { get; }
Property Value
Type | Description |
---|---|
System.Int32 |
HttpRetryAttempts
The number of times WebScraper will retry a failed URL (normally with a new identity) before considering it non-scrapable.
Declaration
public int HttpRetryAttempts { get; set; }
Property Value
Type | Description |
---|---|
System.Int32 |
HttpTimeOut
Gets or the time after-which a HTTP request will be considered failed or lost. (non-contactable or Dns unavailable)
Declaration
public TimeSpan HttpTimeOut { get; set; }
Property Value
Type | Description |
---|---|
System.TimeSpan |
MaxHttpConnectionLimit
Gets or sets the total number of allowed open HTTP requests (threads)
Declaration
public int MaxHttpConnectionLimit { get; set; }
Property Value
Type | Description |
---|---|
System.Int32 |
OpenConnectionLimitPerHost
Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname or IP address. This helps protect hosts against too many requests.
Declaration
public int OpenConnectionLimitPerHost { get; set; }
Property Value
Type | Description |
---|---|
System.Int32 |
RateLimitPerHost
Gets or sets minimum polite delay (pause) between request to a given domain or IP address.
Declaration
public TimeSpan RateLimitPerHost { get; set; }
Property Value
Type | Description |
---|---|
System.TimeSpan |
SuccessfulFileDownloadCount
Gets the number of successful http downloads using the DownloadFile and DownloadImage methods..
Declaration
public int SuccessfulFileDownloadCount { get; }
Property Value
Type | Description |
---|---|
System.Int32 |
SuccessfulfulRequestCount
Gets the number of successful http requests.
Declaration
public int SuccessfulfulRequestCount { get; }
Property Value
Type | Description |
---|---|
System.Int32 |
ThrottleMode
Makes the WebSraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses. This is polite in-case multiple scraped domains are hosted on the same machine.
Declaration
public WebScraper.Throttle ThrottleMode { get; set; }
Property Value
Type | Description |
---|---|
WebScraper.Throttle |
|
Methods
AcceptUrl(String)
Decides if the WebScraper will accept a given url. My be overridden to apply custom middleware logic.
Declaration
public virtual bool AcceptUrl(string url)
Parameters
Type | Name | Description |
---|---|---|
System.String | url |
Returns
Type | Description |
---|---|
System.Boolean |
CheckLicense()
Check the license before using IronWebScraper
Declaration
protected static void CheckLicense()
Exceptions
Type | Condition |
---|---|
IronSoftware.Exceptions.LicensingException |
ChooseIdentityForRequest(Request)
Picks a random identity from WebScraper.Identities for each request. Add Identities with proxy IP addresses, userAgents, headers, cookies, username and password in your Init Method and add them to the WebScraper.Identities List;
Override this method to create your own logic for non-random selection of a HttpIdentity for each request.
Declaration
public virtual HttpIdentity ChooseIdentityForRequest(Request request)
Parameters
Type | Name | Description |
---|---|---|
Request | request | The http Request |
Returns
Type | Description |
---|---|
HttpIdentity | An HttpIdentity |
DownloadFile(String, String, Boolean, HttpIdentity)
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.
Normally called with an Parse Method of IronWebScraper.WebScraper
Declaration
public virtual string DownloadFile(string url, string path, bool overWrite = false, HttpIdentity identity = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | url | The absolute url of the resource to be downloaded. |
System.String | path | The path to which the downloaded file should be saved. You may give a directory name or a file name. Relative paths will be resolved relative to WorkingDirectory. |
System.Boolean | overWrite | If set to |
HttpIdentity | identity | An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
Returns
Type | Description |
---|---|
System.String | The file path (relative to WorkingDirecory) which the file will be saved to. |
DownloadFile(Uri, String, Boolean, HttpIdentity)
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.
Normally called with an Parse Method of IronWebScraper.WebScraper
Declaration
public virtual string DownloadFile(Uri uri, string path, bool overWrite = false, HttpIdentity identity = null)
Parameters
Type | Name | Description |
---|---|---|
System.Uri | uri | The absolute uri of the resource to be downloaded. |
System.String | path | The path to which the downloaded file should be saved. You may give a directory name or a file name. Relative paths will be resolved relative to WorkingDirectory. |
System.Boolean | overWrite | If set to |
HttpIdentity | identity | An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
Returns
Type | Description |
---|---|
System.String | The file path (relative to WorkingDirecory) which the file will be saved to. |
DownloadFileUnique(String, String, HttpIdentity)
Much like DownloadFile except if the file has already been downloaded or exists locally, it will not be re-downloaded.
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.
Normally called with an Parse Method of IronWebScraper.WebScraper
Declaration
public virtual string DownloadFileUnique(string url, string path, HttpIdentity identity = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | url | The URL. |
System.String | path | The path. |
HttpIdentity | identity | The identity. |
Returns
Type | Description |
---|---|
System.String |
DownloadImage(String, String, Int32, Int32, Boolean, HttpIdentity)
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.
Normally called with an Parse Method of IronWebScraper.WebScraper
Declaration
public virtual string DownloadImage(string url, string path, int maxWidth = 0, int maxHeight = 0, bool overWrite = false, HttpIdentity identity = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | url | The absolute url of the resource to be downloaded. |
System.String | path | The path to which the downloaded file should be saved. You may give a directory name or a file name. Relative paths will be resolved relative to WorkingDirectory. |
System.Int32 | maxWidth | The Downloaded image will be scaled proportionally to this maximum width. Zero means no constraint. |
System.Int32 | maxHeight | The Downloaded image will be scaled proportionally to this maximum height. Zero means no constraint. |
System.Boolean | overWrite | If set to |
HttpIdentity | identity | An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
Returns
Type | Description |
---|---|
System.String | The file path (relative to WorkingDirecory) which the image will be saved to. |
DownloadImage(Uri, String, Int32, Int32, Boolean, HttpIdentity)
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.
Normally called with an Parse Method of IronWebScraper.WebScraper
Declaration
public virtual string DownloadImage(Uri uri, string path, int maxWidth = 0, int maxHeight = 0, bool overWrite = false, HttpIdentity identity = null)
Parameters
Type | Name | Description |
---|---|---|
System.Uri | uri | The absolute uri of the resource to be downloaded. |
System.String | path | The path to which the downloaded file should be saved. You may give a directory name or a file name. Relative paths will be resolved relative to WorkingDirectory. |
System.Int32 | maxWidth | The Downloaded image will be scaled proportionally to this maximum width. Zero means no constraint. |
System.Int32 | maxHeight | The Downloaded image will be scaled proportionally to this maximum height. Zero means no constraint. |
System.Boolean | overWrite | If set to |
HttpIdentity | identity | An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
Returns
Type | Description |
---|---|
System.String | The file path (relative to WorkingDirecory) which the image will be saved to. |
EnableWebCache()
Caches web http responses for reuse. This allows WebScraper classes to be modified and restarted without re-downloading previously scraped urls.
Declaration
public void EnableWebCache()
EnableWebCache(TimeSpan)
Caches web http responses for reuse. This allows WebScraper classes to be modified and restarted without re-downloading previously scrape urls.
Declaration
public void EnableWebCache(TimeSpan cacheDuration)
Parameters
Type | Name | Description |
---|---|---|
System.TimeSpan | cacheDuration | Duration that responses will be cached for. |
FetchUrlContents(String, HttpIdentity)
A handy shortcut method that fetches the text content from any Url (synchronously).
Declaration
public static string FetchUrlContents(string url, HttpIdentity identity = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | url | The absolute URL. |
HttpIdentity | identity | OPtional HTTP identity to choose a proxy, user agent, headers, username and password for the request. |
Returns
Type | Description |
---|---|
System.String |
FetchUrlContentsBinary(String, HttpIdentity)
A handy shortcut method that fetches the text content from any Url (synchronously) as a binary data in a byye array (byte[])
Declaration
public byte[] FetchUrlContentsBinary(string url, HttpIdentity identity = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | url | The absolute URL. |
HttpIdentity | identity | OPtional HTTP identity to choose a proxy, user agent, headers, username and password for the request. |
Returns
Type | Description |
---|---|
System.Byte[] |
Init()
Override this method initialize your web-scraper. Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
Declaration
public abstract void Init()
Log(String, WebScraper.LogLevel)
Logs the specified message to the console. Logs can be Enabled using the EnableLogging. This function has been exposed and is over-ridable to allow for easy Email and Slack notification integration.
Declaration
public virtual void Log(string message, WebScraper.LogLevel type)
Parameters
Type | Name | Description |
---|---|---|
System.String | message | The string message. |
WebScraper.LogLevel | type | The LogLevel. |
ObeyRobotsDotTxtForHost(String)
Causes the WebScraper to always obey /robots.txt directives including path restrictions and crawl rates on a domain by domain basis. May be overridden for advanced control.
Declaration
public virtual bool ObeyRobotsDotTxtForHost(string host)
Parameters
Type | Name | Description |
---|---|---|
System.String | host |
Returns
Type | Description |
---|---|
System.Boolean |
Parse(Response)
Override this method to create the default Response handler for your web scraper. If you have multiple page types, you can add additional similar methods.
Declaration
public abstract void Parse(Response response)
Parameters
Type | Name | Description |
---|---|---|
Response | response | The http Response object to parse |
PostRequest(String, Action<Response>, Dictionary<String, String>)
Request adds a new request to the scrape-job queue using the POST http method.
Declaration
public virtual void PostRequest(string url, Action<Response> parse, Dictionary<string, string> postVaraibles)
Parameters
Type | Name | Description |
---|---|---|
System.String | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
System.Collections.Generic.Dictionary<System.String, System.String> | postVaraibles | The POST variables as a dictionary of key-value pairs. |
PostRequest(String, Action<Response>, Dictionary<String, String>, HttpIdentity, MetaData)
Request adds a new request to the scrape-job queue using the POST http method.
Declaration
public virtual void PostRequest(string url, Action<Response> parse, Dictionary<string, string> postVaraibles, HttpIdentity identity = null, MetaData metaData = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
System.Collections.Generic.Dictionary<System.String, System.String> | postVaraibles | The POST variables as a dictionary of key-value pairs. |
HttpIdentity | identity | An optional HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
PostRequest(String, Action<Response>, Dictionary<String, String>, MetaData)
Request adds a new request to the scrape-job queue using the POST http method.
Declaration
public virtual void PostRequest(string url, Action<Response> parse, Dictionary<string, string> postVaraibles, MetaData metaData)
Parameters
Type | Name | Description |
---|---|---|
System.String | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
System.Collections.Generic.Dictionary<System.String, System.String> | postVaraibles | The POST variables as a dictionary of key-value pairs. |
MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
PostRequest(Uri, Action<Response>, Dictionary<String, String>)
Request adds a new request to the scrape-job queue using the POST http method.
Declaration
public virtual void PostRequest(Uri url, Action<Response> parse, Dictionary<string, string> postVaraibles)
Parameters
Type | Name | Description |
---|---|---|
System.Uri | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
System.Collections.Generic.Dictionary<System.String, System.String> | postVaraibles | The POST variables as a dictionary of key-value pairs. |
PostRequest(Uri, Action<Response>, Dictionary<String, String>, HttpIdentity, MetaData)
Request adds a new request to the scrape-job queue using the POST http method.
Declaration
public virtual void PostRequest(Uri url, Action<Response> parse, Dictionary<string, string> postVaraibles, HttpIdentity identity = null, MetaData metaData = null)
Parameters
Type | Name | Description |
---|---|---|
System.Uri | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
System.Collections.Generic.Dictionary<System.String, System.String> | postVaraibles | The POST variables as a dictionary of key-value pairs. |
HttpIdentity | identity | An optional HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
PostRequest(Uri, Action<Response>, Dictionary<String, String>, MetaData)
Request adds a new request to the scrape-job queue using the POST http method.
Declaration
public virtual void PostRequest(Uri url, Action<Response> parse, Dictionary<string, string> postVaraibles, MetaData metaData)
Parameters
Type | Name | Description |
---|---|---|
System.Uri | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
System.Collections.Generic.Dictionary<System.String, System.String> | postVaraibles | The POST variables as a dictionary of key-value pairs. |
MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
Request(IEnumerable<String>, Action<Response>)
A key method called from with the Init and Parse Methods. Request adds new requests to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(IEnumerable<string> urls, Action<Response> parse)
Parameters
Type | Name | Description |
---|---|---|
System.Collections.Generic.IEnumerable<System.String> | urls | The Absolute url or urls to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
Request(String, Action<Response>)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(string url, Action<Response> parse)
Parameters
Type | Name | Description |
---|---|---|
System.String | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
Request(String, Action<Response>, HttpIdentity, MetaData)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(string url, Action<Response> parse, HttpIdentity identity = null, MetaData metaData = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
HttpIdentity | identity | An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
Request(String, Action<Response>, MetaData)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(string url, Action<Response> parse, MetaData metaData)
Parameters
Type | Name | Description |
---|---|---|
System.String | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
Request(Uri, Action<Response>)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(Uri url, Action<Response> parse)
Parameters
Type | Name | Description |
---|---|---|
System.Uri | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
Request(Uri, Action<Response>, HttpIdentity, MetaData)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(Uri url, Action<Response> parse, HttpIdentity identity = null, MetaData metaData = null)
Parameters
Type | Name | Description |
---|---|---|
System.Uri | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
HttpIdentity | identity | An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity. |
MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
Request(Uri, Action<Response>, MetaData)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Declaration
public virtual void Request(Uri url, Action<Response> parse, MetaData metaData)
Parameters
Type | Name | Description |
---|---|---|
System.Uri | url | The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings. |
System.Action<Response> | parse | The method to be used to parse the Response (often this is WebScraper.Parse) |
MetaData | metaData | Additional information of any Type can be sent with the request and then re-read when the response is parsed. |
Retry(Response)
Retries a Response.
Usually called in a Parse method, this method is useful if a Captcha or error screen was encountered during Html parsing.
Declaration
public void Retry(Response response)
Parameters
Type | Name | Description |
---|---|---|
Response | response |
Scrape(Object, String)
Appends any scraped data to a file in the JsonLines format. (1 json object per line). Will save any .Net object of any kind. This method is typically used with IronWebScraper.ScrapedData or developer defined classes for scraped data items. The default filename will follow the pattern "NameSpace.TypeName.jsonl". E.g: IronWebScraper.ScrapedData.jsonl
Declaration
public void Scrape(object item, string fileName = null)
Parameters
Type | Name | Description |
---|---|---|
System.Object | item | |
System.String | fileName |
ScrapeUnique(Object, String)
Appends scraped data to a file in the JsonLines format. (1 json object per line). Automatically ignores duplicates. Will save any .Net object of any kind. This method is typically used with IronWebScraper.ScrapedData or developer defined classes for scraped data items. The default filename will follow the pattern "WorkingDirecory/NameSpace.TypeName.jsonl". E.g: Scrape/IronWebScraper.ScrapedData.jsonl
Declaration
public void ScrapeUnique(object item, string fileName = null)
Parameters
Type | Name | Description |
---|---|---|
System.Object | item | |
System.String | fileName |
SetSiteSpecificCrawlRateLimit(String, TimeSpan)
Set a throttle limit for a specific domain
Declaration
public void SetSiteSpecificCrawlRateLimit(string hostName, TimeSpan crawlRate)
Parameters
Type | Name | Description |
---|---|---|
System.String | hostName | The http host (domain name). |
System.TimeSpan | crawlRate | The maximum frequency of http requests for the given hostName. |
Start(String)
Starts the WebScraper.
Set CrawlId to make this crawl resumable. Will also resume a previous scrawl with the same CrawlId if it exists.
Giving a CrawlId also causes the WebScraper to auto-save its state every 5 minutes in case of a crash, system failure or power outage. This feature is particularly useful for long running web-scraping tasks, allowing hours, days or even weeks of work to be recovered effortlessly.
Declaration
public void Start(string crawlId = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | crawlId |
StartAsync(String)
Starts the WebScraper Asynchronously. Set CrawlId to make this crawl resumable. Will resume a previous scrawl with the same CrawlId if it exists.
Declaration
public Task StartAsync(string crawlId = null)
Parameters
Type | Name | Description |
---|---|---|
System.String | crawlId |
Returns
Type | Description |
---|---|
System.Threading.Tasks.Task |
Stop()
Stops this WebScraper instance graceful. The WebScraper may be restated later with no loss of data by calling Start(CrawlId) or StartAsync(CrawlId)
Declaration
public void Stop()
UnScrape(Boolean)
Retrieves IronWebScraper.ScrapedData objects which were saved using the WebScraper.Scrape method.
Declaration
public IEnumerable<ScrapedData> UnScrape(bool ignoreErrors)
Parameters
Type | Name | Description |
---|---|---|
System.Boolean | ignoreErrors | if set to |
Returns
Type | Description |
---|---|
System.Collections.Generic.IEnumerable<ScrapedData> |
UnScrape(String, Boolean)
Retrieves IronWebScraper.ScrapedData objects which were saved using the WebScraper.Scrape method.
Declaration
public IEnumerable<ScrapedData> UnScrape(string fileName = null, bool ignoreErrors = false)
Parameters
Type | Name | Description |
---|---|---|
System.String | fileName | Path of the saved data file. |
System.Boolean | ignoreErrors | if set to |
Returns
Type | Description |
---|---|
System.Collections.Generic.IEnumerable<ScrapedData> |
UnScrape<T>(Boolean)
Retrieves native C# objects which were saved using the WebScraper.Scrape method in the JsonLines format.
Declaration
public IEnumerable<T> UnScrape<T>(bool ignoreErrors)
Parameters
Type | Name | Description |
---|---|---|
System.Boolean | ignoreErrors | if set to |
Returns
Type | Description |
---|---|
System.Collections.Generic.IEnumerable<T> |
Type Parameters
Name | Description |
---|---|
T | The Type of object to be returned. Giving no value will return an IEnumberable of IronWebScraper.ScrapedData |
UnScrape<T>(String, Boolean)
Retrieves native C# objects which were saved using the WebScraper.Scrape method in the JsonLines format.
Declaration
public IEnumerable<T> UnScrape<T>(string fileName = null, bool ignoreErrors = false)
Parameters
Type | Name | Description |
---|---|---|
System.String | fileName | Path of the saved data file. |
System.Boolean | ignoreErrors | if set to |
Returns
Type | Description |
---|---|
System.Collections.Generic.IEnumerable<T> |
Type Parameters
Name | Description |
---|---|
T | The Type of object to be returned. Giving no value will return an IEnumberable of IronWebScraper.ScrapedData |