Search Results for

    Show / Hide Table of Contents

    Class WebScraper

    An easy to use base class which developers can extend to rapidly build custom web-scraping applications.

    Inheritance
    System.Object
    WebScraper
    Namespace: IronWebScraper
    Assembly: IronWebScraper.dll
    Syntax
    public abstract class WebScraper : Object

    Constructors

    WebScraper()

    Declaration
    protected WebScraper()

    Fields

    AllowedDomains

    If not empty, all requested Urls' hostname must match at least one of the AllowedDomains patterns. Patterns may be added using glob wildcard strings or Regex

    Declaration
    public UrlMatchPatternCollection AllowedDomains
    Field Value
    Type Description
    UrlMatchPatternCollection

    AllowedUrls

    If not empty, all requested Urls must match at least one of the AllowedUrls patterns. Patterns may be added using glob wildcard strings or Regex

    Declaration
    public UrlMatchPatternCollection AllowedUrls
    Field Value
    Type Description
    UrlMatchPatternCollection

    BannedDomains

    If not empty, no requested Urls' hostname may match any of the BannedDomains patterns. Patterns may be added using glob wildcard strings or Regex

    Declaration
    public UrlMatchPatternCollection BannedDomains
    Field Value
    Type Description
    UrlMatchPatternCollection

    BannedUrls

    If not empty, no requested Urls may match any of the BannedUrls patterns. Patterns may be added using glob wildcard strings or Regex

    Declaration
    public UrlMatchPatternCollection BannedUrls
    Field Value
    Type Description
    UrlMatchPatternCollection

    CrawlId

    A unique string used to identify a crawl job.

    Declaration
    public string CrawlId
    Field Value
    Type Description
    System.String

    FilesDownloaded

    The total number of files downloaded successfully with the DownloadImage and DownloadFile methods.

    Declaration
    public int FilesDownloaded
    Field Value
    Type Description
    System.Int32

    Identities

    A list of http identities to be used to fetch web resources.

    Each Identity may have a different proxy IP addresses, userAgent, http headers, persistent cookies, username and password.

    Best practice is to create Identities in your WebScraper.Init Method and Add them to this WebScraper.Identities List.

    Declaration
    public List<HttpIdentity> Identities
    Field Value
    Type Description
    System.Collections.Generic.List<HttpIdentity>

    LoggingLevel

    The level of logging made by the WebScraper engine to the Console.

    LogLevel.Critical is normally the most useful setting, allowing the developer to write their own, meaningful and application relevant messages inside of Parse methods.

    LogLevel.ScrapedData is useful when coding and testing a new WebScraper.

    Declaration
    public WebScraper.LogLevel LoggingLevel
    Field Value
    Type Description
    WebScraper.LogLevel

    ObeyRobotsDotTxt

    Causes the WebScraper to always obey /robots.txt directives including url and path restrictions and crawl rates.

    Declaration
    public bool ObeyRobotsDotTxt
    Field Value
    Type Description
    System.Boolean

    WorkingDirectory

    Path to a local directory where scraped data and state information will be saved.

    Declaration
    public string WorkingDirectory
    Field Value
    Type Description
    System.String

    Properties

    FailedUrls

    Gets the number of failed http requests which have exceeded their total maximum number of retries.

    Declaration
    public int FailedUrls { get; }
    Property Value
    Type Description
    System.Int32

    HttpRetryAttempts

    The number of times WebScraper will retry a failed URL (normally with a new identity) before considering it non-scrapable.

    Declaration
    public int HttpRetryAttempts { get; set; }
    Property Value
    Type Description
    System.Int32

    HttpTimeOut

    Gets or the time after-which a HTTP request will be considered failed or lost. (non-contactable or Dns unavailable)

    Declaration
    public TimeSpan HttpTimeOut { get; set; }
    Property Value
    Type Description
    System.TimeSpan

    MaxHttpConnectionLimit

    Gets or sets the total number of allowed open HTTP requests (threads)

    Declaration
    public int MaxHttpConnectionLimit { get; set; }
    Property Value
    Type Description
    System.Int32

    OpenConnectionLimitPerHost

    Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname or IP address. This helps protect hosts against too many requests.

    Declaration
    public int OpenConnectionLimitPerHost { get; set; }
    Property Value
    Type Description
    System.Int32

    RateLimitPerHost

    Gets or sets minimum polite delay (pause) between request to a given domain or IP address.

    Declaration
    public TimeSpan RateLimitPerHost { get; set; }
    Property Value
    Type Description
    System.TimeSpan

    SuccessfulFileDownloadCount

    Gets the number of successful http downloads using the DownloadFile and DownloadImage methods..

    Declaration
    public int SuccessfulFileDownloadCount { get; }
    Property Value
    Type Description
    System.Int32

    SuccessfulfulRequestCount

    Gets the number of successful http requests.

    Declaration
    public int SuccessfulfulRequestCount { get; }
    Property Value
    Type Description
    System.Int32

    ThrottleMode

    Makes the WebSraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses. This is polite in-case multiple scraped domains are hosted on the same machine.

    Declaration
    public WebScraper.Throttle ThrottleMode { get; set; }
    Property Value
    Type Description
    WebScraper.Throttle

    true if we wish to look up hosts' IP addresses for throttling; otherwise, false.

    Methods

    AcceptUrl(String)

    Decides if the WebScraper will accept a given url. My be overridden to apply custom middleware logic.

    Declaration
    public virtual bool AcceptUrl(string url)
    Parameters
    Type Name Description
    System.String url
    Returns
    Type Description
    System.Boolean

    ChooseIdentityForRequest(Request)

    Picks a random identity from WebScraper.Identities for each request. Add Identities with proxy IP addresses, userAgents, headers, cookies, username and password in your Init Method and add them to the WebScraper.Identities List;

    Override this method to create your own logic for non-random selection of a HttpIdentity for each request.

    Declaration
    public virtual HttpIdentity ChooseIdentityForRequest(Request request)
    Parameters
    Type Name Description
    Request request

    The http Request

    Returns
    Type Description
    HttpIdentity

    An HttpIdentity

    DownloadFile(String, String, Boolean, HttpIdentity)

    Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

    Normally called with an Parse Method of IronWebScraper.WebScraper

    Declaration
    public virtual string DownloadFile(string url, string path, bool overWrite = false, HttpIdentity identity = null)
    Parameters
    Type Name Description
    System.String url

    The absolute url of the resource to be downloaded.

    System.String path

    The path to which the downloaded file should be saved. You may give a directory name or a file name.

    Relative paths will be resolved relative to WorkingDirectory.

    System.Boolean overWrite

    If set to true any existing file at the given path will be overwritten. If set to false a unique name such as "file(1).html" will be created in the case of a naming conflict.

    HttpIdentity identity

    An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.

    Returns
    Type Description
    System.String

    The file path (relative to WorkingDirecory) which the file will be saved to.

    DownloadFile(Uri, String, Boolean, HttpIdentity)

    Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

    Normally called with an Parse Method of IronWebScraper.WebScraper

    Declaration
    public virtual string DownloadFile(Uri uri, string path, bool overWrite = false, HttpIdentity identity = null)
    Parameters
    Type Name Description
    System.Uri uri

    The absolute uri of the resource to be downloaded.

    System.String path

    The path to which the downloaded file should be saved. You may give a directory name or a file name.

    Relative paths will be resolved relative to WorkingDirectory.

    System.Boolean overWrite

    If set to true any existing file at the given path will be overwritten. If set to false a unique name such as "file(1).html" will be created in the case of a naming conflict.

    HttpIdentity identity

    An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.

    Returns
    Type Description
    System.String

    The file path (relative to WorkingDirecory) which the file will be saved to.

    DownloadFileUnique(String, String, HttpIdentity)

    Much like DownloadFile except if the file has already been downloaded or exists locally, it will not be re-downloaded.

    Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

    Normally called with an Parse Method of IronWebScraper.WebScraper

    Declaration
    public virtual string DownloadFileUnique(string url, string path, HttpIdentity identity = null)
    Parameters
    Type Name Description
    System.String url

    The URL.

    System.String path

    The path.

    HttpIdentity identity

    The identity.

    Returns
    Type Description
    System.String

    DownloadImage(String, String, Int32, Int32, Boolean, HttpIdentity)

    Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

    Normally called with an Parse Method of IronWebScraper.WebScraper

    Declaration
    public virtual string DownloadImage(string url, string path, int maxWidth = 0, int maxHeight = 0, bool overWrite = false, HttpIdentity identity = null)
    Parameters
    Type Name Description
    System.String url

    The absolute url of the resource to be downloaded.

    System.String path

    The path to which the downloaded file should be saved. You may give a directory name or a file name.

    Relative paths will be resolved relative to WorkingDirectory.

    System.Int32 maxWidth

    The Downloaded image will be scaled proportionally to this maximum width. Zero means no constraint.

    System.Int32 maxHeight

    The Downloaded image will be scaled proportionally to this maximum height. Zero means no constraint.

    System.Boolean overWrite

    If set to true any existing file at the given path will be overwritten. If set to false a unique name such as "file(1).html" will be created in the case of a naming conflict.

    HttpIdentity identity

    An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.

    Returns
    Type Description
    System.String

    The file path (relative to WorkingDirecory) which the image will be saved to.

    DownloadImage(Uri, String, Int32, Int32, Boolean, HttpIdentity)

    Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

    Normally called with an Parse Method of IronWebScraper.WebScraper

    Declaration
    public virtual string DownloadImage(Uri uri, string path, int maxWidth = 0, int maxHeight = 0, bool overWrite = false, HttpIdentity identity = null)
    Parameters
    Type Name Description
    System.Uri uri

    The absolute uri of the resource to be downloaded.

    System.String path

    The path to which the downloaded file should be saved. You may give a directory name or a file name.

    Relative paths will be resolved relative to WorkingDirectory.

    System.Int32 maxWidth

    The Downloaded image will be scaled proportionally to this maximum width. Zero means no constraint.

    System.Int32 maxHeight

    The Downloaded image will be scaled proportionally to this maximum height. Zero means no constraint.

    System.Boolean overWrite

    If set to true any existing file at the given path will be overwritten. If set to false a unique name such as "file(1).html" will be created in the case of a naming conflict.

    HttpIdentity identity

    An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.

    Returns
    Type Description
    System.String

    The file path (relative to WorkingDirecory) which the image will be saved to.

    EnableWebCache()

    Caches web http responses for reuse. This allows WebScraper classes to be modified and restarted without re-downloading previously scraped urls.

    Declaration
    public void EnableWebCache()

    EnableWebCache(TimeSpan)

    Caches web http responses for reuse. This allows WebScraper classes to be modified and restarted without re-downloading previously scrape urls.

    Declaration
    public void EnableWebCache(TimeSpan cacheDuration)
    Parameters
    Type Name Description
    System.TimeSpan cacheDuration

    Duration that responses will be cached for.

    FetchUrlContents(String, HttpIdentity)

    A handy shortcut method that fetches the text content from any Url (synchronously).

    Declaration
    public static string FetchUrlContents(string url, HttpIdentity identity = null)
    Parameters
    Type Name Description
    System.String url

    The absolute URL.

    HttpIdentity identity

    OPtional HTTP identity to choose a proxy, user agent, headers, username and password for the request.

    Returns
    Type Description
    System.String

    FetchUrlContentsBinary(String, HttpIdentity)

    A handy shortcut method that fetches the text content from any Url (synchronously) as a binary data in a byye array (byte[])

    Declaration
    public byte[] FetchUrlContentsBinary(string url, HttpIdentity identity = null)
    Parameters
    Type Name Description
    System.String url

    The absolute URL.

    HttpIdentity identity

    OPtional HTTP identity to choose a proxy, user agent, headers, username and password for the request.

    Returns
    Type Description
    System.Byte[]

    Init()

    Override this method initialize your web-scraper. Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.

    Declaration
    public abstract void Init()

    Log(String, WebScraper.LogLevel)

    Logs the specified message to the console. Logs can be Enabled using the EnableLogging. This function has been exposed and is over-ridable to allow for easy Email and Slack notification integration.

    Declaration
    public virtual void Log(string Message, WebScraper.LogLevel Type)
    Parameters
    Type Name Description
    System.String Message

    The string message.

    WebScraper.LogLevel Type

    The LogLevel.

    ObeyRobotsDotTxtForHost(String)

    Causes the WebScraper to always obey /robots.txt directives including path restrictions and crawl rates on a domain by domain basis. May be overridden for advanced control.

    Declaration
    public virtual bool ObeyRobotsDotTxtForHost(string Host)
    Parameters
    Type Name Description
    System.String Host
    Returns
    Type Description
    System.Boolean

    Parse(Response)

    Override this method to create the default Response handler for your web scraper. If you have multiple page types, you can add additional similar methods.

    Declaration
    public abstract void Parse(Response response)
    Parameters
    Type Name Description
    Response response

    The http Response object to parse

    PostRequest(String, Action<Response>, Dictionary<String, String>)

    Request adds a new request to the scrape-job queue using the POST http method.

    Declaration
    public virtual void PostRequest(string url, Action<Response> parse, Dictionary<string, string> postVaraibles)
    Parameters
    Type Name Description
    System.String url

    The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.

    System.Action<Response> parse

    The method to be used to parse the Response (often this is WebScraper.Parse)

    System.Collections.Generic.Dictionary<System.String, System.String> postVaraibles

    The POST variables as a dictionary of key-value pairs.

    PostRequest(String, Action<Response>, Dictionary<String, String>, HttpIdentity, MetaData)

    Request adds a new request to the scrape-job queue using the POST http method.

    Declaration
    public virtual void PostRequest(string url, Action<Response> parse, Dictionary<string, string> postVaraibles, HttpIdentity identity = null, MetaData metaData = null)
    Parameters
    Type Name Description
    System.String url

    The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.

    System.Action<Response> parse

    The method to be used to parse the Response (often this is WebScraper.Parse)

    System.Collections.Generic.Dictionary<System.String, System.String> postVaraibles

    The POST variables as a dictionary of key-value pairs.

    HttpIdentity identity

    An optional HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.

    MetaData metaData

    Additional information of any Type can be sent with the request and then re-read when the response is parsed.

    PostRequest(String, Action<Response>, Dictionary<String, String>, MetaData)

    Request adds a new request to the scrape-job queue using the POST http method.

    Declaration
    public virtual void PostRequest(string url, Action<Response> parse, Dictionary<string, string> postVaraibles, MetaData metaData)
    Parameters
    Type Name Description
    System.String url

    The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.

    System.Action<Response> parse

    The method to be used to parse the Response (often this is WebScraper.Parse)

    System.Collections.Generic.Dictionary<System.String, System.String> postVaraibles

    The POST variables as a dictionary of key-value pairs.

    MetaData metaData

    Additional information of any Type can be sent with the request and then re-read when the response is parsed.

    PostRequest(Uri, Action<Response>, Dictionary<String, String>)

    Request adds a new request to the scrape-job queue using the POST http method.

    Declaration
    public virtual void PostRequest(Uri url, Action<Response> parse, Dictionary<string, string> postVaraibles)
    Parameters
    Type Name Description
    System.Uri url

    The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.

    System.Action<Response> parse

    The method to be used to parse the Response (often this is WebScraper.Parse)

    System.Collections.Generic.Dictionary<System.String, System.String> postVaraibles

    The POST variables as a dictionary of key-value pairs.

    PostRequest(Uri, Action<Response>, Dictionary<String, String>, HttpIdentity, MetaData)

    Request adds a new request to the scrape-job queue using the POST http method.

    Declaration
    public virtual void PostRequest(Uri url, Action<Response> parse, Dictionary<string, string> postVaraibles, HttpIdentity identity = null, MetaData metaData = null)
    Parameters
    Type Name Description
    System.Uri url

    The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.

    System.Action<Response> parse

    The method to be used to parse the Response (often this is WebScraper.Parse)

    System.Collections.Generic.Dictionary<System.String, System.String> postVaraibles

    The POST variables as a dictionary of key-value pairs.

    HttpIdentity identity

    An optional HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.

    MetaData metaData

    Additional information of any Type can be sent with the request and then re-read when the response is parsed.

    PostRequest(Uri, Action<Response>, Dictionary<String, String>, MetaData)

    Request adds a new request to the scrape-job queue using the POST http method.

    Declaration
    public virtual void PostRequest(Uri url, Action<Response> parse, Dictionary<string, string> postVaraibles, MetaData metaData)
    Parameters
    Type Name Description
    System.Uri url

    The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.

    System.Action<Response> parse

    The method to be used to parse the Response (often this is WebScraper.Parse)

    System.Collections.Generic.Dictionary<System.String, System.String> postVaraibles

    The POST variables as a dictionary of key-value pairs.

    MetaData metaData

    Additional information of any Type can be sent with the request and then re-read when the response is parsed.

    Request(IEnumerable<String>, Action<Response>)

    A key method called from with the Init and Parse Methods. Request adds new requests to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.

    Declaration
    public virtual void Request(IEnumerable<string> urls, Action<Response> parse)
    Parameters
    Type Name Description
    System.Collections.Generic.IEnumerable<System.String> urls

    The Absolute url or urls to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.

    System.Action<Response> parse

    The method to be used to parse the Response (often this is WebScraper.Parse)

    Request(String, Action<Response>)

    A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.

    Declaration
    public virtual void Request(string url, Action<Response> parse)
    Parameters
    Type Name Description
    System.String url

    The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.

    System.Action<Response> parse

    The method to be used to parse the Response (often this is WebScraper.Parse)

    Request(String, Action<Response>, HttpIdentity, MetaData)

    A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.

    Declaration
    public virtual void Request(string url, Action<Response> parse, HttpIdentity identity = null, MetaData metaData = null)
    Parameters
    Type Name Description
    System.String url

    The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.

    System.Action<Response> parse

    The method to be used to parse the Response (often this is WebScraper.Parse)

    HttpIdentity identity

    An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.

    MetaData metaData

    Additional information of any Type can be sent with the request and then re-read when the response is parsed.

    Request(String, Action<Response>, MetaData)

    A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.

    Declaration
    public virtual void Request(string url, Action<Response> parse, MetaData metaData)
    Parameters
    Type Name Description
    System.String url

    The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.

    System.Action<Response> parse

    The method to be used to parse the Response (often this is WebScraper.Parse)

    MetaData metaData

    Additional information of any Type can be sent with the request and then re-read when the response is parsed.

    Request(Uri, Action<Response>)

    A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.

    Declaration
    public virtual void Request(Uri url, Action<Response> parse)
    Parameters
    Type Name Description
    System.Uri url

    The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.

    System.Action<Response> parse

    The method to be used to parse the Response (often this is WebScraper.Parse)

    Request(Uri, Action<Response>, HttpIdentity, MetaData)

    A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.

    Declaration
    public virtual void Request(Uri url, Action<Response> parse, HttpIdentity identity = null, MetaData metaData = null)
    Parameters
    Type Name Description
    System.Uri url

    The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.

    System.Action<Response> parse

    The method to be used to parse the Response (often this is WebScraper.Parse)

    HttpIdentity identity

    An HttpIdentity to send the Request. If null, the ChooseIdentityForRequest method will be used to find a suitable identity.

    MetaData metaData

    Additional information of any Type can be sent with the request and then re-read when the response is parsed.

    Request(Uri, Action<Response>, MetaData)

    A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.

    Declaration
    public virtual void Request(Uri url, Action<Response> parse, MetaData metaData)
    Parameters
    Type Name Description
    System.Uri url

    The absolute url to be fetched. Developers may use Response.ToAbsoluteUrl to resolve all relative links to Absolute Url strings.

    System.Action<Response> parse

    The method to be used to parse the Response (often this is WebScraper.Parse)

    MetaData metaData

    Additional information of any Type can be sent with the request and then re-read when the response is parsed.

    Retry(Response)

    Retries a Response.

    Usually called in a Parse method, this method is useful if a Captcha or error screen was encountered during Html parsing.

    Declaration
    public void Retry(Response Response)
    Parameters
    Type Name Description
    Response Response

    Scrape(Object, String)

    Appends any scraped data to a file in the JsonLines format. (1 json object per line). Will save any .Net object of any kind. This method is typically used with IronWebScraper.ScrapedData or developer defined classes for scraped data items. The default filename will follow the pattern "NameSpace.TypeName.jsonl". E.g: IronWebScraper.ScrapedData.jsonl

    Declaration
    public void Scrape(object Item, string fileName = null)
    Parameters
    Type Name Description
    System.Object Item
    System.String fileName

    ScrapeUnique(Object, String)

    Appends scraped data to a file in the JsonLines format. (1 json object per line). Automatically ignores duplicates. Will save any .Net object of any kind. This method is typically used with IronWebScraper.ScrapedData or developer defined classes for scraped data items. The default filename will follow the pattern "WorkingDirecory/NameSpace.TypeName.jsonl". E.g: Scrape/IronWebScraper.ScrapedData.jsonl

    Declaration
    public void ScrapeUnique(object Item, string fileName = null)
    Parameters
    Type Name Description
    System.Object Item
    System.String fileName

    SetSiteSpecificCrawlRateLimit(String, TimeSpan)

    Set a throttle limit for a specific domain

    Declaration
    public void SetSiteSpecificCrawlRateLimit(string hostName, TimeSpan crawlRate)
    Parameters
    Type Name Description
    System.String hostName

    The http host (domain name).

    System.TimeSpan crawlRate

    The maximum frequency of http requests for the given hostName.

    Start(String)

    Starts the WebScraper.

    Set CrawlId to make this crawl resumable. Will also resume a previous scrawl with the same CrawlId if it exists.

    Giving a CrawlId also causes the WebScraper to auto-save its state every 5 minutes in case of a crash, system failure or power outage. This feature is particularly useful for long running web-scraping tasks, allowing hours, days or even weeks of work to be recovered effortlessly.

    Declaration
    public void Start(string CrawlId = null)
    Parameters
    Type Name Description
    System.String CrawlId

    StartAsync(String)

    Starts the WebScraper Asynchronously. Set CrawlId to make this crawl resumable. Will resume a previous scrawl with the same CrawlId if it exists.

    Declaration
    public Task StartAsync(string CrawlId = null)
    Parameters
    Type Name Description
    System.String CrawlId
    Returns
    Type Description
    System.Threading.Tasks.Task

    Stop()

    Stops this WebScraper instance graceful. The WebScraper may be restated later with no loss of data by calling Start(CrawlId) or StartAsync(CrawlId)

    Declaration
    public void Stop()

    UnScrape(Boolean)

    Retrieves IronWebScraper.ScrapedData objects which were saved using the WebScraper.Scrape method.

    Declaration
    public IEnumerable<ScrapedData> UnScrape(bool IgnoreErrors)
    Parameters
    Type Name Description
    System.Boolean IgnoreErrors

    if set to true any objects that cant be cast to the specified Type T will be ignored..

    Returns
    Type Description
    System.Collections.Generic.IEnumerable<ScrapedData>

    UnScrape(String, Boolean)

    Retrieves IronWebScraper.ScrapedData objects which were saved using the WebScraper.Scrape method.

    Declaration
    public IEnumerable<ScrapedData> UnScrape(string fileName = null, bool IgnoreErrors = false)
    Parameters
    Type Name Description
    System.String fileName

    Path of the saved data file.

    System.Boolean IgnoreErrors

    if set to true any objects that cant be cast to the specified Type T will be ignored..

    Returns
    Type Description
    System.Collections.Generic.IEnumerable<ScrapedData>

    UnScrape<T>(Boolean)

    Retrieves native C# objects which were saved using the WebScraper.Scrape method in the JsonLines format.

    Declaration
    public IEnumerable<T> UnScrape<T>(bool IgnoreErrors)
    Parameters
    Type Name Description
    System.Boolean IgnoreErrors

    if set to true any objects that cant be cast to the specified Type T will be ignored..

    Returns
    Type Description
    System.Collections.Generic.IEnumerable<T>
    Type Parameters
    Name Description
    T

    The Type of object to be returned. Giving no value will return an IEnumberable of IronWebScraper.ScrapedData

    UnScrape<T>(String, Boolean)

    Retrieves native C# objects which were saved using the WebScraper.Scrape method in the JsonLines format.

    Declaration
    public IEnumerable<T> UnScrape<T>(string fileName = null, bool IgnoreErrors = false)
    Parameters
    Type Name Description
    System.String fileName

    Path of the saved data file.

    System.Boolean IgnoreErrors

    if set to true any objects that cant be cast to the specified Type T will be ignored..

    Returns
    Type Description
    System.Collections.Generic.IEnumerable<T>
    Type Parameters
    Name Description
    T

    The Type of object to be returned. Giving no value will return an IEnumberable of IronWebScraper.ScrapedData

    ☀
    ☾
    Downloads
    • Download with Nuget
    • Free 30-Day Trial Key
    In This Article
    Back to top
    Install with Nuget
    Want to deploy IronWebscraper to a live project for FREE?
    What’s included?
    30 days of fully-functional product
    Test and share in a live environment
    No watermarks in production
    Get your free 30-day Trial Key instantly.
    No credit card or account creation required
    Your Trial License Key has been emailed to you.
    Download IronWebscraper free to apply
    your Trial Licenses Key
    Install with NuGet View Licenses
    Licenses from $499. Have a question? Get in touch.