Webscraping in C#

What is IronWebScraper?

IronWebScraper is a class library and framework for C# and the .NET programming platform that allows developers to programmatically read websites and extract their content. This is ideal for reverse engineering websites or existing intranets and turning them back into databases or JSON data. It's also useful for downloading large volumes of documents from the internet.

In many respects, Iron Web Scraper is similar to the Scrapy library for Python, but leverages the advantages of C#, particularly its ability to step through code as the web scraping process is in progress and debug.

Installation

Your first step will be to install Iron Web Scraper, which you may do from NuGet or by downloading the DLL from our website.

All of the classes you will need can be found in the Iron Web Scraper namespace.

PM > Install-Package IronWebScraper

Migrating Websites to Databases

IronWebScraper provides the tools and methods to allow you to re-engineer your websites back into structured databases. This technology is useful when migrating content from legacy websites and intranets into your new C# application.

Migrating Websites

Being able to easily extract the content of a partial or complete website in C# reduces the time and cost implication in migrating or upgrading website and intranet resources. This can be significantly more efficient than direct SQL transformations, as it flattens the data down to what can be seen on each webpage, and does not require the previous SQL data structures to be understood, nor complex SQL queries to be built.

Populating Search Indexes

Iron Web Scraper may be pointed at your own website or intranet to read structured data, to read every page, and to extract the correct data so that a search engine within your organization may be populated accurately.

IronWebScraper is an ideal tool to scrape content for your search index. A search application such as IronSearch can read structured content from IronWebScraper to build a powerful enterprise search system.

Using Iron Webscraper

To learn how to use Iron Web Scraper, it is best to look at examples. This basic example creates a class to scrape titles from a website blog.

using IronWebScraper;

namespace WebScrapingProject
{
    class MainClass
    {
        public static void Main(string [] args)
        {
            var scraper = new BlogScraper();
            scraper.Start();
        }
    }

    class BlogScraper : WebScraper
    {
        public override void Init()
        {
            this.LoggingLevel = WebScraper.LogLevel.All;
            this.Request("https://ironpdf.com/blog/", Parse);
        }

        public override void Parse(Response response)
        {
            foreach (var title_link in response.Css("h2.entry-title a"))
            {
                string strTitle = title_link.TextContentClean;
                Scrape(new ScrapedData() { { "Title", strTitle } });
            }

            if (response.CssExists("div.prev-post > a [href]"))
            {
                var next_page = response.Css("div.prev-post > a [href]")[0].Attributes ["href"];
                this.Request(next_page, Parse);
            }
        }
    }
}
using IronWebScraper;

namespace WebScrapingProject
{
    class MainClass
    {
        public static void Main(string [] args)
        {
            var scraper = new BlogScraper();
            scraper.Start();
        }
    }

    class BlogScraper : WebScraper
    {
        public override void Init()
        {
            this.LoggingLevel = WebScraper.LogLevel.All;
            this.Request("https://ironpdf.com/blog/", Parse);
        }

        public override void Parse(Response response)
        {
            foreach (var title_link in response.Css("h2.entry-title a"))
            {
                string strTitle = title_link.TextContentClean;
                Scrape(new ScrapedData() { { "Title", strTitle } });
            }

            if (response.CssExists("div.prev-post > a [href]"))
            {
                var next_page = response.Css("div.prev-post > a [href]")[0].Attributes ["href"];
                this.Request(next_page, Parse);
            }
        }
    }
}
Imports IronWebScraper

Namespace WebScrapingProject
	Friend Class MainClass
		Public Shared Sub Main(ByVal args() As String)
			Dim scraper = New BlogScraper()
			scraper.Start()
		End Sub
	End Class

	Friend Class BlogScraper
		Inherits WebScraper

		Public Overrides Sub Init()
			Me.LoggingLevel = WebScraper.LogLevel.All
			Me.Request("https://ironpdf.com/blog/", AddressOf Parse)
		End Sub

		Public Overrides Sub Parse(ByVal response As Response)
			For Each title_link In response.Css("h2.entry-title a")
				Dim strTitle As String = title_link.TextContentClean
				Scrape(New ScrapedData() From {
					{ "Title", strTitle }
				})
			Next title_link

			If response.CssExists("div.prev-post > a [href]") Then
				Dim next_page = response.Css("div.prev-post > a [href]")(0).Attributes ("href")
				Me.Request(next_page, AddressOf Parse)
			End If
		End Sub
	End Class
End Namespace
VB   C#

To scrape a specific website, we will have to create our own class to read that website. This class will extend Web Scraper. We will add some methods to this class, including init, where we can set initial settings and start the first request, which will then in turn cause a chain reaction where the entire website will be scraped.

We must also add at least one Parse method. Parse methods read webpages which have been downloaded from the internet and use jQuery-like CSS selectors to select content and extract the relevant text and/or images for usage.

Within a Parse method, we may also specify which hyperlinks we wish the crawler to continue to follow and which ones it will ignore.

We may use the scrape method to extract any data and dump it into a convenient JSON-style file format for later use.

Moving Forward

To learn more about Iron Web Scraper, we recommend you read the API Reference Documentation, and then start looking at the examples within the tutorial section of our documentation.

The next example we recommend you look at is the C# "blog" webscraping example, where we learn how we might extract the text content from a blog, such as a WordPress blog. This might be very useful in a site migration.

From there, you might go on to look at the other advanced webscraping tutorial examples where we can look at concepts like websites with many different types of pages, e-commerce websites, and also how to use multiple proxies, identities, and logins when scraping data from the internet.