Webscraping in C#

What is IronWebScraper?

IronWebScraper is a class library and framework for C# and the .NET programming platform that allows developers to programmatically read websites and extract their content. This is ideal for reverse engineering websites or existing intranets and turning them back into databases or JSON data. It's also useful for downloading large volumes of documents from the internet.

In many respects, Iron Web Scraper is similar to the Scrapy library for Python, but leverages the advantages of C#, particularly its ability to step through code as the web scraping process is in progress and debug.

Installation

Your first step will be to install Iron Web Scraper, which you may do from NuGet or by downloading the DLL from our website.

All of the classes you will need can be found in the Iron Web Scraper namespace.

Install-Package IronWebScraper

Migrating Websites to Databases

IronWebScraper provides the tools and methods to allow you to re-engineer your websites back into structured databases. This technology is useful when migrating content from legacy websites and intranets into your new C# application.

Migrating Websites

Being able to easily extract the content of a partial or complete website in C# reduces the time and cost implication in migrating or upgrading website and intranet resources. This can be significantly more efficient than direct SQL transformations, as it flattens the data down to what can be seen on each webpage, and does not require the previous SQL data structures to be understood, nor complex SQL queries to be built.

Populating Search Indexes

Iron Web Scraper may be pointed at your own website or intranet to read structured data, to read every page, and to extract the correct data so that a search engine within your organization may be populated accurately.

IronWebScraper is an ideal tool to scrape content for your search index. A search application such as IronSearch can read structured content from IronWebScraper to build a powerful enterprise search system.

Using Iron Webscraper

To learn how to use Iron Web Scraper, it is best to look at examples. This basic example creates a class to scrape titles from a website blog.

using IronWebScraper;

namespace WebScrapingProject
{
    class MainClass
    {
        public static void Main(string [] args)
        {
            var scraper = new BlogScraper();
            scraper.Start();
        }
    }

    class BlogScraper : WebScraper
    {
        // Initialize scraper settings and make the first request
        public override void Init()
        {
            // Set logging level to show all log messages
            this.LoggingLevel = WebScraper.LogLevel.All;

            // Request the initial page to start scraping
            this.Request("https://ironpdf.com/blog/", Parse);
        }

        // Method to handle parsing of the page response
        public override void Parse(Response response)
        {
            // Loop through each blog post title link found by CSS selector
            foreach (var title_link in response.Css("h2.entry-title a"))
            {
                // Clean and extract the title text
                string strTitle = title_link.TextContentClean;

                // Store the extracted title for later use
                Scrape(new ScrapedData() { { "Title", strTitle } });
            }

            // Check if there is a link to the previous post page and if exists, follow it
            if (response.CssExists("div.prev-post > a[href]"))
            {
                // Get the URL for the next page
                var next_page = response.Css("div.prev-post > a[href]")[0].Attributes["href"];

                // Request the next page to continue scraping
                this.Request(next_page, Parse);
            }
        }
    }
}
using IronWebScraper;

namespace WebScrapingProject
{
    class MainClass
    {
        public static void Main(string [] args)
        {
            var scraper = new BlogScraper();
            scraper.Start();
        }
    }

    class BlogScraper : WebScraper
    {
        // Initialize scraper settings and make the first request
        public override void Init()
        {
            // Set logging level to show all log messages
            this.LoggingLevel = WebScraper.LogLevel.All;

            // Request the initial page to start scraping
            this.Request("https://ironpdf.com/blog/", Parse);
        }

        // Method to handle parsing of the page response
        public override void Parse(Response response)
        {
            // Loop through each blog post title link found by CSS selector
            foreach (var title_link in response.Css("h2.entry-title a"))
            {
                // Clean and extract the title text
                string strTitle = title_link.TextContentClean;

                // Store the extracted title for later use
                Scrape(new ScrapedData() { { "Title", strTitle } });
            }

            // Check if there is a link to the previous post page and if exists, follow it
            if (response.CssExists("div.prev-post > a[href]"))
            {
                // Get the URL for the next page
                var next_page = response.Css("div.prev-post > a[href]")[0].Attributes["href"];

                // Request the next page to continue scraping
                this.Request(next_page, Parse);
            }
        }
    }
}
Imports IronWebScraper

Namespace WebScrapingProject
	Friend Class MainClass
		Public Shared Sub Main(ByVal args() As String)
			Dim scraper = New BlogScraper()
			scraper.Start()
		End Sub
	End Class

	Friend Class BlogScraper
		Inherits WebScraper

		' Initialize scraper settings and make the first request
		Public Overrides Sub Init()
			' Set logging level to show all log messages
			Me.LoggingLevel = WebScraper.LogLevel.All

			' Request the initial page to start scraping
			Me.Request("https://ironpdf.com/blog/", AddressOf Parse)
		End Sub

		' Method to handle parsing of the page response
		Public Overrides Sub Parse(ByVal response As Response)
			' Loop through each blog post title link found by CSS selector
			For Each title_link In response.Css("h2.entry-title a")
				' Clean and extract the title text
				Dim strTitle As String = title_link.TextContentClean

				' Store the extracted title for later use
				Scrape(New ScrapedData() From {
					{ "Title", strTitle }
				})
			Next title_link

			' Check if there is a link to the previous post page and if exists, follow it
			If response.CssExists("div.prev-post > a[href]") Then
				' Get the URL for the next page
				Dim next_page = response.Css("div.prev-post > a[href]")(0).Attributes("href")

				' Request the next page to continue scraping
				Me.Request(next_page, AddressOf Parse)
			End If
		End Sub
	End Class
End Namespace
$vbLabelText   $csharpLabel

To scrape a specific website, we will have to create our own class to read that website. This class will extend Web Scraper. We will add some methods to this class, including Init, where we can set initial settings and start the first request, which will then in turn cause a chain reaction where the entire website will be scraped.

We must also add at least one Parse method. Parse methods read webpages which have been downloaded from the internet and use jQuery-like CSS selectors to select content and extract the relevant text and/or images for usage.

Within a Parse method, we may also specify which hyperlinks we wish the crawler to continue to follow and which ones it will ignore.

We may use the Scrape method to extract any data and dump it into a convenient JSON-style file format for later use.

Moving Forward

To learn more about Iron Web Scraper, we recommend you read the API Reference Documentation, and then start looking at the examples within the tutorial section of our documentation.

The next example we recommend you look at is the C# "blog" webscraping example, where we learn how we might extract the text content from a blog, such as a WordPress blog. This might be very useful in a site migration.

From there, you might go on to look at the other advanced webscraping tutorial examples where we can look at concepts like websites with many different types of pages, e-commerce websites, and also how to use multiple proxies, identities, and logins when scraping data from the internet.