Webscraping in C#
What is IronWebScraper?
IronWebScraper is a class library and framework for C# and the .NET programming platform that allows developers to programmatically read websites and extract their content. This is ideal for reverse engineering websites or existing intranets and turning them back into databases or JSON data. It's also useful for downloading large volumes of documents from the internet.
In many respects, Iron Web Scraper is similar to the Scrapy library for Python, but leverages the advantages of C#, particularly its ability to step through code as the web scraping process is in progress and debug.
Installation
Your first step will be to install Iron Web Scraper, which you may do from NuGet or by downloading the DLL from our website.
All of the classes you will need can be found in the Iron Web Scraper namespace.
PM > Install-Package IronWebScraper
Popular Use Cases
Migrating Websites to Databases
IronWebScraper provides the tools and methods to allow you to re-engineer your websites back into structured databases. This technology is useful when migrating content from legacy websites and intranets into your new C# application.
Migrating Websites
Being able to easily extract the content of a partial or complete website in C# reduces the time and cost implication in migrating or upgrading website and intranet resources. This can be significantly more efficient than direct SQL transformations, as it flattens the data down to what can be seen on each webpage, and does not require the previous SQL data structures to be understood, nor complex SQL queries to be built.
Populating Search Indexes
Iron Web Scraper may be pointed at your own website or intranet to read structured data, to read every page, and to extract the correct data so that a search engine within your organization may be populated accurately.
IronWebScraper is an ideal tool to scrape content for your search index. A search application such as IronSearch can read structured content from IronWebScraper to build a powerful enterprise search system.
Using Iron Webscraper
To learn how to use Iron Web Scraper, it is best to look at examples. This basic example creates a class to scrape titles from a website blog.
using IronWebScraper;
namespace WebScrapingProject
{
class MainClass
{
public static void Main(string [] args)
{
var scraper = new BlogScraper();
scraper.Start();
}
}
class BlogScraper : WebScraper
{
public override void Init()
{
this.LoggingLevel = WebScraper.LogLevel.All;
this.Request("https://ironpdf.com/blog/", Parse);
}
public override void Parse(Response response)
{
foreach (var title_link in response.Css("h2.entry-title a"))
{
string strTitle = title_link.TextContentClean;
Scrape(new ScrapedData() { { "Title", strTitle } });
}
if (response.CssExists("div.prev-post > a [href]"))
{
var next_page = response.Css("div.prev-post > a [href]")[0].Attributes ["href"];
this.Request(next_page, Parse);
}
}
}
}
using IronWebScraper;
namespace WebScrapingProject
{
class MainClass
{
public static void Main(string [] args)
{
var scraper = new BlogScraper();
scraper.Start();
}
}
class BlogScraper : WebScraper
{
public override void Init()
{
this.LoggingLevel = WebScraper.LogLevel.All;
this.Request("https://ironpdf.com/blog/", Parse);
}
public override void Parse(Response response)
{
foreach (var title_link in response.Css("h2.entry-title a"))
{
string strTitle = title_link.TextContentClean;
Scrape(new ScrapedData() { { "Title", strTitle } });
}
if (response.CssExists("div.prev-post > a [href]"))
{
var next_page = response.Css("div.prev-post > a [href]")[0].Attributes ["href"];
this.Request(next_page, Parse);
}
}
}
}
Imports IronWebScraper
Namespace WebScrapingProject
Friend Class MainClass
Public Shared Sub Main(ByVal args() As String)
Dim scraper = New BlogScraper()
scraper.Start()
End Sub
End Class
Friend Class BlogScraper
Inherits WebScraper
Public Overrides Sub Init()
Me.LoggingLevel = WebScraper.LogLevel.All
Me.Request("https://ironpdf.com/blog/", AddressOf Parse)
End Sub
Public Overrides Sub Parse(ByVal response As Response)
For Each title_link In response.Css("h2.entry-title a")
Dim strTitle As String = title_link.TextContentClean
Scrape(New ScrapedData() From {
{ "Title", strTitle }
})
Next title_link
If response.CssExists("div.prev-post > a [href]") Then
Dim next_page = response.Css("div.prev-post > a [href]")(0).Attributes ("href")
Me.Request(next_page, AddressOf Parse)
End If
End Sub
End Class
End Namespace
To scrape a specific website, we will have to create our own class to read that website. This class will extend Web Scraper. We will add some methods to this class, including init, where we can set initial settings and start the first request, which will then in turn cause a chain reaction where the entire website will be scraped.
We must also add at least one Parse
method. Parse methods read webpages which have been downloaded from the internet and use jQuery-like CSS selectors to select content and extract the relevant text and/or images for usage.
Within a Parse
method, we may also specify which hyperlinks we wish the crawler to continue to follow and which ones it will ignore.
We may use the scrape method to extract any data and dump it into a convenient JSON-style file format for later use.
Moving Forward
To learn more about Iron Web Scraper, we recommend you read the API Reference Documentation, and then start looking at the examples within the tutorial section of our documentation.
The next example we recommend you look at is the C# "blog" webscraping example, where we learn how we might extract the text content from a blog, such as a WordPress blog. This might be very useful in a site migration.
From there, you might go on to look at the other advanced webscraping tutorial examples where we can look at concepts like websites with many different types of pages, e-commerce websites, and also how to use multiple proxies, identities, and logins when scraping data from the internet.