Webscrpaing has never been a simple task, with no dominant frameworks for use in C# or .Net programming environments. Iron Web Scraper was created to change this

C# PDF HTML

8th August 2018 by Ahmed Aboelmagd

How to Scrape a Blog in C#

Let’s use Iron WebScraper to extract Blog content using C# or VB.Net.

This tutorial shows how a WordPress blog (or similar) may be scraped back into content using .Net

public class BlogScraper : WebScraper
{
    /// <summary>
    /// Override this method initializes your web-scraper.
    /// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
    /// </summary>
    public override void Init()
    {
        License.LicenseKey = " LicenseKey ";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\BlogSample\Output\";
        EnableWebCache(new TimeSpan(1, 30, 30));
        this.Request("http://blogSite.com/", Parse);
    }
}
Public Class BlogScraper
	Inherits WebScraper

	''' <summary>
	''' Override this method initializes your web-scraper.
	''' Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
	''' </summary>
	Public Overrides Sub Init()
		License.LicenseKey = " LicenseKey "
		Me.LoggingLevel = WebScraper.LogLevel.All
		Me.WorkingDirectory = AppSetting.GetAppRoot() & "\BlogSample\Output\"
		EnableWebCache(New TimeSpan(1, 30, 30))
		Me.Request("http://blogSite.com/", Parse)
	End Sub
End Class
VB   C#

As usual, we create a Scraper and inherit from the WebScraper class. In this case it is "BlogScraper"

We set a working directory to “\BlogSample\Output\” which is where all of out output and cache files can go.

Then we Enable the Webcache to save requested pages inside cache folder “WebCache”

Now let’s write a parse function:

/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
    foreach (var link in response.Css("div.section-nav > ul > li > a "))
    {
        switch(link.TextContentClean)
        {
            case "Reviews":
                {

                }break;
            case "Science":
                {

                }break;
            default:
                {
                    // Save Result to File
                    Scrape(new ScrapedData() { { "Title", link.TextContentClean } }, "BlogScraper.Jsonl");
                }
                break;
        }
    }
}
''' <summary>
''' Override this method to create the default Response handler for your web scraper.
''' If you have multiple page types, you can add additional similar methods.
''' </summary>
''' <param name="response">The http Response object to parse</param>
Public Overrides Sub Parse(ByVal response As Response)
	For Each link In response.Css("div.section-nav > ul > li > a ")
		Select Case link.TextContentClean
			Case "Reviews"

			Case "Science"

			Case Else
					' Save Result to File
					Scrape(New ScrapedData() From {
						{ "Title", link.TextContentClean }
					},
					"BlogScraper.Jsonl")
		End Select
	Next link
End Sub
VB   C#

Inside the parse Method; we parse the top menu to get all the links to all category pages (Movies, Science, Reviews, etc.)

Then we switch to a suitable parse method based on the link category.

Let's prepare our object model for the Science Page:

/// <summary>
/// ScienceModel
/// </summary>
public class ScienceModel
{
    /// <summary>
    /// Gets or sets the title.
    /// </summary>
    /// <value>
    /// The title.
    /// </value>
    public string Title { get; set; }
    /// <summary>
    /// Gets or sets the author.
    /// </summary>
    /// <value>
    /// The author.
    /// </value>
    public string Author { get; set; }
    /// <summary>
    /// Gets or sets the date.
    /// </summary>
    /// <value>
    /// The date.
    /// </value>
    public string Date { get; set; }
    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    /// <value>
    /// The image.
    /// </value>
    public string Image { get; set; }
    /// <summary>
    /// Gets or sets the text.
    /// </summary>
    /// <value>
    /// The text.
    /// </value>
    public string Text { get; set; }

}
''' <summary>
''' ScienceModel
''' </summary>
Public Class ScienceModel
	''' <summary>
	''' Gets or sets the title.
	''' </summary>
	''' <value>
	''' The title.
	''' </value>
	Public Property Title() As String
	''' <summary>
	''' Gets or sets the author.
	''' </summary>
	''' <value>
	''' The author.
	''' </value>
	Public Property Author() As String
	''' <summary>
	''' Gets or sets the date.
	''' </summary>
	''' <value>
	''' The date.
	''' </value>
	Public Property [Date]() As String
	''' <summary>
	''' Gets or sets the image.
	''' </summary>
	''' <value>
	''' The image.
	''' </value>
	Public Property Image() As String
	''' <summary>
	''' Gets or sets the text.
	''' </summary>
	''' <value>
	''' The text.
	''' </value>
	Public Property Text() As String

End Class
VB   C#

Now let’s implement a single page scrape:

/// <summary>
/// Parses the reviews.
/// </summary>
/// <param name="response">The response.</param>
public void ParseReviews(Response response)
{
    // List of Science Link
    var scienceList = new List<ScienceModel>();

    foreach (var postBox in response.Css("section.main > div > div.post-list"))
    {
        var item = new ScienceModel();
        item.Title = postBox.Css("h1.headline > a")[0].TextContentClean;
        item.Author = postBox.Css("div.author > a")[0].TextContentClean;
        item.Date = postBox.Css("div.time > a")[0].TextContentClean;
        item.Image = postBox.Css("div.image-wrapper.default-state > img")[0].Attributes["src"];
        item.Text = postBox.Css("div.summary > p")[0].TextContentClean;
        scienceList.Add(item);
    }

    Scrape(scienceList, "BlogScience.Jsonl");
}
''' <summary>
''' Parses the reviews.
''' </summary>
''' <param name="response">The response.</param>
Public Sub ParseReviews(ByVal response As Response)
	' List of Science Link
	Dim scienceList = New List(Of ScienceModel)()

	For Each postBox In response.Css("section.main > div > div.post-list")
		Dim item = New ScienceModel()
		item.Title = postBox.Css("h1.headline > a")(0).TextContentClean
		item.Author = postBox.Css("div.author > a")(0).TextContentClean
		item.Date = postBox.Css("div.time > a")(0).TextContentClean
		item.Image = postBox.Css("div.image-wrapper.default-state > img")(0).Attributes("src")
		item.Text = postBox.Css("div.summary > p")(0).TextContentClean
		scienceList.Add(item)
	Next postBox

	Scrape(scienceList, "BlogScience.Jsonl")
End Sub
VB   C#

After we have created our model, we can parse the response object to drill down into its main elements (title, author, date, image, text)

Then we save our result in separate file using Scrape(object, fileName).

Click here for Ahmed's full tutorial on the use of IronWebscraper

.Net Software Engineer For many this is the most efficient way to generate PDF files from .Net, because there is no additional API to learn, or complex design system to navigate

Ahmed Aboelmagd

.Net Software Solution Architect at an Multinational IT Company

Ahmed is an experienced and certified Microsoft Technology specialist with more than 10 years experience in IT and Software development. He's worked in a range of companies, and is now a country manager for multinational IT Company.

Ahmed has been working with IronPDF and IronWebscraper for over a year now, using them in multiple projects with his company.