如何在C#中抓取博客

Darrius Serrant
Darrius Serrant
2018年11月13日
已更新 2024年12月10日
分享:
This article was translated from English: Does it need improvement?
Translated
View the article in English

讓我們使用 Iron WebScraper 來使用 C# 或 VB.NET 提取博客內容。

本教程顯示如何使用 .NET 將 WordPress 部落格(或類似網站)重新抓取成內容。

FireShotScreenCaptureGizmodo related to 如何在C#中抓取博客

public class BlogScraper : WebScraper
{
    /// <summary>
    /// Override this method initializes your web-scraper.
    /// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
    /// </summary>
    public override void Init()
    {
        License.LicenseKey = " LicenseKey ";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\BlogSample\Output\";
        EnableWebCache(new TimeSpan(1, 30, 30));
        this.Request("http://blogSite.com/", Parse);
    }
}
public class BlogScraper : WebScraper
{
    /// <summary>
    /// Override this method initializes your web-scraper.
    /// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
    /// </summary>
    public override void Init()
    {
        License.LicenseKey = " LicenseKey ";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\BlogSample\Output\";
        EnableWebCache(new TimeSpan(1, 30, 30));
        this.Request("http://blogSite.com/", Parse);
    }
}
Public Class BlogScraper
	Inherits WebScraper

	''' <summary>
	''' Override this method initializes your web-scraper.
	''' Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
	''' </summary>
	Public Overrides Sub Init()
		License.LicenseKey = " LicenseKey "
		Me.LoggingLevel = WebScraper.LogLevel.All
		Me.WorkingDirectory = AppSetting.GetAppRoot() & "\BlogSample\Output\"
		EnableWebCache(New TimeSpan(1, 30, 30))
		Me.Request("http://blogSite.com/", Parse)
	End Sub
End Class
$vbLabelText   $csharpLabel

像往常一樣,我們創建一個Scraper並繼承自WebScraper類。 在這種情況下,它是“BlogScraper”。

我们将工作目录设置为“\BlogSample\Output\”,所有输出和缓存文件都可以放在这里。

然後我們啟用 Webcache 將請求的頁面保存在緩存文件夾“WebCache”中。

現在讓我們編寫一個解析函數:

/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
    foreach (var link in response.Css("div.section-nav > ul > li > a "))
    {
        switch(link.TextContentClean)
        {
            case "Reviews":
                {

                }break;
            case "Science":
                {

                }break;
            default:
                {
                    // Save Result to File
                    Scrape(new ScrapedData() { { "Title", link.TextContentClean } }, "BlogScraper.Jsonl");
                }
                break;
        }
    }
}
/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
    foreach (var link in response.Css("div.section-nav > ul > li > a "))
    {
        switch(link.TextContentClean)
        {
            case "Reviews":
                {

                }break;
            case "Science":
                {

                }break;
            default:
                {
                    // Save Result to File
                    Scrape(new ScrapedData() { { "Title", link.TextContentClean } }, "BlogScraper.Jsonl");
                }
                break;
        }
    }
}
''' <summary>
''' Override this method to create the default Response handler for your web scraper.
''' If you have multiple page types, you can add additional similar methods.
''' </summary>
''' <param name="response">The http Response object to parse</param>
Public Overrides Sub Parse(ByVal response As Response)
	For Each link In response.Css("div.section-nav > ul > li > a ")
		Select Case link.TextContentClean
			Case "Reviews"

			Case "Science"

			Case Else
					' Save Result to File
					Scrape(New ScrapedData() From {
						{ "Title", link.TextContentClean }
					},
					"BlogScraper.Jsonl")
		End Select
	Next link
End Sub
$vbLabelText   $csharpLabel

在 parse 方法內部; 我們解析頂部選單以取得所有分類頁面的連結(影片、科學、評論等)。

然後我們根據連結類別切換到合適的解析方法。

讓我們為科學頁面準備我們的對象模型:

/// <summary>
/// ScienceModel
/// </summary>
public class ScienceModel
{
    /// <summary>
    /// Gets or sets the title.
    /// </summary>
    /// <value>
    /// The title.
    /// </value>
    public string Title { get; set; }
    /// <summary>
    /// Gets or sets the author.
    /// </summary>
    /// <value>
    /// The author.
    /// </value>
    public string Author { get; set; }
    /// <summary>
    /// Gets or sets the date.
    /// </summary>
    /// <value>
    /// The date.
    /// </value>
    public string Date { get; set; }
    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    /// <value>
    /// The image.
    /// </value>
    public string Image { get; set; }
    /// <summary>
    /// Gets or sets the text.
    /// </summary>
    /// <value>
    /// The text.
    /// </value>
    public string Text { get; set; }

}
/// <summary>
/// ScienceModel
/// </summary>
public class ScienceModel
{
    /// <summary>
    /// Gets or sets the title.
    /// </summary>
    /// <value>
    /// The title.
    /// </value>
    public string Title { get; set; }
    /// <summary>
    /// Gets or sets the author.
    /// </summary>
    /// <value>
    /// The author.
    /// </value>
    public string Author { get; set; }
    /// <summary>
    /// Gets or sets the date.
    /// </summary>
    /// <value>
    /// The date.
    /// </value>
    public string Date { get; set; }
    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    /// <value>
    /// The image.
    /// </value>
    public string Image { get; set; }
    /// <summary>
    /// Gets or sets the text.
    /// </summary>
    /// <value>
    /// The text.
    /// </value>
    public string Text { get; set; }

}
''' <summary>
''' ScienceModel
''' </summary>
Public Class ScienceModel
	''' <summary>
	''' Gets or sets the title.
	''' </summary>
	''' <value>
	''' The title.
	''' </value>
	Public Property Title() As String
	''' <summary>
	''' Gets or sets the author.
	''' </summary>
	''' <value>
	''' The author.
	''' </value>
	Public Property Author() As String
	''' <summary>
	''' Gets or sets the date.
	''' </summary>
	''' <value>
	''' The date.
	''' </value>
	Public Property [Date]() As String
	''' <summary>
	''' Gets or sets the image.
	''' </summary>
	''' <value>
	''' The image.
	''' </value>
	Public Property Image() As String
	''' <summary>
	''' Gets or sets the text.
	''' </summary>
	''' <value>
	''' The text.
	''' </value>
	Public Property Text() As String

End Class
$vbLabelText   $csharpLabel

現在讓我們實現單頁抓取:

/// <summary>
/// Parses the reviews.
/// </summary>
/// <param name="response">The response.</param>
public void ParseReviews(Response response)
{
    // List of Science Link
    var scienceList = new List<ScienceModel>();

    foreach (var postBox in response.Css("section.main > div > div.post-list"))
    {
        var item = new ScienceModel();
        item.Title = postBox.Css("h1.headline > a")[0].TextContentClean;
        item.Author = postBox.Css("div.author > a")[0].TextContentClean;
        item.Date = postBox.Css("div.time > a")[0].TextContentClean;
        item.Image = postBox.Css("div.image-wrapper.default-state > img")[0].Attributes ["src"];
        item.Text = postBox.Css("div.summary > p")[0].TextContentClean;
        scienceList.Add(item);
    }

    Scrape(scienceList, "BlogScience.Jsonl");
}
/// <summary>
/// Parses the reviews.
/// </summary>
/// <param name="response">The response.</param>
public void ParseReviews(Response response)
{
    // List of Science Link
    var scienceList = new List<ScienceModel>();

    foreach (var postBox in response.Css("section.main > div > div.post-list"))
    {
        var item = new ScienceModel();
        item.Title = postBox.Css("h1.headline > a")[0].TextContentClean;
        item.Author = postBox.Css("div.author > a")[0].TextContentClean;
        item.Date = postBox.Css("div.time > a")[0].TextContentClean;
        item.Image = postBox.Css("div.image-wrapper.default-state > img")[0].Attributes ["src"];
        item.Text = postBox.Css("div.summary > p")[0].TextContentClean;
        scienceList.Add(item);
    }

    Scrape(scienceList, "BlogScience.Jsonl");
}
''' <summary>
''' Parses the reviews.
''' </summary>
''' <param name="response">The response.</param>
Public Sub ParseReviews(ByVal response As Response)
	' List of Science Link
	Dim scienceList = New List(Of ScienceModel)()

	For Each postBox In response.Css("section.main > div > div.post-list")
		Dim item = New ScienceModel()
		item.Title = postBox.Css("h1.headline > a")(0).TextContentClean
		item.Author = postBox.Css("div.author > a")(0).TextContentClean
		item.Date = postBox.Css("div.time > a")(0).TextContentClean
		item.Image = postBox.Css("div.image-wrapper.default-state > img")(0).Attributes ("src")
		item.Text = postBox.Css("div.summary > p")(0).TextContentClean
		scienceList.Add(item)
	Next postBox

	Scrape(scienceList, "BlogScience.Jsonl")
End Sub
$vbLabelText   $csharpLabel

在我們建立模型之後,我們可以解析響應物件,以深入了解其主要元素(標題、作者、日期、圖片、文本)。

然後我們使用 Scrape(object, fileName) 將結果儲存在單獨的檔案中。

點擊這裡查看 IronWebscraper 使用的完整教程

開始使用IronWebscraper

立即在您的專案中使用IronWebScraper,並享受免費試用。

第一步:
green arrow pointer


網頁抓取從來不是一項簡單的任務,在C#或.NET編程環境中沒有主導的框架可供使用。Iron Web Scraper的創建旨在改變這一現狀
Darrius Serrant
全端軟體工程師(WebOps)

Darrius Serrant 擁有邁阿密大學的計算機科學學士學位,目前擔任 Iron Software 的全端 WebOps 行銷工程師。自幼對編程產生興趣,他認為計算機既神秘又易於接觸,使其成為創造力和解決問題的完美媒介。

在 Iron Software,Darrius 享受創造新事物並簡化複雜概念使其更易理解的過程。作為我們的其中一位常駐開發人員,他也自願教導學生,將他的專業知識傳授給下一代。

對 Darrius 來說,他的工作之所以令人滿足,是因為它受到重視並且產生了真正的影響。