如何使用 C# 抓取部落格

This article was translated from English: Does it need improvement?
Translated
View the article in English

讓我們使用 IronWebScraper 透過 C# 或 VB.NET 擷取部落格內容。

本教學將展示如何利用 .NET 將 WordPress 部落格(或類似平台)的內容進行擷取並轉化為可用內容

FireShotScreenCaptureGizmodo related to 如何使用 C# 抓取部落格

// Define a class that extends WebScraper from IronWebScraper
public class BlogScraper : WebScraper
{
    /// <summary>
    /// Override this method to initialize your web-scraper.
    /// Set at least one start URL and configure domain or URL patterns.
    /// </summary>
    public override void Init()
    {
        // Set your license key for IronWebScraper
        License.LicenseKey = "YourLicenseKey";

        // Enable logging for all actions
        this.LoggingLevel = WebScraper.LogLevel.All;

        // Set a directory to store output and cache files
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\BlogSample\Output\";

        // Enable caching with a specific duration
        EnableWebCache(new TimeSpan(1, 30, 30));

        // Request the start URL and specify the response handler
        this.Request("http://blogSite.com/", Parse);
    }
}
// Define a class that extends WebScraper from IronWebScraper
public class BlogScraper : WebScraper
{
    /// <summary>
    /// Override this method to initialize your web-scraper.
    /// Set at least one start URL and configure domain or URL patterns.
    /// </summary>
    public override void Init()
    {
        // Set your license key for IronWebScraper
        License.LicenseKey = "YourLicenseKey";

        // Enable logging for all actions
        this.LoggingLevel = WebScraper.LogLevel.All;

        // Set a directory to store output and cache files
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\BlogSample\Output\";

        // Enable caching with a specific duration
        EnableWebCache(new TimeSpan(1, 30, 30));

        // Request the start URL and specify the response handler
        this.Request("http://blogSite.com/", Parse);
    }
}
' Define a class that extends WebScraper from IronWebScraper
Public Class BlogScraper
	Inherits WebScraper

	''' <summary>
	''' Override this method to initialize your web-scraper.
	''' Set at least one start URL and configure domain or URL patterns.
	''' </summary>
	Public Overrides Sub Init()
		' Set your license key for IronWebScraper
		License.LicenseKey = "YourLicenseKey"

		' Enable logging for all actions
		Me.LoggingLevel = WebScraper.LogLevel.All

		' Set a directory to store output and cache files
		Me.WorkingDirectory = AppSetting.GetAppRoot() & "\BlogSample\Output\"

		' Enable caching with a specific duration
		EnableWebCache(New TimeSpan(1, 30, 30))

		' Request the start URL and specify the response handler
		Me.Request("http://blogSite.com/", Parse)
	End Sub
End Class
$vbLabelText   $csharpLabel

如同往常,我們建立一個 Scraper 並繼承自 WebScraper 類別。 在此情況下,該代碼為"BlogScraper"。

我們將工作目錄設定為"\BlogSample\Output\",所有輸出檔與快取檔皆可存放於此。

接著,我們啟用網頁快取功能,將請求的頁面節省至"WebCache"快取資料夾中。

現在讓我們編寫一個 Parse 函式:

/// <summary>
/// Override this method to handle the Http Response for your web scraper.
/// Add additional methods if you handle multiple page types.
/// </summary>
/// <param name="response">The HTTP Response object to parse.</param>
public override void Parse(Response response)
{
    // Iterate over each link found in the section navigation
    foreach (var link in response.Css("div.section-nav > ul > li > a"))
    {
        switch(link.TextContentClean)
        {
            case "Reviews":
                {
                    // Handle reviews case
                }
                break;
            case "Science":
                {
                    // Handle science case
                }
                break;
            default:
                {
                    // Save the link title to a file
                    Scrape(new ScrapedData() { { "Title", link.TextContentClean } }, "BlogScraper.Jsonl");
                }
                break;
        }
    }
}
/// <summary>
/// Override this method to handle the Http Response for your web scraper.
/// Add additional methods if you handle multiple page types.
/// </summary>
/// <param name="response">The HTTP Response object to parse.</param>
public override void Parse(Response response)
{
    // Iterate over each link found in the section navigation
    foreach (var link in response.Css("div.section-nav > ul > li > a"))
    {
        switch(link.TextContentClean)
        {
            case "Reviews":
                {
                    // Handle reviews case
                }
                break;
            case "Science":
                {
                    // Handle science case
                }
                break;
            default:
                {
                    // Save the link title to a file
                    Scrape(new ScrapedData() { { "Title", link.TextContentClean } }, "BlogScraper.Jsonl");
                }
                break;
        }
    }
}
''' <summary>
''' Override this method to handle the Http Response for your web scraper.
''' Add additional methods if you handle multiple page types.
''' </summary>
''' <param name="response">The HTTP Response object to parse.</param>
Public Overrides Sub Parse(ByVal response As Response)
	' Iterate over each link found in the section navigation
	For Each link In response.Css("div.section-nav > ul > li > a")
		Select Case link.TextContentClean
			Case "Reviews"
					' Handle reviews case
			Case "Science"
					' Handle science case
			Case Else
					' Save the link title to a file
					Scrape(New ScrapedData() From {
						{ "Title", link.TextContentClean }
					},
					"BlogScraper.Jsonl")
		End Select
	Next link
End Sub
$vbLabelText   $csharpLabel

Parse 方法中,我們會從頂部選單取得所有分類頁面(電影、科學、評論等)的連結。

接著,我們會根據連結類別切換至合適的解析方法。

讓我們為"科學頁面"準備物件模型:

/// <summary>
/// Represents a model for Science Page
/// </summary>
public class ScienceModel
{
    /// <summary>
    /// Gets or sets the title.
    /// </summary>
    public string Title { get; set; }

    /// <summary>
    /// Gets or sets the author.
    /// </summary>
    public string Author { get; set; }

    /// <summary>
    /// Gets or sets the date.
    /// </summary>
    public string Date { get; set; }

    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    public string Image { get; set; }

    /// <summary>
    /// Gets or sets the text.
    /// </summary>
    public string Text { get; set; }
}
/// <summary>
/// Represents a model for Science Page
/// </summary>
public class ScienceModel
{
    /// <summary>
    /// Gets or sets the title.
    /// </summary>
    public string Title { get; set; }

    /// <summary>
    /// Gets or sets the author.
    /// </summary>
    public string Author { get; set; }

    /// <summary>
    /// Gets or sets the date.
    /// </summary>
    public string Date { get; set; }

    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    public string Image { get; set; }

    /// <summary>
    /// Gets or sets the text.
    /// </summary>
    public string Text { get; set; }
}
''' <summary>
''' Represents a model for Science Page
''' </summary>
Public Class ScienceModel
	''' <summary>
	''' Gets or sets the title.
	''' </summary>
	Public Property Title() As String

	''' <summary>
	''' Gets or sets the author.
	''' </summary>
	Public Property Author() As String

	''' <summary>
	''' Gets or sets the date.
	''' </summary>
	Public Property [Date]() As String

	''' <summary>
	''' Gets or sets the image.
	''' </summary>
	Public Property Image() As String

	''' <summary>
	''' Gets or sets the text.
	''' </summary>
	Public Property Text() As String
End Class
$vbLabelText   $csharpLabel

現在讓我們實作單一頁面的擷取:

/// <summary>
/// Parses the reviews from the response.
/// </summary>
/// <param name="response">The HTTP Response object.</param>
public void ParseReviews(Response response)
{
    // A list to hold Science models
    var scienceList = new List<ScienceModel>();

    foreach (var postBox in response.Css("section.main > div > div.post-list"))
    {
        var item = new ScienceModel
        {
            Title = postBox.Css("h1.headline > a")[0].TextContentClean,
            Author = postBox.Css("div.author > a")[0].TextContentClean,
            Date = postBox.Css("div.time > a")[0].TextContentClean,
            Image = postBox.Css("div.image-wrapper.default-state > img")[0].Attributes["src"],
            Text = postBox.Css("div.summary > p")[0].TextContentClean
        };

        scienceList.Add(item);
    }

    // Save the science list to a JSONL file
    Scrape(scienceList, "BlogScience.Jsonl");
}
/// <summary>
/// Parses the reviews from the response.
/// </summary>
/// <param name="response">The HTTP Response object.</param>
public void ParseReviews(Response response)
{
    // A list to hold Science models
    var scienceList = new List<ScienceModel>();

    foreach (var postBox in response.Css("section.main > div > div.post-list"))
    {
        var item = new ScienceModel
        {
            Title = postBox.Css("h1.headline > a")[0].TextContentClean,
            Author = postBox.Css("div.author > a")[0].TextContentClean,
            Date = postBox.Css("div.time > a")[0].TextContentClean,
            Image = postBox.Css("div.image-wrapper.default-state > img")[0].Attributes["src"],
            Text = postBox.Css("div.summary > p")[0].TextContentClean
        };

        scienceList.Add(item);
    }

    // Save the science list to a JSONL file
    Scrape(scienceList, "BlogScience.Jsonl");
}
''' <summary>
''' Parses the reviews from the response.
''' </summary>
''' <param name="response">The HTTP Response object.</param>
Public Sub ParseReviews(ByVal response As Response)
	' A list to hold Science models
	Dim scienceList = New List(Of ScienceModel)()

	For Each postBox In response.Css("section.main > div > div.post-list")
		Dim item = New ScienceModel With {
			.Title = postBox.Css("h1.headline > a")(0).TextContentClean,
			.Author = postBox.Css("div.author > a")(0).TextContentClean,
			.Date = postBox.Css("div.time > a")(0).TextContentClean,
			.Image = postBox.Css("div.image-wrapper.default-state > img")(0).Attributes("src"),
			.Text = postBox.Css("div.summary > p")(0).TextContentClean
		}

		scienceList.Add(item)
	Next postBox

	' Save the science list to a JSONL file
	Scrape(scienceList, "BlogScience.Jsonl")
End Sub
$vbLabelText   $csharpLabel

建立模型後,我們可以解析 Response 物件,以深入檢視其主要元素(標題、作者、日期、圖片、文字)。

接著,我們使用 Scrape(object, fileName) 將結果儲存至獨立檔案中。

點擊此處查看 using IronWebscraper 的使用教學的完整內容

開始使用 IronWebscraper

網頁抓取向來並非易事,在 C# 或 .NET Framework 程式設計環境中也缺乏主流的框架。IronWebScraper 的誕生正是為了改變這項現狀

常見問題

如何使用 C# 建立部落格網頁擷取工具?

若要使用 C# 建立部落格網頁擷取工具,您可以使用 IronWebscraper 程式庫。首先定義一個繼承自 WEBSCRAPER 類別的類別,設定起始網址,配置擷取器以處理不同類型的網頁,並使用 Parse 方法從 HTTP 回應中擷取所需資訊。

在網頁抓取中,Parse 方法的功能是什麼?

在使用 IronWebscraper 進行網頁抓取時,Parse 方法對於處理 HTTP 回應至關重要。它透過解析網頁內容、識別連結,並將頁面類型(如部落格文章或其他區塊)進行分類,從而協助提取資料。

如何有效管理網頁抓取資料?

IronWebscraper 透過配置快取來儲存請求的頁面,並設定輸出檔案的工作目錄,從而實現高效的数据管理。此組織方式有助於追蹤抓取的數據,並減少不必要的頁面重新抓取。

IronWebscraper 如何協助抓取 WordPress 部落格?

IronWebscraper 透過提供導覽部落格結構、擷取文章詳情及處理各類頁面類型的工具,簡化了 WordPress 部落格的擷取作業。您可以使用 IronWebscraper 程式庫解析文章,以取得標題、作者、日期、圖片及文字等資訊。

IronWebScraper 是否可同時用於 C# 和 VB.NET?

是的,IronWebscraper 同時相容於 C# 和 VB.NET,對於偏好這兩種 .NET 語言的開發者而言,是極具彈性的選擇。

如何處理部落格內不同類型的頁面?

您可透過覆寫 IronWebScraper 中的 Parse 方法,來處理部落格內不同類型的頁面。此方法讓您能將頁面分類至不同區塊(例如「評論」與「科學」),並針對各區塊套用特定的解析邏輯。

是否有方法能將抓取的部落格資料以結構化格式儲存?

是的,使用 IronWebscraper,您可以將抓取的部落格資料儲存為 JSONL 等結構化格式。此格式能將每筆資料以逐行 JSON 格式儲存,便於日後管理與處理。

如何為我的網頁抓取工具設定工作目錄?

在 IronWebscraper 中,您可以透過設定擷取器來指定輸出檔與快取檔的儲存位置,藉此設定工作目錄。此功能有助於有效率地整理擷取到的資料。

網頁爬取中有哪些常見的疑難排解情境?

網頁抓取常見的疑難排解情境包括:處理網站結構變更、管理速率限制,以及應對反抓取措施。使用 IronWebscraper,您可以實作錯誤處理與記錄功能,以診斷並解決這些問題。

我該去哪裡尋找資源,以進一步了解如何使用 IronWebScraper?

您可在 Iron Software 網站上找到關於使用 IronWebscraper 的資源與教學,該網站的「網頁擷取教學」專區提供了詳細的指南與範例。

Curtis Chau
技術撰稿人

Curtis Chau 擁有卡爾頓大學(Carleton University)的電腦科學學士學位,專精於前端開發,並精通 Node.js、TypeScript、JavaScript 及 React。他熱衷於打造直觀且美觀的用戶介面,喜歡運用現代框架,並創建結構完善、視覺上吸引人的手冊。

除了開發工作之外,Curtis 對物聯網(IoT)抱有濃厚興趣,致力於探索整合硬體與軟體的創新方法。閒暇時,他喜歡玩遊戲和開發 Discord 機器人,將對科技的熱愛與創意相結合。

準備開始了嗎?
Nuget 下載 137,906 | 版本: 2026.6 just released
Still Scrolling Icon

還在捲動嗎?

想要快速證明? PM > Install-Package IronWebScraper
執行範例 觀看您的目標網站成為結構化資料。