如何在 C&num 中抓取博客;

艾哈迈德·阿布马格德

让我们使用 Iron WebScraper,用 C# 或 VB.NET 来提取博客内容。

本教程展示了 WordPress 博客如何 (或类似) 可使用 .NET 将内容重新刮回

适用于的C# NuGet库

安装使用 NuGet

Install-Package IronWebScraper
Java PDF JAR

下载 DLL

下载DLL

手动安装到你的项目中

适用于的C# NuGet库

安装使用 NuGet

Install-Package IronWebScraper
Java PDF JAR

下载 DLL

下载DLL

手动安装到你的项目中

开始在您的项目中使用IronPDF,并立即获取免费试用。

第一步:
green arrow pointer

查看 IronWebScraperNuget 用于快速安装和部署。它有超过800万次下载,正在使用C#改变。

适用于的C# NuGet库 nuget.org/packages/IronWebScraper/
Install-Package IronWebScraper

考虑安装 IronWebScraper DLL 直接。下载并手动安装到您的项目或GAC表单中: webscraperIronWebScraper.zip

手动安装到你的项目中

下载DLL

public class BlogScraper : WebScraper
{
    /// <summary>
    /// Override this method initializes your web-scraper.
    /// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
    /// </summary>
    public override void Init()
    {
        License.LicenseKey = " LicenseKey ";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\BlogSample\Output\";
        EnableWebCache(new TimeSpan(1, 30, 30));
        this.Request("http://blogSite.com/", Parse);
    }
}
public class BlogScraper : WebScraper
{
    /// <summary>
    /// Override this method initializes your web-scraper.
    /// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
    /// </summary>
    public override void Init()
    {
        License.LicenseKey = " LicenseKey ";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\BlogSample\Output\";
        EnableWebCache(new TimeSpan(1, 30, 30));
        this.Request("http://blogSite.com/", Parse);
    }
}
Public Class BlogScraper
	Inherits WebScraper

	''' <summary>
	''' Override this method initializes your web-scraper.
	''' Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
	''' </summary>
	Public Overrides Sub Init()
		License.LicenseKey = " LicenseKey "
		Me.LoggingLevel = WebScraper.LogLevel.All
		Me.WorkingDirectory = AppSetting.GetAppRoot() & "\BlogSample\Output\"
		EnableWebCache(New TimeSpan(1, 30, 30))
		Me.Request("http://blogSite.com/", Parse)
	End Sub
End Class
VB   C#

像往常一样,我们创建一个 Scraper 并继承自 WebScraper 类。 在本例中是 "BlogScraper"。

我们将工作目录设置为"\BlogSample\Output\",所有输出和缓存文件都可以放在这里。

然后启用网络缓存,将请求的网页保存在缓存文件夹 "WebCache "中。

现在让我们编写一个解析函数:

/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
    foreach (var link in response.Css("div.section-nav > ul > li > a "))
    {
        switch(link.TextContentClean)
        {
            case "Reviews":
                {

                }break;
            case "Science":
                {

                }break;
            default:
                {
                    // Save Result to File
                    Scrape(new ScrapedData() { { "Title", link.TextContentClean } }, "BlogScraper.Jsonl");
                }
                break;
        }
    }
}
/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
    foreach (var link in response.Css("div.section-nav > ul > li > a "))
    {
        switch(link.TextContentClean)
        {
            case "Reviews":
                {

                }break;
            case "Science":
                {

                }break;
            default:
                {
                    // Save Result to File
                    Scrape(new ScrapedData() { { "Title", link.TextContentClean } }, "BlogScraper.Jsonl");
                }
                break;
        }
    }
}
''' <summary>
''' Override this method to create the default Response handler for your web scraper.
''' If you have multiple page types, you can add additional similar methods.
''' </summary>
''' <param name="response">The http Response object to parse</param>
Public Overrides Sub Parse(ByVal response As Response)
	For Each link In response.Css("div.section-nav > ul > li > a ")
		Select Case link.TextContentClean
			Case "Reviews"

			Case "Science"

			Case Else
					' Save Result to File
					Scrape(New ScrapedData() From {
						{ "Title", link.TextContentClean }
					},
					"BlogScraper.Jsonl")
		End Select
	Next link
End Sub
VB   C#

在解析方法中,我们解析顶部菜单,以获取所有分类页面的所有链接 (电影、科学、评论等)

然后,我们根据链接类别切换到合适的解析方法。

让我们为科学页面准备好对象模型:

/// <summary>
/// ScienceModel
/// </summary>
public class ScienceModel
{
    /// <summary>
    /// Gets or sets the title.
    /// </summary>
    /// <value>
    /// The title.
    /// </value>
    public string Title { get; set; }
    /// <summary>
    /// Gets or sets the author.
    /// </summary>
    /// <value>
    /// The author.
    /// </value>
    public string Author { get; set; }
    /// <summary>
    /// Gets or sets the date.
    /// </summary>
    /// <value>
    /// The date.
    /// </value>
    public string Date { get; set; }
    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    /// <value>
    /// The image.
    /// </value>
    public string Image { get; set; }
    /// <summary>
    /// Gets or sets the text.
    /// </summary>
    /// <value>
    /// The text.
    /// </value>
    public string Text { get; set; }

}
/// <summary>
/// ScienceModel
/// </summary>
public class ScienceModel
{
    /// <summary>
    /// Gets or sets the title.
    /// </summary>
    /// <value>
    /// The title.
    /// </value>
    public string Title { get; set; }
    /// <summary>
    /// Gets or sets the author.
    /// </summary>
    /// <value>
    /// The author.
    /// </value>
    public string Author { get; set; }
    /// <summary>
    /// Gets or sets the date.
    /// </summary>
    /// <value>
    /// The date.
    /// </value>
    public string Date { get; set; }
    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    /// <value>
    /// The image.
    /// </value>
    public string Image { get; set; }
    /// <summary>
    /// Gets or sets the text.
    /// </summary>
    /// <value>
    /// The text.
    /// </value>
    public string Text { get; set; }

}
''' <summary>
''' ScienceModel
''' </summary>
Public Class ScienceModel
	''' <summary>
	''' Gets or sets the title.
	''' </summary>
	''' <value>
	''' The title.
	''' </value>
	Public Property Title() As String
	''' <summary>
	''' Gets or sets the author.
	''' </summary>
	''' <value>
	''' The author.
	''' </value>
	Public Property Author() As String
	''' <summary>
	''' Gets or sets the date.
	''' </summary>
	''' <value>
	''' The date.
	''' </value>
	Public Property [Date]() As String
	''' <summary>
	''' Gets or sets the image.
	''' </summary>
	''' <value>
	''' The image.
	''' </value>
	Public Property Image() As String
	''' <summary>
	''' Gets or sets the text.
	''' </summary>
	''' <value>
	''' The text.
	''' </value>
	Public Property Text() As String

End Class
VB   C#

现在,让我们执行单页刮擦:

/// <summary>
/// Parses the reviews.
/// </summary>
/// <param name="response">The response.</param>
public void ParseReviews(Response response)
{
    // List of Science Link
    var scienceList = new List<ScienceModel>();

    foreach (var postBox in response.Css("section.main > div > div.post-list"))
    {
        var item = new ScienceModel();
        item.Title = postBox.Css("h1.headline > a")[0].TextContentClean;
        item.Author = postBox.Css("div.author > a")[0].TextContentClean;
        item.Date = postBox.Css("div.time > a")[0].TextContentClean;
        item.Image = postBox.Css("div.image-wrapper.default-state > img")[0].Attributes ["src"];
        item.Text = postBox.Css("div.summary > p")[0].TextContentClean;
        scienceList.Add(item);
    }

    Scrape(scienceList, "BlogScience.Jsonl");
}
/// <summary>
/// Parses the reviews.
/// </summary>
/// <param name="response">The response.</param>
public void ParseReviews(Response response)
{
    // List of Science Link
    var scienceList = new List<ScienceModel>();

    foreach (var postBox in response.Css("section.main > div > div.post-list"))
    {
        var item = new ScienceModel();
        item.Title = postBox.Css("h1.headline > a")[0].TextContentClean;
        item.Author = postBox.Css("div.author > a")[0].TextContentClean;
        item.Date = postBox.Css("div.time > a")[0].TextContentClean;
        item.Image = postBox.Css("div.image-wrapper.default-state > img")[0].Attributes ["src"];
        item.Text = postBox.Css("div.summary > p")[0].TextContentClean;
        scienceList.Add(item);
    }

    Scrape(scienceList, "BlogScience.Jsonl");
}
''' <summary>
''' Parses the reviews.
''' </summary>
''' <param name="response">The response.</param>
Public Sub ParseReviews(ByVal response As Response)
	' List of Science Link
	Dim scienceList = New List(Of ScienceModel)()

	For Each postBox In response.Css("section.main > div > div.post-list")
		Dim item = New ScienceModel()
		item.Title = postBox.Css("h1.headline > a")(0).TextContentClean
		item.Author = postBox.Css("div.author > a")(0).TextContentClean
		item.Date = postBox.Css("div.time > a")(0).TextContentClean
		item.Image = postBox.Css("div.image-wrapper.default-state > img")(0).Attributes ("src")
		item.Text = postBox.Css("div.summary > p")(0).TextContentClean
		scienceList.Add(item)
	Next postBox

	Scrape(scienceList, "BlogScience.Jsonl")
End Sub
VB   C#

创建模型后,我们可以解析响应对象,深入了解其主要元素 (标题、作者、日期、图像、文本)

然后,我们使用 刮削(对象,文件名).

点击此处查看艾哈迈德的 IronWebscraper 完整使用教程

网页抓取从来都不是一项简单的任务,C#或.NET编程环境中也没有主流框架可供使用。Iron Web Scraper 的诞生改变了这一现状。
.NET软件工程师 对许多人来说,这是从.NET生成PDF文件的最有效方式,因为无需学习额外的API或复杂的设计系统。

艾哈迈德·阿布马格德

跨国IT公司的.NET软件解决方案架构师

艾哈迈德是一位经验丰富且经认证的微软技术专家,拥有超过10年的IT和软件开发经验。他曾在多家公司工作,现在是一家跨国IT公司的国家经理。

艾哈迈德已经使用IronPDF和IronWebScraper超过一年了,并在他公司的多个项目中使用它们。