Cómo scrapear un blog en C#

Name: IronWebScraper
Brand: Iron Software
Availability: InStock
Rating: 4.72 (37 reviews)

Darrius Serrant

13 de noviembre, 2018

Actualizado 10 de diciembre, 2024

Translated

View the article in English

Usemos Iron WebScraper para extraer el contenido de un Blog usando C# o VB.NET.

Este tutorial muestra cómo un blog de WordPress (o similar) puede ser extraído en contenido utilizando .NET

public class BlogScraper : WebScraper
{
    /// <summary>
    /// Override this method initializes your web-scraper.
    /// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
    /// </summary>
    public override void Init()
    {
        License.LicenseKey = " LicenseKey ";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\BlogSample\Output\";
        EnableWebCache(new TimeSpan(1, 30, 30));
        this.Request("http://blogSite.com/", Parse);
    }
}

public class BlogScraper : WebScraper
{
    /// <summary>
    /// Override this method initializes your web-scraper.
    /// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
    /// </summary>
    public override void Init()
    {
        License.LicenseKey = " LicenseKey ";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\BlogSample\Output\";
        EnableWebCache(new TimeSpan(1, 30, 30));
        this.Request("http://blogSite.com/", Parse);
    }
}

Public Class BlogScraper
	Inherits WebScraper

	''' <summary>
	''' Override this method initializes your web-scraper.
	''' Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
	''' </summary>
	Public Overrides Sub Init()
		License.LicenseKey = " LicenseKey "
		Me.LoggingLevel = WebScraper.LogLevel.All
		Me.WorkingDirectory = AppSetting.GetAppRoot() & "\BlogSample\Output\"
		EnableWebCache(New TimeSpan(1, 30, 30))
		Me.Request("http://blogSite.com/", Parse)
	End Sub
End Class

$vbLabelText $csharpLabel

Como de costumbre, creamos un Scraper y heredamos de la clase WebScraper. En este caso es "BlogScraper".

Establecemos un directorio de trabajo a "\BlogSample\Output\" que es donde todos los archivos de salida y la memoria caché puede ir.

A continuación, activamos la caché web para guardar las páginas solicitadas en la carpeta de caché "WebCache".

Ahora vamos a escribir una función de análisis sintáctico:

/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
    foreach (var link in response.Css("div.section-nav > ul > li > a "))
    {
        switch(link.TextContentClean)
        {
            case "Reviews":
                {

                }break;
            case "Science":
                {

                }break;
            default:
                {
                    // Save Result to File
                    Scrape(new ScrapedData() { { "Title", link.TextContentClean } }, "BlogScraper.Jsonl");
                }
                break;
        }
    }
}

/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
    foreach (var link in response.Css("div.section-nav > ul > li > a "))
    {
        switch(link.TextContentClean)
        {
            case "Reviews":
                {

                }break;
            case "Science":
                {

                }break;
            default:
                {
                    // Save Result to File
                    Scrape(new ScrapedData() { { "Title", link.TextContentClean } }, "BlogScraper.Jsonl");
                }
                break;
        }
    }
}

''' <summary>
''' Override this method to create the default Response handler for your web scraper.
''' If you have multiple page types, you can add additional similar methods.
''' </summary>
''' <param name="response">The http Response object to parse</param>
Public Overrides Sub Parse(ByVal response As Response)
	For Each link In response.Css("div.section-nav > ul > li > a ")
		Select Case link.TextContentClean
			Case "Reviews"

			Case "Science"

			Case Else
					' Save Result to File
					Scrape(New ScrapedData() From {
						{ "Title", link.TextContentClean }
					},
					"BlogScraper.Jsonl")
		End Select
	Next link
End Sub

$vbLabelText $csharpLabel

Dentro del método parse; analizamos el menú superior para obtener todos los enlaces a todas las páginas de categorías (Películas, Ciencia, Reseñas, etc.)

A continuación, cambiamos a un método de análisis adecuado basado en la categoría del enlace.

Preparemos nuestro modelo de objetos para la Página de Ciencia:

/// <summary>
/// ScienceModel
/// </summary>
public class ScienceModel
{
    /// <summary>
    /// Gets or sets the title.
    /// </summary>
    /// <value>
    /// The title.
    /// </value>
    public string Title { get; set; }
    /// <summary>
    /// Gets or sets the author.
    /// </summary>
    /// <value>
    /// The author.
    /// </value>
    public string Author { get; set; }
    /// <summary>
    /// Gets or sets the date.
    /// </summary>
    /// <value>
    /// The date.
    /// </value>
    public string Date { get; set; }
    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    /// <value>
    /// The image.
    /// </value>
    public string Image { get; set; }
    /// <summary>
    /// Gets or sets the text.
    /// </summary>
    /// <value>
    /// The text.
    /// </value>
    public string Text { get; set; }

}

/// <summary>
/// ScienceModel
/// </summary>
public class ScienceModel
{
    /// <summary>
    /// Gets or sets the title.
    /// </summary>
    /// <value>
    /// The title.
    /// </value>
    public string Title { get; set; }
    /// <summary>
    /// Gets or sets the author.
    /// </summary>
    /// <value>
    /// The author.
    /// </value>
    public string Author { get; set; }
    /// <summary>
    /// Gets or sets the date.
    /// </summary>
    /// <value>
    /// The date.
    /// </value>
    public string Date { get; set; }
    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    /// <value>
    /// The image.
    /// </value>
    public string Image { get; set; }
    /// <summary>
    /// Gets or sets the text.
    /// </summary>
    /// <value>
    /// The text.
    /// </value>
    public string Text { get; set; }

}

''' <summary>
''' ScienceModel
''' </summary>
Public Class ScienceModel
	''' <summary>
	''' Gets or sets the title.
	''' </summary>
	''' <value>
	''' The title.
	''' </value>
	Public Property Title() As String
	''' <summary>
	''' Gets or sets the author.
	''' </summary>
	''' <value>
	''' The author.
	''' </value>
	Public Property Author() As String
	''' <summary>
	''' Gets or sets the date.
	''' </summary>
	''' <value>
	''' The date.
	''' </value>
	Public Property [Date]() As String
	''' <summary>
	''' Gets or sets the image.
	''' </summary>
	''' <value>
	''' The image.
	''' </value>
	Public Property Image() As String
	''' <summary>
	''' Gets or sets the text.
	''' </summary>
	''' <value>
	''' The text.
	''' </value>
	Public Property Text() As String

End Class

$vbLabelText $csharpLabel

Ahora vamos a implementar un scrape de una sola página:

/// <summary>
/// Parses the reviews.
/// </summary>
/// <param name="response">The response.</param>
public void ParseReviews(Response response)
{
    // List of Science Link
    var scienceList = new List<ScienceModel>();

    foreach (var postBox in response.Css("section.main > div > div.post-list"))
    {
        var item = new ScienceModel();
        item.Title = postBox.Css("h1.headline > a")[0].TextContentClean;
        item.Author = postBox.Css("div.author > a")[0].TextContentClean;
        item.Date = postBox.Css("div.time > a")[0].TextContentClean;
        item.Image = postBox.Css("div.image-wrapper.default-state > img")[0].Attributes ["src"];
        item.Text = postBox.Css("div.summary > p")[0].TextContentClean;
        scienceList.Add(item);
    }

    Scrape(scienceList, "BlogScience.Jsonl");
}

/// <summary>
/// Parses the reviews.
/// </summary>
/// <param name="response">The response.</param>
public void ParseReviews(Response response)
{
    // List of Science Link
    var scienceList = new List<ScienceModel>();

    foreach (var postBox in response.Css("section.main > div > div.post-list"))
    {
        var item = new ScienceModel();
        item.Title = postBox.Css("h1.headline > a")[0].TextContentClean;
        item.Author = postBox.Css("div.author > a")[0].TextContentClean;
        item.Date = postBox.Css("div.time > a")[0].TextContentClean;
        item.Image = postBox.Css("div.image-wrapper.default-state > img")[0].Attributes ["src"];
        item.Text = postBox.Css("div.summary > p")[0].TextContentClean;
        scienceList.Add(item);
    }

    Scrape(scienceList, "BlogScience.Jsonl");
}

''' <summary>
''' Parses the reviews.
''' </summary>
''' <param name="response">The response.</param>
Public Sub ParseReviews(ByVal response As Response)
	' List of Science Link
	Dim scienceList = New List(Of ScienceModel)()

	For Each postBox In response.Css("section.main > div > div.post-list")
		Dim item = New ScienceModel()
		item.Title = postBox.Css("h1.headline > a")(0).TextContentClean
		item.Author = postBox.Css("div.author > a")(0).TextContentClean
		item.Date = postBox.Css("div.time > a")(0).TextContentClean
		item.Image = postBox.Css("div.image-wrapper.default-state > img")(0).Attributes ("src")
		item.Text = postBox.Css("div.summary > p")(0).TextContentClean
		scienceList.Add(item)
	Next postBox

	Scrape(scienceList, "BlogScience.Jsonl")
End Sub

$vbLabelText $csharpLabel

Después de haber creado nuestro modelo, podemos analizar el objeto de respuesta para profundizar en sus elementos principales (título, autor, fecha, imagen, texto).

Luego guardamos nuestro resultado en un archivo separado usando Scrape(object, fileName).

Haga clic aquí para el tutorial completo sobre el uso de IronWebscraper

Comienza con IronWebscraper

Comience a usar IronWebScraper en su proyecto hoy con una prueba gratuita.

Primer Paso:

Darrius Serrant

Chatea con el equipo de ingeniería ahora

Ingeniero de Software Full Stack (WebOps)

Darrius Serrant tiene una licenciatura en Informática de la Universidad de Miami y trabaja como Ingeniero de Marketing WebOps Full Stack en Iron Software. Atraído por la programación desde una edad temprana, veía la computación como algo misterioso y accesible, lo que la convertía en el medio perfecto para la creatividad y la resolución de problemas.

En Iron Software, Darrius disfruta creando cosas nuevas y simplificando conceptos complejos para hacerlos más comprensibles. Como uno de nuestros desarrolladores residentes, también se ha ofrecido como voluntario para enseñar a los estudiantes, compartiendo su experiencia con la próxima generación.

Para Darrius, su trabajo es gratificante porque es valorado y tiene un impacto real.