Actualizado 20 de octubre, 2024

Extraer contenido de un sitio web de compras

This article was translated from English: Does it need improvement?
View the article in English

Seleccionamos un sitio de compras para raspar su contenido

ShoppingSite related to Extraer contenido de un sitio web de compras

Como se puede ver en la imagen, tenemos una barra izquierda que contiene enlaces para las categorías de productos del sitio

Así que nuestro primer paso es investigar el HTML del sitio y planificar cómo queremos rasparlo.

ShoppingSiteLeftBar related to Extraer contenido de un sitio web de compras

Las categorías de sitios de moda tienen subcategorías(Hombres, Mujeres, Niños)

<li class="menu-item" data-id="">
    <a href="" class="main-category">
        <i class="cat-icon osh-font-fashion"></i> <span class="nav-subTxt">FASHION </span> <i class="osh-font-light-arrow-left"></i><i class="osh-font-light-arrow-right"></i>
    </a> <div class="navLayerWrapper" style="width: 633px; display: none;"><div class="submenu"><div class="column"><div class="categories"><a class="category" href=";dir=desc&amp;viewType=gridView3">New Arrivals !</a>  </div><div class="categories"><a class="category" href="">Men</a>   <a class="subcategory" href="">Shoes</a>   <a class="subcategory" href="">Clothing</a>   <a class="subcategory" href="">Accessories</a>  </div><div class="categories"><a class="category" href="">Women</a>   <a class="subcategory" href="">Shoes</a>   <a class="subcategory" href="">Clothing</a>   <a class="subcategory" href="">Accessories</a>  </div><div class="categories"><a class="category" href="">Kids</a>   <a class="subcategory" href="">Boys</a>   <a class="subcategory" href="">Girls</a>  </div><div class="categories"><a class="category" href="">Maternity Clothes</a>  </div></div><div class="column"><div class="categories"> <span class="category defaultCursor">Men Best Sellers</span>  <a class="subcategory" href="">Casual Shoes</a>   <a class="subcategory" href="">Sneakers</a>   <a class="subcategory" href="">T-shirts</a>   <a class="subcategory" href="">Polos</a>  </div><div class="categories"> <span class="category defaultCursor">Women Best Sellers</span>  <a class="subcategory" href="">Sandals</a>   <a class="subcategory" href="">Sneakers</a>   <a class="subcategory" href="">Dresses</a>   <a class="subcategory" href="">Tops</a>  </div><div class="categories"><a class="category" href="">Women's Curvy Clothing</a>  </div><div class="categories"><a class="category" href="">Fashion Bundles</a>  </div><div class="categories"><a class="category" href="">Hijab Fashion</a>  </div></div><div class="column"><div class="categories"><a class="category" href="">SEE ALL BRANDS</a>   <a class="subcategory" href="">Adidas</a>   <a class="subcategory" href="">Converse</a>   <a class="subcategory" href="">Ravin</a>   <a class="subcategory" href="">Dejavu</a>   <a class="subcategory" href="">Agu</a>   <a class="subcategory" href="">Activ</a>   <a class="subcategory" href="">Tie House</a>   <a class="subcategory" href="">Shoe Room</a>   <a class="subcategory" href="">Town Team</a>  </div></div></div></div>
<li class="menu-item" data-id="">
    <a href="" class="main-category">
        <i class="cat-icon osh-font-fashion"></i> <span class="nav-subTxt">FASHION </span> <i class="osh-font-light-arrow-left"></i><i class="osh-font-light-arrow-right"></i>
    </a> <div class="navLayerWrapper" style="width: 633px; display: none;"><div class="submenu"><div class="column"><div class="categories"><a class="category" href=";dir=desc&amp;viewType=gridView3">New Arrivals !</a>  </div><div class="categories"><a class="category" href="">Men</a>   <a class="subcategory" href="">Shoes</a>   <a class="subcategory" href="">Clothing</a>   <a class="subcategory" href="">Accessories</a>  </div><div class="categories"><a class="category" href="">Women</a>   <a class="subcategory" href="">Shoes</a>   <a class="subcategory" href="">Clothing</a>   <a class="subcategory" href="">Accessories</a>  </div><div class="categories"><a class="category" href="">Kids</a>   <a class="subcategory" href="">Boys</a>   <a class="subcategory" href="">Girls</a>  </div><div class="categories"><a class="category" href="">Maternity Clothes</a>  </div></div><div class="column"><div class="categories"> <span class="category defaultCursor">Men Best Sellers</span>  <a class="subcategory" href="">Casual Shoes</a>   <a class="subcategory" href="">Sneakers</a>   <a class="subcategory" href="">T-shirts</a>   <a class="subcategory" href="">Polos</a>  </div><div class="categories"> <span class="category defaultCursor">Women Best Sellers</span>  <a class="subcategory" href="">Sandals</a>   <a class="subcategory" href="">Sneakers</a>   <a class="subcategory" href="">Dresses</a>   <a class="subcategory" href="">Tops</a>  </div><div class="categories"><a class="category" href="">Women's Curvy Clothing</a>  </div><div class="categories"><a class="category" href="">Fashion Bundles</a>  </div><div class="categories"><a class="category" href="">Hijab Fashion</a>  </div></div><div class="column"><div class="categories"><a class="category" href="">SEE ALL BRANDS</a>   <a class="subcategory" href="">Adidas</a>   <a class="subcategory" href="">Converse</a>   <a class="subcategory" href="">Ravin</a>   <a class="subcategory" href="">Dejavu</a>   <a class="subcategory" href="">Agu</a>   <a class="subcategory" href="">Activ</a>   <a class="subcategory" href="">Tie House</a>   <a class="subcategory" href="">Shoe Room</a>   <a class="subcategory" href="">Town Team</a>  </div></div></div></div>

Creemos un proyecto

  1. Cree una nueva aplicación de consola o añada una nueva carpeta para nuestra nueva muestra con el nombre "ShoppingSiteSample".

  2. Añadir nueva clase con el nombre "ShoppingScraper"

  3. El primer paso será raspar las categorías del sitio y sus subcategorías

    Vamos a crear un Modelo de Categorías:

public class Category
    /// <summary>
    /// Gets or sets the name.
    /// </summary>
    /// <value>
    /// The name.
    /// </value>
    public string Name { get; set; }
    /// <summary>
    /// Gets or sets the URL.
    /// </summary>
    /// <value>
    /// The URL.
    /// </value>
    public string URL { get; set; }
    /// <summary>
    /// Gets or sets the sub categories.
    /// </summary>
    /// <value>
    /// The sub categories.
    /// </value>
    public List<Category> SubCategories { get; set; }
public class Category
    /// <summary>
    /// Gets or sets the name.
    /// </summary>
    /// <value>
    /// The name.
    /// </value>
    public string Name { get; set; }
    /// <summary>
    /// Gets or sets the URL.
    /// </summary>
    /// <value>
    /// The URL.
    /// </value>
    public string URL { get; set; }
    /// <summary>
    /// Gets or sets the sub categories.
    /// </summary>
    /// <value>
    /// The sub categories.
    /// </value>
    public List<Category> SubCategories { get; set; }
Public Class Category
	''' <summary>
	''' Gets or sets the name.
	''' </summary>
	''' <value>
	''' The name.
	''' </value>
	Public Property Name() As String
	''' <summary>
	''' Gets or sets the URL.
	''' </summary>
	''' <value>
	''' The URL.
	''' </value>
	Public Property URL() As String
	''' <summary>
	''' Gets or sets the sub categories.
	''' </summary>
	''' <value>
	''' The sub categories.
	''' </value>
	Public Property SubCategories() As List(Of Category)
End Class
$vbLabelText   $csharpLabel
  1. Ahora vamos a construir nuestra lógica de raspado
public class ShoppingScraper : WebScraper
    /// <summary>
    /// Override this method initialize your web-scraper.
    /// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
    /// </summary>
    public override void Init()
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
        this.Request("", Parse);

    /// <summary>
    /// Override this method to create the default Response handler for your web scraper.
    /// If you have multiple page types, you can add additional similar methods.
    /// </summary>
    /// <param name="response">The http Response object to parse</param>
    public override void Parse(Response response)
        var categoryList = new List<Category>();

        foreach (var Links in response.Css("#menuFixed > ul > li > a "))
            var cat = new Category();
            cat.URL = Links.Attributes ["href"];
            cat.Name = Links.InnerText;
        Scrape(categoryList, "Shopping.Jsonl");
public class ShoppingScraper : WebScraper
    /// <summary>
    /// Override this method initialize your web-scraper.
    /// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
    /// </summary>
    public override void Init()
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
        this.Request("", Parse);

    /// <summary>
    /// Override this method to create the default Response handler for your web scraper.
    /// If you have multiple page types, you can add additional similar methods.
    /// </summary>
    /// <param name="response">The http Response object to parse</param>
    public override void Parse(Response response)
        var categoryList = new List<Category>();

        foreach (var Links in response.Css("#menuFixed > ul > li > a "))
            var cat = new Category();
            cat.URL = Links.Attributes ["href"];
            cat.Name = Links.InnerText;
        Scrape(categoryList, "Shopping.Jsonl");
$vbLabelText   $csharpLabel

Extracción de enlaces del menú

ShoppingSiteScrapeMenu related to Extraer contenido de un sitio web de compras

Actualicemos nuestro código para raspar las Categorías Principales y todos sus subenlaces

public override void Parse(Response response)
    // List of Categories Links (Root)
    var categoryList = new List<Category>();

    foreach (var li in response.Css("#menuFixed > ul > li"))
        // List Of Main Links
        foreach (var Links in li.Css("a"))
            var cat = new Category();
            cat.URL = Links.Attributes ["href"];
            cat.Name = Links.InnerText;
            cat.SubCategories = new List<Category>();
            // List of Sub Catgories Links
            foreach (var subCategory in li.Css("a [class=subcategory]"))
                var subcat = new Category();
                subcat.URL = Links.Attributes ["href"];
                subcat.Name = Links.InnerText;
                // Check If Link Exist Before 
                if (cat.SubCategories.Find(c=>c.Name== subcat.Name && c.URL == subcat.URL) == null)
                    // Add Sublinks
            // Add Categories
    Scrape(categoryList, "Shopping.Jsonl");
public override void Parse(Response response)
    // List of Categories Links (Root)
    var categoryList = new List<Category>();

    foreach (var li in response.Css("#menuFixed > ul > li"))
        // List Of Main Links
        foreach (var Links in li.Css("a"))
            var cat = new Category();
            cat.URL = Links.Attributes ["href"];
            cat.Name = Links.InnerText;
            cat.SubCategories = new List<Category>();
            // List of Sub Catgories Links
            foreach (var subCategory in li.Css("a [class=subcategory]"))
                var subcat = new Category();
                subcat.URL = Links.Attributes ["href"];
                subcat.Name = Links.InnerText;
                // Check If Link Exist Before 
                if (cat.SubCategories.Find(c=>c.Name== subcat.Name && c.URL == subcat.URL) == null)
                    // Add Sublinks
            // Add Categories
    Scrape(categoryList, "Shopping.Jsonl");
Public Overrides Sub Parse(ByVal response As Response)
	' List of Categories Links (Root)
	Dim categoryList = New List(Of Category)()

	For Each li In response.Css("#menuFixed > ul > li")
		' List Of Main Links
		For Each Links In li.Css("a")
			Dim cat = New Category()
			cat.URL = Links.Attributes ("href")
			cat.Name = Links.InnerText
			cat.SubCategories = New List(Of Category)()
			' List of Sub Catgories Links
			For Each subCategory In li.Css("a [class=subcategory]")
				Dim subcat = New Category()
				subcat.URL = Links.Attributes ("href")
				subcat.Name = Links.InnerText
				' Check If Link Exist Before 
				If cat.SubCategories.Find(Function(c) c.Name= subcat.Name AndAlso c.URL = subcat.URL) Is Nothing Then
					' Add Sublinks
				End If
			Next subCategory
			' Add Categories
		Next Links
	Next li
	Scrape(categoryList, "Shopping.Jsonl")
End Sub
$vbLabelText   $csharpLabel

Ahora que tenemos enlaces a todas las categorías del sitio, vamos a empezar a raspar los productos dentro de cada categoría

Naveguemos a cualquier categoría y comprobemos el contenido.

ProductSubCategoryList related to Extraer contenido de un sitio web de compras

Veamos su código

<section class="products">
    <div class="sku -gallery -validate-size " data-sku="AG249FA0T2PSGNAFAMZ" ft-product-sizes="41,42,43,44,45" ft-product-color="Multicolour">
        <a class="link" href="">
            <div class="image-wrapper default-state">
                <img class="lazy image -loaded" alt="Bundle Of 2 Sneakers - Black &amp;amp; Navy Blue" data-image-vertical="1" width="210" height="262" src="" data-sku="AG249FA0T2PSGNAFAMZ" data-src="" data-placeholder="placeholder_m_1.jpg"><noscript>&lt;img src="" width="210" height="262" class="image" /&gt;</noscript>
            </div> <h2 class="title">
                <span class="brand ">Agu&nbsp;</span>
                <span class="name" dir="ltr">Bundle Of 2 Sneakers - Black &amp; Navy Blue</span>
            </h2><div class="price-container clearfix">
                <span class="price-box">
                    <span class="price">
                        <span data-currency-iso="EGP">EGP</span>
                        <span dir="ltr" data-price="299">299</span>
                    </span>   <span class="price -old  -no-special"></span>
            </div><div class="rating-stars"><div class="stars-container"><div class="stars" style="width: 62%"></div></div> <div class="total-ratings">(30)</div> </div>    <span class="shop-first-logo-container"><img src="" data-src="" class="lazy shop-first-logo-img -mbxs -loaded"> </span>
            <span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
            <div class="list -sizes" data-selected-sku="">
                <span class="js-link sku-size" data-href="">41</span>     <span class="js-link sku-size" data-href="">42</span>
                <span class="js-link sku-size" data-href="">43</span>     <span class="js-link sku-size" data-href="">44</span>
                <span class="js-link sku-size" data-href="">45</span>
    <div class="sku -gallery -validate-size " data-sku="LE047FA01SRK4NAFAMZ" ft-product-sizes="110,115,120,125,130,135" ft-product-color="Black">
        <a class="link" href="">
            <div class="image-wrapper default-state"><img class="lazy image -loaded" alt="Genuine Leather Belt - Black" data-image-vertical="1" width="210" height="262" src="" data-sku="LE047FA01SRK4NAFAMZ" data-src="" data-placeholder="placeholder_m_1.jpg"><noscript>&lt;img src="" width="210" height="262" class="image" /&gt;</noscript></div>
            <h2 class="title"><span class="brand ">Leather Shop&nbsp;</span> <span class="name" dir="ltr">Genuine Leather Belt - Black</span></h2><div class="price-container clearfix">
                <span class="sale-flag-percent">-29%</span>  <span class="price-box"> <span class="price"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="96">96</span> </span>   <span class="price -old "><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="135">135</span> </span> </span>
            </div><div class="rating-stars"><div class="stars-container"><div class="stars" style="width: 100%"></div></div> <div class="total-ratings">(1)</div> </div>
            <span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>    <div class="list -sizes" data-selected-sku="">
                <span class="js-link sku-size" data-href="">110</span>     <span class="js-link sku-size" data-href="">115</span>
                <span class="js-link sku-size" data-href="">120</span>     <span class="js-link sku-size" data-href="">125</span>     <span class="js-link sku-size" data-href="">130</span>
                <span class="js-link sku-size" data-href="">135</span>
<section class="products">
    <div class="sku -gallery -validate-size " data-sku="AG249FA0T2PSGNAFAMZ" ft-product-sizes="41,42,43,44,45" ft-product-color="Multicolour">
        <a class="link" href="">
            <div class="image-wrapper default-state">
                <img class="lazy image -loaded" alt="Bundle Of 2 Sneakers - Black &amp;amp; Navy Blue" data-image-vertical="1" width="210" height="262" src="" data-sku="AG249FA0T2PSGNAFAMZ" data-src="" data-placeholder="placeholder_m_1.jpg"><noscript>&lt;img src="" width="210" height="262" class="image" /&gt;</noscript>
            </div> <h2 class="title">
                <span class="brand ">Agu&nbsp;</span>
                <span class="name" dir="ltr">Bundle Of 2 Sneakers - Black &amp; Navy Blue</span>
            </h2><div class="price-container clearfix">
                <span class="price-box">
                    <span class="price">
                        <span data-currency-iso="EGP">EGP</span>
                        <span dir="ltr" data-price="299">299</span>
                    </span>   <span class="price -old  -no-special"></span>
            </div><div class="rating-stars"><div class="stars-container"><div class="stars" style="width: 62%"></div></div> <div class="total-ratings">(30)</div> </div>    <span class="shop-first-logo-container"><img src="" data-src="" class="lazy shop-first-logo-img -mbxs -loaded"> </span>
            <span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
            <div class="list -sizes" data-selected-sku="">
                <span class="js-link sku-size" data-href="">41</span>     <span class="js-link sku-size" data-href="">42</span>
                <span class="js-link sku-size" data-href="">43</span>     <span class="js-link sku-size" data-href="">44</span>
                <span class="js-link sku-size" data-href="">45</span>
    <div class="sku -gallery -validate-size " data-sku="LE047FA01SRK4NAFAMZ" ft-product-sizes="110,115,120,125,130,135" ft-product-color="Black">
        <a class="link" href="">
            <div class="image-wrapper default-state"><img class="lazy image -loaded" alt="Genuine Leather Belt - Black" data-image-vertical="1" width="210" height="262" src="" data-sku="LE047FA01SRK4NAFAMZ" data-src="" data-placeholder="placeholder_m_1.jpg"><noscript>&lt;img src="" width="210" height="262" class="image" /&gt;</noscript></div>
            <h2 class="title"><span class="brand ">Leather Shop&nbsp;</span> <span class="name" dir="ltr">Genuine Leather Belt - Black</span></h2><div class="price-container clearfix">
                <span class="sale-flag-percent">-29%</span>  <span class="price-box"> <span class="price"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="96">96</span> </span>   <span class="price -old "><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="135">135</span> </span> </span>
            </div><div class="rating-stars"><div class="stars-container"><div class="stars" style="width: 100%"></div></div> <div class="total-ratings">(1)</div> </div>
            <span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>    <div class="list -sizes" data-selected-sku="">
                <span class="js-link sku-size" data-href="">110</span>     <span class="js-link sku-size" data-href="">115</span>
                <span class="js-link sku-size" data-href="">120</span>     <span class="js-link sku-size" data-href="">125</span>     <span class="js-link sku-size" data-href="">130</span>
                <span class="js-link sku-size" data-href="">135</span>

Construyamos nuestro modelo de producto para este contenido.

public class Product
    /// <summary>
    /// Gets or sets the name.
    /// </summary>
    /// <value>
    /// The name.
    /// </value>
    public string Name { get; set; }
    /// <summary>
    /// Gets or sets the price.
    /// </summary>
    /// <value>
    /// The price.
    /// </value>
    public string Price { get; set; }
    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    /// <value>
    /// The image.
    /// </value>
    public string Image { get; set; }
public class Product
    /// <summary>
    /// Gets or sets the name.
    /// </summary>
    /// <value>
    /// The name.
    /// </value>
    public string Name { get; set; }
    /// <summary>
    /// Gets or sets the price.
    /// </summary>
    /// <value>
    /// The price.
    /// </value>
    public string Price { get; set; }
    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    /// <value>
    /// The image.
    /// </value>
    public string Image { get; set; }
Public Class Product
	''' <summary>
	''' Gets or sets the name.
	''' </summary>
	''' <value>
	''' The name.
	''' </value>
	Public Property Name() As String
	''' <summary>
	''' Gets or sets the price.
	''' </summary>
	''' <value>
	''' The price.
	''' </value>
	Public Property Price() As String
	''' <summary>
	''' Gets or sets the image.
	''' </summary>
	''' <value>
	''' The image.
	''' </value>
	Public Property Image() As String
End Class
$vbLabelText   $csharpLabel

Para raspar páginas de categorías, añadimos un nuevo método de raspado:

public void ParseCatgory(Response response)
    // List of Products Links (Root)
    var productList = new List<Product>();

    foreach (var Links in response.Css("body > main > section.osh-content > section.products > div > a"))
        var product = new Product();
        product.Name = Links.InnerText;
        product.Image = Links.Css("div.image-wrapper.default-state > img")[0].Attributes ["src"];                

    Scrape(productList, "Products.Jsonl");
public void ParseCatgory(Response response)
    // List of Products Links (Root)
    var productList = new List<Product>();

    foreach (var Links in response.Css("body > main > section.osh-content > section.products > div > a"))
        var product = new Product();
        product.Name = Links.InnerText;
        product.Image = Links.Css("div.image-wrapper.default-state > img")[0].Attributes ["src"];                

    Scrape(productList, "Products.Jsonl");
Public Sub ParseCatgory(ByVal response As Response)
	' List of Products Links (Root)
	Dim productList = New List(Of Product)()

	For Each Links In response.Css("body > main > section.osh-content > section.products > div > a")
		Dim product As New Product()
		product.Name = Links.InnerText
		product.Image = Links.Css("div.image-wrapper.default-state > img")(0).Attributes ("src")
	Next Links

	Scrape(productList, "Products.Jsonl")
End Sub
$vbLabelText   $csharpLabel
Chaknith Bin

Chaknith Bin

Ingeniero de software


Chaknith trabaja en IronXL e IronBarcode. Tiene una gran experiencia en C# y .NET, ayudando a mejorar el software y a apoyar a los clientes. Sus conocimientos de las interacciones con los usuarios contribuyen a mejorar los productos, la documentación y la experiencia general.