Web Scraping de Site de Compras em C
Aprenda a extrair categorias de produtos e itens de sites de compras usando web scraping em C# com o framework WebScraper, extraindo dados estruturados de elementos HTML em modelos personalizados. Este guia abrangente orienta você na construção de um web scraper de e-commerce robusto utilizando a biblioteca IronWebScraper.
Início Rápido: Extrair Dados de Sites de Compras em C#
-
Instale IronWebScraper com o Gerenciador de Pacotes NuGet
PM > Install-Package IronWebScraper -
Copie e execute este trecho de código.
using IronWebScraper; public class QuickShoppingScraper : WebScraper { public override void Init() { // Apply your license key License.LicenseKey = "YOUR-LICENSE-KEY"; // Set the starting URL this.Request("https://shopping-site.com", Parse); } public override void Parse(Response response) { // Extract product data foreach (var product in response.Css(".product-item")) { var item = new { Name = product.Css(".product-name").First().InnerText, Price = product.Css(".price").First().InnerText, Image = product.Css("img").First().Attributes["src"] }; Scrape(item, "products.jsonl"); } } } // Run the scraper var scraper = new QuickShoppingScraper(); scraper.Start(); -
Implante para testar em seu ambiente de produção.
Comece a usar IronWebScraper em seu projeto hoje com uma avaliação gratuita
- Crie um novo projeto de aplicativo de console chamado "ShoppingSiteSample".
- Adicione uma classe chamada "ShoppingScraper" que herda de
WebScraper - Crie modelos para dados
CategoryeProduct - Sobrescreva
Init()para definir a URL inicial e o métodoParse()para raspagem - Execute o scraper para extrair categorias e produtos para arquivos JSONL.
Como analisar a estrutura HTML para web scraping de sites de compras
Selecione um site de compras para analisar sua estrutura de conteúdo. Compreender a estrutura HTML é crucial para o sucesso da extração de dados da web. Antes de escrever qualquer código, dedique algum tempo a analisar a estrutura do site alvo usando as ferramentas de desenvolvedor do navegador.
Conforme mostrado na imagem, a barra lateral esquerda contém links para as categorias de produtos do site. O primeiro passo é investigar o HTML do site e planejar a abordagem de extração de dados. Esta fase de análise é essencial para construir uma estratégia de extração de dados eficaz.
Por que é importante entender a estrutura do HTML?
As categorias do site de moda possuem subcategorias (Masculino, Feminino, Infantil). Compreender essa estrutura hierárquica ajuda a projetar modelos de dados e lógica de extração de dados adequados. Ao trabalhar com recursos avançados de web scraping , a análise adequada do HTML torna-se ainda mais crucial.
<li class="menu-item" data-id="">
<a href="https://domain.com/fashion-by-/" class="main-category">
<i class="cat-icon osh-font-fashion"></i>
<span class="nav-subTxt">FASHION </span>
<i class="osh-font-light-arrow-left"></i><i class="osh-font-light-arrow-right"></i>
</a>
<div class="navLayerWrapper" style="width: 633px; display: none;">
<div class="submenu">
<div class="column">
<div class="categories">
<a class="category" href="https://domain.com/fashion-by-/?sort=newest&dir=desc&viewType=gridView3">New Arrivals !</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/men-fashion/">Men</a>
<a class="subcategory" href="https://domain.com/mens-shoes/">Shoes</a>
<a class="subcategory" href="https://domain.com/mens-clothing/">Clothing</a>
<a class="subcategory" href="https://domain.com/mens-accessories/">Accessories</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/women-fashion/">Women</a>
<a class="subcategory" href="https://domain.com/womens-shoes/">Shoes</a>
<a class="subcategory" href="https://domain.com/womens-clothing/">Clothing</a>
<a class="subcategory" href="https://domain.com/womens-accessories/">Accessories</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/girls-boys-fashion/">Kids</a>
<a class="subcategory" href="https://domain.com/boys-fashion/">Boys</a>
<a class="subcategory" href="https://domain.com/girls/">Girls</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/maternity-clothes/">Maternity Clothes</a>
</div>
</div>
<div class="column">
<div class="categories">
<span class="category defaultCursor">Men Best Sellers</span>
<a class="subcategory" href="https://domain.com/mens-casual-shoes/">Casual Shoes</a>
<a class="subcategory" href="https://domain.com/mens-sneakers/">Sneakers</a>
<a class="subcategory" href="https://domain.com/mens-t-shirts/">T-shirts</a>
<a class="subcategory" href="https://domain.com/mens-polos/">Polos</a>
</div>
<div class="categories">
<span class="category defaultCursor">Women Best Sellers</span>
<a class="subcategory" href="https://domain.com/womens-sandals/">Sandals</a>
<a class="subcategory" href="https://domain.com/womens-sneakers/">Sneakers</a>
<a class="subcategory" href="https://domain.com/women-dresses/">Dresses</a>
<a class="subcategory" href="https://domain.com/women-tops/">Tops</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/womens-curvy-clothing/">Women's Curvy Clothing</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/fashion-bundles/v/">Fashion Bundles</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/hijab-fashion/">Hijab Fashion</a>
</div>
</div>
<div class="column">
<div class="categories">
<a class="category" href="https://domain.com/brands/fashion-by-/">SEE ALL BRANDS</a>
<a class="subcategory" href="https://domain.com/adidas/">Adidas</a>
<a class="subcategory" href="https://domain.com/converse/">Converse</a>
<a class="subcategory" href="https://domain.com/ravin/">Ravin</a>
<a class="subcategory" href="https://domain.com/dejavu/">Dejavu</a>
<a class="subcategory" href="https://domain.com/agu/">Agu</a>
<a class="subcategory" href="https://domain.com/activ/">Activ</a>
<a class="subcategory" href="https://domain.com/oxford--bellini--tie-house--milano/">Tie House</a>
<a class="subcategory" href="https://domain.com/shoe-room/">Shoe Room</a>
<a class="subcategory" href="https://domain.com/town-team/">Town Team</a>
</div>
</div>
</div>
</div>
</li>
<li class="menu-item" data-id="">
<a href="https://domain.com/fashion-by-/" class="main-category">
<i class="cat-icon osh-font-fashion"></i>
<span class="nav-subTxt">FASHION </span>
<i class="osh-font-light-arrow-left"></i><i class="osh-font-light-arrow-right"></i>
</a>
<div class="navLayerWrapper" style="width: 633px; display: none;">
<div class="submenu">
<div class="column">
<div class="categories">
<a class="category" href="https://domain.com/fashion-by-/?sort=newest&dir=desc&viewType=gridView3">New Arrivals !</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/men-fashion/">Men</a>
<a class="subcategory" href="https://domain.com/mens-shoes/">Shoes</a>
<a class="subcategory" href="https://domain.com/mens-clothing/">Clothing</a>
<a class="subcategory" href="https://domain.com/mens-accessories/">Accessories</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/women-fashion/">Women</a>
<a class="subcategory" href="https://domain.com/womens-shoes/">Shoes</a>
<a class="subcategory" href="https://domain.com/womens-clothing/">Clothing</a>
<a class="subcategory" href="https://domain.com/womens-accessories/">Accessories</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/girls-boys-fashion/">Kids</a>
<a class="subcategory" href="https://domain.com/boys-fashion/">Boys</a>
<a class="subcategory" href="https://domain.com/girls/">Girls</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/maternity-clothes/">Maternity Clothes</a>
</div>
</div>
<div class="column">
<div class="categories">
<span class="category defaultCursor">Men Best Sellers</span>
<a class="subcategory" href="https://domain.com/mens-casual-shoes/">Casual Shoes</a>
<a class="subcategory" href="https://domain.com/mens-sneakers/">Sneakers</a>
<a class="subcategory" href="https://domain.com/mens-t-shirts/">T-shirts</a>
<a class="subcategory" href="https://domain.com/mens-polos/">Polos</a>
</div>
<div class="categories">
<span class="category defaultCursor">Women Best Sellers</span>
<a class="subcategory" href="https://domain.com/womens-sandals/">Sandals</a>
<a class="subcategory" href="https://domain.com/womens-sneakers/">Sneakers</a>
<a class="subcategory" href="https://domain.com/women-dresses/">Dresses</a>
<a class="subcategory" href="https://domain.com/women-tops/">Tops</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/womens-curvy-clothing/">Women's Curvy Clothing</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/fashion-bundles/v/">Fashion Bundles</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/hijab-fashion/">Hijab Fashion</a>
</div>
</div>
<div class="column">
<div class="categories">
<a class="category" href="https://domain.com/brands/fashion-by-/">SEE ALL BRANDS</a>
<a class="subcategory" href="https://domain.com/adidas/">Adidas</a>
<a class="subcategory" href="https://domain.com/converse/">Converse</a>
<a class="subcategory" href="https://domain.com/ravin/">Ravin</a>
<a class="subcategory" href="https://domain.com/dejavu/">Dejavu</a>
<a class="subcategory" href="https://domain.com/agu/">Agu</a>
<a class="subcategory" href="https://domain.com/activ/">Activ</a>
<a class="subcategory" href="https://domain.com/oxford--bellini--tie-house--milano/">Tie House</a>
<a class="subcategory" href="https://domain.com/shoe-room/">Shoe Room</a>
<a class="subcategory" href="https://domain.com/town-team/">Town Team</a>
</div>
</div>
</div>
</div>
</li>
Como configurar um projeto de web scraper em C
Configure um projeto seguindo as melhores práticas para web scraping em C# .
- Crie um novo aplicativo de console ou adicione uma nova pasta para o exemplo com o nome "ShoppingSiteSample".
- Adicione uma nova classe chamada "ShoppingScraper"
- Comece extraindo as categorias do site e suas subcategorias.
- Instale
IronWebScrapervia Gerenciador de Pacotes NuGet ou Console do Gerenciador de Pacotes:
Install-Package IronWebScraper
Install-Package IronWebScraper
Qual modelo de dados devo usar para as categorias?
Crie um Modelo de Categorias que represente adequadamente a estrutura hierárquica descoberta:
public class Category
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the URL.
/// </summary>
/// <value>
/// The URL.
/// </value>
public string URL { get; set; }
/// <summary>
/// Gets or sets the subcategories.
/// </summary>
/// <value>
/// The subcategories.
/// </value>
public List<Category> SubCategories { get; set; }
// Additional properties for enhanced data collection
public int ProductCount { get; set; }
public DateTime LastScraped { get; set; }
public string CategoryType { get; set; }
}
public class Category
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the URL.
/// </summary>
/// <value>
/// The URL.
/// </value>
public string URL { get; set; }
/// <summary>
/// Gets or sets the subcategories.
/// </summary>
/// <value>
/// The subcategories.
/// </value>
public List<Category> SubCategories { get; set; }
// Additional properties for enhanced data collection
public int ProductCount { get; set; }
public DateTime LastScraped { get; set; }
public string CategoryType { get; set; }
}
Public Class Category
''' <summary>
''' Gets or sets the name.
''' </summary>
''' <value>
''' The name.
''' </value>
Public Property Name As String
''' <summary>
''' Gets or sets the URL.
''' </summary>
''' <value>
''' The URL.
''' </value>
Public Property URL As String
''' <summary>
''' Gets or sets the subcategories.
''' </summary>
''' <value>
''' The subcategories.
''' </value>
Public Property SubCategories As List(Of Category)
' Additional properties for enhanced data collection
Public Property ProductCount As Integer
Public Property LastScraped As DateTime
Public Property CategoryType As String
End Class
Como faço para construir a lógica básica do scraper?
Crie a lógica do scraper, lembrando-se de aplicar sua chave de licença antes de executá-lo:
public class ShoppingScraper : WebScraper
{
/// <summary>
/// Initialize the web scraper, setting the start URLs and allowed/banned domains or URL patterns.
/// </summary>
public override void Init()
{
// Apply your license key - get one from https://ironsoftware.com/csharp/webscraper/licensing/
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Configure request settings for better performance
this.Request("www.webSite.com", Parse);
}
/// <summary>
/// Parses the HTML document of the response to scrap the necessary data.
/// </summary>
/// <param name="response">The HTTP Response object to parse.</param>
public override void Parse(Response response)
{
var categoryList = new List<Category>();
// Iterate through each link in the menu and extract the category data.
foreach (var Links in response.Css("#menuFixed > ul > li > a"))
{
var cat = new Category
{
URL = Links.Attributes["href"],
Name = Links.InnerText,
LastScraped = DateTime.Now
};
categoryList.Add(cat);
}
// Save the scraped data into a JSONL file.
Scrape(categoryList, "Shopping.jsonl");
}
}
public class ShoppingScraper : WebScraper
{
/// <summary>
/// Initialize the web scraper, setting the start URLs and allowed/banned domains or URL patterns.
/// </summary>
public override void Init()
{
// Apply your license key - get one from https://ironsoftware.com/csharp/webscraper/licensing/
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Configure request settings for better performance
this.Request("www.webSite.com", Parse);
}
/// <summary>
/// Parses the HTML document of the response to scrap the necessary data.
/// </summary>
/// <param name="response">The HTTP Response object to parse.</param>
public override void Parse(Response response)
{
var categoryList = new List<Category>();
// Iterate through each link in the menu and extract the category data.
foreach (var Links in response.Css("#menuFixed > ul > li > a"))
{
var cat = new Category
{
URL = Links.Attributes["href"],
Name = Links.InnerText,
LastScraped = DateTime.Now
};
categoryList.Add(cat);
}
// Save the scraped data into a JSONL file.
Scrape(categoryList, "Shopping.jsonl");
}
}
Imports System
Imports System.Collections.Generic
Public Class ShoppingScraper
Inherits WebScraper
''' <summary>
''' Initialize the web scraper, setting the start URLs and allowed/banned domains or URL patterns.
''' </summary>
Public Overrides Sub Init()
' Apply your license key - get one from https://ironsoftware.com/csharp/webscraper/licensing/
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Configure request settings for better performance
Me.Request("www.webSite.com", AddressOf Parse)
End Sub
''' <summary>
''' Parses the HTML document of the response to scrap the necessary data.
''' </summary>
''' <param name="response">The HTTP Response object to parse.</param>
Public Overrides Sub Parse(response As Response)
Dim categoryList As New List(Of Category)()
' Iterate through each link in the menu and extract the category data.
For Each Links In response.Css("#menuFixed > ul > li > a")
Dim cat As New Category With {
.URL = Links.Attributes("href"),
.Name = Links.InnerText,
.LastScraped = DateTime.Now
}
categoryList.Add(cat)
Next
' Save the scraped data into a JSONL file.
Scrape(categoryList, "Shopping.jsonl")
End Sub
End Class
Quais elementos estou visando no menu?
Extrair links do menu requer seletores CSS precisos. A referência da API fornece informações detalhadas sobre os métodos de seleção disponíveis:
Como faço para extrair dados tanto das categorias principais quanto das subcategorias?
Atualize o código para extrair as categorias principais e todos os sublinks. Essa abordagem garante a captura completa da estrutura de navegação:
public override void Parse(Response response)
{
// List of Category Links (Root)
var categoryList = new List<Category>();
// Traverse each 'li' under the fixed menu
foreach (var li in response.Css("#menuFixed > ul > li"))
{
// List of Main Links
foreach (var Links in li.Css("a"))
{
var cat = new Category
{
URL = Links.Attributes["href"],
Name = Links.InnerText,
SubCategories = new List<Category>(),
LastScraped = DateTime.Now
};
// List of Subcategories Links
foreach (var subCategory in li.Css("a[class=subcategory]"))
{
var subcat = new Category
{
URL = subCategory.Attributes["href"],
Name = subCategory.InnerText,
CategoryType = "Subcategory"
};
// Check if subcategory link already exists
if (cat.SubCategories.Find(c => c.Name == subcat.Name && c.URL == subcat.URL) == null)
{
// Add sublinks
cat.SubCategories.Add(subcat);
}
}
// Update product count based on subcategories
cat.ProductCount = cat.SubCategories.Count;
// Add Main Category to the list
categoryList.Add(cat);
}
}
// Save the scraped data into a JSONL file.
Scrape(categoryList, "Shopping.jsonl");
}
public override void Parse(Response response)
{
// List of Category Links (Root)
var categoryList = new List<Category>();
// Traverse each 'li' under the fixed menu
foreach (var li in response.Css("#menuFixed > ul > li"))
{
// List of Main Links
foreach (var Links in li.Css("a"))
{
var cat = new Category
{
URL = Links.Attributes["href"],
Name = Links.InnerText,
SubCategories = new List<Category>(),
LastScraped = DateTime.Now
};
// List of Subcategories Links
foreach (var subCategory in li.Css("a[class=subcategory]"))
{
var subcat = new Category
{
URL = subCategory.Attributes["href"],
Name = subCategory.InnerText,
CategoryType = "Subcategory"
};
// Check if subcategory link already exists
if (cat.SubCategories.Find(c => c.Name == subcat.Name && c.URL == subcat.URL) == null)
{
// Add sublinks
cat.SubCategories.Add(subcat);
}
}
// Update product count based on subcategories
cat.ProductCount = cat.SubCategories.Count;
// Add Main Category to the list
categoryList.Add(cat);
}
}
// Save the scraped data into a JSONL file.
Scrape(categoryList, "Shopping.jsonl");
}
Option Strict On
Public Overrides Sub Parse(response As Response)
' List of Category Links (Root)
Dim categoryList As New List(Of Category)()
' Traverse each 'li' under the fixed menu
For Each li In response.Css("#menuFixed > ul > li")
' List of Main Links
For Each Links In li.Css("a")
Dim cat As New Category With {
.URL = Links.Attributes("href"),
.Name = Links.InnerText,
.SubCategories = New List(Of Category)(),
.LastScraped = DateTime.Now
}
' List of Subcategories Links
For Each subCategory In li.Css("a[class=subcategory]")
Dim subcat As New Category With {
.URL = subCategory.Attributes("href"),
.Name = subCategory.InnerText,
.CategoryType = "Subcategory"
}
' Check if subcategory link already exists
If cat.SubCategories.Find(Function(c) c.Name = subcat.Name AndAlso c.URL = subcat.URL) Is Nothing Then
' Add sublinks
cat.SubCategories.Add(subcat)
End If
Next
' Update product count based on subcategories
cat.ProductCount = cat.SubCategories.Count
' Add Main Category to the list
categoryList.Add(cat)
Next
Next
' Save the scraped data into a JSONL file.
Scrape(categoryList, "Shopping.jsonl")
End Sub
Como extrair informações de produtos das páginas de categorias?
Com links disponíveis para todas as categorias do site, comece a extrair dados dos produtos em cada categoria. Ao lidar com páginas de produtos, a segurança da rosca torna-se importante para um desempenho ideal. Navegue até qualquer categoria e examine o conteúdo:
Qual é a aparência da estrutura HTML do produto?
Examine a estrutura HTML para compreender a organização do produto:
<section class="products">
<div class="sku -gallery -validate-size " data-sku="AG249FA0T2PSGNAFAMZ" ft-product-sizes="41,42,43,44,45" ft-product-color="Multicolour">
<a class="link" href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html">
<div class="image-wrapper default-state">
<img class="lazy image -loaded" alt="Bundle Of 2 Sneakers - Black & Navy Blue" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-sku="AG249FA0T2PSGNAFAMZ" data-src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg">
<noscript><img src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
</div>
<h2 class="title"></h2>
<span class="brand ">Agu </span>
<span class="name" dir="ltr">Bundle Of 2 Sneakers - Black & Navy Blue</span>
</h2>
<div class="price-container clearfix">
<span class="price-box">
<span class="price">
<span data-currency-iso="EGP">EGP</span>
<span dir="ltr" data-price="299">299</span>
</span>
<span class="price -old -no-special"></span>
</span>
</div>
<div class="rating-stars">
<div class="stars-container">
<div class="stars" style="width: 62%"></div>
</div>
<div class="total-ratings">(30)</div>
</div>
<span class="shop-first-logo-container">
<img src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" data-src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" class="lazy shop-first-logo-img -mbxs -loaded">
</span>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
<div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=41">41</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=42">42</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=43">43</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=44">44</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=45">45</span>
</div>
</a>
</div>
<div class="sku -gallery -validate-size " data-sku="LE047FA01SRK4NAFAMZ" ft-product-sizes="110,115,120,125,130,135" ft-product-color="Black">
<a class="link" href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html">
<div class="image-wrapper default-state">
<img class="lazy image -loaded" alt="Genuine Leather Belt - Black" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-sku="LE047FA01SRK4NAFAMZ" data-src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg">
<noscript><img src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
</div>
<h2 class="title"><span class="brand ">Leather Shop </span> <span class="name" dir="ltr">Genuine Leather Belt - Black</span></h2>
<div class="price-container clearfix">
<span class="sale-flag-percent">-29%</span>
<span class="price-box">
<span class="price"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="96">96</span> </span>
<span class="price -old"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="135">135</span> </span>
</span>
</div>
<div class="rating-stars">
<div class="stars-container">
<div class="stars" style="width: 100%"></div>
</div>
<div class="total-ratings">(1)</div>
</div>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
<div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=110">110</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=115">115</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=120">120</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=125">125</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=130">130</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=135">135</span>
</div>
</a>
</div>
</section>
<section class="products">
<div class="sku -gallery -validate-size " data-sku="AG249FA0T2PSGNAFAMZ" ft-product-sizes="41,42,43,44,45" ft-product-color="Multicolour">
<a class="link" href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html">
<div class="image-wrapper default-state">
<img class="lazy image -loaded" alt="Bundle Of 2 Sneakers - Black & Navy Blue" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-sku="AG249FA0T2PSGNAFAMZ" data-src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg">
<noscript><img src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
</div>
<h2 class="title"></h2>
<span class="brand ">Agu </span>
<span class="name" dir="ltr">Bundle Of 2 Sneakers - Black & Navy Blue</span>
</h2>
<div class="price-container clearfix">
<span class="price-box">
<span class="price">
<span data-currency-iso="EGP">EGP</span>
<span dir="ltr" data-price="299">299</span>
</span>
<span class="price -old -no-special"></span>
</span>
</div>
<div class="rating-stars">
<div class="stars-container">
<div class="stars" style="width: 62%"></div>
</div>
<div class="total-ratings">(30)</div>
</div>
<span class="shop-first-logo-container">
<img src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" data-src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" class="lazy shop-first-logo-img -mbxs -loaded">
</span>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
<div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=41">41</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=42">42</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=43">43</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=44">44</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=45">45</span>
</div>
</a>
</div>
<div class="sku -gallery -validate-size " data-sku="LE047FA01SRK4NAFAMZ" ft-product-sizes="110,115,120,125,130,135" ft-product-color="Black">
<a class="link" href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html">
<div class="image-wrapper default-state">
<img class="lazy image -loaded" alt="Genuine Leather Belt - Black" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-sku="LE047FA01SRK4NAFAMZ" data-src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg">
<noscript><img src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
</div>
<h2 class="title"><span class="brand ">Leather Shop </span> <span class="name" dir="ltr">Genuine Leather Belt - Black</span></h2>
<div class="price-container clearfix">
<span class="sale-flag-percent">-29%</span>
<span class="price-box">
<span class="price"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="96">96</span> </span>
<span class="price -old"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="135">135</span> </span>
</span>
</div>
<div class="rating-stars">
<div class="stars-container">
<div class="stars" style="width: 100%"></div>
</div>
<div class="total-ratings">(1)</div>
</div>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
<div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=110">110</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=115">115</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=120">120</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=125">125</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=130">130</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=135">135</span>
</div>
</a>
</div>
</section>
Que modelo de produto devo criar?
Construa um modelo de produto para este conteúdo. Ao trabalhar com extração de dados de sites de compras , capture todos os detalhes relevantes do produto:
public class Product
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the price.
/// </summary>
/// <value>
/// The price.
/// </value>
public string Price { get; set; }
/// <summary>
/// Gets or sets the image.
/// </summary>
/// <value>
/// The image.
/// </value>
public string Image { get; set; }
// Additional properties for comprehensive data collection
public string Brand { get; set; }
public string OldPrice { get; set; }
public string Discount { get; set; }
public float Rating { get; set; }
public int ReviewCount { get; set; }
public List<string> AvailableSizes { get; set; }
public string ProductUrl { get; set; }
public string SKU { get; set; }
public DateTime ScrapedDate { get; set; }
}
public class Product
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the price.
/// </summary>
/// <value>
/// The price.
/// </value>
public string Price { get; set; }
/// <summary>
/// Gets or sets the image.
/// </summary>
/// <value>
/// The image.
/// </value>
public string Image { get; set; }
// Additional properties for comprehensive data collection
public string Brand { get; set; }
public string OldPrice { get; set; }
public string Discount { get; set; }
public float Rating { get; set; }
public int ReviewCount { get; set; }
public List<string> AvailableSizes { get; set; }
public string ProductUrl { get; set; }
public string SKU { get; set; }
public DateTime ScrapedDate { get; set; }
}
Public Class Product
''' <summary>
''' Gets or sets the name.
''' </summary>
''' <value>
''' The name.
''' </value>
Public Property Name As String
''' <summary>
''' Gets or sets the price.
''' </summary>
''' <value>
''' The price.
''' </value>
Public Property Price As String
''' <summary>
''' Gets or sets the image.
''' </summary>
''' <value>
''' The image.
''' </value>
Public Property Image As String
' Additional properties for comprehensive data collection
Public Property Brand As String
Public Property OldPrice As String
Public Property Discount As String
Public Property Rating As Single
Public Property ReviewCount As Integer
Public Property AvailableSizes As List(Of String)
Public Property ProductUrl As String
Public Property SKU As String
Public Property ScrapedDate As DateTime
End Class
Como adiciono a funcionalidade de extração de dados de produtos?
Para extrair dados das páginas de categoria, adicione um novo método de extração com tratamento de erros e validação de dados:
public void ParseCategory(Response response)
{
// List of Products
var productList = new List<Product>();
// Iterate through product links in the product section
foreach (var Links in response.Css("section.products > div > a"))
{
try
{
var product = new Product
{
Name = Links.Css("h2.title > span.name").First().InnerText,
Brand = Links.Css("h2.title > span.brand").FirstOrDefault()?.InnerText ?? "Unknown",
Price = Links.Css("div.price-container > span.price-box > span.price > span[data-price]").First().InnerText,
Image = Links.Css("div.image-wrapper.default-state > img").First().Attributes["src"],
ProductUrl = Links.Attributes["href"],
SKU = Links.ParentNode.Attributes["data-sku"],
ScrapedDate = DateTime.Now
};
// Extract old price if available
var oldPriceElement = Links.Css("span.price.-old > span[data-price]").FirstOrDefault();
if (oldPriceElement != null)
{
product.OldPrice = oldPriceElement.InnerText;
}
// Extract discount percentage
var discountElement = Links.Css("span.sale-flag-percent").FirstOrDefault();
if (discountElement != null)
{
product.Discount = discountElement.InnerText;
}
// Extract rating information
var ratingWidth = Links.Css("div.stars").FirstOrDefault()?.Attributes["style"];
if (!string.IsNullOrEmpty(ratingWidth))
{
var width = System.Text.RegularExpressions.Regex.Match(ratingWidth, @"(\d+)%").Groups[1].Value;
if (int.TryParse(width, out int ratingPercent))
{
product.Rating = ratingPercent / 20.0f; // Convert percentage to 5-star scale
}
}
// Extract review count
var reviewText = Links.Css("div.total-ratings").FirstOrDefault()?.InnerText;
if (!string.IsNullOrEmpty(reviewText))
{
var reviewCount = System.Text.RegularExpressions.Regex.Match(reviewText, @"\d+").Value;
if (int.TryParse(reviewCount, out int count))
{
product.ReviewCount = count;
}
}
// Extract available sizes
product.AvailableSizes = Links.Css("div.list.-sizes > span.sku-size")
.Select(s => s.InnerText)
.ToList();
productList.Add(product);
}
catch (Exception ex)
{
// Log error and continue with next product
Console.WriteLine($"Error parsing product: {ex.Message}");
}
}
// Save the scraped product data into a JSONL file.
Scrape(productList, "Products.jsonl");
// Handle pagination if needed
var nextPageLink = response.Css("a.pagination-next").FirstOrDefault();
if (nextPageLink != null)
{
var nextPageUrl = nextPageLink.Attributes["href"];
this.Request(nextPageUrl, ParseCategory);
}
}
public void ParseCategory(Response response)
{
// List of Products
var productList = new List<Product>();
// Iterate through product links in the product section
foreach (var Links in response.Css("section.products > div > a"))
{
try
{
var product = new Product
{
Name = Links.Css("h2.title > span.name").First().InnerText,
Brand = Links.Css("h2.title > span.brand").FirstOrDefault()?.InnerText ?? "Unknown",
Price = Links.Css("div.price-container > span.price-box > span.price > span[data-price]").First().InnerText,
Image = Links.Css("div.image-wrapper.default-state > img").First().Attributes["src"],
ProductUrl = Links.Attributes["href"],
SKU = Links.ParentNode.Attributes["data-sku"],
ScrapedDate = DateTime.Now
};
// Extract old price if available
var oldPriceElement = Links.Css("span.price.-old > span[data-price]").FirstOrDefault();
if (oldPriceElement != null)
{
product.OldPrice = oldPriceElement.InnerText;
}
// Extract discount percentage
var discountElement = Links.Css("span.sale-flag-percent").FirstOrDefault();
if (discountElement != null)
{
product.Discount = discountElement.InnerText;
}
// Extract rating information
var ratingWidth = Links.Css("div.stars").FirstOrDefault()?.Attributes["style"];
if (!string.IsNullOrEmpty(ratingWidth))
{
var width = System.Text.RegularExpressions.Regex.Match(ratingWidth, @"(\d+)%").Groups[1].Value;
if (int.TryParse(width, out int ratingPercent))
{
product.Rating = ratingPercent / 20.0f; // Convert percentage to 5-star scale
}
}
// Extract review count
var reviewText = Links.Css("div.total-ratings").FirstOrDefault()?.InnerText;
if (!string.IsNullOrEmpty(reviewText))
{
var reviewCount = System.Text.RegularExpressions.Regex.Match(reviewText, @"\d+").Value;
if (int.TryParse(reviewCount, out int count))
{
product.ReviewCount = count;
}
}
// Extract available sizes
product.AvailableSizes = Links.Css("div.list.-sizes > span.sku-size")
.Select(s => s.InnerText)
.ToList();
productList.Add(product);
}
catch (Exception ex)
{
// Log error and continue with next product
Console.WriteLine($"Error parsing product: {ex.Message}");
}
}
// Save the scraped product data into a JSONL file.
Scrape(productList, "Products.jsonl");
// Handle pagination if needed
var nextPageLink = response.Css("a.pagination-next").FirstOrDefault();
if (nextPageLink != null)
{
var nextPageUrl = nextPageLink.Attributes["href"];
this.Request(nextPageUrl, ParseCategory);
}
}
Public Sub ParseCategory(response As Response)
' List of Products
Dim productList As New List(Of Product)()
' Iterate through product links in the product section
For Each Links In response.Css("section.products > div > a")
Try
Dim product As New Product With {
.Name = Links.Css("h2.title > span.name").First().InnerText,
.Brand = If(Links.Css("h2.title > span.brand").FirstOrDefault()?.InnerText, "Unknown"),
.Price = Links.Css("div.price-container > span.price-box > span.price > span[data-price]").First().InnerText,
.Image = Links.Css("div.image-wrapper.default-state > img").First().Attributes("src"),
.ProductUrl = Links.Attributes("href"),
.SKU = Links.ParentNode.Attributes("data-sku"),
.ScrapedDate = DateTime.Now
}
' Extract old price if available
Dim oldPriceElement = Links.Css("span.price.-old > span[data-price]").FirstOrDefault()
If oldPriceElement IsNot Nothing Then
product.OldPrice = oldPriceElement.InnerText
End If
' Extract discount percentage
Dim discountElement = Links.Css("span.sale-flag-percent").FirstOrDefault()
If discountElement IsNot Nothing Then
product.Discount = discountElement.InnerText
End If
' Extract rating information
Dim ratingWidth = Links.Css("div.stars").FirstOrDefault()?.Attributes("style")
If Not String.IsNullOrEmpty(ratingWidth) Then
Dim width = System.Text.RegularExpressions.Regex.Match(ratingWidth, "(\d+)%").Groups(1).Value
Dim ratingPercent As Integer
If Integer.TryParse(width, ratingPercent) Then
product.Rating = ratingPercent / 20.0F ' Convert percentage to 5-star scale
End If
End If
' Extract review count
Dim reviewText = Links.Css("div.total-ratings").FirstOrDefault()?.InnerText
If Not String.IsNullOrEmpty(reviewText) Then
Dim reviewCount = System.Text.RegularExpressions.Regex.Match(reviewText, "\d+").Value
Dim count As Integer
If Integer.TryParse(reviewCount, count) Then
product.ReviewCount = count
End If
End If
' Extract available sizes
product.AvailableSizes = Links.Css("div.list.-sizes > span.sku-size") _
.Select(Function(s) s.InnerText) _
.ToList()
productList.Add(product)
Catch ex As Exception
' Log error and continue with next product
Console.WriteLine($"Error parsing product: {ex.Message}")
End Try
Next
' Save the scraped product data into a JSONL file.
Scrape(productList, "Products.jsonl")
' Handle pagination if needed
Dim nextPageLink = response.Css("a.pagination-next").FirstOrDefault()
If nextPageLink IsNot Nothing Then
Dim nextPageUrl = nextPageLink.Attributes("href")
Me.Request(nextPageUrl, AddressOf ParseCategory)
End If
End Sub
Essa abordagem abrangente para extrair dados de sites de compras garante a captura de todas as informações relevantes do produto, ao mesmo tempo que lida com erros de forma eficiente. Para cenários mais avançados, explore os recursos avançados de extração de dados da web disponíveis em IronWebScraper.
Perguntas frequentes
Como posso extrair dados de produtos de sites de compras em C#?
O IronWebScraper facilita a extração de dados de produtos de sites de compras usando seletores CSS. Você pode criar uma classe WebScraper, sobrescrever o método Parse e usar response.Css() para selecionar elementos HTML específicos, como nomes de produtos, preços e imagens. Os dados extraídos podem ser salvos em vários formatos, incluindo arquivos JSON e JSONL.
Quais são os passos básicos para criar um programa de extração de dados de sites de compras?
Para criar um scraper de um site de compras com o IronWebScraper: 1) Crie um projeto de Aplicativo de Console, 2) Adicione uma classe que herde de WebScraper, 3) Crie modelos de dados para categorias e produtos, 4) Sobrescreva o método Init() para definir a URL inicial, 5) Sobrescreva o método Parse() para extrair dados usando seletores CSS e 6) Execute o scraper para salvar os dados no formato desejado.
Como posso lidar com estruturas de categorias hierárquicas ao extrair dados de sites de comércio eletrônico?
O IronWebScraper permite lidar com estruturas hierárquicas criando modelos de dados apropriados que refletem as relações pai-filho (como Moda > Masculino > Sapatos). Você pode navegar por elementos HTML aninhados usando seletores CSS e construir a estrutura da sua árvore de categorias programaticamente, o que é especialmente útil ao trabalhar com os recursos avançados do IronWebScraper.
Qual a melhor maneira de analisar a estrutura HTML de um site de compras antes de extrair dados?
Antes de usar o IronWebScraper para extrair dados de um site de compras, utilize as ferramentas de desenvolvedor do navegador para inspecionar a estrutura HTML. Procure por padrões consistentes nas classes CSS e nas hierarquias de elementos. Essa análise ajuda a identificar os seletores CSS corretos para usar no método `Parse()` do IronWebScraper, permitindo direcionar com precisão informações de produtos, categorias e outros elementos de dados.
Posso extrair tanto a lista de produtos quanto a navegação por categorias da mesma página?
Sim, o IronWebScraper permite extrair vários tipos de dados de uma única página. No seu método Parse(), você pode usar diferentes seletores CSS para direcionar links de categoria (como '.category-item') e listas de produtos (como '.product-item') simultaneamente e, em seguida, salvá-los em arquivos de saída ou estruturas de dados separadas.
Como faço para salvar os dados de produtos extraídos em um arquivo?
O IronWebScraper oferece um método Scrape() integrado que salva automaticamente os dados extraídos. Basta passar o objeto de dados e o nome do arquivo para Scrape(item, "products.jsonl"). A biblioteca suporta vários formatos de saída, incluindo JSON, JSONL e CSV, facilitando a exportação dos dados extraídos do seu e-commerce para processamento posterior.





