Web Scraping d'un Site E-commerce en C
Apprenez à extraire les catégories de produits et les articles des sites Web de vente en ligne en utilisant C# avec le framework WebScraper, en extrayant des données structurées des éléments HTML dans des modèles personnalisés. Ce guide complet vous accompagne dans la création d'un scraper web e-commerce robuste à l'aide de la bibliothèque IronWebScraper .
Démarrage rapide : Extraire des données d'un site de shopping en ligne en C#
-
Installez IronWebScraper avec le Gestionnaire de Packages NuGet
PM > Install-Package IronWebScraper -
Copiez et exécutez cet extrait de code.
using IronWebScraper; public class QuickShoppingScraper : WebScraper { public override void Init() { // Apply your license key License.LicenseKey = "YOUR-LICENSE-KEY"; // Set the starting URL this.Request("https://shopping-site.com", Parse); } public override void Parse(Response response) { // Extract product data foreach (var product in response.Css(".product-item")) { var item = new { Name = product.Css(".product-name").First().InnerText, Price = product.Css(".price").First().InnerText, Image = product.Css("img").First().Attributes["src"] }; Scrape(item, "products.jsonl"); } } } // Run the scraper var scraper = new QuickShoppingScraper(); scraper.Start(); -
Déployez pour tester sur votre environnement de production.
Commencez à utiliser IronWebScraper dans votre projet dès aujourd'hui avec un essai gratuit
- Créez un nouveau projet Console App nommé "ShoppingSiteSample"
- Ajoutez une classe nommée " ShoppingScraper " qui hérite de
WebScraper - Créer des modèles pour les données
CategoryetProduct - Remplacez
Init()pour définir l'URL de départ etParse()la méthode de récupération. - Exécutez le scraper pour extraire les catégories et les produits dans des fichiers JSONL
Comment analyser la structure HTML d'un site d'achat?
Sélectionnez un site d'achat pour analyser la structure de son contenu. La compréhension de la structure HTML est essentielle à la réussite du web scraping. Avant d'écrire le moindre code, prenez le temps d'analyser la structure du site web cible à l'aide des outils de développement du navigateur.
Comme le montre l'image, la barre latérale gauche contient des liens vers les catégories de produits du site. La première étape consiste à étudier le code HTML du site et à planifier l'approche du scraping. Cette phase d'analyse est essentielle pour élaborer une stratégie de scraping efficace.
Pourquoi la compréhension de la structure HTML est-elle importante?
Les catégories du site de mode comportent des sous-catégories (Hommes, Femmes, Enfants). La compréhension de cette structure hiérarchique permet de concevoir des modèles de données et une logique de scraping appropriés. Lorsque l'on travaille avec des fonctions avancées de web scraping, une analyse HTML correcte devient encore plus critique.
<li class="menu-item" data-id="">
<a href="https://domain.com/fashion-by-/" class="main-category">
<i class="cat-icon osh-font-fashion"></i>
<span class="nav-subTxt">FASHION </span>
<i class="osh-font-light-arrow-left"></i><i class="osh-font-light-arrow-right"></i>
</a>
<div class="navLayerWrapper" style="width: 633px; display: none;">
<div class="submenu">
<div class="column">
<div class="categories">
<a class="category" href="https://domain.com/fashion-by-/?sort=newest&dir=desc&viewType=gridView3">New Arrivals !</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/men-fashion/">Men</a>
<a class="subcategory" href="https://domain.com/mens-shoes/">Shoes</a>
<a class="subcategory" href="https://domain.com/mens-clothing/">Clothing</a>
<a class="subcategory" href="https://domain.com/mens-accessories/">Accessories</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/women-fashion/">Women</a>
<a class="subcategory" href="https://domain.com/womens-shoes/">Shoes</a>
<a class="subcategory" href="https://domain.com/womens-clothing/">Clothing</a>
<a class="subcategory" href="https://domain.com/womens-accessories/">Accessories</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/girls-boys-fashion/">Kids</a>
<a class="subcategory" href="https://domain.com/boys-fashion/">Boys</a>
<a class="subcategory" href="https://domain.com/girls/">Girls</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/maternity-clothes/">Maternity Clothes</a>
</div>
</div>
<div class="column">
<div class="categories">
<span class="category defaultCursor">Men Best Sellers</span>
<a class="subcategory" href="https://domain.com/mens-casual-shoes/">Casual Shoes</a>
<a class="subcategory" href="https://domain.com/mens-sneakers/">Sneakers</a>
<a class="subcategory" href="https://domain.com/mens-t-shirts/">T-shirts</a>
<a class="subcategory" href="https://domain.com/mens-polos/">Polos</a>
</div>
<div class="categories">
<span class="category defaultCursor">Women Best Sellers</span>
<a class="subcategory" href="https://domain.com/womens-sandals/">Sandals</a>
<a class="subcategory" href="https://domain.com/womens-sneakers/">Sneakers</a>
<a class="subcategory" href="https://domain.com/women-dresses/">Dresses</a>
<a class="subcategory" href="https://domain.com/women-tops/">Tops</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/womens-curvy-clothing/">Women's Curvy Clothing</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/fashion-bundles/v/">Fashion Bundles</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/hijab-fashion/">Hijab Fashion</a>
</div>
</div>
<div class="column">
<div class="categories">
<a class="category" href="https://domain.com/brands/fashion-by-/">SEE ALL BRANDS</a>
<a class="subcategory" href="https://domain.com/adidas/">Adidas</a>
<a class="subcategory" href="https://domain.com/converse/">Converse</a>
<a class="subcategory" href="https://domain.com/ravin/">Ravin</a>
<a class="subcategory" href="https://domain.com/dejavu/">Dejavu</a>
<a class="subcategory" href="https://domain.com/agu/">Agu</a>
<a class="subcategory" href="https://domain.com/activ/">Activ</a>
<a class="subcategory" href="https://domain.com/oxford--bellini--tie-house--milano/">Tie House</a>
<a class="subcategory" href="https://domain.com/shoe-room/">Shoe Room</a>
<a class="subcategory" href="https://domain.com/town-team/">Town Team</a>
</div>
</div>
</div>
</div>
</li>
<li class="menu-item" data-id="">
<a href="https://domain.com/fashion-by-/" class="main-category">
<i class="cat-icon osh-font-fashion"></i>
<span class="nav-subTxt">FASHION </span>
<i class="osh-font-light-arrow-left"></i><i class="osh-font-light-arrow-right"></i>
</a>
<div class="navLayerWrapper" style="width: 633px; display: none;">
<div class="submenu">
<div class="column">
<div class="categories">
<a class="category" href="https://domain.com/fashion-by-/?sort=newest&dir=desc&viewType=gridView3">New Arrivals !</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/men-fashion/">Men</a>
<a class="subcategory" href="https://domain.com/mens-shoes/">Shoes</a>
<a class="subcategory" href="https://domain.com/mens-clothing/">Clothing</a>
<a class="subcategory" href="https://domain.com/mens-accessories/">Accessories</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/women-fashion/">Women</a>
<a class="subcategory" href="https://domain.com/womens-shoes/">Shoes</a>
<a class="subcategory" href="https://domain.com/womens-clothing/">Clothing</a>
<a class="subcategory" href="https://domain.com/womens-accessories/">Accessories</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/girls-boys-fashion/">Kids</a>
<a class="subcategory" href="https://domain.com/boys-fashion/">Boys</a>
<a class="subcategory" href="https://domain.com/girls/">Girls</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/maternity-clothes/">Maternity Clothes</a>
</div>
</div>
<div class="column">
<div class="categories">
<span class="category defaultCursor">Men Best Sellers</span>
<a class="subcategory" href="https://domain.com/mens-casual-shoes/">Casual Shoes</a>
<a class="subcategory" href="https://domain.com/mens-sneakers/">Sneakers</a>
<a class="subcategory" href="https://domain.com/mens-t-shirts/">T-shirts</a>
<a class="subcategory" href="https://domain.com/mens-polos/">Polos</a>
</div>
<div class="categories">
<span class="category defaultCursor">Women Best Sellers</span>
<a class="subcategory" href="https://domain.com/womens-sandals/">Sandals</a>
<a class="subcategory" href="https://domain.com/womens-sneakers/">Sneakers</a>
<a class="subcategory" href="https://domain.com/women-dresses/">Dresses</a>
<a class="subcategory" href="https://domain.com/women-tops/">Tops</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/womens-curvy-clothing/">Women's Curvy Clothing</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/fashion-bundles/v/">Fashion Bundles</a>
</div>
<div class="categories">
<a class="category" href="https://domain.com/hijab-fashion/">Hijab Fashion</a>
</div>
</div>
<div class="column">
<div class="categories">
<a class="category" href="https://domain.com/brands/fashion-by-/">SEE ALL BRANDS</a>
<a class="subcategory" href="https://domain.com/adidas/">Adidas</a>
<a class="subcategory" href="https://domain.com/converse/">Converse</a>
<a class="subcategory" href="https://domain.com/ravin/">Ravin</a>
<a class="subcategory" href="https://domain.com/dejavu/">Dejavu</a>
<a class="subcategory" href="https://domain.com/agu/">Agu</a>
<a class="subcategory" href="https://domain.com/activ/">Activ</a>
<a class="subcategory" href="https://domain.com/oxford--bellini--tie-house--milano/">Tie House</a>
<a class="subcategory" href="https://domain.com/shoe-room/">Shoe Room</a>
<a class="subcategory" href="https://domain.com/town-team/">Town Team</a>
</div>
</div>
</div>
</div>
</li>
Comment mettre en place le projet de récupération de données sur le web?
Mettez en place un projet en suivant les meilleures pratiques pour le web scraping C#.
- Créez une nouvelle application console ou ajoutez un nouveau dossier pour l'exemple nommé "ShoppingSiteSample"
- Ajoutez une nouvelle classe nommée "ShoppingScraper"
- Commencez par récupérer les catégories du site et leurs sous-catégories
- Installez
IronWebScrapervia le gestionnaire de packages NuGet ou la console du gestionnaire de packages :
Install-Package IronWebScraper
Install-Package IronWebScraper
Quel modèle de données dois-je utiliser pour les catégories ?
Créer un modèle de catégories qui représente correctement la structure hiérarchique découverte :
public class Category
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the URL.
/// </summary>
/// <value>
/// The URL.
/// </value>
public string URL { get; set; }
/// <summary>
/// Gets or sets the subcategories.
/// </summary>
/// <value>
/// The subcategories.
/// </value>
public List<Category> SubCategories { get; set; }
// Additional properties for enhanced data collection
public int ProductCount { get; set; }
public DateTime LastScraped { get; set; }
public string CategoryType { get; set; }
}
public class Category
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the URL.
/// </summary>
/// <value>
/// The URL.
/// </value>
public string URL { get; set; }
/// <summary>
/// Gets or sets the subcategories.
/// </summary>
/// <value>
/// The subcategories.
/// </value>
public List<Category> SubCategories { get; set; }
// Additional properties for enhanced data collection
public int ProductCount { get; set; }
public DateTime LastScraped { get; set; }
public string CategoryType { get; set; }
}
Public Class Category
''' <summary>
''' Gets or sets the name.
''' </summary>
''' <value>
''' The name.
''' </value>
Public Property Name As String
''' <summary>
''' Gets or sets the URL.
''' </summary>
''' <value>
''' The URL.
''' </value>
Public Property URL As String
''' <summary>
''' Gets or sets the subcategories.
''' </summary>
''' <value>
''' The subcategories.
''' </value>
Public Property SubCategories As List(Of Category)
' Additional properties for enhanced data collection
Public Property ProductCount As Integer
Public Property LastScraped As DateTime
Public Property CategoryType As String
End Class
Comment construire la logique de base du scraper?
Créez la logique du scraper, en n'oubliant pas de appliquer votre clé de licence avant d'exécuter le scraper :
public class ShoppingScraper : WebScraper
{
/// <summary>
/// Initialize the web scraper, setting the start URLs and allowed/banned domains or URL patterns.
/// </summary>
public override void Init()
{
// Apply your license key - get one from https://ironsoftware.com/csharp/webscraper/licensing/
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Configure request settings for better performance
this.Request("www.webSite.com", Parse);
}
/// <summary>
/// Parses the HTML document of the response to scrap the necessary data.
/// </summary>
/// <param name="response">The HTTP Response object to parse.</param>
public override void Parse(Response response)
{
var categoryList = new List<Category>();
// Iterate through each link in the menu and extract the category data.
foreach (var Links in response.Css("#menuFixed > ul > li > a"))
{
var cat = new Category
{
URL = Links.Attributes["href"],
Name = Links.InnerText,
LastScraped = DateTime.Now
};
categoryList.Add(cat);
}
// Save the scraped data into a JSONL file.
Scrape(categoryList, "Shopping.jsonl");
}
}
public class ShoppingScraper : WebScraper
{
/// <summary>
/// Initialize the web scraper, setting the start URLs and allowed/banned domains or URL patterns.
/// </summary>
public override void Init()
{
// Apply your license key - get one from https://ironsoftware.com/csharp/webscraper/licensing/
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Configure request settings for better performance
this.Request("www.webSite.com", Parse);
}
/// <summary>
/// Parses the HTML document of the response to scrap the necessary data.
/// </summary>
/// <param name="response">The HTTP Response object to parse.</param>
public override void Parse(Response response)
{
var categoryList = new List<Category>();
// Iterate through each link in the menu and extract the category data.
foreach (var Links in response.Css("#menuFixed > ul > li > a"))
{
var cat = new Category
{
URL = Links.Attributes["href"],
Name = Links.InnerText,
LastScraped = DateTime.Now
};
categoryList.Add(cat);
}
// Save the scraped data into a JSONL file.
Scrape(categoryList, "Shopping.jsonl");
}
}
Imports System
Imports System.Collections.Generic
Public Class ShoppingScraper
Inherits WebScraper
''' <summary>
''' Initialize the web scraper, setting the start URLs and allowed/banned domains or URL patterns.
''' </summary>
Public Overrides Sub Init()
' Apply your license key - get one from https://ironsoftware.com/csharp/webscraper/licensing/
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Configure request settings for better performance
Me.Request("www.webSite.com", AddressOf Parse)
End Sub
''' <summary>
''' Parses the HTML document of the response to scrap the necessary data.
''' </summary>
''' <param name="response">The HTTP Response object to parse.</param>
Public Overrides Sub Parse(response As Response)
Dim categoryList As New List(Of Category)()
' Iterate through each link in the menu and extract the category data.
For Each Links In response.Css("#menuFixed > ul > li > a")
Dim cat As New Category With {
.URL = Links.Attributes("href"),
.Name = Links.InnerText,
.LastScraped = DateTime.Now
}
categoryList.Add(cat)
Next
' Save the scraped data into a JSONL file.
Scrape(categoryList, "Shopping.jsonl")
End Sub
End Class
Quels sont les éléments ciblés dans le menu ?
L'extraction des liens du menu nécessite des sélecteurs CSS précis. La référence API fournit des informations détaillées sur les méthodes de sélection disponibles :
Comment récupérer les catégories principales et les sous-catégories?
Mettre à jour le code pour récupérer les catégories principales et tous les sous-liens. Cette approche garantit une saisie complète de la structure de navigation :
public override void Parse(Response response)
{
// List of Category Links (Root)
var categoryList = new List<Category>();
// Traverse each 'li' under the fixed menu
foreach (var li in response.Css("#menuFixed > ul > li"))
{
// List of Main Links
foreach (var Links in li.Css("a"))
{
var cat = new Category
{
URL = Links.Attributes["href"],
Name = Links.InnerText,
SubCategories = new List<Category>(),
LastScraped = DateTime.Now
};
// List of Subcategories Links
foreach (var subCategory in li.Css("a[class=subcategory]"))
{
var subcat = new Category
{
URL = subCategory.Attributes["href"],
Name = subCategory.InnerText,
CategoryType = "Subcategory"
};
// Check if subcategory link already exists
if (cat.SubCategories.Find(c => c.Name == subcat.Name && c.URL == subcat.URL) == null)
{
// Add sublinks
cat.SubCategories.Add(subcat);
}
}
// Update product count based on subcategories
cat.ProductCount = cat.SubCategories.Count;
// Add Main Category to the list
categoryList.Add(cat);
}
}
// Save the scraped data into a JSONL file.
Scrape(categoryList, "Shopping.jsonl");
}
public override void Parse(Response response)
{
// List of Category Links (Root)
var categoryList = new List<Category>();
// Traverse each 'li' under the fixed menu
foreach (var li in response.Css("#menuFixed > ul > li"))
{
// List of Main Links
foreach (var Links in li.Css("a"))
{
var cat = new Category
{
URL = Links.Attributes["href"],
Name = Links.InnerText,
SubCategories = new List<Category>(),
LastScraped = DateTime.Now
};
// List of Subcategories Links
foreach (var subCategory in li.Css("a[class=subcategory]"))
{
var subcat = new Category
{
URL = subCategory.Attributes["href"],
Name = subCategory.InnerText,
CategoryType = "Subcategory"
};
// Check if subcategory link already exists
if (cat.SubCategories.Find(c => c.Name == subcat.Name && c.URL == subcat.URL) == null)
{
// Add sublinks
cat.SubCategories.Add(subcat);
}
}
// Update product count based on subcategories
cat.ProductCount = cat.SubCategories.Count;
// Add Main Category to the list
categoryList.Add(cat);
}
}
// Save the scraped data into a JSONL file.
Scrape(categoryList, "Shopping.jsonl");
}
Option Strict On
Public Overrides Sub Parse(response As Response)
' List of Category Links (Root)
Dim categoryList As New List(Of Category)()
' Traverse each 'li' under the fixed menu
For Each li In response.Css("#menuFixed > ul > li")
' List of Main Links
For Each Links In li.Css("a")
Dim cat As New Category With {
.URL = Links.Attributes("href"),
.Name = Links.InnerText,
.SubCategories = New List(Of Category)(),
.LastScraped = DateTime.Now
}
' List of Subcategories Links
For Each subCategory In li.Css("a[class=subcategory]")
Dim subcat As New Category With {
.URL = subCategory.Attributes("href"),
.Name = subCategory.InnerText,
.CategoryType = "Subcategory"
}
' Check if subcategory link already exists
If cat.SubCategories.Find(Function(c) c.Name = subcat.Name AndAlso c.URL = subcat.URL) Is Nothing Then
' Add sublinks
cat.SubCategories.Add(subcat)
End If
Next
' Update product count based on subcategories
cat.ProductCount = cat.SubCategories.Count
' Add Main Category to the list
categoryList.Add(cat)
Next
Next
' Save the scraped data into a JSONL file.
Scrape(categoryList, "Shopping.jsonl")
End Sub
Comment extraire des informations sur les produits à partir des pages de catégories ?
Les liens vers toutes les catégories du site étant disponibles, commencez à rechercher des produits dans chaque catégorie. Lorsqu'il s'agit de pages de produits, thread safety devient important pour des performances optimales. Naviguez vers n'importe quelle catégorie et examinez le contenu :
À quoi ressemble la structure HTML du produit?
Examinez la structure HTML pour comprendre l'organisation du produit :
<section class="products">
<div class="sku -gallery -validate-size " data-sku="AG249FA0T2PSGNAFAMZ" ft-product-sizes="41,42,43,44,45" ft-product-color="Multicolour">
<a class="link" href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html">
<div class="image-wrapper default-state">
<img class="lazy image -loaded" alt="Bundle Of 2 Sneakers - Black & Navy Blue" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-sku="AG249FA0T2PSGNAFAMZ" data-src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg">
<noscript><img src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
</div>
<h2 class="title"></h2>
<span class="brand ">Agu </span>
<span class="name" dir="ltr">Bundle Of 2 Sneakers - Black & Navy Blue</span>
</h2>
<div class="price-container clearfix">
<span class="price-box">
<span class="price">
<span data-currency-iso="EGP">EGP</span>
<span dir="ltr" data-price="299">299</span>
</span>
<span class="price -old -no-special"></span>
</span>
</div>
<div class="rating-stars">
<div class="stars-container">
<div class="stars" style="width: 62%"></div>
</div>
<div class="total-ratings">(30)</div>
</div>
<span class="shop-first-logo-container">
<img src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" data-src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" class="lazy shop-first-logo-img -mbxs -loaded">
</span>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
<div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=41">41</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=42">42</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=43">43</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=44">44</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=45">45</span>
</div>
</a>
</div>
<div class="sku -gallery -validate-size " data-sku="LE047FA01SRK4NAFAMZ" ft-product-sizes="110,115,120,125,130,135" ft-product-color="Black">
<a class="link" href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html">
<div class="image-wrapper default-state">
<img class="lazy image -loaded" alt="Genuine Leather Belt - Black" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-sku="LE047FA01SRK4NAFAMZ" data-src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg">
<noscript><img src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
</div>
<h2 class="title"><span class="brand ">Leather Shop </span> <span class="name" dir="ltr">Genuine Leather Belt - Black</span></h2>
<div class="price-container clearfix">
<span class="sale-flag-percent">-29%</span>
<span class="price-box">
<span class="price"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="96">96</span> </span>
<span class="price -old"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="135">135</span> </span>
</span>
</div>
<div class="rating-stars">
<div class="stars-container">
<div class="stars" style="width: 100%"></div>
</div>
<div class="total-ratings">(1)</div>
</div>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
<div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=110">110</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=115">115</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=120">120</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=125">125</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=130">130</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=135">135</span>
</div>
</a>
</div>
</section>
<section class="products">
<div class="sku -gallery -validate-size " data-sku="AG249FA0T2PSGNAFAMZ" ft-product-sizes="41,42,43,44,45" ft-product-color="Multicolour">
<a class="link" href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html">
<div class="image-wrapper default-state">
<img class="lazy image -loaded" alt="Bundle Of 2 Sneakers - Black & Navy Blue" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-sku="AG249FA0T2PSGNAFAMZ" data-src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg">
<noscript><img src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
</div>
<h2 class="title"></h2>
<span class="brand ">Agu </span>
<span class="name" dir="ltr">Bundle Of 2 Sneakers - Black & Navy Blue</span>
</h2>
<div class="price-container clearfix">
<span class="price-box">
<span class="price">
<span data-currency-iso="EGP">EGP</span>
<span dir="ltr" data-price="299">299</span>
</span>
<span class="price -old -no-special"></span>
</span>
</div>
<div class="rating-stars">
<div class="stars-container">
<div class="stars" style="width: 62%"></div>
</div>
<div class="total-ratings">(30)</div>
</div>
<span class="shop-first-logo-container">
<img src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" data-src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" class="lazy shop-first-logo-img -mbxs -loaded">
</span>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
<div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=41">41</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=42">42</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=43">43</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=44">44</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=45">45</span>
</div>
</a>
</div>
<div class="sku -gallery -validate-size " data-sku="LE047FA01SRK4NAFAMZ" ft-product-sizes="110,115,120,125,130,135" ft-product-color="Black">
<a class="link" href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html">
<div class="image-wrapper default-state">
<img class="lazy image -loaded" alt="Genuine Leather Belt - Black" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-sku="LE047FA01SRK4NAFAMZ" data-src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg">
<noscript><img src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
</div>
<h2 class="title"><span class="brand ">Leather Shop </span> <span class="name" dir="ltr">Genuine Leather Belt - Black</span></h2>
<div class="price-container clearfix">
<span class="sale-flag-percent">-29%</span>
<span class="price-box">
<span class="price"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="96">96</span> </span>
<span class="price -old"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="135">135</span> </span>
</span>
</div>
<div class="rating-stars">
<div class="stars-container">
<div class="stars" style="width: 100%"></div>
</div>
<div class="total-ratings">(1)</div>
</div>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
<div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=110">110</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=115">115</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=120">120</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=125">125</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=130">130</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=135">135</span>
</div>
</a>
</div>
</section>
Quel modèle de produit dois-je créer ?
Créez un modèle de produit pour ce contenu. Lorsque vous travaillez avec shopping website scraping, saisissez tous les détails pertinents sur les produits :
public class Product
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the price.
/// </summary>
/// <value>
/// The price.
/// </value>
public string Price { get; set; }
/// <summary>
/// Gets or sets the image.
/// </summary>
/// <value>
/// The image.
/// </value>
public string Image { get; set; }
// Additional properties for comprehensive data collection
public string Brand { get; set; }
public string OldPrice { get; set; }
public string Discount { get; set; }
public float Rating { get; set; }
public int ReviewCount { get; set; }
public List<string> AvailableSizes { get; set; }
public string ProductUrl { get; set; }
public string SKU { get; set; }
public DateTime ScrapedDate { get; set; }
}
public class Product
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the price.
/// </summary>
/// <value>
/// The price.
/// </value>
public string Price { get; set; }
/// <summary>
/// Gets or sets the image.
/// </summary>
/// <value>
/// The image.
/// </value>
public string Image { get; set; }
// Additional properties for comprehensive data collection
public string Brand { get; set; }
public string OldPrice { get; set; }
public string Discount { get; set; }
public float Rating { get; set; }
public int ReviewCount { get; set; }
public List<string> AvailableSizes { get; set; }
public string ProductUrl { get; set; }
public string SKU { get; set; }
public DateTime ScrapedDate { get; set; }
}
Public Class Product
''' <summary>
''' Gets or sets the name.
''' </summary>
''' <value>
''' The name.
''' </value>
Public Property Name As String
''' <summary>
''' Gets or sets the price.
''' </summary>
''' <value>
''' The price.
''' </value>
Public Property Price As String
''' <summary>
''' Gets or sets the image.
''' </summary>
''' <value>
''' The image.
''' </value>
Public Property Image As String
' Additional properties for comprehensive data collection
Public Property Brand As String
Public Property OldPrice As String
Public Property Discount As String
Public Property Rating As Single
Public Property ReviewCount As Integer
Public Property AvailableSizes As List(Of String)
Public Property ProductUrl As String
Public Property SKU As String
Public Property ScrapedDate As DateTime
End Class
Comment ajouter une fonctionnalité de récupération de produits?
Pour récupérer les pages de catégories, ajoutez une nouvelle méthode de récupération avec gestion des erreurs et validation des données :
public void ParseCategory(Response response)
{
// List of Products
var productList = new List<Product>();
// Iterate through product links in the product section
foreach (var Links in response.Css("section.products > div > a"))
{
try
{
var product = new Product
{
Name = Links.Css("h2.title > span.name").First().InnerText,
Brand = Links.Css("h2.title > span.brand").FirstOrDefault()?.InnerText ?? "Unknown",
Price = Links.Css("div.price-container > span.price-box > span.price > span[data-price]").First().InnerText,
Image = Links.Css("div.image-wrapper.default-state > img").First().Attributes["src"],
ProductUrl = Links.Attributes["href"],
SKU = Links.ParentNode.Attributes["data-sku"],
ScrapedDate = DateTime.Now
};
// Extract old price if available
var oldPriceElement = Links.Css("span.price.-old > span[data-price]").FirstOrDefault();
if (oldPriceElement != null)
{
product.OldPrice = oldPriceElement.InnerText;
}
// Extract discount percentage
var discountElement = Links.Css("span.sale-flag-percent").FirstOrDefault();
if (discountElement != null)
{
product.Discount = discountElement.InnerText;
}
// Extract rating information
var ratingWidth = Links.Css("div.stars").FirstOrDefault()?.Attributes["style"];
if (!string.IsNullOrEmpty(ratingWidth))
{
var width = System.Text.RegularExpressions.Regex.Match(ratingWidth, @"(\d+)%").Groups[1].Value;
if (int.TryParse(width, out int ratingPercent))
{
product.Rating = ratingPercent / 20.0f; // Convert percentage to 5-star scale
}
}
// Extract review count
var reviewText = Links.Css("div.total-ratings").FirstOrDefault()?.InnerText;
if (!string.IsNullOrEmpty(reviewText))
{
var reviewCount = System.Text.RegularExpressions.Regex.Match(reviewText, @"\d+").Value;
if (int.TryParse(reviewCount, out int count))
{
product.ReviewCount = count;
}
}
// Extract available sizes
product.AvailableSizes = Links.Css("div.list.-sizes > span.sku-size")
.Select(s => s.InnerText)
.ToList();
productList.Add(product);
}
catch (Exception ex)
{
// Log error and continue with next product
Console.WriteLine($"Error parsing product: {ex.Message}");
}
}
// Save the scraped product data into a JSONL file.
Scrape(productList, "Products.jsonl");
// Handle pagination if needed
var nextPageLink = response.Css("a.pagination-next").FirstOrDefault();
if (nextPageLink != null)
{
var nextPageUrl = nextPageLink.Attributes["href"];
this.Request(nextPageUrl, ParseCategory);
}
}
public void ParseCategory(Response response)
{
// List of Products
var productList = new List<Product>();
// Iterate through product links in the product section
foreach (var Links in response.Css("section.products > div > a"))
{
try
{
var product = new Product
{
Name = Links.Css("h2.title > span.name").First().InnerText,
Brand = Links.Css("h2.title > span.brand").FirstOrDefault()?.InnerText ?? "Unknown",
Price = Links.Css("div.price-container > span.price-box > span.price > span[data-price]").First().InnerText,
Image = Links.Css("div.image-wrapper.default-state > img").First().Attributes["src"],
ProductUrl = Links.Attributes["href"],
SKU = Links.ParentNode.Attributes["data-sku"],
ScrapedDate = DateTime.Now
};
// Extract old price if available
var oldPriceElement = Links.Css("span.price.-old > span[data-price]").FirstOrDefault();
if (oldPriceElement != null)
{
product.OldPrice = oldPriceElement.InnerText;
}
// Extract discount percentage
var discountElement = Links.Css("span.sale-flag-percent").FirstOrDefault();
if (discountElement != null)
{
product.Discount = discountElement.InnerText;
}
// Extract rating information
var ratingWidth = Links.Css("div.stars").FirstOrDefault()?.Attributes["style"];
if (!string.IsNullOrEmpty(ratingWidth))
{
var width = System.Text.RegularExpressions.Regex.Match(ratingWidth, @"(\d+)%").Groups[1].Value;
if (int.TryParse(width, out int ratingPercent))
{
product.Rating = ratingPercent / 20.0f; // Convert percentage to 5-star scale
}
}
// Extract review count
var reviewText = Links.Css("div.total-ratings").FirstOrDefault()?.InnerText;
if (!string.IsNullOrEmpty(reviewText))
{
var reviewCount = System.Text.RegularExpressions.Regex.Match(reviewText, @"\d+").Value;
if (int.TryParse(reviewCount, out int count))
{
product.ReviewCount = count;
}
}
// Extract available sizes
product.AvailableSizes = Links.Css("div.list.-sizes > span.sku-size")
.Select(s => s.InnerText)
.ToList();
productList.Add(product);
}
catch (Exception ex)
{
// Log error and continue with next product
Console.WriteLine($"Error parsing product: {ex.Message}");
}
}
// Save the scraped product data into a JSONL file.
Scrape(productList, "Products.jsonl");
// Handle pagination if needed
var nextPageLink = response.Css("a.pagination-next").FirstOrDefault();
if (nextPageLink != null)
{
var nextPageUrl = nextPageLink.Attributes["href"];
this.Request(nextPageUrl, ParseCategory);
}
}
Public Sub ParseCategory(response As Response)
' List of Products
Dim productList As New List(Of Product)()
' Iterate through product links in the product section
For Each Links In response.Css("section.products > div > a")
Try
Dim product As New Product With {
.Name = Links.Css("h2.title > span.name").First().InnerText,
.Brand = If(Links.Css("h2.title > span.brand").FirstOrDefault()?.InnerText, "Unknown"),
.Price = Links.Css("div.price-container > span.price-box > span.price > span[data-price]").First().InnerText,
.Image = Links.Css("div.image-wrapper.default-state > img").First().Attributes("src"),
.ProductUrl = Links.Attributes("href"),
.SKU = Links.ParentNode.Attributes("data-sku"),
.ScrapedDate = DateTime.Now
}
' Extract old price if available
Dim oldPriceElement = Links.Css("span.price.-old > span[data-price]").FirstOrDefault()
If oldPriceElement IsNot Nothing Then
product.OldPrice = oldPriceElement.InnerText
End If
' Extract discount percentage
Dim discountElement = Links.Css("span.sale-flag-percent").FirstOrDefault()
If discountElement IsNot Nothing Then
product.Discount = discountElement.InnerText
End If
' Extract rating information
Dim ratingWidth = Links.Css("div.stars").FirstOrDefault()?.Attributes("style")
If Not String.IsNullOrEmpty(ratingWidth) Then
Dim width = System.Text.RegularExpressions.Regex.Match(ratingWidth, "(\d+)%").Groups(1).Value
Dim ratingPercent As Integer
If Integer.TryParse(width, ratingPercent) Then
product.Rating = ratingPercent / 20.0F ' Convert percentage to 5-star scale
End If
End If
' Extract review count
Dim reviewText = Links.Css("div.total-ratings").FirstOrDefault()?.InnerText
If Not String.IsNullOrEmpty(reviewText) Then
Dim reviewCount = System.Text.RegularExpressions.Regex.Match(reviewText, "\d+").Value
Dim count As Integer
If Integer.TryParse(reviewCount, count) Then
product.ReviewCount = count
End If
End If
' Extract available sizes
product.AvailableSizes = Links.Css("div.list.-sizes > span.sku-size") _
.Select(Function(s) s.InnerText) _
.ToList()
productList.Add(product)
Catch ex As Exception
' Log error and continue with next product
Console.WriteLine($"Error parsing product: {ex.Message}")
End Try
Next
' Save the scraped product data into a JSONL file.
Scrape(productList, "Products.jsonl")
' Handle pagination if needed
Dim nextPageLink = response.Css("a.pagination-next").FirstOrDefault()
If nextPageLink IsNot Nothing Then
Dim nextPageUrl = nextPageLink.Attributes("href")
Me.Request(nextPageUrl, AddressOf ParseCategory)
End If
End Sub
Cette approche globale du scraping de sites web d'achat garantit la capture de toutes les informations pertinentes sur les produits, tout en gérant les erreurs de manière élégante. Pour des scénarios plus avancés, explorez les fonctionnalités avancées de web scraping disponibles dans IronWebScraper.
Questions Fréquemment Posées
Comment extraire des données sur les produits à partir de sites web d'achat en C# ?
IronWebScraper facilite l'extraction de données sur les produits à partir de sites web d'achat en utilisant des sélecteurs CSS. Vous pouvez créer une classe WebScraper, surcharger la méthode Parse et utiliser response.Css() pour cibler des éléments HTML spécifiques tels que les noms de produits, les prix et les images. Les données extraites peuvent être enregistrées dans différents formats, notamment les fichiers JSON et JSONL.
Quelles sont les étapes de base pour créer un scraper de site d'achat ?
Pour créer un scraper de site d'achat avec IronWebScraper : 1) Créez un projet Console App, 2) Ajoutez une classe qui hérite de WebScraper, 3) Créez des modèles de données pour les catégories et les produits, 4) Surchargez la méthode Init() pour définir votre URL de départ, 5) Surchargez la méthode Parse() pour extraire les données à l'aide de sélecteurs CSS, et 6) Exécutez le scraper pour enregistrer les données dans le format de votre choix.
Comment puis-je gérer les structures de catégories hiérarchiques lorsque je récupère des sites de commerce électronique ?
IronWebScraper vous permet de gérer des structures hiérarchiques en créant des modèles de données appropriés qui reflètent les relations parent-enfant (comme Fashion > Men > Shoes). Vous pouvez naviguer dans des éléments HTML imbriqués à l'aide de sélecteurs CSS et construire l'arborescence de vos catégories de manière programmatique, ce qui est particulièrement utile lorsque vous travaillez avec les fonctionnalités avancées d'IronWebscraper.
Quelle est la meilleure façon d'analyser la structure HTML d'un site d'achat avant le scraping ?
Avant d'utiliser IronWebscraper pour récupérer un site d'achat, utilisez les outils de développement du navigateur pour inspecter la structure HTML. Recherchez des modèles cohérents dans les classes CSS et les hiérarchies d'éléments. Cette analyse vous aide à identifier les sélecteurs CSS corrects à utiliser dans votre méthode IronWebScraper Parse() pour cibler avec précision les informations sur les produits, les catégories et d'autres éléments de données.
Puis-je extraire les listes de produits et la navigation par catégorie d'une même page ?
Oui, IronWebscraper vous permet d'extraire plusieurs types de données d'une même page. Dans votre méthode Parse(), vous pouvez utiliser différents sélecteurs CSS pour cibler simultanément les liens de catégorie (comme '.category-item') et les listes de produits (comme '.product-item'), puis les enregistrer dans des fichiers de sortie ou des structures de données distincts.
Comment enregistrer les données de produits récupérées dans un fichier ?
IronWebscraper fournit une méthode intégrée Scrape() qui enregistre automatiquement les données extraites. Il suffit de transmettre votre objet de données et le nom du fichier à Scrape(item, "products.jsonl"). La bibliothèque prend en charge différents formats de sortie, notamment JSON, JSONL et CSV, ce qui facilite l'exportation des données de commerce électronique extraites en vue d'un traitement ultérieur.





