Scrape a Shopping Website in C

This article was translated from English: Does it need improvement?
Translated
View the article in English

学习如何使用 C# 和 WebScraper 框架从购物网站抓取产品类别和商品,将 HTML 元素中的结构化数据提取到自定义模型中。 本综合指南将引导您使用IronWebScraper构建一个强大的电子商务爬虫。

快速入门:使用 C# 抓取购物网站数据

  1. 使用 NuGet 包管理器安装 https://www.nuget.org/packages/IronWebScraper

    PM > Install-Package IronWebScraper
  2. 复制并运行这段代码。

    using IronWebScraper;
    
    public class QuickShoppingScraper : WebScraper
    {
        public override void Init()
        {
            // Apply your license key
            License.LicenseKey = "YOUR-LICENSE-KEY";
    
            // Set the starting URL
            this.Request("https://shopping-site.com", Parse);
        }
    
        public override void Parse(Response response)
        {
            // Extract product data
            foreach (var product in response.Css(".product-item"))
            {
                var item = new
                {
                    Name = product.Css(".product-name").First().InnerText,
                    Price = product.Css(".price").First().InnerText,
                    Image = product.Css("img").First().Attributes["src"]
                };
    
                Scrape(item, "products.jsonl");
            }
        }
    }
    
    // Run the scraper
    var scraper = new QuickShoppingScraper();
    scraper.Start();
  3. 部署到您的生产环境中进行测试

    通过免费试用立即在您的项目中开始使用IronWebScraper

    arrow pointer

1.创建一个名为 "ShoppingSiteSample "的新控制台应用程序项目

  1. 添加一个名为"ShoppingScraper"的类,该类继承自 WebScraper
  2. CategoryProduct 数据创建模型
  3. 重写 Init() 设置起始 URL 和 Parse() 方法以进行抓取
    5.运行 scraper 将类别和产品提取为 JSONL 文件

如何分析购物网站的 HTML 结构?

选择一个购物网站,分析其内容结构。 了解 HTML 结构是成功进行网络刮擦的关键。 在编写任何代码之前,请花时间使用浏览器开发工具分析目标网站的结构。

带有斋月促销横幅和导航菜单的 Jumia 电子商务主页

如图所示,左侧边栏包含网站产品类别的链接。 第一步是调查网站的 HTML 并规划刮擦方法。 这一分析阶段对于制定有效的搜索策略至关重要。

显示产品类别、子类别和品牌部分的电子商务网站导航菜单

为什么理解 HTML 结构很重要?

时尚网站的类别有子类别(男装、女装、儿童)。 了解这种分层结构有助于设计适当的数据模型和搜索逻辑。 在使用 高级网络搜刮功能时,正确的 HTML 分析变得更加重要。

<li class="menu-item" data-id="">
    <a href="https://domain.com/fashion-by-/" class="main-category">
        <i class="cat-icon osh-font-fashion"></i>
        <span class="nav-subTxt">FASHION </span>
        <i class="osh-font-light-arrow-left"></i><i class="osh-font-light-arrow-right"></i>
    </a>
    <div class="navLayerWrapper" style="width: 633px; display: none;">
        <div class="submenu">
            <div class="column">
                <div class="categories">
                    <a class="category" href="https://domain.com/fashion-by-/?sort=newest&dir=desc&viewType=gridView3">New Arrivals !</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/men-fashion/">Men</a>
                    <a class="subcategory" href="https://domain.com/mens-shoes/">Shoes</a>
                    <a class="subcategory" href="https://domain.com/mens-clothing/">Clothing</a>
                    <a class="subcategory" href="https://domain.com/mens-accessories/">Accessories</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/women-fashion/">Women</a>
                    <a class="subcategory" href="https://domain.com/womens-shoes/">Shoes</a>
                    <a class="subcategory" href="https://domain.com/womens-clothing/">Clothing</a>
                    <a class="subcategory" href="https://domain.com/womens-accessories/">Accessories</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/girls-boys-fashion/">Kids</a>
                    <a class="subcategory" href="https://domain.com/boys-fashion/">Boys</a>
                    <a class="subcategory" href="https://domain.com/girls/">Girls</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/maternity-clothes/">Maternity Clothes</a>
                </div>
            </div>
            <div class="column">
                <div class="categories">
                    <span class="category defaultCursor">Men Best Sellers</span>
                    <a class="subcategory" href="https://domain.com/mens-casual-shoes/">Casual Shoes</a>
                    <a class="subcategory" href="https://domain.com/mens-sneakers/">Sneakers</a>
                    <a class="subcategory" href="https://domain.com/mens-t-shirts/">T-shirts</a>
                    <a class="subcategory" href="https://domain.com/mens-polos/">Polos</a>
                </div>
                <div class="categories">
                    <span class="category defaultCursor">Women Best Sellers</span>
                    <a class="subcategory" href="https://domain.com/womens-sandals/">Sandals</a>
                    <a class="subcategory" href="https://domain.com/womens-sneakers/">Sneakers</a>
                    <a class="subcategory" href="https://domain.com/women-dresses/">Dresses</a>
                    <a class="subcategory" href="https://domain.com/women-tops/">Tops</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/womens-curvy-clothing/">Women's Curvy Clothing</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/fashion-bundles/v/">Fashion Bundles</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/hijab-fashion/">Hijab Fashion</a>
                </div>
            </div>
            <div class="column">
                <div class="categories">
                    <a class="category" href="https://domain.com/brands/fashion-by-/">SEE ALL BRANDS</a>
                    <a class="subcategory" href="https://domain.com/adidas/">Adidas</a>
                    <a class="subcategory" href="https://domain.com/converse/">Converse</a>
                    <a class="subcategory" href="https://domain.com/ravin/">Ravin</a>
                    <a class="subcategory" href="https://domain.com/dejavu/">Dejavu</a>
                    <a class="subcategory" href="https://domain.com/agu/">Agu</a>
                    <a class="subcategory" href="https://domain.com/activ/">Activ</a>
                    <a class="subcategory" href="https://domain.com/oxford--bellini--tie-house--milano/">Tie House</a>
                    <a class="subcategory" href="https://domain.com/shoe-room/">Shoe Room</a>
                    <a class="subcategory" href="https://domain.com/town-team/">Town Team</a>
                </div>
            </div>
        </div>
    </div>
</li>
<li class="menu-item" data-id="">
    <a href="https://domain.com/fashion-by-/" class="main-category">
        <i class="cat-icon osh-font-fashion"></i>
        <span class="nav-subTxt">FASHION </span>
        <i class="osh-font-light-arrow-left"></i><i class="osh-font-light-arrow-right"></i>
    </a>
    <div class="navLayerWrapper" style="width: 633px; display: none;">
        <div class="submenu">
            <div class="column">
                <div class="categories">
                    <a class="category" href="https://domain.com/fashion-by-/?sort=newest&dir=desc&viewType=gridView3">New Arrivals !</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/men-fashion/">Men</a>
                    <a class="subcategory" href="https://domain.com/mens-shoes/">Shoes</a>
                    <a class="subcategory" href="https://domain.com/mens-clothing/">Clothing</a>
                    <a class="subcategory" href="https://domain.com/mens-accessories/">Accessories</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/women-fashion/">Women</a>
                    <a class="subcategory" href="https://domain.com/womens-shoes/">Shoes</a>
                    <a class="subcategory" href="https://domain.com/womens-clothing/">Clothing</a>
                    <a class="subcategory" href="https://domain.com/womens-accessories/">Accessories</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/girls-boys-fashion/">Kids</a>
                    <a class="subcategory" href="https://domain.com/boys-fashion/">Boys</a>
                    <a class="subcategory" href="https://domain.com/girls/">Girls</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/maternity-clothes/">Maternity Clothes</a>
                </div>
            </div>
            <div class="column">
                <div class="categories">
                    <span class="category defaultCursor">Men Best Sellers</span>
                    <a class="subcategory" href="https://domain.com/mens-casual-shoes/">Casual Shoes</a>
                    <a class="subcategory" href="https://domain.com/mens-sneakers/">Sneakers</a>
                    <a class="subcategory" href="https://domain.com/mens-t-shirts/">T-shirts</a>
                    <a class="subcategory" href="https://domain.com/mens-polos/">Polos</a>
                </div>
                <div class="categories">
                    <span class="category defaultCursor">Women Best Sellers</span>
                    <a class="subcategory" href="https://domain.com/womens-sandals/">Sandals</a>
                    <a class="subcategory" href="https://domain.com/womens-sneakers/">Sneakers</a>
                    <a class="subcategory" href="https://domain.com/women-dresses/">Dresses</a>
                    <a class="subcategory" href="https://domain.com/women-tops/">Tops</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/womens-curvy-clothing/">Women's Curvy Clothing</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/fashion-bundles/v/">Fashion Bundles</a>
                </div>
                <div class="categories">
                    <a class="category" href="https://domain.com/hijab-fashion/">Hijab Fashion</a>
                </div>
            </div>
            <div class="column">
                <div class="categories">
                    <a class="category" href="https://domain.com/brands/fashion-by-/">SEE ALL BRANDS</a>
                    <a class="subcategory" href="https://domain.com/adidas/">Adidas</a>
                    <a class="subcategory" href="https://domain.com/converse/">Converse</a>
                    <a class="subcategory" href="https://domain.com/ravin/">Ravin</a>
                    <a class="subcategory" href="https://domain.com/dejavu/">Dejavu</a>
                    <a class="subcategory" href="https://domain.com/agu/">Agu</a>
                    <a class="subcategory" href="https://domain.com/activ/">Activ</a>
                    <a class="subcategory" href="https://domain.com/oxford--bellini--tie-house--milano/">Tie House</a>
                    <a class="subcategory" href="https://domain.com/shoe-room/">Shoe Room</a>
                    <a class="subcategory" href="https://domain.com/town-team/">Town Team</a>
                </div>
            </div>
        </div>
    </div>
</li>
HTML

如何设置网络抓取项目?

按照 C# Web scraping 的最佳实践建立一个项目。

1.创建一个新的控制台应用程序,或为名为 "ShoppingSiteSample "的示例添加一个新文件夹 2.添加一个名为 "ShoppingScraper "的新类 3.从搜索网站类别及其子类别开始

  1. 通过NuGet程序包管理器或程序包管理器控制台安装 IronWebScraper
Install-Package IronWebScraper
Install-Package IronWebScraper
$vbLabelText   $csharpLabel

分类应使用何种数据模型?

创建一个类别模型,正确表达所发现的层次结构:

public class Category
{
    /// <summary>
    /// Gets or sets the name.
    /// </summary>
    /// <value>
    /// The name.
    /// </value>
    public string Name { get; set; }

    /// <summary>
    /// Gets or sets the URL.
    /// </summary>
    /// <value>
    /// The URL.
    /// </value>
    public string URL { get; set; }

    /// <summary>
    /// Gets or sets the subcategories.
    /// </summary>
    /// <value>
    /// The subcategories.
    /// </value>
    public List<Category> SubCategories { get; set; }

    // Additional properties for enhanced data collection
    public int ProductCount { get; set; }
    public DateTime LastScraped { get; set; }
    public string CategoryType { get; set; }
}
public class Category
{
    /// <summary>
    /// Gets or sets the name.
    /// </summary>
    /// <value>
    /// The name.
    /// </value>
    public string Name { get; set; }

    /// <summary>
    /// Gets or sets the URL.
    /// </summary>
    /// <value>
    /// The URL.
    /// </value>
    public string URL { get; set; }

    /// <summary>
    /// Gets or sets the subcategories.
    /// </summary>
    /// <value>
    /// The subcategories.
    /// </value>
    public List<Category> SubCategories { get; set; }

    // Additional properties for enhanced data collection
    public int ProductCount { get; set; }
    public DateTime LastScraped { get; set; }
    public string CategoryType { get; set; }
}
$vbLabelText   $csharpLabel

如何构建基本的 Scraper 逻辑?

构建刮板逻辑,切记在运行刮板之前应用您的许可证密钥

public class ShoppingScraper : WebScraper
{
    /// <summary>
    /// Initialize the web scraper, setting the start URLs and allowed/banned domains or URL patterns.
    /// </summary>
    public override void Init()
    {
        // Apply your license key - get one from https://ironsoftware.com/csharp/webscraper/licensing/
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

        // Configure request settings for better performance
        this.Request("www.webSite.com", Parse);
    }

    /// <summary>
    /// Parses the HTML document of the response to scrap the necessary data.
    /// </summary>
    /// <param name="response">The HTTP Response object to parse.</param>
    public override void Parse(Response response)
    {
        var categoryList = new List<Category>();

        // Iterate through each link in the menu and extract the category data.
        foreach (var Links in response.Css("#menuFixed > ul > li > a"))
        {
            var cat = new Category
            {
                URL = Links.Attributes["href"],
                Name = Links.InnerText,
                LastScraped = DateTime.Now
            };
            categoryList.Add(cat);
        }

        // Save the scraped data into a JSONL file.
        Scrape(categoryList, "Shopping.jsonl");
    }
}
public class ShoppingScraper : WebScraper
{
    /// <summary>
    /// Initialize the web scraper, setting the start URLs and allowed/banned domains or URL patterns.
    /// </summary>
    public override void Init()
    {
        // Apply your license key - get one from https://ironsoftware.com/csharp/webscraper/licensing/
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";

        // Configure request settings for better performance
        this.Request("www.webSite.com", Parse);
    }

    /// <summary>
    /// Parses the HTML document of the response to scrap the necessary data.
    /// </summary>
    /// <param name="response">The HTTP Response object to parse.</param>
    public override void Parse(Response response)
    {
        var categoryList = new List<Category>();

        // Iterate through each link in the menu and extract the category data.
        foreach (var Links in response.Css("#menuFixed > ul > li > a"))
        {
            var cat = new Category
            {
                URL = Links.Attributes["href"],
                Name = Links.InnerText,
                LastScraped = DateTime.Now
            };
            categoryList.Add(cat);
        }

        // Save the scraped data into a JSONL file.
        Scrape(categoryList, "Shopping.jsonl");
    }
}
$vbLabelText   $csharpLabel

菜单中的目标元素是什么?

从菜单中抓取链接需要精确的 CSS 选择器。 API 参考提供了有关可用选择器方法的详细信息:

记事本中的 JSON 文件,显示带有嵌套子类别和 URL 的电子商务类别结构

如何同时抓取主分类和子分类?

更新代码以抓取主要类别和所有子链接。 这种方法可确保完整的导航结构捕获:

public override void Parse(Response response)
{
    // List of Category Links (Root)
    var categoryList = new List<Category>();

    // Traverse each 'li' under the fixed menu
    foreach (var li in response.Css("#menuFixed > ul > li"))
    {
        // List of Main Links
        foreach (var Links in li.Css("a"))
        {
            var cat = new Category
            {
                URL = Links.Attributes["href"],
                Name = Links.InnerText,
                SubCategories = new List<Category>(),
                LastScraped = DateTime.Now
            };

            // List of Subcategories Links
            foreach (var subCategory in li.Css("a[class=subcategory]"))
            {
                var subcat = new Category
                {
                    URL = subCategory.Attributes["href"],
                    Name = subCategory.InnerText,
                    CategoryType = "Subcategory"
                };

                // Check if subcategory link already exists
                if (cat.SubCategories.Find(c => c.Name == subcat.Name && c.URL == subcat.URL) == null)
                {
                    // Add sublinks
                    cat.SubCategories.Add(subcat);
                }
            }

            // Update product count based on subcategories
            cat.ProductCount = cat.SubCategories.Count;

            // Add Main Category to the list
            categoryList.Add(cat);
        }
    }

    // Save the scraped data into a JSONL file.
    Scrape(categoryList, "Shopping.jsonl");
}
public override void Parse(Response response)
{
    // List of Category Links (Root)
    var categoryList = new List<Category>();

    // Traverse each 'li' under the fixed menu
    foreach (var li in response.Css("#menuFixed > ul > li"))
    {
        // List of Main Links
        foreach (var Links in li.Css("a"))
        {
            var cat = new Category
            {
                URL = Links.Attributes["href"],
                Name = Links.InnerText,
                SubCategories = new List<Category>(),
                LastScraped = DateTime.Now
            };

            // List of Subcategories Links
            foreach (var subCategory in li.Css("a[class=subcategory]"))
            {
                var subcat = new Category
                {
                    URL = subCategory.Attributes["href"],
                    Name = subCategory.InnerText,
                    CategoryType = "Subcategory"
                };

                // Check if subcategory link already exists
                if (cat.SubCategories.Find(c => c.Name == subcat.Name && c.URL == subcat.URL) == null)
                {
                    // Add sublinks
                    cat.SubCategories.Add(subcat);
                }
            }

            // Update product count based on subcategories
            cat.ProductCount = cat.SubCategories.Count;

            // Add Main Category to the list
            categoryList.Add(cat);
        }
    }

    // Save the scraped data into a JSONL file.
    Scrape(categoryList, "Shopping.jsonl");
}
$vbLabelText   $csharpLabel

如何从分类页面中提取产品信息?

有了所有网站类别的链接,就可以开始搜索每个类别中的产品。 在处理产品页面时,线程安全对于优化性能非常重要。 导航至任何类别并检查内容:

电子商务产品列表页面,显示鞋子和配件的价格、评级和过滤控件

产品的 HTML 结构是什么样的?

检查 HTML 结构以了解产品组织:

<section class="products">
    <div class="sku -gallery -validate-size " data-sku="AG249FA0T2PSGNAFAMZ" ft-product-sizes="41,42,43,44,45" ft-product-color="Multicolour">
        <a class="link" href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html">
            <div class="image-wrapper default-state">
                <img class="lazy image -loaded" alt="Bundle Of 2 Sneakers - Black & Navy Blue" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-sku="AG249FA0T2PSGNAFAMZ" data-src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg">
                <noscript><img src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
            </div>
            <h2 class="title"></h2>
                <span class="brand ">Agu&nbsp;</span>
                <span class="name" dir="ltr">Bundle Of 2 Sneakers - Black & Navy Blue</span>
            </h2>
            <div class="price-container clearfix">
                <span class="price-box">
                    <span class="price">
                        <span data-currency-iso="EGP">EGP</span>
                        <span dir="ltr" data-price="299">299</span>
                    </span>
                    <span class="price -old  -no-special"></span>
                </span>
            </div>
            <div class="rating-stars">
                <div class="stars-container">
                    <div class="stars" style="width: 62%"></div>
                </div>
                <div class="total-ratings">(30)</div>
            </div>
            <span class="shop-first-logo-container">
                <img src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" data-src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" class="lazy shop-first-logo-img -mbxs -loaded">
            </span>
            <span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
            <div class="list -sizes" data-selected-sku="">
                <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=41">41</span>
                <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=42">42</span>
                <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=43">43</span>
                <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=44">44</span>
                <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=45">45</span>
            </div>
        </a>
    </div>
    <div class="sku -gallery -validate-size " data-sku="LE047FA01SRK4NAFAMZ" ft-product-sizes="110,115,120,125,130,135" ft-product-color="Black">
        <a class="link" href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html">
            <div class="image-wrapper default-state">
                <img class="lazy image -loaded" alt="Genuine Leather Belt - Black" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-sku="LE047FA01SRK4NAFAMZ" data-src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg">
                <noscript><img src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
            </div>
            <h2 class="title"><span class="brand ">Leather Shop&nbsp;</span> <span class="name" dir="ltr">Genuine Leather Belt - Black</span></h2>
            <div class="price-container clearfix">
                <span class="sale-flag-percent">-29%</span>
                <span class="price-box">
                    <span class="price"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="96">96</span> </span>
                    <span class="price -old"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="135">135</span> </span>
                </span>
            </div>
            <div class="rating-stars">
                <div class="stars-container">
                    <div class="stars" style="width: 100%"></div>
                </div>
                <div class="total-ratings">(1)</div>
            </div>
            <span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
            <div class="list -sizes" data-selected-sku="">
                <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=110">110</span>
                <span class="js-link sku-size"  data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=115">115</span>
                <span class="js-link sku-size"  data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=120">120</span>
                <span class="js-link sku-size"  data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=125">125</span>
                <span class="js-link sku-size"  data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=130">130</span>
                <span class="js-link sku-size"  data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=135">135</span>
            </div>
        </a>
    </div>
</section>
<section class="products">
    <div class="sku -gallery -validate-size " data-sku="AG249FA0T2PSGNAFAMZ" ft-product-sizes="41,42,43,44,45" ft-product-color="Multicolour">
        <a class="link" href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html">
            <div class="image-wrapper default-state">
                <img class="lazy image -loaded" alt="Bundle Of 2 Sneakers - Black & Navy Blue" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-sku="AG249FA0T2PSGNAFAMZ" data-src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg">
                <noscript><img src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
            </div>
            <h2 class="title"></h2>
                <span class="brand ">Agu&nbsp;</span>
                <span class="name" dir="ltr">Bundle Of 2 Sneakers - Black & Navy Blue</span>
            </h2>
            <div class="price-container clearfix">
                <span class="price-box">
                    <span class="price">
                        <span data-currency-iso="EGP">EGP</span>
                        <span dir="ltr" data-price="299">299</span>
                    </span>
                    <span class="price -old  -no-special"></span>
                </span>
            </div>
            <div class="rating-stars">
                <div class="stars-container">
                    <div class="stars" style="width: 62%"></div>
                </div>
                <div class="total-ratings">(30)</div>
            </div>
            <span class="shop-first-logo-container">
                <img src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" data-src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" class="lazy shop-first-logo-img -mbxs -loaded">
            </span>
            <span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
            <div class="list -sizes" data-selected-sku="">
                <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=41">41</span>
                <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=42">42</span>
                <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=43">43</span>
                <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=44">44</span>
                <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=45">45</span>
            </div>
        </a>
    </div>
    <div class="sku -gallery -validate-size " data-sku="LE047FA01SRK4NAFAMZ" ft-product-sizes="110,115,120,125,130,135" ft-product-color="Black">
        <a class="link" href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html">
            <div class="image-wrapper default-state">
                <img class="lazy image -loaded" alt="Genuine Leather Belt - Black" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-sku="LE047FA01SRK4NAFAMZ" data-src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg">
                <noscript><img src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
            </div>
            <h2 class="title"><span class="brand ">Leather Shop&nbsp;</span> <span class="name" dir="ltr">Genuine Leather Belt - Black</span></h2>
            <div class="price-container clearfix">
                <span class="sale-flag-percent">-29%</span>
                <span class="price-box">
                    <span class="price"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="96">96</span> </span>
                    <span class="price -old"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="135">135</span> </span>
                </span>
            </div>
            <div class="rating-stars">
                <div class="stars-container">
                    <div class="stars" style="width: 100%"></div>
                </div>
                <div class="total-ratings">(1)</div>
            </div>
            <span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
            <div class="list -sizes" data-selected-sku="">
                <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=110">110</span>
                <span class="js-link sku-size"  data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=115">115</span>
                <span class="js-link sku-size"  data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=120">120</span>
                <span class="js-link sku-size"  data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=125">125</span>
                <span class="js-link sku-size"  data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=130">130</span>
                <span class="js-link sku-size"  data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=135">135</span>
            </div>
        </a>
    </div>
</section>
HTML

我应该创建哪种产品模型?

为该内容建立产品模型。 在使用购物网站搜索时,请捕捉所有相关的产品详细信息:

public class Product
{
    /// <summary>
    /// Gets or sets the name.
    /// </summary>
    /// <value>
    /// The name.
    /// </value>
    public string Name { get; set; }

    /// <summary>
    /// Gets or sets the price.
    /// </summary>
    /// <value>
    /// The price.
    /// </value>
    public string Price { get; set; }

    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    /// <value>
    /// The image.
    /// </value>
    public string Image { get; set; }

    // Additional properties for comprehensive data collection
    public string Brand { get; set; }
    public string OldPrice { get; set; }
    public string Discount { get; set; }
    public float Rating { get; set; }
    public int ReviewCount { get; set; }
    public List<string> AvailableSizes { get; set; }
    public string ProductUrl { get; set; }
    public string SKU { get; set; }
    public DateTime ScrapedDate { get; set; }
}
public class Product
{
    /// <summary>
    /// Gets or sets the name.
    /// </summary>
    /// <value>
    /// The name.
    /// </value>
    public string Name { get; set; }

    /// <summary>
    /// Gets or sets the price.
    /// </summary>
    /// <value>
    /// The price.
    /// </value>
    public string Price { get; set; }

    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    /// <value>
    /// The image.
    /// </value>
    public string Image { get; set; }

    // Additional properties for comprehensive data collection
    public string Brand { get; set; }
    public string OldPrice { get; set; }
    public string Discount { get; set; }
    public float Rating { get; set; }
    public int ReviewCount { get; set; }
    public List<string> AvailableSizes { get; set; }
    public string ProductUrl { get; set; }
    public string SKU { get; set; }
    public DateTime ScrapedDate { get; set; }
}
$vbLabelText   $csharpLabel

如何添加产品抓取功能?

要抓取分类页面,请添加一个新的抓取方法,该方法具有错误处理和数据验证功能:

public void ParseCategory(Response response)
{
    // List of Products
    var productList = new List<Product>();

    // Iterate through product links in the product section
    foreach (var Links in response.Css("section.products > div > a"))
    {
        try
        {
            var product = new Product
            {
                Name = Links.Css("h2.title > span.name").First().InnerText,
                Brand = Links.Css("h2.title > span.brand").FirstOrDefault()?.InnerText ?? "Unknown",
                Price = Links.Css("div.price-container > span.price-box > span.price > span[data-price]").First().InnerText,
                Image = Links.Css("div.image-wrapper.default-state > img").First().Attributes["src"],
                ProductUrl = Links.Attributes["href"],
                SKU = Links.ParentNode.Attributes["data-sku"],
                ScrapedDate = DateTime.Now
            };

            // Extract old price if available
            var oldPriceElement = Links.Css("span.price.-old > span[data-price]").FirstOrDefault();
            if (oldPriceElement != null)
            {
                product.OldPrice = oldPriceElement.InnerText;
            }

            // Extract discount percentage
            var discountElement = Links.Css("span.sale-flag-percent").FirstOrDefault();
            if (discountElement != null)
            {
                product.Discount = discountElement.InnerText;
            }

            // Extract rating information
            var ratingWidth = Links.Css("div.stars").FirstOrDefault()?.Attributes["style"];
            if (!string.IsNullOrEmpty(ratingWidth))
            {
                var width = System.Text.RegularExpressions.Regex.Match(ratingWidth, @"(\d+)%").Groups[1].Value;
                if (int.TryParse(width, out int ratingPercent))
                {
                    product.Rating = ratingPercent / 20.0f; // Convert percentage to 5-star scale
                }
            }

            // Extract review count
            var reviewText = Links.Css("div.total-ratings").FirstOrDefault()?.InnerText;
            if (!string.IsNullOrEmpty(reviewText))
            {
                var reviewCount = System.Text.RegularExpressions.Regex.Match(reviewText, @"\d+").Value;
                if (int.TryParse(reviewCount, out int count))
                {
                    product.ReviewCount = count;
                }
            }

            // Extract available sizes
            product.AvailableSizes = Links.Css("div.list.-sizes > span.sku-size")
                .Select(s => s.InnerText)
                .ToList();

            productList.Add(product);
        }
        catch (Exception ex)
        {
            // Log error and continue with next product
            Console.WriteLine($"Error parsing product: {ex.Message}");
        }
    }

    // Save the scraped product data into a JSONL file.
    Scrape(productList, "Products.jsonl");

    // Handle pagination if needed
    var nextPageLink = response.Css("a.pagination-next").FirstOrDefault();
    if (nextPageLink != null)
    {
        var nextPageUrl = nextPageLink.Attributes["href"];
        this.Request(nextPageUrl, ParseCategory);
    }
}
public void ParseCategory(Response response)
{
    // List of Products
    var productList = new List<Product>();

    // Iterate through product links in the product section
    foreach (var Links in response.Css("section.products > div > a"))
    {
        try
        {
            var product = new Product
            {
                Name = Links.Css("h2.title > span.name").First().InnerText,
                Brand = Links.Css("h2.title > span.brand").FirstOrDefault()?.InnerText ?? "Unknown",
                Price = Links.Css("div.price-container > span.price-box > span.price > span[data-price]").First().InnerText,
                Image = Links.Css("div.image-wrapper.default-state > img").First().Attributes["src"],
                ProductUrl = Links.Attributes["href"],
                SKU = Links.ParentNode.Attributes["data-sku"],
                ScrapedDate = DateTime.Now
            };

            // Extract old price if available
            var oldPriceElement = Links.Css("span.price.-old > span[data-price]").FirstOrDefault();
            if (oldPriceElement != null)
            {
                product.OldPrice = oldPriceElement.InnerText;
            }

            // Extract discount percentage
            var discountElement = Links.Css("span.sale-flag-percent").FirstOrDefault();
            if (discountElement != null)
            {
                product.Discount = discountElement.InnerText;
            }

            // Extract rating information
            var ratingWidth = Links.Css("div.stars").FirstOrDefault()?.Attributes["style"];
            if (!string.IsNullOrEmpty(ratingWidth))
            {
                var width = System.Text.RegularExpressions.Regex.Match(ratingWidth, @"(\d+)%").Groups[1].Value;
                if (int.TryParse(width, out int ratingPercent))
                {
                    product.Rating = ratingPercent / 20.0f; // Convert percentage to 5-star scale
                }
            }

            // Extract review count
            var reviewText = Links.Css("div.total-ratings").FirstOrDefault()?.InnerText;
            if (!string.IsNullOrEmpty(reviewText))
            {
                var reviewCount = System.Text.RegularExpressions.Regex.Match(reviewText, @"\d+").Value;
                if (int.TryParse(reviewCount, out int count))
                {
                    product.ReviewCount = count;
                }
            }

            // Extract available sizes
            product.AvailableSizes = Links.Css("div.list.-sizes > span.sku-size")
                .Select(s => s.InnerText)
                .ToList();

            productList.Add(product);
        }
        catch (Exception ex)
        {
            // Log error and continue with next product
            Console.WriteLine($"Error parsing product: {ex.Message}");
        }
    }

    // Save the scraped product data into a JSONL file.
    Scrape(productList, "Products.jsonl");

    // Handle pagination if needed
    var nextPageLink = response.Css("a.pagination-next").FirstOrDefault();
    if (nextPageLink != null)
    {
        var nextPageUrl = nextPageLink.Attributes["href"];
        this.Request(nextPageUrl, ParseCategory);
    }
}
$vbLabelText   $csharpLabel

这种刮擦购物网站的综合方法可确保捕获所有相关产品信息,同时优雅地处理错误。 对于更高级的场景,请探索 IronWebScraper 中提供的高级网络爬虫功能

常见问题解答

如何用 C# 从购物网站提取产品数据?

IronWebScraper 可通过 CSS 选择器轻松提取购物网站中的产品数据。您可以创建一个 WebScraper 类,覆盖 Parse 方法,并使用 response.Css() 来定位特定的 HTML 元素,如产品名称、价格和图片。提取的数据可保存为各种格式,包括 JSON 和 JSONL 文件。

创建购物网站搜索器的基本步骤是什么?

使用 IronWebScraper 创建购物网站刮板:1) 创建一个控制台应用程序项目;2) 添加一个继承自 WebScraper 的类;3) 为类别和产品创建数据模型;4) 覆盖 Init() 方法以设置起始 URL;5) 覆盖 Parse() 方法以使用 CSS 选择器提取数据;6) 运行 scraper 以将数据保存为首选格式。

在搜索电子商务网站时,如何处理分层分类结构?

IronWebScraper 允许您通过创建反映父子关系的适当数据模型(如时尚 > 男装 > 鞋)来处理分层结构。您可以使用 CSS 选择器浏览嵌套的 HTML 元素,并以编程方式构建分类树结构,这在使用 IronWebScraper 的高级功能时尤其有用。

分析购物网站 HTML 结构的最佳方法是什么?

在使用 IronWebScraper 搜刮购物网站之前,请使用浏览器开发工具检查 HTML 结构。在 CSS 类和元素层次结构中寻找一致的模式。这种分析可帮助您确定要在 IronWebScraper Parse() 方法中使用的正确 CSS 选择器,以便准确定位产品信息、类别和其他数据元素。

我能否从同一个页面中提取产品列表和分类导航?

是的,IronWebScraper 可以让您从一个页面中提取多种类型的数据。在您的 Parse() 方法中,您可以使用不同的 CSS 选择器同时针对类别链接(如".category-item")和产品列表(如".product-item"),然后将它们保存到不同的输出文件或数据结构中。

如何将刮擦的产品数据保存到文件中?

IronWebScraper 提供了一个内置的 Scrape() 方法,可自动保存提取的数据。只需将数据对象和文件名传递给 Scrape(item, "products.jsonl")。该库支持多种输出格式,包括 JSON、JSONL 和 CSV,因此可以轻松导出抓取的电子商务数据,以便进一步处理。

Curtis Chau
技术作家

Curtis Chau 拥有卡尔顿大学的计算机科学学士学位,专注于前端开发,精通 Node.js、TypeScript、JavaScript 和 React。他热衷于打造直观且美观的用户界面,喜欢使用现代框架并创建结构良好、视觉吸引力强的手册。

除了开发之外,Curtis 对物联网 (IoT) 有浓厚的兴趣,探索将硬件和软件集成的新方法。在空闲时间,他喜欢玩游戏和构建 Discord 机器人,将他对技术的热爱与创造力相结合。

准备开始了吗?
Nuget 下载 131,807 | 版本: 2026.3 刚刚发布
Still Scrolling Icon

还在滚动吗?

想快速获得证据? PM > Install-Package IronWebScraper
运行示例 观看您的目标网站成为结构化数据。