使用 C# 和 IronWebScraper 抓取在线电影网站

This article was translated from English: Does it need improvement?
Translated
View the article in English

IronWebscraper 通过解析 HTML 元素从网站中提取电影数据,创建用于存储结构化数据的类型化对象,并利用元数据在页面间导航,从而构建全面的电影信息数据集。 这款 C# Web Scraper 库可简化非结构化网页内容向有序、可分析数据的转换过程。

快速入门:使用 C# 抓取电影信息

  1. 通过 NuGet 包管理器安装 IronWebScraper
  2. 创建一个继承自 `` 的类
  3. 重写 `` 以设置许可证并请求目标 URL
  4. 重写 `` 以使用 CSS 选择器提取电影数据
  5. 使用 `` 方法将数据保存为 JSON 格式
  1. 使用 NuGet 包管理器安装 https://www.nuget.org/packages/IronWebScraper

    PM > Install-Package IronWebScraper
  2. 复制并运行这段代码。

    using IronWebScraper;
    using System;
    
    public class QuickstartMovieScraper : WebScraper
    {
        public override void Init()
        {
            // Set your license key
            License.LicenseKey = "YOUR-LICENSE-KEY";
    
            // Configure scraper settings
            this.LoggingLevel = LogLevel.All;
            this.WorkingDirectory = @"C:\MovieData\Output\";
    
            // Start scraping from the homepage
            this.Request("https://example-movie-site.com", Parse);
        }
    
        public override void Parse(Response response)
        {
            // Extract movie titles using CSS selectors
            foreach (var movieDiv in response.Css(".movie-item"))
            {
                var title = movieDiv.Css("h2")[0].TextContentClean;
                var url = movieDiv.Css("a")[0].Attributes["href"];
    
                // Save the scraped data
                Scrape(new { Title = title, Url = url }, "movies.json");
            }
        }
    }
    
    // Run the scraper
    var scraper = new QuickstartMovieScraper();
    scraper.Start();
  3. 部署到您的生产环境中进行测试

    通过免费试用立即在您的项目中开始使用IronWebScraper

    arrow pointer

如何创建电影抓取类?

请以一个真实的网站示例开头。 我们将使用《C# 网页抓取教程》中介绍的技术,对一个电影网站进行抓取。

添加一个新类,并将其命名为 ``:

Visual Studio

创建一个专用的抓取类有助于整理代码并提高代码的可复用性。 这种方法遵循面向对象原则,便于您日后轻松扩展功能。

目标网站的结构是怎样的?

检查网站结构以便进行抓取。 理解网站结构对于有效的网页抓取至关重要。 与我们关于"从在线电影网站抓取数据"的指南类似,请先分析 HTML 结构:

电影流媒体网站界面,显示电影海报网格,配有导航标签和画质指示器

哪些 HTML 元素包含视频数据?

这是网站首页HTML代码的一部分。通过分析HTML结构,有助于确定应使用的正确CSS选择器:

<div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active">
    <div data-movie-id="20746" class="ml-item">
        <a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/">
            <span class="mli-quality">CAM</span>
            <img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" 
                 class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword"
                 src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" 
                 style="display: inline-block;">
            <span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span>
        </a>
    </div>
    <div data-movie-id="20724" class="ml-item">
        <a href="https://website.com/film/snatched-20724/">
            <span class="mli-quality">CAM</span>
            <img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" 
                 class="lazy thumb mli-thumb" alt="Snatched" 
                 src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" 
                 style="display: inline-block;">
            <span class="mli-info"><h2>Snatched</h2></span>
        </a>
    </div>
</div>
<div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active">
    <div data-movie-id="20746" class="ml-item">
        <a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/">
            <span class="mli-quality">CAM</span>
            <img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" 
                 class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword"
                 src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" 
                 style="display: inline-block;">
            <span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span>
        </a>
    </div>
    <div data-movie-id="20724" class="ml-item">
        <a href="https://website.com/film/snatched-20724/">
            <span class="mli-quality">CAM</span>
            <img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" 
                 class="lazy thumb mli-thumb" alt="Snatched" 
                 src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" 
                 style="display: inline-block;">
            <span class="mli-info"><h2>Snatched</h2></span>
        </a>
    </div>
</div>
HTML

我们拥有电影 ID、标题以及详细页面的链接。 每个视频都包含在 标签中,该标签具有 类,并包含用于标识的唯一 `` 属性。

如何实现基础的电影信息抓取?

开始抓取此数据集。 在运行任何抓取工具之前,请确保已按如下所示正确配置许可证密钥:

public class MovieScraper : WebScraper
{
    public override void Init()
    {
        // Initialize scraper settings
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";

        // Request homepage content for scraping
        this.Request("www.website.com", Parse);
    }

    public override void Parse(Response response)
    {
        // Iterate over each movie div within the featured movie section
        foreach (var div in response.Css("#movie-featured > div"))
        {
            if (div.Attributes["class"] != "clearfix")
            {
                var movieId = Convert.ToInt32(div.GetAttribute("data-movie-id"));
                var link = div.Css("a")[0];
                var movieTitle = link.TextContentClean;

                // Scrape and store movie data as key-value pairs
                Scrape(new ScrapedData() 
                { 
                    { "MovieId", movieId },
                    { "MovieTitle", movieTitle }
                }, "Movie.Jsonl");
            }
        }           
    }
}
public class MovieScraper : WebScraper
{
    public override void Init()
    {
        // Initialize scraper settings
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";

        // Request homepage content for scraping
        this.Request("www.website.com", Parse);
    }

    public override void Parse(Response response)
    {
        // Iterate over each movie div within the featured movie section
        foreach (var div in response.Css("#movie-featured > div"))
        {
            if (div.Attributes["class"] != "clearfix")
            {
                var movieId = Convert.ToInt32(div.GetAttribute("data-movie-id"));
                var link = div.Css("a")[0];
                var movieTitle = link.TextContentClean;

                // Scrape and store movie data as key-value pairs
                Scrape(new ScrapedData() 
                { 
                    { "MovieId", movieId },
                    { "MovieTitle", movieTitle }
                }, "Movie.Jsonl");
            }
        }           
    }
}
Public Class MovieScraper
	Inherits WebScraper

	Public Overrides Sub Init()
		' Initialize scraper settings
		License.LicenseKey = "LicenseKey"
		Me.LoggingLevel = WebScraper.LogLevel.All
		Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"

		' Request homepage content for scraping
		Me.Request("www.website.com", AddressOf Parse)
	End Sub

	Public Overrides Sub Parse(ByVal response As Response)
		' Iterate over each movie div within the featured movie section
		For Each div In response.Css("#movie-featured > div")
			If div.Attributes("class") <> "clearfix" Then
				Dim movieId = Convert.ToInt32(div.GetAttribute("data-movie-id"))
				Dim link = div.Css("a")(0)
				Dim movieTitle = link.TextContentClean

				' Scrape and store movie data as key-value pairs
				Scrape(New ScrapedData() From {
					{ "MovieId", movieId },
					{ "MovieTitle", movieTitle }
				},
				"Movie.Jsonl")
			End If
		Next div
	End Sub
End Class
$vbLabelText   $csharpLabel

"工作目录"属性有何用途?

此代码有何更新?

Working Directory 属性用于设置所有抓取数据及相关文件的主工作目录。 这确保所有输出文件都集中存储在一个位置,从而更便于管理大规模的抓取项目。 如果该目录不存在,系统将自动创建。

何时应使用 CSS 选择器,何时应使用属性?

其他注意事项:

当需要根据元素的结构位置或类名定位时,CSS 选择器是理想的选择;而直接访问属性则更适合提取 ID 或自定义数据属性等特定值。 在本例中,我们使用 CSS 选择器 (#movie-featured &gt; div) 来遍历 DOM 结构,并利用属性 (``) 提取特定值。

如何为抓取的数据创建类型化对象?

构建类型化对象,以格式化对象的形式存储抓取的数据。 使用强类型对象可实现更佳的代码组织、IntelliSense 支持以及编译时类型检查。

实现一个 `` 类,用于存储格式化数据:

public class Movie
{
    public int Id { get; set; }
    public string Title { get; set; }
    public string URL { get; set; }
}
public class Movie
{
    public int Id { get; set; }
    public string Title { get; set; }
    public string URL { get; set; }
}
Public Class Movie
    Public Property Id As Integer
    Public Property Title As String
    Public Property URL As String
End Class
$vbLabelText   $csharpLabel

使用类型化对象如何改善数据组织?

请更新代码,使用类型化的 类,而非通用的 字典:

public class MovieScraper : WebScraper
{
    public override void Init()
    {
        // Initialize scraper settings
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";

        // Request homepage content for scraping
        this.Request("https://website.com/", Parse);
    }

    public override void Parse(Response response)
    {
        // Iterate over each movie div within the featured movie section
        foreach (var div in response.Css("#movie-featured > div"))
        {
            if (div.Attributes["class"] != "clearfix")
            {
                var movie = new Movie
                {
                    Id = Convert.ToInt32(div.GetAttribute("data-movie-id"))
                };

                var link = div.Css("a")[0];
                movie.Title = link.TextContentClean;
                movie.URL = link.Attributes["href"];

                // Scrape and store movie object
                Scrape(movie, "Movie.Jsonl");
            }
        }
    }
}
public class MovieScraper : WebScraper
{
    public override void Init()
    {
        // Initialize scraper settings
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";

        // Request homepage content for scraping
        this.Request("https://website.com/", Parse);
    }

    public override void Parse(Response response)
    {
        // Iterate over each movie div within the featured movie section
        foreach (var div in response.Css("#movie-featured > div"))
        {
            if (div.Attributes["class"] != "clearfix")
            {
                var movie = new Movie
                {
                    Id = Convert.ToInt32(div.GetAttribute("data-movie-id"))
                };

                var link = div.Css("a")[0];
                movie.Title = link.TextContentClean;
                movie.URL = link.Attributes["href"];

                // Scrape and store movie object
                Scrape(movie, "Movie.Jsonl");
            }
        }
    }
}
Public Class MovieScraper
	Inherits WebScraper

	Public Overrides Sub Init()
		' Initialize scraper settings
		License.LicenseKey = "LicenseKey"
		Me.LoggingLevel = WebScraper.LogLevel.All
		Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"

		' Request homepage content for scraping
		Me.Request("https://website.com/", AddressOf Parse)
	End Sub

	Public Overrides Sub Parse(ByVal response As Response)
		' Iterate over each movie div within the featured movie section
		For Each div In response.Css("#movie-featured > div")
			If div.Attributes("class") <> "clearfix" Then
				Dim movie As New Movie With {.Id = Convert.ToInt32(div.GetAttribute("data-movie-id"))}

				Dim link = div.Css("a")(0)
				movie.Title = link.TextContentClean
				movie.URL = link.Attributes("href")

				' Scrape and store movie object
				Scrape(movie, "Movie.Jsonl")
			End If
		Next div
	End Sub
End Class
$vbLabelText   $csharpLabel

Scrape 方法对类型化对象采用何种格式?

有什么新内容?

  1. 我们实现了 `` 类来存储抓取的数据,从而提供了类型安全并改善了代码组织结构。
  2. 我们将电影对象传递给 `` 方法,该方法能识别我们的格式,并按如下所示以规定的方式进行保存:

记事本窗口显示的 JSON 电影数据库,其中包含结构化的电影数据,包括标题、URL 和元数据字段

输出内容会自动序列化为 JSON 格式,便于导入数据库或其他应用程序。

如何抓取详细的电影页面?

开始抓取更详细的页面。 多页抓取是常见需求,而 IronWebscraper 通过其请求链机制使其变得简单易行。

我还能从详情页提取哪些其他数据?

电影页面如下所示,包含每部电影的丰富元数据:

《银河护卫队2》电影信息页面,包含海报及影片详情(包括演员阵容和评分)

<div class="mvi-content">
    <div class="thumb mvic-thumb" 
         style="background-image: url(https://img.gocdn.online/2017/04/28/poster/5a08e94ba02118f22dc30f298c603210-guardians-of-the-galaxy-vol-2.jpg);"></div>
    <div class="mvic-desc">
        <h3>Guardians of the Galaxy Vol. 2</h3>        
        <div class="desc">
            Set to the backdrop of Awesome Mixtape #2, Marvel's Guardians of the Galaxy Vol. 2 continues the team's adventures as they travel throughout the cosmos to help Peter Quill learn more about his true parentage.
        </div>
        <div class="mvic-info">
            <div class="mvici-left">
                <p>
                    <strong>Genre: </strong>
                    <a href="https://Domain/genre/action/" title="Action">Action</a>,
                    <a href="https://Domain/genre/adventure/" title="Adventure">Adventure</a>,
                    <a href="https://Domain/genre/sci-fi/" title="Sci-Fi">Sci-Fi</a>
                </p>
                <p>
                    <strong>Actor: </strong>
                    <a target="_blank" href="https://Domain/actor/chris-pratt" title="Chris Pratt">Chris Pratt</a>,
                    <a target="_blank" href="https://Domain/actor/-zoe-saldana" title="Zoe Saldana">Zoe Saldana</a>,
                    <a target="_blank" href="https://Domain/actor/-dave-bautista-" title="Dave Bautista">Dave Bautista</a>
                </p>
                <p>
                    <strong>Director: </strong>
                    <a href="#" title="James Gunn">James Gunn</a>
                </p>
                <p>
                    <strong>Country: </strong>
                    <a href="https://Domain/country/us" title="United States">United States</a>
                </p>
            </div>
            <div class="mvici-right">
                <p><strong>Duration:</strong> 136 min</p>
                <p><strong>Quality:</strong> <span class="quality">CAM</span></p>
                <p><strong>Release:</strong> 2017</p>
                <p><strong>IMDb:</strong> 8.3</p>
            </div>
            <div class="clearfix"></div>
        </div>
        <div class="clearfix"></div>
    </div>
    <div class="clearfix"></div>
</div>
<div class="mvi-content">
    <div class="thumb mvic-thumb" 
         style="background-image: url(https://img.gocdn.online/2017/04/28/poster/5a08e94ba02118f22dc30f298c603210-guardians-of-the-galaxy-vol-2.jpg);"></div>
    <div class="mvic-desc">
        <h3>Guardians of the Galaxy Vol. 2</h3>        
        <div class="desc">
            Set to the backdrop of Awesome Mixtape #2, Marvel's Guardians of the Galaxy Vol. 2 continues the team's adventures as they travel throughout the cosmos to help Peter Quill learn more about his true parentage.
        </div>
        <div class="mvic-info">
            <div class="mvici-left">
                <p>
                    <strong>Genre: </strong>
                    <a href="https://Domain/genre/action/" title="Action">Action</a>,
                    <a href="https://Domain/genre/adventure/" title="Adventure">Adventure</a>,
                    <a href="https://Domain/genre/sci-fi/" title="Sci-Fi">Sci-Fi</a>
                </p>
                <p>
                    <strong>Actor: </strong>
                    <a target="_blank" href="https://Domain/actor/chris-pratt" title="Chris Pratt">Chris Pratt</a>,
                    <a target="_blank" href="https://Domain/actor/-zoe-saldana" title="Zoe Saldana">Zoe Saldana</a>,
                    <a target="_blank" href="https://Domain/actor/-dave-bautista-" title="Dave Bautista">Dave Bautista</a>
                </p>
                <p>
                    <strong>Director: </strong>
                    <a href="#" title="James Gunn">James Gunn</a>
                </p>
                <p>
                    <strong>Country: </strong>
                    <a href="https://Domain/country/us" title="United States">United States</a>
                </p>
            </div>
            <div class="mvici-right">
                <p><strong>Duration:</strong> 136 min</p>
                <p><strong>Quality:</strong> <span class="quality">CAM</span></p>
                <p><strong>Release:</strong> 2017</p>
                <p><strong>IMDb:</strong> 8.3</p>
            </div>
            <div class="clearfix"></div>
        </div>
        <div class="clearfix"></div>
    </div>
    <div class="clearfix"></div>
</div>
HTML

如何扩展我的 Movie 类以添加更多属性?

类添加新属性 (, ,, ) 但仅使用。 使用 `` 处理类型和演员信息,可优雅地处理多个值:

using System.Co/llections.Generic;

public class Movie
{
    public int Id { get; set; }
    public string Title { get; set; }
    public string URL { get; set; }
    public string Description { get; set; }
    public List<string> Genre { get; set; }
    public List<string> Actor { get; set; }
}
using System.Co/llections.Generic;

public class Movie
{
    public int Id { get; set; }
    public string Title { get; set; }
    public string URL { get; set; }
    public string Description { get; set; }
    public List<string> Genre { get; set; }
    public List<string> Actor { get; set; }
}
Imports System.Collections.Generic

Public Class Movie
    Public Property Id As Integer
    Public Property Title As String
    Public Property URL As String
    Public Property Description As String
    Public Property Genre As List(Of String)
    Public Property Actor As List(Of String)
End Class
$vbLabelText   $csharpLabel

在抓取过程中如何在页面间导航?

导航至详情页进行抓取。 IronWebscraper 会自动处理线程安全问题,从而支持并行处理多个页面。

为何针对不同页面类型使用多种解析函数?

IronWebScraper 支持添加多个抓取函数,以处理不同格式的网页。 这种关注点的分离使您的代码更易于维护,并能妥善处理不同的页面结构。 每个解析函数可专注于从特定类型的页面中提取数据。

元数据如何帮助在解析函数之间传递对象?

MetaData 功能对于在请求之间保持状态至关重要。 如需了解更高级的WEBSCRAPER功能,请查阅我们的详细指南:

public class MovieScraper : WebScraper
{
    public override void Init()
    {
        // Initialize scraper settings
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";

        // Request homepage content for scraping
        this.Request("https://domain/", Parse);
    }

    public override void Parse(Response response)
    {
        // Iterate over each movie div within the featured movie section
        foreach (var div in response.Css("#movie-featured > div"))
        {
            if (div.Attributes["class"] != "clearfix")
            {
                var movie = new Movie
                {
                    Id = Convert.ToInt32(div.GetAttribute("data-movie-id"))
                };

                var link = div.Css("a")[0];
                movie.Title = link.TextContentClean;
                movie.URL = link.Attributes["href"];

                // Request detailed page
                this.Request(movie.URL, ParseDetails, new MetaData() { { "movie", movie } });
            }
        }           
    }

    public void ParseDetails(Response response)
    {
        // Retrieve movie object from metadata
        var movie = response.MetaData.Get<Movie>("movie");
        var div = response.Css("div.mvic-desc")[0];

        // Extract description
        movie.Description = div.Css("div.desc")[0].TextContentClean;

        // Extract genres
        movie.Genre = new List<string>(); // Initialize genre list
        foreach(var genre in div.Css("div > p > a"))
        {
            movie.Genre.Add(genre.TextContentClean);
        }

        // Extract actors
        movie.Actor = new List<string>(); // Initialize actor list
        foreach (var actor in div.Css("div > p:nth-child(2) > a"))
        {
            movie.Actor.Add(actor.TextContentClean);
        }

        // Scrape and store detailed movie data
        Scrape(movie, "Movie.Jsonl");
    }
}
public class MovieScraper : WebScraper
{
    public override void Init()
    {
        // Initialize scraper settings
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";

        // Request homepage content for scraping
        this.Request("https://domain/", Parse);
    }

    public override void Parse(Response response)
    {
        // Iterate over each movie div within the featured movie section
        foreach (var div in response.Css("#movie-featured > div"))
        {
            if (div.Attributes["class"] != "clearfix")
            {
                var movie = new Movie
                {
                    Id = Convert.ToInt32(div.GetAttribute("data-movie-id"))
                };

                var link = div.Css("a")[0];
                movie.Title = link.TextContentClean;
                movie.URL = link.Attributes["href"];

                // Request detailed page
                this.Request(movie.URL, ParseDetails, new MetaData() { { "movie", movie } });
            }
        }           
    }

    public void ParseDetails(Response response)
    {
        // Retrieve movie object from metadata
        var movie = response.MetaData.Get<Movie>("movie");
        var div = response.Css("div.mvic-desc")[0];

        // Extract description
        movie.Description = div.Css("div.desc")[0].TextContentClean;

        // Extract genres
        movie.Genre = new List<string>(); // Initialize genre list
        foreach(var genre in div.Css("div > p > a"))
        {
            movie.Genre.Add(genre.TextContentClean);
        }

        // Extract actors
        movie.Actor = new List<string>(); // Initialize actor list
        foreach (var actor in div.Css("div > p:nth-child(2) > a"))
        {
            movie.Actor.Add(actor.TextContentClean);
        }

        // Scrape and store detailed movie data
        Scrape(movie, "Movie.Jsonl");
    }
}
Public Class MovieScraper
	Inherits WebScraper

	Public Overrides Sub Init()
		' Initialize scraper settings
		License.LicenseKey = "LicenseKey"
		Me.LoggingLevel = WebScraper.LogLevel.All
		Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"

		' Request homepage content for scraping
		Me.Request("https://domain/", AddressOf Parse)
	End Sub

	Public Overrides Sub Parse(ByVal response As Response)
		' Iterate over each movie div within the featured movie section
		For Each div In response.Css("#movie-featured > div")
			If div.Attributes("class") <> "clearfix" Then
				Dim movie As New Movie With {.Id = Convert.ToInt32(div.GetAttribute("data-movie-id"))}

				Dim link = div.Css("a")(0)
				movie.Title = link.TextContentClean
				movie.URL = link.Attributes("href")

				' Request detailed page
				Me.Request(movie.URL, AddressOf ParseDetails, New MetaData() From {
					{ "movie", movie }
				})
			End If
		Next div
	End Sub

	Public Sub ParseDetails(ByVal response As Response)
		' Retrieve movie object from metadata
		Dim movie = response.MetaData.Get(Of Movie)("movie")
		Dim div = response.Css("div.mvic-desc")(0)

		' Extract description
		movie.Description = div.Css("div.desc")(0).TextContentClean

		' Extract genres
		movie.Genre = New List(Of String)() ' Initialize genre list
		For Each genre In div.Css("div > p > a")
			movie.Genre.Add(genre.TextContentClean)
		Next genre

		' Extract actors
		movie.Actor = New List(Of String)() ' Initialize actor list
		For Each actor In div.Css("div > p:nth-child(2) > a")
			movie.Actor.Add(actor.TextContentClean)
		Next actor

		' Scrape and store detailed movie data
		Scrape(movie, "Movie.Jsonl")
	End Sub
End Class
$vbLabelText   $csharpLabel

这种多页抓取方法有哪些关键特点?

有什么新内容?

  1. 添加抓取功能(例如:``),用于抓取详情页面,类似于从购物网站抓取数据时采用的技术。
  2. 将生成文件的 `` 函数移至新函数中,确保仅在收集完所有详细信息后才保存数据。
  3. 使用 IronWebscraper 的功能 (``),将电影对象传递给新的抓取函数,从而在跨请求过程中保持对象状态。
  4. 抓取页面并将电影对象数据保存为包含完整信息的文件。

记事本窗口显示的 JSON 电影数据库,包含电影标题、简介、URL 及类型元数据

有关可用方法和属性的更多信息,请参阅 API 参考文档。 IronWebscraper 提供了一个从网站中提取结构化数据的强大框架,使其成为数据收集和分析项目中不可或缺的工具。

常见问题解答

如何使用 C# 从 HTML 中提取电影标题?

IronWebScraper 提供了从 HTML 中提取电影标题的 CSS 选择器方法。使用 response.Css() 方法和适当的选择器(如".movie-item h2")来锁定标题元素,然后访问 TextContentClean 属性来获取干净的文本值。

在多个电影页面之间导航的最佳方式是什么?

IronWebScraper 通过 Request() 方法处理页面导航。您可以使用 CSS 选择器提取分页链接,然后使用每个 URL 调用 Request() 从多个页面中抓取数据,自动构建全面的电影数据集。

如何以结构化格式保存刮擦的电影数据?

使用 IronWebScraper 的 Scrape() 方法以 JSON 格式保存数据。创建包含标题、URL 和评分等电影属性的匿名对象或类型类,然后将它们连同文件名一起传递给 Scrape(),以自动序列化和保存数据。

我应该使用哪些 CSS 选择器来提取电影信息?

IronWebScraper 支持标准 CSS 选择器。对于电影网站,可使用".movie-item "等选择器来表示容器,"h2 "表示标题,"a[href]"表示链接,以及特定的类名表示评级或流派。Css() 方法会返回可以遍历的集合。

如何处理刮擦数据中的 "CAM "等电影质量指标?

通过 IronWebScraper,您可以针对其特定的 HTML 元素提取和处理质量指标。使用 CSS 选择器定位质量徽章或文本,然后将它们作为属性包含在您的刮擦数据对象中,以获得全面的电影信息。

我能否为我的电影搜索操作设置日志记录?

是的,IronWebScraper 包含内置日志功能。在 Init() 方法中将 LoggingLevel 属性设置为 LogLevel.All,以跟踪所有刮擦活动、错误和进度,这有助于调试和监控电影数据提取。

配置刮擦数据工作目录的正确方法是什么?

IronWebScraper 可让你在 Init() 方法中设置一个 WorkingDirectory 属性。指定一个类似于 "C:\MovieData\Output\"的路径,用于保存刮擦的电影数据文件。这样可以集中管理输出,让你的数据井井有条。

如何正确继承 WebScraper 类?

创建一个继承自 IronWebScraper 的 WebScraper 基类的新类。覆盖用于配置的 Init() 方法和用于数据提取逻辑的 Parse() 方法。这种面向对象的方法使你的电影刮刀可重复使用并易于维护。

Curtis Chau
技术作家

Curtis Chau 拥有卡尔顿大学的计算机科学学士学位,专注于前端开发,精通 Node.js、TypeScript、JavaScript 和 React。他热衷于打造直观且美观的用户界面,喜欢使用现代框架并创建结构良好、视觉吸引力强的手册。

除了开发之外,Curtis 对物联网 (IoT) 有浓厚的兴趣,探索将硬件和软件集成的新方法。在空闲时间,他喜欢玩游戏和构建 Discord 机器人,将他对技术的热爱与创造力相结合。

准备开始了吗?
Nuget 下载 137,906 | 版本: 2026.6 just released
Still Scrolling Icon

还在滚动吗?

想快速获得证据? PM > Install-Package IronWebScraper
运行示例 观看您的目标网站成为结构化数据。