抓取在线电影网站

Darrius Serrant

2023年一月19日

更新 2024年十月20日

Translated

View the article in English

让我们从一个真实世界的网站开始另一个示例。我们将选择抓取一个电影网站。

让我们添加一个新类并将其命名为“MovieScraper”：

现在，让我们来看看我们要搜索的网站：

这是我们在网站上看到的主页 HTML 的一部分：

<div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active">
    <div data-movie-id="20746" class="ml-item">
        <a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/">
            <span class="mli-quality">CAM</span>
            <img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" 
                 class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword"
                  src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" 
                 style="display: inline-block;">
            <span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span>
        </a>
    </div>
    <div data-movie-id="20724" class="ml-item">
        <a href="https://website.com/film/snatched-20724/" >
            <span class="mli-quality">CAM</span>
            <img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" 
                 class="lazy thumb mli-thumb" alt="Snatched" 
                 src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" 
                 style="display: inline-block;">
            <span class="mli-info"><h2>Snatched</h2></span>
        </a>
    </div>
</div>

<div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active">
    <div data-movie-id="20746" class="ml-item">
        <a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/">
            <span class="mli-quality">CAM</span>
            <img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" 
                 class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword"
                  src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" 
                 style="display: inline-block;">
            <span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span>
        </a>
    </div>
    <div data-movie-id="20724" class="ml-item">
        <a href="https://website.com/film/snatched-20724/" >
            <span class="mli-quality">CAM</span>
            <img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" 
                 class="lazy thumb mli-thumb" alt="Snatched" 
                 src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" 
                 style="display: inline-block;">
            <span class="mli-info"><h2>Snatched</h2></span>
        </a>
    </div>
</div>

HTML

正如我们所见，我们有电影ID、标题和详细页面链接。

让我们开始抓取这组数据：

public class MovieScraper : WebScraper
{
    public override void Init()
    {
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
        this.Request("www.website.com", Parse);
    }
    public override void Parse(Response response)
    {
        foreach (var Divs in response.Css("#movie-featured > div"))
        {
            if (Divs.Attributes ["class"] != "clearfix")
            {
                var MovieId = Divs.GetAttribute("data-movie-id");
                var link = Divs.Css("a")[0];
                var MovieTitle = link.TextContentClean;
                Scrape(new ScrapedData() { { "MovieId", MovieId }, { "MovieTitle", MovieTitle } }, "Movie.Jsonl");
            }
        }           
    }
}

public class MovieScraper : WebScraper
{
    public override void Init()
    {
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
        this.Request("www.website.com", Parse);
    }
    public override void Parse(Response response)
    {
        foreach (var Divs in response.Css("#movie-featured > div"))
        {
            if (Divs.Attributes ["class"] != "clearfix")
            {
                var MovieId = Divs.GetAttribute("data-movie-id");
                var link = Divs.Css("a")[0];
                var MovieTitle = link.TextContentClean;
                Scrape(new ScrapedData() { { "MovieId", MovieId }, { "MovieTitle", MovieTitle } }, "Movie.Jsonl");
            }
        }           
    }
}

Public Class MovieScraper
	Inherits WebScraper

	Public Overrides Sub Init()
		License.LicenseKey = "LicenseKey"
		Me.LoggingLevel = WebScraper.LogLevel.All
		Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"
		Me.Request("www.website.com", AddressOf Parse)
	End Sub
	Public Overrides Sub Parse(ByVal response As Response)
		For Each Divs In response.Css("#movie-featured > div")
			If Divs.Attributes ("class") <> "clearfix" Then
				Dim MovieId = Divs.GetAttribute("data-movie-id")
				Dim link = Divs.Css("a")(0)
				Dim MovieTitle = link.TextContentClean
				Scrape(New ScrapedData() From {
					{ "MovieId", MovieId },
					{ "MovieTitle", MovieTitle }
				},
				"Movie.Jsonl")
			End If
		Next Divs
	End Sub
End Class

$vbLabelText $csharpLabel

此代码有什么新功能？

工作目录属性用于设置所有抓取数据及其相关文件的主工作目录。

让我们做更多。

如果我们需要构建类型化对象来保存格式化对象中的抓取数据，该怎么办？

让我们实现一个电影类，用于存放我们的格式化数据：

public class Movie
{
    public int Id { get; set; }
    public string Title { get; set; }
    public string URL { get; set; }

}

public class Movie
{
    public int Id { get; set; }
    public string Title { get; set; }
    public string URL { get; set; }

}

Public Class Movie
	Public Property Id() As Integer
	Public Property Title() As String
	Public Property URL() As String

End Class

$vbLabelText $csharpLabel

现在，我们将更新代码：

public class MovieScraper : WebScraper
{
    public override void Init()
    {
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
        this.Request("https://website.com/", Parse);
    }
    public override void Parse(Response response)
    {
        foreach (var Divs in response.Css("#movie-featured > div"))
        {
            if (Divs.Attributes ["class"] != "clearfix")
            {
                var movie = new Movie();
                movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
                var link = Divs.Css("a")[0];
                movie.Title = link.TextContentClean;
                movie.URL = link.Attributes ["href"];
                Scrape(movie, "Movie.Jsonl");
            }
        }
    }
}

public class MovieScraper : WebScraper
{
    public override void Init()
    {
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
        this.Request("https://website.com/", Parse);
    }
    public override void Parse(Response response)
    {
        foreach (var Divs in response.Css("#movie-featured > div"))
        {
            if (Divs.Attributes ["class"] != "clearfix")
            {
                var movie = new Movie();
                movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
                var link = Divs.Css("a")[0];
                movie.Title = link.TextContentClean;
                movie.URL = link.Attributes ["href"];
                Scrape(movie, "Movie.Jsonl");
            }
        }
    }
}

Public Class MovieScraper
	Inherits WebScraper

	Public Overrides Sub Init()
		License.LicenseKey = "LicenseKey"
		Me.LoggingLevel = WebScraper.LogLevel.All
		Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"
		Me.Request("https://website.com/", AddressOf Parse)
	End Sub
	Public Overrides Sub Parse(ByVal response As Response)
		For Each Divs In response.Css("#movie-featured > div")
			If Divs.Attributes ("class") <> "clearfix" Then
				Dim movie As New Movie()
				movie.Id = Convert.ToInt32(Divs.GetAttribute("data-movie-id"))
				Dim link = Divs.Css("a")(0)
				movie.Title = link.TextContentClean
				movie.URL = link.Attributes ("href")
				Scrape(movie, "Movie.Jsonl")
			End If
		Next Divs
	End Sub
End Class

$vbLabelText $csharpLabel

有什么新功能？

我们实现电影类来存储我们抓取的数据。
我们将电影对象传递给Scrape方法，它理解我们的格式并按照我们在这里看到的定义格式进行保存：
让我们开始抓取一个更详细的页面。

电影页面如下：

```html

Guardians of the Galaxy Vol. 2

Set to the backdrop of Awesome Mixtape #2, Marvel's Guardians of the Galaxy Vol. 2 continues the team's adventures as they travel throughout the cosmos to help Peter Quill learn more about his true parentage.

Genre: Action, Adventure, Sci-Fi

Actor: Chris Pratt, Zoe Saldana, Dave Bautista

Director: James Gunn

Country: United States

Duration: 136 min

Quality: CAM

Release: 2017

IMDb: 8.3

``` 我们可以通过新属性（描述、类型、演员、导演、国家、时长、IMDB评分）扩展我们的电影类，但我们将仅使用（描述、类型、演员）用于我们的示例。 ```cs public class Movie { public int Id { get; set; } public string Title { get; set; } public string URL { get; set; } public string Description { get; set; } public ListGenre { get; set; } public ListActor { get; set; } } ``` 现在我们将导航到详细页面进行抓取。 IronWebScraper 允许您添加更多的抓取功能，以抓取不同类型的页面格式。如我们所见： ```cs public class MovieScraper : WebScraper { public override void Init() { License.LicenseKey = "LicenseKey"; this.LoggingLevel = WebScraper.LogLevel.All; this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\"; this.Request("https://domain/", Parse); } public override void Parse(Response response) { foreach (var Divs in response.Css("#movie-featured > div")) { if (Divs.Attributes ["class"] != "clearfix") { var movie = new Movie(); movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id")); var link = Divs.Css("a")[0]; movie.Title = link.TextContentClean; movie.URL = link.Attributes ["href"]; this.Request(movie.URL, ParseDetails, new MetaData() { { "movie", movie } });// to scrap Detailed Page } } } public void ParseDetails(Response response) { var movie = response.MetaData.Get("movie"); var Div = response.Css("div.mvic-desc")[0]; movie.Description = Div.Css("div.desc")[0].TextContentClean; foreach(var Genre in Div.Css("div > p > a")) { movie.Genre.Add(Genre.TextContentClean); } foreach (var Actor in Div.Css("div > p:nth-child(2) > a")) { movie.Actor.Add(Actor.TextContentClean); } Scrape(movie, "Movie.Jsonl"); } } ``` *有什么新功能？* 1. 我们可以添加抓取功能（ParseDetails）来抓取详细页面 2. 我们将生成文件的 Scrape 功能移至新功能。 3. 我们使用了IronWebscraper功能（MetaData）将我们的电影对象传递给新的抓取功能 4. 我们爬取了页面并将我们的电影对象数据保存到了一个文件中。