抓取在线电影网站

Darrius Serrant
Darrius Serrant
2023年一月19日
更新 2024年十月20日
分享:
This article was translated from English: Does it need improvement?
Translated
View the article in English

让我们从一个真实世界的网站开始另一个示例。我们将选择抓取一个电影网站。

让我们添加一个新类并将其命名为“MovieScraper”:

MovieScrapaerAddClass related to 抓取在线电影网站

现在,让我们来看看我们要搜索的网站:

123movies related to 抓取在线电影网站

这是我们在网站上看到的主页 HTML 的一部分:

<div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active">
    <div data-movie-id="20746" class="ml-item">
        <a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/">
            <span class="mli-quality">CAM</span>
            <img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" 
                 class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword"
                  src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" 
                 style="display: inline-block;">
            <span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span>
        </a>
    </div>
    <div data-movie-id="20724" class="ml-item">
        <a href="https://website.com/film/snatched-20724/" >
            <span class="mli-quality">CAM</span>
            <img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" 
                 class="lazy thumb mli-thumb" alt="Snatched" 
                 src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" 
                 style="display: inline-block;">
            <span class="mli-info"><h2>Snatched</h2></span>
        </a>
    </div>
</div>
<div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active">
    <div data-movie-id="20746" class="ml-item">
        <a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/">
            <span class="mli-quality">CAM</span>
            <img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" 
                 class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword"
                  src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" 
                 style="display: inline-block;">
            <span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span>
        </a>
    </div>
    <div data-movie-id="20724" class="ml-item">
        <a href="https://website.com/film/snatched-20724/" >
            <span class="mli-quality">CAM</span>
            <img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" 
                 class="lazy thumb mli-thumb" alt="Snatched" 
                 src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" 
                 style="display: inline-block;">
            <span class="mli-info"><h2>Snatched</h2></span>
        </a>
    </div>
</div>
HTML

正如我们所见,我们有电影ID、标题和详细页面链接。

让我们开始抓取这组数据:

public class MovieScraper : WebScraper
{
    public override void Init()
    {
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
        this.Request("www.website.com", Parse);
    }
    public override void Parse(Response response)
    {
        foreach (var Divs in response.Css("#movie-featured > div"))
        {
            if (Divs.Attributes ["class"] != "clearfix")
            {
                var MovieId = Divs.GetAttribute("data-movie-id");
                var link = Divs.Css("a")[0];
                var MovieTitle = link.TextContentClean;
                Scrape(new ScrapedData() { { "MovieId", MovieId }, { "MovieTitle", MovieTitle } }, "Movie.Jsonl");
            }
        }           
    }
}
public class MovieScraper : WebScraper
{
    public override void Init()
    {
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
        this.Request("www.website.com", Parse);
    }
    public override void Parse(Response response)
    {
        foreach (var Divs in response.Css("#movie-featured > div"))
        {
            if (Divs.Attributes ["class"] != "clearfix")
            {
                var MovieId = Divs.GetAttribute("data-movie-id");
                var link = Divs.Css("a")[0];
                var MovieTitle = link.TextContentClean;
                Scrape(new ScrapedData() { { "MovieId", MovieId }, { "MovieTitle", MovieTitle } }, "Movie.Jsonl");
            }
        }           
    }
}
Public Class MovieScraper
	Inherits WebScraper

	Public Overrides Sub Init()
		License.LicenseKey = "LicenseKey"
		Me.LoggingLevel = WebScraper.LogLevel.All
		Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"
		Me.Request("www.website.com", AddressOf Parse)
	End Sub
	Public Overrides Sub Parse(ByVal response As Response)
		For Each Divs In response.Css("#movie-featured > div")
			If Divs.Attributes ("class") <> "clearfix" Then
				Dim MovieId = Divs.GetAttribute("data-movie-id")
				Dim link = Divs.Css("a")(0)
				Dim MovieTitle = link.TextContentClean
				Scrape(New ScrapedData() From {
					{ "MovieId", MovieId },
					{ "MovieTitle", MovieTitle }
				},
				"Movie.Jsonl")
			End If
		Next Divs
	End Sub
End Class
$vbLabelText   $csharpLabel

此代码有什么新功能?

工作目录属性用于设置所有抓取数据及其相关文件的主工作目录。

让我们做更多。

如果我们需要构建类型化对象来保存格式化对象中的抓取数据,该怎么办?

让我们实现一个电影类,用于存放我们的格式化数据:

public class Movie
{
    public int Id { get; set; }
    public string Title { get; set; }
    public string URL { get; set; }

}
public class Movie
{
    public int Id { get; set; }
    public string Title { get; set; }
    public string URL { get; set; }

}
Public Class Movie
	Public Property Id() As Integer
	Public Property Title() As String
	Public Property URL() As String

End Class
$vbLabelText   $csharpLabel

现在,我们将更新代码:

public class MovieScraper : WebScraper
{
    public override void Init()
    {
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
        this.Request("https://website.com/", Parse);
    }
    public override void Parse(Response response)
    {
        foreach (var Divs in response.Css("#movie-featured > div"))
        {
            if (Divs.Attributes ["class"] != "clearfix")
            {
                var movie = new Movie();
                movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
                var link = Divs.Css("a")[0];
                movie.Title = link.TextContentClean;
                movie.URL = link.Attributes ["href"];
                Scrape(movie, "Movie.Jsonl");
            }
        }
    }
}
public class MovieScraper : WebScraper
{
    public override void Init()
    {
        License.LicenseKey = "LicenseKey";
        this.LoggingLevel = WebScraper.LogLevel.All;
        this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
        this.Request("https://website.com/", Parse);
    }
    public override void Parse(Response response)
    {
        foreach (var Divs in response.Css("#movie-featured > div"))
        {
            if (Divs.Attributes ["class"] != "clearfix")
            {
                var movie = new Movie();
                movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
                var link = Divs.Css("a")[0];
                movie.Title = link.TextContentClean;
                movie.URL = link.Attributes ["href"];
                Scrape(movie, "Movie.Jsonl");
            }
        }
    }
}
Public Class MovieScraper
	Inherits WebScraper

	Public Overrides Sub Init()
		License.LicenseKey = "LicenseKey"
		Me.LoggingLevel = WebScraper.LogLevel.All
		Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"
		Me.Request("https://website.com/", AddressOf Parse)
	End Sub
	Public Overrides Sub Parse(ByVal response As Response)
		For Each Divs In response.Css("#movie-featured > div")
			If Divs.Attributes ("class") <> "clearfix" Then
				Dim movie As New Movie()
				movie.Id = Convert.ToInt32(Divs.GetAttribute("data-movie-id"))
				Dim link = Divs.Css("a")(0)
				movie.Title = link.TextContentClean
				movie.URL = link.Attributes ("href")
				Scrape(movie, "Movie.Jsonl")
			End If
		Next Divs
	End Sub
End Class
$vbLabelText   $csharpLabel

有什么新功能?

  1. 我们实现电影类来存储我们抓取的数据。

  2. 我们将电影对象传递给Scrape方法,它理解我们的格式并按照我们在这里看到的定义格式进行保存:

    MovieResultMovieClass related to 抓取在线电影网站

    让我们开始抓取一个更详细的页面。

电影页面如下:

MovieDetailsSample related to 抓取在线电影网站

```html

Guardians of the Galaxy Vol. 2

Set to the backdrop of Awesome Mixtape #2, Marvel's Guardians of the Galaxy Vol. 2 continues the team's adventures as they travel throughout the cosmos to help Peter Quill learn more about his true parentage.

Duration: 136 min

Quality: CAM

Release: 2017

IMDb: 8.3

``` 我们可以通过新属性(描述、类型、演员、导演、国家、时长、IMDB评分)扩展我们的电影类,但我们将仅使用(描述、类型、演员)用于我们的示例。 ```cs public class Movie { public int Id { get; set; } public string Title { get; set; } public string URL { get; set; } public string Description { get; set; } public ListGenre { get; set; } public ListActor { get; set; } } ``` 现在我们将导航到详细页面进行抓取。 IronWebScraper 允许您添加更多的抓取功能,以抓取不同类型的页面格式。 如我们所见: ```cs public class MovieScraper : WebScraper { public override void Init() { License.LicenseKey = "LicenseKey"; this.LoggingLevel = WebScraper.LogLevel.All; this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\"; this.Request("https://domain/", Parse); } public override void Parse(Response response) { foreach (var Divs in response.Css("#movie-featured > div")) { if (Divs.Attributes ["class"] != "clearfix") { var movie = new Movie(); movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id")); var link = Divs.Css("a")[0]; movie.Title = link.TextContentClean; movie.URL = link.Attributes ["href"]; this.Request(movie.URL, ParseDetails, new MetaData() { { "movie", movie } });// to scrap Detailed Page } } } public void ParseDetails(Response response) { var movie = response.MetaData.Get("movie"); var Div = response.Css("div.mvic-desc")[0]; movie.Description = Div.Css("div.desc")[0].TextContentClean; foreach(var Genre in Div.Css("div > p > a")) { movie.Genre.Add(Genre.TextContentClean); } foreach (var Actor in Div.Css("div > p:nth-child(2) > a")) { movie.Actor.Add(Actor.TextContentClean); } Scrape(movie, "Movie.Jsonl"); } } ``` *有什么新功能?* 1. 我们可以添加抓取功能(ParseDetails)来抓取详细页面 2. 我们将生成文件的 Scrape 功能移至新功能。 3. 我们使用了IronWebscraper功能(MetaData)将我们的电影对象传递给新的抓取功能 4. 我们爬取了页面并将我们的电影对象数据保存到了一个文件中。

MovieResultMovieClass1 related to 抓取在线电影网站

Darrius Serrant
全栈软件工程师(WebOps)

达瑞乌斯·塞兰特拥有迈阿密大学计算机科学学士学位,目前在Iron Software担任全栈WebOps营销工程师。从小对编码的热爱使他认为计算机既神秘又易接近,成为创意和解决问题的完美媒介。

在Iron Software,达瑞乌斯乐于创造新事物并简化复杂概念,使其更易于理解。作为我们在职开发者之一,他还自愿教授学生,将他的专业知识传授给下一代。

对达瑞乌斯而言,他的工作之所以令人满足,是因为它具有价值并产生了真正的影响。