IronWebScraper 如何 抓取在線電影網站 Scraping an Online Movie Website Darrius Serrant 更新日期:6月 10, 2025 Download IronWebScraper NuGet 下載 DLL 下載 Start Free Trial Copy for LLMs Copy for LLMs Copy page as Markdown for LLMs Open in ChatGPT Ask ChatGPT about this page Open in Gemini Ask Gemini about this page Open in Grok Ask Grok about this page Open in Perplexity Ask Perplexity about this page Share Share on Facebook Share on X (Twitter) Share on LinkedIn Copy URL Email article This article was translated from English: Does it need improvement? Translated View the article in English 讓我們從一個現實世界的網站開始另一個例子。我們將選擇刮取一個電影網站。 讓我們添加一個新類並將其命名為“MovieScraper”: 現在讓我們看看我們將要刮取的網站: 這是我們在網站上看到的首頁 HTML 的一部分: <div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active"> <div data-movie-id="20746" class="ml-item"> <a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/"> <span class="mli-quality">CAM</span> <img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword" src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" style="display: inline-block;"> <span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span> </a> </div> <div data-movie-id="20724" class="ml-item"> <a href="https://website.com/film/snatched-20724/"> <span class="mli-quality">CAM</span> <img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" class="lazy thumb mli-thumb" alt="Snatched" src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" style="display: inline-block;"> <span class="mli-info"><h2>Snatched</h2></span> </a> </div> </div> <div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active"> <div data-movie-id="20746" class="ml-item"> <a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/"> <span class="mli-quality">CAM</span> <img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword" src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg" style="display: inline-block;"> <span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span> </a> </div> <div data-movie-id="20724" class="ml-item"> <a href="https://website.com/film/snatched-20724/"> <span class="mli-quality">CAM</span> <img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" class="lazy thumb mli-thumb" alt="Snatched" src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg" style="display: inline-block;"> <span class="mli-info"><h2>Snatched</h2></span> </a> </div> </div> HTML 正如我們所見, 我們有一個電影 ID、標題和指向詳細頁面的鏈接。 讓我們開始刮取這組數據: public class MovieScraper : WebScraper { public override void Init() { // Initialize scraper settings License.LicenseKey = "LicenseKey"; this.LoggingLevel = WebScraper.LogLevel.All; this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\"; // Request homepage content for scraping this.Request("www.website.com", Parse); } public override void Parse(Response response) { // Iterate over each movie div within the featured movie section foreach (var div in response.Css("#movie-featured > div")) { if (div.Attributes["class"] != "clearfix") { var movieId = Convert.ToInt32(div.GetAttribute("data-movie-id")); var link = div.Css("a")[0]; var movieTitle = link.TextContentClean; // Scrape and store movie data as key-value pairs Scrape(new ScrapedData() { { "MovieId", movieId }, { "MovieTitle", movieTitle } }, "Movie.Jsonl"); } } } } public class MovieScraper : WebScraper { public override void Init() { // Initialize scraper settings License.LicenseKey = "LicenseKey"; this.LoggingLevel = WebScraper.LogLevel.All; this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\"; // Request homepage content for scraping this.Request("www.website.com", Parse); } public override void Parse(Response response) { // Iterate over each movie div within the featured movie section foreach (var div in response.Css("#movie-featured > div")) { if (div.Attributes["class"] != "clearfix") { var movieId = Convert.ToInt32(div.GetAttribute("data-movie-id")); var link = div.Css("a")[0]; var movieTitle = link.TextContentClean; // Scrape and store movie data as key-value pairs Scrape(new ScrapedData() { { "MovieId", movieId }, { "MovieTitle", movieTitle } }, "Movie.Jsonl"); } } } } Public Class MovieScraper Inherits WebScraper Public Overrides Sub Init() ' Initialize scraper settings License.LicenseKey = "LicenseKey" Me.LoggingLevel = WebScraper.LogLevel.All Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\" ' Request homepage content for scraping Me.Request("www.website.com", AddressOf Parse) End Sub Public Overrides Sub Parse(ByVal response As Response) ' Iterate over each movie div within the featured movie section For Each div In response.Css("#movie-featured > div") If div.Attributes("class") <> "clearfix" Then Dim movieId = Convert.ToInt32(div.GetAttribute("data-movie-id")) Dim link = div.Css("a")(0) Dim movieTitle = link.TextContentClean ' Scrape and store movie data as key-value pairs Scrape(New ScrapedData() From { { "MovieId", movieId }, { "MovieTitle", movieTitle } }, "Movie.Jsonl") End If Next div End Sub End Class $vbLabelText $csharpLabel 這段代碼有什麼新內容? 工作目錄屬性用於設置所有刮取數據及其相關文件的主要工作目錄。 讓我們多做一些。 如果我們需要構建類型化對象以存儲格式化對象中的刮取數據怎麼辦? 讓我們實現一個Movie類,該類將保存我們的格式化數據: public class Movie { public int Id { get; set; } public string Title { get; set; } public string URL { get; set; } } public class Movie { public int Id { get; set; } public string Title { get; set; } public string URL { get; set; } } IRON VB CONVERTER ERROR developers@ironsoftware.com $vbLabelText $csharpLabel 現在我們將更新我們的代碼: public class MovieScraper : WebScraper { public override void Init() { // Initialize scraper settings License.LicenseKey = "LicenseKey"; this.LoggingLevel = WebScraper.LogLevel.All; this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\"; // Request homepage content for scraping this.Request("https://website.com/", Parse); } public override void Parse(Response response) { // Iterate over each movie div within the featured movie section foreach (var div in response.Css("#movie-featured > div")) { if (div.Attributes["class"] != "clearfix") { var movie = new Movie { Id = Convert.ToInt32(div.GetAttribute("data-movie-id")) }; var link = div.Css("a")[0]; movie.Title = link.TextContentClean; movie.URL = link.Attributes["href"]; // Scrape and store movie object Scrape(movie, "Movie.Jsonl"); } } } } public class MovieScraper : WebScraper { public override void Init() { // Initialize scraper settings License.LicenseKey = "LicenseKey"; this.LoggingLevel = WebScraper.LogLevel.All; this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\"; // Request homepage content for scraping this.Request("https://website.com/", Parse); } public override void Parse(Response response) { // Iterate over each movie div within the featured movie section foreach (var div in response.Css("#movie-featured > div")) { if (div.Attributes["class"] != "clearfix") { var movie = new Movie { Id = Convert.ToInt32(div.GetAttribute("data-movie-id")) }; var link = div.Css("a")[0]; movie.Title = link.TextContentClean; movie.URL = link.Attributes["href"]; // Scrape and store movie object Scrape(movie, "Movie.Jsonl"); } } } } Public Class MovieScraper Inherits WebScraper Public Overrides Sub Init() ' Initialize scraper settings License.LicenseKey = "LicenseKey" Me.LoggingLevel = WebScraper.LogLevel.All Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\" ' Request homepage content for scraping Me.Request("https://website.com/", AddressOf Parse) End Sub Public Overrides Sub Parse(ByVal response As Response) ' Iterate over each movie div within the featured movie section For Each div In response.Css("#movie-featured > div") If div.Attributes("class") <> "clearfix" Then Dim movie As New Movie With {.Id = Convert.ToInt32(div.GetAttribute("data-movie-id"))} Dim link = div.Css("a")(0) movie.Title = link.TextContentClean movie.URL = link.Attributes("href") ' Scrape and store movie object Scrape(movie, "Movie.Jsonl") End If Next div End Sub End Class $vbLabelText $csharpLabel 有什麼新內容? 我們實現了一個Movie類以保存我們的刮取數據。 我們將電影對象傳遞給Scrape方法,它了解我們的格式並以定義的方式保存它,如下圖所示: 讓我們開始刮取一個更詳細的頁面。 電影頁面看起來像這樣: <div class="mvi-content"> <div class="thumb mvic-thumb" style="background-image: url(https://img.gocdn.online/2017/04/28/poster/5a08e94ba02118f22dc30f298c603210-guardians-of-the-galaxy-vol-2.jpg);"></div> <div class="mvic-desc"> <h3>Guardians of the Galaxy Vol. 2</h3> <div class="desc"> Set to the backdrop of Awesome Mixtape #2, Marvel's Guardians of the Galaxy Vol. 2 continues the team's adventures as they travel throughout the cosmos to help Peter Quill learn more about his true parentage. </div> <div class="mvic-info"> <div class="mvici-left"> <p> <strong>Genre: </strong> <a href="https://Domain/genre/action/" title="Action">Action</a>, <a href="https://Domain/genre/adventure/" title="Adventure">Adventure</a>, <a href="https://Domain/genre/sci-fi/" title="Sci-Fi">Sci-Fi</a> </p> <p> <strong>Actor: </strong> <a target="_blank" href="https://Domain/actor/chris-pratt" title="Chris Pratt">Chris Pratt</a>, <a target="_blank" href="https://Domain/actor/-zoe-saldana" title="Zoe Saldana">Zoe Saldana</a>, <a target="_blank" href="https://Domain/actor/-dave-bautista-" title="Dave Bautista">Dave Bautista</a> </p> <p> <strong>Director: </strong> <a href="#" title="James Gunn">James Gunn</a> </p> <p> <strong>Country: </strong> <a href="https://Domain/country/us" title="United States">United States</a> </p> </div> <div class="mvici-right"> <p><strong>Duration:</strong> 136 min</p> <p><strong>Quality:</strong> <span class="quality">CAM</span></p> <p><strong>Release:</strong> 2017</p> <p><strong>IMDb:</strong> 8.3</p> </div> <div class="clearfix"></div> </div> <div class="clearfix"></div> </div> <div class="clearfix"></div> </div> <div class="mvi-content"> <div class="thumb mvic-thumb" style="background-image: url(https://img.gocdn.online/2017/04/28/poster/5a08e94ba02118f22dc30f298c603210-guardians-of-the-galaxy-vol-2.jpg);"></div> <div class="mvic-desc"> <h3>Guardians of the Galaxy Vol. 2</h3> <div class="desc"> Set to the backdrop of Awesome Mixtape #2, Marvel's Guardians of the Galaxy Vol. 2 continues the team's adventures as they travel throughout the cosmos to help Peter Quill learn more about his true parentage. </div> <div class="mvic-info"> <div class="mvici-left"> <p> <strong>Genre: </strong> <a href="https://Domain/genre/action/" title="Action">Action</a>, <a href="https://Domain/genre/adventure/" title="Adventure">Adventure</a>, <a href="https://Domain/genre/sci-fi/" title="Sci-Fi">Sci-Fi</a> </p> <p> <strong>Actor: </strong> <a target="_blank" href="https://Domain/actor/chris-pratt" title="Chris Pratt">Chris Pratt</a>, <a target="_blank" href="https://Domain/actor/-zoe-saldana" title="Zoe Saldana">Zoe Saldana</a>, <a target="_blank" href="https://Domain/actor/-dave-bautista-" title="Dave Bautista">Dave Bautista</a> </p> <p> <strong>Director: </strong> <a href="#" title="James Gunn">James Gunn</a> </p> <p> <strong>Country: </strong> <a href="https://Domain/country/us" title="United States">United States</a> </p> </div> <div class="mvici-right"> <p><strong>Duration:</strong> 136 min</p> <p><strong>Quality:</strong> <span class="quality">CAM</span></p> <p><strong>Release:</strong> 2017</p> <p><strong>IMDb:</strong> 8.3</p> </div> <div class="clearfix"></div> </div> <div class="clearfix"></div> </div> <div class="clearfix"></div> </div> HTML 我們可以用新屬性(描述、類型、演員、導演、國家、時長、IMDb 評分)擴展我們的Movie類,但我們將僅使用(描述、類型、演員)作為樣本。 using System.Collections.Generic; public class Movie { public int Id { get; set; } public string Title { get; set; } public string URL { get; set; } public string Description { get; set; } public List<string> Genre { get; set; } public List<string> Actor { get; set; } } using System.Collections.Generic; public class Movie { public int Id { get; set; } public string Title { get; set; } public string URL { get; set; } public string Description { get; set; } public List<string> Genre { get; set; } public List<string> Actor { get; set; } } Imports System.Collections.Generic Public Class Movie Public Property Id() As Integer Public Property Title() As String Public Property URL() As String Public Property Description() As String Public Property Genre() As List(Of String) Public Property Actor() As List(Of String) End Class $vbLabelText $csharpLabel 現在我們將導航到詳細頁面進行刮取。 IronWebScraper 使您能夠為刮取功能添加更多內容,以刮取不同類型的頁面格式。 正如我們在這裡看到的: public class MovieScraper : WebScraper { public override void Init() { // Initialize scraper settings License.LicenseKey = "LicenseKey"; this.LoggingLevel = WebScraper.LogLevel.All; this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\"; // Request homepage content for scraping this.Request("https://domain/", Parse); } public override void Parse(Response response) { // Iterate over each movie div within the featured movie section foreach (var div in response.Css("#movie-featured > div")) { if (div.Attributes["class"] != "clearfix") { var movie = new Movie { Id = Convert.ToInt32(div.GetAttribute("data-movie-id")) }; var link = div.Css("a")[0]; movie.Title = link.TextContentClean; movie.URL = link.Attributes["href"]; // Request detailed page this.Request(movie.URL, ParseDetails, new MetaData() { { "movie", movie } }); } } } public void ParseDetails(Response response) { // Retrieve movie object from metadata var movie = response.MetaData.Get<Movie>("movie"); var div = response.Css("div.mvic-desc")[0]; // Extract description movie.Description = div.Css("div.desc")[0].TextContentClean; // Extract genres movie.Genre = new List<string>(); // Initialize genre list foreach(var genre in div.Css("div > p > a")) { movie.Genre.Add(genre.TextContentClean); } // Extract actors movie.Actor = new List<string>(); // Initialize actor list foreach (var actor in div.Css("div > p:nth-child(2) > a")) { movie.Actor.Add(actor.TextContentClean); } // Scrape and store detailed movie data Scrape(movie, "Movie.Jsonl"); } } public class MovieScraper : WebScraper { public override void Init() { // Initialize scraper settings License.LicenseKey = "LicenseKey"; this.LoggingLevel = WebScraper.LogLevel.All; this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\"; // Request homepage content for scraping this.Request("https://domain/", Parse); } public override void Parse(Response response) { // Iterate over each movie div within the featured movie section foreach (var div in response.Css("#movie-featured > div")) { if (div.Attributes["class"] != "clearfix") { var movie = new Movie { Id = Convert.ToInt32(div.GetAttribute("data-movie-id")) }; var link = div.Css("a")[0]; movie.Title = link.TextContentClean; movie.URL = link.Attributes["href"]; // Request detailed page this.Request(movie.URL, ParseDetails, new MetaData() { { "movie", movie } }); } } } public void ParseDetails(Response response) { // Retrieve movie object from metadata var movie = response.MetaData.Get<Movie>("movie"); var div = response.Css("div.mvic-desc")[0]; // Extract description movie.Description = div.Css("div.desc")[0].TextContentClean; // Extract genres movie.Genre = new List<string>(); // Initialize genre list foreach(var genre in div.Css("div > p > a")) { movie.Genre.Add(genre.TextContentClean); } // Extract actors movie.Actor = new List<string>(); // Initialize actor list foreach (var actor in div.Css("div > p:nth-child(2) > a")) { movie.Actor.Add(actor.TextContentClean); } // Scrape and store detailed movie data Scrape(movie, "Movie.Jsonl"); } } Public Class MovieScraper Inherits WebScraper Public Overrides Sub Init() ' Initialize scraper settings License.LicenseKey = "LicenseKey" Me.LoggingLevel = WebScraper.LogLevel.All Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\" ' Request homepage content for scraping Me.Request("https://domain/", AddressOf Parse) End Sub Public Overrides Sub Parse(ByVal response As Response) ' Iterate over each movie div within the featured movie section For Each div In response.Css("#movie-featured > div") If div.Attributes("class") <> "clearfix" Then Dim movie As New Movie With {.Id = Convert.ToInt32(div.GetAttribute("data-movie-id"))} Dim link = div.Css("a")(0) movie.Title = link.TextContentClean movie.URL = link.Attributes("href") ' Request detailed page Me.Request(movie.URL, AddressOf ParseDetails, New MetaData() From { { "movie", movie } }) End If Next div End Sub Public Sub ParseDetails(ByVal response As Response) ' Retrieve movie object from metadata Dim movie = response.MetaData.Get(Of Movie)("movie") Dim div = response.Css("div.mvic-desc")(0) ' Extract description movie.Description = div.Css("div.desc")(0).TextContentClean ' Extract genres movie.Genre = New List(Of String)() ' Initialize genre list For Each genre In div.Css("div > p > a") movie.Genre.Add(genre.TextContentClean) Next genre ' Extract actors movie.Actor = New List(Of String)() ' Initialize actor list For Each actor In div.Css("div > p:nth-child(2) > a") movie.Actor.Add(actor.TextContentClean) Next actor ' Scrape and store detailed movie data Scrape(movie, "Movie.Jsonl") End Sub End Class $vbLabelText $csharpLabel 有什麼新內容? 我們可以添加刮取函數(例如,ParseDetails)以刮取詳細頁面。 我們將生成文件的Scrape函數移到了新功能中。 我們使用 IronWebScraper 功能(MetaData)將我們的電影對象傳遞到新的刮取功能。 我們刮取了頁面並將我們的電影對象數據保存到文件中。 常見問題解答 我如何從在線電影網站抓取數據? 您可以使用IronWebScraper從在線電影網站抓取數據。首先創建一個'MovieScraper'類,設置抓取器設置並請求首頁內容以進行提取。 'Movie'類在網頁抓取中有什麼功能? 在IronWebScraper中,'Movie'類被用於將抓取的數據結構化並存儲為對象,其屬性如ID、標題、URL、描述、類型和演員,確保有組織的數據處理。 你如何導航並提取詳細的電影信息? IronWebScraper允許您實現一個'ParseDetails'函數以訪問詳細的電影頁面並提取額外的信息,如描述、類型和演員。 'MetaData'功能在網頁抓取中有什麼作用? IronWebScraper的'MetaData'功能對於在抓取函數之間傳遞數據至關重要,例如將電影對象傳遞給'ParseDetails'函數以進行進一步處理。 在抓取過程中如何處理不同的頁面格式? 使用IronWebScraper,您可以創建多個抓取函數以管理各種頁面格式並有效地提取大量數據。 你如何使用IronWebScraper提取電影ID和標題? 您可以迭代IronWebScraper中的特色電影部分內的每個電影div,以通過訪問數據屬性和文本內容來提取電影ID和標題。 記錄設置在抓取器中有何意義? IronWebScraper的'LoggingLevel'屬性允許您設置日誌輸出的詳細程度,以方便監控和排除抓取過程中的故障。 工作目錄在網頁抓取項目中有何作用? IronWebScraper中的工作目錄用於存儲所有抓取的數據和相關文件,集中數據管理過程。 IronWebScraper可以用來自動化數據提取任務嗎? 可以,IronWebScraper旨在通過允許用戶創建類和方法系統地從網頁抓取和存儲數據來自動化數據提取任務。 Darrius Serrant 立即與工程團隊聊天 全棧軟件工程師 (WebOps) Darrius Serrant 擁有邁阿密大學計算機科學學士學位,目前任職於 Iron Software 的全栈 WebOps 市場營銷工程師。從小就迷上編碼,他認為計算既神秘又可接近,是創意和解決問題的完美媒介。在 Iron Software,Darrius 喜歡創造新事物,並簡化複雜概念以便於理解。作為我們的駐場開發者之一,他也自願教學生,分享他的專業知識給下一代。對 Darrius 來說,工作令人滿意因為它被重視且有實際影響。 準備好開始了嗎? Nuget 下載 122,916 | 版本: 2025.11 剛剛發布 免費 NuGet 下載 總下載量:122,916 查看許可證