如何用 C 语言从网站抓取数据#
IronWebscraper 是一个用于网络抓取、网络数据提取和网络内容解析的 .NET 库。 这是一个易于使用的库,可以添加到 Microsoft Visual Studio 项目中,用于开发和生产。
IronWebscraper具有许多独特的功能和能力,例如控制允许和禁止的页面、对象、媒体等。它还允许管理多个身份、网络缓存以及我们将在本教程中介绍的许多其他功能。
开始使用IronWebscraper
立即在您的项目中开始使用IronWebScraper,并享受免费试用。
目标受众
本教程面向具有基础或高级编程技能的软件开发者,旨在帮助他们构建并实现具有高级抓取功能的解决方案。(网站搜刮、网站数据收集和提取、网站内容解析、网络收获).
所需技能
使用Microsoft编程语言(如C#或VB.NET)进行编程的基本原理。
基本了解网络技术(HTML、JavaScript、JQuery、CSS 等。)以及它们的工作原理
- 基本的DOM、XPath、HTML和CSS选择器知识。
工具
Microsoft Visual Studio 2010 或以上版本
- 浏览器的网页开发者扩展,例如Chrome的Web检查器或Firefox的Firebug。
为什么要进行网页抓取?
(原因与概念)
如果您想构建一个具有以下功能的产品或解决方案:
提取网站数据
比较多个网站的内容、价格、功能等。
扫描和缓存网站内容
如果您有上述一个或多个原因,那么IronWebscraper是一个非常适合您需求的库。
如何安装IronWebScraper?
在您创建一个新项目之后(见附录 A)您可以使用 NuGet 自动插入 IronWebScraper 库,或手动安装 DLL,将其添加到项目中。
使用 NuGet 安装
要使用 NuGet 将 IronWebScraper 库添加到我们的项目中,我们可以使用可视界面来操作。(NuGet 软件包管理器)或使用软件包管理器控制台的命令。
使用 NuGet 软件包管理器
使用 NuGet 包控制台
手动安装
点击IronWebScraper或直接使用URL访问其页面。https://ironsoftware.com/csharp/webscraper/
点击下载DLL。
提取已下载的压缩文件
在 Visual Studio 中右键单击项目 -> 添加 -> 引用 -> 浏览
转到解压缩文件夹 ->
netstandard2.0
-> 并选择所有.dll
文件- 完成了!
HelloScraper - 我们的第一个IronWebScraper示例
像往常一样,我们将开始实施Hello Scraper应用程序,以使用IronWebScraper迈出我们的第一步。
我们创建了一个名为“IronWebScraperSample”的新控制台应用程序。
创建 IronWebScraper 示例的步骤
public class HelloScraper : WebScraper
{
/// <summary>
/// Override this method initialize your web-scraper.
/// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
/// </summary>
public override void Init()
{
License.LicenseKey = "LicenseKey"; // Write License Key
this.LoggingLevel = WebScraper.LogLevel.All; // All Events Are Logged
this.Request("https://blog.scrapinghub.com", Parse);
}
/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
// set working directory for the project
this.WorkingDirectory = AppSetting.GetAppRoot()+ @"\HelloScraperSample\Output\";
// Loop on all Links
foreach (var title_link in response.Css("h2.entry-title a"))
{
// Read Link Text
string strTitle = title_link.TextContentClean;
// Save Result to File
Scrape(new ScrapedData() { { "Title", strTitle } }, "HelloScraper.json");
}
// Loop On All Links
if (response.CssExists("div.prev-post > a [href]"))
{
// Get Link URL
var next_page = response.Css("div.prev-post > a [href]")[0].Attributes ["href"];
// Scrape Next URL
this.Request(next_page, Parse);
}
}
}
public class HelloScraper : WebScraper
{
/// <summary>
/// Override this method initialize your web-scraper.
/// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
/// </summary>
public override void Init()
{
License.LicenseKey = "LicenseKey"; // Write License Key
this.LoggingLevel = WebScraper.LogLevel.All; // All Events Are Logged
this.Request("https://blog.scrapinghub.com", Parse);
}
/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
// set working directory for the project
this.WorkingDirectory = AppSetting.GetAppRoot()+ @"\HelloScraperSample\Output\";
// Loop on all Links
foreach (var title_link in response.Css("h2.entry-title a"))
{
// Read Link Text
string strTitle = title_link.TextContentClean;
// Save Result to File
Scrape(new ScrapedData() { { "Title", strTitle } }, "HelloScraper.json");
}
// Loop On All Links
if (response.CssExists("div.prev-post > a [href]"))
{
// Get Link URL
var next_page = response.Css("div.prev-post > a [href]")[0].Attributes ["href"];
// Scrape Next URL
this.Request(next_page, Parse);
}
}
}
IRON VB CONVERTER ERROR developers@ironsoftware.com
- 现在开始 Scrape 将以下代码片段添加到主页
static void Main(string [] args)
{
// Create Object From Hello Scrape class
HelloScraperSample.HelloScraper scrape = new HelloScraperSample.HelloScraper();
// Start Scraping
scrape.Start();
}
static void Main(string [] args)
{
// Create Object From Hello Scrape class
HelloScraperSample.HelloScraper scrape = new HelloScraperSample.HelloScraper();
// Start Scraping
scrape.Start();
}
IRON VB CONVERTER ERROR developers@ironsoftware.com
代码概览
Scrape.Start()=> 触发以下的抓取逻辑:
调用 Init()首先初始化变量、抓取属性和行为属性的方法。
如我们所见,它将起始页设置为请求。("https://blog.scrapinghub.com",解析)和解析(回应)定义为用于解析响应的过程。
Webscraper 并行管理:http 和线程……保持您的所有代码易于调试和同步。
- 解析方法在初始化之后开始。()来解析页面。
您可以使用(Css 选择器、Js DOM、XPath)查找元素
选中的元素会被转换为 ScrapedData 类,您也可以将它们转换为任何自定义类(如产品、员工、新闻等)。
对象以 Json 格式保存在("bin/Scrape/")目录下的文件中。也可以将文件路径设为参数,稍后我们将在其他示例中看到。
IronWebScraper 库功能和选项
您可以在手动安装方法下载的zip文件中找到更新的文档。(IronWebScraper Documentation.chm 文件)
或者您可以查看库最后更新的在线文档。https://ironsoftware.com/csharp/webscraper/object-reference/
要在您的项目中开始使用IronWebscraper,您必须继承自(IronWebScraper.WebScraper)扩展您的类库并向其添加抓取功能的类。
您还必须实现{启动(), 解析(回应)}方法。
namespace IronWebScraperEngine
{
public class NewsScraper : IronWebScraper.WebScraper
{
public override void Init()
{
throw new NotImplementedException();
}
public override void Parse(Response response)
{
throw new NotImplementedException();
}
}
}
namespace IronWebScraperEngine
{
public class NewsScraper : IronWebScraper.WebScraper
{
public override void Init()
{
throw new NotImplementedException();
}
public override void Parse(Response response)
{
throw new NotImplementedException();
}
}
}
Namespace IronWebScraperEngine
Public Class NewsScraper
Inherits IronWebScraper.WebScraper
Public Overrides Sub Init()
Throw New NotImplementedException()
End Sub
Public Overrides Sub Parse(ByVal response As Response)
Throw New NotImplementedException()
End Sub
End Class
End Namespace
特性 (函数 | 类型 | 说明 |
---|---|---|
启动 () | 方法 | 用于设置刮板 |
解析(响应) | 方法 | 用于执行刮板将使用的逻辑以及处理方式。 下表包含 IronWebScraper 库提供的方法和属性列表 注意:可针对不同的页面行为或结构实施多种方法 |
| 收藏品 | 用于禁止/允许/ url 和/或域 Ex: BannedUrls.Add ("*.zip"、"*.exe"、"*.gz"、"*.pdf"); 请注意:
|
ObeyRobotsDotTxt | 布尔型 | 用于启用或禁用 read and follow robots.txt 指令。 |
public override bool ObeyRobotsDotTxtForHost (string Host) | 方法 | 用于启用或禁用某些域的读取和跟踪 robots.txt 指令或不读取和跟踪 robots.txt 指令 |
刮削 | 方法 | |
ScrapeUnique | 方法 | |
节流阀模式 | 枚举 | |
启用网络缓存 () | 方法 | |
启用 WebCache(TimeSpan cacheDuration) | 方法 | |
最大连接数限制 | 内部 | |
每主机费率限制 | 时间跨度 | |
OpenConnectionLimitPerHost | 内部 | |
ObeyRobotsDotTxt | 布尔型 | |
节流阀模式 | 枚举 | 枚举选项:
|
SetSiteSpecificCrawlRateLimit(字符串 hostName,时间跨度 crawlRate) | 方法 | |
身份 | 收藏品 | 用于获取网络资源的 HttpIdentity () 列表。 每个个人信息都可能有不同的代理 IP 地址、用户代理、http 头信息、持久 cookies、用户名和密码。 最佳做法是在 WebScraper.Init 方法中创建个人信息,并将其添加到 WebScraper.Identities 列表中。 |
工作目录 | 字符串 | 设置工作目录,用于将所有与刮擦相关的数据存储到磁盘中。 |
真实世界样本与实践
抓取在线电影网站
让我们从一个真实世界的网站开始另一个示例。我们将选择抓取一个电影网站。
让我们添加一个新类并将其命名为“MovieScraper”:
现在,让我们来看看我们要搜索的网站:
这是我们在网站上看到的主页 HTML 的一部分:
<div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active">
<div data-movie-id="20746" class="ml-item">
<a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/">
<span class="mli-quality">CAM</span>
<img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg"
class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword"
src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg"
style="display: inline-block;">
<span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span>
</a>
</div>
<div data-movie-id="20724" class="ml-item">
<a href="https://website.com/film/snatched-20724/" >
<span class="mli-quality">CAM</span>
<img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg"
class="lazy thumb mli-thumb" alt="Snatched"
src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg"
style="display: inline-block;">
<span class="mli-info"><h2>Snatched</h2></span>
</a>
</div>
</div>
<div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active">
<div data-movie-id="20746" class="ml-item">
<a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/">
<span class="mli-quality">CAM</span>
<img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg"
class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword"
src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg"
style="display: inline-block;">
<span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span>
</a>
</div>
<div data-movie-id="20724" class="ml-item">
<a href="https://website.com/film/snatched-20724/" >
<span class="mli-quality">CAM</span>
<img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg"
class="lazy thumb mli-thumb" alt="Snatched"
src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg"
style="display: inline-block;">
<span class="mli-info"><h2>Snatched</h2></span>
</a>
</div>
</div>
正如我们所见,我们有电影ID、标题和详细页面链接。
让我们开始抓取这组数据:
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("www.website.com", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var MovieId = Divs.GetAttribute("data-movie-id");
var link = Divs.Css("a")[0];
var MovieTitle = link.TextContentClean;
Scrape(new ScrapedData() { { "MovieId", MovieId }, { "MovieTitle", MovieTitle } }, "Movie.Jsonl");
}
}
}
}
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("www.website.com", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var MovieId = Divs.GetAttribute("data-movie-id");
var link = Divs.Css("a")[0];
var MovieTitle = link.TextContentClean;
Scrape(new ScrapedData() { { "MovieId", MovieId }, { "MovieTitle", MovieTitle } }, "Movie.Jsonl");
}
}
}
}
Public Class MovieScraper
Inherits WebScraper
Public Overrides Sub Init()
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"
Me.Request("www.website.com", AddressOf Parse)
End Sub
Public Overrides Sub Parse(ByVal response As Response)
For Each Divs In response.Css("#movie-featured > div")
If Divs.Attributes ("class") <> "clearfix" Then
Dim MovieId = Divs.GetAttribute("data-movie-id")
Dim link = Divs.Css("a")(0)
Dim MovieTitle = link.TextContentClean
Scrape(New ScrapedData() From {
{ "MovieId", MovieId },
{ "MovieTitle", MovieTitle }
},
"Movie.Jsonl")
End If
Next Divs
End Sub
End Class
这段代码有什么新内容?
工作目录属性用于设置所有抓取数据及其相关文件的主工作目录。
让我们做得更多。
如果我们需要构建类型化对象来保存格式化对象中的抓取数据,该怎么办?
让我们实现一个电影类,用于存放我们的格式化数据:
public class Movie
{
public int Id { get; set; }
public string Title { get; set; }
public string URL { get; set; }
}
public class Movie
{
public int Id { get; set; }
public string Title { get; set; }
public string URL { get; set; }
}
Public Class Movie
Public Property Id() As Integer
Public Property Title() As String
Public Property URL() As String
End Class
现在,我们将更新代码:
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("https://website.com/", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var movie = new Movie();
movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
var link = Divs.Css("a")[0];
movie.Title = link.TextContentClean;
movie.URL = link.Attributes ["href"];
Scrape(movie, "Movie.Jsonl");
}
}
}
}
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("https://website.com/", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var movie = new Movie();
movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
var link = Divs.Css("a")[0];
movie.Title = link.TextContentClean;
movie.URL = link.Attributes ["href"];
Scrape(movie, "Movie.Jsonl");
}
}
}
}
Public Class MovieScraper
Inherits WebScraper
Public Overrides Sub Init()
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"
Me.Request("https://website.com/", AddressOf Parse)
End Sub
Public Overrides Sub Parse(ByVal response As Response)
For Each Divs In response.Css("#movie-featured > div")
If Divs.Attributes ("class") <> "clearfix" Then
Dim movie As New Movie()
movie.Id = Convert.ToInt32(Divs.GetAttribute("data-movie-id"))
Dim link = Divs.Css("a")(0)
movie.Title = link.TextContentClean
movie.URL = link.Attributes ("href")
Scrape(movie, "Movie.Jsonl")
End If
Next Divs
End Sub
End Class
有什么新鲜事?
电影页面如下:
<div class="mvi-content">
<div class="thumb mvic-thumb"
style="background-image: url(https://img.gocdn.online/2017/04/28/poster/5a08e94ba02118f22dc30f298c603210-guardians-of-the-galaxy-vol-2.jpg);"></div>
<div class="mvic-desc">
<h3>Guardians of the Galaxy Vol. 2</h3>
<div class="desc">
Set to the backdrop of Awesome Mixtape #2, Marvel's Guardians of the Galaxy Vol. 2 continues the team's adventures as they travel throughout the cosmos to help Peter Quill learn more about his true parentage.
</div>
<div class="mvic-info">
<div class="mvici-left">
<p>
<strong>Genre: </strong>
<a href="https://Domain/genre/action/" title="Action">Action</a>,
<a href="https://Domain/genre/adventure/" title="Adventure">Adventure</a>,
<a href="https://Domain/genre/sci-fi/" title="Sci-Fi">Sci-Fi</a>
</p>
<p>
<strong>Actor: </strong>
<a target="_blank" href="https://Domain/actor/chris-pratt" title="Chris Pratt">Chris Pratt</a>,
<a target="_blank" href="https://Domain/actor/-zoe-saldana" title="Zoe Saldana">Zoe Saldana</a>,
<a target="_blank" href="https://Domain/actor/-dave-bautista-" title="Dave Bautista">Dave Bautista</a>
</p>
<p>
<strong>Director: </strong>
<a href="#" title="James Gunn">James Gunn</a>
</p>
<p>
<strong>Country: </strong>
<a href="https://Domain/country/us" title="United States">United States</a>
</p>
</div>
<div class="mvici-right">
<p><strong>Duration:</strong> 136 min</p>
<p><strong>Quality:</strong> <span class="quality">CAM</span></p>
<p><strong>Release:</strong> 2017</p>
<p><strong>IMDb:</strong> 8.3</p>
</div>
<div class="clearfix"></div>
</div>
<div class="clearfix"></div>
</div>
<div class="clearfix"></div>
</div>
<div class="mvi-content">
<div class="thumb mvic-thumb"
style="background-image: url(https://img.gocdn.online/2017/04/28/poster/5a08e94ba02118f22dc30f298c603210-guardians-of-the-galaxy-vol-2.jpg);"></div>
<div class="mvic-desc">
<h3>Guardians of the Galaxy Vol. 2</h3>
<div class="desc">
Set to the backdrop of Awesome Mixtape #2, Marvel's Guardians of the Galaxy Vol. 2 continues the team's adventures as they travel throughout the cosmos to help Peter Quill learn more about his true parentage.
</div>
<div class="mvic-info">
<div class="mvici-left">
<p>
<strong>Genre: </strong>
<a href="https://Domain/genre/action/" title="Action">Action</a>,
<a href="https://Domain/genre/adventure/" title="Adventure">Adventure</a>,
<a href="https://Domain/genre/sci-fi/" title="Sci-Fi">Sci-Fi</a>
</p>
<p>
<strong>Actor: </strong>
<a target="_blank" href="https://Domain/actor/chris-pratt" title="Chris Pratt">Chris Pratt</a>,
<a target="_blank" href="https://Domain/actor/-zoe-saldana" title="Zoe Saldana">Zoe Saldana</a>,
<a target="_blank" href="https://Domain/actor/-dave-bautista-" title="Dave Bautista">Dave Bautista</a>
</p>
<p>
<strong>Director: </strong>
<a href="#" title="James Gunn">James Gunn</a>
</p>
<p>
<strong>Country: </strong>
<a href="https://Domain/country/us" title="United States">United States</a>
</p>
</div>
<div class="mvici-right">
<p><strong>Duration:</strong> 136 min</p>
<p><strong>Quality:</strong> <span class="quality">CAM</span></p>
<p><strong>Release:</strong> 2017</p>
<p><strong>IMDb:</strong> 8.3</p>
</div>
<div class="clearfix"></div>
</div>
<div class="clearfix"></div>
</div>
<div class="clearfix"></div>
</div>
我们可以用新的属性来扩展我们的影片类(说明, 类型, 演员, 导演, 国家, 片长, IMDB评分)但我们将使用(描述、类型、演员)仅适用于我们的样本。
public class Movie
{
public int Id { get; set; }
public string Title { get; set; }
public string URL { get; set; }
public string Description { get; set; }
public List<string> Genre { get; set; }
public List<string> Actor { get; set; }
}
public class Movie
{
public int Id { get; set; }
public string Title { get; set; }
public string URL { get; set; }
public string Description { get; set; }
public List<string> Genre { get; set; }
public List<string> Actor { get; set; }
}
Public Class Movie
Public Property Id() As Integer
Public Property Title() As String
Public Property URL() As String
Public Property Description() As String
Public Property Genre() As List(Of String)
Public Property Actor() As List(Of String)
End Class
现在我们将导航到详细页面进行抓取。
IronWebScraper 允许您添加更多的抓取功能,以抓取不同类型的页面格式。
如我们所见:
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("https://domain/", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var movie = new Movie();
movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
var link = Divs.Css("a")[0];
movie.Title = link.TextContentClean;
movie.URL = link.Attributes ["href"];
this.Request(movie.URL, ParseDetails, new MetaData() { { "movie", movie } });// to scrap Detailed Page
}
}
}
public void ParseDetails(Response response)
{
var movie = response.MetaData.Get<Movie>("movie");
var Div = response.Css("div.mvic-desc")[0];
movie.Description = Div.Css("div.desc")[0].TextContentClean;
foreach(var Genre in Div.Css("div > p > a"))
{
movie.Genre.Add(Genre.TextContentClean);
}
foreach (var Actor in Div.Css("div > p:nth-child(2) > a"))
{
movie.Actor.Add(Actor.TextContentClean);
}
Scrape(movie, "Movie.Jsonl");
}
}
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("https://domain/", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var movie = new Movie();
movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
var link = Divs.Css("a")[0];
movie.Title = link.TextContentClean;
movie.URL = link.Attributes ["href"];
this.Request(movie.URL, ParseDetails, new MetaData() { { "movie", movie } });// to scrap Detailed Page
}
}
}
public void ParseDetails(Response response)
{
var movie = response.MetaData.Get<Movie>("movie");
var Div = response.Css("div.mvic-desc")[0];
movie.Description = Div.Css("div.desc")[0].TextContentClean;
foreach(var Genre in Div.Css("div > p > a"))
{
movie.Genre.Add(Genre.TextContentClean);
}
foreach (var Actor in Div.Css("div > p:nth-child(2) > a"))
{
movie.Actor.Add(Actor.TextContentClean);
}
Scrape(movie, "Movie.Jsonl");
}
}
Public Class MovieScraper
Inherits WebScraper
Public Overrides Sub Init()
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"
Me.Request("https://domain/", AddressOf Parse)
End Sub
Public Overrides Sub Parse(ByVal response As Response)
For Each Divs In response.Css("#movie-featured > div")
If Divs.Attributes ("class") <> "clearfix" Then
Dim movie As New Movie()
movie.Id = Convert.ToInt32(Divs.GetAttribute("data-movie-id"))
Dim link = Divs.Css("a")(0)
movie.Title = link.TextContentClean
movie.URL = link.Attributes ("href")
Me.Request(movie.URL, AddressOf ParseDetails, New MetaData() From {
{ "movie", movie }
}) ' to scrap Detailed Page
End If
Next Divs
End Sub
Public Sub ParseDetails(ByVal response As Response)
Dim movie = response.MetaData.Get(Of Movie)("movie")
Dim Div = response.Css("div.mvic-desc")(0)
movie.Description = Div.Css("div.desc")(0).TextContentClean
For Each Genre In Div.Css("div > p > a")
movie.Genre.Add(Genre.TextContentClean)
Next Genre
For Each Actor In Div.Css("div > p:nth-child(2) > a")
movie.Actor.Add(Actor.TextContentClean)
Next Actor
Scrape(movie, "Movie.Jsonl")
End Sub
End Class
有什么新鲜事?
我们可以添加抓取功能(解析细节)抓取详细页面
我们将生成文件的 Scrape 功能移至新功能。
我们使用了IronWebScraper功能(元数据)将我们的电影对象传递给新的抓取函数
- 我们爬取了页面并将我们的电影对象数据保存到了一个文件中。
从购物网站抓取内容
我们选择一个购物网站来抓取其内容。
如您从图片中可以看到,我们有一个左侧栏,其中包含了网站产品类别的链接。
因此,我们的第一步是调查网站的HTML,并计划我们希望如何抓取它。
时尚网站类别有子类别(男士、女士、儿童)
<li class="menu-item" data-id="">
<a href="https://domain.com/fashion-by-/" class="main-category">
<i class="cat-icon osh-font-fashion"></i> <span class="nav-subTxt">FASHION </span> <i class="osh-font-light-arrow-left"></i><i class="osh-font-light-arrow-right"></i>
</a> <div class="navLayerWrapper" style="width: 633px; display: none;"><div class="submenu"><div class="column"><div class="categories"><a class="category" href="https://domain.com/fashion-by-/?sort=newest&dir=desc&viewType=gridView3">New Arrivals !</a> </div><div class="categories"><a class="category" href="https://domain.com/men-fashion/">Men</a> <a class="subcategory" href="https://domain.com/mens-shoes/">Shoes</a> <a class="subcategory" href="https://domain.com/mens-clothing/">Clothing</a> <a class="subcategory" href="https://domain.com/mens-accessories/">Accessories</a> </div><div class="categories"><a class="category" href="https://domain.com/women-fashion/">Women</a> <a class="subcategory" href="https://domain.com/womens-shoes/">Shoes</a> <a class="subcategory" href="https://domain.com/womens-clothing/">Clothing</a> <a class="subcategory" href="https://domain.com/womens-accessories/">Accessories</a> </div><div class="categories"><a class="category" href="https://domain.com/girls-boys-fashion/">Kids</a> <a class="subcategory" href="https://domain.com/boys-fashion/">Boys</a> <a class="subcategory" href="https://domain.com/girls/">Girls</a> </div><div class="categories"><a class="category" href="https://domain.com/maternity-clothes/">Maternity Clothes</a> </div></div><div class="column"><div class="categories"> <span class="category defaultCursor">Men Best Sellers</span> <a class="subcategory" href="https://domain.com/mens-casual-shoes/">Casual Shoes</a> <a class="subcategory" href="https://domain.com/mens-sneakers/">Sneakers</a> <a class="subcategory" href="https://domain.com/mens-t-shirts/">T-shirts</a> <a class="subcategory" href="https://domain.com/mens-polos/">Polos</a> </div><div class="categories"> <span class="category defaultCursor">Women Best Sellers</span> <a class="subcategory" href="https://domain.com/womens-sandals/">Sandals</a> <a class="subcategory" href="https://domain.com/womens-sneakers/">Sneakers</a> <a class="subcategory" href="https://domain.com/women-dresses/">Dresses</a> <a class="subcategory" href="https://domain.com/women-tops/">Tops</a> </div><div class="categories"><a class="category" href="https://domain.com/womens-curvy-clothing/">Women's Curvy Clothing</a> </div><div class="categories"><a class="category" href="https://domain.com/fashion-bundles/v/">Fashion Bundles</a> </div><div class="categories"><a class="category" href="https://domain.com/hijab-fashion/">Hijab Fashion</a> </div></div><div class="column"><div class="categories"><a class="category" href="https://domain.com/brands/fashion-by-/">SEE ALL BRANDS</a> <a class="subcategory" href="https://domain.com/adidas/">Adidas</a> <a class="subcategory" href="https://domain.com/converse/">Converse</a> <a class="subcategory" href="https://domain.com/ravin/">Ravin</a> <a class="subcategory" href="https://domain.com/dejavu/">Dejavu</a> <a class="subcategory" href="https://domain.com/agu/">Agu</a> <a class="subcategory" href="https://domain.com/activ/">Activ</a> <a class="subcategory" href="https://domain.com/oxford--bellini--tie-house--milano/">Tie House</a> <a class="subcategory" href="https://domain.com/shoe-room/">Shoe Room</a> <a class="subcategory" href="https://domain.com/town-team/">Town Team</a> </div></div></div></div>
</li>
<li class="menu-item" data-id="">
<a href="https://domain.com/fashion-by-/" class="main-category">
<i class="cat-icon osh-font-fashion"></i> <span class="nav-subTxt">FASHION </span> <i class="osh-font-light-arrow-left"></i><i class="osh-font-light-arrow-right"></i>
</a> <div class="navLayerWrapper" style="width: 633px; display: none;"><div class="submenu"><div class="column"><div class="categories"><a class="category" href="https://domain.com/fashion-by-/?sort=newest&dir=desc&viewType=gridView3">New Arrivals !</a> </div><div class="categories"><a class="category" href="https://domain.com/men-fashion/">Men</a> <a class="subcategory" href="https://domain.com/mens-shoes/">Shoes</a> <a class="subcategory" href="https://domain.com/mens-clothing/">Clothing</a> <a class="subcategory" href="https://domain.com/mens-accessories/">Accessories</a> </div><div class="categories"><a class="category" href="https://domain.com/women-fashion/">Women</a> <a class="subcategory" href="https://domain.com/womens-shoes/">Shoes</a> <a class="subcategory" href="https://domain.com/womens-clothing/">Clothing</a> <a class="subcategory" href="https://domain.com/womens-accessories/">Accessories</a> </div><div class="categories"><a class="category" href="https://domain.com/girls-boys-fashion/">Kids</a> <a class="subcategory" href="https://domain.com/boys-fashion/">Boys</a> <a class="subcategory" href="https://domain.com/girls/">Girls</a> </div><div class="categories"><a class="category" href="https://domain.com/maternity-clothes/">Maternity Clothes</a> </div></div><div class="column"><div class="categories"> <span class="category defaultCursor">Men Best Sellers</span> <a class="subcategory" href="https://domain.com/mens-casual-shoes/">Casual Shoes</a> <a class="subcategory" href="https://domain.com/mens-sneakers/">Sneakers</a> <a class="subcategory" href="https://domain.com/mens-t-shirts/">T-shirts</a> <a class="subcategory" href="https://domain.com/mens-polos/">Polos</a> </div><div class="categories"> <span class="category defaultCursor">Women Best Sellers</span> <a class="subcategory" href="https://domain.com/womens-sandals/">Sandals</a> <a class="subcategory" href="https://domain.com/womens-sneakers/">Sneakers</a> <a class="subcategory" href="https://domain.com/women-dresses/">Dresses</a> <a class="subcategory" href="https://domain.com/women-tops/">Tops</a> </div><div class="categories"><a class="category" href="https://domain.com/womens-curvy-clothing/">Women's Curvy Clothing</a> </div><div class="categories"><a class="category" href="https://domain.com/fashion-bundles/v/">Fashion Bundles</a> </div><div class="categories"><a class="category" href="https://domain.com/hijab-fashion/">Hijab Fashion</a> </div></div><div class="column"><div class="categories"><a class="category" href="https://domain.com/brands/fashion-by-/">SEE ALL BRANDS</a> <a class="subcategory" href="https://domain.com/adidas/">Adidas</a> <a class="subcategory" href="https://domain.com/converse/">Converse</a> <a class="subcategory" href="https://domain.com/ravin/">Ravin</a> <a class="subcategory" href="https://domain.com/dejavu/">Dejavu</a> <a class="subcategory" href="https://domain.com/agu/">Agu</a> <a class="subcategory" href="https://domain.com/activ/">Activ</a> <a class="subcategory" href="https://domain.com/oxford--bellini--tie-house--milano/">Tie House</a> <a class="subcategory" href="https://domain.com/shoe-room/">Shoe Room</a> <a class="subcategory" href="https://domain.com/town-team/">Town Team</a> </div></div></div></div>
</li>
让我们建立一个项目
创建一个新的控制台应用程序或为我们的新示例添加一个名为“ShoppingSiteSample”的新文件夹。
添加名为“ShoppingScraper”的新类
首先将抓取网站的类别及其子类别。
让我们创建一个类别模型:
public class Category
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the URL.
/// </summary>
/// <value>
/// The URL.
/// </value>
public string URL { get; set; }
/// <summary>
/// Gets or sets the sub categories.
/// </summary>
/// <value>
/// The sub categories.
/// </value>
public List<Category> SubCategories { get; set; }
}
public class Category
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the URL.
/// </summary>
/// <value>
/// The URL.
/// </value>
public string URL { get; set; }
/// <summary>
/// Gets or sets the sub categories.
/// </summary>
/// <value>
/// The sub categories.
/// </value>
public List<Category> SubCategories { get; set; }
}
Public Class Category
''' <summary>
''' Gets or sets the name.
''' </summary>
''' <value>
''' The name.
''' </value>
Public Property Name() As String
''' <summary>
''' Gets or sets the URL.
''' </summary>
''' <value>
''' The URL.
''' </value>
Public Property URL() As String
''' <summary>
''' Gets or sets the sub categories.
''' </summary>
''' <value>
''' The sub categories.
''' </value>
Public Property SubCategories() As List(Of Category)
End Class
- 现在,让我们构建刮擦逻辑
public class ShoppingScraper : WebScraper
{
/// <summary>
/// Override this method initialize your web-scraper.
/// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
/// </summary>
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
this.Request("www.webSite.com", Parse);
}
/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
var categoryList = new List<Category>();
foreach (var Links in response.Css("#menuFixed > ul > li > a "))
{
var cat = new Category();
cat.URL = Links.Attributes ["href"];
cat.Name = Links.InnerText;
categoryList.Add(cat);
}
Scrape(categoryList, "Shopping.Jsonl");
}
}
public class ShoppingScraper : WebScraper
{
/// <summary>
/// Override this method initialize your web-scraper.
/// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
/// </summary>
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
this.Request("www.webSite.com", Parse);
}
/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
var categoryList = new List<Category>();
foreach (var Links in response.Css("#menuFixed > ul > li > a "))
{
var cat = new Category();
cat.URL = Links.Attributes ["href"];
cat.Name = Links.InnerText;
categoryList.Add(cat);
}
Scrape(categoryList, "Shopping.Jsonl");
}
}
IRON VB CONVERTER ERROR developers@ironsoftware.com
从菜单中抓取链接
让我们更新代码,以抓取主分类及其所有子链接
public override void Parse(Response response)
{
// List of Categories Links (Root)
var categoryList = new List<Category>();
foreach (var li in response.Css("#menuFixed > ul > li"))
{
// List Of Main Links
foreach (var Links in li.Css("a"))
{
var cat = new Category();
cat.URL = Links.Attributes ["href"];
cat.Name = Links.InnerText;
cat.SubCategories = new List<Category>();
// List of Sub Catgories Links
foreach (var subCategory in li.Css("a [class=subcategory]"))
{
var subcat = new Category();
subcat.URL = Links.Attributes ["href"];
subcat.Name = Links.InnerText;
// Check If Link Exist Before
if (cat.SubCategories.Find(c=>c.Name== subcat.Name && c.URL == subcat.URL) == null)
{
// Add Sublinks
cat.SubCategories.Add(subcat);
}
}
// Add Categories
categoryList.Add(cat);
}
}
Scrape(categoryList, "Shopping.Jsonl");
}
public override void Parse(Response response)
{
// List of Categories Links (Root)
var categoryList = new List<Category>();
foreach (var li in response.Css("#menuFixed > ul > li"))
{
// List Of Main Links
foreach (var Links in li.Css("a"))
{
var cat = new Category();
cat.URL = Links.Attributes ["href"];
cat.Name = Links.InnerText;
cat.SubCategories = new List<Category>();
// List of Sub Catgories Links
foreach (var subCategory in li.Css("a [class=subcategory]"))
{
var subcat = new Category();
subcat.URL = Links.Attributes ["href"];
subcat.Name = Links.InnerText;
// Check If Link Exist Before
if (cat.SubCategories.Find(c=>c.Name== subcat.Name && c.URL == subcat.URL) == null)
{
// Add Sublinks
cat.SubCategories.Add(subcat);
}
}
// Add Categories
categoryList.Add(cat);
}
}
Scrape(categoryList, "Shopping.Jsonl");
}
Public Overrides Sub Parse(ByVal response As Response)
' List of Categories Links (Root)
Dim categoryList = New List(Of Category)()
For Each li In response.Css("#menuFixed > ul > li")
' List Of Main Links
For Each Links In li.Css("a")
Dim cat = New Category()
cat.URL = Links.Attributes ("href")
cat.Name = Links.InnerText
cat.SubCategories = New List(Of Category)()
' List of Sub Catgories Links
For Each subCategory In li.Css("a [class=subcategory]")
Dim subcat = New Category()
subcat.URL = Links.Attributes ("href")
subcat.Name = Links.InnerText
' Check If Link Exist Before
If cat.SubCategories.Find(Function(c) c.Name= subcat.Name AndAlso c.URL = subcat.URL) Is Nothing Then
' Add Sublinks
cat.SubCategories.Add(subcat)
End If
Next subCategory
' Add Categories
categoryList.Add(cat)
Next Links
Next li
Scrape(categoryList, "Shopping.Jsonl")
End Sub
现在我们有了指向所有网站类别的链接,让我们开始抓取每个类别中的产品。
让我们导航到任何类别并检查内容。
让我们看看它的代码
<section class="products">
<div class="sku -gallery -validate-size " data-sku="AG249FA0T2PSGNAFAMZ" ft-product-sizes="41,42,43,44,45" ft-product-color="Multicolour">
<a class="link" href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html">
<div class="image-wrapper default-state">
<img class="lazy image -loaded" alt="Bundle Of 2 Sneakers - Black &amp; Navy Blue" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-sku="AG249FA0T2PSGNAFAMZ" data-src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg"><noscript><img src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
</div> <h2 class="title">
<span class="brand ">Agu </span>
<span class="name" dir="ltr">Bundle Of 2 Sneakers - Black & Navy Blue</span>
</h2><div class="price-container clearfix">
<span class="price-box">
<span class="price">
<span data-currency-iso="EGP">EGP</span>
<span dir="ltr" data-price="299">299</span>
</span> <span class="price -old -no-special"></span>
</span>
</div><div class="rating-stars"><div class="stars-container"><div class="stars" style="width: 62%"></div></div> <div class="total-ratings">(30)</div> </div> <span class="shop-first-logo-container"><img src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" data-src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" class="lazy shop-first-logo-img -mbxs -loaded"> </span>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
<div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=41">41</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=42">42</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=43">43</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=44">44</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=45">45</span>
</div>
</a>
</div>
<div class="sku -gallery -validate-size " data-sku="LE047FA01SRK4NAFAMZ" ft-product-sizes="110,115,120,125,130,135" ft-product-color="Black">
<a class="link" href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html">
<div class="image-wrapper default-state"><img class="lazy image -loaded" alt="Genuine Leather Belt - Black" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-sku="LE047FA01SRK4NAFAMZ" data-src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg"><noscript><img src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript></div>
<h2 class="title"><span class="brand ">Leather Shop </span> <span class="name" dir="ltr">Genuine Leather Belt - Black</span></h2><div class="price-container clearfix">
<span class="sale-flag-percent">-29%</span> <span class="price-box"> <span class="price"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="96">96</span> </span> <span class="price -old "><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="135">135</span> </span> </span>
</div><div class="rating-stars"><div class="stars-container"><div class="stars" style="width: 100%"></div></div> <div class="total-ratings">(1)</div> </div>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span> <div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=110">110</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=115">115</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=120">120</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=125">125</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=130">130</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=135">135</span>
</div>
</a>
</div>
</section>
<section class="products">
<div class="sku -gallery -validate-size " data-sku="AG249FA0T2PSGNAFAMZ" ft-product-sizes="41,42,43,44,45" ft-product-color="Multicolour">
<a class="link" href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html">
<div class="image-wrapper default-state">
<img class="lazy image -loaded" alt="Bundle Of 2 Sneakers - Black &amp; Navy Blue" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-sku="AG249FA0T2PSGNAFAMZ" data-src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg"><noscript><img src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
</div> <h2 class="title">
<span class="brand ">Agu </span>
<span class="name" dir="ltr">Bundle Of 2 Sneakers - Black & Navy Blue</span>
</h2><div class="price-container clearfix">
<span class="price-box">
<span class="price">
<span data-currency-iso="EGP">EGP</span>
<span dir="ltr" data-price="299">299</span>
</span> <span class="price -old -no-special"></span>
</span>
</div><div class="rating-stars"><div class="stars-container"><div class="stars" style="width: 62%"></div></div> <div class="total-ratings">(30)</div> </div> <span class="shop-first-logo-container"><img src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" data-src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" class="lazy shop-first-logo-img -mbxs -loaded"> </span>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
<div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=41">41</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=42">42</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=43">43</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=44">44</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=45">45</span>
</div>
</a>
</div>
<div class="sku -gallery -validate-size " data-sku="LE047FA01SRK4NAFAMZ" ft-product-sizes="110,115,120,125,130,135" ft-product-color="Black">
<a class="link" href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html">
<div class="image-wrapper default-state"><img class="lazy image -loaded" alt="Genuine Leather Belt - Black" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-sku="LE047FA01SRK4NAFAMZ" data-src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg"><noscript><img src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript></div>
<h2 class="title"><span class="brand ">Leather Shop </span> <span class="name" dir="ltr">Genuine Leather Belt - Black</span></h2><div class="price-container clearfix">
<span class="sale-flag-percent">-29%</span> <span class="price-box"> <span class="price"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="96">96</span> </span> <span class="price -old "><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="135">135</span> </span> </span>
</div><div class="rating-stars"><div class="stars-container"><div class="stars" style="width: 100%"></div></div> <div class="total-ratings">(1)</div> </div>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span> <div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=110">110</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=115">115</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=120">120</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=125">125</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=130">130</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=135">135</span>
</div>
</a>
</div>
</section>
让我们为这些内容建立产品模型。
public class Product
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the price.
/// </summary>
/// <value>
/// The price.
/// </value>
public string Price { get; set; }
/// <summary>
/// Gets or sets the image.
/// </summary>
/// <value>
/// The image.
/// </value>
public string Image { get; set; }
}
public class Product
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the price.
/// </summary>
/// <value>
/// The price.
/// </value>
public string Price { get; set; }
/// <summary>
/// Gets or sets the image.
/// </summary>
/// <value>
/// The image.
/// </value>
public string Image { get; set; }
}
Public Class Product
''' <summary>
''' Gets or sets the name.
''' </summary>
''' <value>
''' The name.
''' </value>
Public Property Name() As String
''' <summary>
''' Gets or sets the price.
''' </summary>
''' <value>
''' The price.
''' </value>
Public Property Price() As String
''' <summary>
''' Gets or sets the image.
''' </summary>
''' <value>
''' The image.
''' </value>
Public Property Image() As String
End Class
为了抓取分类页面,我们添加了一种新的抓取方法:
public void ParseCatgory(Response response)
{
// List of Products Links (Root)
var productList = new List<Product>();
foreach (var Links in response.Css("body > main > section.osh-content > section.products > div > a"))
{
var product = new Product();
product.Name = Links.InnerText;
product.Image = Links.Css("div.image-wrapper.default-state > img")[0].Attributes ["src"];
productList.Add(product);
}
Scrape(productList, "Products.Jsonl");
}
public void ParseCatgory(Response response)
{
// List of Products Links (Root)
var productList = new List<Product>();
foreach (var Links in response.Css("body > main > section.osh-content > section.products > div > a"))
{
var product = new Product();
product.Name = Links.InnerText;
product.Image = Links.Css("div.image-wrapper.default-state > img")[0].Attributes ["src"];
productList.Add(product);
}
Scrape(productList, "Products.Jsonl");
}
Public Sub ParseCatgory(ByVal response As Response)
' List of Products Links (Root)
Dim productList = New List(Of Product)()
For Each Links In response.Css("body > main > section.osh-content > section.products > div > a")
Dim product As New Product()
product.Name = Links.InnerText
product.Image = Links.Css("div.image-wrapper.default-state > img")(0).Attributes ("src")
productList.Add(product)
Next Links
Scrape(productList, "Products.Jsonl")
End Sub
高级网页抓取功能
HttpIdentity 功能:
一些网站系统要求用户登录后才能查看内容; 在这种情况下,我们可以使用HttpIdentity:-
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id);
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id);
Dim id As New HttpIdentity()
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
Identities.Add(id)
IronWebScraper 最令人印象深刻、功能最强大的功能之一,就是可以使用数千种独特的网络技术。(用户凭据和/或浏览器引擎)使用多重登录会话欺骗或搜索网站。
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
foreach (var proxy in proxies)
{
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
foreach (var proxy in proxies)
{
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
Dim proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(","c)
For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
For Each proxy In proxies
Identities.Add(New HttpIdentity() With {
.UserAgent = UA,
.UseCookies = True,
.Proxy = proxy
})
Next proxy
Next UA
Me.Request("http://www.Website.com", Parse)
End Sub
您拥有多个属性以赋予您不同的行为,因此可以防止网站屏蔽您。
一些这些属性:-
- NetworkDomain:用于用户认证的网络域。 支持 Windows、NTLM、Keroberos、Linux、BSD 和 Mac OS X 网络。 必须与此配合使用(网络用户名和网络密码)
- NetworkUsername:用于用户认证的网络/HTTP用户名。 支持Http、Windows网络、NTLM、Kerberos、Linux网络、BSD网络和Mac OS。
- NetworkPassword:用于用户身份验证的网络/HTTP密码。 支持Http、Windows网络、NTLM、Keroberos、Linux网络、BSD网络和Mac OS。
- 代理:设置代理配置
- UserAgent:设置浏览器引擎(chrome桌面、chrome手机、chrome平板电脑、IE和火狐等。)
- HttpRequestHeaders:用于将与此标识一起使用的自定义头部值,它接受字典对象。(字典 <字符串, string>)
UseCookies:启用/禁用使用 cookies
IronWebScraper 使用随机身份运行抓取程序。 如果我们需要指定使用特定身份来解析页面,我们可以这样做。
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
HttpIdentity identity = new HttpIdentity();
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
Identities.Add(id);
this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
HttpIdentity identity = new HttpIdentity();
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
Identities.Add(id);
this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
Dim identity As New HttpIdentity()
identity.NetworkUsername = "username"
identity.NetworkPassword = "pwd"
Identities.Add(id)
Me.Request("http://www.Website.com", Parse, identity)
End Sub
启用网页缓存功能:
此功能用于缓存请求的页面。 它经常在开发和测试阶段使用; 使开发者能够缓存所需页面以在更新代码后重用。 这使您在重新启动网络抓取工具后能够在缓存的页面上执行代码,而不需要每次都连接到实时网站。(动作重放).
您可以在Init中使用它()方法
启用网页缓存();
或
启用网页缓存(有效期);
它会将缓存数据保存到工作目录下的WebCache文件夹中。
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
EnableWebCache(new TimeSpan(1,30,30));
this.Request("http://www.WebSite.com", Parse);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
EnableWebCache(new TimeSpan(1,30,30));
this.Request("http://www.WebSite.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
EnableWebCache(New TimeSpan(1,30,30))
Me.Request("http://www.WebSite.com", Parse)
End Sub
IronWebScraper 还具有一些功能,可通过使用 "开始 "设置引擎启动进程名称,使引擎在重启代码后继续进行刮擦。(CrawlID)
static void Main(string [] args)
{
// Create Object From Scraper class
EngineScraper scrape = new EngineScraper();
// Start Scraping
scrape.Start("enginestate");
}
static void Main(string [] args)
{
// Create Object From Scraper class
EngineScraper scrape = new EngineScraper();
// Start Scraping
scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
' Create Object From Scraper class
Dim scrape As New EngineScraper()
' Start Scraping
scrape.Start("enginestate")
End Sub
执行请求和响应将保存在工作目录内的 SavedState 文件夹中。
限流
我们可以控制每个域的最小和最大连接数量以及连接速度。
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Gets or sets the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
// or IP address. This helps protect hosts against too many requests.
this.OpenConnectionLimitPerHost = 25;
this.ObeyRobotsDotTxt = false;
// Makes the WebSraper intelligently throttle requests not only by hostname, but
// also by host servers' IP addresses. This is polite in-case multiple scraped domains
// are hosted on the same machine.
this.ThrottleMode = Throttle.ByDomainHostName;
this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Gets or sets the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
// or IP address. This helps protect hosts against too many requests.
this.OpenConnectionLimitPerHost = 25;
this.ObeyRobotsDotTxt = false;
// Makes the WebSraper intelligently throttle requests not only by hostname, but
// also by host servers' IP addresses. This is polite in-case multiple scraped domains
// are hosted on the same machine.
this.ThrottleMode = Throttle.ByDomainHostName;
this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Gets or sets the total number of allowed open HTTP requests (threads)
Me.MaxHttpConnectionLimit = 80
' Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
' Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
' or IP address. This helps protect hosts against too many requests.
Me.OpenConnectionLimitPerHost = 25
Me.ObeyRobotsDotTxt = False
' Makes the WebSraper intelligently throttle requests not only by hostname, but
' also by host servers' IP addresses. This is polite in-case multiple scraped domains
' are hosted on the same machine.
Me.ThrottleMode = Throttle.ByDomainHostName
Me.Request("https://www.Website.com", Parse)
End Sub
限流属性
- MaxHttpConnectionLimit
允许打开的 HTTP 请求总数 (线程) - RateLimitPerHost
最小礼貌延迟或暂停 (毫秒)请求给定域名或IP地址之间 - OpenConnectionLimitPerHost
允许的并发 HTTP 请求数 (线程) - ThrottleMode
使WebSraper智能地限制请求,不仅按主机名,还按主机服务器的IP地址。 这是礼貌的做法,以防多个抓取的域名托管在同一台机器上。
附录
如何创建Windows窗体应用程序?
我们应该使用Visual Studio 2013或更高版本来进行这项工作。
按照以下步骤创建新的Windows Forms项目:
位置:在硬盘上选择一个位置
如何创建网页表单应用程序?
您应该使用Visual Studio 2013或更高版本。
按照以下步骤创建一个新的Asp.NET Web表单项目。
打开 Visual Studio
文件 -> 新建 -> 项目
- 从模板选择编程语言(Visual C# 或 VB) -> 网络 -> ASP.NET 网络应用程序(.NET框架).
项目名称:IronScraperSample
位置:从您的硬盘中选择一个位置
从您的 ASP.NET 模板
- 现在,您的基本 ASP.NET 网络表单项目已经创建
点击这里下载完整的教程示例项目代码项目。