如何使用C#抓取网站数据
IronWebscraper 是一个用于网络抓取、网络数据提取和网络内容解析的 .NET 库。 这是一个易于使用的库,可以添加到 Microsoft Visual Studio 项目中,用于开发和生产。
IronWebscraper具有许多独特的功能和能力,例如控制允许和禁止的页面、对象、媒体等。它还允许管理多个身份、网络缓存以及我们将在本教程中介绍的许多其他功能。
开始使用IronWebscraper
立即在您的项目中开始使用IronWebScraper,并享受免费试用。
目标受众
本教程面向具有基本或高级编程技能的软件开发人员,他们希望构建和实施针对高级抓取功能的解决方案(网站抓取、网站数据收集和提取、网站内容解析、网络收割)。

所需技能
使用Microsoft编程语言(如C#或VB.NET)进行编程的基本原理。
对网络技术(HTML、JavaScript、JQuery、CSS 等)及其工作原理的基本理解
- 基本的DOM、XPath、HTML和CSS选择器知识。
工具
Microsoft Visual Studio 2010 或以上版本
- 浏览器的网页开发者扩展,例如Chrome的Web检查器或Firefox的Firebug。
为什么要进行网页抓取?
(原因和概念)
如果您想构建一个具有以下功能的产品或解决方案:
提取网站数据
比较多个网站的内容、价格、功能等。
扫描和缓存网站内容
如果您有上述一个或多个原因,那么IronWebscraper是一个非常适合您需求的库。
如何安装IronWebScraper?
在创建新项目后(参见附录A),您可以通过使用NuGet自动插入库或手动安装DLL,将IronWebScraper库添加到您的项目中。
使用 NuGet 安装
要通过 NuGet 将 IronWebScraper 库添加到我们的项目中,我们可以使用可视化界面(NuGet 包管理器)或通过使用包管理器控制台的命令来进行。
使用 NuGet 软件包管理器
使用 NuGet 包控制台
手动安装
点击IronWebScraper或通过URL直接访问其页面https://ironsoftware.com/csharp/webscraper/
点击下载DLL。
提取已下载的压缩文件
在 Visual Studio 中右键单击项目 -> 添加 -> 引用 -> 浏览
转到提取的文件夹 ->
netstandard2.0
-> 然后选择所有.dll
文件- 完成了!
HelloScraper - 我们的第一个IronWebScraper示例
像往常一样,我们将开始实施Hello Scraper应用程序,以使用IronWebScraper迈出我们的第一步。
我们创建了一个名为“IronWebScraperSample”的新控制台应用程序。
创建IronWebScraper示例的步骤
public class HelloScraper : WebScraper
{
/// <summary>
/// Override this method initialize your web-scraper.
/// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
/// </summary>
public override void Init()
{
License.LicenseKey = "LicenseKey"; // Write License Key
this.LoggingLevel = WebScraper.LogLevel.All; // All Events Are Logged
this.Request("https://blog.scrapinghub.com", Parse);
}
/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
// set working directory for the project
this.WorkingDirectory = AppSetting.GetAppRoot()+ @"\HelloScraperSample\Output\";
// Loop on all Links
foreach (var title_link in response.Css("h2.entry-title a"))
{
// Read Link Text
string strTitle = title_link.TextContentClean;
// Save Result to File
Scrape(new ScrapedData() { { "Title", strTitle } }, "HelloScraper.json");
}
// Loop On All Links
if (response.CssExists("div.prev-post > a [href]"))
{
// Get Link URL
var next_page = response.Css("div.prev-post > a [href]")[0].Attributes ["href"];
// Scrape Next URL
this.Request(next_page, Parse);
}
}
}
public class HelloScraper : WebScraper
{
/// <summary>
/// Override this method initialize your web-scraper.
/// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
/// </summary>
public override void Init()
{
License.LicenseKey = "LicenseKey"; // Write License Key
this.LoggingLevel = WebScraper.LogLevel.All; // All Events Are Logged
this.Request("https://blog.scrapinghub.com", Parse);
}
/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
// set working directory for the project
this.WorkingDirectory = AppSetting.GetAppRoot()+ @"\HelloScraperSample\Output\";
// Loop on all Links
foreach (var title_link in response.Css("h2.entry-title a"))
{
// Read Link Text
string strTitle = title_link.TextContentClean;
// Save Result to File
Scrape(new ScrapedData() { { "Title", strTitle } }, "HelloScraper.json");
}
// Loop On All Links
if (response.CssExists("div.prev-post > a [href]"))
{
// Get Link URL
var next_page = response.Css("div.prev-post > a [href]")[0].Attributes ["href"];
// Scrape Next URL
this.Request(next_page, Parse);
}
}
}
IRON VB CONVERTER ERROR developers@ironsoftware.com
- 现在开始 Scrape 将以下代码片段添加到主页
static void Main(string [] args)
{
// Create Object From Hello Scrape class
HelloScraperSample.HelloScraper scrape = new HelloScraperSample.HelloScraper();
// Start Scraping
scrape.Start();
}
static void Main(string [] args)
{
// Create Object From Hello Scrape class
HelloScraperSample.HelloScraper scrape = new HelloScraperSample.HelloScraper();
// Start Scraping
scrape.Start();
}
IRON VB CONVERTER ERROR developers@ironsoftware.com
代码概览
Scrape.Start() => 启动以下抓取逻辑:
首先调用 Init() 方法以初始化变量、抓取属性和行为属性。
正如我们所看到的,它将起始页面设置为 Request("https://blog.scrapinghub.com", Parse),并定义 Parse (Response response) 为用于解析响应的过程。
Webscraper 并行管理:http 和线程……保持您的所有代码易于调试和同步。
- 在 Init() 之后启动 Parse 方法来解析页面。
您可以使用(Css 选择器、Js DOM、XPath)查找元素
选中的元素会被转换为 ScrapedData 类,您也可以将它们转换为任何自定义类(如产品、员工、新闻等)。
对象以 Json 格式保存在("bin/Scrape/")目录下的文件中。也可以将文件路径设为参数,稍后我们将在其他示例中看到。
IronWebScraper 库功能和选项
您可以在使用手动安装方法下载的zip文件内找到更新的文档(IronWebScraper Documentation.chm文件)。
或者,您可以在线查看库的最新文档更新,网址为 https://ironsoftware.com/csharp/webscraper/object-reference/
要在项目中开始使用IronWebscraper,您必须继承 (IronWebScraper.WebScraper) 类,该类扩展了您的类库并为其添加抓取功能。
您还必须实现 {Init(), Parse(Response response)} 方法。
namespace IronWebScraperEngine
{
public class NewsScraper : IronWebScraper.WebScraper
{
public override void Init()
{
throw new NotImplementedException();
}
public override void Parse(Response response)
{
throw new NotImplementedException();
}
}
}
namespace IronWebScraperEngine
{
public class NewsScraper : IronWebScraper.WebScraper
{
public override void Init()
{
throw new NotImplementedException();
}
public override void Parse(Response response)
{
throw new NotImplementedException();
}
}
}
Namespace IronWebScraperEngine
Public Class NewsScraper
Inherits IronWebScraper.WebScraper
Public Overrides Sub Init()
Throw New NotImplementedException()
End Sub
Public Overrides Sub Parse(ByVal response As Response)
Throw New NotImplementedException()
End Sub
End Class
End Namespace
Properties \ functions | Type | Description |
---|---|---|
Init () | Method | used to setup the scraper |
Parse (Response response) | Method | Used to implement the logic that the scraper will use and how it will process it. Coming table contain list of methods and properties that IronWebScraper Library are providing NOTE : Can implement multiple method for different pages behaviors or structures |
| Collections | Used to ban/Allow/ URLs And/Or Domains Ex: BannedUrls.Add ("*.zip", "*.exe", "*.gz", "*.pdf"); Note:
|
ObeyRobotsDotTxt | Boolean | Used to enable or disable read and follow robots.txt its directive or not |
public override bool ObeyRobotsDotTxtForHost (string Host) | Method | Used to enable or disable read and follow robots.txt its directive or not for certain domain |
Scrape | Method | |
ScrapeUnique | Method | |
ThrottleMode | Enumeration | |
EnableWebCache () | Method | |
EnableWebCache (TimeSpan cacheDuration) | Method | |
MaxHttpConnectionLimit | Int | |
RateLimitPerHost | TimeSpan | |
OpenConnectionLimitPerHost | Int | |
ObeyRobotsDotTxt | Boolean | |
ThrottleMode | Enum | Enum Options:
|
SetSiteSpecificCrawlRateLimit (string hostName, TimeSpan crawlRate) | Method | |
Identities | Collections | A list of HttpIdentity () to be used to fetch web resources. Each Identity may have a different proxy IP addresses, user Agent, http headers, Persistent cookies, username and password. Best practice is to create Identities in your WebScraper.Init Method and Add Them to this WebScraper.Identities List. |
WorkingDirectory | string | Setting working directory that will be used for all scrape related data will be stored to disk. |
## 真实案例与实践
抓取在线电影网站
让我们从一个真实世界的网站开始另一个示例。我们将选择抓取一个电影网站。
让我们添加一个新类并将其命名为“MovieScraper”:
现在,让我们来看看我们要搜索的网站:
这是我们在网站上看到的主页 HTML 的一部分:
<div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active">
<div data-movie-id="20746" class="ml-item">
<a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/">
<span class="mli-quality">CAM</span>
<img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg"
class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword"
src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg"
style="display: inline-block;">
<span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span>
</a>
</div>
<div data-movie-id="20724" class="ml-item">
<a href="https://website.com/film/snatched-20724/" >
<span class="mli-quality">CAM</span>
<img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg"
class="lazy thumb mli-thumb" alt="Snatched"
src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg"
style="display: inline-block;">
<span class="mli-info"><h2>Snatched</h2></span>
</a>
</div>
</div>
<div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active">
<div data-movie-id="20746" class="ml-item">
<a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/">
<span class="mli-quality">CAM</span>
<img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg"
class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword"
src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg"
style="display: inline-block;">
<span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span>
</a>
</div>
<div data-movie-id="20724" class="ml-item">
<a href="https://website.com/film/snatched-20724/" >
<span class="mli-quality">CAM</span>
<img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg"
class="lazy thumb mli-thumb" alt="Snatched"
src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg"
style="display: inline-block;">
<span class="mli-info"><h2>Snatched</h2></span>
</a>
</div>
</div>
正如我们所见,我们有电影ID、标题和详细页面链接。
让我们开始抓取这组数据:
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("www.website.com", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var MovieId = Divs.GetAttribute("data-movie-id");
var link = Divs.Css("a")[0];
var MovieTitle = link.TextContentClean;
Scrape(new ScrapedData() { { "MovieId", MovieId }, { "MovieTitle", MovieTitle } }, "Movie.Jsonl");
}
}
}
}
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("www.website.com", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var MovieId = Divs.GetAttribute("data-movie-id");
var link = Divs.Css("a")[0];
var MovieTitle = link.TextContentClean;
Scrape(new ScrapedData() { { "MovieId", MovieId }, { "MovieTitle", MovieTitle } }, "Movie.Jsonl");
}
}
}
}
Public Class MovieScraper
Inherits WebScraper
Public Overrides Sub Init()
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"
Me.Request("www.website.com", AddressOf Parse)
End Sub
Public Overrides Sub Parse(ByVal response As Response)
For Each Divs In response.Css("#movie-featured > div")
If Divs.Attributes ("class") <> "clearfix" Then
Dim MovieId = Divs.GetAttribute("data-movie-id")
Dim link = Divs.Css("a")(0)
Dim MovieTitle = link.TextContentClean
Scrape(New ScrapedData() From {
{ "MovieId", MovieId },
{ "MovieTitle", MovieTitle }
},
"Movie.Jsonl")
End If
Next Divs
End Sub
End Class
此代码有什么新功能?
工作目录属性用于设置所有抓取数据及其相关文件的主工作目录。
让我们做更多。
如果我们需要构建类型化对象来保存格式化对象中的抓取数据,该怎么办?
让我们实现一个电影类,用于存放我们的格式化数据:
public class Movie
{
public int Id { get; set; }
public string Title { get; set; }
public string URL { get; set; }
}
public class Movie
{
public int Id { get; set; }
public string Title { get; set; }
public string URL { get; set; }
}
Public Class Movie
Public Property Id() As Integer
Public Property Title() As String
Public Property URL() As String
End Class
现在,我们将更新代码:
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("https://website.com/", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var movie = new Movie();
movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
var link = Divs.Css("a")[0];
movie.Title = link.TextContentClean;
movie.URL = link.Attributes ["href"];
Scrape(movie, "Movie.Jsonl");
}
}
}
}
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("https://website.com/", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var movie = new Movie();
movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
var link = Divs.Css("a")[0];
movie.Title = link.TextContentClean;
movie.URL = link.Attributes ["href"];
Scrape(movie, "Movie.Jsonl");
}
}
}
}
Public Class MovieScraper
Inherits WebScraper
Public Overrides Sub Init()
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"
Me.Request("https://website.com/", AddressOf Parse)
End Sub
Public Overrides Sub Parse(ByVal response As Response)
For Each Divs In response.Css("#movie-featured > div")
If Divs.Attributes ("class") <> "clearfix" Then
Dim movie As New Movie()
movie.Id = Convert.ToInt32(Divs.GetAttribute("data-movie-id"))
Dim link = Divs.Css("a")(0)
movie.Title = link.TextContentClean
movie.URL = link.Attributes ("href")
Scrape(movie, "Movie.Jsonl")
End If
Next Divs
End Sub
End Class
有什么新功能?
电影页面如下:
```htmlGuardians of the Galaxy Vol. 2
Genre: Action, Adventure, Sci-Fi
Actor: Chris Pratt, Zoe Saldana, Dave Bautista
Director: James Gunn
Country: United States
Duration: 136 min
Quality: CAM
Release: 2017
IMDb: 8.3
允许的打开HTTP请求(线程)的总数 * **RateLimitPerHost**
对给定域或 IP 地址的请求之间的最小礼貌延迟或暂停(以毫秒为单位) * **OpenConnectionLimitPerHost**
允许的并发HTTP请求(线程)数量 * **ThrottleMode**
使 WebScraper 智能地限制请求,不仅按主机名,还按主机服务器的 IP 地址进行限制。 这是礼貌的做法,以防多个抓取的域名托管在同一台机器上。 ## 附录 ### 如何创建Windows窗体应用程序? 我们应该使用Visual Studio 2013或更高版本来进行这项工作。 按照以下步骤创建新的Windows Forms项目: 1. 打开 Visual Studio2. 文件 -> 新建 -> 项目3. 从模板中,选择编程语言(Visual C# 或 VB)-> Windows -> Windows 窗体应用程序**项目名称**: IronScraperSample
**位置**: 选择硬盘上的位置### 如何创建网页表单应用程序? 您应该使用Visual Studio 2013或更高版本。 按照以下步骤创建一个新的Asp.NET Web表单项目。 1. 打开 Visual Studio2. 文件 -> 新建 -> 项目3. 从模板中选择编程语言(Visual C# 或 VB)→ Web → ASP.NET Web 应用程序 (.NET Framework)。**项目名称**: IronScraperSample
**位置**:从硬盘中选择一个位置 4. 从您的 ASP.NET 模板5. 现在,您的基本 ASP.NET 网络表单项目已经创建[点击这里](/downloads/assets/tutorials/webscraping-in-c-sharp/IronWebScraperSample.zip) 下载完整的教程示例项目代码项目。