Webscraping in C

更新日:2026年4月21日

Translated

View the article in English

IronWebScraperとは何ですか？

IronWebScraperは、C#および.NETプログラミングプラットフォーム用のクラスライブラリおよびフレームワークで、開発者がプログラムでウェブサイトを読み取り、そのコンテンツを抽出できるようにします。これは、ウェブサイトや既存のイントラネットをリバースエンジニアリングして、データベースやJSONデータに変換するのに理想的です。また、インターネットから大量のドキュメントをダウンロードするのにも便利です。

多くの点で、IronWebScraperはPythonのScrapyライブラリに似ていますが、C#の利点、特にコードを実行中にステップを踏んでデバッグできる利点を活かしています。

インストール

最初のステップはIronWebScraperをインストールすることです。これはNuGetから、または当社のウェブサイトからDLLをダウンロードすることで行えます。

必要なすべてのクラスはIronWebScraper名前空間にあります。

Install-Package IronWebScraper

Iron Webscraperの使用法

IronWebScraperの使用方法を学ぶには、例を見るのが最善です。この基本的な例は、ウェブサイトブログからタイトルをスクレイプするクラスを作成します。

using IronWebScraper;

namespace WebScrapingProject
{
    class MainClass
    {
        public static void Main(string [] args)
        {
            var scraper = new BlogScraper();
            scraper.Start();
        }
    }

    class BlogScraper : WebScraper
    {
        // Initialize scraper settings and make the first request
        public override void Init()
        {
            // Set logging level to show all log messages
            this.LoggingLevel = WebScraper.LogLevel.All;

            // Request the initial page to start scraping
            this.Request("https://ironpdf.com/blog/", Parse);
        }

        // Method to handle parsing of the page response
        public override void Parse(Response response)
        {
            // Loop through each blog post title link found by CSS selector
            foreach (var title_link in response.Css("h2.entry-title a"))
            {
                // Clean and extract the title text
                string strTitle = title_link.TextContentClean;

                // Store the extracted title for later use
                Scrape(new ScrapedData() { { "Title", strTitle } });
            }

            // Check if there is a link to the previous post page and if exists, follow it
            if (response.CssExists("div.prev-post > a[href]"))
            {
                // Get the URL for the next page
                var next_page = response.Css("div.prev-post > a[href]")[0].Attributes["href"];

                // Request the next page to continue scraping
                this.Request(next_page, Parse);
            }
        }
    }
}

using IronWebScraper;

namespace WebScrapingProject
{
    class MainClass
    {
        public static void Main(string [] args)
        {
            var scraper = new BlogScraper();
            scraper.Start();
        }
    }

    class BlogScraper : WebScraper
    {
        // Initialize scraper settings and make the first request
        public override void Init()
        {
            // Set logging level to show all log messages
            this.LoggingLevel = WebScraper.LogLevel.All;

            // Request the initial page to start scraping
            this.Request("https://ironpdf.com/blog/", Parse);
        }

        // Method to handle parsing of the page response
        public override void Parse(Response response)
        {
            // Loop through each blog post title link found by CSS selector
            foreach (var title_link in response.Css("h2.entry-title a"))
            {
                // Clean and extract the title text
                string strTitle = title_link.TextContentClean;

                // Store the extracted title for later use
                Scrape(new ScrapedData() { { "Title", strTitle } });
            }

            // Check if there is a link to the previous post page and if exists, follow it
            if (response.CssExists("div.prev-post > a[href]"))
            {
                // Get the URL for the next page
                var next_page = response.Css("div.prev-post > a[href]")[0].Attributes["href"];

                // Request the next page to continue scraping
                this.Request(next_page, Parse);
            }
        }
    }
}

Imports IronWebScraper

Namespace WebScrapingProject
	Friend Class MainClass
		Public Shared Sub Main(ByVal args() As String)
			Dim scraper = New BlogScraper()
			scraper.Start()
		End Sub
	End Class

	Friend Class BlogScraper
		Inherits WebScraper

		' Initialize scraper settings and make the first request
		Public Overrides Sub Init()
			' Set logging level to show all log messages
			Me.LoggingLevel = WebScraper.LogLevel.All

			' Request the initial page to start scraping
			Me.Request("https://ironpdf.com/blog/", AddressOf Parse)
		End Sub

		' Method to handle parsing of the page response
		Public Overrides Sub Parse(ByVal response As Response)
			' Loop through each blog post title link found by CSS selector
			For Each title_link In response.Css("h2.entry-title a")
				' Clean and extract the title text
				Dim strTitle As String = title_link.TextContentClean

				' Store the extracted title for later use
				Scrape(New ScrapedData() From {
					{ "Title", strTitle }
				})
			Next title_link

			' Check if there is a link to the previous post page and if exists, follow it
			If response.CssExists("div.prev-post > a[href]") Then
				' Get the URL for the next page
				Dim next_page = response.Css("div.prev-post > a[href]")(0).Attributes("href")

				' Request the next page to continue scraping
				Me.Request(next_page, AddressOf Parse)
			End If
		End Sub
	End Class
End Namespace

$vbLabelText $csharpLabel

特定のウェブサイトをスクレイプするには、そのウェブサイトを読み取るための独自のクラスを作成する必要があります。このクラスはWeb Scraperを拡張します。このクラスにはいくつかのメソッドを追加する予定です。その中には Init があり、ここで初期設定を行い最初のリクエストを開始します。これにより連鎖反応が引き起こされ、ウェブサイト全体がスクレイピングされることになります。

また、少なくとも1つの Parse メソッドを追加する必要があります。 Parseメソッドは、インターネットからダウンロードされたウェブページを読み取り、jQueryのようなCSSセレクタを使用してコンテンツを選択し、使用するための関連するテキストや画像を抽出します。

Parse メソッド内では、クローラーに追跡を継続させたいハイパーリンクと、無視させるハイパーリンクを指定することも可能です。

スクレイプメソッドを使用して、データを抽出し、便利なJSONスタイルのファイル形式にダンプして後で使用することができます。

今後のステップ

IronWebScraperについて詳しく知るには、APIリファレンスドキュメントを読み、その後、当社ドキュメントのチュートリアルセクション内の例を確認することをお勧めします。

次におすすめする例はC# "blog" ウェブスクレイピング例で、WordPressブログなどのブログからテキストコンテンツを抽出する方法を学びます。これは、サイト移行において非常に役立つかもしれません。

そこから、進んだウェブスクレイピングのチュートリアルの他の例を見ていくことができます。そこでは、多種多様なページを持つウェブサイト、eコマースサイト、インターネットからデータをスクレイプする際の複数のプロキシ、ID、ログインの使用方法などの概念を見ていくことができます。

カーティス・チャウ

今すぐエンジニアリングチームとチャット

テクニカルライター

Curtis Chauは、カールトン大学でコンピュータサイエンスの学士号を取得し、Node.js、TypeScript、JavaScript、およびReactに精通したフロントエンド開発を専門としています。直感的で美しいユーザーインターフェースを作成することに情熱を持ち、Curtisは現代のフレームワークを用いた開発や、構造の良い視覚的に魅力的なマニュアルの作成を楽しんでいます。

開発以外にも、CurtisはIoT（Internet of Things）への強い関心を持ち、ハードウェアとソフトウェアの統合方法を模索しています。余暇には、ゲームをしたりDiscordボットを作成したりして、技術に対する愛情と創造性を組み合わせています。

準備はできましたか？

Nuget ダウンロード 137,906 | バージョン: 2026.6 just released