Cómo extraer datos de sitios web en C# #
IronWebscraper es una librería .NET para web scraping, extracción de datos web y análisis de contenido web. Se trata de una biblioteca fácil de usar que puede añadirse a los proyectos de Microsoft Visual Studio para su uso en desarrollo y producción.
IronWebscraper tiene un montón de características únicas y capacidades tales como el control de páginas permitidas y prohibidas, objetos, medios de comunicación, etc. También permite la gestión de múltiples identidades, caché web, y un montón de otras características que vamos a cubrir en este tutorial.
Comienza con IronWebscraper
Comience a usar IronWebScraper en su proyecto hoy con una prueba gratuita.
Público objetivo
Este tutorial está dirigido a desarrolladores de software con habilidades de programación básicas o avanzadas, que deseen construir e implementar soluciones para capacidades de scraping avanzadas (scraping de sitios web, recopilación y extracción de datos de sitios web, análisis de contenidos de sitios web, extracción web).

Competencias requeridas
Fundamentos básicos de programación con conocimientos de uno de los lenguajes de programación de Microsoft, como C# o VB.NET.
Comprensión básica de tecnologías web (HTML, JavaScript, JQuery, CSS, etc.) y cómo funcionan
- Conocimientos básicos de DOM, XPath, HTML y selectores CSS
Herramientas
Microsoft Visual Studio 2010 o superior
- Extensiones de desarrollador web para navegadores como web inspector para Chrome o Firebug para Firefox
¿Por qué rascar?
(Razones y Conceptos)
Si quiere crear un producto o una solución que tenga la capacidad de:
Extraer datos del sitio web
Compare contenidos, precios, características, etc. de varios sitios web
Escaneado y almacenamiento en caché del contenido de los sitios web
Si usted tiene una o más razones de las anteriores, entonces IronWebscraper es una gran biblioteca para satisfacer sus necesidades
¿Cómo instalar IronWebScraper?
Después de crear un nuevo proyecto (ver Apéndice A), puedes agregar la biblioteca IronWebScraper a tu proyecto insertando automáticamente la biblioteca usando NuGet o instalando manualmente el DLL.
Instalación con NuGet
Para agregar la biblioteca IronWebScraper a nuestro proyecto usando NuGet, podemos hacerlo utilizando la interfaz visual (Administrador de paquetes de NuGet) o mediante comandos utilizando la Consola del Administrador de Paquetes.
Uso del gestor de paquetes NuGet
Con el ratón -> click derecho sobre el nombre del proyecto -> Seleccionar gestionar paquete NuGet
Desde la pestaña Examinar -> busque IronWebScraper -> Instalar
Haga clic en Ok
- Y hemos terminado
Uso de la consola de paquetes NuGet
Desde herramientas -> Gestor de paquetes NuGet -> Consola del Gestor de paquetes
Seleccione el proyecto de biblioteca de clases como proyecto por defecto
- Ejecutar comando -> Install-Package IronWebScraper
Instalación manual
Vaya a ironsoftware.com
Haga clic en IronWebScraper o visite su página directamente usando la URL https://ironsoftware.com/csharp/webscraper/
Haga clic en Descargar DLL.
Extraer el archivo comprimido descargado
En visual studio click derecho en proyecto -> añadir -> referencia -> examinar
Ve a la carpeta extraída ->
netstandard2.0
-> y selecciona todos los archivos.dll
- ¡Y está hecho!
HelloScraper - Nuestra primera muestra de IronWebScraper
Como de costumbre, comenzaremos implementando la App Hello Scraper para dar nuestro primer paso utilizando IronWebScraper.
Hemos creado una nueva aplicación de consola con el nombre "IronWebScraperSample".
Pasos para Crear un Ejemplo de IronWebScraper
Cree una carpeta y nómbrela "HelloScraperSample".
A continuación, una nueva clase y el nombre "HelloScraper"
- Añada este fragmento de código a HelloScraper
public class HelloScraper : WebScraper
{
/// <summary>
/// Override this method initialize your web-scraper.
/// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
/// </summary>
public override void Init()
{
License.LicenseKey = "LicenseKey"; // Write License Key
this.LoggingLevel = WebScraper.LogLevel.All; // All Events Are Logged
this.Request("https://blog.scrapinghub.com", Parse);
}
/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
// set working directory for the project
this.WorkingDirectory = AppSetting.GetAppRoot()+ @"\HelloScraperSample\Output\";
// Loop on all Links
foreach (var title_link in response.Css("h2.entry-title a"))
{
// Read Link Text
string strTitle = title_link.TextContentClean;
// Save Result to File
Scrape(new ScrapedData() { { "Title", strTitle } }, "HelloScraper.json");
}
// Loop On All Links
if (response.CssExists("div.prev-post > a [href]"))
{
// Get Link URL
var next_page = response.Css("div.prev-post > a [href]")[0].Attributes ["href"];
// Scrape Next URL
this.Request(next_page, Parse);
}
}
}
public class HelloScraper : WebScraper
{
/// <summary>
/// Override this method initialize your web-scraper.
/// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
/// </summary>
public override void Init()
{
License.LicenseKey = "LicenseKey"; // Write License Key
this.LoggingLevel = WebScraper.LogLevel.All; // All Events Are Logged
this.Request("https://blog.scrapinghub.com", Parse);
}
/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
// set working directory for the project
this.WorkingDirectory = AppSetting.GetAppRoot()+ @"\HelloScraperSample\Output\";
// Loop on all Links
foreach (var title_link in response.Css("h2.entry-title a"))
{
// Read Link Text
string strTitle = title_link.TextContentClean;
// Save Result to File
Scrape(new ScrapedData() { { "Title", strTitle } }, "HelloScraper.json");
}
// Loop On All Links
if (response.CssExists("div.prev-post > a [href]"))
{
// Get Link URL
var next_page = response.Css("div.prev-post > a [href]")[0].Attributes ["href"];
// Scrape Next URL
this.Request(next_page, Parse);
}
}
}
IRON VB CONVERTER ERROR developers@ironsoftware.com
- Ahora para empezar Scrape Añadir este fragmento de código a la principal
static void Main(string [] args)
{
// Create Object From Hello Scrape class
HelloScraperSample.HelloScraper scrape = new HelloScraperSample.HelloScraper();
// Start Scraping
scrape.Start();
}
static void Main(string [] args)
{
// Create Object From Hello Scrape class
HelloScraperSample.HelloScraper scrape = new HelloScraperSample.HelloScraper();
// Start Scraping
scrape.Start();
}
IRON VB CONVERTER ERROR developers@ironsoftware.com
Visión general del código
Scrape.Start() => inicia la lógica de extracción de la siguiente manera:
Llame primero al método Init() para iniciar las variables, las propiedades de raspado y los atributos de comportamiento.
Como podemos ver, establece la página de inicio en Request("https://blog.scrapinghub.com", Parse) y Parse (Response response) se define como el proceso utilizado para analizar la respuesta.
Webscraper gestiona en paralelo: http e hilos... manteniendo todo su código fácil de depurar y síncrono.
- El método Parse comienza después de Init() para analizar la página.
Puede encontrar elementos utilizando (selectores Css, Js DOM, XPath)
Los elementos seleccionados se asignan a la clase ScrapedData, puede asignarlos a cualquier clase personalizada (Producto, Empleado, Noticias, etc.).
Los objetos guardados en un archivo con formato Json en el directorio ("bin/Scrape/"). O puede establecer la ruta del archivo como un parámetro como veremos más adelante en otros ejemplos.
Funciones y opciones de la biblioteca IronWebScraper
Puedes encontrar la documentación actualizada dentro del archivo zip que se descargó con el método de instalación manual (Archivo IronWebScraper Documentation.chm)
O puedes consultar la documentación en línea para la última actualización de la biblioteca en https://ironsoftware.com/csharp/webscraper/object-reference/
Para comenzar a usar IronWebscraper en su proyecto, debe heredar de la clase (IronWebScraper.WebScraper), que extiende su biblioteca de clases y le añade funcionalidad de scraping.
También debe implementar los métodos {Init(), Parse(Response response)}.
namespace IronWebScraperEngine
{
public class NewsScraper : IronWebScraper.WebScraper
{
public override void Init()
{
throw new NotImplementedException();
}
public override void Parse(Response response)
{
throw new NotImplementedException();
}
}
}
namespace IronWebScraperEngine
{
public class NewsScraper : IronWebScraper.WebScraper
{
public override void Init()
{
throw new NotImplementedException();
}
public override void Parse(Response response)
{
throw new NotImplementedException();
}
}
}
Namespace IronWebScraperEngine
Public Class NewsScraper
Inherits IronWebScraper.WebScraper
Public Overrides Sub Init()
Throw New NotImplementedException()
End Sub
Public Overrides Sub Parse(ByVal response As Response)
Throw New NotImplementedException()
End Sub
End Class
End Namespace
Properties \ functions | Type | Description |
---|---|---|
Init () | Method | used to setup the scraper |
Parse (Response response) | Method | Used to implement the logic that the scraper will use and how it will process it. Coming table contain list of methods and properties that IronWebScraper Library are providing NOTE : Can implement multiple method for different pages behaviors or structures |
| Collections | Used to ban/Allow/ URLs And/Or Domains Ex: BannedUrls.Add ("*.zip", "*.exe", "*.gz", "*.pdf"); Note:
|
ObeyRobotsDotTxt | Boolean | Used to enable or disable read and follow robots.txt its directive or not |
public override bool ObeyRobotsDotTxtForHost (string Host) | Method | Used to enable or disable read and follow robots.txt its directive or not for certain domain |
Scrape | Method | |
ScrapeUnique | Method | |
ThrottleMode | Enumeration | |
EnableWebCache () | Method | |
EnableWebCache (TimeSpan cacheDuration) | Method | |
MaxHttpConnectionLimit | Int | |
RateLimitPerHost | TimeSpan | |
OpenConnectionLimitPerHost | Int | |
ObeyRobotsDotTxt | Boolean | |
ThrottleMode | Enum | Enum Options:
|
SetSiteSpecificCrawlRateLimit (string hostName, TimeSpan crawlRate) | Method | |
Identities | Collections | A list of HttpIdentity () to be used to fetch web resources. Each Identity may have a different proxy IP addresses, user Agent, http headers, Persistent cookies, username and password. Best practice is to create Identities in your WebScraper.Init Method and Add Them to this WebScraper.Identities List. |
WorkingDirectory | string | Setting working directory that will be used for all scrape related data will be stored to disk. |
## Muestras y Prácticas del Mundo Real
Raspado de un sitio web de películas en línea
Empecemos con otro ejemplo de un sitio web del mundo real. Vamos a elegir para raspar un sitio web de películas.
Añadamos una nueva clase y llamémosla "MovieScraper":
Ahora vamos a echar un vistazo en el sitio que vamos a raspar:
Esto forma parte del HTML de la página de inicio que vemos en el sitio web:
<div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active">
<div data-movie-id="20746" class="ml-item">
<a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/">
<span class="mli-quality">CAM</span>
<img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg"
class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword"
src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg"
style="display: inline-block;">
<span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span>
</a>
</div>
<div data-movie-id="20724" class="ml-item">
<a href="https://website.com/film/snatched-20724/" >
<span class="mli-quality">CAM</span>
<img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg"
class="lazy thumb mli-thumb" alt="Snatched"
src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg"
style="display: inline-block;">
<span class="mli-info"><h2>Snatched</h2></span>
</a>
</div>
</div>
<div id="movie-featured" class="movies-list movies-list-full tab-pane in fade active">
<div data-movie-id="20746" class="ml-item">
<a href="https://website.com/film/king-arthur-legend-of-the-sword-20746/">
<span class="mli-quality">CAM</span>
<img data-original="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg"
class="lazy thumb mli-thumb" alt="King Arthur: Legend of the Sword"
src="https://img.gocdn.online/2017/05/16/poster/2116d6719c710eabe83b377463230fbe-king-arthur-legend-of-the-sword.jpg"
style="display: inline-block;">
<span class="mli-info"><h2>King Arthur: Legend of the Sword</h2></span>
</a>
</div>
<div data-movie-id="20724" class="ml-item">
<a href="https://website.com/film/snatched-20724/" >
<span class="mli-quality">CAM</span>
<img data-original="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg"
class="lazy thumb mli-thumb" alt="Snatched"
src="https://img.gocdn.online/2017/05/16/poster/5ef66403dc331009bdb5aa37cfe819ba-snatched.jpg"
style="display: inline-block;">
<span class="mli-info"><h2>Snatched</h2></span>
</a>
</div>
</div>
Como podemos ver, tenemos el ID de la película, el Título y el Enlace a la página detallada.
Empecemos a raspar este conjunto de datos:
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("www.website.com", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var MovieId = Divs.GetAttribute("data-movie-id");
var link = Divs.Css("a")[0];
var MovieTitle = link.TextContentClean;
Scrape(new ScrapedData() { { "MovieId", MovieId }, { "MovieTitle", MovieTitle } }, "Movie.Jsonl");
}
}
}
}
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("www.website.com", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var MovieId = Divs.GetAttribute("data-movie-id");
var link = Divs.Css("a")[0];
var MovieTitle = link.TextContentClean;
Scrape(new ScrapedData() { { "MovieId", MovieId }, { "MovieTitle", MovieTitle } }, "Movie.Jsonl");
}
}
}
}
Public Class MovieScraper
Inherits WebScraper
Public Overrides Sub Init()
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"
Me.Request("www.website.com", AddressOf Parse)
End Sub
Public Overrides Sub Parse(ByVal response As Response)
For Each Divs In response.Css("#movie-featured > div")
If Divs.Attributes ("class") <> "clearfix" Then
Dim MovieId = Divs.GetAttribute("data-movie-id")
Dim link = Divs.Css("a")(0)
Dim MovieTitle = link.TextContentClean
Scrape(New ScrapedData() From {
{ "MovieId", MovieId },
{ "MovieTitle", MovieTitle }
},
"Movie.Jsonl")
End If
Next Divs
End Sub
End Class
¿Qué hay de nuevo en este código?
La propiedad Directorio de trabajo se utiliza para establecer el directorio de trabajo principal para todos los datos raspados y sus archivos relacionados.
Hagamos más.
¿Y si necesitamos construir objetos tipados que contengan datos raspados en objetos formateados?
Vamos a implementar una clase de película que contendrá nuestros datos formateados:
public class Movie
{
public int Id { get; set; }
public string Title { get; set; }
public string URL { get; set; }
}
public class Movie
{
public int Id { get; set; }
public string Title { get; set; }
public string URL { get; set; }
}
Public Class Movie
Public Property Id() As Integer
Public Property Title() As String
Public Property URL() As String
End Class
Ahora actualizaremos nuestro código:
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("https://website.com/", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var movie = new Movie();
movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
var link = Divs.Css("a")[0];
movie.Title = link.TextContentClean;
movie.URL = link.Attributes ["href"];
Scrape(movie, "Movie.Jsonl");
}
}
}
}
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("https://website.com/", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var movie = new Movie();
movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
var link = Divs.Css("a")[0];
movie.Title = link.TextContentClean;
movie.URL = link.Attributes ["href"];
Scrape(movie, "Movie.Jsonl");
}
}
}
}
Public Class MovieScraper
Inherits WebScraper
Public Overrides Sub Init()
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"
Me.Request("https://website.com/", AddressOf Parse)
End Sub
Public Overrides Sub Parse(ByVal response As Response)
For Each Divs In response.Css("#movie-featured > div")
If Divs.Attributes ("class") <> "clearfix" Then
Dim movie As New Movie()
movie.Id = Convert.ToInt32(Divs.GetAttribute("data-movie-id"))
Dim link = Divs.Css("a")(0)
movie.Title = link.TextContentClean
movie.URL = link.Attributes ("href")
Scrape(movie, "Movie.Jsonl")
End If
Next Divs
End Sub
End Class
¿Qué hay de nuevo?
Implementamos Movie Class para mantener nuestros datos raspados
- Pasamos objetos de película al método Scrape y este entiende nuestro formato y guarda en un formato definido como podemos ver aquí:
Empecemos a raspar una página más detallada.
La página de películas tiene este aspecto:
<div class="mvi-content">
<div class="thumb mvic-thumb"
style="background-image: url(https://img.gocdn.online/2017/04/28/poster/5a08e94ba02118f22dc30f298c603210-guardians-of-the-galaxy-vol-2.jpg);"></div>
<div class="mvic-desc">
<h3>Guardians of the Galaxy Vol. 2</h3>
<div class="desc">
Set to the backdrop of Awesome Mixtape #2, Marvel's Guardians of the Galaxy Vol. 2 continues the team's adventures as they travel throughout the cosmos to help Peter Quill learn more about his true parentage.
</div>
<div class="mvic-info">
<div class="mvici-left">
<p>
<strong>Genre: </strong>
<a href="https://Domain/genre/action/" title="Action">Action</a>,
<a href="https://Domain/genre/adventure/" title="Adventure">Adventure</a>,
<a href="https://Domain/genre/sci-fi/" title="Sci-Fi">Sci-Fi</a>
</p>
<p>
<strong>Actor: </strong>
<a target="_blank" href="https://Domain/actor/chris-pratt" title="Chris Pratt">Chris Pratt</a>,
<a target="_blank" href="https://Domain/actor/-zoe-saldana" title="Zoe Saldana">Zoe Saldana</a>,
<a target="_blank" href="https://Domain/actor/-dave-bautista-" title="Dave Bautista">Dave Bautista</a>
</p>
<p>
<strong>Director: </strong>
<a href="#" title="James Gunn">James Gunn</a>
</p>
<p>
<strong>Country: </strong>
<a href="https://Domain/country/us" title="España">United States</a>
</p>
</div>
<div class="mvici-right">
<p><strong>Duration:</strong> 136 min</p>
<p><strong>Quality:</strong> <span class="quality">CAM</span></p>
<p><strong>Release:</strong> 2017</p>
<p><strong>IMDb:</strong> 8.3</p>
</div>
<div class="clearfix"></div>
</div>
<div class="clearfix"></div>
</div>
<div class="clearfix"></div>
</div>
<div class="mvi-content">
<div class="thumb mvic-thumb"
style="background-image: url(https://img.gocdn.online/2017/04/28/poster/5a08e94ba02118f22dc30f298c603210-guardians-of-the-galaxy-vol-2.jpg);"></div>
<div class="mvic-desc">
<h3>Guardians of the Galaxy Vol. 2</h3>
<div class="desc">
Set to the backdrop of Awesome Mixtape #2, Marvel's Guardians of the Galaxy Vol. 2 continues the team's adventures as they travel throughout the cosmos to help Peter Quill learn more about his true parentage.
</div>
<div class="mvic-info">
<div class="mvici-left">
<p>
<strong>Genre: </strong>
<a href="https://Domain/genre/action/" title="Action">Action</a>,
<a href="https://Domain/genre/adventure/" title="Adventure">Adventure</a>,
<a href="https://Domain/genre/sci-fi/" title="Sci-Fi">Sci-Fi</a>
</p>
<p>
<strong>Actor: </strong>
<a target="_blank" href="https://Domain/actor/chris-pratt" title="Chris Pratt">Chris Pratt</a>,
<a target="_blank" href="https://Domain/actor/-zoe-saldana" title="Zoe Saldana">Zoe Saldana</a>,
<a target="_blank" href="https://Domain/actor/-dave-bautista-" title="Dave Bautista">Dave Bautista</a>
</p>
<p>
<strong>Director: </strong>
<a href="#" title="James Gunn">James Gunn</a>
</p>
<p>
<strong>Country: </strong>
<a href="https://Domain/country/us" title="España">United States</a>
</p>
</div>
<div class="mvici-right">
<p><strong>Duration:</strong> 136 min</p>
<p><strong>Quality:</strong> <span class="quality">CAM</span></p>
<p><strong>Release:</strong> 2017</p>
<p><strong>IMDb:</strong> 8.3</p>
</div>
<div class="clearfix"></div>
</div>
<div class="clearfix"></div>
</div>
<div class="clearfix"></div>
</div>
Podemos ampliar nuestra clase de película con nuevas propiedades (Descripción, Género, Actor, Director, País, Duración, Puntuación IMDB), pero utilizaremos (Descripción, Género, Actor) solo para nuestra muestra.
public class Movie
{
public int Id { get; set; }
public string Title { get; set; }
public string URL { get; set; }
public string Description { get; set; }
public List<string> Genre { get; set; }
public List<string> Actor { get; set; }
}
public class Movie
{
public int Id { get; set; }
public string Title { get; set; }
public string URL { get; set; }
public string Description { get; set; }
public List<string> Genre { get; set; }
public List<string> Actor { get; set; }
}
Public Class Movie
Public Property Id() As Integer
Public Property Title() As String
Public Property URL() As String
Public Property Description() As String
Public Property Genre() As List(Of String)
Public Property Actor() As List(Of String)
End Class
Ahora navegaremos a la página Detallada para rasparla.
IronWebScraper le permite añadir más a la función de raspado para raspar diferentes tipos de formatos de página
Como podemos ver aquí:
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("https://domain/", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var movie = new Movie();
movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
var link = Divs.Css("a")[0];
movie.Title = link.TextContentClean;
movie.URL = link.Attributes ["href"];
this.Request(movie.URL, ParseDetails, new MetaData() { { "movie", movie } });// to scrap Detailed Page
}
}
}
public void ParseDetails(Response response)
{
var movie = response.MetaData.Get<Movie>("movie");
var Div = response.Css("div.mvic-desc")[0];
movie.Description = Div.Css("div.desc")[0].TextContentClean;
foreach(var Genre in Div.Css("div > p > a"))
{
movie.Genre.Add(Genre.TextContentClean);
}
foreach (var Actor in Div.Css("div > p:nth-child(2) > a"))
{
movie.Actor.Add(Actor.TextContentClean);
}
Scrape(movie, "Movie.Jsonl");
}
}
public class MovieScraper : WebScraper
{
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\MovieSample\Output\";
this.Request("https://domain/", Parse);
}
public override void Parse(Response response)
{
foreach (var Divs in response.Css("#movie-featured > div"))
{
if (Divs.Attributes ["class"] != "clearfix")
{
var movie = new Movie();
movie.Id = Convert.ToInt32( Divs.GetAttribute("data-movie-id"));
var link = Divs.Css("a")[0];
movie.Title = link.TextContentClean;
movie.URL = link.Attributes ["href"];
this.Request(movie.URL, ParseDetails, new MetaData() { { "movie", movie } });// to scrap Detailed Page
}
}
}
public void ParseDetails(Response response)
{
var movie = response.MetaData.Get<Movie>("movie");
var Div = response.Css("div.mvic-desc")[0];
movie.Description = Div.Css("div.desc")[0].TextContentClean;
foreach(var Genre in Div.Css("div > p > a"))
{
movie.Genre.Add(Genre.TextContentClean);
}
foreach (var Actor in Div.Css("div > p:nth-child(2) > a"))
{
movie.Actor.Add(Actor.TextContentClean);
}
Scrape(movie, "Movie.Jsonl");
}
}
Public Class MovieScraper
Inherits WebScraper
Public Overrides Sub Init()
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\MovieSample\Output\"
Me.Request("https://domain/", AddressOf Parse)
End Sub
Public Overrides Sub Parse(ByVal response As Response)
For Each Divs In response.Css("#movie-featured > div")
If Divs.Attributes ("class") <> "clearfix" Then
Dim movie As New Movie()
movie.Id = Convert.ToInt32(Divs.GetAttribute("data-movie-id"))
Dim link = Divs.Css("a")(0)
movie.Title = link.TextContentClean
movie.URL = link.Attributes ("href")
Me.Request(movie.URL, AddressOf ParseDetails, New MetaData() From {
{ "movie", movie }
}) ' to scrap Detailed Page
End If
Next Divs
End Sub
Public Sub ParseDetails(ByVal response As Response)
Dim movie = response.MetaData.Get(Of Movie)("movie")
Dim Div = response.Css("div.mvic-desc")(0)
movie.Description = Div.Css("div.desc")(0).TextContentClean
For Each Genre In Div.Css("div > p > a")
movie.Genre.Add(Genre.TextContentClean)
Next Genre
For Each Actor In Div.Css("div > p:nth-child(2) > a")
movie.Actor.Add(Actor.TextContentClean)
Next Actor
Scrape(movie, "Movie.Jsonl")
End Sub
End Class
¿Qué hay de nuevo?
Podemos añadir funciones de scrapeo (ParseDetails) para extraer datos de páginas detalladas
Trasladamos la función Scrape que genera nuestro fichero a la nueva función
Utilizamos la característica IronWebScraper (MetaData) para pasar nuestro objeto película a la nueva función de raspado.
- Raspamos la página y guardamos los datos de nuestros objetos de película en un archivo
Extraer contenido de un sitio web de compras
Seleccionamos un sitio de compras para raspar su contenido
Como se puede ver en la imagen, tenemos una barra izquierda que contiene enlaces para las categorías de productos del sitio
Así que nuestro primer paso es investigar el HTML del sitio y planificar cómo queremos rasparlo.
Las categorías del sitio de moda tienen subcategorías (Hombres, Mujeres, Niños)
<li class="menu-item" data-id="">
<a href="https://domain.com/fashion-by-/" class="main-category">
<i class="cat-icon osh-font-fashion"></i> <span class="nav-subTxt">FASHION </span> <i class="osh-font-light-arrow-left"></i><i class="osh-font-light-arrow-right"></i>
</a> <div class="navLayerWrapper" style="width: 633px; display: none;"><div class="submenu"><div class="column"><div class="categories"><a class="category" href="https://domain.com/fashion-by-/?sort=newest&dir=desc&viewType=gridView3">New Arrivals !</a> </div><div class="categories"><a class="category" href="https://domain.com/men-fashion/">Men</a> <a class="subcategory" href="https://domain.com/mens-shoes/">Shoes</a> <a class="subcategory" href="https://domain.com/mens-clothing/">Clothing</a> <a class="subcategory" href="https://domain.com/mens-accessories/">Accessories</a> </div><div class="categories"><a class="category" href="https://domain.com/women-fashion/">Women</a> <a class="subcategory" href="https://domain.com/womens-shoes/">Shoes</a> <a class="subcategory" href="https://domain.com/womens-clothing/">Clothing</a> <a class="subcategory" href="https://domain.com/womens-accessories/">Accessories</a> </div><div class="categories"><a class="category" href="https://domain.com/girls-boys-fashion/">Kids</a> <a class="subcategory" href="https://domain.com/boys-fashion/">Boys</a> <a class="subcategory" href="https://domain.com/girls/">Girls</a> </div><div class="categories"><a class="category" href="https://domain.com/maternity-clothes/">Maternity Clothes</a> </div></div><div class="column"><div class="categories"> <span class="category defaultCursor">Men Best Sellers</span> <a class="subcategory" href="https://domain.com/mens-casual-shoes/">Casual Shoes</a> <a class="subcategory" href="https://domain.com/mens-sneakers/">Sneakers</a> <a class="subcategory" href="https://domain.com/mens-t-shirts/">T-shirts</a> <a class="subcategory" href="https://domain.com/mens-polos/">Polos</a> </div><div class="categories"> <span class="category defaultCursor">Women Best Sellers</span> <a class="subcategory" href="https://domain.com/womens-sandals/">Sandals</a> <a class="subcategory" href="https://domain.com/womens-sneakers/">Sneakers</a> <a class="subcategory" href="https://domain.com/women-dresses/">Dresses</a> <a class="subcategory" href="https://domain.com/women-tops/">Tops</a> </div><div class="categories"><a class="category" href="https://domain.com/womens-curvy-clothing/">Women's Curvy Clothing</a> </div><div class="categories"><a class="category" href="https://domain.com/fashion-bundles/v/">Fashion Bundles</a> </div><div class="categories"><a class="category" href="https://domain.com/hijab-fashion/">Hijab Fashion</a> </div></div><div class="column"><div class="categories"><a class="category" href="https://domain.com/brands/fashion-by-/">SEE ALL BRANDS</a> <a class="subcategory" href="https://domain.com/adidas/">Adidas</a> <a class="subcategory" href="https://domain.com/converse/">Converse</a> <a class="subcategory" href="https://domain.com/ravin/">Ravin</a> <a class="subcategory" href="https://domain.com/dejavu/">Dejavu</a> <a class="subcategory" href="https://domain.com/agu/">Agu</a> <a class="subcategory" href="https://domain.com/activ/">Activ</a> <a class="subcategory" href="https://domain.com/oxford--bellini--tie-house--milano/">Tie House</a> <a class="subcategory" href="https://domain.com/shoe-room/">Shoe Room</a> <a class="subcategory" href="https://domain.com/town-team/">Town Team</a> </div></div></div></div>
</li>
<li class="menu-item" data-id="">
<a href="https://domain.com/fashion-by-/" class="main-category">
<i class="cat-icon osh-font-fashion"></i> <span class="nav-subTxt">FASHION </span> <i class="osh-font-light-arrow-left"></i><i class="osh-font-light-arrow-right"></i>
</a> <div class="navLayerWrapper" style="width: 633px; display: none;"><div class="submenu"><div class="column"><div class="categories"><a class="category" href="https://domain.com/fashion-by-/?sort=newest&dir=desc&viewType=gridView3">New Arrivals !</a> </div><div class="categories"><a class="category" href="https://domain.com/men-fashion/">Men</a> <a class="subcategory" href="https://domain.com/mens-shoes/">Shoes</a> <a class="subcategory" href="https://domain.com/mens-clothing/">Clothing</a> <a class="subcategory" href="https://domain.com/mens-accessories/">Accessories</a> </div><div class="categories"><a class="category" href="https://domain.com/women-fashion/">Women</a> <a class="subcategory" href="https://domain.com/womens-shoes/">Shoes</a> <a class="subcategory" href="https://domain.com/womens-clothing/">Clothing</a> <a class="subcategory" href="https://domain.com/womens-accessories/">Accessories</a> </div><div class="categories"><a class="category" href="https://domain.com/girls-boys-fashion/">Kids</a> <a class="subcategory" href="https://domain.com/boys-fashion/">Boys</a> <a class="subcategory" href="https://domain.com/girls/">Girls</a> </div><div class="categories"><a class="category" href="https://domain.com/maternity-clothes/">Maternity Clothes</a> </div></div><div class="column"><div class="categories"> <span class="category defaultCursor">Men Best Sellers</span> <a class="subcategory" href="https://domain.com/mens-casual-shoes/">Casual Shoes</a> <a class="subcategory" href="https://domain.com/mens-sneakers/">Sneakers</a> <a class="subcategory" href="https://domain.com/mens-t-shirts/">T-shirts</a> <a class="subcategory" href="https://domain.com/mens-polos/">Polos</a> </div><div class="categories"> <span class="category defaultCursor">Women Best Sellers</span> <a class="subcategory" href="https://domain.com/womens-sandals/">Sandals</a> <a class="subcategory" href="https://domain.com/womens-sneakers/">Sneakers</a> <a class="subcategory" href="https://domain.com/women-dresses/">Dresses</a> <a class="subcategory" href="https://domain.com/women-tops/">Tops</a> </div><div class="categories"><a class="category" href="https://domain.com/womens-curvy-clothing/">Women's Curvy Clothing</a> </div><div class="categories"><a class="category" href="https://domain.com/fashion-bundles/v/">Fashion Bundles</a> </div><div class="categories"><a class="category" href="https://domain.com/hijab-fashion/">Hijab Fashion</a> </div></div><div class="column"><div class="categories"><a class="category" href="https://domain.com/brands/fashion-by-/">SEE ALL BRANDS</a> <a class="subcategory" href="https://domain.com/adidas/">Adidas</a> <a class="subcategory" href="https://domain.com/converse/">Converse</a> <a class="subcategory" href="https://domain.com/ravin/">Ravin</a> <a class="subcategory" href="https://domain.com/dejavu/">Dejavu</a> <a class="subcategory" href="https://domain.com/agu/">Agu</a> <a class="subcategory" href="https://domain.com/activ/">Activ</a> <a class="subcategory" href="https://domain.com/oxford--bellini--tie-house--milano/">Tie House</a> <a class="subcategory" href="https://domain.com/shoe-room/">Shoe Room</a> <a class="subcategory" href="https://domain.com/town-team/">Town Team</a> </div></div></div></div>
</li>
Creemos un proyecto
Cree una nueva aplicación de consola o añada una nueva carpeta para nuestra nueva muestra con el nombre "ShoppingSiteSample".
Añadir nueva clase con el nombre "ShoppingScraper"
El primer paso será raspar las categorías del sitio y sus subcategorías
Vamos a crear un Modelo de Categorías:
public class Category
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the URL.
/// </summary>
/// <value>
/// The URL.
/// </value>
public string URL { get; set; }
/// <summary>
/// Gets or sets the sub categories.
/// </summary>
/// <value>
/// The sub categories.
/// </value>
public List<Category> SubCategories { get; set; }
}
public class Category
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the URL.
/// </summary>
/// <value>
/// The URL.
/// </value>
public string URL { get; set; }
/// <summary>
/// Gets or sets the sub categories.
/// </summary>
/// <value>
/// The sub categories.
/// </value>
public List<Category> SubCategories { get; set; }
}
Public Class Category
''' <summary>
''' Gets or sets the name.
''' </summary>
''' <value>
''' The name.
''' </value>
Public Property Name() As String
''' <summary>
''' Gets or sets the URL.
''' </summary>
''' <value>
''' The URL.
''' </value>
Public Property URL() As String
''' <summary>
''' Gets or sets the sub categories.
''' </summary>
''' <value>
''' The sub categories.
''' </value>
Public Property SubCategories() As List(Of Category)
End Class
- Ahora vamos a construir nuestra lógica de raspado
public class ShoppingScraper : WebScraper
{
/// <summary>
/// Override this method initialize your web-scraper.
/// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
/// </summary>
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
this.Request("www.webSite.com", Parse);
}
/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
var categoryList = new List<Category>();
foreach (var Links in response.Css("#menuFixed > ul > li > a "))
{
var cat = new Category();
cat.URL = Links.Attributes ["href"];
cat.Name = Links.InnerText;
categoryList.Add(cat);
}
Scrape(categoryList, "Shopping.Jsonl");
}
}
public class ShoppingScraper : WebScraper
{
/// <summary>
/// Override this method initialize your web-scraper.
/// Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
/// </summary>
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
this.Request("www.webSite.com", Parse);
}
/// <summary>
/// Override this method to create the default Response handler for your web scraper.
/// If you have multiple page types, you can add additional similar methods.
/// </summary>
/// <param name="response">The http Response object to parse</param>
public override void Parse(Response response)
{
var categoryList = new List<Category>();
foreach (var Links in response.Css("#menuFixed > ul > li > a "))
{
var cat = new Category();
cat.URL = Links.Attributes ["href"];
cat.Name = Links.InnerText;
categoryList.Add(cat);
}
Scrape(categoryList, "Shopping.Jsonl");
}
}
IRON VB CONVERTER ERROR developers@ironsoftware.com
Extracción de enlaces del menú
Actualicemos nuestro código para raspar las Categorías Principales y todos sus subenlaces
public override void Parse(Response response)
{
// List of Categories Links (Root)
var categoryList = new List<Category>();
foreach (var li in response.Css("#menuFixed > ul > li"))
{
// List Of Main Links
foreach (var Links in li.Css("a"))
{
var cat = new Category();
cat.URL = Links.Attributes ["href"];
cat.Name = Links.InnerText;
cat.SubCategories = new List<Category>();
// List of Sub Catgories Links
foreach (var subCategory in li.Css("a [class=subcategory]"))
{
var subcat = new Category();
subcat.URL = Links.Attributes ["href"];
subcat.Name = Links.InnerText;
// Check If Link Exist Before
if (cat.SubCategories.Find(c=>c.Name== subcat.Name && c.URL == subcat.URL) == null)
{
// Add Sublinks
cat.SubCategories.Add(subcat);
}
}
// Add Categories
categoryList.Add(cat);
}
}
Scrape(categoryList, "Shopping.Jsonl");
}
public override void Parse(Response response)
{
// List of Categories Links (Root)
var categoryList = new List<Category>();
foreach (var li in response.Css("#menuFixed > ul > li"))
{
// List Of Main Links
foreach (var Links in li.Css("a"))
{
var cat = new Category();
cat.URL = Links.Attributes ["href"];
cat.Name = Links.InnerText;
cat.SubCategories = new List<Category>();
// List of Sub Catgories Links
foreach (var subCategory in li.Css("a [class=subcategory]"))
{
var subcat = new Category();
subcat.URL = Links.Attributes ["href"];
subcat.Name = Links.InnerText;
// Check If Link Exist Before
if (cat.SubCategories.Find(c=>c.Name== subcat.Name && c.URL == subcat.URL) == null)
{
// Add Sublinks
cat.SubCategories.Add(subcat);
}
}
// Add Categories
categoryList.Add(cat);
}
}
Scrape(categoryList, "Shopping.Jsonl");
}
Public Overrides Sub Parse(ByVal response As Response)
' List of Categories Links (Root)
Dim categoryList = New List(Of Category)()
For Each li In response.Css("#menuFixed > ul > li")
' List Of Main Links
For Each Links In li.Css("a")
Dim cat = New Category()
cat.URL = Links.Attributes ("href")
cat.Name = Links.InnerText
cat.SubCategories = New List(Of Category)()
' List of Sub Catgories Links
For Each subCategory In li.Css("a [class=subcategory]")
Dim subcat = New Category()
subcat.URL = Links.Attributes ("href")
subcat.Name = Links.InnerText
' Check If Link Exist Before
If cat.SubCategories.Find(Function(c) c.Name= subcat.Name AndAlso c.URL = subcat.URL) Is Nothing Then
' Add Sublinks
cat.SubCategories.Add(subcat)
End If
Next subCategory
' Add Categories
categoryList.Add(cat)
Next Links
Next li
Scrape(categoryList, "Shopping.Jsonl")
End Sub
Ahora que tenemos enlaces a todas las categorías del sitio, vamos a empezar a raspar los productos dentro de cada categoría
Naveguemos a cualquier categoría y comprobemos el contenido.
Veamos su código
<section class="products">
<div class="sku -gallery -validate-size " data-sku="AG249FA0T2PSGNAFAMZ" ft-product-sizes="41,42,43,44,45" ft-product-color="Multicolour">
<a class="link" href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html">
<div class="image-wrapper default-state">
<img class="lazy image -loaded" alt="Bundle Of 2 Sneakers - Black &amp; Navy Blue" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-sku="AG249FA0T2PSGNAFAMZ" data-src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg"><noscript><img src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
</div> <h2 class="title">
<span class="brand ">Agu </span>
<span class="name" dir="ltr">Bundle Of 2 Sneakers - Black & Navy Blue</span>
</h2><div class="price-container clearfix">
<span class="price-box">
<span class="price">
<span data-currency-iso="EGP">EGP</span>
<span dir="ltr" data-price="299">299</span>
</span> <span class="price -old -no-special"></span>
</span>
</div><div class="rating-stars"><div class="stars-container"><div class="stars" style="width: 62%"></div></div> <div class="total-ratings">(30)</div> </div> <span class="shop-first-logo-container"><img src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" data-src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" class="lazy shop-first-logo-img -mbxs -loaded"> </span>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
<div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=41">41</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=42">42</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=43">43</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=44">44</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=45">45</span>
</div>
</a>
</div>
<div class="sku -gallery -validate-size " data-sku="LE047FA01SRK4NAFAMZ" ft-product-sizes="110,115,120,125,130,135" ft-product-color="Black">
<a class="link" href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html">
<div class="image-wrapper default-state"><img class="lazy image -loaded" alt="Genuine Leather Belt - Black" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-sku="LE047FA01SRK4NAFAMZ" data-src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg"><noscript><img src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript></div>
<h2 class="title"><span class="brand ">Leather Shop </span> <span class="name" dir="ltr">Genuine Leather Belt - Black</span></h2><div class="price-container clearfix">
<span class="sale-flag-percent">-29%</span> <span class="price-box"> <span class="price"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="96">96</span> </span> <span class="price -old "><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="135">135</span> </span> </span>
</div><div class="rating-stars"><div class="stars-container"><div class="stars" style="width: 100%"></div></div> <div class="total-ratings">(1)</div> </div>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span> <div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=110">110</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=115">115</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=120">120</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=125">125</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=130">130</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=135">135</span>
</div>
</a>
</div>
</section>
<section class="products">
<div class="sku -gallery -validate-size " data-sku="AG249FA0T2PSGNAFAMZ" ft-product-sizes="41,42,43,44,45" ft-product-color="Multicolour">
<a class="link" href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html">
<div class="image-wrapper default-state">
<img class="lazy image -loaded" alt="Bundle Of 2 Sneakers - Black &amp; Navy Blue" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-sku="AG249FA0T2PSGNAFAMZ" data-src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg"><noscript><img src="https://static.WebSite.com/p/agu-6208-488356-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript>
</div> <h2 class="title">
<span class="brand ">Agu </span>
<span class="name" dir="ltr">Bundle Of 2 Sneakers - Black & Navy Blue</span>
</h2><div class="price-container clearfix">
<span class="price-box">
<span class="price">
<span data-currency-iso="EGP">EGP</span>
<span dir="ltr" data-price="299">299</span>
</span> <span class="price -old -no-special"></span>
</span>
</div><div class="rating-stars"><div class="stars-container"><div class="stars" style="width: 62%"></div></div> <div class="total-ratings">(30)</div> </div> <span class="shop-first-logo-container"><img src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" data-src="http://www.WebSite.com/images/local/logos/shop_first/ShoppingSite/logo_normal.png" class="lazy shop-first-logo-img -mbxs -loaded"> </span>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span>
<div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=41">41</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=42">42</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=43">43</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=44">44</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/agu-bundle-of-2-sneakers-black-navy-blue-653884.html?size=45">45</span>
</div>
</a>
</div>
<div class="sku -gallery -validate-size " data-sku="LE047FA01SRK4NAFAMZ" ft-product-sizes="110,115,120,125,130,135" ft-product-color="Black">
<a class="link" href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html">
<div class="image-wrapper default-state"><img class="lazy image -loaded" alt="Genuine Leather Belt - Black" data-image-vertical="1" width="210" height="262" src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-sku="LE047FA01SRK4NAFAMZ" data-src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" data-placeholder="placeholder_m_1.jpg"><noscript><img src="https://static.WebSite.com/p/leather-shop-1831-030217-1-catalog_grid_3.jpg" width="210" height="262" class="image" /></noscript></div>
<h2 class="title"><span class="brand ">Leather Shop </span> <span class="name" dir="ltr">Genuine Leather Belt - Black</span></h2><div class="price-container clearfix">
<span class="sale-flag-percent">-29%</span> <span class="price-box"> <span class="price"><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="96">96</span> </span> <span class="price -old "><span data-currency-iso="EGP">EGP</span> <span dir="ltr" data-price="135">135</span> </span> </span>
</div><div class="rating-stars"><div class="stars-container"><div class="stars" style="width: 100%"></div></div> <div class="total-ratings">(1)</div> </div>
<span class="osh-icon -ShoppingSite-local shop_local--logo -block -mbs -mts"></span> <div class="list -sizes" data-selected-sku="">
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=110">110</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=115">115</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=120">120</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=125">125</span> <span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=130">130</span>
<span class="js-link sku-size" data-href="http://www.WebSite.com/leather-shop-genuine-leather-belt-black-712030.html?size=135">135</span>
</div>
</a>
</div>
</section>
Construyamos nuestro modelo de producto para este contenido.
public class Product
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the price.
/// </summary>
/// <value>
/// The price.
/// </value>
public string Price { get; set; }
/// <summary>
/// Gets or sets the image.
/// </summary>
/// <value>
/// The image.
/// </value>
public string Image { get; set; }
}
public class Product
{
/// <summary>
/// Gets or sets the name.
/// </summary>
/// <value>
/// The name.
/// </value>
public string Name { get; set; }
/// <summary>
/// Gets or sets the price.
/// </summary>
/// <value>
/// The price.
/// </value>
public string Price { get; set; }
/// <summary>
/// Gets or sets the image.
/// </summary>
/// <value>
/// The image.
/// </value>
public string Image { get; set; }
}
Public Class Product
''' <summary>
''' Gets or sets the name.
''' </summary>
''' <value>
''' The name.
''' </value>
Public Property Name() As String
''' <summary>
''' Gets or sets the price.
''' </summary>
''' <value>
''' The price.
''' </value>
Public Property Price() As String
''' <summary>
''' Gets or sets the image.
''' </summary>
''' <value>
''' The image.
''' </value>
Public Property Image() As String
End Class
Para raspar páginas de categorías, añadimos un nuevo método de raspado:
public void ParseCatgory(Response response)
{
// List of Products Links (Root)
var productList = new List<Product>();
foreach (var Links in response.Css("body > main > section.osh-content > section.products > div > a"))
{
var product = new Product();
product.Name = Links.InnerText;
product.Image = Links.Css("div.image-wrapper.default-state > img")[0].Attributes ["src"];
productList.Add(product);
}
Scrape(productList, "Products.Jsonl");
}
public void ParseCatgory(Response response)
{
// List of Products Links (Root)
var productList = new List<Product>();
foreach (var Links in response.Css("body > main > section.osh-content > section.products > div > a"))
{
var product = new Product();
product.Name = Links.InnerText;
product.Image = Links.Css("div.image-wrapper.default-state > img")[0].Attributes ["src"];
productList.Add(product);
}
Scrape(productList, "Products.Jsonl");
}
Public Sub ParseCatgory(ByVal response As Response)
' List of Products Links (Root)
Dim productList = New List(Of Product)()
For Each Links In response.Css("body > main > section.osh-content > section.products > div > a")
Dim product As New Product()
product.Name = Links.InnerText
product.Image = Links.Css("div.image-wrapper.default-state > img")(0).Attributes ("src")
productList.Add(product)
Next Links
Scrape(productList, "Products.Jsonl")
End Sub
Funciones avanzadas de Webscraping
Característica HttpIdentity:
Algunos sistemas de sitios web exigen que el usuario inicie sesión para ver el contenido; en este caso podemos utilizar una HttpIdentity: -
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id);
HttpIdentity id = new HttpIdentity();
id.NetworkUsername = "username";
id.NetworkPassword = "pwd";
Identities.Add(id);
Dim id As New HttpIdentity()
id.NetworkUsername = "username"
id.NetworkPassword = "pwd"
Identities.Add(id)
Una de las características más impresionantes y poderosas de IronWebScraper es la capacidad de utilizar miles de datos únicos (credenciales de usuario y/o motores de navegador) para suplantar o extraer datos de sitios web utilizando sesiones de inicio de sesión múltiples.
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
foreach (var proxy in proxies)
{
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
this.Request("http://www.Website.com", Parse);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
var proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(',');
foreach (var UA in IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents)
{
foreach (var proxy in proxies)
{
Identities.Add(new HttpIdentity()
{
UserAgent = UA,
UseCookies = true,
Proxy = proxy
});
}
}
this.Request("http://www.Website.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
Dim proxies = "IP-Proxy1: 8080,IP-Proxy2: 8081".Split(","c)
For Each UA In IronWebScraper.CommonUserAgents.ChromeDesktopUserAgents
For Each proxy In proxies
Identities.Add(New HttpIdentity() With {
.UserAgent = UA,
.UseCookies = True,
.Proxy = proxy
})
Next proxy
Next UA
Me.Request("http://www.Website.com", Parse)
End Sub
Dispone de múltiples propiedades para darle diferentes comportamientos, evitando así que los sitios web le bloqueen.
Algunas de estas propiedades: -
- NetworkDomain : El dominio de red que se utilizará para la autenticación de usuarios. Compatible con redes Windows, NTLM , Keroberos, Linux, BSD y Mac OS X. Debe usarse con (NetworkUsername y NetworkPassword)
- NetworkUsername : El nombre de usuario de red/HTTP que se usará para la autenticación de usuario. Soporta Http, redes Windows, NTLM , Kerberos , redes Linux, redes BSD y Mac OS.
- NetworkPassword : La contraseña de red/http que se utilizará para la autenticación del usuario. Soporta Http , redes Windows, NTLM , Keroberos , redes Linux, redes BSD y Mac OS.
- Proxy: para configurar los ajustes de proxy
- UserAgent: para configurar el motor del navegador (chrome desktop, chrome mobile, chrome tablet, IE y Firefox, etc.)
- HttpRequestHeaders: para valores de encabezado personalizados que se utilizarán con esta identidad, y acepta un objeto de diccionario (Dictionary <string, string>)
UseCookies : habilitar/deshabilitar el uso de cookies
IronWebScraper ejecuta el scraper utilizando identidades aleatorias. Si necesitamos especificar el uso de una identidad concreta para analizar una página, podemos hacerlo.
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
HttpIdentity identity = new HttpIdentity();
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
Identities.Add(id);
this.Request("http://www.Website.com", Parse, identity);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
HttpIdentity identity = new HttpIdentity();
identity.NetworkUsername = "username";
identity.NetworkPassword = "pwd";
Identities.Add(id);
this.Request("http://www.Website.com", Parse, identity);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
Dim identity As New HttpIdentity()
identity.NetworkUsername = "username"
identity.NetworkPassword = "pwd"
Identities.Add(id)
Me.Request("http://www.Website.com", Parse, identity)
End Sub
Active la función de caché web:
Esta función se utiliza para almacenar en caché las páginas solicitadas. Suele utilizarse en las fases de desarrollo y prueba; que permite a los desarrolladores almacenar en caché las páginas necesarias para reutilizarlas tras actualizar el código. Esto te permite ejecutar tu código en páginas en caché después de reiniciar tu scraper web y no necesitar conectarte al sitio web en vivo cada vez (repetición de acción).
Puedes usarlo en el método Init()
EnableWebCache();
O
EnableWebCache(Timespan Expiry);
Guardará los datos en caché en la carpeta WebCache situada bajo la carpeta del directorio de trabajo
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
EnableWebCache(new TimeSpan(1,30,30));
this.Request("http://www.WebSite.com", Parse);
}
public override void Init()
{
License.LicenseKey = " LicenseKey ";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
EnableWebCache(new TimeSpan(1,30,30));
this.Request("http://www.WebSite.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = " LicenseKey "
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
EnableWebCache(New TimeSpan(1,30,30))
Me.Request("http://www.WebSite.com", Parse)
End Sub
IronWebScraper también tiene funciones para permitir que tu motor continúe raspando después de reiniciar el código al establecer el nombre del proceso de inicio del motor usando Start(CrawlID)
static void Main(string [] args)
{
// Create Object From Scraper class
EngineScraper scrape = new EngineScraper();
// Start Scraping
scrape.Start("enginestate");
}
static void Main(string [] args)
{
// Create Object From Scraper class
EngineScraper scrape = new EngineScraper();
// Start Scraping
scrape.Start("enginestate");
}
Shared Sub Main(ByVal args() As String)
' Create Object From Scraper class
Dim scrape As New EngineScraper()
' Start Scraping
scrape.Start("enginestate")
End Sub
La solicitud de ejecución y la respuesta se guardarán en la carpeta SavedState dentro del directorio de trabajo.
Estrangulamiento
Podemos controlar el número mínimo y máximo de conexiones y la velocidad de conexión por dominio.
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Gets or sets the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
// or IP address. This helps protect hosts against too many requests.
this.OpenConnectionLimitPerHost = 25;
this.ObeyRobotsDotTxt = false;
// Makes the WebSraper intelligently throttle requests not only by hostname, but
// also by host servers' IP addresses. This is polite in-case multiple scraped domains
// are hosted on the same machine.
this.ThrottleMode = Throttle.ByDomainHostName;
this.Request("https://www.Website.com", Parse);
}
public override void Init()
{
License.LicenseKey = "LicenseKey";
this.LoggingLevel = WebScraper.LogLevel.All;
this.WorkingDirectory = AppSetting.GetAppRoot() + @"\ShoppingSiteSample\Output\";
// Gets or sets the total number of allowed open HTTP requests (threads)
this.MaxHttpConnectionLimit = 80;
// Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
this.RateLimitPerHost = TimeSpan.FromMilliseconds(50);
// Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
// or IP address. This helps protect hosts against too many requests.
this.OpenConnectionLimitPerHost = 25;
this.ObeyRobotsDotTxt = false;
// Makes the WebSraper intelligently throttle requests not only by hostname, but
// also by host servers' IP addresses. This is polite in-case multiple scraped domains
// are hosted on the same machine.
this.ThrottleMode = Throttle.ByDomainHostName;
this.Request("https://www.Website.com", Parse);
}
Public Overrides Sub Init()
License.LicenseKey = "LicenseKey"
Me.LoggingLevel = WebScraper.LogLevel.All
Me.WorkingDirectory = AppSetting.GetAppRoot() & "\ShoppingSiteSample\Output\"
' Gets or sets the total number of allowed open HTTP requests (threads)
Me.MaxHttpConnectionLimit = 80
' Gets or sets minimum polite delay (pause)between request to a given domain or IP address.
Me.RateLimitPerHost = TimeSpan.FromMilliseconds(50)
' Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname
' or IP address. This helps protect hosts against too many requests.
Me.OpenConnectionLimitPerHost = 25
Me.ObeyRobotsDotTxt = False
' Makes the WebSraper intelligently throttle requests not only by hostname, but
' also by host servers' IP addresses. This is polite in-case multiple scraped domains
' are hosted on the same machine.
Me.ThrottleMode = Throttle.ByDomainHostName
Me.Request("https://www.Website.com", Parse)
End Sub
Propiedades de limitación
- MaxHttpConnectionLimit
número total de solicitudes HTTP abiertas permitidas (hilos) - RateLimitPerHost
mínima demora o pausa cortés (en milisegundos) entre solicitudes a un dominio o dirección IP determinada - OpenConnectionLimitPerHost
número permitido de solicitudes HTTP concurrentes (hilos) - ThrottleMode
Hace que el WebScraper ajuste inteligentemente la velocidad de las solicitudes no solo por nombre de host, sino también por las direcciones IP de los servidores anfitriones. Esto es útil en el caso de que varios dominios raspados estén alojados en la misma máquina.
Anexo
¿Cómo crear una aplicación Windows Form?
Para ello debemos utilizar Visual Studio 2013 o superior.
Siga estos pasos para crear un nuevo proyecto Windows Forms:
Abrir Visual Studio
Archivo -> Nuevo -> Proyecto
- Desde Plantilla, elige el lenguaje de programación (Visual C# o VB) -> Windows -> Aplicación de formularios de Windows
Nombre del proyecto: IronScraperSample
Ubicación: Elige un lugar en tu disco duro
¿Cómo crear una aplicación de formulario web?
Para ello, debe utilizar Visual Studio 2013 o superior.
Siga los pasos para Crear un Nuevo Proyecto Asp.NET Web forms
Abrir Visual Studio
Archivo -> Nuevo -> Proyecto
- Desde la Plantilla, elija el lenguaje de programación (Visual C# o VB) -> Web -> Aplicación web ASP.NET (.NET Framework).
Nombre del proyecto: IronScraperSample
Ubicación: Elige una ubicación de tu disco duro
Desde sus plantillas ASP.NET
- Ahora su proyecto básico de formulario web ASP.NET está creado
Haga clic aquí para descargar el código del proyecto completo del tutorial de muestra.