An Introduction to Web Crawling using Ruby and Nokogiri

Web Crawling

A web crawler is a program automated script that browses world wide web in methodical and automated manner.Key motivation for designing web crawlers has been retrieve web pages and add their representations to local repository.   Other less frequently used names of a web crawler are

  • Bot
  • ants
  • automatic indexers
  • worm

Difference between web-crawling and web-scraping

Web scraping,  is the process of processing a web document and extracting information out of it.

web scraping will focus on two things:

  1. Examining what the webpage expects from the user and what it shows the user.
  2. Processing the data being sent or received by the browser

Web crawling, is the process of iteratively finding and fetching web links starting from a list of seed URL’s. Strictly speaking, to do web crawling, we to do some degree of web scraping (to extract the URL’s.)

Next we discus  the theory, technique, and programming needed to write web-scrapers.

Use of web inspector in crawling

All of the major browsers have a web inspector built-in or available to them.This highlights selected elements.Data-seekers will get even more utility out of the network panel, which provides a way to directly examine the data and logic underneath the webpage displaying.

The network panel is used to  examine the source of dynamically loaded data requests like Javascript or Flash.

Nokogiri (Rubygem)

RubyGems is a package manager for the Ruby programming language that provides a standard format for distributing Ruby programs and libraries.

Nokogiri  is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.

Features of Nokogiri

  •    XPath support for document searching
  •  CSS3 selector support for document searching
  • XML/HTML builder

Parsing HTML with Nokogiri

Step -1 require rubygems and nokogiri pakaege

All web crawling script is having  these two lines in the beginning

require “rubygems”

require “nokogiri”

Step – 2  Opening page with open-uri pakage.

If the webpage is stored as a file on  hard drive, pass it in like so:
page = Nokogiri::HTML(open(“index.html”))

If the webpage is live on a remote site, like  http://en.wikipedia.org/

page = Nokogiri::HTML(open(“http://en.wikipedia.org/”))

If the webpage is live on a remote site, like http://en.wikipedia.org/, then include the open-uri module, which is part of the standard Ruby distribution but must be explicitly required’

ie.  require “open-uri”

Open-uri  encapsulate all the work of making a HTTP request into the open method, making the operation as simple as as opening a file on our own hard drive.

Step – 3  Selecting elements

Nokogiri’s css method allow  to target individual or groups of HTML methods using CSS selectors.
Eg.
page = Nokogiri::HTML(open(“index.html”))
puts page.class   # => Nokogiri::HTML::Document
page.css(“title”).text #=>title of the web page

The css method does not return the text of the target element,
It returns an array – more specifically, a Nokogiri data object that is a collectino of Nokogiri::XML::Element objects. These Element objects have a variety of methods.
Eg
text
name
attr
value etc..

Page.css(“csspath”)[:atrname]   #=>atrvalue
Page.css(“a.link”)[:href]        #=>”http://anyvalue.come”

Parsing XML using Nokogiri

There is small difference in xml parsing with html parsing . In xml parsing use XML class for parsing .

Page = Nokogiri::XML(open(“file.xml or url”))
retuns an Nokogiri::XML::Document
Use object method “xpath” like “css” method for
to target elements
Page.xpath(“copy xpath from browser inspector”)

we can use all other object method described above for xml parsing