Data Scraping and More With Ruby, Nokogiri, Sinatra and Heroku
In this article, you will learn the basics of scraping and parsing data from
websites with Ruby and
Nokogiri. You will then take this information and build
a sample application, first as a command line tool and then as a full
Sinatra web app. Finally, you will deploy your
new application on the heroku hosting platform.
There is a screencast that
accompanies this article. If you are interested in the process behind many of
the thoughts and code in this article, watch the screencast. If you just want
the facts, read the article.
Data scraping is the process of
extracting data from output that was originally intended for humans. A web page
is an example of output originally intended for humans in contrast to an
API intended for use by other programs.
Nokogiri is a Ruby Gem that extracts data from web pages
using CSS selectors. Additionally,
it provides methods to help parse (make sense of) the results. The use of CSS
selectors allows you to easily target the data you wish to extract from a URL.
<!DOCTYPE HTML><htmllang="en-US"><head><metacharset="UTF-8"><title></title></head><body><divid="price"> $32.11
</div><divid="time"> in 6 hours
</div><divid="stock"> in stock
In this page, there are three bits of interesting information: price, time, and inventory status. The CSS selectors for these are what we would use were we were trying to style their divs, ‘#price’, ‘#time’, and ‘#stock’.
Let’s look at a sample Nokogiri script that will extract this information:
# nokogiri is our scraping/parsing library# you will need to install it with "gem install nokogiri"require'nokogiri'# open-uri is part of the standard library and allows you to# download a webpagerequire'open-uri'# I am hosting interesting.html on a local server. This is the URL.url="http://localhost:4567/interesting.html"# Here we load the URL into Nokogiri for parsing downloading the page in# the processdata=Nokogiri::HTML(open(url))# We can now target data in the page using css selectors. The at_css method# returns the first element that matches the selector.putsdata.at_css("#price").text.strip# The text method returns the text from inside the element.putsdata.at_css("#time").text.strip# The strip method is a standard ruby method for strings and removes# extraneous whitespace from the outputputsdata.at_css("#stock").text.strip
The above example is very simple, but hopefully gets the point across about how you can target content for extraction using CSS selectors.
Now, for a real world example:
The 9:30 club is a music venue in DC. Let’s figure out how to scrape concert information from the concerts listing page. We will try and target the name of the headlining band, the date they are playing, the time of the show, the price for a ticket and whether or not the show is sold out.
Using the same technique as the previous example, we find the name of the band located in ‘.event’, the date of the show located in ‘.date’, the time of the show located in ‘.doors’ and the price of the show located in ‘.price’. If a show is sold out, the div with class ‘.price’ is missing. This is good, however our current method at_css is only going to return information for the first concert.
What we need is a way to target all concerts. Nokogiri provides such a method with .css(‘selector’). This method returns a Nokogiri enumerable object that holds all objects matching the provided selector. In the 9:30 club markup, each concert has its own div with class ‘.concert_listing’. We can now use css(‘.concert_listing’) combined with an each iterator to extract information about “each” concert.
require'nokogiri'require'open-uri'url="http://www.930.com/concerts/#/930/"data=Nokogiri::HTML(open(url))# Here is where we use the new method to create an object that holds all the# concert listings. Think of it as an array that we can loop through. It's# not an array, but it does respond very similarly.concerts=data.css('.concert_listing')concerts.eachdo|concert|# name of the showputsconcert.at_css('.event').text# date of the showputsconcert.at_css('.date').text# time of the showputsconcert.at_css('.doors').text# show price or sold out# Remember, when a show is sold out, there is no div with the selector .price# What we are doing here is setting price = to that selector. We then test# to see whether it is nil or not which let's us know if the show is SOLD OUT.price=concert.at_css('.price')if!price.nil?putsprice.textelseputs"SOLD OUT"end# blank line to make results prettierputs""end
truncated output from above
hunter@i7:code_samples ruby concerts_930.rb
Balkan Beat Box
Let’s take it a step farther and turn our app into a full Sinatra application. The main thing we are going to do is separate our “business logic” and “display logic”. Business logic will stay in our application as part of a “route” and display logic will movie into a view. This paradigm isn’t entirely accurate, but it does represent the gist of what we are after. If we weren’t moving so fast, it might be nice (read necessary) to abstract a layer with a custom Class to separate the two.
# The primary requirement of a Sinatra application is the sinatra gem.# If you haven't already, install the gem with 'gem install sinatra'require'sinatra'require'nokogiri'require'open-uri'# sinatra allows us to respond to route requests with code. Here we are # responding to requests for the root document - the naked domain.get'/'do# the first two lines are lifted directly from our previous scripturl="http://www.930.com/concerts/#/930/"data=Nokogiri::HTML(open(url))# this line has only be adjusted slightly with the inclusion of an ampersand# before concerts. This creates an instance variable that can be referenced# in our display logic (view).@concerts=data.css('.concert_listing')# this tells sinatra to render the Embedded Ruby template /views/shows.erberb:showsend
For the view, I added a bit of HTML and linked to a hostedbootstrap stylesheet. The rest of the code should look very familiar. The only new thing here should be the introduction of ERB syntax which allows us to evaluate Ruby in our HTML document.
The two basic tags are <%= %> and <% %>. The difference between the two is, the first one with the equal sign renders the return value to the HTML, while the second is used primarily to evaluate a statement.
<!DOCTYPE HTML><html lang="en-US"><head> <meta charset="UTF-8"> <title>9:30 Show</title> <link rel="stylesheet" href="http://current.bootstrapcdn.com/bootstrap-v204/css/bootstrap-combined.min.css"></head><body> <div class="span8"> <!-- This is table layout is pulled directly from twitter bootstrap --> <table class="table table-striped"> <thead> <tr> <th>Date</th> <th>Event</th> <th>Time</th> <th>Price</th> </tr> </thead> <tbody><%@concerts.eachdo|concert|%> <tr><%price=concert.at_css('.price')%><%if!price.nil?%> <td><%=concert.at_css('.date').text%></td> <td> <!-- This next line targetting the :HREF is new. --> <!-- The first part should seem familiar. We are targetting --> <!-- the first anchor link inside an element with the class --> <!-- '.buy'. The next bit [:href], tells Nokogiri to extract --> <!-- the href value from the anchor link. Our ERB tag then --> <!-- outputs that value to the string. See if you can figure --> <!-- out why we are extracting this link by reviewing 930.html --> <a href="<%=concert.at_css('.buy a')[:href]%>"><%=concert.at_css('.event').text%> </a> </td> <td><%=concert.at_css('.doors').text%></td> <td><%=price.text%></td><%else%> <td><%=concert.at_css('.date').text%></td> <td><del><%=concert.at_css('.event').text%></del></td> <td><%=concert.at_css('.doors').text%></td> <td> SOLD OUT </td><%end%> </tr><%end%> </tbody> </table> </div></body></html>
Let’s take this a step further and deploy to Heroku. All we need are three new files and a bit of version control. We actually only need to create two of the three files. Bundler will take care of the third.
# The Gemfile tells bundler which gems our app is using.# Where the gems are fromsource:rubygems# Which gems are needed# You might note the ommision of open-uri, this is because it is part of the# Ruby standard library. The remaining two are simply copied from the# require statements in app.rbgem'sinatra'gem'nokogiri'
# tell Heroku what to loadrequire'./app'# tell Heroku what to dorunSinatra::Application
After creating the two files, run ‘Bundle’ in the application folder. You will need to have Bundler installed.
shell output from ‘bundle’ command
Fetching gem metadata from http://rubygems.org/.......
Using nokogiri (1.5.4)
Using rack (1.4.1)
Using rack-protection (1.2.0)
Using tilt (1.3.3)
Using sinatra (1.3.2)
Using bundler (1.1.4)
Your bundle is complete! Use `bundle show [gemname]` to see...
The bundle command creates our third file, Gemfile.lock. Next we will put our application under version control with git. You will need to have git installed.
shell commands for git
git add .
git commit -m "init commit"
And now we are ready to create our instance on heroku and deploy.
You need to have the heroku toolbelt installed & configured
shell commands for heroku
git push heroku master
Our site is live ( http://nine30.heroku.com ). If some of the steps seem “glossed over”, watch the screencast. I do everything in real-time, solving many of the problems above for the first time. Still have something to say, shoot me 140 characters @TheHunter.