Build a Web Scraper in Ruby on Rails with Nokogiri and HTTParty

I thought it would be fun to do a little write up on how you can build a web scraper with ruby/rails. I was also recently thinking it might be a fun web app project to build a crypto/blockchain job board, then I thought why not both?

So now, let’s see if we can build a web scraper tool that we can use to build a job board application for blockchain jobs.

We’ll use a handful of ruby gems for this, in particular: nokogiri, httparty, byebug and/or pry.

Getting Started

Note: this post assumes you already have ruby and rails installed and have a minimal understanding of interacting with your terminal

First, things first, I’m going to create a new rails app I’m going to call ‘gigchain’. I’m also going to setup the db with postgresql to make transitioning to a production application on heroku easier later on. If you’re following along, in your terminal you’ll want to do the following in your own development directory(or wherever):

rails new gigchain --database=postgresql 

The additional –database=postgresql is probably pretty self-explanatory, but basically we are just setting our app up with postgres.

Next, in our new rails app’s gemfile we need to add nokogiri & httparty:

Gemfile

Then we can run bundle install in terminal to add our new gems:

bundle install 

Building Our Scraper

Now that we have that all set up, let’s start building out our scraper. For a simple proof of concept I’m going to see if we can just start by scraping indeed[.]com for blockchain job listings.

I’m going to create a file called scraper.rb and add it at the top level of my app for now – later we can run it from our terminal to run our scraper.

So now let’s start building out our scraper. The first thing I’m going to do is include our new gems (along with byebug and json):

We’re going to build out our scraper method for now as something that we can run from our terminal. So next, I want to supply the user (in this case me) with the ability to customize the job search terms they want to use. Looking at our target indeed.com, I can see that their url format looks like this when you search for something:

https://www.indeed.com/q-YOUR-SEARCH-TERMS-jobs.html

So we want to collect the users input that we can plug into that url format. We can use ruby’s ‘gets.chomp’ to capture the user’s input from the terminal. We’ll also want to make sure that we replace any blank spaces the user introduces into hyphens as this will be getting inserted into a url:

Now with the users input provided, we can build and run our actual web scraper method. Let’s create a new method in our scraper.rb file called ‘scrape’. Scrape will take a single argument (input) which will be the user supplied search term. We can add a few simple things inside of this method to get started.

First, we’ll use HTTParty to make a get request to our target url:

Notice the #{input} getting added in the url. in line 8 and 9 above we’re taking the input from the user and cleaning it up a bit.

Then when we run our function, that input will get passed in and we’ll insert it into the url we are passing into HTTParty.get()

This will return a raw unparsed version of the entire HTML page to us. Then using Nokogiri we can create a parsed version of the page:

Then we can insert a byebug to interact with our method and data from our terminal. If you haven’t used byebug before, this basically lets us pause our code at a given point when we run it, and then lets us interact with the available data.

So at this point, this is what my scraper.rb file and new scrape method will look like:

To recap, so far on line 7 we’re asking the user for some input. On line 8 we’re storing the user input to a variable and on line 9 we’re cleaning it up a bit.

Then on line 11 we’re creating the scrape method that will handle the user input. on line 12 we’re creating a variable unparsed_page that contains the raw value we get using HTTParty’s get method. Then we’re creating a parsed_page variable and passing that unparsed_page value into Nokogiri::HTML to organize our data.

Let’s run a quick test to see where we’re at so far. In your scraper.rb file below the current code, add scraper(input) to run the function with the supplied user input.

In your terminal you can now run this by navigating to the rails app we created and entering ‘ruby scraper.rb’

You’ll be prompted for some input (if you’re following along you can use ‘Blockchain’ for your keyword/search-term, but you could also do anything else you want). Press enter, our scrape function will run, and then it will hit our byebug and pause.

Now because our byebug is placed after our parsed and unparsed page variables, we have access to that data currently in our terminal, and now we can use Nokogiri to interact with the data.

Getting Our Data

Let’s start as an example with how we can collect all of the job titles on the page.

Nokogiri allows us to interact with elements by targeting whatever we want –
the class, id, data attribute, etc. So if we inspect the actual webpage we’re trying to scrape we can start to look at potential ways we want to target our data.

In the page source we can see a recurring ‘jobitle’ class that each job title appears to have. Looking into the page source further, I can also see that there’s a data attribute being used for the group of titles which is data-tn-element=”jobTitle”

So using nokogiri’s built in tools we can now call .css on our parsed_page variable and either target the html element we want and it’s class, like this:

parsed_page.css('a.jobtitle')

Or target by the data attribute, like this:

parsed_page.css('a[data-tn-element=jobTitle]')

After testing each of these, the second option seems to work well so I’m going to go with using the data attribute here.

jobs = parsed_page.css('a[data-tn-element=jobTitle]')

What we’ve done so far is collected all of the elements with this attribute and grouped them into an array. So what we can do next in our terminal is iterate through each item and call ‘.text’ on our items to print out each job title.

Output in our terminal

Look at that, we’ve already put together enough code that we can scrape a page for job titles. Pretty cool!

However we will need to actually collect all of the relevant data on each job post, not just the title, if we’re going to build our own job board.

Additionally, we’ll have to figure out how to account for pagination if we want to collect more than just the first page of results.

In order to collect more than just the title, I’m going to start by going back to our parsed_page variable. Instead of targeting just the titles data attribute, we can start instead by targeting all of the divs in the page that act as containers for each job. If we look back through the page source we can see that each container div has the same ‘.row’ class.

So what we can do then is create a new jobs variable and do this:

jobs = parsed_page.css('div.row')

This will give us an array of 15 items (each page has 10 organic job listings + 5 sponsored listings).

This way, we’ve created an array that segments each job into it’s own section, and now we just have to figure out how to iterate over that and collect the information we want for each job.

To keep things simple for now, for each job listing I’m simply going to aim to collect the title, company, and the url where you can apply to the job.

So what we can do now in our scrape function is add some code so that we can iterate through our new jobs variable. Additionally, I’m going to move the byebug inside of our iterator so we can take a look at our data for each item of our array. Adding these changes will result in our scrape function looking something like this now:

[Note that I changed the url format on line 12/13. This will come in handy when we get into pagination]

So now let’s figure out how to capture our data. We already know how to get the title, because we did that above. For collecting the job’s url, we can actually look at what we did to pull out that job title in Nokogiri and collect the href from that anchor tag we targeted. It’s a little strange, but let’s take a look at this by re-targeting our anchor tag with the jobtitle class. however this time we won’t attach .text to the tailend of it:

It’s a little more complicated than getting the text value, but if you play around with that you’ll find that you can grab the href value by doing this:

job.css('a[data-tn-element=jobTitle]')[0].attributes['href'].value

[Note that we are using ‘job’ here because we are inside of the .each iterator in our scraper method]

Then finally, for the company name, if we look back in the page source we can see in each row the company name is wrapped in a span with the class ‘company’, so we can simply target that.

So now lets create a small hash inside our loop to collect all this data, and then drop our byebug under that. Our scrape function will look something like this now:

Now, with our byebug dropped below the job_data hash we just added, we can rerun this in our terminal and check what the values we capture are as we iterate through the loop to check if it’s working.

After a quick test, it looks like everything is working correctly – woohoo!

Now that we’ve figured out how to scrape the data we want to extract from the page, we can focus on the next peice of the puzzle – pagination.

Pagination

Okay, so now that we have our scraper built out, we need to figure out pagination. By default, our target (Indeed.com) shows 10 listings (+ 5 sponsored listings) on each page. Looking at my search for blockchain job listings, I can see that Indeed shows the total number of results at the top of each page.

If we look at how Indeed handles pagination, we can see the url adjust to the following on page 2:

https://www.indeed.com/jobs?q=Blockchain&start=10

So all that’s happening is an additional query parameter (‘start’) gets added to the url with a marker of where we are in the results, incrementing by 10.

Based off that, I think there’s a simple, somewhat hacky way we can account for pagination, at least as a short term solution.

Here are the steps to what I’m thinking:

  • Use HTTParty to make an initial request to our target url.
  • search through the parsed_page to grab the number of job listing results, this will become our max value.
  • Using a while loop, we can increment a counter by 10 on each loop and run the code we’ve already written above until we reach our max.
  • We’ll pass the value of the counter into the url on each loop to set the current page we want to crawl.

If we look back at the target webpage and look through the source, we can see that the number of results for our search is wrapped in a div and has a unique id of ‘#searchCount’.

So all we have to do then is run our initial get request with HTTParty and then target that value from our parsed_page variable. I’m going to do something a little hacky but it should work for my purposes:

max = parsed_page.css('div#searchCount').text.split(" of ")[1].to_i

Couple of things here – the thing that we are targeting is a string that will read something like this on each page: “Jobs 11 to 20 of 711”. So what I’m doing is targeting that text, and then using .split at the point of (” of “) to break the string into an array. This will create an array that looks like this:

["Jobs 11 to 20", "711"]

Then by grabbing only the second item in our array and converting it to an integer (to_i) we now have the max value we need. Again, super hacky, but for now it should work fine.

Next, in our scrape function, we can add a start variable that begins at 0, and then wrap our code in a while loop with the condition of ‘while start < max'. I'm also going to create an empty array because I want to dump all of our results into it and look at it once we're done looping through everything. I also need to change around when and how we're using HTTParty and Nokogiri - now initially using that to get the number of results, then creating a similar parsed page variable inside of our loop. Here's what my method looks like after accounting for these changes: