Build a Web Scraper in Rails
In this post, I'll walk you through building a web scraper in Ruby on Rails. I'm assuming an intermediate skill level with Rails.
you can a completed version of this project here
github.com/topher6345/rails-jobscrape
This application can be used to scrape job postings.
Requirements
- ruby-2.1.1
- rails 4.1.1
- local instance of postgresql
Create new rails project
rails new jobscraper -d postgresql
Install gems
bundle install
Create Database
postgres -D /usr/local/pgsql/data
rake db:create
Create 'Job' Resource
rails g scaffold job title:string location:string link:text haveapplied:boolean company:string interested:boolean referred:string
Use scaffold generator to get .json API for free
rake db:migrate
Add Active Admin
add these lines to your Gemfile
ruby
gem 'devise'
gem 'activeadmin', github: 'gregbell/active_admin'
and run
bundle install
Install ActiveAdmin
rails g active_admin:install
Register Jobs with ActiveAdmin
rails generate active_admin:resource job
Customize ActiveAdmin Jobs View
# app/admin/job.rb
ActiveAdmin.register Job do
permit_params :title, :location, :haveapplied, :interested, :referred
index do
selectable_column
id_column
column :title do |s|
a href: admin_job_path(s) do
s.title
end
end
column :location
column :link do |s|
a href: s.link do
s.link
end
end
column :haveapplied
column :interested
column :referred
column :created_at
column :updated_at
actions
end
end
Add Rake Task
rails generate task jobs fetch prune clean
# lib/tasks/jobs.rake
namespace :jobs do
desc "Fill database with Job listings"
task fetch: :environment do
require 'nokogiri'
require 'open-uri'
# clean database to avoid duplicates
Job.all.each do |job|
job.destroy!
end
# write your nokogiri scripts here or
#
# require 'lib/tasks/sites/santacruzjobs.rb'
#
# them from other files.
# Throw away old jobs
Job.destroy_all(['created_at < ?', 7.days.ago])
end
desc "Delete Jobs that are older than 7 days"
task prune: :environment do
Job.destroy_all(['created_at < ?', 7.days.ago])
end
desc "Delete all jobs."
task clean: :environment do
Job.all.each do |job|
job.destroy!
end
end
end
If you run rake -T
you can see these tasks are registered with rake.
rake jobs:clean # Delete all jobs
rake jobs:fetch # Fill database with Job listings
rake jobs:prune # Delete Jobs that are older than 7 days
Write custom nokogiri scripts to populate ActiveRecord attributes.