Web Scraping

There’s a wealth of data available on the internet which can be used for data augmentation or to create entirely new datasets. In this course you’ll learn how to use R to selectively scrape content from websites.

During this course we’ll scrape data from a number of sites including:

Part 1

  • Motivating Example
  • Review of the Tidyverse
  • Introduction to Scraping
    • The components of an HTML document
    • Building a simple HTML document
    • DOM
    • CSS (summary of CSS)
    • XPath (summary of XPath)
    • Developer Tools
    • Important files: robots.txt and sitemap.xml
    • Ethics
  • HTTP
  • Manual Scraping
  • Scraping a Static Website using rvest
    • Retrieving page content
    • Navigation
    • Extracting text
    • Extracting attributes
    • Working with tables
    • Storing data as CSV or JSON.
    • Case Study
  • Assisted Assignment

Part 2

  • Case Study
  • Interacting with APIs
    • Using XHR to find an API
    • Building wrappers around APIs
  • Dynamic Websites and JavaScript
  • Driving a Browser using RSelenium
    • Why is RSelenium needed?
    • Navigation
    • Interacting with elements
    • Combining RSelenium with rvest
    • Useful JavaScript tools
    • Going headless
    • Case Study
  • Building Robust Scrapers
    • Handling errors using tryCatch()
    • Functional tools from purrr: mapping, walking, insistently() and slowly()
  • Deploying a Scraper in the Cloud
    • Launching and connecting to an EC2 instance
    • Headless browsers
    • Automation with cron

Book now!

Training Philosophy

Our training emphasises practical skills. So, although you'll be learning concepts and theory, you'll see how everything is applied in the real world. We will work through examples and exercises based on real datasets.

Requirements

All you'll need is a computer with a browser and a decent internet connection. We'll be using an online development environment. This means that you can focus on learning and not on solving technical problems.

Of course, we are happy to help you get your local environment set up too! You can start by following these instructions.

Package

The training package includes access to
  • our online development environment and
  • detailed course material (slides and scripts).

Return to our list of courses.