Web Scraping with R


Training

Web Scraping with R

1. Introduction

Exegetic Analytics is a Data Science consultancy specialising in data acquisition and augmentation, data preparation, predictive analytics and machine learning. Our services are used by a range of industries from Education to Security, Food Delivery to Politics. Our consultants are based in Durban and Cape Town and we engage with clients all over the world. Our products and services are used by a multitude of industries including Aerospace, Education, Finance, Food and Transport.

Exegetic Analytics also offers training, with experienced and knowledgeable facilitators. Our courses focus on practical applications, working through examples and exercises based on real-world datasets.

All of our training packages include access to:

  • our online development environment and
  • detailed course material which participants will have continued access to even once the training has concluded.

For more information about what we do, you can refer to our website.

These are some of the companies who have benefitted from our trainning:

Take a look at our full list of courses to see what other training we have on offer.

Contact Us

If this proposal is of interest to you or you would like to hear more about what we do you can get in touch on training@exegetic.biz or +27 (0)73 805 7439.

2. Course Description

There’s a wealth of data available on the internet which can be used for data augmentation or to create entirely new datasets.

Details

Duration 2 days
Who should attend? The course is aimed at students, academics and professionals who need to harvest data from the internet.
Objectives

In this course you’ll learn how to use R to selectively scrape content from websites.

During this course we’ll scrape data from a number of sites including:

Outcomes Participants will be able to isolate the relevant portions of a website and write scripts to automatically extract the required information. Furthermore they’ll know how to apply these techniques to both static and dynamic websites.
Requirements Participants are assumed to have prior exposure to R and the {dplyr}, {purrr} and {stringr} packages. Some familiarity with HTML and CSS will be an advantage but not mandatory.
Setup

Return to our list of courses.

Course Outline

3. Course Outline

Day 1

  • Motivating Example
  • Review of the Tidyverse
  • Website screenshots
  • Navigating the Internet with URLs
    • Anatomy of an URL
    • Building URLs with {urltools}
    • Encoding and decoding parameters
  • HTTP
  • Deconstructing a Website
    • Structure of an HTML document
    • DOM
    • CSS (summary of CSS)
    • XPath (summary of XPath)
    • Developer Tools
    • Important files: robots.txt and sitemap.xml
    • Ethics
  • Manual Scraping
  • Scraping a Static Website using rvest
    • Retrieving page content
    • Navigation
    • Extracting text
    • Extracting attributes
    • Working with tables
    • Storing data as CSV or JSON.
    • Case Study
  • Assisted Assignment

Day 2

  • Case Study
  • Sessions
    • Moving around with jump_to()
    • Checking session history
    • Filling forms
  • Dynamic Websites and JavaScript
  • Driving a Browser using RSelenium
    • Why is RSelenium needed?
    • Navigation
    • Interacting with elements
    • Combining RSelenium with rvest
    • Useful JavaScript tools
    • Going headless
    • Case Study
  • Building Robust Scrapers
    • Handling errors using tryCatch()
    • Functional tools from purrr: mapping, walking, insistently() and slowly()
  • Deploying a Scraper in the Cloud
    • Launching and connecting to an EC2 instance
    • Headless browsers
    • Automation with cron

Book now!

Training Philosophy

Our training emphasises practical skills. So, although you'll be learning concepts and theory, you'll see how everything is applied in the real world. We will work through examples and exercises based on real datasets.

Requirements

All you'll need is a computer with a browser and a decent internet connection. We'll be using an online development environment. This means that you can focus on learning and not on solving technical problems.

Of course, we are happy to help you get your local environment set up too! You can start by following these instructions.

Package

The training package includes access to
  • our online development environment and
  • detailed course material (slides and scripts).

Return to our list of courses.