Introduction to Spark

Apache Spark is a fast, general-purpose system for cluster computing on large datasets.

This workshop is an introduction to working with Spark using R. After the course you will:

  • be able to load structured or unstructured data into Spark;
  • be able to apply various transformations to the data; and
  • understand the way that data are distributed across a Spark cluster.

Course Description

  • Connecting to Spark
  • RDDs
    • Unstructured Data
    • Reading data from file
    • Data distribution
    • Operations
      • Viewing
      • Simple statistics
      • Filtering
      • Mapping
    • Persistence
  • Key-Value RDDs
    • Creating
    • Transforming
    • Summarising
  • Structured Data
    • DataFrames
    • Reading data from file
    • Accessing rows and columns
    • Merging and Aggregation
  • SQL
  • Partitions
    • How is data partitioned?
    • Repartitioning

Book now!

Training Philosophy

Our training emphasises practical skills. So, although you'll be learning concepts and theory, you'll see how everything is applied in the real world. We will work through examples and exercises based on real datasets.


All you'll need is a computer with a browser and a decent internet connection. We'll be using an online development environment. This means that you can focus on learning and not on solving technical problems.

Of course, we are happy to help you get your local environment set up too! You can start by following these instructions.


The training package includes access to
  • our online development environment and
  • detailed course material (slides and scripts).

Return to our list of courses.