Introduction to Spark

Contact us to book Introduction to Spark training.

Apache Spark is a fast, general-purpose system for cluster computing on large datasets.

This workshop is an introduction to working with Spark using R. After the course you will:

  • be able to load structured or unstructured data into Spark;
  • be able to apply various transformations to the data; and
  • understand the way that data are distributed across a Spark cluster.

Course Description

  • Connecting to Spark
  • RDDs
    • Unstructured Data
    • Reading data from file
    • Data distribution
    • Operations
      • Viewing
      • Simple statistics
      • Filtering
      • Mapping
    • Persistence
  • Key-Value RDDs
    • Creating
    • Transforming
    • Summarising
  • Structured Data
    • DataFrames
    • Reading data from file
    • Accessing rows and columns
    • Merging and Aggregation
  • SQL
  • Partitions
    • How is data partitioned?
    • Repartitioning
Return to our list of courses.

Training Philosophy

Our training emphasises practical skills. So, although you'll be learning concepts and theory, you'll see how everything is applied in the real world. We will work through examples and exercises based on real datasets.

Requirements

All you'll need is a computer with a browser and a decent internet connection. We'll be using an online development environment. This means that you can focus on learning and not on solving technical problems.

Of course, we are happy to help you get your local environment set up too! You can start by following these instructions.

Package

The training package includes access to
  • our online development environment and
  • detailed course material (slides and scripts).

Contact us to book Introduction to Spark training.

Remote Training Available

We can provide remote training online, tailored to your specific needs.


Learn more