Spark: Big Data from Python


Spark: Big Data from Python

1. Introduction

Exegetic Analytics is a Data Science consultancy specialising in data acquisition and augmentation, data preparation, predictive analytics and machine learning. Our services are used by a range of industries from Education to Security, Food Delivery to Politics. Our consultants are based in Durban and Cape Town and we engage with clients all over the world. Our products and services are used by a multitude of industries including Aerospace, Education, Finance, Food and Transport.

Exegetic Analytics also offers training, with experienced and knowledgeable facilitators. Our courses focus on practical applications, working through examples and exercises based on real-world datasets.

All of our training packages include access to:

  • our online development environment and
  • detailed course material which participants will have continued access to even once the training has concluded.

For more information about what we do, you can refer to our website.

These are some of the companies who have benefitted from our trainning:

Take a look at our full list of courses to see what other training we have on offer.

Contact Us

If this proposal is of interest to you or you would like to hear more about what we do you can get in touch on or +27 73 805 7439.

2. Course Description

Apache Spark is a fast, general-purpose system for cluster computing on large datasets.

This course uses Python to interact with Spark.

The first part of the course is an introduction to working with Spark. It will enable you to

  • load structured or unstructured data into Spark;
  • understand the way that data are distributed across a Spark cluster;
  • apply transformations and actions to the data.

In the second part of the course you’ll learn how to build Machine Learning models on large datasets using Spark. You’ll be able to

  • create classification and regression models;
  • use pipelines to streamline your workflow; and
  • combine pipelines with cross-validation and grid-search to optimise model parameters.

All material will be available as Jupyter Notebooks.


Duration 2 days
Requirements A working knowledge of Python will be helpful.

Return to our list of courses.

Course Outline

3. Course Outline

Day 1

  • Connecting to Spark
  • RDDs
    • Unstructured Data
    • Reading data from file
    • Data distribution
    • Transformations
      • Filtering
      • Mapping
    • Actions
    • Persistence
  • Key-Value RDDs
    • Creating
    • Transforming
    • Summarising
  • Structured Data
    • DataFrames
    • Reading data from file
    • Accessing rows and columns
    • Merging and Aggregation
  • Spark SQL
  • Partitions
    • How is data partitioned?
    • Repartitioning

Day 2

  • Machine Learning and Big Data
  • Working with data
    • Categorical data
      • Indexing
      • One-hot encoding
      • Dense versus Sparse
    • Data preparation
      • Column manipulation
      • Bucketing
      • Assembling columns
    • Text data
      • Punctuation and numbers
      • Tokens
      • Stop words
      • Hashing
      • TF-IDF
  • Classification
    • Decision Tree
    • Logistic Regression
  • Regression
    • Linear regression
    • Penalised regression
  • Pipelines
  • Cross-validation
  • Grid search

Book now!

Training Philosophy

Our training emphasises practical skills. So, although you'll be learning concepts and theory, you'll see how everything is applied in the real world. We will work through examples and exercises based on real datasets.


All you'll need is a computer with a browser and a decent internet connection. We'll be using an online development environment. This means that you can focus on learning and not on solving technical problems.

Of course, we are happy to help you get your local environment set up too! You can start by following these instructions.


The training package includes access to
  • our online development environment and
  • detailed course material (slides and scripts).

Return to our list of courses.