Machine Learning at Scale with PySpark

Apache Spark is a fast, general-purpose system for cluster computing. Learn how to use Python and Spark to easily build Machine Learning models on large datasets.

In this workshop you’ll learn how to ingest structured data into Spark, then create classification and regression models. You’ll also discover how to use pipelines to streamline your workflow and how these can be combined with cross-validation and grid-search to easily optimise model parameters.

All material will be available as Jupyter Notebooks.


  • Machine Learning and Big Data
  • Connecting to Spark
  • Loading data
    • DataFrame
    • Spark SQL
  • Working with data
    • Categorical data
      • Indexing
      • One-hot encoding
      • Dense versus Sparse
    • Data preparation
      • Column manipulation
      • Bucketing
      • Assembling columns
    • Text data
      • Punctuation and numbers
      • Tokens
      • Stop words
      • Hashing
      • TF-IDF
  • Classification
    • Decision Tree
    • Logistic Regression
  • Regression
    • Linear regression
    • Penalised regression
  • Pipelines
  • Cross-validation
  • Grid search
  • Ensemble models


In order for this course to make sense you should first complete the Introduction to Spark course.

Book now!

Training Philosophy

Our training emphasises practical skills. So, although you'll be learning concepts and theory, you'll see how everything is applied in the real world. We will work through examples and exercises based on real datasets.


All you'll need is a computer with a browser and a decent internet connection. We'll be using an online development environment. This means that you can focus on learning and not on solving technical problems.

Of course, we are happy to help you get your local environment set up too! You can start by following these instructions.


The training package includes access to
  • our online development environment and
  • detailed course material (slides and scripts).

Return to our list of courses.