Big Data Series - Machine Learning with Apache Spark Training

Course#: BSBD3000

About this Course

Course Type Course Code Duration
Big Data BSBD3000 1 Day

To stay competitive, organizations have started adopting new approaches to data processing and analysis.  For example, data scientists are turning to Apache Spark for processing massive amounts of data using Spark’s distributed compute capability and its built-in machine learning library.

Why Attend this Course?

This intensive training course provides an overview of data science algorithms as well as the theoretical and technical aspects of using the Apache Spark platform for Machine Learning.  This training course is supplemented by a variety of hands-on labs that help attendees reinforce their theoretical knowledge of the learned material.

What Makes this Course Stand Apart?

New course for 2017, latest updates.

What you will Learn?

Upon completion of this course, you will be able to:

  • Applied Data Science and Business Analytics
  • Machine Learning Algorithms, Techniques and Common Analytical Methods
  • Apache Spark Introduction
  • Spark’s MLlib Machine Learning Library


Data Scientists, Business Analysts, Software Developers, IT Architects


Participants should have the general knowledge of statistics and programming

Course Outline

Chapter 1. Data Science Algorithms and Analytical Methods

  • Supervised vs Unsupervised Machine Learning
  • Supervised Machine Learning Algorithms
  • Unsupervised Machine Learning Algorithms
  • Choose the Right Algorithm
  • Life-cycles of Machine Learning Development
  • Classifying with k-Nearest Neighbors (SL)
  • k-Nearest Neighbors Algorithm
  • k-Nearest Neighbors Algorithm
  • The Error Rate
  • Decision Trees (SL)
  • Decision Tree Terminology
  • Decision Trees in Pictures
  • Decision Tree Classification in Context of Information Theory
  • Information Entropy Defined
  • The Shannon Entropy Formula
  • The Simplified Decision Tree Algorithm
  • Using Decision Trees
  • Random Forests
  • Naive Bayes Classifier (SL)
  • Naive Bayesian Probabilistic Model in a Nutshell
  • Bayes Formula
  • Classification of Documents with Naive Bayes
  • Unsupervised Learning Type: Clustering
  • K-Means Clustering (UL)
  • K-Means Clustering in a Nutshell
  • Regression Analysis
  • Simple Linear Regression Model
  • Linear vs Non-Linear Regression
  • Linear Regression Illustration
  • Major Underlying Assumptions for Regression Analysis
  • Least-Squares Method (LSM)
  • Locally Weighted Linear Regression
  • Regression Models in Excel
  • Multiple Regression Analysis
  • Logistic Regression
  • Regression vs Classification
  • Time-Series Analysis
  • Decomposing Time-Series
  • Monte-Carlo Simulation (Method)
  • Who Uses Monte-Carlo Simulation?
  • Monte-Carlo Simulation in a Nutshell
  • Monte-Carlo Simulation Example
  • Summary

Chapter 2. Introduction to Apache Spark

  • What is Spark
  • A Short History of Spark
  • Where to Get Spark?
  • The Spark Platform
  • Spark Logo
  • Common Spark Use Cases
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Driver Process
  • Spark Applications
  • Spark Shell
  • The spark-submit Tool
  • The spark-submit Tool Configuration
  • The Executor and Worker Processes
  • The Spark Application Architecture
  • Interfaces with Data Storage Systems
  • Limitations of Hadoop’s MapReduce
  • Spark vs MapReduce
  • Spark as an Alternative to Apache Tez
  • The Resilient Distributed Dataset (RDD)
  • Spark Streaming (Micro-batching)
  • Spark SQL
  • Example of Spark SQL
  • Spark Machine Learning Library
  • GraphX
  • Spark vs R
  • Summary

Chapter 3. The Spark Shell

  • The Spark Shell
  • The Spark Shell UI
  • Spark Shell Options
  • Getting Help
  • The Spark Context (sc) and SQL Context (sqlContext)
  • The Shell Spark Context
  • Loading Files
  • Saving Files
  • Basic Spark ETL Operations
  • Summary

Chapter 4. The Spark Machine Learning Library

  • What is MLlib?
  • Supported Languages
  • MLlib Packages
  • Dense and Sparse Vectors
  • Labeled Point
  • Python Example of Using the LabeledPoint Class
  • LIBSVM format
  • An Example of a LIBSVM File
  • Loading LIBSVM Files
  • Local Matrices
  • Example of Creating Matrices in MLlib
  • Distributed Matrices
  • Example of Using a Distributed Matrix
  • Classification and Regression Algorithm
  • Clustering
  • Summary

Chapter 5. Text Mining

  • What is Text Mining?
  • The Common Text Mining Tasks
  • What is Natural Language Processing (NLP)?
  • Some of the NLP Use Cases
  • Machine Learning in Text Mining and NLP
  • Machine Learning in NLP
  • TF-IDF
  • The Feature Hashing Trick
  • Stemming
  • Example of Stemming
  • Stop Words
  • Popular Text Mining and NLP Libraries and Packages
  • Summary

What next- How do I arrange a group course or book a public place.?

We are here to help so please utilise our live chat team.

Call to speak to your account manager or a consultant on

+ 44(0)345 467 9557 or if you would prefer email


Start typing and press Enter to search