Let us help you find the training program you are looking for.

If you can't find what you are looking for, contact us, we'll help you find it. We have over 800 training programs to choose from.


Course Skill Level:

Foundational to Intermediate

Course Duration:

3 day/s

  • Course Delivery Format:

    Live, instructor-led.

  • Course Category:

    Big Data & Data Science

  • Course Code:


Who should attend & recommended skills:

Developers, Data analysts, and business analysts

Who should attend & recommended skills

  • Developers, Data analysts, and business analysts.
  • Even if you haven’t done any Python programming, Python is such an easy language to learn quickly.
  • We will provide Python resources.
  • Jupyter notebooks: Basic (1-2 years’ experience) preferred not requiredA reasonably modern laptop with unrestricted connection to the Internet.
  • Laptops with overly restrictive VPNs or firewalls may not work properly; requiredChrome browser, required.

About this course

We are living in an era of ‘big data’. Being able to analyze and process big data is vital for enterprises. Spark is a popular platform for analyzing big data. This course introduces Apache Spark to students. This class is taught with Python language using Jupyter environment.

Skills acquired & topics covered

  • Spark ecosystem
  • Spark Shell
  • Spark Data structures (RDD / Dataframe / Dataset)
  • Spark SQL
  • Modern data formats and Spark
  • Spark API
  • Spark, Hadoop, and Hive
  • Spark ML overview
  • GraphX
  • Spark Streaming

Course breakdown / modules

  • Big Data , Hadoop, Spark
  • Spark concepts and architecture
  • Spark components overview
  • Labs : Installing and running Spark

  • Spark shell
  • Spark web UIs
  • Analyzing dataset – part 1
  • Labs: Spark shell exploration

  • Partitions
  • Distributed execution
  • Operations : transformations and actions
  • Labs : Unstructured data analytics using RDDs

  • Caching overview
  • Various caching mechanisms available in Spark
  • In memory file systems
  • Caching use cases and best practices
  • Labs: Benchmark of caching performance

  • Dataframes Intro
  • Loading structured data (json, CSV) using Dataframes
  • Using schema
  • Specifying schema for Dataframes
  • Labs : Dataframes, Datasets, Schema

  • Spark SQL concepts and overview
  • Defining tables and importing datasets
  • Querying data using SQL
  • Handling various storage formats : JSON / Parquet / ORC
  • Labs : querying structured data using SQL; evaluating data formats

  • Hadoop Primer : HDFS / YARN
  • Hadoop + Spark architecture
  • Running Spark on Hadoop YARN
  • Processing HDFS files using Spark
  • Spark Hive
  • Spark API loying an Spark application

  • Machine Learning primer
  • Machine Learning in Spark: MLib / ML
  • Spark ML overview (newer Spark2 version)
  • Algorithms overview: Clustering, Classifications, Recommendations
  • Labs: Writing ML applications in Spark

  • GraphX library overview
  • GraphX APIs
  • Create a Graph and navigating it
  • Shortest distance
  • Pregel API
  • Labs: Processing graph data using Spark

  • Streaming concepts
  • Evaluating Streaming platforms
  • Spark streaming library overview
  • Streaming operations
  • Sliding window operations
  • Structured Streaming
  • Continuous streaming
  • Spark Kafka streaming
  • Labs: Writing spark streaming applications

  • These are group workshops
  • Attendees will work on solving real world data analysis problems using Spark