Let us help you find the training program you are looking for.

If you can't find what you are looking for, contact us, we'll help you find it. We have over 800 training programs to choose from.

Introduction to Apache Spark

  • Course Code:
  • Course Dates: Contact us to schedule.
  • Course Category: Big Data & Data Science Audience: Developers, Data analysts and business analysts

Overview

We are living in an era of ‘big data’. And being able to analyze and process big data is very vital for enterprises. Spark is a popular platform for analyzing big data. This course introduces Apache Spark to students. This class is taught with Python language and using Jupyter environment

What you will learn:

  • Spark ecosystem
  • Spark Shell
  • Spark Data structures (RDD / Dataframe / Dataset)
  • Spark SQL
  • Modern data formats and Spark
  • Spark API
  • Spark & Hadoop & Hive
  • Spark ML overview
  • GraphX
  • Spark Streaming

Audience:

Developers, Data analysts and business analysts

Skill level

Introductory to Intermediate

Prerequisites

  • Basic knowledge of Python language and Jupyter notebooks is preferred but not mandatory.
    Even if you haven’t done any Python programming, Python is such an easy language to learn quickly. We will provide Python resources.

Lab environment

  • Cloud based lab environment will be provided to students, no need to install anything on the laptop

Spark Introduction

  • Big Data , Hadoop, Spark
  • Spark concepts and architecture
  • Spark components overview
  • Labs : Installing and running Spark

First Look at Spark

  • Spark shell
  • Spark web UIs
  • Analyzing dataset – part 1
  • Labs: Spark shell exploration

Spark Data structures

  • Partitions
  • Distributed execution
  • Operations : transformations and actions
  • Labs : Unstructured data analytics using RDDs

Caching

  • Caching overview
  • Various caching mechanisms available in Spark
  • In memory file systems
  • Caching use cases and best practices
  • Labs: Benchmark of caching performance

Dataframes / Datasets

  • Dataframes Intro
  • Loading structured data (json, CSV) using Dataframes
  • Using schema
  • Specifying schema for Dataframes
  • Labs : Dataframes, Datasets, Schema

Spark SQL

  • Spark SQL concepts and overview
  • Defining tables and importing datasets
  • Querying data using SQL
  • Handling various storage formats : JSON / Parquet / ORC
  • Labs : querying structured data using SQL; evaluating data formats

Spark and Hadoop

  • Hadoop Primer : HDFS / YARN
  • Hadoop + Spark architecture
  • Running Spark on Hadoop YARN
  • Processing HDFS files using Spark
  • Spark & Hive

Spark API

  • Overview of Spark APIs in Scala / Python
  • Life cycle of an Spark application
  • Spark APIs
  • Deploying Spark applications on YARN
  • Labs : Developing and deploying an Spark application

Spark ML Overview

  • Machine Learning primer
  • Machine Learning in Spark: MLib / ML
  • Spark ML overview (newer Spark2 version)
  • Algorithms overview: Clustering, Classifications, Recommendations
  • Labs: Writing ML applications in Spark

GraphX

  • GraphX library overview
  • GraphX APIs
  • Create a Graph and navigating it
  • Shortest distance
  • Pregel API
  • Labs: Processing graph data using Spark

Spark Streaming

  • Streaming concepts
  • Evaluating Streaming platforms
  • Spark streaming library overview
  • Streaming operations
  • Sliding window operations
  • Structured Streaming
  • Continuous streaming
  • Spark & Kafka streaming
  • Labs: Writing spark streaming applications

Workshops (Time permitting)

  • These are group workshops
  • Attendees will work on solving real world data analysis problems using Spark
View All Courses

    Course Inquiry

    Fill in the details below and we will get back to you as quickly as we can.

    Interested in any of these related courses?