We are living in an era of ‘big data’. And being able to analyze and process big data is very vital for enterprises. Spark is a popular platform for analyzing big data. This course introduces Apache Spark to students. This class is taught with Python language and using Jupyter environment
- Spark ecosystem
- Spark Shell
- Spark Data structures (RDD / Dataframe / Dataset)
- Spark SQL
- Modern data formats and Spark
- Spark API
- Spark & Hadoop & Hive
- Spark ML overview
- GraphX
- Spark Streaming
Developers, Data analysts and business analysts
Introductory to Intermediate
- Basic knowledge of Python language and Jupyter notebooks is preferred but not mandatory.
Even if you haven’t done any Python programming, Python is such an easy language to learn quickly. We will provide Python resources.
- Cloud based lab environment will be provided to students, no need to install anything on the laptop
- Big Data , Hadoop, Spark
- Spark concepts and architecture
- Spark components overview
- Labs : Installing and running Spark
- Spark shell
- Spark web UIs
- Analyzing dataset – part 1
- Labs: Spark shell exploration
- Partitions
- Distributed execution
- Operations : transformations and actions
- Labs : Unstructured data analytics using RDDs
- Caching overview
- Various caching mechanisms available in Spark
- In memory file systems
- Caching use cases and best practices
- Labs: Benchmark of caching performance
- Dataframes Intro
- Loading structured data (json, CSV) using Dataframes
- Using schema
- Specifying schema for Dataframes
- Labs : Dataframes, Datasets, Schema
- Spark SQL concepts and overview
- Defining tables and importing datasets
- Querying data using SQL
- Handling various storage formats : JSON / Parquet / ORC
- Labs : querying structured data using SQL; evaluating data formats
- Hadoop Primer : HDFS / YARN
- Hadoop + Spark architecture
- Running Spark on Hadoop YARN
- Processing HDFS files using Spark
- Spark & Hive
- Overview of Spark APIs in Scala / Python
- Life cycle of an Spark application
- Spark APIs
- Deploying Spark applications on YARN
- Labs : Developing and deploying an Spark application
- Machine Learning primer
- Machine Learning in Spark: MLib / ML
- Spark ML overview (newer Spark2 version)
- Algorithms overview: Clustering, Classifications, Recommendations
- Labs: Writing ML applications in Spark
- GraphX library overview
- GraphX APIs
- Create a Graph and navigating it
- Shortest distance
- Pregel API
- Labs: Processing graph data using Spark
- Streaming concepts
- Evaluating Streaming platforms
- Spark streaming library overview
- Streaming operations
- Sliding window operations
- Structured Streaming
- Continuous streaming
- Spark & Kafka streaming
- Labs: Writing spark streaming applications
- These are group workshops
- Attendees will work on solving real world data analysis problems using Spark