Let us help you find the training program you are looking for.

If you can't find what you are looking for, contact us, we'll help you find it. We have over 800 training programs to choose from.

banner-img

Course Skill Level:

Foundational

Course Duration:

5 day/s

  • Course Delivery Format:

    Live, instructor-led.

  • Course Category:

    Big Data & Data Science

  • Course Code:

    HAHISPL21E09

Who should attend & recommended skills:

Business analysts, Software developers, Managers with basic SQL, software design, & Python experience

Who should attend & recommended skills

  • Business analysts, Software developers, Managers
  • SQl: Basic (1-2 years’ experience)
  • Software Design: Basic (1-2 years’ experience)
  • Python: Basic (1-2 years’ experience)

About this course

Hadoop is a mature Big Data environment, with Hive is de-facto standard for the SQL interface. Today, the computations in Hadoop are usually done with Spark. Spark offers an optimized compute engine that includes batch, and real-time streaming, and machine learning.
This course covers Hadoop 3, Hive 3, and Spark 3

Skills acquired & topics covered

  • Why Hadoop?
  • The Hadoop platform
  • Hive Basics
  • New in Hive 3
  • HBase
  • Sqoop
  • The big picture
  • Spark Introduction
  • First Look at Spark
  • Spark Data structures
  • Caching
  • DataFrames and Datasets
  • Spark SQL
  • Spark and Hadoop
  • Spark API
  • Spark ML Overview
  • GraphX
  • Spark Streaming

Course breakdown / modules

  • The motivation for Hadoop
  • Use cases and case studies about Hadoop

  • MapReduce, HDFS, YARN
  • New in Hadoop 3
  • Erasure Coding vs 3x replication

  • Defining Hive Tables
  • SQL Queries over Structured Data
  • Filtering / Search
  • Aggregations / Ordering
  • Partitions
  • Joins
  • Text Analytics (Semi Structured Data)

  • ACID tables
  • Hive Query Language (HQL)
  • How to run a good query?
  • How to trouble shoot queries?

  • Basics
  • HBase tables – design and use
  • Phoenix driver for HBase tables

  • Tool
  • Architecture
  • Use

  • How Hadoop fits into your architecture
  • Hive vs HBase with Phoenix vs Excel

  • Big Data , Hadoop, Spark
  • Spark concepts and architecture
  • Spark components overview
  • Labs : Installing and running Spark

  • Spark shell
  • Spark web UIs
  • Analyzing dataset – part 1
  • Labs: Spark shell exploration

  • Partitions
  • Distributed execution
  • Operations : transformations and actions
  • Labs : Unstructured data analytics using RDDs

  • Caching overview
  • Various caching mechanisms available in Spark
  • In memory file systems
  • Caching use cases and best practices
  • Labs: Benchmark of caching performance

  • DataFrames Intro
  • Loading structured data (JSON, CSV) using DataFrames
  • Using schema
  • Specifying schema for DataFrames
  • Labs : DataFrames, Datasets, Schema

  • Spark SQL concepts and overview
  • Defining tables and importing datasets
  • Querying data using SQL
  • Handling various storage formats : JSON / Parquet / ORC
  • Labs : querying structured data using SQL; evaluating data formats

  • Hadoop + Spark architecture
  • Running Spark on Hadoop YARN
  • Processing HDFS files using Spark
  • Spark Hive

  • Overview of Spark APIs in Scala / Python
  • Life cycle of an Spark application
  • Spark APIs
  • Deploying Spark applications on YARN
  • Labs: Developing and deploying an Spark application

  • Machine Learning primer
  • Machine Learning in Spark: MLib / ML
  • Spark ML overview (newer Spark2 version)
  • Algorithms overview: Clustering, Classifications, Recommendations
  • Labs: Writing ML applications in Spark

  • GraphX library overview
  • GraphX APIs
  • Create a Graph and navigating it
  • Shortest distance
  • Pregel API
  • Labs: Processing graph data using Spark

  • Streaming concepts
  • Evaluating Streaming platforms
  • Spark streaming library overview
  • Streaming operations
  • Sliding window operations
  • Structured Streaming
  • Continuous streaming
  • Spark Kafka streaming
  • Labs: Writing spark streaming applications

  • These are group workshops
  • Attendees will work on solving real world data analysis problems using Spark