Let us help you find the training program you are looking for.

If you can't find what you are looking for, contact us, we'll help you find it. We have over 800 training programs to choose from.

banner-img

Course Skill Level:

Foundational

Course Duration:

2 day/s

  • Course Delivery Format:

    Live, instructor-led.

  • Course Category:

    AI / Machine Learning

  • Course Code:

    MLSP2XL21E09

Who should attend & recommended skills:

Developers with basic Python experience

Who should attend & recommended skills

  • This course is designed for developers interested to unlock the complexities of machine learning algorithms in Spark to generate useful data insights through this data analysis tutorial.
  • Skill-level: Foundation-level Machine Learning with Spark skills for Intermediate skilled team members. This is not a basic class.
  • Python: Basic (1-2 years’ experience) required

About this course

The purpose of machine learning is to build systems that learn from data. Being able to understand trends and patterns in complex data is critical to success; it is one of the key strategies to unlock growth in the challenging contemporary marketplace today. With the meteoric rise of machine learning, developers are now keen on finding out how they can they make their Spark applications smarter.
This course gives you access to transform data into actionable knowledge. The course commences by defining machine learning primitives by the MLlib and H2O libraries. You will learn how to use binary classification to detect the Higgs Boson particle in the huge amount of data produced by CERN particle collider and classify daily health activities using ensemble Methods for Multi-Class Classification. Next, you will solve a typical regression problem involving flight delay predictions and write sophisticated Spark pipelines. You will analyze Twitter data with help of the doc2vec algorithm and K-means clustering. Finally, you will build different pattern mining models using MLlib, perform complex manipulation of DataFrames using Spark and Spark SQL, and deploy your app in a Spark streaming environment.

Skills acquired & topics covered

  • Working in a hands-on learning environment, led by our Machine Learning with Spark expert instructor, students will learn about and explore:
  • Processing and analyzing big data in a distributed and scalable way
  • Writing sophisticated Spark pipelines that incorporate elaborate extraction
  • Building and using regression models to predict flight delays
  • Using Spark streams to cluster tweets online
  • Running the PageRank algorithm to compute user influence
  • Performing complex manipulation of DataFrames using Spark
  • Defining Spark pipelines to compose individual data transformations
  • Utilizing generated models for off-line/on-line prediction
  • Transferring the learning from an ensemble to a simpler Neural Network
  • Understanding basic graph properties and important graph operations
  • Using GraphFrames, an extension of DataFrames to graphs, to study graphs using an elegant query language
  • Using K-means algorithm to cluster movie reviews dataset

Course breakdown / modules

  • Data science
  • The sexiest role of the 21st century – data scientist?
  • Introducing H2O.ai
  • What’s the difference between H2O and Spark’s MLlib?
  • Data munging
  • Data science – an iterative process

  • Type I versus type II error
  • Spark start and data load

  • Data
  • Modeling goal

  • NLP – a brief primer
  • The dataset
  • Feature extraction
  • Featurization – feature hashing
  • Let’s do some (model) training!
  • Super learner

  • Motivation of word vectors
  • Word2vec explained
  • Doc2vec explained
  • Applying word2vec and exploring our data with vectors
  • Creating document vectors
  • Supervised learning task

  • Frequent pattern mining
  • Pattern mining with Spark MLlib
  • Deploying a pattern mining application

  • Basic graph theory
  • GraphX distributed graph processing engine
  • Graph algorithms and applications

  • Motivation
  • Preparation of the environment
  • Data load
  • Exploration – data analysis