Let us help you find the training program you are looking for.

If you can't find what you are looking for, contact us, we'll help you find it. We have over 800 training programs to choose from.

Apache Spark 2.x Machine Learning Cookbook

  • Course Code: Data Science - Apache Spark 2.x Machine Learning Cookbook
  • Course Dates: Contact us to schedule.
  • Course Category: AI / Machine Learning Duration: 3 Days Audience: This course is geared for those who wants to Simplify machine learning model implementations with Spark.

Course Snapshot 

  • Duration: 3 days 
  • Skill-level: Foundation-level Apache Spark 2.x Machine Learning Cookbook skills for Intermediate skilled team members. This is not a basic class. 
  • Targeted Audience: This course is geared for those who wants to Simplify machine learning model implementations with Spark.   
  • Hands-on Learning: This course is approximately 50% hands-on lab to 50% lecture ratio, combining engaging lecture, demos, group activities and discussions with machine-based student labs and exercises. Student machines are required. 
  • Delivery Format: This course is available for onsite private classroom presentation. 
  • Customizable: This course may be tailored to target your specific training skills objectives, tools of choice and learning goals. 

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. Learning about algorithms enables a wide range of applications, from everyday tasks such as product recommendations and spam filtering to cutting edge applications such as self-driving cars and personalized medicine. You will gain hands-on experience of applying these principles using Apache Spark, a resilient cluster computing system well suited for large-scale machine learning tasks. This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems. 

Working in a hands-on learning environment, led by our Data Science with Python and Jupyter expert instructor, students will learn about and explore: 

  • Solve the day-to-day problems of data science with Spark 
  • This unique cookbook consists of exciting and intuitive numerical recipes 
  • Optimize your work by acquiring, cleaning, analyzing, predicting, and visualizing your data 

Topics Covered: This is a high-level list of topics covered in this course. Please see the detailed Agenda below 

  • Get to know how Scala and Spark go hand-in-hand for developers when developing ML systems with Spark 
  • Build a recommendation engine that scales with Spark 
  • Find out how to build unsupervised clustering systems to classify data in Spark 
  • Build machine learning systems with the Decision Tree and Ensemble models in Spark 
  • Deal with the curse of high-dimensionality in big data using Spark 
  • Implement Text analytics for Search Engines in Spark 
  • Streaming Machine Learning System implementation using Spark 

Audience & Pre-Requisites 

This course is designed for for beginners who wants to simplify machine learning model implementations with Spark. 

Pre-Requisites:  Students should have familiar with  

  • Basics of Python  
  • Knowledge of Python is assumed. 

Course Agenda / Topics 

  1. Practical Machine Learning with Spark Using Scala 
  • Practical Machine Learning with Spark Using Scala 
  • Introduction 
  • Downloading and installing the JDK 
  • Downloading and installing IntelliJ 
  • Downloading and installing Spark 
  • Configuring IntelliJ to work with Spark and run Spark ML sample codes 
  • Running a sample ML code from Spark 
  • Identifying data sources for practical machine learning 
  • Running your first program using Apache Spark 2.0 with the IntelliJ IDE 
  • How to add graphics to your Spark program 
  1. Just Enough Linear Algebra for Machine Learning with Spark 
  • Just Enough Linear Algebra for Machine Learning with Spark 
  • Introduction 
  • Package imports and initial setup for vectors and matrices 
  • Creating DenseVector and setup with Spark 2.0 
  • Creating SparseVector and setup with Spark 
  • Creating dense matrix and setup with Spark 2.0 
  • Using sparse local matrices with Spark 2.0 
  • Performing vector arithmetic using Spark 2.0 
  • Performing matrix arithmetic using Spark 2.0 
  • Exploring RowMatrix in Spark 2.0 
  • Exploring Distributed IndexedRowMatrix in Spark 2.0 
  • Exploring distributed CoordinateMatrix in Spark 2.0 
  • Exploring distributed BlockMatrix in Spark 2.0 
  1. Spark’s Three Data Musketeers for Machine Learning – Perfect Together 
  • Spark’s Three Data Musketeers for Machine Learning – Perfect Together 
  • Introduction 
  • Creating RDDs with Spark 2.0 using internal data sources 
  • Creating RDDs with Spark 2.0 using external data sources 
  • Transforming RDDs with Spark 2.0 using the filter() API 
  • Transforming RDDs with the super useful flatMap() API 
  • Transforming RDDs with set operation APIs 
  • RDD transformation/aggregation with groupBy() and reduceByKey() 
  • Transforming RDDs with the zip() API 
  • Join transformation with paired key-value RDDs 
  • Reduce and grouping transformation with paired key-value RDDs 
  • Creating DataFrames from Scala data structures 
  • Operating on DataFrames programmatically without SQL 
  • Loading DataFrames and setup from an external source 
  • Using DataFrames with standard SQL language – SparkSQL 
  • Working with the Dataset API using a Scala Sequence 
  • Creating and using Datasets from RDDs and back again 
  • Working with JSON using the Dataset API and SQL together 
  • Functional programming with the Dataset API using domain objects 
  1. Common Recipes for Implementing a Robust Machine Learning System 
  • Common Recipes for Implementing a Robust Machine Learning System 
  • Introduction 
  • Spark’s basic statistical API to help you build your own algorithms 
  • ML pipelines for real-life machine learning applications 
  • Normalizing data with Spark 
  • Splitting data for training and testing 
  • Common operations with the new Dataset API 
  • Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0 
  • LabeledPoint data structure for Spark ML 
  • Getting access to Spark cluster in Spark 2.0 
  • Getting access to Spark cluster pre-Spark 2.0 
  • Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0 
  • New model export and PMML markup in Spark 2.0 
  • Regression model evaluation using Spark 2.0 
  • Binary classification model evaluation using Spark 2.0 
  • Multiclass classification model evaluation using Spark 2.0 
  • Multilabel classification model evaluation using Spark 2.0 
  • Using the Scala Breeze library to do graphics in Spark 2.0 
  1. Practical Machine Learning with Regression and Classification in Spark 2.0 – Part I 
  • Practical Machine Learning with Regression and Classification in Spark 2.0 – Part I 
  • Introduction 
  • Fitting a linear regression line to data the old fashioned way 
  • Generalized linear regression in Spark 2.0 
  • Linear regression API with Lasso and L-BFGS in Spark 2.0 
  • Linear regression API with Lasso and ‘auto’ optimization selection in Spark 2.0 
  • Linear regression API with ridge regression and ‘auto’ optimization selection in Spark 2.0 
  • Isotonic regression in Apache Spark 2.0 
  • Multilayer perceptron classifier in Apache Spark 2.0 
  • One-vs-Rest classifier (One-vs-All) in Apache Spark 2.0 
  • Survival regression – parametric AFT model in Apache Spark 2.0 
  1. Practical Machine Learning with Regression and Classification in Spark 2.0 – Part II 
  • Practical Machine Learning with Regression and Classification in Spark 2.0 – Part II 
  • Introduction 
  • Linear regression with SGD optimization in Spark 2.0 
  • Logistic regression with SGD optimization in Spark 2.0 
  • Ridge regression with SGD optimization in Spark 2.0 
  • Lasso regression with SGD optimization in Spark 2.0 
  • Logistic regression with L-BFGS optimization in Spark 2.0 
  • Support Vector Machine (SVM) with Spark 2.0 
  • Naive Bayes machine learning with Spark 2.0 MLlib 
  • Exploring ML pipelines and DataFrames using logistic regression in Spark 2.0 
  1. Recommendation Engine that Scales with Spark 
  • Recommendation Engine that Scales with Spark 
  • Introduction 
  • Setting up the required data for a scalable recommendation engine in Spark 2.0 
  • Exploring the movies data details for the recommendation system in Spark 2.0 
  • Exploring the ratings data details for the recommendation system in Spark 2.0 
  • Building a scalable recommendation engine using collaborative filtering in Spark 2.0 
  1. Unsupervised Clustering with Apache Spark 2.0 
  • Unsupervised Clustering with Apache Spark 2.0 
  • Introduction 
  • Building a KMeans classifying system in Spark 2.0 
  • Bisecting KMeans, the new kid on the block in Spark 2.0 
  • Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data 
  • Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0 
  • Latent Dirichlet Allocation (LDA) to classify documents and text into topics 
  • Streaming KMeans to classify data in near real-time 
  1. Optimization – Going Down the Hill with Gradient Descent 
  • Optimization – Going Down the Hill with Gradient Descent 
  • Introduction 
  • Optimizing a quadratic cost function and finding the minima using just math to gain insight 
  • Coding a quadratic cost function optimization using Gradient Descent (GD) from scratch 
  • Coding Gradient Descent optimization to solve Linear Regression from scratch 
  • Normal equations as an alternative for solving Linear Regression in Spark 2.0 
  1. Building Machine Learning Systems with Decision Tree and Ensemble Models 
  • Building Machine Learning Systems with Decision Tree and Ensemble Models 
  • Introduction 
  • Getting and preparing real-world medical data for exploring Decision Trees and Ensemble models in Spark 2.0 
  • Building a classification system with Decision Trees in Spark 2.0 
  • Solving Regression problems with Decision Trees in Spark 2.0 
  • Building a classification system with Random Forest Trees in Spark 2.0 
  • Solving regression problems with Random Forest Trees in Spark 2.0 
  • Building a classification system with Gradient Boosted Trees (GBT) in Spark 2.0 
  • Solving regression problems with Gradient Boosted Trees (GBT) in Spark 2.0 
  1. Curse of High-Dimensionality in Big Data 
  • Curse of High-Dimensionality in Big Data 
  • Introduction 
  • Two methods of ingesting and preparing a CSV file for processing in Spark 
  • Singular Value Decomposition (SVD) to reduce high-dimensionality in Spark 
  • Principal Component Analysis (PCA) to pick the most effective latent factor for machine learning in Spark 
  1. Implementing Text Analytics with Spark 2.0 ML Library 
  • Implementing Text Analytics with Spark 2.0 ML Library 
  • Introduction 
  • Doing term frequency with Spark – everything that counts 
  • Displaying similar words with Spark using Word2Vec 
  • Downloading a complete dump of Wikipedia for a real-life Spark ML project 
  • Using Latent Semantic Analysis for text analytics with Spark 2.0 
  • Topic modeling with Latent Dirichlet allocation in Spark 2.0 
  1. Spark Streaming and Machine Learning Library 
  • Spark Streaming and Machine Learning Library 
  • Introduction 
  • Structured streaming for near real-time machine learning 
  • Streaming DataFrames for real-time machine learning 
  • Streaming Datasets for real-time machine learning 
  • Streaming data and debugging with queueStream 
  • Downloading and understanding the famous Iris data for unsupervised classification 
  • Streaming KMeans for a real-time on-line classifier 
  • Downloading wine quality data for streaming regression 
  • Streaming linear regression for a real-time regression 
  • Downloading Pima Diabetes data for supervised classification 
  • Streaming logistic regression for an on-line classifier 

View All Courses

    Course Inquiry

    Fill in the details below and we will get back to you as quickly as we can.

    Interested in any of these related courses?