Let us help you find the training program you are looking for.

If you can't find what you are looking for, contact us, we'll help you find it. We have over 800 training programs to choose from.

Integrating Hadoop

  • Course Code: Data Science - Integrating Hadoop
  • Course Dates: Contact us to schedule.
  • Course Category: Big Data & Data Science Duration: 3 Days Audience: This course is geared for those who wants to a simple approach to harnessing the data.

Course Snapshot 

  • Duration: 3 days 
  • Skill-level: Foundation-level Hadoop skills for Intermediate skilled team members. This is not a basic class. 
  • Targeted Audience: This course is geared for those who wants to a simple approach to harnessing the data. 
  • Hands-on Learning: This course is approximately 50% hands-on lab to 50% lecture ratio, combining engaging lecture, demos, group activities and discussions with machine-based student labs and exercises. Student machines are required. 
  • Delivery Format: This course is available for onsite private classroom presentation. 
  • Customizable: This course may be tailored to target your specific training skills objectives, tools of choice and learning goals. 

In today’s time, data with value is branched off into numerous databases across multiple companies. The challenge is bringing the data together. Integrating Hadoop shows how Hadoop is used to collect and load the data on physical devices and the cloud. The book begins with an introduction of Hadoop and the types of data fit for it. Next, it focuses on assembling the integration team and gives an overview of workloads in the organization. You will also identify data sources for Hadoop, such as No SQL Databases and Legacy/Relational Databases, distinguish between ETL and ELT, and learn how to load and unload data into Hadoop. You will also practice managing big data using methods such as Upserts and Use HBase, and discover the advantages of real-time computing and the basic structure of streaming data architecture. Finally, you will interact with the master data of an organization and learn the top 10 mistakes people commit while integrating Hadoop data and how to avoid them. 

Working in a hands-on learning environment, led by our Hadoop expert instructor, students will learn about and explore: 

  • Organize a successful Hadoop rollout 
  • Load, unload, and manage data in Hadoop 
  • Integrate Hadoop with the existing information infrastructure 

Topics Covered: This is a high-level list of topics covered in this course. Please see the detailed Agenda below 

  • Study the different roles and responsibilities of the integration team 
  • Move data from one place to another with ETL and ELT 
  • Load the data into Hadoop using the original method, called Batch 
  • Find out how and where to use real-time computing framework Spark 
  • Discover project Apache Kafka and its role in streaming data processor 
  • Avoid common mistakes of integrating Hadoop data 

Audience & Pre-Requisites 

This course is designed for developers wants a simple approach to harnessing the data 

Pre-Requisites:  Students should have familiar with  

  • Basics of Python  
  • Knowledge of Python is assumed. 

Course Agenda / Topics 

  1. 1 Hadoop in Support of an Information Strategy 
  • Introducing Hadoop 
  • Hadoop Distributions 
  1. Preparing for Integration 
  • Assembling the Integration Team 
  • Overview of Workloads for Hadoop in the Organization 
  • Identifying Data Sources for Hadoop 
  • Data Profiling 
  • Analyzing and Profiling Source Systems and Data 
  1. ETL versus ELT 
  • Continued Need for More Speed 
  • Preference with Hadoop 
  • Is ETL Dead? 
  1. Loading Data into Hadoop 
  • Advantages of Data Integration Tools 
  • Methods of Data Loading 
  • Path to Production 
  • How-To with Talend Big Data 
  1. Managing Big Data 
  • Big Data ELT 
  • Importance of Data Quality in Hadoop 
  • Stewardship of Big Data 
  1. Unloading/Distributing Data from Hadoop 
  • Hadoop Extracts 
  • Hadoop and SOA 
  1. Apache Spark Cluster Computing with Hadoop 
  • Advantages of Real-Time Computing 
  • How and Where to Use Spark 
  1. Streaming Data 
  • 8 Streaming Data 
  • Streaming Data Technology Distinctions 
  1. Master Data Management and Big Data 
  • Hadoop and Master Data Management 
  • Integrating with Master Data 
  • Data Virtualization 
  • MDM and Hadoop Disconnects 
  1. Top 10 Mistakes Integrating Hadoop Data 
  • 1. Integrating Data Without a Business Purpose 
  • 2. Integrating Data into Hadoop for an Enterprise Data Repository 
  • 3. Overemphasis on Data Integration Performance to the Detriment of Query Performance for Data Usage 
  • 4. Not Refining Data to the Point of Usefulness 
  • 5. Improper Node Specification 
  • 6. Over-Reliance on Open Source Hadoop 
  • 7. ETL instead of ELT 
  • 8. Using MapReduce to Load Hadoop 
  • 9. Using Spark through Hive to Load Hadoop 
  • 10. Ignoring the Quality of the Data Being Loaded 
  1. Case Studies and Trends 
  • Case Studies in Big Data Integration 
  • Trends in Hadoop and Summary of Ideas 
View All Courses

    Course Inquiry

    Fill in the details below and we will get back to you as quickly as we can.

    Interested in any of these related courses?