Let us help you find the training program you are looking for.

If you can't find what you are looking for, contact us, we'll help you find it. We have over 800 training programs to choose from.

banner-img

Course Skill Level:

Foundational

Course Duration:

3 day/s

  • Course Delivery Format:

    Live, instructor-led.

  • Course Category:

    Big Data & Data Science

  • Course Code:

    INTHADL21E09

Who should attend & recommended skills:

Developers with basic Python skills

Who should attend & recommended skills

  • This course is designed for developers who want a simple approach to harnessing the data.
  • Skill-level: Foundation-level Hadoop skills for Intermediate skilled team members. This is not a basic class.
  • Python: Basic (1-2 years’ experience)

About this course

In today’s time, data with value is branched off into numerous databases across multiple companies. The challenge is bringing the data together. Integrating Hadoop shows how Hadoop is used to collect and load the data on physical devices and the cloud. The book begins with an introduction of Hadoop and the types of data fit for it. Next, it focuses on assembling the integration team and gives an overview of workloads in the organization. You will also identify data sources for Hadoop, such as No SQL Databases and Legacy/Relational Databases, distinguish between ETL and ELT, and learn how to load and unload data into Hadoop. You will also practice managing big data using methods such as Upserts and Use HBase, and discover the advantages of real-time computing and the basic structure of streaming data architecture. Finally, you will interact with the master data of an organization and learn the top 10 mistakes people commit while integrating Hadoop data and how to avoid them.

Skills acquired & topics covered

  • Working in a hands-on learning environment, led by our Hadoop expert instructor, students will learn about and explore:
  • Organizing a successful Hadoop rollout
  • Loading, unloading, and managing data in Hadoop
  • Integrating Hadoop with the existing information infrastructure
  • The different roles and responsibilities of the integration team
  • Moving data from one place to another with ETL and ELT
  • Loading the data into Hadoop using the original method, called Batch
  • How and where to use real-time computing framework Spark
  • Project Apache Kafka and its role in streaming data processor
  • Avoiding common mistakes of integrating Hadoop data

Course breakdown / modules

  • Introducing Hadoop
  • Hadoop Distributions

  • Assembling the Integration Team
  • Overview of Workloads for Hadoop in the Organization
  • Identifying Data Sources for Hadoop
  • Data Profiling
  • Analyzing and Profiling Source Systems and Data

  • Continued Need for More Speed
  • Preference with Hadoop
  • Is ETL Dead?

  • Advantages of Data Integration Tools
  • Methods of Data Loading
  • Path to Production
  • How-To with Talend Big Data

  • Big Data ELT
  • Importance of Data Quality in Hadoop
  • Stewardship of Big Data

  • Hadoop Extracts
  • Hadoop and SOA

  • Advantages of Real-Time Computing
  • How and Where to Use Spark

  • Streaming Data Technology Distinctions

  • Hadoop and Master Data Management
  • Integrating with Master Data
  • Data Virtualization
  • MDM and Hadoop Disconnects

  • 1. Integrating Data Without a Business Purpose
  • 2. Integrating Data into Hadoop for an Enterprise Data Repository
  • 3. Overemphasis on Data Integration Performance to the Detriment of Query Performance for Data Usage
  • 4. Not Refining Data to the Point of Usefulness
  • 5. Improper Node Specification
  • 6. Over-Reliance on Open Source Hadoop
  • 7. ETL instead of ELT
  • 8. Using MapReduce to Load Hadoop
  • 9. Using Spark through Hive to Load Hadoop
  • 10. Ignoring the Quality of the Data Being Loaded

  • Case Studies in Big Data Integration
  • Trends in Hadoop and Summary of Ideas