Let us help you find the training program you are looking for.

If you can't find what you are looking for, contact us, we'll help you find it. We have over 800 training programs to choose from.

banner-img

Course Skill Level:

Foundational to Intermediate

Course Duration:

5 day/s

  • Course Delivery Format:

    Live, instructor-led

  • Course Category:

    Big Data & Data Science

  • Course Code:

    BIGDATL21E09

Who should attend & recommended skills:

Those with basic IT and traditional database skills

Who should attend & recommended skills

  • This course is geared for those who want to use an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.
  • Skill-level: Foundation-level Big Data skills for Intermediate skilled team members. This is not a basic class.
  • IT skills: Basic to Intermediate (1-5 years’ experience)
  • Traditional databases: Basic (1-2 years’ experience) helpful
  • Large-scale Data Analysis and NoSQL tools not necessary

About this course

Web-scale applications like social networks, real-time analytics, or e-commerce sites deal with a lot of data, whose volume and velocity exceed the limits of traditional database systems. These applications require architectures built around clusters of machines to store and process data of any size, or speed. Fortunately, scale and simplicity are not mutually exclusive.

Big Data teaches you to build big data systems using an architecture designed specifically to capture and analyze web-scale data. This course presents the Lambda Architecture, a scalable, easy-to-understand approach that can be built and run by a small team. You’ll explore the theory of big data systems and how to implement them in practice. In addition to discovering a general framework for processing big data, you’ll learn specific technologies like Hadoop, Storm, and NoSQL databases.

Skills acquired & topics covered

Working in a hands-on learning environment, led by our Big Data expert instructor, participants will learn about and explore:

  • Using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.
  • It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team.
  • It guides readers through the theory of big data systems, how to implement them in practice
  • How to deploy and operate them once they’re built.
  • Introduction to big data systems
  • Real-time processing of web-scale data
  • Tools like Hadoop, Cassandra, and Storm
  • Extensions to traditional database skills

Course breakdown / modules

  • The properties of data
  • The fact-based model for representing data
  • Graph schemas
  • A complete data model for SuperWebAnalytics.com

  • The properties of data
  • The fact-based model for representing data
  • Graph schemas
  • A complete data model for SuperWebAnalytics.com

  • Why a serialization framework?
  • Apache Thrift
  • Limitations of serialization frameworks

  • Storage requirements for the master dataset
  • Choosing a storage solution for the batch layer
  • How distributed filesystems work
  • Storing a master dataset with a distributed filesystem
  • Vertical partitioning
  • Low-level nature of distributed filesystems
  • Storing the SuperWebAnalytics.com master dataset on a distributed filesystem

  • Using the Hadoop Distributed File System
  • Data storage in the batch layer with Pail
  • Storing the master dataset for SuperWebAnalytics.com

  • Motivating examples
  • Computing on the batch layer
  • Recomputation algorithms vs. incremental algorithms
  • Scalability in the batch layer
  • MapReduce: a paradigm for Big Data computing
  • Low-level nature of MapReduce
  • Pipe diagrams: a higher-level way of thinking about batch computation

  • An illustrative example
  • Common pitfalls of data-processing tools
  • An introduction to JCascalog
  • Composition

  • Design of the SuperWebAnalytics.com batch layer
  • Workflow overview
  • Ingesting new data
  • URL normalization
  • User-identifier normalization
  • Deduplicate pageviews
  • Computing batch views

  • Starting point
  • Preparing the workflow
  • Ingesting new data
  • URL normalization
  • User-identifier normalization
  • Deduplicate pageviews
  • Computing batch views

  • Performance metrics for the serving layer
  • The serving layer solution to the normalization/denormalization problem
  • Requirements for a serving layer database
  • Designing a serving layer for SuperWebAnalytics.com
  • Contrasting with a fully incremental solution

  • Basics of ElephantDB
  • Building the serving layer for SuperWebAnalytics.com

  • Computing realtime views
  • Storing realtime views
  • Challenges of incremental computation
  • Asynchronous versus synchronous updates
  • Expiring realtime views

  • Cassandra’s data model
  • Using Cassandra

  • Queuing
  • Stream processing
  • Higher-level, one-at-a-time stream processing
  • SuperWebAnalytics.com speed layer

  • Defining topologies with Apache Storm
  • Apache Storm clusters and deployment
  • Guaranteeing message processing
  • Implementing the SuperWebAnalytics.com uniques-over-time speed layer

  • Achieving exactly-once semantics
  • Core concepts of micro-batch stream processing
  • Extending pipe diagrams for micro-batch processing
  • Finishing the speed layer for SuperWebAnalytics.com
  • Another look at the bounce-rate-analysis example

  • Using Trident
  • Finishing the SuperWebAnalytics.com speed layer
  • Fully fault-tolerant, in-memory, micro-batch processing

  • Defining data systems
  • Batch and serving layers
  • Speed layer
  • Query layer
  1. Ongoing training is a talent recruiting differentiator
  2. Bonus outcome: Saving IT team time, resources, and budget