Let us help you find the training program you are looking for.

If you can't find what you are looking for, contact us, we'll help you find it. We have over 800 training programs to choose from.

banner-img

Course Skill Level:

Foundational

Course Duration:

4 day/s

  • Course Delivery Format:

    Live, instructor-led.

  • Course Category:

    Big Data & Data Science

  • Course Code:

    TIKA00L21E09

Who should attend & recommended skills:

Developers familiar with Java

Who should attend & recommended skills

  • Developers familiar with Java who want to learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives.
  • Skill-level: Foundation level Tika skills for Intermediate skilled team members. This is not a basic class.
  • Java: Basic (1-2 years’ experience)
  • Tika: No previous knowledge required
  • Text mining: No previous knowledge required

About this course

Tika is the ultimate guide to content mining using Apache Tika. You’ll learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives. This example-rich course teaches you to build and extend applications based on real-world experience with search engines, digital asset management, and scientific data processing. In addition to architectural overviews, you’ll find detailed lessons on features like metadata extraction, automatic language detection, and custom parser development.

Skills acquired & topics covered

  • Examples and case studies offer real-world experience from domains ranging from search engines to digital asset management and scientific data processing.
  • How to pull usable information from otherwise inaccessible sources, including internet media and file archives
  • Cracking MS Word, PDF, HTML, and ZIP
  • Integrating with search engines, CMS, and other data sources
  • Learning through experimentation
  • Many examples

Course breakdown / modules

  • Understanding digital documents
  • everything
  • What is Apache Tika?

  • Working with Tika source code
  • The Tika application
  • Tika as an embedded library

  • Measuring information overload
  • I’m feeling lucky — searching the information landscape
  • Beyond lucky: machine learning

  • Internet media types
  • Media types in Tika
  • File format diagnostics
  • Tika, the type inspector

  • Full-text extraction
  • The Parser interface
  • Document input stream
  • Structured XHTML output
  • Context-sensitive parsing

  • The standards of metadata
  • Metadata quality
  • Metadata in Tika
  • Practical uses of metadata

  • The most translated document in the world
  • Sounds Greek to me — theory of language detection
  • Language detection in Tika

  • Types of content
  • How Tika extracts content

  • Tika in search engines
  • Managing and mining information
  • Buzzword compliance

  • Load-bearing walls
  • The steel frame
  • The finishing touches

  • Adding type information
  • Custom type detection
  • Customized parsing

  • NASA’s Planetary Data System
  • NASA’s Earth Science Enterprise

  • Introducing Apache Jackrabbit
  • The text extraction pool
  • Content-aware WebDAV

  • The NCI Early Detection Research Network
  • Integrating Tika

  • The Public Terabyte Dataset Project
  • The Bixo web crawler