Let us help you find the training program you are looking for.

If you can't find what you are looking for, contact us, we'll help you find it. We have over 800 training programs to choose from.


  • Course Code: Data Analysis / BI - Tika
  • Course Dates: Contact us to schedule.
  • Course Category: Big Data & Data Science Duration: 4 Days Audience: This course is geared for those who wants to learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives.

Course Snapshot 

  • Duration: 4 days 
  • Skill-level: Foundation level Tika skills for Intermediate skilled team members. This is not a basic class. 
  • Targeted Audience: This course is geared for those who wants to learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives. 
  • Hands-on Learning: This course is approximately 50% hands-on lab to 50% lecture ratio, combining engaging lecture, demos, group activities and discussions with machine-based student labs and exercises. Student machines are required. 
  • Delivery Format: This course is available for onsite private classroom presentation, or remote instructor led delivery, or CBT/WBT (by request). 
  • Customizable: This course may be tailored to target your specific training skills objectives, tools of choice and learning goals. 

Tika is the ultimate guide to content mining using Apache Tika. You’ll learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives. This example-rich course teaches you to build and extend applications based on real-world experience with search engines, digital asset management, and scientific data processing. In addition to architectural overviews, you’ll find detailed lessons on features like metadata extraction, automatic language detection, and custom parser development. 

Working in a hands-on learning environment, led by our Tika expert instructor, students will learn about and explore: 

  • examples and case studies offer real-world experience from domains ranging from search engines to digital asset management and scientific data processing. 
  • You’ll learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives 

Topics Covered: This is a high-level list of topics covered in this course. Please see the detailed Agenda below 

  • Crack MS Word, PDF, HTML, and ZIP 
  • Integrate with search engines, CMS, and other data sources 
  • Learn through experimentation 
  • Many examples 

Audience & Pre-Requisites 

This course is written for developers familiar with Java.  

Pre-Requisites:  Students should have  

  • requires no previous knowledge of Tika or text mining techniques.  
  • It assumes a working knowledge of Java. 

Course Agenda / Topics 

  1. The case for the digital Babel fishfree 
  • Understanding digital documents 
  • everything 
  • What is Apache Tika? 
  1. Getting started with Tika 
  • Working with Tika source code 
  • The Tika application 
  • Tika as an embedded library 
  1. The information landscape 
  • Measuring information overload 
  • I’m feeling lucky—searching the information landscape 
  • Beyond lucky: machine learning 
  1. Document type detection 
  • Internet media types 
  • Media types in Tika 
  • File format diagnostics 
  • Tika, the type inspector 
  1. Content extraction 
  • Full-text extraction 
  • The Parser interface 
  • Document input stream 
  • Structured XHTML output 
  • Context-sensitive parsing 
  1. Understanding metadata 
  • The standards of metadata 
  • Metadata quality 
  • Metadata in Tika 
  • Practical uses of metadata 
  1. Language detection 
  • The most translated document in the world 
  • Sounds Greek to me—theory of language detection 
  • Language detection in Tika 
  1. What���s in a file?free 
  • Types of content 
  • How Tika extracts content 
  1. The big picture 
  • Tika in search engines 
  • Managing and mining information 
  • Buzzword compliance 
  1. Tika and the Lucene search stack 
  • Load-bearing walls 
  • The steel frame 
  • The finishing touches 
  1. Extending Tika 
  • Adding type information 
  • Custom type detection 
  • Customized parsing 
  1. Powering NASA science data systems 
  • NASA’s Planetary Data System 
  • NASA’s Earth Science Enterprise 
  1. Content management with Apache Jackrabbit 
  • Introducing Apache Jackrabbit 
  • The text extraction pool 
  • Content-aware WebDAV 
  1. Curating cancer research data with Tika 
  • The NCI Early Detection Research Network 
  • Integrating Tika 
  1. The classic search engine example 
  • The Public Terabyte Dataset Project 
  • The Bixo web crawler 
View All Courses

    Course Inquiry

    Fill in the details below and we will get back to you as quickly as we can.

    Interested in any of these related courses?