Tika

Home » Technology » Big Data & Data Science » Tika

Course Skill Level:

Foundational

Course Duration:

4 day/s

Course Delivery Format:

Live, instructor-led.
Course Category:

Big Data & Data Science
Course Code:

TIKA00L21E09

Who should attend & recommended skills:

Developers familiar with Java

Who should attend & recommended skills

Developers familiar with Java who want to learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives.
Skill-level: Foundation level Tika skills for Intermediate skilled team members. This is not a basic class.
Java: Basic (1-2 years’ experience)
Tika: No previous knowledge required
Text mining: No previous knowledge required

About this course

Tika is the ultimate guide to content mining using Apache Tika. You’ll learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives. This example-rich course teaches you to build and extend applications based on real-world experience with search engines, digital asset management, and scientific data processing. In addition to architectural overviews, you’ll find detailed lessons on features like metadata extraction, automatic language detection, and custom parser development.

Skills acquired & topics covered

Examples and case studies offer real-world experience from domains ranging from search engines to digital asset management and scientific data processing.
How to pull usable information from otherwise inaccessible sources, including internet media and file archives
Cracking MS Word, PDF, HTML, and ZIP
Integrating with search engines, CMS, and other data sources
Learning through experimentation
Many examples

Course breakdown / modules

Understanding digital documents
everything
What is Apache Tika?

Working with Tika source code
The Tika application
Tika as an embedded library

Measuring information overload
I’m feeling lucky — searching the information landscape
Beyond lucky: machine learning

Internet media types
Media types in Tika
File format diagnostics
Tika, the type inspector

Full-text extraction
The Parser interface
Document input stream
Structured XHTML output
Context-sensitive parsing

The standards of metadata
Metadata quality
Metadata in Tika
Practical uses of metadata

The most translated document in the world
Sounds Greek to me — theory of language detection
Language detection in Tika

Types of content
How Tika extracts content

Tika in search engines
Managing and mining information
Buzzword compliance

Load-bearing walls
The steel frame
The finishing touches

Adding type information
Custom type detection
Customized parsing

NASA’s Planetary Data System
NASA’s Earth Science Enterprise

Introducing Apache Jackrabbit
The text extraction pool
Content-aware WebDAV

The NCI Early Detection Research Network
Integrating Tika

The Public Terabyte Dataset Project
The Bixo web crawler

Free Training Courses

Leadership & Professional Development Courses

Microsoft Office Courses

Technology Courses

Who should attend & recommended skills

About this course

Skills acquired & topics covered

Course breakdown / modules

Browse our programs to take the next step toward advancing yourself, your team, and organization.

Free Training Courses

Leadership & Professional Development Courses

Microsoft Office Courses

Technology Courses

Let us help you find the training program you are looking for.

Tika

Who should attend & recommended skills

About this course

Skills acquired & topics covered

Course breakdown / modules

The case for the digital Babel fishfree

Getting started with Tika

The information landscape

Document type detection

Content extraction

Understanding metadata

Language detection

What's in a file?

The big picture

Tika and the Lucene search stack

Extending Tika

Powering NASA science data systems

Content management with Apache Jackrabbit

Curating cancer research data with Tika

The classic search engine example

Browse our programs to take the next step toward advancing yourself, your team, and organization.

View Course Detail