Let us help you find the training program you are looking for.

If you can't find what you are looking for, contact us, we'll help you find it. We have over 800 training programs to choose from.

banner-img

Course Skill Level:

Foundational

Course Duration:

2 day/s

  • Course Delivery Format:

    Live, instructor-led.

  • Course Category:

    Big Data & Data Science

  • Course Code:

    TAMTEXL21E09

Who should attend & recommended skills:

Those with basic IT, Microsoft Azure, and machine learning experience

Who should attend & recommended skills

  • Those who want to explore how to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization.
  • Skill-level: Foundation-level Taming Text skills for Intermediate skilled team members. This is not a basic class.
  • IT Skills: Basic to Intermediate (1-5 years’ experience)
  • Microsoft Azure: Basic to Intermediate (1-5 years’ experience)
  • Machine Learning: Basic to Intermediate (1-5 years’ experience)

About this course

There is so much text in our lives, we are practically drowning in it. Fortunately, there are innovative tools and techniques for managing unstructured information that can throw the smart developer a much-needed lifeline. You’ll find them in this course.
Taming Text is a practical, example-driven guide to working with text in real applications. This course introduces you to useful techniques like full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. You’ll explore real use cases as you systematically absorb the foundations upon which they are built.
Written in a clear and concise style, this course avoids jargon, explaining the subject in terms you can understand without a background in statistics or natural language processing. Examples are in Java, but the concepts can be applied in any language.

Skills acquired & topics covered

  • This hands-on, example-driven guide to working with unstructured text in the context of real-world applications.
  • How to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization.
  • Examples illustrating each of these topics, as well as the foundations upon which they are built
  • When to use text-taming techniques
  • Important open-source libraries like Solr and Mahout
  • How to build text-processing applications

Course breakdown / modules

  • Why taming text is important
  • Preview: A fact-based question answering system
  • Understanding text is hard
  • Text, tamed
  • Text and the intelligent app: search and beyond

  • Foundations of language
  • Common tools for text processing
  • Preprocessing and extracting content from common file formats

  • Search and faceting example: Amazon.com
  • Introduction to search concepts
  • Introducing the Apache Solr search server
  • Indexing content with Apache Solr
  • Searching content with Apache Solr
  • Understanding search performance factors
  • Improving search performance
  • Search alternatives

  • Approaches to fuzzy string matching
  • Finding fuzzy string matches
  • Building fuzzy string matching applications

  • Approaches to named-entity recognition
  • Basic entity identification with OpenNLP
  • In-depth entity identification with OpenNLP
  • Performance of OpenNLP
  • Customizing OpenNLP entity identification for a new domain

  • Google News document clustering
  • Clustering foundations
  • Setting up a simple clustering application
  • Clustering search results using Carrot2
  • Clustering document collections with Apache Mahout
  • Topic modeling using Apache Mahout
  • Examining clustering performance

  • Introduction to classification and categorization
  • The classification process
  • Building document categorizers using Apache Lucene
  • Training a naive Bayes classifier using Apache Mahout
  • Categorizing documents with OpenNLP
  • Building a tag recommender using Apache Solr

  • Basics of a question answering system
  • Installing and running the QA code
  • A sample question answering architecture
  • Understanding questions and producing answers
  • Steps to improve the system

  • Semantics, discourse, and pragmatics: exploring higher levels of NLP
  • Document and collection summarization
  • Relationship extraction
  • Identifying important content and people
  • Detecting emotions via sentiment analysis
  • Cross-language information retrieval