With Search, developers and data engineers can run more relevant and responsive queries on the data in Hadoop and integrate with external tools to build custom real-time applications.
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
Â
Introduction to Cloudera Search Training
1. 1Š Cloudera, Inc. All rights reserved.
Introduction to Cloudera
Search Training
Tom Wheeler, Sr. Curriculum Developer
2. 2Š Cloudera, Inc. All rights reserved.
Course Objectives
After successfully completing this course, you will be able to:
⢠Understand the architecture of Cloudera Search
⢠Describe several use cases for Cloudera Search
⢠Develop schemas and queries for your data
⢠Choose the most appropriate indexing method for a particular situation
⢠Perform batch indexing of data stored in HDFS and HBase
⢠Perform indexing of streaming data in near-real-time with Flume
⢠Index content in multiple languages and file formats
⢠Process and transform incoming data with Morphlines
⢠Understand the factors that affect the performance of Cloudera Search
⢠Create a user interface for your index using Hue
⢠Integrate Cloudera Search with external applications
⢠Improve the Search experience using features such as faceting, highlighting, and
spelling correction
4. 4Š Cloudera, Inc. All rights reserved.
Target Audience, Course Prerequisites, and Required Skills
This is a three-day technical course
⢠Intended for software developers, data engineers, and similar roles
There are no specific prerequisite courses
Students should have the following qualifications
⢠A basic understanding of Hadoop
⢠Experience with a general-purpose programming language
⢠Ability to perform basic end-user tasks using the Linux command line
No prior experience with Cloudera Search or Apache Solr is necessary
⢠Nor is experience with tools such as Apache Flume or Apache HBase
5. 5Š Cloudera, Inc. All rights reserved.
Learning Path: Developers & Data Engineers
Intro to
Data Science
Spark
Training
Learn to code and write MapReduce programs for produc on
Master advanced API topics required for real-world data analysis
Combine batch and stream processing with interac ve analy cs
Op mize applica ons for speed, ease of use, and sophis ca on
Implement recommenders and data experiments
Draw ac onable insights from analysis of disparate data
Big Data
Applica ons
Build converged applica ons using mul ple processing engines
Develop enterprise solu ons using components across the EDH
Developer
Training
Design schemas to minimize latency on massive data sets
Scale hundreds of thousands of opera ons per second
HBase
Training
Search
Training
Bring scalable, flexible indexing to Hadoop with Apache Solr
Integrate powerful, real- me queries with external applica ons
Aaron T. Myers
So ware Engineer
6. 6Š Cloudera, Inc. All rights reserved.
Course Outline (1)
Overview of Cloudera Search
Performing Basic Queries
⢠Hands-On Exercise: Writing and Executing Basic Search Queries
⢠Bonus Exercise: Issuing Queries Directly to Solr
Writing More Powerful Queries
⢠Hands-On Exercise: Using Functions in Queries
⢠Bonus Exercise: Using Filter Queries
⢠Bonus Exercise: Field Faceting
Preparing to Index Documents
⢠Hands-On Exercise: Performing Pre-Indexing Tasks
⢠Bonus Exercise: Extracting Multiple Values from a Field
7. 7Š Cloudera, Inc. All rights reserved.
Course Outline (2)
Batch Indexing HDFS Data with MapReduce
⢠Hands-On Exercise: Using MapReduce to Index Data in HDFS
⢠Bonus Exercise: Troubleshooting Data Problems
Near-Real-Time Indexing with Flume
⢠Hands-On Exercise: Using Flume to Index Changes to a Collection
⢠Bonus Exercise: Indexing Streaming Data in Near-Real-Time
Indexing HBase Data with Lily
⢠Hands-On Exercise: Indexing Data in HBase Tables
Understanding Language and File Type Support
⢠Hands-On Exercise: Testing the Analyzer Chain with the Admin UI
⢠Bonus Exercise: Extracting Information from Binary Files
8. 8Š Cloudera, Inc. All rights reserved.
Course Outline (3)
Improving Search Quality and Performance
⢠Hands-On Exercise: Improving Search Quality
⢠Bonus Exercise: Using Spellchecking in Queries
Building User Interfaces for Search
⢠Hands-On Exercise: Building a User Interface with Hue
Considerations for Deployment
9. 9Š Cloudera, Inc. All rights reserved.
Presentation: Excerpt from Course
I will now show you some of what's in the course.
Primarily based on the "Overview of Cloudera Search" chapter
⢠What is Cloudera Search?
⢠Helpful Features
⢠Use Cases
10. 10Š Cloudera, Inc. All rights reserved.
Overview of Cloudera Search
⢠What is Cloudera Search?
⢠Helpful Features
⢠Case Studies
⢠Essential Points
11. 11Š Cloudera, Inc. All rights reserved.
The Need for Cloudera Search
There is significant growth in unstructured and semi-structured data
⢠Log files
⢠Product reviews
⢠Customer surveys
⢠News releases and articles
⢠Email and social media messages
⢠Research reports and other documents
We need scalability, speed, and flexibility to keep up with this growth
⢠Relational databases canât handle this volume or variety of data
Decreasing storage costs make it possible to store everything
⢠But finding relevant data is increasingly a problem
12. 12Š Cloudera, Inc. All rights reserved.
Cloudera Search Is an Important Part of an Enterprise Data Hub
Interactive full-text search capability for data in your Hadoop cluster
Makes the data accessible to non-technical audiences
⢠A few people can write code for Spark or MapReduce
⢠Many more people can write SQL queries
⢠Nearly everyone can use a search engine
13. 13Š Cloudera, Inc. All rights reserved.
Cloudera Search Integrates Apache Solr with CDH
Apache Solr provides a high-performance search service
⢠Solr is a mature platform with widespread deployment
⢠Standard Solr APIs and Web UI are available in Cloudera Search
Integration with CDH increases scalability and reliability
⢠The indexing and query processes can be distributed across nodes
Cloudera Search is 100% open source
⢠Released under the Apache Software License
14. 14Š Cloudera, Inc. All rights reserved.
Relationship Between Cloudera Search and Apache Solr
Apache Solr is the foundation of Cloudera Search
⢠Proven technology that powers much of the internet
⢠Active open source community
Cloudera Search adds many additional capabilities
⢠Integration with HDFS, MapReduce, HBase, and Flume
⢠Support for file formats widely used with Hadoop
⢠Dynamic Web-based dashboard and search interface with Hue
⢠Fine-grained access control through integration with Apache Sentry
15. 15Š Cloudera, Inc. All rights reserved.
How Does Cloudera Search Compare to a Relational Database?
As with a database, Cloudera Search is primarily a backend tool
⢠End users usually interact with it through user interfaces you create
⢠APIs are available for application development in multiple languages
Databases are often used to analyze data
⢠Search is typically used to discover data
Databases are designed to join tables based on a key
⢠Search is intended for queries on denormalized (flat) data sets
Databases are optimized to find and sort by specific values
⢠Search can match based on specific values, term variants, or ranges
⢠Search results are usually sorted by relevance
16. 16Š Cloudera, Inc. All rights reserved.
Overview of Cloudera Search
⢠What is Cloudera Search?
⢠Helpful Features
⢠Case Studies
⢠Essential Points
17. 17Š Cloudera, Inc. All rights reserved.
Scoring Manipulation
One way you can improve precision is by manipulating document scores
⢠Users donât always know how to write good queries
This is also used to balance the needs of the business and the user
⢠In the end, it is important that the user is satisfied
⢠Data scientists can be helpful in developing scoring algorithms
⢠Function queries are often used to manipulate scores
Many factors might be used to influence the scores
⢠Such as geography, popularity, timeliness, or profit margin
18. 18Š Cloudera, Inc. All rights reserved.
Broad File Format Support
Cloudera Search is ideal for semi-structured and free-form text data
⢠This includes a variety of document types such as log files, email messages,
reports, spreadsheets, presentations, and multimedia
Support for indexing data from many common formats, including
⢠Microsoft Office (Word, Excel, and PowerPoint)
⢠Portable Document Format (PDF)
⢠HTML and XML
⢠UNIX mailbox format (mbox)
⢠Plain text and Rich Text Format (RTF)
⢠Hadoop file formats like SequenceFiles and Avro
Can also extract and index metadata from many image and audio formats
19. 19Š Cloudera, Inc. All rights reserved.
Multilingual Support
You can index and query content in more than 30 languages
20. 20Š Cloudera, Inc. All rights reserved.
âMore Like Thisâ
Aids in focusing results when searching on words with multiple meanings
The Apple Macintosh Book
by Cary Lu (1984)
A wealth of information about the Macintosh family of computers... more like this
Wild Apple and Fruit Trees of Central Asia
by Jules Janick and Calvin Ross Sperling (2003)
The definitive source of information about Malus species found in... more like this
The Year the Big Apple Went Bust
by Fred Ferretti (1976)
Chronicles the 1975 fiscal crisis that nearly forced New York City... more like this
Apple of My Eye
by Patrick Redmond (2003)
When Susan and Ronnie first meet, the attraction is instant... more like this
They Were Strangers: A Family History
by Slovie Solomon Apple (1995)
Determined to survive at any cost, Clara endures untold hardships... more like this
Showing results 1-5 out of 7,523 for term: apple
21. 21Š Cloudera, Inc. All rights reserved.
Term Highlighting
Highlighting helps you quickly identify matches in surrounding text
How to Traverse the Space-Time Continuum
by Doc Brown (1955)
...after hitting my head on the bathroom sink while attempting to hang a clock,
I conceived of a flux capacitor, which contains three Geissler-style gas discharge
tubes sealed with mercury vapor or reactive alkali metal such as sodium...
Customizing Your DeLorean DMC-12
by Doc Brown and Marty McFly (1985)
...the stainless steel body of the DeLorean DMC-12 provides a direct and influential
effect on the "flux dispersal" of the overall system, and by installing a flux capacitor
providing 1.21 gigawatts (roughly equivalent to the power produced by 15 jet...
Relativity: the Special and General Theory
by A. Einstein (1916)
âŚunder these conditions, the u-curves and v-curves are straight lines in the
sense of Euclidean geometry, and they are perpendicular to each other when
the flux capacitor exceeds ~ 1200 gigawatts of electrical power...
Showing results 1 - 3 out of 18 for phrase: âflux capacitorâ
22. 22Š Cloudera, Inc. All rights reserved.
Spellchecking Suggestions
Users often enter search terms incorrectly
⢠Unless they notice, they may conclude that no relevant data exists
⢠The spellchecking feature in Cloudera Search can suggest an alternative
No results found for phrase: âcomptuer porgrammingâ
Did you mean to search for âcomputer programmingâ instead?
23. 23Š Cloudera, Inc. All rights reserved.
Geospatial Search
Cloudera Search can use location data to filter and sort results
⢠Proximity is calculated based on longitude and latitude of each point
1. Forest Park Station
0.1 kilometers
2. Skinker Station
0.2 kilometers
3. Central West End Station
0.3 kilometers
4. Delmar Station
0.3 kilometers
5. Big Bend Station
0.9 kilometers
5
1
2
3
4
Showing all 5 results for Metrolink stations within 1 kilometer of Forest Park
24. 24Š Cloudera, Inc. All rights reserved.
Faceted Search
Facets categorize results by field values or ranges
⢠Makes it easy to âdrill downâ into a subset of results
This feature is found on many popular Web sites
⢠Travel sites might facet on location and price
⢠Music sites might facet by genre, format, and year
Faceting makes it easy for users to narrow searches
⢠They can see how many items match a given facet
⢠Then, they can filter by that facet
This is key for analytics in Cloudera Search
(remove) - Jazz
Genre
2010 - Now (397)
2000 - 2009 (974)
1990 - 1999 (721)
Release Year
(remove) - Vinyl
Format
Downtown (97)
Midtown (62)
+ Show more...
Neighborhood
Economy (872)
Moderate (519)
Luxury (361)
Price Range
25. 25Š Cloudera, Inc. All rights reserved.
Hue: Search Dashboards
Hue has drag-and-drop support for building dashboards based on Search
Search Employees +
Job Designer Dat a Browsers Workf lows Search
Department
Operations (590)
Sales (540)
Facilities Management (272)
Customer Support (227)
IT (222)
Engineering (218)
Show moreâŚ
Nevada
439
Year Hired
2014 (914)
2013 (892)
2012 (703)
2011 (489)
2010 (401)
Before 2010 (376)
Location
Education Level
120,000
110,000
Salary
Stanford
26. 26Š Cloudera, Inc. All rights reserved.
Overview of Cloudera Search
⢠What is Cloudera Search?
⢠Helpful Features
⢠Case Studies
⢠Essential Points
27. 27Š Cloudera, Inc. All rights reserved.
Use Case #1: Online Document Archive
Information silos impede cross-team collaboration and knowledge sharing
HDFS can act as a central repository for archiving all types of data
⢠Search allows employees to find this information quickly and easily
PDF (132)
Microsoft Word (68)
Microsoft Excel (27)
E-Mail Message (19)
Audio File (3)
File Type
Legal Compliance (117)
Engineering (86)
Manufacturing (46)
Department
Find: Display results per page, sorted by
249 matches found
Recall Notice: CX1-2112 Fuel Pump May Cause Fire
By Arnold Anderson, Chief Engineer (April 29, 2014)
Pending Class Action Regarding Faulty Fuel Pumps
Author10
The CX1-2112 fuel pump uses a neoprene gasket that has
been shown to fail during normal use, causing dangerousâŚ
From Winston Prescott, Esquire (November 11, 2014)
My firm represents 318 victims, injured during fires caused
by the failure of the CX1-2112 fuel pump manufactured byâŚ
âfuel pumpâ AND fail
28. 28Š Cloudera, Inc. All rights reserved.
Use Case #2: Threat Detection in Near-Real-Time
Looking at yesterdayâs log files allows us to react to history
⢠Yet emerging threats require us to react to whatâs happening right now
Search can help you identify important patterns in incoming data
Yes (4,292,172)
No (61,779)
Packet Rejected
4,323,951 records matched (time range: 11:37:21 â 12:37:21)
Firewall LogsSearch data set for IP Addressin field
HTTP (594,370)
HTTPS (605,352)
SSH (475,634)
SMTP (2,645,595)
Service Port
Top Five Origins by Source IP Address
Display Last Hour
New York
Ukraine
Texas
Illinois
California
172.16.36.*
29. 29Š Cloudera, Inc. All rights reserved.
Use Case #3: Market Segmentation/Identification
Survey and feedback information is valuable
⢠But extracting insight can be a slow and expensive process
Search makes it easy to interactively explore new opportunities
2014 SurveySearch: for term in field
90%
Recent Leisure ActivitiesPrimary
Residence
$10,000
Monthly Expenses, by Category
$9,000
$8,000
80%
70%
60%
50%
Yachting
Shopping
Polo
Opera
Croquet
1. Beverly Hills, CA
2. Malibu, CA
3. Los Altos Hills, CA
4. Scottsdale, AZ
5. Park City, UT
Under 35 (1,798)
35-50 (6,389)
Over 50 (8,991)
Age Range
17,138 matches with filters (Annual Income: >$500,000, Region: Southwest, Education: College Graduate)
Female (10,085)
Male (7,093)
Gender
Marital Status
Married (12,347)
phone OR tablet Next Purchase
30. 30Š Cloudera, Inc. All rights reserved.
Overview of Cloudera Search
⢠What is Cloudera Search?
⢠Helpful Features
⢠Case Studies
⢠Essential Points
31. 31Š Cloudera, Inc. All rights reserved.
Documents, Fields, Queries, and Terms
It is helpful to understand the meaning
of some commonly-used words in Solr
A query typically specifies terms of
interest, such as âequityâ or âDavidâ
It may match one or more documents
⢠Each document contains one or
more fields, such as âtitleâ or ânameâ
The notion of âdocumentâ is flexible
⢠Think of a document as being similar
to a record in a database table
⢠A single file may contain multiple documents
Title:
Date:
Author:
Summary:
Body:
Equity Market Analysis
March 14, 2015
J.P. Moneybags
This report explains how toâŚ
Given the recent increase inâŚ
name address city
Alice 12 Ames St. Austin
Bruce 27 Bend Rd. Baltimore
Carol 35 Clay Ct. Cleveland
David 41 Deer Dr. Dallas
Ellie 59 Elan Ln. El Paso
32. 32Š Cloudera, Inc. All rights reserved.
Indexing Data Is a Prerequisite to Searching It
You must index data prior to querying that data with Cloudera Search
Creating and populating an index requires specialized skills
⢠Somewhat similar to designing database tables
⢠Frequently involves data extraction and transformation
Running basic queries on that data requires relatively little skill
⢠âPower usersâ who master the syntax can create very powerful queries
Overview of Cloudera Search
Transform
Data
Index
Data
Acquire
Data
Query
Data
Display
Results
33. 33Š Cloudera, Inc. All rights reserved.
What Is an Index?
Indexes are data structures optimized for quick lookups
⢠Much like a bookâs index helps you quickly locate information
The indexing process uses a schema to define the documentsâ fields
⢠This includes each fieldâs name and data type
Cloudera Search includes the Morphlines library
⢠Can extract, transform, and load data into Solr
Data
a,Alice,Manager
b,Bruce,Engineer,$5000
c,Carol,Manager,$7500
d,David,Analyst,$5000
Schema
Index
Alice: (a)
Bruce: (b)
Carol: (c)
David: (d)
name
Analyst: (d)
Engineer: (b)Manager: (a,
c)
title
5000: (b, d)
7500: (c)
bonus
id: string
name: string
title: string
bonus: int
34. 34Š Cloudera, Inc. All rights reserved.
Three Indexing Methods in Cloudera Search
Near-Real-Time indexing with Flume
⢠Data is indexed immediately as it enters the cluster
Batch mode indexing with MapReduce
⢠Used to index static data that already resides in HDFS
HBase indexing with Lily
⢠Allows you to index records stored in HBase tables
35. 35Š Cloudera, Inc. All rights reserved.
Batch Indexing of Data in HDFS with MapReduce
Use batch indexing to index static data already stored in HDFS
Cloudera Search provides a reusable job (MapReduceIndexerTool)
⢠Reads input data previously stored in HDFS
⢠Processes this data using Morphlines
⢠Creates the index and stores it in HDFS
HDFSAdd Input Data Read Input Data
MapReduce
Indexing Job
Morphlines
Input Data
name
Alice: (a)
Bruce: (b)
Carol: (c)
David: (d)
Index
Store Create
36. 36Š Cloudera, Inc. All rights reserved.
Near-Real-Time Indexing with Flume
Use near-real-time indexing for streaming or continuously-generated data
⢠Flume reads incoming data from a specific source
⢠This data is processed using Morphlines
⢠The index is created in HDFS and updated as new records arrive
The processed data can optionally be written as files in HDFS
Read Source
Flume
Morphline
Solr Sink
Morphlines
Input Data
Event
Event
Event
HDFS
name
Alice: (a)
Bruce: (b)
Carol: (c)
David: (d)
IndexCreate or Update
Index in HDFS
Create data files (optional)
37. 37Š Cloudera, Inc. All rights reserved.
Indexing Data in HBase with Lily
Use Lily to index data stored in HBase tables
⢠HBase is a non-relational (NoSQL) distributed database built on HDFS
⢠HBase can scale to handle billions of records with millions of columns
Both batch and near-real-time modes of operation are supported
HBase
name
Alice: (a)
Bruce: (b)
Carol: (c)
David: (d)
Index
Read input from cells
Lily NRT
Indexer Tool
Morphlines Update the index
Create the index
Triggered by updates
to HBase cells
Read input from cells
Invoked on demand
or through scheduler
HDFS
HBase Batch
Indexer Tool
Morphlines
38. 38Š Cloudera, Inc. All rights reserved.
Morphlines Overview
Morphlines is a framework for processing streams of data
⢠It is part of the Kite Software Development Kit (SDK)
⢠Offers many helpful features for indexing data with Search
⢠It is a plain Java library that can be used even outside of Hadoop
Especially useful for Extract, Transform, and Load (ETL) processing
⢠Processing commands are defined in a configuration file
⢠These commands are executed in sequence, much like a UNIX pipeline
⢠Morphlines ships with dozens of reusable commands
Incoming
Record
Outgoing
Record
Morphlines Processing Pipeline
Read
CSV
Generate
UUID
Convert
Timestamp
39. 39Š Cloudera, Inc. All rights reserved.
Essential Points
Cloudera Search provides full-text interactive search for data in Hadoop
⢠Apache Solr is a mature, high-performance search platform
⢠CDH components provide reliability and scalability
Search offers an additional option for accessing data
⢠Ideal for free-form or semi-structured data in many formats
⢠Does not require users to have experience with Java or SQL
Data must be indexed before it can be searched
⢠Cloudera Search offers several methods for indexing data at scale
⢠You can extract, load, and transform data using Morphlines
41. 41Š Cloudera, Inc. All rights reserved.
Thank You for Attending!
⢠Submit questions in the Q&A panel
⢠Follow Cloudera University on Twitter @ClouderaU
⢠Learn more about Cloudera Search Training:
http://university.cloudera.com/search-training
⢠Follow the Developer Learning Path:
http://university.cloudera.com/developers
⢠Get Developer Certification: http://university.cloudera.com/certification
⢠Join the Cloudera Community: http://community.cloudera.com