Automate your Kamailio Test Calls - Kamailio World 2024
Solr + Hadoop: Interactive Search for Hadoop
1. 1
Solr + Hadoop: Interactive Search for
Hadoop
Gregory Chanan (gchanan AT cloudera.com)
OC Big Data Meetup 07/16/14
2. Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Security
• Conclusion
3. Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Security
• Conclusion
4. Why Search?
• Hadoop for everyone
• Typical case:
• Ingest data to storage engine (HDFS, HBase, etc)
• Process data (MapReduce, Hive, Impala)
• Experts know MapReduce
• Savvy people know SQL
• Everyone knows Search!
5. Why Search?
An Integrated Part of
the Hadoop System
One pool of data
One security framework
One set of system resources
One management interface
6. Benefits of Search
• Improved Big Data ROI
• An interactive experience without technical knowledge
• Faster time to insight
• Exploratory analysis, esp. unstructured data
• Broad range of indexing options to accommodate needs
• Cost efficiency
• Single scalable platform; no incremental investment
• No need for separate systems, storage
7. What is Cloudera Search?
• Full-text, interactive search with faceted navigation
• Apache Solr integrated with CDH
• Established, mature search with vibrant community
• In production environments for years
• Open Source
• 100% Apache, 100% Solr
• Standard Solr APIs
• Batch, near real-time, and on-demand indexing
• Available for CDH4 and CDH5
8. Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Security
• Conclusion
9. Apache Hadoop
• Apache HDFS
• Distributed file system
• High reliability
• High throughput
• Apache MapReduce
• Parallel, distributed programming model
• Allows processing of large datasets
• Fault tolerant
10. Apache Lucene
• Full text search library
• Indexing
• Querying
• Traditional inverted index
• Batch and Incremental indexing
• We are using version 4.4 in current release
11. Apache Solr
• Search service built using Lucene
• Ships with Lucene (same TLP at Apache)
• Provides XML/HTTP/JSON/Python/Ruby/… APIs
• Indexing
• Query
• Administrative interface
• Also rich web admin GUI via HTTP
12. Apache SolrCloud
• Provides distributed Search capability
• Part of Solr (not a separate library/codebase)
• Shards – provide scalability
• partition index for size
• replicate for query performance
• Uses ZooKeeper for coordination
• No split-brain issues
• Simplifies operations
13. SolrCloud Architecture
• Updates automatically sent to
the correct shard
• Replicas handle queries,
forward updates to the leader
• Leader indexes the document
for the shard, and forwards
the index notation to itself
and any replicas.
15. Distributed Search on Hadoop
Flume
Hue UI
Custom
UI
Custom
App
Solr
Solr
Solr
SolrCloud
query
query
query
index
Hadoop Cluster
MR
HDFS
index
HBase
index
ZK
16. Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Indexing
• ETL - morphlines
• Querying
• Security
• Conclusion
17. Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch
• MapReduceIndexerTool
• HBaseBatchIndexer
18. Near Real Time Indexing with Flume
Log File
Solr and Flume
• Data ingest at scale
• Flexible extraction and
mapping
• Indexing at data ingest
HDFS
Flume
Agent
Indexer
Other
Log File
Flume
Agent
Indexer
18
19. Apache Flume - MorphlineSolrSink
• A Flume Source…
• Receives/gathers events
• A Flume Channel…
• Carries the event – MemoryChannel or reliable FileChannel
• A Flume Sink…
• Sends the events on to the next location
• Flume MorphlineSolrSink
• Integrates Cloudera Morphlines library
• ETL, more on that in a bit
• Does batching
• Results sent to Solr for indexing
20. Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch
• MapReduceIndexerTool
• HBaseBatchIndexer
21. Near Real Time Indexing of Apache HBase
HDFS
HBase
interactiveload
HBase
Indexer(s)
Replication Solr server
Solr server
Solr server
Solr server
Solr server
Search
+ =
planet-sized tabular data
immediate access & updates
fast & flexible information
discovery
BIG DATA DATAMANAGEMENT
22. Lily HBase Indexer
• Collaboration between NGData & Cloudera
• NGData are creators of the Lily data management platform
• Lily HBase Indexer
• Service which acts as a HBase replication listener
• HBase replication features, such as filtering, supported
• Replication updates trigger indexing of updates (rows)
• Integrates Cloudera Morphlines library for ETL of rows
• AL2 licensed on github https://github.com/ngdata
23. Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch
• MapReduceIndexerTool
• HBaseBatchIndexer
25. MapReduce Indexer
MapReduce Job with two parts
1) Scan HDFS for files to be indexed
• Much like Unix “find” – see HADOOP-8989
• Output is NLineInputFormat’ed file
2) Mapper/Reducer indexing step
• Mapper extracts content via Cloudera Morphlines
• Reducer indexes documents via embedded Solr server
• Originally based on SOLR-1301
• Many modifications to enable linear scalability
26. MapReduce Indexer “golive”
• Cloudera created this to bridge the gap between NRT
(low latency, expensive) and Batch (high latency,
cheap at scale) indexing
• Results of MR indexing operation are immediately
merged into a live SolrCloud serving cluster
• No downtime for users
• No NRT expense
• Linear scale out to the size of your MR cluster
27. Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch
• MapReduceIndexerTool
• HBaseBatchIndexer
28. HBase + MapReduce
• Run MapReduce job over HBase tables
• Same architecture as running over HDFS
• Similar to HBase’s CopyTable
• Support for go-live
29. Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Indexing
• ETL - morphlines
• Querying
• Security
• Conclusion
30. Cloudera Morphlines
• Open Source framework for simple ETL
• Simplify ETL
• Built-in commands and library support (Avro format, Hadoop
SequenceFiles, grok for syslog messages)
• Configuration over coding
• Standardize ETL
• Ships as part of Kite SDK, formerly Cloudera
Developer Kit (CDK)
• It’s a Java library
• AL2 licensed on github https://github.com/kite-sdk
31. Cloudera Morphlines Architecture
Solr
Solr
Solr
SolrCloud
Logs, tweets, social
media, html,
images, pdf, text….
Anything you want
to index
Flume, MR Indexer, HBase indexer, etc...
Or your application!
Morphline Library
Morphlines can be embedded in any application…
32. Extraction and Mapping
• Modeled after Unix
pipelines (records instead
of lines)
• Simple and flexible data
transformation
• Reusable across multiple
index workloads
• Over time, extend and re-
use across platform
workloads
syslog Flume
Agent
Solr sink
Command: readLine
Command: grok
Command: loadSolr
Solr
Event
Record
Record
Record
Document
MorphlineLibrary
33. Morphline Example – syslog with grok
morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{ readLine {} }
{
grok {
dictionaryFiles : [/tmp/grok-dictionaries]
expressions : {
message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp}
%{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%{GREEDYDATA:syslog_message}"""
}
}
}
{ loadSolr {} }
]
}
]
Example Input
<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22
Output Record
syslog_pri:164
syslog_timestamp:Feb 4 10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening on 0.0.0.0 port 22.
34. Current Command Library
• Integrate with and load into Apache Solr
• Flexible log file analysis
• Single-line record, multi-line records, CSV files
• Regex based pattern matching and extraction
• Integration with Avro
• Integration with Apache Hadoop Sequence Files
• Integration with SolrCell and all Apache Tika parsers
• Auto-detection of MIME types from binary data using
Apache Tika
35. Current Command Library (cont)
• Scripting support for dynamic java code
• Operations on fields for assignment and comparison
• Operations on fields with list and set semantics
• if-then-else conditionals
• A small rules engine (tryRules)
• String and timestamp conversions
• slf4j logging
• Yammer metrics and counters
• Decompression and unpacking of arbitrarily nested
container file formats
• Etc…
36. Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Indexing
• ETL - morphlines
• Querying
• Security
• Conclusion
38. Simple, Customizable Search Interface
Hue
• Simple UI
• Navigated, faceted drill
down
• Customizable display
• Full text search,
standard Solr API and
query language
39. Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Security
• Conclusion
40. Security
• Upstream Solr doesn’t deal with security
• Cloudera Search supports kerberos authentication
• Similar to Oozie / WebHDFS
• Collection-Level Authorization via Apache Sentry
• Document-Level Authorization via Apache Sentry
(new in CDH5.1)
41. Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Indexing
• ETL - morphlines
• Querying
• Security
• Collection-Level Authorization
• Document-Level Authorization
• Conclusion
42. Collection-Level Authorization
• Sentry supports role-based granting of
privileges
• each role can be granted QUERY, UPDATE, and/or
administrative privileges on an index (collection)
• Privileges stored in a “policy file” on HDFS
43. Policy File
[groups]
# Assigns each Hadoop group to its set of roles
dev_ops = engineer_role, ops_role
[roles]
# Assigns each role to its set of privileges
engineer_role = collection = source_code->action=Query,
collection = source_code- > action=Update
ops_role = collection = hbase_logs->action=Query
44. Integrating Sentry and Solr
• Solr Request Handlers:
• Specified per collection in solrconfig.xml:
• Request to: http://localhost:8983/solr/collection1/select
Is dispatched to an instance of solr.SearchHandler
45. Sentry Request Handlers
• Sentry ships with its own version of solrconfig.xml
with secure handlers, called solrconfig.xml.secure
• Use a SearchComponent to implement the checking
• Update Requests handled in a similar way
46. Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Indexing
• ETL - morphlines
• Querying
• Security
• Collection-Level Authorization
• Document-Level Authorization
• Conclusion
47. Document-level authorization Motivation
• Index-level authorization useful when access control
requirements for documents are homogeneous
• Security requirements may require restricting access
to a subset of documents
48. Document-level authorization Motivation
• Consider “Confidential” and “Secret” documents.
How to store with only index-level authorization?
• Pushes complexity to application. Doc-level
authorization designed to solve this problem
49. Document-level authorization model
• Instead of storing in HDFS Policy File:
[groups]
# Assigns each Hadoop group to its set of roles
dev_ops = engineer_role, ops_role
[roles]
# Assigns each role to its set of privileges
engineer_role = collection = source_code->action=Query,
collection = source_code- > action=Update
ops_role = collection = hbase_logs->action=Query
• Store authorization tokens in each document
• Many more documents than collections; doesn’t scale to
store document-level info in Policy File
• Can use Solr’s built-in filtering capabilities to restrict access
50. Document-level authorization model
• A configurable token field stores the authorization tokens
• The authorization tokens are Sentry roles, i.e. “ops_role”
[roles]
ops_role = collection = hbase_logs->action=Query
• Represents the roles that are allowed to view the
document. To view a document, the querying user must
belong to at least one role whose token is stored in the
token field
• Can modify document permissions without restarting
Solr
• Can modify role memberships without reindexing
51. Document-level authorization impl
• Intercepts the request via a SearchComponent
• SearchComponent adds an “fq” or FilterQuery
• Filter out all documents that don’t have “role1” or “role2” in
authField
• Multiple “fq”s work as intersection, so malicious user
can’t avoid by injection his own fq
• Filters are cached, so only construction expense once
• Note: does not supersede index-level authorization
52. Document-level authorization config
• Configuration via solrconfig.xml.secure (per
collection):
<!-- Set to true to enabled document-level authorization -->
<bool name="enabled">false</bool>
<!-- Field where the auth tokens are stored in the document -->
<str name="sentryAuthField">sentry_auth</str>
<!-- Auth token defined to allow any role to access the document.
Uncomment to enable. -->
<!--<str name="allRolesToken">*</str>-->
• For backwards compatibility, not enabled
• No tokens = no access. To allow all users to access a
document, use the allRolesToken. Useful for getting started
53. Conclusion
• Cloudera Search
• Free Download
• Extensive documentation
• Send your questions and feedback to search-
user@cloudera.org
• Take the Search online training
• Cloudera Manager Standard (i.e. the free version)
• Simple management of Search
• Free Download
• QuickStart VM also available!