Presented by Mark Miller, Software Developer, Cloudera
Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The First Class Integration of Solr with Hadoop
1.
2. THE FIRST CLASS INTEGRATION OF SOLR WITH
HADOOP
Mark Miller (Cloudera)
3. WHO AM I?
Cloudera employee, Lucene/Solr committer, Lucene PMC member,
Apache member
!
First job out of college was in the Newspaper archiving business.
!
First full time employee at LucidWorks - a startup around Lucene/Solr.
!
Spent a couple years as “Core” engineering manager, reporting to the
VP of engineering.
4. •
Very fast and feature rich ‘core’ search engine library.
•
Compact and powerful, Lucene is an extremely popular full-text search library.
•
Provides low level API’s for analyzing, indexing, and searching text, along with a
myriad of related features.
!
!
!
•
Just the core - either you write the ‘glue’ or use a higher level search engine built
with Lucene.
5. • Solr (pronounced "solar") is an open source enterprise search
platform from the Apache Lucene project. Its major features
include full-text search, hit highlighting, faceted search,
dynamic clustering, database integration, and rich document
(e.g., Word, PDF) handling. Providing distributed search and
index replication, Solr is highly scalable. Solr is the most
popular enterprise search engine.
- Wikipedia
6. SEARCH ON HADOOP HISTORY
•
•
•
•
•
•
•
Katta
Blur
SolBase
HBASE-3529
SOLR-1301
SOLR-1045
Ad-Hoc
8. THE PLAN: STRENGTHEN THE FAMILY BONDS
•
No need to build something radically new - we have the pieces we need.
•
Focus on integration points.
•
Create high quality, first class integrations and contribute the work to the projects
involved.
!
!
!
•
Focus on integration and quality first - then performance and scale.
10. SOLR INTEGRATION
•
Read and Write directly to HDFS
•
•
First Class Custom Directory Support in Solr
Support Solr Replication on HDFS
•
Other improvements around usability and configuration
!
!
11. READ AND WRITE DIRECTLY TO HDFS
•
Lucene did not historically support append only file system
•
“Flexible Indexing” brought around support for append only filesystem support
•
Lucene support append only filesystem by default since 4.2
!
!
12. LUCENE DIRECTORY ABSTRACTION
•
•
It’s how Lucene interacts with index files.
Solr uses the Lucene library and offers DirectoryFactory
!
•
•
•
•
•
•
•
•
Class Directory {
listAll();
createOutput(file, context);
openInput(file, context);
deleteFile(file);
makeLock(file);
clearLock(file);
…
13. PUTTING THE INDEX IN HDFS
•
Solr relies on the filesystem cache to operate at full speed.
•
HDFS not known for it’s random access speed.
•
Apache Blur has already solved this with an HdfsDirectory that works on top of a
BlockDirectory.
!
!
!
•
The “block cache” caches the hot blocks of the index off heap (direct byte array)
and takes the place of the filesystem cache.
!
•
We contributed back optional ‘write’ caching.
!
!
14. PUTTING THE TRANSACTIONLOG IN HDFS
•
HdfsUpdateLog added - extends UpdateLog
•
Triggered by setting the UpdateLog dataDir to something that starts with hdfs:/ - no
additional configuration necessary.
!
!
•
Same extensive testing as used on UpdateLog
15. RUNNING SOLR ON HDFS
•
Set DirectoryFactory to HdfsDirectoryFactory and set the dataDir to a location in
hdfs.
!
•
Set LockType to ‘hdfs’
•
Use an UpdateLog dataDir location that begins with ‘hdfs:/’
•
•
•
Or java -Dsolr.directoryFactory=HdfsDirectoryFactory
-Dsolr.lockType=solr.HdfsLockFactory
-Dsolr.updatelog=hdfs://host:port/path -jar start.jar
!
!
16. SOLR REPLICATION ON HDFS
!
•
While Solr has exposed a plug-able DirectoryFactory for a long time now, it was
really quite limited.
!
•
Most glaring, only a local file system based Directory would work with replication.
•
There where also other more minor areas that relied on a local filesystem Directory
implementation.
!
17. FUTURE SOLR REPLICATION ON HDFS
•
Take advantage of “distributed filesystem” and allow for something similar to HBase
regions.
!
•
If a node goes down, the data is still available in HDFS - allow for that index to be
automatically served by a node that is still up if it has the capacity.
Solr
Node
Solr
Node
Solr
Node
HDFS
18. • Leader reads and writes index files to HDFS
• Replicas only read from HDFS, write to /dev/null
Leader
Replica
HDFS
Replica
19. MAP REDUCE INDEX BUILDING
•
Scalable index creation via map-reduce
•
Many initial ‘homegrown’ implementations sent documents from reducer to
SolrCloud over http
!
!
•
To really scale, you want the reducers to create the indexes in HDFS and then load
them up with Solr
!
•
The ideal impl will allow using as many reducers as are available in your hadoop
cluster, and then merge the indexes down to the correct number of ‘shards’
20. MR INDEX BUILDING
Mapper:
Parse input
Mapper:
Parse input
Mapper:
Parse input
Arbitrary reducing steps of indexing and merging
End-Reducer
End-Reducer
Index
Index
21. SOLRCLOUD AWARE
•
Can ‘inspect’ ZooKeeper to learn about Solr cluster.
•
What URL’s to GoLive to.
•
The Schema to use when building indexes.
•
Match hash -> shard assignments of a Solr cluster.
!
!
!
22. GOLIVE
!
•
•
•
•
After building your indexes with map-reduce, how do you deploy them to
your Solr cluster?
We want it to be easy - so we built the GoLive option.
GoLive allows you to easily merge the indexes you have created
atomically into a live running Solr cluster.
Paired with the ZooKeeper Aware ability, this allows you to simply point
your map-reduce job to your Solr cluster and it will automatically discover
how many shards to build and what locations to deliver the final indexes to
in HDFS.
23. FLUME SOLR SYNC
• Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of
log data. It has a simple and flexible architecture based on
streaming data flows. It is robust and fault tolerant with tunable
reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows
for online analytic application.
25. SOLRCLOUD AWARE
•
Can ‘inspect’ ZooKeeper to learn about Solr cluster.
•
What URL’s to send data to.
•
The Schema for the collection being indexed to.
!
!
26. HBASE INTEGRATION
•
•
•
•
•
•
•
•
Collaboration between NGData & Cloudera
NGData are creators of the Lily data management platform
Lily HBase Indexer
Service which acts as a HBase replication listener
HBase replication features, such as filtering, supported
Replication updates trigger indexing of updates (rows)
Integrates Morphlines library for ETL of rows
AL2 licensed on github https://github.com/ngdata
28. MORPHLINES
•
A morphline is a configuration file that allows you to define ETL transformation
pipelines
!
•
Extract content from input files, transform content, load content (eg to Solr)
•
Uses Tika to extract content from a large variety of input documents
•
Part of the CDK (Cloudera Development Kit)
!
!
29. syslog
Flume
Agent
Solr Sink
Command: readLine
Command: grok
Command: loadSolr
Solr
•
•
•
•
•
•
•
•
•
•
•
Open Source framework for simple ETL
Ships as part Cloudera Developer Kit (CDK)
It’s a Java library
AL2 licensed on github https://github.com/cloudera/cdk
Similar to Unix pipelines
Configuration over coding
Supports common Hadoop formats
Avro
Sequence file
Text
Etc…
!
30. •
•
•
•
•
•
•
•
Integrate with and load into Apache Solr
Flexible log file analysis
Single-line record, multi-line records, CSV files
Regex based pattern matching and extraction
Integration with Avro
Integration with Apache Hadoop Sequence Files
Integration with SolrCell and all Apache Tika parsers
Auto-detection of MIME types from binary data using Apache Tika
31. •
•
•
•
•
•
•
•
•
•
Scripting support for dynamic java code
Operations on fields for assignment and comparison
Operations on fields with list and set semantics
if-then-else conditionals
A small rules engine (tryRules)
String and timestamp conversions
slf4j logging
Yammer metrics and counters
Decompression and unpacking of arbitrarily nested container file
formats
Etc…
32. MORPHLINES EXAMPLE CONFIG
Example Input
<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 po
Output Record
syslog_pri:164
syslog_timestamp:Feb 4 10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening on 0.0.0.0 port 22.
morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{ readLine {} }
{
grok {
dictionaryFiles : [/tmp/grok-dictionaries]
expressions : {
message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %
{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %{GREEDYDATA:syslog_message}"""
}
}
}
{ loadSolr {} }
]
}