Solr + Hadoop = Big Data Search

Solr + Hadoop = Big Data Search
Mark Miller

!1

Who Am I?
Cloudera employee, Lucene/Solr committer, Lucene PMC member, Apache
member

!

First job out of college was in the Newspaper archiving
business.
First full time employee at LucidWorks - a startup around
Lucene/Solr.
Spent a couple years as “Core” engineering manager,
reporting to the VP of engineering.
!2

Very fast and feature rich ‘core’ search engine library.
!

Compact and powerful, Lucene is an extremely
popular full-text search library.
!

Provides low level API’s for analyzing, indexing, and
searching text, along with a myriad of related
features.
!

Just the core - either you write the ‘glue’ or use a
higher level search engine built with Lucene.
!3

Solr (pronounced "solar") is an open source enterprise
search platform from the Apache Lucene project. Its
major features include full-text search, hit
highlighting, faceted search, dynamic clustering,
database integration, and rich document (e.g., Word,
PDF) handling. Providing distributed search and index
replication, Solr is highly scalable. Solr is the most
popular enterprise search engine.
- Wikipedia
!4

Search on Hadoop History
• Katta
• Blur
• SolBase
• HBASE-3529
• SOLR-1301
• SOLR-1045
• Ad-Hoc
!
!

!5

Strengthen the Family Bonds

• No need to build something radically new - we
have the pieces we need.

!

• Focus on integration points.
!

• Create high quality, ﬁrst class integrations and
contribute the work to the projects involved.
!

• Focus on integration and quality ﬁrst - then
performance and scale.

!7

Solr Integration
• Read and Write directly to HDFS
!

• First Class Custom Directory Support in Solr
• Support Solr Replication on HDFS
!

• Other improvements around usability and conﬁguration
!

!9

Read and Write directly to HDFS
• Lucene did not historically support append only file system
!

• “Flexible Indexing” brought around support for append
only filesystem support

!

• Lucene support append only filesystem by default since
4.2

!

!10

Lucene Directory Abstraction
It’s how Lucene interacts with index files.
Solr uses the Lucene library and offers DirectoryFactory
!
Class Directory {

listAll();

createOutput(file, context);

openInput(file, context);

deleteFile(file);

makeLock(file);

clearLock(file);

…
}
!11

Putting the Index in HDFS
• Solr relies on the ﬁlesystem cache to operate at full speed.
!

• HDFS not known for it’s random access speed.
!

• Apache Blur has already solved this with an HdfsDirectory
•

•
!
!
!12

that works on top of a BlockDirectory.
!
The “block cache” caches the hot blocks of the index off
heap (direct byte array) and takes the place of the
ﬁlesystem cache.
!
We contributed back optional ‘write’ caching.

Putting the TransactionLog in HDFS
• HdfsUpdateLog added - extends UpdateLog
!

• Triggered by setting the UpdateLog dataDir to

something that starts with hdfs:/ - no additional
conﬁguration necessary.

!

• Same extensive testing as used on UpdateLog
!
!
!
!

!13

Running Solr on HDFS
• Set DirectoryFactory to HdfsDirectoryFactory and set the dataDir to a
location in hdfs.

!

• Set LockType to ‘hdfs’
!

• Use an UpdateLog dataDir location that begins with ‘hdfs:/’
•

!
!
!14

!
Or java -Dsolr.directoryFactory=HdfsDirectoryFactory
-Dsolr.lockType=solr.HdfsLockFactory
-Dsolr.updatelog=hdfs://host:port/path -jar start.jar
!

Solr Replication on HDFS
!

• While Solr has exposed a plug-able DirectoryFactory
for a long time now, it was really quite limited.

!

• Most glaring, only a local ﬁle system based Directory
would work with replication.

!

• There where also other more minor areas that relied
on a local ﬁlesystem Directory implementation.

!
!

!15

Future Solr Replication on HDFS
• Take advantage of “distributed ﬁlesystem” and allow
for something similar to HBase regions.

!

• If a node goes down, the data is still available in HDFS
- allow for that index to be automatically served by a
node that is still up if it has the capacity.

!

Solr
Node

Solr
Node

Solr
Node

HDFS
!16

MR Index Building
• Scalable index creation via map-reduce
•
•
•

!17

!
!

!
Many initial ‘homegrown’ implementations sent documents from
reducer to SolrCloud over http
!
To really scale, you want the reducers to create the indexes in HDFS
and then load them up with Solr
!
The ideal impl will allow using as many reducers as are available in
your hadoop cluster, and then merge the indexes down to the correct
number of ‘shards’
!

MR Index Building
Mapper:
Parse input into
indexable document

Mapper:
Parse input into
indexable document

Mapper:
Parse input into
indexable document

Arbitrary reducing steps of indexing and merging

End-Reducer (shard 1):
Index document

Index
shard 1
!18

End-Reducer (shard 2):
Index document

Index
shard 2

SolrCloud Aware
• Can ‘inspect’ ZooKeeper to learn about Solr cluster.
!

• What URL’s to GoLive to.
!

• The Schema to use when building indexes.
!

• Match hash -> shard assignments of a Solr cluster.
!
!
!
!19

GoLive
!

• After building your indexes with map-reduce, how do
•
•
•

!
!20

!

you deploy them to your Solr cluster?
We want it to be easy - so we built the GoLive option.
GoLive allows you to easily merge the indexes you have
created atomically into a live running Solr cluster.
Paired with the ZooKeeper Aware ability, this allows
you to simply point your map-reduce job to your Solr
cluster and it will automatically discover how many
shards to build and what locations to deliver the ﬁnal
indexes to in HDFS.

Flume Solr Sync
Flume is a distributed, reliable, and available service
for efficiently collecting, aggregating, and moving
large amounts of log data. It has a simple and
ﬂexible architecture based on streaming data ﬂows.
It is robust and fault tolerant with tunable reliability
mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model
that allows for online analytic application.
!
!21

!

- Apache Flume Website

Flume Solr Sync
Other

Logs

Flume
Agent

Flume
Agent

Solr
HDFS

!22

SolrCloud Aware
• Can ‘inspect’ ZooKeeper to learn about Solr cluster.
!

• What URL’s to send data to.
!

• The Schema for the collection being indexed to.
!
!
!

!23

HBase Integration
• Collaboration between NGData & Cloudera
• NGData are creators of the Lily data management
•
•
•
•
•
•
!
!
!24

!

platform
Lily HBase Indexer
Service which acts as a HBase replication listener
HBase replication features, such as ﬁltering, supported
Replication updates trigger indexing of updates (rows)
Integrates Morphlines library for ETL of rows
AL2 licensed on github https://github.com/ngdata

HBase

Triggers on updates

interactive load

HBase Integration

HDFS

!25

Indexer(s)

Solr
Solr
Solr
Solr
Solr

server
server
server
server
server

Morphlines
• A morphline is a configuration file that allows you to define ETL
transformation pipelines

!

• Extract content from input files, transform content, load content (eg to
Solr)

!

• Uses Tika to extract content from a large variety of input documents
!

• Part of the CDK (Cloudera Development Kit)
!
!
!
!26

Morphlines
Flume
Agent

syslog

Solr Sink
Command:
readLine
Command: grok
Command:
loadSolr

Solr

!27

• Open Source framework for simple ETL
• Ships as part Cloudera Developer Kit (CDK)
• It’s a Java library
• AL2 licensed on github https://github.com/
cloudera/cdk
• Similar to Unix pipelines
• Conﬁguration over coding
• Supports common Hadoop formats
Avro
Sequence ﬁle
Text
Etc…

!

Morphlines
• Integrate with and load into Apache Solr
• Flexible log ﬁle analysis
• Single-line record, multi-line records, CSV ﬁles
• Regex based pattern matching and extraction
• Integration with Avro
• Integration with Apache Hadoop Sequence Files
• Integration with SolrCell and all Apache Tika parsers
• Auto-detection of MIME types from binary data using
Apache Tika
!
!28

Morphlines
• Scripting support for dynamic java code
• Operations on fields for assignment and comparison
• Operations on fields with list and set semantics
• if-then-else conditionals
• A small rules engine (tryRules)
• String and timestamp conversions
• slf4j logging
• Yammer metrics and counters
• Decompression and unpacking of arbitrarily nested container file
formats
• Etc…
!
!29

Morphlines Example Conﬁg
morphlines : [
Example Input
<164>Feb 4 10:46:14 syslog sshd[607]: listening on
{
0.0.0.0 port 22
   id : morphline1
Output Record
   importCommands : ["com.cloudera.**", "org.apache.solr.**"]
syslog_pri:164
   commands : [
syslog_timestamp:Feb 4 10:46:14
     { readLine {} }
syslog_hostname:syslog
     {
syslog_program:sshd
       grok {
syslog_pid:607
         dictionaryFiles : [/tmp/grok-dictionaries]
syslog_message:listening on 0.0.0.0 port 22.
         expressions : {
 
           message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %
{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %
{GREEDYDATA:syslog_message}"""
         }
       }
     }
     { loadSolr {} }
    ]
}
]

!30

Hue Integration
Hue
• Simple UI
• Navigated, faceted drill
•
•
!
!
!

!31

down
Customizable display
Full text search, standard
Solr API and query language

Cloudera Search

https://ccp.cloudera.com/display/SUPPORT/Downloads
!

Or Google
!

“cloudera search download”

!32

Mark Miller, Cloudera

@heismark

Solr + Hadoop = Big Data Search

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (19)

Semelhante a Solr + Hadoop = Big Data Search

Semelhante a Solr + Hadoop = Big Data Search (20)

Último

Último (20)

Solr + Hadoop = Big Data Search