SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Solr + Hadoop = Big Data Search
Mark Miller

!1
Who Am I?
Cloudera employee, Lucene/Solr committer, Lucene PMC member, Apache
member

!

First job out of college was in the Newspaper archiving
business.
First full time employee at LucidWorks - a startup around
Lucene/Solr.
Spent a couple years as “Core” engineering manager,
reporting to the VP of engineering.
!2
Very fast and feature rich ‘core’ search engine library. 
!

Compact and powerful, Lucene is an extremely
popular full-text search library.
!

Provides low level API’s for analyzing, indexing, and
searching text, along with a myriad of related
features.
!

Just the core - either you write the ‘glue’ or use a
higher level search engine built with Lucene.
!3
Solr (pronounced "solar") is an open source enterprise
search platform from the Apache Lucene project. Its
major features include full-text search, hit
highlighting, faceted search, dynamic clustering,
database integration, and rich document (e.g., Word,
PDF) handling. Providing distributed search and index
replication, Solr is highly scalable. Solr is the most
popular enterprise search engine.
- Wikipedia
!4
Search on Hadoop History
• Katta
• Blur
• SolBase
• HBASE-3529
• SOLR-1301
• SOLR-1045
• Ad-Hoc
!
!

!5
Family Tree

...

!6
Strengthen the Family Bonds


• No need to build something radically new - we
have the pieces we need.

!

• Focus on integration points.
!

• Create high quality, first class integrations and
contribute the work to the projects involved.
!

• Focus on integration and quality first - then
performance and scale.

!7
SolrCloud

!8
Solr Integration
• Read and Write directly to HDFS
!

• First Class Custom Directory Support in Solr
• Support Solr Replication on HDFS
!

• Other improvements around usability and configuration
!

!9
Read and Write directly to HDFS
• Lucene did not historically support append only file system
!

• “Flexible Indexing” brought around support for append
only filesystem support

!

• Lucene support append only filesystem by default since
4.2

!

!10
Lucene Directory Abstraction
It’s how Lucene interacts with index files.
Solr uses the Lucene library and offers DirectoryFactory
!
Class Directory {

listAll();

createOutput(file, context);

openInput(file, context);

deleteFile(file);

makeLock(file);

clearLock(file);

…
}
!11
Putting the Index in HDFS
• Solr relies on the filesystem cache to operate at full speed.
!

• HDFS not known for it’s random access speed.
!

• Apache Blur has already solved this with an HdfsDirectory
•

•
!
!
!12

that works on top of a BlockDirectory.
!
The “block cache” caches the hot blocks of the index off
heap (direct byte array) and takes the place of the
filesystem cache.
!
We contributed back optional ‘write’ caching.
Putting the TransactionLog in HDFS
• HdfsUpdateLog added - extends UpdateLog
!

• Triggered by setting the UpdateLog dataDir to

something that starts with hdfs:/ - no additional
configuration necessary.

!

• Same extensive testing as used on UpdateLog
!
!
!
!

!13
Running Solr on HDFS
• Set DirectoryFactory to HdfsDirectoryFactory and set the dataDir to a
location in hdfs.

!

• Set LockType to ‘hdfs’
!

• Use an UpdateLog dataDir location that begins with ‘hdfs:/’
•

!
!
!14

!
Or java -Dsolr.directoryFactory=HdfsDirectoryFactory 
-Dsolr.lockType=solr.HdfsLockFactory
-Dsolr.updatelog=hdfs://host:port/path -jar start.jar
!
Solr Replication on HDFS
!

• While Solr has exposed a plug-able DirectoryFactory
for a long time now, it was really quite limited.

!

• Most glaring, only a local file system based Directory
would work with replication.

!

• There where also other more minor areas that relied
on a local filesystem Directory implementation.

!
!

!15
Future Solr Replication on HDFS
• Take advantage of “distributed filesystem” and allow
for something similar to HBase regions.

!

• If a node goes down, the data is still available in HDFS
- allow for that index to be automatically served by a
node that is still up if it has the capacity.

!

Solr
Node

Solr
Node

Solr
Node

HDFS
!16
MR Index Building
• Scalable index creation via map-reduce
•
•
•

!17

!
!

!
Many initial ‘homegrown’ implementations sent documents from
reducer to SolrCloud over http
!
To really scale, you want the reducers to create the indexes in HDFS
and then load them up with Solr
!
The ideal impl will allow using as many reducers as are available in
your hadoop cluster, and then merge the indexes down to the correct
number of ‘shards’
!
MR Index Building
Mapper:
Parse input into
indexable document

Mapper:
Parse input into
indexable document

Mapper:
Parse input into
indexable document

Arbitrary reducing steps of indexing and merging

End-Reducer (shard 1):
Index document

Index
shard 1
!18

End-Reducer (shard 2):
Index document

Index
shard 2
SolrCloud Aware
• Can ‘inspect’ ZooKeeper to learn about Solr cluster.
!

• What URL’s to GoLive to.
!

• The Schema to use when building indexes.
!

• Match hash -> shard assignments of a Solr cluster.
!
!
!
!19
GoLive
!

• After building your indexes with map-reduce, how do
•
•
•

!
!20

!

you deploy them to your Solr cluster?
We want it to be easy - so we built the GoLive option.
GoLive allows you to easily merge the indexes you have
created atomically into a live running Solr cluster.
Paired with the ZooKeeper Aware ability, this allows
you to simply point your map-reduce job to your Solr
cluster and it will automatically discover how many
shards to build and what locations to deliver the final
indexes to in HDFS.
Flume Solr Sync
Flume is a distributed, reliable, and available service
for efficiently collecting, aggregating, and moving
large amounts of log data. It has a simple and
flexible architecture based on streaming data flows.
It is robust and fault tolerant with tunable reliability
mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model
that allows for online analytic application.
!
!21

!

- Apache Flume Website
Flume Solr Sync
Other

Logs

Flume
Agent

Flume
Agent

Solr
HDFS

!22
SolrCloud Aware
• Can ‘inspect’ ZooKeeper to learn about Solr cluster.
!

• What URL’s to send data to.
!

• The Schema for the collection being indexed to.
!
!
!

!23
HBase Integration
• Collaboration between NGData & Cloudera
• NGData are creators of the Lily data management
•
•
•
•
•
•
!
!
!24

!

platform
Lily HBase Indexer
Service which acts as a HBase replication listener
HBase replication features, such as filtering, supported
Replication updates trigger indexing of updates (rows)
Integrates Morphlines library for ETL of rows
AL2 licensed on github https://github.com/ngdata
HBase

Triggers on updates

interactive load

HBase Integration

HDFS

!25

Indexer(s)

Solr
Solr
Solr
Solr
Solr

server
server
server
server
server
Morphlines
• A morphline is a configuration file that allows you to define ETL
transformation pipelines

!

• Extract content from input files, transform content, load content (eg to
Solr)

!

• Uses Tika to extract content from a large variety of input documents
!

• Part of the CDK (Cloudera Development Kit)
!
!
!
!26
Morphlines
Flume
Agent

syslog

Solr Sink
Command:
readLine
Command: grok
Command:
loadSolr

Solr

!27

• Open Source framework for simple ETL
• Ships as part Cloudera Developer Kit (CDK)
• It’s a Java library
• AL2 licensed on github https://github.com/
cloudera/cdk
• Similar to Unix pipelines
• Configuration over coding
• Supports common Hadoop formats
Avro
Sequence file
Text
Etc…

!
Morphlines
• Integrate with and load into Apache Solr
• Flexible log file analysis
• Single-line record, multi-line records, CSV files 
• Regex based pattern matching and extraction 
• Integration with Avro 
• Integration with Apache Hadoop Sequence Files
• Integration with SolrCell and all Apache Tika parsers 
• Auto-detection of MIME types from binary data using
Apache Tika
!
!28
Morphlines
• Scripting support for dynamic java code 
• Operations on fields for assignment and comparison
• Operations on fields with list and set semantics 
• if-then-else conditionals 
• A small rules engine (tryRules)
• String and timestamp conversions 
• slf4j logging
• Yammer metrics and counters 
• Decompression and unpacking of arbitrarily nested container file
formats
• Etc…
!
!29
Morphlines Example Config
morphlines : [
Example Input
<164>Feb  4 10:46:14 syslog sshd[607]: listening on
 {
0.0.0.0 port 22
   id : morphline1
Output Record
   importCommands : ["com.cloudera.**", "org.apache.solr.**"]
syslog_pri:164
   commands : [
syslog_timestamp:Feb  4 10:46:14
     { readLine {} }                    
syslog_hostname:syslog
     { 
syslog_program:sshd
       grok { 
syslog_pid:607
         dictionaryFiles : [/tmp/grok-dictionaries]                               
syslog_message:listening on 0.0.0.0 port 22.
         expressions : { 


           message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %
{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %
{GREEDYDATA:syslog_message}"""
         }
       }
     }
     { loadSolr {} }     
    ]
 }
]

!30
Hue Integration
Hue
• Simple UI
• Navigated, faceted drill
•
•
!
!
!

!31

down
Customizable display
Full text search, standard
Solr API and query language
Cloudera Search

https://ccp.cloudera.com/display/SUPPORT/Downloads
!

Or Google
!

“cloudera search download”

!32
Mark Miller, Cloudera

@heismark

Mais conteúdo relacionado

Mais procurados

Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLucidworks
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Lucidworks
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Lucidworks
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 

Mais procurados (20)

spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Data Science
Data ScienceData Science
Data Science
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 

Destaque

SolrCloud Failover and Testing
SolrCloud Failover and TestingSolrCloud Failover and Testing
SolrCloud Failover and TestingMark Miller
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth
 
Solr cloud the 'search first' nosql database extended deep dive
Solr cloud the 'search first' nosql database   extended deep diveSolr cloud the 'search first' nosql database   extended deep dive
Solr cloud the 'search first' nosql database extended deep divelucenerevolution
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?gagravarr
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaLucidworks
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Alexandre Rafalovitch
 
Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.ashish0x90
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Sematext Group, Inc.
 
Mobile development
Mobile developmentMobile development
Mobile developmentKruno Ris
 
Poduzetnici nisu zli, oni pokreću
Poduzetnici nisu zli, oni pokrećuPoduzetnici nisu zli, oni pokreću
Poduzetnici nisu zli, oni pokrećuKruno Ris
 

Destaque (19)

SolrCloud Failover and Testing
SolrCloud Failover and TestingSolrCloud Failover and Testing
SolrCloud Failover and Testing
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Solr cloud the 'search first' nosql database extended deep dive
Solr cloud the 'search first' nosql database   extended deep diveSolr cloud the 'search first' nosql database   extended deep dive
Solr cloud the 'search first' nosql database extended deep dive
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
οδηγίες για εργασία στο Wiki
οδηγίες για εργασία στο Wikiοδηγίες για εργασία στο Wiki
οδηγίες για εργασία στο Wiki
 
Lesson Planning Advice
Lesson Planning AdviceLesson Planning Advice
Lesson Planning Advice
 
Mobile development
Mobile developmentMobile development
Mobile development
 
Презентация PMI Уфа июль 2015
Презентация PMI Уфа июль 2015Презентация PMI Уфа июль 2015
Презентация PMI Уфа июль 2015
 
Mekong(ต่อ)
Mekong(ต่อ)Mekong(ต่อ)
Mekong(ต่อ)
 
Poduzetnici nisu zli, oni pokreću
Poduzetnici nisu zli, oni pokrećuPoduzetnici nisu zli, oni pokreću
Poduzetnici nisu zli, oni pokreću
 
Mekong naga
Mekong nagaMekong naga
Mekong naga
 

Semelhante a Solr + Hadoop = Big Data Search

The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetupgregchanan
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemCloudera, Inc.
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksLucidworks
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopGrant Ingersoll
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413gregchanan
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & SolrLucidworks
 
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Caserta
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldAlex Moundalexis
 
Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekCloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekHakka Labs
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 

Semelhante a Solr + Hadoop = Big Data Search (20)

The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Solr 8 interview
Solr 8 interview Solr 8 interview
Solr 8 interview
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
 
Solr
SolrSolr
Solr
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
 
Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekCloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Solr + Hadoop = Big Data Search

  • 1. Solr + Hadoop = Big Data Search Mark Miller !1
  • 2. Who Am I? Cloudera employee, Lucene/Solr committer, Lucene PMC member, Apache member ! First job out of college was in the Newspaper archiving business. First full time employee at LucidWorks - a startup around Lucene/Solr. Spent a couple years as “Core” engineering manager, reporting to the VP of engineering. !2
  • 3. Very fast and feature rich ‘core’ search engine library. ! Compact and powerful, Lucene is an extremely popular full-text search library. ! Provides low level API’s for analyzing, indexing, and searching text, along with a myriad of related features. ! Just the core - either you write the ‘glue’ or use a higher level search engine built with Lucene. !3
  • 4. Solr (pronounced "solar") is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable. Solr is the most popular enterprise search engine. - Wikipedia !4
  • 5. Search on Hadoop History • Katta • Blur • SolBase • HBASE-3529 • SOLR-1301 • SOLR-1045 • Ad-Hoc ! ! !5
  • 7. Strengthen the Family Bonds • No need to build something radically new - we have the pieces we need. ! • Focus on integration points. ! • Create high quality, first class integrations and contribute the work to the projects involved. ! • Focus on integration and quality first - then performance and scale. !7
  • 9. Solr Integration • Read and Write directly to HDFS ! • First Class Custom Directory Support in Solr • Support Solr Replication on HDFS ! • Other improvements around usability and configuration ! !9
  • 10. Read and Write directly to HDFS • Lucene did not historically support append only file system ! • “Flexible Indexing” brought around support for append only filesystem support ! • Lucene support append only filesystem by default since 4.2 ! !10
  • 11. Lucene Directory Abstraction It’s how Lucene interacts with index files. Solr uses the Lucene library and offers DirectoryFactory ! Class Directory { listAll(); createOutput(file, context); openInput(file, context); deleteFile(file); makeLock(file); clearLock(file); … } !11
  • 12. Putting the Index in HDFS • Solr relies on the filesystem cache to operate at full speed. ! • HDFS not known for it’s random access speed. ! • Apache Blur has already solved this with an HdfsDirectory • • ! ! !12 that works on top of a BlockDirectory. ! The “block cache” caches the hot blocks of the index off heap (direct byte array) and takes the place of the filesystem cache. ! We contributed back optional ‘write’ caching.
  • 13. Putting the TransactionLog in HDFS • HdfsUpdateLog added - extends UpdateLog ! • Triggered by setting the UpdateLog dataDir to something that starts with hdfs:/ - no additional configuration necessary. ! • Same extensive testing as used on UpdateLog ! ! ! ! !13
  • 14. Running Solr on HDFS • Set DirectoryFactory to HdfsDirectoryFactory and set the dataDir to a location in hdfs. ! • Set LockType to ‘hdfs’ ! • Use an UpdateLog dataDir location that begins with ‘hdfs:/’ • ! ! !14 ! Or java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lockType=solr.HdfsLockFactory -Dsolr.updatelog=hdfs://host:port/path -jar start.jar !
  • 15. Solr Replication on HDFS ! • While Solr has exposed a plug-able DirectoryFactory for a long time now, it was really quite limited. ! • Most glaring, only a local file system based Directory would work with replication. ! • There where also other more minor areas that relied on a local filesystem Directory implementation. ! ! !15
  • 16. Future Solr Replication on HDFS • Take advantage of “distributed filesystem” and allow for something similar to HBase regions. ! • If a node goes down, the data is still available in HDFS - allow for that index to be automatically served by a node that is still up if it has the capacity. ! Solr Node Solr Node Solr Node HDFS !16
  • 17. MR Index Building • Scalable index creation via map-reduce • • • !17 ! ! ! Many initial ‘homegrown’ implementations sent documents from reducer to SolrCloud over http ! To really scale, you want the reducers to create the indexes in HDFS and then load them up with Solr ! The ideal impl will allow using as many reducers as are available in your hadoop cluster, and then merge the indexes down to the correct number of ‘shards’ !
  • 18. MR Index Building Mapper: Parse input into indexable document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Arbitrary reducing steps of indexing and merging End-Reducer (shard 1): Index document Index shard 1 !18 End-Reducer (shard 2): Index document Index shard 2
  • 19. SolrCloud Aware • Can ‘inspect’ ZooKeeper to learn about Solr cluster. ! • What URL’s to GoLive to. ! • The Schema to use when building indexes. ! • Match hash -> shard assignments of a Solr cluster. ! ! ! !19
  • 20. GoLive ! • After building your indexes with map-reduce, how do • • • ! !20 ! you deploy them to your Solr cluster? We want it to be easy - so we built the GoLive option. GoLive allows you to easily merge the indexes you have created atomically into a live running Solr cluster. Paired with the ZooKeeper Aware ability, this allows you to simply point your map-reduce job to your Solr cluster and it will automatically discover how many shards to build and what locations to deliver the final indexes to in HDFS.
  • 21. Flume Solr Sync Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. ! !21 ! - Apache Flume Website
  • 23. SolrCloud Aware • Can ‘inspect’ ZooKeeper to learn about Solr cluster. ! • What URL’s to send data to. ! • The Schema for the collection being indexed to. ! ! ! !23
  • 24. HBase Integration • Collaboration between NGData & Cloudera • NGData are creators of the Lily data management • • • • • • ! ! !24 ! platform Lily HBase Indexer Service which acts as a HBase replication listener HBase replication features, such as filtering, supported Replication updates trigger indexing of updates (rows) Integrates Morphlines library for ETL of rows AL2 licensed on github https://github.com/ngdata
  • 25. HBase Triggers on updates interactive load HBase Integration HDFS !25 Indexer(s) Solr Solr Solr Solr Solr server server server server server
  • 26. Morphlines • A morphline is a configuration file that allows you to define ETL transformation pipelines ! • Extract content from input files, transform content, load content (eg to Solr) ! • Uses Tika to extract content from a large variety of input documents ! • Part of the CDK (Cloudera Development Kit) ! ! ! !26
  • 27. Morphlines Flume Agent syslog Solr Sink Command: readLine Command: grok Command: loadSolr Solr !27 • Open Source framework for simple ETL • Ships as part Cloudera Developer Kit (CDK) • It’s a Java library • AL2 licensed on github https://github.com/ cloudera/cdk • Similar to Unix pipelines • Configuration over coding • Supports common Hadoop formats Avro Sequence file Text Etc… !
  • 28. Morphlines • Integrate with and load into Apache Solr • Flexible log file analysis • Single-line record, multi-line records, CSV files • Regex based pattern matching and extraction • Integration with Avro • Integration with Apache Hadoop Sequence Files • Integration with SolrCell and all Apache Tika parsers • Auto-detection of MIME types from binary data using Apache Tika ! !28
  • 29. Morphlines • Scripting support for dynamic java code • Operations on fields for assignment and comparison • Operations on fields with list and set semantics • if-then-else conditionals • A small rules engine (tryRules) • String and timestamp conversions • slf4j logging • Yammer metrics and counters • Decompression and unpacking of arbitrarily nested container file formats • Etc… ! !29
  • 30. Morphlines Example Config morphlines : [ Example Input <164>Feb  4 10:46:14 syslog sshd[607]: listening on  { 0.0.0.0 port 22    id : morphline1 Output Record    importCommands : ["com.cloudera.**", "org.apache.solr.**"] syslog_pri:164    commands : [ syslog_timestamp:Feb  4 10:46:14      { readLine {} }                     syslog_hostname:syslog      { syslog_program:sshd        grok { syslog_pid:607          dictionaryFiles : [/tmp/grok-dictionaries]                                syslog_message:listening on 0.0.0.0 port 22.          expressions : { 
            message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} % {SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: % {GREEDYDATA:syslog_message}"""          }        }      }      { loadSolr {} }          ]  } ] !30
  • 31. Hue Integration Hue • Simple UI • Navigated, faceted drill • • ! ! ! !31 down Customizable display Full text search, standard Solr API and query language