Big Search with Big Data Principles

Big Search w/ Big Data
Principles
LuceneRevolution 2012
Eric Pugh | epugh@o19s.com | @dep4b

Who am i?
• Principal of OpenSource Connections
- Solr/Lucene Search Consultancy

• Member of Apache Software
Foundation

• SOLR-284 UpdateRichDocuments
(July 07)

• Fascinated by the art of software
development

n!
io
it
ed
d
2n
AUTHOR

Telling some stories

• Prototyping
• Application Development
• Maintaining Your Cluster

Not an intro to cloud
Computing
• See Indexing Big Data on Amazon AWS by
Scott Stults @ 1:15 Thursday
• See How is the Government Spending Your
Money? How GCE is Using Lucene and the
GCE Big Data Cloud by Seshu Simhadi @
2:55 Thursday

Not an intro to
SolrCloud!
• See How SolrCloud Changes the User
Experience In a Sharded Environment by
Erick Erickson @ 2:55 Today
• See Solr 4: The SolrCloud Architecture by
Mark Miller @ 10:45 Tomorrow

My Assumptions for
Client X
• Big Data is any data set that is primarily at
rest due to the difﬁculty of working with it.
• Limited selection of tools available.
• Aggressive timeline.
• All the data must be searched per query.
• On Solr 3.x line

Boy meets Girl Story

Metadata

Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files

Bash Rocks
• Remote Solr stop/start scripts
• Remote Indexer stop/start scripts
• Performance Monitoring
• Content Extraction scripts (+Java)
• Ingestor Scripts (+Java)
• Artifact Deployment (CM)

Make it easy to change
sharding

Make it easy to change
sharding
public void run(Map options, List<SolrInputDocument> docs) throws
InstantiationException, IllegalAccessException, ClassNotFoundException {
IndexStrategy indexStrategy = (IndexStrategy) Class.forName(
"com.o19s.solr.ModShardIndexStrategy").newInstance();
indexStrategy.configure(options);

for (SolrInputDocument doc:docs){
indexStrategy.addDocument(doc);
}
}

Separate JVM from Solr
Cores
• Step 1: Fire up empty Solr’s on all the
servers (nohup &).
• Step 2:Verify they started cleanly.
• Step 3: Create Cores (curl http://
search1.o19s.com:8983/solr/admin?
action=create&name=run2)
• Step 4: Create a “aggregator” core, passing
in urls of Cores. (&property.shards=)

search1.o19s.com
search1.o19s.com
shard1
shard1
shard1
shard1 :8983
shard1
shard1
shard1
shard1 :8983
search2.o19s.com
shard1
shard1
shard1
shard8 :8984 shard1
shard1
shard1 :8983
shard8

shard1
shard1
shard1 :8985
shard12 search3.o19s.com
shard1
shard1
shard1 :8985
shard12
shard1
shard1
shard1 :8983
shard12

Simple Pipeline

• Simple pipeline

• mv is atomic

Don’t Move Files
• SCP across machines is slow/error prone
• NFS share, single point of failure.
• Clustered ﬁle system like GFS (Global File
System) can have “fencing” issues
• HDFS shines here.
• ZooKeeper shines here.

Can you test your
changes?

JVM tuning is black art
-verbose:gc
-XX:+PrintGCDetails
-server
-Xmx8G
-Xms8G
-XX:MaxPermSize=256m
-XX:PermSize=256m
-XX:+AggressiveHeap
-XX:+DisableExplicitGC
-XX:ParallelGCThreads=16
-XX:+UseParallelOldGC

Grab some Data
#!/bin/sh
SOURCE_SOLR='http://
ec2-107-20-92-190.compute-1.amazonaws.com:8983/solr/
core0/select?q=*%3A*&start=0&rows=500000&wt=csv'

TARGET_SOLR=http://localhost:8983/solr/us_patent_grant/
update/csv

wget -O output.csv $SOURCE_SOLR

curl 'http://localhost:8983/solr/us_patent_grant/update/
csv?skipLines=1&commit=true&optimize=true' --data-binary
@output.csv -H 'Content-type:text/plain; charset=utf-8'

Using Solr as a key/
value store
• thousands of queries per second without
real time get.
http://localhost:8983/solr/run2_enrichment/select?
q=id:DOC45242&fl=entities,html

• ??? with real time get?
http://localhost:8983/solr/run2_enrichment/get?
id=DOC45242&fl=entities,html

Using Solr as key/value store
Solr Key/
Value Cache
Metadata

Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files

Using Solr as key/value store
• thousands of queries per second without
real time get.
http://localhost:8983/solr/run2_enrichment/select?
q=id:DOC45242&fl=entities,html

• ??? with real time get?
http://localhost:8983/solr/run2_enrichment/get?
id=DOC45242&fl=entities,html

Push schema deﬁnition
to the application
• Not “schema less”
• Just different owner of schema!
• Schema may have common set of ﬁelds like
id, type, timestamp, version
• Nothing required.
q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor

Don’t do expensive
things in Solr

• Tika content extraction aka Solr Cell

• UpdateRequestProcessorChain

Avro!
• Supports serialization of data readable from
multiple languages
• It’s smart XML
• Handles forward and reverse versions of an
object
• Compact and fast to read.

Avro!
Solr Key/
Value Cache
.avro

Metadata Ingest Solr
Solr
Pipeline Solr
Solr

Content
Files

No JavaBin

/u
G te
p
iv /
da
e av
m r
e o!
• Avoid Jarmaggeddon
• Reﬂection? Ugh.

No JavaBin
Solr Key/
Value Cache
Metadata

Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files

No JavaBin
Solr Key/ Solr 3.4
Value Cache
Metadata

Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files

No JavaBin
Solr Key/ Solr 3.4
Value Cache
Metadata
Solr 4
Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files

No JavaBin
Solr Key/ Solr 3.4
Value Cache
Metadata
Solr 4
Ingest Solr
Solr
Pipeline Solr
Solr
Content
Which SolrJ
Files
version do I
use?

Upgrade Lucene
Indexes Easily
• Don’t reindex!
• Try out new features!
David Lyle
java -cp lucene-core.jar
org.apache.lucene.index.IndexUpgrader [-delete-prior-
commits] [-verbose] indexDir

Indexing is Easy and
Quick

CHEAP AND CHEERFUL

< >

The tension between
scale and update rate

Bad Place
to Be
> 100,000,000 < 10,000,000

The tension between
scale and update rate

10 million Bad Place 100’s of millions

Delayed Replication
<requestHandler name="/replication" class="solr.ReplicationHandler" >
<lst name="slave">
<str name="masterUrl">http://localhost:8983/solr/replication</str>
<str name="pollInterval">36:00:00</str>
</lst>
</requestHandler>

Enable/Disable
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
<lst name="invariants">
<str name="q">MY HARD QUERY</str>
<str name="shards">http://search1.o19s.com:8983/solr/run2_1,http://
search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2
</lst>
<lst name="defaults">
<str name="echoParams">all</str>
</lst>
<str name="healthcheckFile">server-enabled.txt</str>
</requestHandler>

Provisioning

• Chef/Puppet
• ZooKeeper
• Have you versioned everything to build an
index?

Do I need Failover?

• Can I build quickly?
• Do I have a reliable cluster?
• Am I spread across data centers?
• Is sooo 90’s....

Thank you!

• epugh@o19s.com
• @dep4b
• www.opensourceconnections.com

Big Search with Big Data Principles

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Big Search with Big Data Principles

Semelhante a Big Search with Big Data Principles (20)

Mais de OpenSource Connections

Mais de OpenSource Connections (20)

Último

Último (20)

Big Search with Big Data Principles

Notas do Editor