Real-time Inverted Search in the Cloud Using Lucene and Storm

conlin_joshua@bah.com
bende_bryan@bah.com
owen_james@bah.com

REAL-TIME INVERTED SEARCH IN THE
CLOUD USING LUCENE AND STORM
Joshua Conlin, Bryan Bende, James Owen

Table of Contents
  Problem Statement

  Storm

  Methodology

  Results

Who are we ?
Booz Allen Hamilton
–  Large consulting firm supporting many industries
•  Healthcare, Finance, Energy, Defense
–  Strategic Innovation Group
•  Focus on innovative solutions that can be applied across industries
•  Major focus on data science, big data, & information retrieval
•  Multiple clients utilizing Solr for implementing search capabilities

Client Applications & Architecture

Ingest

Typical client applications allow users to:
•  Query document index using Lucene syntax

SolrCloud

•  Filter and facet results
•  Save queries for future use

Web
App

Problem Statement
How do we instantly notify users of new documents that match their
saved queries?
Constraints:
• 

Process documents in real-time, notify as soon as possible

• 

Scale with the number of saved queries (starting with tens of thousands)

• 

Result set of notifications must match saved queries

• 

Must not impact performance of the web application

• 

Data arrives at varying speeds and varying sizes

Possible Solutions
• 
• 

Second Solr instance to handle background execution of saved queries
Fork ingest to primary and secondary Solr instances, execute all the saved queries
against secondary instance

lotsOfQueries.size() = 1 X 109 //Milliard?
for (Query q : lotsOfQueries) {
q //*A* OR *B* OR …

Pros

•  Easy
to
set
up,
Simple

•  Works
for
a
consistent,
small
data
ﬂow

}
//… This will take forever

Cons

•  Query
bound

Possible Solutions
• 
• 

Distribute queries amongst multiple machines
Execute queries against a shared Solr (or SolrCloud) instance

lotsOfQueries.size()
=
2.5
X
108

for
(Query
q
:
lotsOfQueries)
{

q
//*A*
OR
*B*
OR
…

}

=
2.5
X
108

for
(Query
q
:
lotsOfQueries)
{

q
//*C*
OR
*D*
OR
…

}

Pros

• 

Scalable,
only
bound
by
the
processing
of
the

Solr
instance

Cons

• 

=
2.5
X
108

for
(Query
q
:
lotsOfQueries)
{

q
//*E*
OR
*F*
OR
…

}

=
2.5
X
108

for
(Query
q
:
lotsOfQueries)
{

q
//*G*
OR
*H*
OR
…

}

Who
is
maintaining
this
code???

• 

SynchronizaCon
issues,
Index
cannot
be

updated
during
query
execuCon

Possible Solutions
One way to deal with the synchronization issues is to do away with a shared Solr
instance, giving each VM its own instance, then distribute the data or queries evenly
across the VMs.
Pros

=
5
X
108

for
(Query
q
:
lotsOfQueries)
{

q
//*A*
OR
*B*
OR
…

}

=
5
X
108

for
(Query
q
:
lotsOfQueries)
{

q
//*C*
OR
*D*
OR
…

}

• 

Scalable,
processing
power
only
bound
by

number
of
VMs

• 

Can
handle
variable
data
ﬂow,
query

processing
would
not
need
to
be

synchronized

Cons

• 

Diﬃcult
to
maintain

Possible Solutions

Is there a way we can set up this system so that it’s:
•  easy to maintain,
•  easy to scale, and
•  easy to synchronize?

Candidate Solution
• 
• 

Integrate Solr and/or Lucene with a stream processing framework
Process data in real-time, leverage proven framework for distributed stream
processing

Ingest

SolrCloud

Storm

Web
App

NoCﬁcaCons

Storm - Overview
• 

Storm is an open source stream processing framework.

• 

It’s a scalable platform that lets you distribute processes across a cluster quickly
and easily.

• 

You can add more resources to your cluster and easily utilize those resources in
your processing.

Storm - Components
• 
• 
• 

Nimbus – the control node for the cluster, distributes jobs through the cluster
Supervisor – one on each machine in the cluster , controls the allocation of worker
assignments on its machine
Worker – JVM process for running topology components

Nimbus

Supervisor

Supervisor

Supervisor

Worker

Worker

Worker

Worker

Worker

Worker

Worker

Worker

Worker

Worker

Worker

Worker

Storm – Core Concepts
• 

Topology – defines a running process, which includes all of the processes to be
run, the connections between those processes, and their configuration

• 

Stream – the flow of data through a topology; it is an unbounded collection of
tuples that is passed from process to process

• 

Storm has 2 types of processing units:
–  Spout – the start of a stream; it can be thought of as the source of the data;
that data can be read in however the spout wants—from a database, from a
message queue, etc.
–  Bolt – the primary processing unit for a topology; it accepts any number of
streams, does whatever processing you’ve set it to do, and outputs any
number of streams based on how you configure it

Storm – Core Concepts (continued)
• 

Stream Groupings – defines how topology processing units (spouts and bolts) are
connected to each other; some common groupings are:
–  All Grouping – stream is sent to all bolts
–  Shuffle Grouping – stream is evenly distributed across bolts
–  Fields grouping – sends tuples that match on the designated “field” to the
same bolt

How to Utilize Storm
How can we use this framework to solve our problem?

Let
Storm
distribute
out
the
data
and
queries
between

processing
nodes

…but
we
would
sCll
need
to
manage
a
Solr
instance
on
each

VM,
and
we
would
even
need
to
ensure
synchronizaCon

between
query
processing
bolts
running
on
the
same
VM.

How to Utilize Storm
What if instead of having a Solr installation on each machine we ran
Solr in memory inside each of the processing bolts?
• 

Use Storm spout to distribute new documents

• 

Use Storm bolt to execute queries against EmbeddedSolrServer with
RAMDirectory
–  Incoming documents added to index
–  Queries executed
–  Documents removed from index

• 

Use Storm bolt to process query results

Bolt

EmbeddedSolrServer

RAMDirectory

Advantages
This has several advantages:
• 

It removes the need to maintain a Solr instance on each VM.

• 

It’s easier to scale and more flexible; it doesn’t matter which Supervisor the bolts
get sent to, all the processing is self-contained.

• 

It removes the need to synchronize processing between bolts.

• 

Documents are volatile, existing queries over new data

Execution Topology
Data

Spout

Data

Spout

Data

Spout

Query

Spout

Data
Spout – Receives incoming
data files and sends to every
Executor Bolt
Query Spout – Coordinates
updates to queries

Executor

Bolt

Executor

Bolt

All

Grouping

Shuffle
Grouping

Executor

Bolt

NoCﬁcaCon

Bolt

Executor

Bolt

Executor

Bolt

Executor Bolt – Loads and
executes queries
Notification Bolt – Generates
notifications based on results

Executor Bolt
Documents

1.  Queries are loaded into memory
2.  Incoming documents are added to the
Lucene index
3.  Documents are processed when one
of the following conditions are met:
a)  The number of documents have
exceeded the max batch size
b)  The time since the last execution
is longer than the max interval
time
4.  Matching queries and document UIDs
are emitted
5.  Remove all documents from index

2

1
Query
List

3
4
emit()

Solr In-Memory Processing Bolt Issues
• 
• 

• 

• 
• 

Attempted to run Solr with in-memory index inside Storm bolt
Solr 4.5 requires:
–  http-client 4.2.3
–  http-core 4.2.2
Storm 0.8.2 & 0.9.0 require:
–  http-client 4.1.1
–  http-core 4.1
Could exclude libraries from super jar and rely on storm/lib, but Solr
expecting SystemDefaultHttpClient from 4.2.3
Could build Storm with newer version of libraries, but not
guaranteed to work

Lucene
In-‐Memory
Processing
Bolt

1.  IniCalizaCon

–  Parse
Common
Solr
Schema

–  Replace
Solr
Classes

2.  Add
Documents

–  Convert
SolrInputDocument
to
Lucene

Document

–  Add
to
index

Advantages:

• 
Fast,
Lightweight

• 
No
Dependency
Conﬂicts

• 
RAMDirectory
backed

• 
Easy
Solr
to
Lucene
Document
Conversion

• 
Solr
Schema
based

Bolt

Lucene
Index

RAMDirectory

Lucene In-Memory Processing Bolt
Parse
Read/Parse/Update
Solr
Schema
File
using
Stax

Create
IndexSchema
from
new
Solr
Schema
data

public void addDocument(SolrInputDocument doc) throws Exception {
if (doc != null) {
Document luceneDoc = solrDocumentConverter.convert(doc);
indexWriter.addDocument(luceneDoc);
indexWriter.commit();
}
}

Prototype Solution
• 

• 

• 

Infrastructure:
–  8 node cluster on Amazon EC2
–  Each VM has 2 cores and 8G of memory
Data:
–  92,000 news article summaries
–  Average file size: ~1k
Queries:
–  Generated 1 million sample queries
–  Randomly selected terms from document set
–  Stored in MariaDB (username, query string)
–  Query Executor Bolt configured to as any subset of these queries

Prototype Solution – Monitoring Performance
• 

Metrics Provided by Storm UI
– 
– 
– 
– 

Emitted: number of tuples emitted
Transferred: number of tuples transferred (emitted * # follow-on bolts)
Acked: number of tuples acknowledged
Execute Latency: timestamp when execute function ends - timestamp when execute is
passed tuple
–  Process Latency: timestamp when ack is called - timestamp when execute is passed tuple
–  Capacity: % of the time in the last 10 minutes the bolt spent executing tuples

• 
• 

Many metrics are samples, don’t always indicate problems
Good measurement is comparing number of tuples transferred from spout, to number
of tuples acknowledged in bolt
–  If transferred number is getting increasingly higher than number of acknowledged tuples,
then the topology is not keeping up with the rate of data

Trial Runs – First Attempt

Node
1

• 
• 
• 
• 
Node
1

ArCcle
Spout

8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts
Article spout emitting as fast as possible
Query execution at 1k docs or 60 seconds elapsed time
Increased number of queries on each trial: 10k, 50k, 100k, 200k, 300k, 400k, 500k
Node
2

Node
3

Node
4

Query
Bx
4
Worker
olt

Query
Bx
4
Worker
olt

Query
Bx
4
Worker
olt

Query
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Worker
x
4

Worker
x
4

Worker
x
4

Worker
x
4

Node
5

Node
6

Node
7

Node
8

Query
Bx
4
Worker
olt

Query
Bx
4
Worker
olt

Query
Bx
4
Worker
olt

Query
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Worker
x
4

Worker
x
4

Worker
x
4

Worker
x
4

Results:
• 
• 

Articles emitted too fast for
bolts to keep up
If data continued to stream
at this rate, topology would
back up and drop tuples

Trial Runs – Second Attempt

Node
1

• 
• 
• 

8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts
Article spout now places articles on queue in background thread every 100ms
Everything else the same…

Node
1

ArCcle
Spout

Node
2

Node
3

Node
4

Query
Bx
4
Worker
olt

Query
Bx
4
Worker
olt

Query
Bx
4
Worker
olt

Query
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Worker
x
4

Worker
x
4

• 

Result
Bx
4
Worker
olt

Worker
x
4

Results:

Worker
x
4

• 
Node
5

Node
6

Node
7

Node
8

Query
Bx
4
Worker
olt

Query
Bx
4
Worker
olt

Query
Bx
4
Worker
olt

Query
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Worker
x
4

Worker
x
4

Worker
x
4

Worker
x
4

Topology performing much
better, keeping up with data
flow for query size of 10k,
50k, 100k, 200k
Slows down around 300k
queries, approx 37.5k
queries/bolt

Trials Runs – Third Attempt

Node
1

• 
• 
• 

Each node has 4 worker slots so lets scale up
16 workers, 1 spout, 16 Query Executor Bolts, 8 Result Bolts
Everything else the same…

Node
1

ArCcle
Spout

Node
2

Node
3

Node
4

Query
Bx
4
Worker
olt

x
2

Query
Bx
4
Worker
olt

x
2

Query
Bx
4
Worker
olt

x
2

Query
Bx
4
Worker
olt

x
2

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Worker
x
4

Worker
x
4

Worker
x
4

Worker
x
4

Node
5

Node
6

Node
7

Node
8

Query
Bx
4
Worker
olt

x
2

Query
Bx
4
Worker
olt

x
2

Query
Bx
4
Worker
olt

x
2

Query
Bx
4
Worker
olt

x
2

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Worker
x
4

Worker
x
4

Worker
x
4

Worker
x
4

Results:
• 
• 
• 

300k queries now keeping
up no problem
400k doing ok…
500k backing up a bit

Trial Runs – Fourth Attempt

Node
1

• 
• 
• 

Next logical step, 32 workers, 1 spout, 32 Query Executor Bolts
Didn’t result in anticipated performance gain, 500k still too much
Hypothesizing that 2-core VMs might not be enough to get full performance from 4
worker slots

Node
1

ArCcle
Spout

Node
2

Node
3

Node
4

Query
Bx
4
Worker
olt

x
4

Query
Bx
4
Worker
olt

x
4

Query
Bx
4
Worker
olt

x
4

Worker
olt

x
4

Query
Bx
4

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Worker
x
4

Worker
x
4

Worker
x
4

Worker
x
4

Node
5

Node
6

Node
7

Node
8

Query
Bx
4
Worker
olt

x
4

Query
Bx
4
Worker
olt

x
4

Query
Bx
4
Worker
olt

x
4

Query
Bx
4
Worker
olt

x
4

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Result
Bx
4
Worker
olt

Worker
x
4

Worker
x
4

Worker
x
4

Worker
x
4

Trials Runs – Conclusions
• 

Most important factor affecting performance is relationship between data rate and
number of queries

• 

Ideal Storm configuration is dependent on hardware executing the topology

• 

Optimal configuration resulted in 250 queries per second per bolt, 4k queries per
second across topology

• 

High level of performance from relatively small cluster

Conclusions
• 

Low barrier to entry working with Storm

• 

Easy conversion of Solr indices to Lucene Indices

• 

Simple integration between Lucene and Storm; Solr more complicated

• 

Configuration is key, tune topology to your needs

• 

Overall strategy appears to scale well for our use case, limited only by hardware

Future Considerations
• 

Adjust the batch size on the query executor bolt

• 

Combine duplicate queries (between users) if your system has many duplicates

• 

Investigate additional optimizations during Solr to Lucene

• 

Run topology with more complex queries (fielded, filtered, etc.)

• 

Investigate handling of bolt failure

• 

If ratio of incoming data to queries was reversed, consider switching the groupings
between the spouts and executor bolts

Real-time Inverted Search in the Cloud Using Lucene and Storm

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a Real-time Inverted Search in the Cloud Using Lucene and Storm

Semelhante a Real-time Inverted Search in the Cloud Using Lucene and Storm (20)

Mais de lucenerevolution

Mais de lucenerevolution (20)

Último

Último (20)

Real-time Inverted Search in the Cloud Using Lucene and Storm