SlideShare uma empresa Scribd logo
1 de 39
Timothy Potter
independent consultant search / big data projects
soon to be joining engineering team @LucidWorks
co-author Solr In Action
previously big data architect Dachis Group
my storm story
re-designed a complex batch-oriented indexing
pipeline based on Hadoop (Oozie, Pig, Hive, Sqoop)
to real-time storm topology
walk through how to develop a storm topology
common integration points with Solr
(near real-time indexing, percolator, real-time get)
listen to click events from URL shortener
( to determine trending US government sites
stream of click events: ->
beyond word count
tackle real challenges you’ll encounter when
developing a storm topology
and what about ... unit testing, dependency injection,
measure runtime behavior of your components, separation of
concerns, reducing boilerplate, hiding complexity ...
open source distributed computation system
scalability, fault-tolerance, guaranteed message
processing (optional)
storm primitives
• tuple: ordered list of values
• stream: unbounded sequence of tuples
• spout: emit a stream of tuples (source)
• bolt: performs some operation on each tuple
• topology: dag of spouts and tuples
solution requirements
• receive click events from stream
• count frequency of pages in a time window
• rank top N sites per time window
• extract title, body text, image for each link
• persist rankings and metadata for visualization
trending snapshot (sept 12, 2013)
grouping hash
grouping hash
provided by in the
storm-starter project
data store
stream grouping
• shuffle: random distribution of tuples to all instances of a bolt
• field(s): group tuples by one or more fields in common
• global: reduce down to one
• all: replicate stream to all instances of a bolt
useful storm concepts
• bolts can receive input from many spouts
• tuples in a stream can be grouped together
• streams can be split and joined
• bolts can inject new tuples into the stream
• components can be distributed across a cluster at a
configurable parallelism level
• optionally, storm keeps track of each tuple emitted by a spout
(ack or fail)
• Spring framework – dependency injection, configuration, unit
testing, mature, etc.
• Groovy – keeps your code tidy and elegant
• Mockito – ignore stuff your test doesn’t care about
• Netty – fast & powerful NIO networking library
• Coda Hale metrics – get visibility into how your bolts and
spouts are performing (at a very low-level)
easy! just produce a stream of tuples ...
and ... avoid blocking when waiting for more data, ease off throttle if topology
is not processing fast enough, deal with failed tuples, choose if it should use
message Ids for each tuple emitted, data model / schema, etc ...
Spring container (1 per topology per JVM)
JDBC WebService
Hide complexity
of implementing
Storm contract
focuses on
streaming data provider
class OneUsaGovStreamingDataProvider implements StreamingDataProvider, MessageHandler {
MessageStream messageStream
void open(Map stormConf) { messageStream.receive(this) }
boolean next(NamedValues nv) {
String msg = queue.poll()
if (msg) {
OneUsaGovRequest req = objectMapper.readValue(msg, OneUsaGovRequest)
if (req != null && req.globalBitlyHash != null) {
nv.set(OneUsaGovTopology.GLOBAL_BITLY_HASH, req.globalBitlyHash)
nv.set(OneUsaGovTopology.JSON_PAYLOAD, req)
return true
return false
void handleMessage(String msg) { queue.offer(msg) }
Spring Dependency Injection
non-blocking call to get the
next message from
use Jackson JSON parser
to create an object from the
raw incoming data
jackson json to java
@JsonIgnoreProperties(ignoreUnknown = true)
class OneUsaGovRequest implements Serializable {
String userAgent;
String countryCode;
int knownUser;
String globalBitlyHash;
String encodingUserBitlyHash;
String encodingUserLogin;
Spring converts json to java object for you:
<bean id="restTemplate"
<property name="messageConverters">
<bean id="messageConverter”
spout data provider spring-managed bean
<bean id="oneUsaGovStreamingDataProvider"
<property name="messageStream">
<bean class="com.bigdatajumpstart.netty.HttpClient">
<constructor-arg index="0" value="${streamUrl}"/>
new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1)
Note: when building the StormTopology to submit to Storm, you do:
class OneUsaGovStreamingDataProviderTest extends StreamingDataProviderTestBase {
void testDataProvider() {
String jsonStr = '''{
"a": "user-agent", "c": "US",
"nk": 0, "tz": "America/Los_Angeles",
"gr": "OR", "g": "2BktiW",
"h": "12Me4B2", "l": "usairforce",
"al": "en-us", "hh": "",
"r": "",
OneUsaGovStreamingDataProvider dataProvider = new OneUsaGovStreamingDataProvider()
dataProvider.setMessageStream(mock(MessageStream)) // Config setup in base class
NamedValues record = new NamedValues(OneUsaGovTopology.spoutFields)
spout data provider unit test
mock json to simulate
data from feed
use Mockito to satisfy
dependencies not needed
for this test
asserts to verify
data provider
works correctly
rolling count bolt
• counts frequency of links in a sliding time window
• emits topN in current window every M seconds
• uses tick tuple trick provided by Storm to emit every
M seconds (configurable)
• provided with the storm-starter project
• calls out to API
• caches results locally in the bolt instance
• relies on field grouping (incoming tuples)
• outputs data to be indexed in Solr
• benefits from parallelism to enrich more links
concurrently (watch those rate limits)
enrich link metadata bolt service
class EmbedlyService {
RestTemplate restTemplate
String apiKey
private Timer apiTimer = MetricsSupport.timer(EmbedlyService, "apiCall")
Embedly getLinkMetadata(String link) {
String urlEncoded = URLEncoder.encode(link,"UTF-8")
URI uri = new URI("${apiKey}&url=${urlEncoded}")
Embedly embedly = null
MetricsSupport.withTimer(apiTimer, {
embedly = restTemplate.getForObject(uri, Embedly)
return embedly
simple closure to time our
requests to the Web service
integrate coda hale metrics
• capture runtime behavior of the components in your
• Coda Hale metrics -
• output metrics every N minutes
• report metrics to JMX, ganglia, graphite, etc
-- Meters ----------------------------------------------------------------------
count = 97
mean rate = 0.81 events/second
1-minute rate = 0.89 events/second
5-minute rate = 1.62 events/second
15-minute rate = 1.86 events/second
count = 60
mean rate = 0.50 events/second
1-minute rate = 0.41 events/second
5-minute rate = 0.16 events/second
15-minute rate = 0.06 events/second
-- Timers ----------------------------------------------------------------------
count = 60
mean rate = 0.50 calls/second
1-minute rate = 0.40 calls/second
5-minute rate = 0.16 calls/second
15-minute rate = 0.06 calls/second
min = 138.70 milliseconds
max = 7642.92 milliseconds
mean = 1148.29 milliseconds
stddev = 1281.40 milliseconds
median = 652.83 milliseconds
75% <= 1620.96 milliseconds
storm cluster concepts
• nimbus: master node (~job tracker in Hadoop)
• zookeeper: cluster management / coordination
• supervisor: one per node in the cluster to manage worker
• worker: one or more per supervisor (JVM process)
• executor: thread in worker
• task: work performed by a spout or bolt
Worker 1 (port 6701)
Supervisor (1 per node)
Node 1
JVM process
... N workers
... M nodes
Each component (spout or bolt)
is distributed across a cluster of
workers based on a configurable
StormTopology build(StreamingApp app) throws Exception {
TopologyBuilder builder = new TopologyBuilder()
new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1)
new SpringBolt("enrichLinkAction", enrichedLinkFields), 3)
.fieldsGrouping("", globalBitlyHashGrouping)
parallelism hint to
the framework
(can be rebalanced)
solr integration points
• real-time get
• near real-time indexing (NRT)
• percolate (match incoming docs to pre-existing
real-time get
use Solr for fast lookups by document ID
class SolrClient {
SolrServer solrServer
SolrDocument get(String docId, String... fields) {
SolrQuery q = new SolrQuery()
q.set("id", docId)
QueryRequest req = new QueryRequest(q)
req.setResponseParser(new BinaryResponseParser())
QueryResponse rsp = req.process(solrServer)
return (SolrDocument)rsp.getResponse().get("doc")
send the request to the
“get” request handler
near real-time indexing
• If possible, use CloudSolrServer to route documents directly
to the correct shard leaders (SOLR-4816)
• Use <openSearcher>false</openSearcher> for auto “hard”
• Use auto soft commits as needed
• Use parallelism of Storm bolt to distribute indexing work to N
• match incoming documents to pre-configured
queries (inverted search)
– example: Is this tweet related to campaign Y for brand X?
• use storm’s distributed computation support to
evaluate M pre-configured queries per doc
two possible approaches
• Lucene-only solution using MemoryIndex
– See presentation by Charlie Hull and Alan Woodward
• EmbeddedSolrServer
– Full solrconfig.xml / schema.xml
– RAMDirectory
– Relies on Storm to scale up documents / second
– Easy solution for up to a few thousand queries
PercolatorBolt 1
queries stored in
a database
PercolatorBolt N
... Could be 100’s of these
grouping ZeroMQ
pub/sub to push
query changes
to percolator
tick tuples
• send a special kind of tuple to a bolt every N
if (TupleHelpers.isTickTuple(input)) {
// do special work
used in percolator to delete accumulated documents every minute or so ...
• Storm Wiki
• Overview: Krishna Gade
• Trending Topics: Michael Knoll
• Understanding Parallelism: Michael Knoll
get the code:
Q & A
Manning coupon codes for conference related books:

Mais conteúdo relacionado

Mais procurados

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologyLucidworks
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature PreviewYonik Seeley
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptLucidworks
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)Yonik Seeley
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrlucenerevolution
Faster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache SolrFaster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache SolrChitturi Kiran
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)MongoDB
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Lucidworks
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachAlexandre Rafalovitch
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrChristos Manios
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Lucidworks

Mais procurados (20)

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScript
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
Faster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache SolrFaster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache Solr
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data Analytics
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7

Semelhante a Integrate Solr with real-time stream processing applications

Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorStéphane Maldini
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsSrinath Perera
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsDEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsSriskandarajah Suhothayan
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...Zhenzhong Xu
Concurrency (Fisher Syer S2GX 2010)
Concurrency (Fisher Syer S2GX 2010)Concurrency (Fisher Syer S2GX 2010)
Concurrency (Fisher Syer S2GX 2010)Dave Syer
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics PlatformSrinath Perera
Real time stream processing presentation at General
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General
Real time stream processing presentation at General Assemb.lyVarun Vijayaraghavan
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Felicia Haggarty
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Eric Sammer
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysSmartNews, Inc.
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Stormjustinjleet
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQXin Wang
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny

Semelhante a Integrate Solr with real-time stream processing applications (20)

Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and Reactor
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsDEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
AI Development with
AI Development with H2O.aiAI Development with
AI Development with
Concurrency (Fisher Syer S2GX 2010)
Concurrency (Fisher Syer S2GX 2010)Concurrency (Fisher Syer S2GX 2010)
Concurrency (Fisher Syer S2GX 2010)
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
Real time stream processing presentation at General
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General
Real time stream processing presentation at General
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdays
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup

Mais de thelabdude

Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxiothelabdude
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)thelabdude
Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202thelabdude

Mais de thelabdude (7)

Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxio
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)
Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202


Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Último (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Integrate Solr with real-time stream processing applications

  • 1.
  • 3. whoami independent consultant search / big data projects soon to be joining engineering team @LucidWorks co-author Solr In Action previously big data architect Dachis Group
  • 4. my storm story re-designed a complex batch-oriented indexing pipeline based on Hadoop (Oozie, Pig, Hive, Sqoop) to real-time storm topology
  • 5. agenda walk through how to develop a storm topology common integration points with Solr (near real-time indexing, percolator, real-time get)
  • 6. example listen to click events from URL shortener ( to determine trending US government sites stream of click events: ->
  • 7. beyond word count tackle real challenges you’ll encounter when developing a storm topology and what about ... unit testing, dependency injection, measure runtime behavior of your components, separation of concerns, reducing boilerplate, hiding complexity ...
  • 8. storm open source distributed computation system scalability, fault-tolerance, guaranteed message processing (optional)
  • 9. storm primitives • tuple: ordered list of values • stream: unbounded sequence of tuples • spout: emit a stream of tuples (source) • bolt: performs some operation on each tuple • topology: dag of spouts and tuples
  • 10. solution requirements • receive click events from stream • count frequency of pages in a time window • rank top N sites per time window • extract title, body text, image for each link • persist rankings and metadata for visualization
  • 12.
  • 14. stream grouping • shuffle: random distribution of tuples to all instances of a bolt • field(s): group tuples by one or more fields in common • global: reduce down to one • all: replicate stream to all instances of a bolt source:
  • 15. useful storm concepts • bolts can receive input from many spouts • tuples in a stream can be grouped together • streams can be split and joined • bolts can inject new tuples into the stream • components can be distributed across a cluster at a configurable parallelism level • optionally, storm keeps track of each tuple emitted by a spout (ack or fail)
  • 16. tools • Spring framework – dependency injection, configuration, unit testing, mature, etc. • Groovy – keeps your code tidy and elegant • Mockito – ignore stuff your test doesn’t care about • Netty – fast & powerful NIO networking library • Coda Hale metrics – get visibility into how your bolts and spouts are performing (at a very low-level)
  • 17. spout easy! just produce a stream of tuples ... and ... avoid blocking when waiting for more data, ease off throttle if topology is not processing fast enough, deal with failed tuples, choose if it should use message Ids for each tuple emitted, data model / schema, etc ...
  • 18. SpringBoltSpringSpout Streaming DataAction (POJO) Streaming DataProvider (POJO) Spring container (1 per topology per JVM) Spring Dependency Injection JDBC WebService Hide complexity of implementing Storm contract developer focuses on business logic
  • 19. streaming data provider class OneUsaGovStreamingDataProvider implements StreamingDataProvider, MessageHandler { MessageStream messageStream ... void open(Map stormConf) { messageStream.receive(this) } boolean next(NamedValues nv) { String msg = queue.poll() if (msg) { OneUsaGovRequest req = objectMapper.readValue(msg, OneUsaGovRequest) if (req != null && req.globalBitlyHash != null) { nv.set(OneUsaGovTopology.GLOBAL_BITLY_HASH, req.globalBitlyHash) nv.set(OneUsaGovTopology.JSON_PAYLOAD, req) return true } } return false } void handleMessage(String msg) { queue.offer(msg) } Spring Dependency Injection non-blocking call to get the next message from use Jackson JSON parser to create an object from the raw incoming data
  • 20. jackson json to java @JsonIgnoreProperties(ignoreUnknown = true) class OneUsaGovRequest implements Serializable { @JsonProperty("a") String userAgent; @JsonProperty("c") String countryCode; @JsonProperty("nk") int knownUser; @JsonProperty("g") String globalBitlyHash; @JsonProperty("h") String encodingUserBitlyHash; @JsonProperty("l") String encodingUserLogin; ... } Spring converts json to java object for you: <bean id="restTemplate" class="org.springframework.web.client.RestTemplate"> <property name="messageConverters"> <list> <bean id="messageConverter” class="...json.MappingJackson2HttpMessageConverter"> </bean> </list> </property> </bean>
  • 21. spout data provider spring-managed bean <bean id="oneUsaGovStreamingDataProvider" class="com.bigdatajumpstart.storm.OneUsaGovStreamingDataProvider"> <property name="messageStream"> <bean class="com.bigdatajumpstart.netty.HttpClient"> <constructor-arg index="0" value="${streamUrl}"/> </bean> </property> </bean> builder.setSpout("", new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1) Note: when building the StormTopology to submit to Storm, you do:
  • 22. class OneUsaGovStreamingDataProviderTest extends StreamingDataProviderTestBase { @Test void testDataProvider() { String jsonStr = '''{ "a": "user-agent", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "gr": "OR", "g": "2BktiW", "h": "12Me4B2", "l": "usairforce", "al": "en-us", "hh": "", "r": "", ... }''' OneUsaGovStreamingDataProvider dataProvider = new OneUsaGovStreamingDataProvider() dataProvider.setMessageStream(mock(MessageStream)) // Config setup in base class dataProvider.handleMessage(jsonStr) NamedValues record = new NamedValues(OneUsaGovTopology.spoutFields) assertTrue ... } } spout data provider unit test mock json to simulate data from feed use Mockito to satisfy dependencies not needed for this test asserts to verify data provider works correctly
  • 23. rolling count bolt • counts frequency of links in a sliding time window • emits topN in current window every M seconds • uses tick tuple trick provided by Storm to emit every M seconds (configurable) • provided with the storm-starter project
  • 24. • calls out to API • caches results locally in the bolt instance • relies on field grouping (incoming tuples) • outputs data to be indexed in Solr • benefits from parallelism to enrich more links concurrently (watch those rate limits) enrich link metadata bolt
  • 25. service class EmbedlyService { @Autowired RestTemplate restTemplate String apiKey private Timer apiTimer = MetricsSupport.timer(EmbedlyService, "apiCall") Embedly getLinkMetadata(String link) { String urlEncoded = URLEncoder.encode(link,"UTF-8") URI uri = new URI("${apiKey}&url=${urlEncoded}") Embedly embedly = null MetricsSupport.withTimer(apiTimer, { embedly = restTemplate.getForObject(uri, Embedly) }) return embedly } simple closure to time our requests to the Web service integrate coda hale metrics
  • 26. • capture runtime behavior of the components in your topology • Coda Hale metrics - • output metrics every N minutes • report metrics to JMX, ganglia, graphite, etc metrics
  • 27. -- Meters ---------------------------------------------------------------------- EnrichLinkBoltLogic.solrQueries count = 97 mean rate = 0.81 events/second 1-minute rate = 0.89 events/second 5-minute rate = 1.62 events/second 15-minute rate = 1.86 events/second SolrBoltLogic.linksIndexed count = 60 mean rate = 0.50 events/second 1-minute rate = 0.41 events/second 5-minute rate = 0.16 events/second 15-minute rate = 0.06 events/second -- Timers ---------------------------------------------------------------------- EmbedlyService.apiCall count = 60 mean rate = 0.50 calls/second 1-minute rate = 0.40 calls/second 5-minute rate = 0.16 calls/second 15-minute rate = 0.06 calls/second min = 138.70 milliseconds max = 7642.92 milliseconds mean = 1148.29 milliseconds stddev = 1281.40 milliseconds median = 652.83 milliseconds 75% <= 1620.96 milliseconds ...
  • 28. storm cluster concepts • nimbus: master node (~job tracker in Hadoop) • zookeeper: cluster management / coordination • supervisor: one per node in the cluster to manage worker processes • worker: one or more per supervisor (JVM process) • executor: thread in worker • task: work performed by a spout or bolt
  • 29. Worker 1 (port 6701) Nimbus Supervisor (1 per node) Topology JAR Node 1 JVM process executor (thread) ... N workers ... M nodes Each component (spout or bolt) is distributed across a cluster of workers based on a configurable parallelism Zookeeper
  • 30. @Override StormTopology build(StreamingApp app) throws Exception { ... TopologyBuilder builder = new TopologyBuilder() builder.setSpout("", new SpringSpout("oneUsaGovStreamingDataProvider", spoutFields), 1) builder.setBolt("enrich-link-bolt", new SpringBolt("enrichLinkAction", enrichedLinkFields), 3) .fieldsGrouping("", globalBitlyHashGrouping) ... parallelism hint to the framework (can be rebalanced)
  • 31. solr integration points • real-time get • near real-time indexing (NRT) • percolate (match incoming docs to pre-existing queries)
  • 32. real-time get use Solr for fast lookups by document ID class SolrClient { @Autowired SolrServer solrServer SolrDocument get(String docId, String... fields) { SolrQuery q = new SolrQuery() q.setRequestHandler("/get") q.set("id", docId) q.setFields(fields) QueryRequest req = new QueryRequest(q) req.setResponseParser(new BinaryResponseParser()) QueryResponse rsp = req.process(solrServer) return (SolrDocument)rsp.getResponse().get("doc") } } send the request to the “get” request handler
  • 33. near real-time indexing • If possible, use CloudSolrServer to route documents directly to the correct shard leaders (SOLR-4816) • Use <openSearcher>false</openSearcher> for auto “hard” commits • Use auto soft commits as needed • Use parallelism of Storm bolt to distribute indexing work to N nodes
  • 34. percolate • match incoming documents to pre-configured queries (inverted search) – example: Is this tweet related to campaign Y for brand X? • use storm’s distributed computation support to evaluate M pre-configured queries per doc
  • 35. two possible approaches • Lucene-only solution using MemoryIndex – See presentation by Charlie Hull and Alan Woodward • EmbeddedSolrServer – Full solrconfig.xml / schema.xml – RAMDirectory – Relies on Storm to scale up documents / second – Easy solution for up to a few thousand queries
  • 36. Twitter Spout PercolatorBolt 1 Embedded SolrServer Pre-configured queries stored in a database PercolatorBolt N Embedded SolrServer ... Could be 100’s of these random stream grouping ZeroMQ pub/sub to push query changes to percolator
  • 37. tick tuples • send a special kind of tuple to a bolt every N seconds if (TupleHelpers.isTickTuple(input)) { // do special work } used in percolator to delete accumulated documents every minute or so ...
  • 38. references • Storm Wiki • • Overview: Krishna Gade • • Trending Topics: Michael Knoll • trending-topics-in-storm/ • Understanding Parallelism: Michael Knoll • parallelism-of-a-storm-topology/
  • 39. get the code: Q & A Manning coupon codes for conference related books: