SlideShare uma empresa Scribd logo
1 de 31
1©MapR Technologies - Confidential
Real-time and Long-time with
Storm and Hadoop
2©MapR Technologies - Confidential
Real-time and Long-time with
Storm and Hadoop MapR
3©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such (available late tonight):
– http://info.mapr.com/ted-utahjug
 Hash tags: #mapr #storm
4©MapR Technologies - Confidential
The Challenge
 Hadoop is great of processing vats of data
– But sucks for real-time (by design!)
 Storm is great for real-time processing
– But lacks any way to deal with batch processing
 It sounds like there isn’t a solution
– Neither fashionable solution handles everything
5©MapR Technologies - Confidential
This is not a problem.
It’s an opportunity!
6©MapR Technologies - Confidential
What is Map-Reduce?
 Map-reduce programs are defined (mostly) by
– A map function that does independent record transformations (and
deletions and replications)
– Reduce functions that do aggregation
 Map-reduce programs run in framework that
– Schedules and re-runs tasks
– Splits the input
– Moves map outputs to reduce inputs
– Receives the results
6
7©MapR Technologies - Confidential
Inside Map-Reduce
7
Input Map CombineShuffle
and sort
Reduce Output
Reduce
"The time has come," the Walrus said,
"To talk of many things:
Of shoes—and ships—and sealing-wax
the, 1
time, 1
has, 1
come, 1
…
come, [3,2,1]
has, [1,5,2]
the, [1,2,1]
time,
[10,1,3]
…
come, 6
has, 8
the, 4
time, 14
…
8©MapR Technologies - Confidential
Not Just Text
 Counting words is easy
 Many other problems work as well
– Sessionize user logs
– Very large scale joins
– Large scale matrix recommendations
– Computing the quadrillionth digit of π
 Map-reduce is inherently batch oriented
9©MapR Technologies - Confidential
Inside Map-Reduce
9
Input Map Shuffle
and sort
Reduce Output
road1, polyline(p1, p2, p3, …)
lake1, polygon(p4, p5, p7, p9, …)
road2, polyline(p6, p7, p9, …)
tile0918-1412, road1
tile1082-8143, road1
tile0014-3284, lake1
tile1082-8143, lake1
…
tile0918-1412, [road1]
tile1082-8143, [road1, lake1]
tile0014-3284, [lake1]
…
tile1082-8143, img#1
tile0014-3284, img#2
tile1082-8143, img#3
…
10©MapR Technologies - Confidential
What is Storm?
 A Storm program is called a topology
– Spouts inject data into a topology
– Bolts process data
 The units of data are called tuples
 All processing is flow-through
 Bolts can buffer or persist
 Output tuples can be anchored
 If a tuple and all descendants are not ack’ed in time, bolt has died
 Bolts that fail are restarted and un-acked tuples are replayed
11©MapR Technologies - Confidential
t
now
Hadoop is Not Very Real-time
Unprocessed
Data
Fully
processed
Latest full
period
Hadoop job
takes this
long for this
data
12©MapR Technologies - Confidential
t
now
Hadoop works
great back here
Storm
works
here
Real-time and Long-time together
Blended
view
Blended
view
Blended
View
13©MapR Technologies - Confidential
One Alternative
Search
Engine
NoSql
de Jour
Consumer
Real-time Long-time
?
14©MapR Technologies - Confidential
Problems
 Simply dumping into noSql engine doesn’t quite work
 Insert rate is limited
 No load isolation
– Big retrospective jobs kill real-time
 Low scan performance
– Hbase pretty good, but not stellar
 Difficult to set boundaries
– where does real-time end and long-time begin?
15©MapR Technologies - Confidential
Data
Sources
Catcher
Cluster
Rough Design – Data Flow
Catcher
Cluster
Query Event
Spout
Logger
Bolt
Counter
Bolt
Raw
Logs
Logger
Bolt
Semi
Agg
Hadoop
Aggregator
Snap
Long
agg
ProtoSpout
Counter
Bolt
Logger
Bolt
Data
Sources
16©MapR Technologies - Confidential
Closer Look – Catcher Protocol
Data
Sources
Catcher
Cluster
Catcher
Cluster
Data
Sources
The data sources and catchers
communicate with a very simple
protocol.
Hello() => list of catchers
Log(topic,message) =>
(OK|FAIL, redirect-to-catcher)
17©MapR Technologies - Confidential
Closer Look – Catcher Queues
Catcher
Cluster
Catcher
Cluster
The catchers forward log requests
to the correct catcher and return
that host in the reply to allow the
client to avoid the extra hop.
Each topic file is appended by
exactly one catcher.
Topic files are kept in shared file
storage.
Topic
File
Topic
File
18©MapR Technologies - Confidential
Closer Look – ProtoSpout
The ProtoSpout tails the topic files,
parses log records into tuples and
injects them into the Storm
topology.
Last fully acked position stored in
shared file system.
Topic
File
Topic
File
ProtoSpout
19©MapR Technologies - Confidential
Closer Look – Counter Bolt
 Critical design goals:
– fast ack for all tuples
– fast restart of counter
 Ack happens when tuple hits the replay log (10’s of milliseconds)
 Restart involves replaying semi-agg’s + replay log (very fast)
 Replay log only lasts until next semi-aggregate goes out
Counter
Bolt
Replay
Log
Semi-
aggregated
records
Incoming
records
Real-time Long-time
20©MapR Technologies - Confidential
A Frozen Moment in Time
 Snapshot defines the dividing
line
 All data in the snap is long-
time, all after is real-time
 Semi-agg strategy allows clean
query
Semi
Agg
Hadoop
Aggregator
Snap
Long
agg
21©MapR Technologies - Confidential
Guarantees
 Counter output volume is small-ish
– the greater of k tuples per 100K inputs or k tuple/s
– 1 tuple/s/label/bolt for this exercise
 Persistence layer must provide guarantees
– distributed against node failure
– must have either readable flush or closed-append
 HDFS is distributed, but provides no guarantees and strange
semantics
 MapRfs is distributed, provides all necessary guarantees
22©MapR Technologies - Confidential
Presentation Layer
 Presentation must
– read recent output of Logger bolt
– read relevant output of Hadoop jobs
– combine semi-aggregated records
 User will see
– counts that increment within 0-2 s of events
– seamless and accurate meld of short and long-term data
23©MapR Technologies - Confidential
Example 2 – AB testing in real-time
 I have 15 versions of my landing page
 Each visitor is assigned to a version
– Which version?
 A conversion or sale or whatever can happen
– How long to wait?
 Some versions of the landing page are horrible
– Don’t want to give them traffic
29©MapR Technologies - Confidential
Bayesian Bandit
 Compute distributions based on data
 Sample p1 and p2 from these distributions
 Put a coin in bandit 1 if p1 > p2
 Else, put the coin in bandit 2
30©MapR Technologies - Confidential
And it works!
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
n
regret
ε- greedy, ε = 0.05
Bayesian Bandit with Gamma- Normal
31©MapR Technologies - Confidential
Video Demo
32©MapR Technologies - Confidential
The Code
 Select an alternative
 Select and learn
 But we already know how to count!
n = dim(k)[1]
p0 = rep(0, length.out=n)
for (i in 1:n) {
p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)
}
return (which(p0 == max(p0)))
for (z in 1:steps) {
i = select(k)
j = test(i)
k[i,j] = k[i,j]+1
}
return (k)
33©MapR Technologies - Confidential
The Basic Idea
 We can encode a distribution by sampling
 Sampling allows unification of exploration and exploitation
 Can be extended to more general response models
34©MapR Technologies - Confidential
The Bigger Basic Idea
 Online algorithms generally have relatively small state (like
counting)
 Online algorithms generally have a simple update (like counting)
 If we can do this with counting, we can do it with all kinds of
algorithms
37©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such (available late tonight):
– Send email !!
 Hash tags: #mapr #bigdata
 We are hiring!
38©MapR Technologies - Confidential
Thank You

Mais conteúdo relacionado

Mais procurados

Foveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsFoveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsTakahiro Harada
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsPandey_G
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...Preferred Networks
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Moving to g1 gc by Kirk Pepperdine.
Moving to g1 gc by Kirk Pepperdine.Moving to g1 gc by Kirk Pepperdine.
Moving to g1 gc by Kirk Pepperdine.J On The Beach
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
 
Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingZbigniew Jerzak
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리NAVER D2
 
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...Takahiro Harada
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
 
HFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn ProtocolHFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn ProtocolMatteo Dell'Amico
 
[232]mist 고성능 iot 스트림 처리 시스템
[232]mist 고성능 iot 스트림 처리 시스템[232]mist 고성능 iot 스트림 처리 시스템
[232]mist 고성능 iot 스트림 처리 시스템NAVER D2
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015Karel Ha
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codesNAVER D2
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdwKohei KaiGai
 
Optimizing the graphics pipeline with compute
Optimizing the graphics pipeline with computeOptimizing the graphics pipeline with compute
Optimizing the graphics pipeline with computeWuBinbo
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopMapR Technologies
 

Mais procurados (20)

Foveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsFoveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUs
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Moving to g1 gc by Kirk Pepperdine.
Moving to g1 gc by Kirk Pepperdine.Moving to g1 gc by Kirk Pepperdine.
Moving to g1 gc by Kirk Pepperdine.
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream Processing
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리
 
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
HFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn ProtocolHFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn Protocol
 
[232]mist 고성능 iot 스트림 처리 시스템
[232]mist 고성능 iot 스트림 처리 시스템[232]mist 고성능 iot 스트림 처리 시스템
[232]mist 고성능 iot 스트림 처리 시스템
 
Apache Nemo
Apache NemoApache Nemo
Apache Nemo
 
Ping to Pong
Ping to PongPing to Pong
Ping to Pong
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codes
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 
Optimizing the graphics pipeline with compute
Optimizing the graphics pipeline with computeOptimizing the graphics pipeline with compute
Optimizing the graphics pipeline with compute
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time Hadoop
 

Destaque

International Space Station
International Space StationInternational Space Station
International Space Stationwouterjanssens
 
Интеграция и внедрение Cisco UCS. Практические аспекты
Интеграция и внедрение Cisco UCS. Практические аспекты Интеграция и внедрение Cisco UCS. Практические аспекты
Интеграция и внедрение Cisco UCS. Практические аспекты Cisco Russia
 
News from Mahout
News from MahoutNews from Mahout
News from MahoutTed Dunning
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012Ted Dunning
 
Devoxx Real-time Learning
Devoxx Real-time LearningDevoxx Real-time Learning
Devoxx Real-time LearningTed Dunning
 

Destaque (8)

Iss
IssIss
Iss
 
International Space Station
International Space StationInternational Space Station
International Space Station
 
Интеграция и внедрение Cisco UCS. Практические аспекты
Интеграция и внедрение Cisco UCS. Практические аспекты Интеграция и внедрение Cisco UCS. Практические аспекты
Интеграция и внедрение Cisco UCS. Практические аспекты
 
Het Iss
Het IssHet Iss
Het Iss
 
News from Mahout
News from MahoutNews from Mahout
News from Mahout
 
Iss
IssIss
Iss
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012
 
Devoxx Real-time Learning
Devoxx Real-time LearningDevoxx Real-time Learning
Devoxx Real-time Learning
 

Semelhante a Real-time and long-time together

Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningTed Dunning
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Numascale Product IBM
Numascale Product IBMNumascale Product IBM
Numascale Product IBMIBM Danmark
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...InfluxData
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceMapR Technologies
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKInfluxData
 
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center ZurichData Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center ZurichRomeo Kienzler
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2Haggai Philip Zagury
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon Berlin
 

Semelhante a Real-time and long-time together (20)

Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
London hug
London hugLondon hug
London hug
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
News From Mahout
News From MahoutNews From Mahout
News From Mahout
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Numascale Product IBM
Numascale Product IBMNumascale Product IBM
Numascale Product IBM
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
Skyline queries
Skyline queriesSkyline queries
Skyline queries
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
 
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center ZurichData Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imagination
 

Mais de Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 

Mais de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Último

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 

Último (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 

Real-time and long-time together

  • 1. 1©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop
  • 2. 2©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop MapR
  • 3. 3©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such (available late tonight): – http://info.mapr.com/ted-utahjug  Hash tags: #mapr #storm
  • 4. 4©MapR Technologies - Confidential The Challenge  Hadoop is great of processing vats of data – But sucks for real-time (by design!)  Storm is great for real-time processing – But lacks any way to deal with batch processing  It sounds like there isn’t a solution – Neither fashionable solution handles everything
  • 5. 5©MapR Technologies - Confidential This is not a problem. It’s an opportunity!
  • 6. 6©MapR Technologies - Confidential What is Map-Reduce?  Map-reduce programs are defined (mostly) by – A map function that does independent record transformations (and deletions and replications) – Reduce functions that do aggregation  Map-reduce programs run in framework that – Schedules and re-runs tasks – Splits the input – Moves map outputs to reduce inputs – Receives the results 6
  • 7. 7©MapR Technologies - Confidential Inside Map-Reduce 7 Input Map CombineShuffle and sort Reduce Output Reduce "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax the, 1 time, 1 has, 1 come, 1 … come, [3,2,1] has, [1,5,2] the, [1,2,1] time, [10,1,3] … come, 6 has, 8 the, 4 time, 14 …
  • 8. 8©MapR Technologies - Confidential Not Just Text  Counting words is easy  Many other problems work as well – Sessionize user logs – Very large scale joins – Large scale matrix recommendations – Computing the quadrillionth digit of π  Map-reduce is inherently batch oriented
  • 9. 9©MapR Technologies - Confidential Inside Map-Reduce 9 Input Map Shuffle and sort Reduce Output road1, polyline(p1, p2, p3, …) lake1, polygon(p4, p5, p7, p9, …) road2, polyline(p6, p7, p9, …) tile0918-1412, road1 tile1082-8143, road1 tile0014-3284, lake1 tile1082-8143, lake1 … tile0918-1412, [road1] tile1082-8143, [road1, lake1] tile0014-3284, [lake1] … tile1082-8143, img#1 tile0014-3284, img#2 tile1082-8143, img#3 …
  • 10. 10©MapR Technologies - Confidential What is Storm?  A Storm program is called a topology – Spouts inject data into a topology – Bolts process data  The units of data are called tuples  All processing is flow-through  Bolts can buffer or persist  Output tuples can be anchored  If a tuple and all descendants are not ack’ed in time, bolt has died  Bolts that fail are restarted and un-acked tuples are replayed
  • 11. 11©MapR Technologies - Confidential t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data
  • 12. 12©MapR Technologies - Confidential t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
  • 13. 13©MapR Technologies - Confidential One Alternative Search Engine NoSql de Jour Consumer Real-time Long-time ?
  • 14. 14©MapR Technologies - Confidential Problems  Simply dumping into noSql engine doesn’t quite work  Insert rate is limited  No load isolation – Big retrospective jobs kill real-time  Low scan performance – Hbase pretty good, but not stellar  Difficult to set boundaries – where does real-time end and long-time begin?
  • 15. 15©MapR Technologies - Confidential Data Sources Catcher Cluster Rough Design – Data Flow Catcher Cluster Query Event Spout Logger Bolt Counter Bolt Raw Logs Logger Bolt Semi Agg Hadoop Aggregator Snap Long agg ProtoSpout Counter Bolt Logger Bolt Data Sources
  • 16. 16©MapR Technologies - Confidential Closer Look – Catcher Protocol Data Sources Catcher Cluster Catcher Cluster Data Sources The data sources and catchers communicate with a very simple protocol. Hello() => list of catchers Log(topic,message) => (OK|FAIL, redirect-to-catcher)
  • 17. 17©MapR Technologies - Confidential Closer Look – Catcher Queues Catcher Cluster Catcher Cluster The catchers forward log requests to the correct catcher and return that host in the reply to allow the client to avoid the extra hop. Each topic file is appended by exactly one catcher. Topic files are kept in shared file storage. Topic File Topic File
  • 18. 18©MapR Technologies - Confidential Closer Look – ProtoSpout The ProtoSpout tails the topic files, parses log records into tuples and injects them into the Storm topology. Last fully acked position stored in shared file system. Topic File Topic File ProtoSpout
  • 19. 19©MapR Technologies - Confidential Closer Look – Counter Bolt  Critical design goals: – fast ack for all tuples – fast restart of counter  Ack happens when tuple hits the replay log (10’s of milliseconds)  Restart involves replaying semi-agg’s + replay log (very fast)  Replay log only lasts until next semi-aggregate goes out Counter Bolt Replay Log Semi- aggregated records Incoming records Real-time Long-time
  • 20. 20©MapR Technologies - Confidential A Frozen Moment in Time  Snapshot defines the dividing line  All data in the snap is long- time, all after is real-time  Semi-agg strategy allows clean query Semi Agg Hadoop Aggregator Snap Long agg
  • 21. 21©MapR Technologies - Confidential Guarantees  Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise  Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append  HDFS is distributed, but provides no guarantees and strange semantics  MapRfs is distributed, provides all necessary guarantees
  • 22. 22©MapR Technologies - Confidential Presentation Layer  Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records  User will see – counts that increment within 0-2 s of events – seamless and accurate meld of short and long-term data
  • 23. 23©MapR Technologies - Confidential Example 2 – AB testing in real-time  I have 15 versions of my landing page  Each visitor is assigned to a version – Which version?  A conversion or sale or whatever can happen – How long to wait?  Some versions of the landing page are horrible – Don’t want to give them traffic
  • 24. 29©MapR Technologies - Confidential Bayesian Bandit  Compute distributions based on data  Sample p1 and p2 from these distributions  Put a coin in bandit 1 if p1 > p2  Else, put the coin in bandit 2
  • 25. 30©MapR Technologies - Confidential And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  • 26. 31©MapR Technologies - Confidential Video Demo
  • 27. 32©MapR Technologies - Confidential The Code  Select an alternative  Select and learn  But we already know how to count! n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0))) for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)
  • 28. 33©MapR Technologies - Confidential The Basic Idea  We can encode a distribution by sampling  Sampling allows unification of exploration and exploitation  Can be extended to more general response models
  • 29. 34©MapR Technologies - Confidential The Bigger Basic Idea  Online algorithms generally have relatively small state (like counting)  Online algorithms generally have a simple update (like counting)  If we can do this with counting, we can do it with all kinds of algorithms
  • 30. 37©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such (available late tonight): – Send email !!  Hash tags: #mapr #bigdata  We are hiring!
  • 31. 38©MapR Technologies - Confidential Thank You