SlideShare uma empresa Scribd logo
1 de 117
Baixar para ler offline
1
ProcessingProcessing
“BIG-DATA”“BIG-DATA”
InIn Real TimeReal Time
Yanai Franchi , TikalYanai Franchi , Tikal
2
3
Vacation to BarcelonaVacation to Barcelona
4
After a Long Travel DayAfter a Long Travel Day
5
Going to a Salsa Club
6
Best Salsa Club
NOW
● Good Music
● Crowded –
Now!
7
Same Problem in “gogobot”
8
9
gogobot checkin
Heat Map Service
Lets' Develop
“Gogobot Checkins Heat-Map”
10
Key Notes
● Collector Service - Collects checkins as text addresses
– We need to use GeoLocation ServiceWe need to use GeoLocation Service
● Upon elapsed interval, the last locations list will be
displayed as Heat-Map in GUI.
● Web Scale service – 10Ks checkins/seconds all over the
world (imaginary, but lets do it for the exercise).
● Accuracy – Sample data, NOT critical data.
– Proportionately representative
– Data volume is large enough tois large enough to compensate for data loss.compensate for data loss.
11
Heat-Map Context
Text-Address
Checkins Heat-Map
Service
Gogobot System
Gogobot
Micro Service
Gogobot
Micro Service
Gogobot
Micro Service
Geo Location
Service
Get-GeoCode(Address)
Heat-Map
Last Interval Locations
12
Database
Persist Checkin
Intervals
Processing
Checkins
Read
Text Address
Check-in #1
Check-in #2
Check-in #3
Check-in #4
Check-in #5
Check-in #6
Check-in #7
Check-in #8
Check-in #9
...
Simulate Checkins with a File
Plan A
GET Geo
Location
Geo Location
Service
13
Tons of Addresses
Arriving Every Second
14
Architect - First Reaction...
15
Second Reaction...
16
Developer
First
Reaction
17
Second
Reaction
18
Problems ?
● Tedious: Spend time confi guring where to send
messages, deploying workers, and deploying
intermediate queues.
● Brittle: There's little fault-tolerance.
● Painful to scale: Partition of running worker/s is
complicated.
19
What We Want ?
● Horizontal scalability
● Fault-tolerance
● No intermediate message brokers!
● Higher level abstraction than message
passing
● “Just works”
● Guaranteed data processing (not in this
case)
20
Apache Storm
✔Horizontal scalability
✔Fault-tolerance
✔No intermediate message brokers!
✔Higher level abstraction than message
passing
✔“Just works”
✔Guaranteed data processing
21
Anatomy of Storm
22
What is Storm ?
● CEP - Open source and distributed realtime
computation system.
– Makes it easy toMakes it easy to reliably process unboundedreliably process unbounded streamsstreams ofof
tuplestuples
– Doing for realtime processing what Hadoop did for batchDoing for realtime processing what Hadoop did for batch
processing.processing.
● Fast - 1M Tuples/sec per node.
– It is scalable,fault-tolerant, guarantees your data will beIt is scalable,fault-tolerant, guarantees your data will be
processed, and is easy to set up and operate.processed, and is easy to set up and operate.
23
Streams
Tuple Tuple Tuple Tuple Tuple Tuple
Unbounded sequence of tuples
24
Spouts
Tuple
Tuple
Sources of Streams
Tuple
Tuple
25
Bolts
Tuple
TupleTuple
Processes input streams and produces
new streams
Tuple
TupleTupleTuple
Tuple TupleTuple
26
Storm Topology
Network of spouts and bolts
Tuple
TupleTuple
TupleTuple TupleTuple
Tuple TupleTupleTuple
Tuple
Tuple
Tuple
Tuple TupleTupleTuple
27
Guarantee for Processing
● Storm guarantees the full processing of a tuple by
tracking its state
● In case of failure, Storm can re-process it.
● Source tuples with full “acked” trees are removed
from the system
28
Tasks (Bolt/Spout Instance)
Spouts and bolts execute as
many tasks across the cluster
29
Stream Grouping
When a tuple is emitted, which task
(instance) does it go to?
30
Stream Grouping
● Shuffl e grouping: pick a random task
● Fields grouping: consistent hashing on a subset of
tuple fi elds
● All grouping: send to all tasks
● Global grouping: pick task with lowest id
31
Tasks , Executors , Workers
Task Task Task
Worker Process
Sput /
Bolt
Sput /
Bolt
Sput /
Bolt
=
Executor Thread
JVM
Executor Thread
32
Bolt B Bolt B
Worker Process
Executor
Spout A
Executor
Node
Supervisor
Bolt C Bolt C
Executor
Bolt B Bolt B
Worker Process
Executor
Spout A
Executor
Node
Supervisor
Bolt C Bolt C
Executor
33
Nimbus
Supervisor Supervisor
Supervisor Supervisor
Supervisor Supervisor
Upload/Rebalance
Heat-Map Topology
Zoo Keeper
Nodes
Storm Architecture
Master Node
(similar to Hadoop JobTracker)
NOT critical
for running topology
34
Nimbus
Supervisor Supervisor
Supervisor Supervisor
Supervisor Supervisor
Upload/Rebalance
Heat-Map Topology
Zoo Keeper
Storm Architecture
Used For Cluster Coordination
A few
nodes
35
Nimbus
Supervisor Supervisor
Supervisor Supervisor
Supervisor Supervisor
Upload/Rebalance
Heat-Map Topology
Zoo Keeper
Storm Architecture
Run Worker Processes
36
Assembling Heatmap Topology
37
HeatMap Input/Output Tuples
● Input Tuples: Timestamp and Text Address :
– (9:00:07 PM , “287 Hudson St New York NY 10013”)(9:00:07 PM , “287 Hudson St New York NY 10013”)
● Output Tuple: Time interval, and a list of points for
it:
– (9:00:00 PM to 9:00:15 PM,(9:00:00 PM to 9:00:15 PM,
ListList((((40.719,-73.98740.719,-73.987),(40.726,-74.001),(),(40.726,-74.001),(40.719,-73.98740.719,-73.987))))
38
Checkins
Spout
Geocode
Lookup
Bolt
Heatmap
Builder
Bolt
Persistor
Bolt
(9:01 PM @ 287 Hudson st)
(9:01 PM , (40.736, -74,354)))
Heat Map
Storm
Topology
(9:00 PM – 9:15 PM , List((40.73, -74,34),
(51.36, -83,33),(69.73, -34,24))
Upon
Elapsed Interval
39
Checkins Spout
public class CheckinsSpout extends BaseRichSpout {
private List<String> sampleLocations;
private int nextEmitIndex;
private SpoutOutputCollector outputCollector;
@Override
public void open(Map map, TopologyContext topologyContext,
SpoutOutputCollector spoutOutputCollector) {
this.outputCollector = spoutOutputCollector;
this.nextEmitIndex = 0;
sampleLocations = IOUtils.readLines(
ClassLoader.getSystemResourceAsStream("sanple-locations.txt"));
}
@Override
public void nextTuple() {
String address = checkins.get(nextEmitIndex);
String checkin = new Date().getTime()+"@ADDRESS:"+address;
outputCollector.emit(new Values(checkin));
nextEmitIndex = (nextEmitIndex + 1) % sampleLocations.size();
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("str"));
}
We hold state
No need for thread safety
Declare
output fields
Been called
iteratively by Storm
40
Geocode Lookup Bolt
public class GeocodeLookupBolt extends BaseBasicBolt {
private LocatorService locatorService;
@Override
public void prepare(Map stormConf, TopologyContext context) {
locatorService = new GoogleLocatorService();
}
@Override
public void execute(Tuple tuple, BasicOutputCollector outputCollector) {
String str = tuple.getStringByField("str");
String[] parts = str.split("@");
Long time = Long.valueOf(parts[0]);
String address = parts[1];
LocationDTO locationDTO = locatorService.getLocation(address);
if(checkinDTO!=null)
outputCollector.emit(new Values(time,locationDTO) );
}
@Override
public void declareOutputFields(OutputFieldsDeclarer fieldsDeclarer) {
fieldsDeclarer.declare(new Fields("time", "location"));
}
}
Get Geocode,
Create DTO
41
Tick Tuple – Repeating Mantra
42
Two Streams to Heat-Map Builder
On tick tuple, we fl ush our Heat-Map
Checkin 1 Checkin 4 Checkin 5 Checkin 6
HeatMap-
Builder Bolt
43
Tick Tuple in Action
public class HeatMapBuilderBolt extends BaseBasicBolt {
private Map<String, List<LocationDTO>> heatmaps;
@Override
public Map<String, Object> getComponentConfiguration() {
Config conf = new Config();
conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 60 );
return conf;
}
@Override
public void execute(Tuple tuple, BasicOutputCollector outputCollector) {
if (isTickTuple(tuple)) {
// Emit accumulated intervals
} else {
// Add check-in info to the current interval in the Map
}
}
private boolean isTickTuple(Tuple tuple) {
return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID)
&& tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID);
}
Tick interval
Hold latest intervals
44
Persister Bolt
public class PersistorBolt extends BaseBasicBolt {
private Jedis jedis;
@Override
public void execute(Tuple tuple, BasicOutputCollector outputCollector) {
Long timeInterval = tuple.getLongByField("time-interval");
String city = tuple.getStringByField("city");
String locationsList = objectMapper.writeValueAsString
( tuple.getValueByField("locationsList"));
String dbKey = "checkins-" + timeInterval+"@"+city;
jedis.setex(dbKey, 3600*24 ,locationsList);
jedis.publish("location-key", dbKey);
}
}
Publish in
Redis channel
for debugging
Persist in Redis
for 24h
45
Shuffle Grouping
Shuffle Grouping
Check-in #1
Check-in #2
Check-in #3
Check-in #4
Check-in #5
Check-in #6
Check-in #7
Check-in #8
Check-in #9
...
Sample Checkins File
Read
Text Addresses
Transforming the Tuples
Checkins
Spout
Geocode
Lookup
Bolt
Heatmap
Builder
Bolt
Database
Persistor
Bolt
Get Geo
Location
Geo Location
Service
Field Grouping(city)
Group by city
46
Heat Map Topology
public class LocalTopologyRunner {
public static void main(String[] args) {
TopologyBuilder builder = buildTopolgy();
StormSubmitter.submitTopology(
"local-heatmap", new Config(), builder.createTopology());
}
private static TopologyBuilder buildTopolgy() {
topologyBuilder builder = new TopologyBuilder();
builder.setSpout("checkins", new CheckinsSpout());
builder.setBolt("geocode-lookup", new GeocodeLookupBolt() )
.shuffleGrouping("checkins");
builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() )
.fieldsGrouping("geocode-lookup", new Fields("city"));
builder.setBolt("persistor", new PersistorBolt() )
.shuffleGrouping("heatmap-builder");
return builder;
}
}
47
Its NOT Scaled
48
49
Scaling the Topology
public class LocalTopologyRunner {
conf.setNumWorkers(20);
public static void main(String[] args) {
TopologyBuilder builder = buildTopolgy();
Config conf = new Config();
conf.setNumWorkers(2);
StormSubmitter.submitTopology(
"local-heatmap", conf, builder.createTopology());
}
private static TopologyBuilder buildTopolgy() {
topologyBuilder builder = new TopologyBuilder();
builder.setSpout("checkins", new CheckinsSpout(), 4 );
builder.setBolt("geocode-lookup", new GeocodeLookupBolt() , 8 )
.shuffleGrouping("checkins").setNumTasks(64);
builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() , 4)
.fieldsGrouping("geocode-lookup", new Fields("city"));
builder.setBolt("persistor", new PersistorBolt() , 2 )
.shuffleGrouping("heatmap-builder").setNumTasks(4);
return builder;
Parallelism hint
Increase Tasks
For Future
Set no. of workers
50
Demo
51
Database
Storm Heat-Map
Topology
Persist Checkin
Intervals
GET Geo
Location
Check-in #1
Check-in #2
Check-in #3
Check-in #4
Check-in #5
Check-in #6
Check-in #7
Check-in #8
Check-in #9
...
Read
Text Address
Sample Checkins File
Recap – Plan A
Geo Location
Service
52
We have
something working
53
Add Kafka Messaging
54
Plan B -
Kafka Spout&Bolt to HeatMap
Geocode
Lookup
Bolt
Heatmap
Builder
Bolt
Kafka
Checkins
Spout
Database
Persistor
Bolt
Geo Location
Service
Read
Text Addresses
Checkin
Kafka Topic
Publish
Checkins
Locations
Topic
Kafka
Locations
Bolt
55
56
They all are Good
But not for all use-cases
57
Kafka
A little introduction
58
59
Pub-Sub Messaging System
60
61
62
63
64
Stateless Broker &
Doesn't Fear the File System
65
66
67
68
Topics
● Logical collections of partitions (the physical fi les).
● A broker contains some of the partitions for a topic
69
A partition is Consumed by
Exactly One Group's Consumer
70
Distributed &
Fault-Tolerant
71
Broker 1 Broker 3Broker 2
Zoo Keeper
Consumer 1 Consumer 2
Producer 1 Producer 2
72
Broker 1 Broker 4Broker 3Broker 2
Zoo Keeper
Consumer 1 Consumer 2
Producer 1 Producer 2
73
Broker 1 Broker 4Broker 3Broker 2
Zoo Keeper
Consumer 1 Consumer 2
Producer 1 Producer 2
74
Broker 1 Broker 4Broker 3Broker 2
Zoo Keeper
Consumer 1 Consumer 2
Producer 1 Producer 2
75
Broker 1 Broker 4Broker 3Broker 2
Zoo Keeper
Consumer 1 Consumer 2
Producer 1 Producer 2
76
Broker 1 Broker 4Broker 3Broker 2
Zoo Keeper
Consumer 1 Consumer 2
Producer 1 Producer 2
77
Broker 1 Broker 4Broker 3Broker 2
Zoo Keeper
Consumer 1 Consumer 2
Producer 1 Producer 2
78
Broker 1 Broker 3Broker 2
Zoo Keeper
Consumer 1 Consumer 2
Producer 1 Producer 2
79
Broker 1 Broker 3Broker 2
Zoo Keeper
Consumer 1 Consumer 2
Producer 1 Producer 2
80
Broker 1 Broker 3Broker 2
Zoo Keeper
Consumer 1 Consumer 2
Producer 1 Producer 2
81
Broker 1 Broker 3Broker 2
Zoo Keeper
Consumer 1
Producer 1 Producer 2
82
Broker 1 Broker 3Broker 2
Zoo Keeper
Consumer 1
Producer 1 Producer 2
83
Broker 1 Broker 3Broker 2
Zoo Keeper
Consumer 1
Producer 1 Producer 2
84
Performance Benchmark
1 Broker
1 Producer
1 Consumer
85
86
87
Add Kafka to our Topology
public class LocalTopologyRunner {
...
private static TopologyBuilder buildTopolgy() {
...
builder.setSpout("checkins", new KafkaSpout(kafkaConfig));
...
builder.setBolt("kafkaProducer", new KafkaOutputBolt
( "localhost:9092",
"kafka.serializer.StringEncoder",
"locations-topic"))
.shuffleGrouping("persistor");
return builder;
}
}
Kafka Bolt
Kafka Spout
88
Checkin HTTP
Reactor
Publish
Checkins
Plan C – Add Reactor
Database
Checkin
Kafka Topic
Consume Checkins
Storm Heat-Map
Topology
Locations
Kafka Topic
Publish
Interval Key
Persist Checkin
Intervals
Geo Location
ServiceGET Geo
Location
Index Interval
Locations
Search
Server
Index
Text-Address
89
Why Reactor ?
90
C10K
Problem
91
2008:
Thread Per Request/Response
92
...events
trigger
handlers
Application
registers
handlers
Reactor Pattern Paradigm
93
Reactor Pattern – Key Points
● Single thread / single event loop
● EVERYTHING runs on it
● You MUST NOT block the event loop
● Many Implementations (partial list):
– Node.js (JavaScrip), EventMachine (Ruby), TwistedNode.js (JavaScrip), EventMachine (Ruby), Twisted
(Python)... and Vert.X(Python)... and Vert.X
94
Reactor Pattern Problems
● Some work is naturally blocking:
– Intensive data crunchingIntensive data crunching
– 3rd-party blocking API’s (e.g. JDBC)3rd-party blocking API’s (e.g. JDBC)
● Pure reactor (e.g. Node.js) is not a good fi t for this
kind of work!
95
96
● Vertciles are Execution unit of Vert.x
● Single threaded
● Verticles communicate by message passing
Verticles
For Blocking IO
Run in Thread-PoolRun in
Event Loop
97
Vert.X Architecture
Event Bus
Vert.X Architecture
98
Vert.X Goodies
● Growing Module
● Repository
● web server
● Persistors (Mongo,
JDBC, ...)
● Work queue
● Authentication
● Manager
● Session manager
● Socket.IO
● TCP/SSL servers/clients
● HTTP/HTTPS servers/
clients
● WebSockets support
● SockJS support
● Timers
● Buffers
● Streams and Pumps
● Routing
● Asynchronous File I/O
99
Node.JS vs Vert.X
100
Node.js vs Vert.X
● Node.js
– JavaScript OnlyJavaScript Only
– Inherently SingleInherently Single
ThreadedThreaded
– No help much with IPCNo help much with IPC
– All code MUST be inAll code MUST be in
Event loopEvent loop
● Vert.X
– Polyglot (JavaScript,Polyglot (JavaScript,
Java, Ruby, Python...)Java, Ruby, Python...)
– Leverages JVM multi-Leverages JVM multi-
threadingthreading
– Nervous Event BusNervous Event Bus
– Blocking work can beBlocking work can be
done off the event loopdone off the event loop
101
Node.js vs Vert.X Benchmark
AMD Phenom II X6 (6 core), 8GB
RAM, Ubuntu 11.04
http://vertxproject.wordpress.com/2012/05/09/vert-x-vs-node-js-simple-http-benchmarks/
102
Event Bus
HTTP
Server
Verticle
Kafka
module
Kafka Topic
Storm
Topology
HeatMap Reactor Architecture
Vert.X Instance
Automatically sends
EventBus Msg → KafkaTopic
Vert.X Instance
103
Heat-Map Server – Only 6 LOC !
var vertx = require('vertx');
var container = require('vertx/container');
var console = require('vertx/console');
var config = container.config;
vertx.createHttpServer().requestHandler(function(request) {
request.dataHandler(function(buffer) {
vertx.eventBus.send(config.address, {payload:buffer.toString()});
});
request.response.end();
}).listen(config.httpServerPort, config.httpServerHost);
console.log("HTTP CheckinsReactor started on port "+config.httpServerPort);
Send checkin
to Vert.X EventBus
104
Database
Checkin HTTP
Reactor Checkin
Kafka Topic
Consume Checkins
Storm Heat-Map
Topology
Hotzones
Kafka Topic
Publish
Interval Key
Persist Checkin
Intervals
Web App
Geo Location
ServiceGET Geo
Location
Get Interval Locations
Consume Intervals Keys
Push via WebSocket
Publish
Checkins
Search
Server
Index
Index Interval
Locations
Search
Checkin HTTP
Firehose
105
Demo
106
Lambda
Architecture
107
Until Now...
108
Doesn't Answer Many Answers...
● What are the most popular Salsa club in last month?
● How many unique visitors this year , per Salsa club?
● Show histogram of “bouncing” checkins for the last
year?
109
Batch
Processing
110
Sing in Concert ?
111
Complementary Views
Batch Views Real Time
Views
Just a few
hours of data
Time
Now
112
Lambda Architecture
New Data
01101001101...
real-time
view
real-time
view
real-time
view
Speed Layer
master dataset
batch
view
batch
view
batch
view
Serving Layer
Batch Layer
Query
How Many ?
113
Lambda Advantages
● Recover from Human mistakes
● No need for random writes on Batch DB
● Processed with high precision, and involve
algorithms without losing short-term information
114
Summary
115
When You go out to Salsa Club
● Good Music
● Crowded
116
More Conclusions..
● Storm – Great for real-time BigData processing.
Complementary for Hadoop batch jobs.
● Kafka – Great messaging for logs/events data, been
served as a good “source” for Storm spout
● Vert.X – Worth trial and check as an alternative for
reactor.
● Lambda Architecture – Bring Real-Time and Batch-
Processing concert for Big Data.
117
Thanks

Mais conteúdo relacionado

Mais procurados

Taking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterTaking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterChristopher Bradford
 
Super COMPUTING Journal
Super COMPUTING JournalSuper COMPUTING Journal
Super COMPUTING JournalPandey_G
 
On heap cache vs off-heap cache
On heap cache vs off-heap cacheOn heap cache vs off-heap cache
On heap cache vs off-heap cachergrebski
 
Smallworld Data Check-Out to Microstation
Smallworld Data Check-Out to MicrostationSmallworld Data Check-Out to Microstation
Smallworld Data Check-Out to MicrostationSafe Software
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
 
Alternative cryptocurrencies
Alternative cryptocurrencies Alternative cryptocurrencies
Alternative cryptocurrencies vpnmentor
 
Smallworld to GreGG - FME Server Automation
Smallworld to GreGG - FME Server AutomationSmallworld to GreGG - FME Server Automation
Smallworld to GreGG - FME Server AutomationSafe Software
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemJayjeetChakraborty
 
Computer Performance Microscopy with SHIM
Computer Performance Microscopy with SHIMComputer Performance Microscopy with SHIM
Computer Performance Microscopy with SHIMhiyangxi
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Igor Sfiligoi
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...Igor Sfiligoi
 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runIgor Sfiligoi
 
Pig TPC-H Benchmark and Performance Tuning
Pig TPC-H Benchmark and Performance TuningPig TPC-H Benchmark and Performance Tuning
Pig TPC-H Benchmark and Performance TuningJie Li
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록Jaehyeuk Oh
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Pharo Status Fosdem 2015
Pharo Status Fosdem 2015Pharo Status Fosdem 2015
Pharo Status Fosdem 2015Marcus Denker
 

Mais procurados (19)

Fun421 stephens
Fun421 stephensFun421 stephens
Fun421 stephens
 
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterTaking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
 
Super COMPUTING Journal
Super COMPUTING JournalSuper COMPUTING Journal
Super COMPUTING Journal
 
On heap cache vs off-heap cache
On heap cache vs off-heap cacheOn heap cache vs off-heap cache
On heap cache vs off-heap cache
 
Smallworld Data Check-Out to Microstation
Smallworld Data Check-Out to MicrostationSmallworld Data Check-Out to Microstation
Smallworld Data Check-Out to Microstation
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Alternative cryptocurrencies
Alternative cryptocurrencies Alternative cryptocurrencies
Alternative cryptocurrencies
 
Smallworld to GreGG - FME Server Automation
Smallworld to GreGG - FME Server AutomationSmallworld to GreGG - FME Server Automation
Smallworld to GreGG - FME Server Automation
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage System
 
Computer Performance Microscopy with SHIM
Computer Performance Microscopy with SHIMComputer Performance Microscopy with SHIM
Computer Performance Microscopy with SHIM
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud run
 
Pig TPC-H Benchmark and Performance Tuning
Pig TPC-H Benchmark and Performance TuningPig TPC-H Benchmark and Performance Tuning
Pig TPC-H Benchmark and Performance Tuning
 
Progress_190315
Progress_190315Progress_190315
Progress_190315
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Pharo Status Fosdem 2015
Pharo Status Fosdem 2015Pharo Status Fosdem 2015
Pharo Status Fosdem 2015
 

Semelhante a Heatmap

Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalCodemotion Tel Aviv
 
Processing Big Data in Realtime
Processing Big Data in RealtimeProcessing Big Data in Realtime
Processing Big Data in RealtimeTikal Knowledge
 
Beyond the DSL - Unlocking the power of Kafka Streams with the Processor API
Beyond the DSL - Unlocking the power of Kafka Streams with the Processor APIBeyond the DSL - Unlocking the power of Kafka Streams with the Processor API
Beyond the DSL - Unlocking the power of Kafka Streams with the Processor APIconfluent
 
Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, FasterKen Krugler
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Rakib Hossain
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
What’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributorWhat’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributorMasahiko Sawada
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time togetherTed Dunning
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and howPetr Zapletal
 
Reproducible Computational Pipelines with Docker and Nextflow
Reproducible Computational Pipelines with Docker and NextflowReproducible Computational Pipelines with Docker and Nextflow
Reproducible Computational Pipelines with Docker and Nextflowinside-BigData.com
 
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...SaltStack
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
QNIBTerminal: Understand your datacenter by overlaying multiple information l...QNIBTerminal: Understand your datacenter by overlaying multiple information l...
QNIBTerminal: Understand your datacenter by overlaying multiple information l...QNIB Solutions
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
 
Scientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceScientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceAngelo Corsaro
 

Semelhante a Heatmap (20)

Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, Tikal
 
Processing Big Data in Realtime
Processing Big Data in RealtimeProcessing Big Data in Realtime
Processing Big Data in Realtime
 
Beyond the DSL - Unlocking the power of Kafka Streams with the Processor API
Beyond the DSL - Unlocking the power of Kafka Streams with the Processor APIBeyond the DSL - Unlocking the power of Kafka Streams with the Processor API
Beyond the DSL - Unlocking the power of Kafka Streams with the Processor API
 
Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, Faster
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
What’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributorWhat’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributor
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time together
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Reproducible Computational Pipelines with Docker and Nextflow
Reproducible Computational Pipelines with Docker and NextflowReproducible Computational Pipelines with Docker and Nextflow
Reproducible Computational Pipelines with Docker and Nextflow
 
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
QNIBTerminal: Understand your datacenter by overlaying multiple information l...QNIBTerminal: Understand your datacenter by overlaying multiple information l...
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Scientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceScientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution Service
 

Mais de Tikal Knowledge

Clojure - LISP on the JVM
Clojure - LISP on the JVM Clojure - LISP on the JVM
Clojure - LISP on the JVM Tikal Knowledge
 
Tabtale story: Building a publishing and monitoring mobile games architecture...
Tabtale story: Building a publishing and monitoring mobile games architecture...Tabtale story: Building a publishing and monitoring mobile games architecture...
Tabtale story: Building a publishing and monitoring mobile games architecture...Tikal Knowledge
 
Docking your services_with_docker
Docking your services_with_dockerDocking your services_with_docker
Docking your services_with_dockerTikal Knowledge
 
Writing a Fullstack Application with Javascript - Remote media player
Writing a Fullstack Application with Javascript - Remote media playerWriting a Fullstack Application with Javascript - Remote media player
Writing a Fullstack Application with Javascript - Remote media playerTikal Knowledge
 
.Net OSS Ci & CD with Jenkins - JUC ISRAEL 2013
.Net OSS Ci & CD with Jenkins - JUC ISRAEL 2013 .Net OSS Ci & CD with Jenkins - JUC ISRAEL 2013
.Net OSS Ci & CD with Jenkins - JUC ISRAEL 2013 Tikal Knowledge
 
Tikal's Backbone_js introduction workshop
Tikal's Backbone_js introduction workshopTikal's Backbone_js introduction workshop
Tikal's Backbone_js introduction workshopTikal Knowledge
 
Cloud computing - an insight into "how does it really work ?"
Cloud computing - an insight into "how does it really work ?" Cloud computing - an insight into "how does it really work ?"
Cloud computing - an insight into "how does it really work ?" Tikal Knowledge
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud ComputingTikal Knowledge
 
Tikal Fuse Day Access Layer Implementation (C#) Based On Mongo Db
Tikal Fuse Day   Access Layer Implementation (C#) Based On Mongo DbTikal Fuse Day   Access Layer Implementation (C#) Based On Mongo Db
Tikal Fuse Day Access Layer Implementation (C#) Based On Mongo DbTikal Knowledge
 
Ship early ship often with Django
Ship early ship often with DjangoShip early ship often with Django
Ship early ship often with DjangoTikal Knowledge
 
Introduction To Cloud Computing
Introduction To Cloud ComputingIntroduction To Cloud Computing
Introduction To Cloud ComputingTikal Knowledge
 
Building Components In Flex3
Building Components In Flex3Building Components In Flex3
Building Components In Flex3Tikal Knowledge
 
JBUG 11 - Django-The Web Framework For Perfectionists With Deadlines
JBUG 11 - Django-The Web Framework For Perfectionists With DeadlinesJBUG 11 - Django-The Web Framework For Perfectionists With Deadlines
JBUG 11 - Django-The Web Framework For Perfectionists With DeadlinesTikal Knowledge
 

Mais de Tikal Knowledge (20)

Clojure - LISP on the JVM
Clojure - LISP on the JVM Clojure - LISP on the JVM
Clojure - LISP on the JVM
 
Clojure presentation
Clojure presentationClojure presentation
Clojure presentation
 
Tabtale story: Building a publishing and monitoring mobile games architecture...
Tabtale story: Building a publishing and monitoring mobile games architecture...Tabtale story: Building a publishing and monitoring mobile games architecture...
Tabtale story: Building a publishing and monitoring mobile games architecture...
 
Docking your services_with_docker
Docking your services_with_dockerDocking your services_with_docker
Docking your services_with_docker
 
Who moved my_box
Who moved my_boxWho moved my_box
Who moved my_box
 
Writing a Fullstack Application with Javascript - Remote media player
Writing a Fullstack Application with Javascript - Remote media playerWriting a Fullstack Application with Javascript - Remote media player
Writing a Fullstack Application with Javascript - Remote media player
 
.Net OSS Ci & CD with Jenkins - JUC ISRAEL 2013
.Net OSS Ci & CD with Jenkins - JUC ISRAEL 2013 .Net OSS Ci & CD with Jenkins - JUC ISRAEL 2013
.Net OSS Ci & CD with Jenkins - JUC ISRAEL 2013
 
Tikal's Backbone_js introduction workshop
Tikal's Backbone_js introduction workshopTikal's Backbone_js introduction workshop
Tikal's Backbone_js introduction workshop
 
TCE Automation
TCE AutomationTCE Automation
TCE Automation
 
Tce automation-d4
Tce automation-d4Tce automation-d4
Tce automation-d4
 
Cloud computing - an insight into "how does it really work ?"
Cloud computing - an insight into "how does it really work ?" Cloud computing - an insight into "how does it really work ?"
Cloud computing - an insight into "how does it really work ?"
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
 
Tikal Fuse Day Access Layer Implementation (C#) Based On Mongo Db
Tikal Fuse Day   Access Layer Implementation (C#) Based On Mongo DbTikal Fuse Day   Access Layer Implementation (C#) Based On Mongo Db
Tikal Fuse Day Access Layer Implementation (C#) Based On Mongo Db
 
Ship early ship often with Django
Ship early ship often with DjangoShip early ship often with Django
Ship early ship often with Django
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
AWS Case Study
AWS Case StudyAWS Case Study
AWS Case Study
 
Introduction To Cloud Computing
Introduction To Cloud ComputingIntroduction To Cloud Computing
Introduction To Cloud Computing
 
Building Components In Flex3
Building Components In Flex3Building Components In Flex3
Building Components In Flex3
 
Osgi Democamp
Osgi DemocampOsgi Democamp
Osgi Democamp
 
JBUG 11 - Django-The Web Framework For Perfectionists With Deadlines
JBUG 11 - Django-The Web Framework For Perfectionists With DeadlinesJBUG 11 - Django-The Web Framework For Perfectionists With Deadlines
JBUG 11 - Django-The Web Framework For Perfectionists With Deadlines
 

Heatmap

  • 1. 1 ProcessingProcessing “BIG-DATA”“BIG-DATA” InIn Real TimeReal Time Yanai Franchi , TikalYanai Franchi , Tikal
  • 2. 2
  • 4. 4 After a Long Travel DayAfter a Long Travel Day
  • 5. 5 Going to a Salsa Club
  • 6. 6 Best Salsa Club NOW ● Good Music ● Crowded – Now!
  • 7. 7 Same Problem in “gogobot”
  • 8. 8
  • 9. 9 gogobot checkin Heat Map Service Lets' Develop “Gogobot Checkins Heat-Map”
  • 10. 10 Key Notes ● Collector Service - Collects checkins as text addresses – We need to use GeoLocation ServiceWe need to use GeoLocation Service ● Upon elapsed interval, the last locations list will be displayed as Heat-Map in GUI. ● Web Scale service – 10Ks checkins/seconds all over the world (imaginary, but lets do it for the exercise). ● Accuracy – Sample data, NOT critical data. – Proportionately representative – Data volume is large enough tois large enough to compensate for data loss.compensate for data loss.
  • 11. 11 Heat-Map Context Text-Address Checkins Heat-Map Service Gogobot System Gogobot Micro Service Gogobot Micro Service Gogobot Micro Service Geo Location Service Get-GeoCode(Address) Heat-Map Last Interval Locations
  • 12. 12 Database Persist Checkin Intervals Processing Checkins Read Text Address Check-in #1 Check-in #2 Check-in #3 Check-in #4 Check-in #5 Check-in #6 Check-in #7 Check-in #8 Check-in #9 ... Simulate Checkins with a File Plan A GET Geo Location Geo Location Service
  • 14. 14 Architect - First Reaction...
  • 18. 18 Problems ? ● Tedious: Spend time confi guring where to send messages, deploying workers, and deploying intermediate queues. ● Brittle: There's little fault-tolerance. ● Painful to scale: Partition of running worker/s is complicated.
  • 19. 19 What We Want ? ● Horizontal scalability ● Fault-tolerance ● No intermediate message brokers! ● Higher level abstraction than message passing ● “Just works” ● Guaranteed data processing (not in this case)
  • 20. 20 Apache Storm ✔Horizontal scalability ✔Fault-tolerance ✔No intermediate message brokers! ✔Higher level abstraction than message passing ✔“Just works” ✔Guaranteed data processing
  • 22. 22 What is Storm ? ● CEP - Open source and distributed realtime computation system. – Makes it easy toMakes it easy to reliably process unboundedreliably process unbounded streamsstreams ofof tuplestuples – Doing for realtime processing what Hadoop did for batchDoing for realtime processing what Hadoop did for batch processing.processing. ● Fast - 1M Tuples/sec per node. – It is scalable,fault-tolerant, guarantees your data will beIt is scalable,fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.processed, and is easy to set up and operate.
  • 23. 23 Streams Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
  • 25. 25 Bolts Tuple TupleTuple Processes input streams and produces new streams Tuple TupleTupleTuple Tuple TupleTuple
  • 26. 26 Storm Topology Network of spouts and bolts Tuple TupleTuple TupleTuple TupleTuple Tuple TupleTupleTuple Tuple Tuple Tuple Tuple TupleTupleTuple
  • 27. 27 Guarantee for Processing ● Storm guarantees the full processing of a tuple by tracking its state ● In case of failure, Storm can re-process it. ● Source tuples with full “acked” trees are removed from the system
  • 28. 28 Tasks (Bolt/Spout Instance) Spouts and bolts execute as many tasks across the cluster
  • 29. 29 Stream Grouping When a tuple is emitted, which task (instance) does it go to?
  • 30. 30 Stream Grouping ● Shuffl e grouping: pick a random task ● Fields grouping: consistent hashing on a subset of tuple fi elds ● All grouping: send to all tasks ● Global grouping: pick task with lowest id
  • 31. 31 Tasks , Executors , Workers Task Task Task Worker Process Sput / Bolt Sput / Bolt Sput / Bolt = Executor Thread JVM Executor Thread
  • 32. 32 Bolt B Bolt B Worker Process Executor Spout A Executor Node Supervisor Bolt C Bolt C Executor Bolt B Bolt B Worker Process Executor Spout A Executor Node Supervisor Bolt C Bolt C Executor
  • 33. 33 Nimbus Supervisor Supervisor Supervisor Supervisor Supervisor Supervisor Upload/Rebalance Heat-Map Topology Zoo Keeper Nodes Storm Architecture Master Node (similar to Hadoop JobTracker) NOT critical for running topology
  • 34. 34 Nimbus Supervisor Supervisor Supervisor Supervisor Supervisor Supervisor Upload/Rebalance Heat-Map Topology Zoo Keeper Storm Architecture Used For Cluster Coordination A few nodes
  • 35. 35 Nimbus Supervisor Supervisor Supervisor Supervisor Supervisor Supervisor Upload/Rebalance Heat-Map Topology Zoo Keeper Storm Architecture Run Worker Processes
  • 37. 37 HeatMap Input/Output Tuples ● Input Tuples: Timestamp and Text Address : – (9:00:07 PM , “287 Hudson St New York NY 10013”)(9:00:07 PM , “287 Hudson St New York NY 10013”) ● Output Tuple: Time interval, and a list of points for it: – (9:00:00 PM to 9:00:15 PM,(9:00:00 PM to 9:00:15 PM, ListList((((40.719,-73.98740.719,-73.987),(40.726,-74.001),(),(40.726,-74.001),(40.719,-73.98740.719,-73.987))))
  • 38. 38 Checkins Spout Geocode Lookup Bolt Heatmap Builder Bolt Persistor Bolt (9:01 PM @ 287 Hudson st) (9:01 PM , (40.736, -74,354))) Heat Map Storm Topology (9:00 PM – 9:15 PM , List((40.73, -74,34), (51.36, -83,33),(69.73, -34,24)) Upon Elapsed Interval
  • 39. 39 Checkins Spout public class CheckinsSpout extends BaseRichSpout { private List<String> sampleLocations; private int nextEmitIndex; private SpoutOutputCollector outputCollector; @Override public void open(Map map, TopologyContext topologyContext, SpoutOutputCollector spoutOutputCollector) { this.outputCollector = spoutOutputCollector; this.nextEmitIndex = 0; sampleLocations = IOUtils.readLines( ClassLoader.getSystemResourceAsStream("sanple-locations.txt")); } @Override public void nextTuple() { String address = checkins.get(nextEmitIndex); String checkin = new Date().getTime()+"@ADDRESS:"+address; outputCollector.emit(new Values(checkin)); nextEmitIndex = (nextEmitIndex + 1) % sampleLocations.size(); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("str")); } We hold state No need for thread safety Declare output fields Been called iteratively by Storm
  • 40. 40 Geocode Lookup Bolt public class GeocodeLookupBolt extends BaseBasicBolt { private LocatorService locatorService; @Override public void prepare(Map stormConf, TopologyContext context) { locatorService = new GoogleLocatorService(); } @Override public void execute(Tuple tuple, BasicOutputCollector outputCollector) { String str = tuple.getStringByField("str"); String[] parts = str.split("@"); Long time = Long.valueOf(parts[0]); String address = parts[1]; LocationDTO locationDTO = locatorService.getLocation(address); if(checkinDTO!=null) outputCollector.emit(new Values(time,locationDTO) ); } @Override public void declareOutputFields(OutputFieldsDeclarer fieldsDeclarer) { fieldsDeclarer.declare(new Fields("time", "location")); } } Get Geocode, Create DTO
  • 41. 41 Tick Tuple – Repeating Mantra
  • 42. 42 Two Streams to Heat-Map Builder On tick tuple, we fl ush our Heat-Map Checkin 1 Checkin 4 Checkin 5 Checkin 6 HeatMap- Builder Bolt
  • 43. 43 Tick Tuple in Action public class HeatMapBuilderBolt extends BaseBasicBolt { private Map<String, List<LocationDTO>> heatmaps; @Override public Map<String, Object> getComponentConfiguration() { Config conf = new Config(); conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 60 ); return conf; } @Override public void execute(Tuple tuple, BasicOutputCollector outputCollector) { if (isTickTuple(tuple)) { // Emit accumulated intervals } else { // Add check-in info to the current interval in the Map } } private boolean isTickTuple(Tuple tuple) { return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID) && tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID); } Tick interval Hold latest intervals
  • 44. 44 Persister Bolt public class PersistorBolt extends BaseBasicBolt { private Jedis jedis; @Override public void execute(Tuple tuple, BasicOutputCollector outputCollector) { Long timeInterval = tuple.getLongByField("time-interval"); String city = tuple.getStringByField("city"); String locationsList = objectMapper.writeValueAsString ( tuple.getValueByField("locationsList")); String dbKey = "checkins-" + timeInterval+"@"+city; jedis.setex(dbKey, 3600*24 ,locationsList); jedis.publish("location-key", dbKey); } } Publish in Redis channel for debugging Persist in Redis for 24h
  • 45. 45 Shuffle Grouping Shuffle Grouping Check-in #1 Check-in #2 Check-in #3 Check-in #4 Check-in #5 Check-in #6 Check-in #7 Check-in #8 Check-in #9 ... Sample Checkins File Read Text Addresses Transforming the Tuples Checkins Spout Geocode Lookup Bolt Heatmap Builder Bolt Database Persistor Bolt Get Geo Location Geo Location Service Field Grouping(city) Group by city
  • 46. 46 Heat Map Topology public class LocalTopologyRunner { public static void main(String[] args) { TopologyBuilder builder = buildTopolgy(); StormSubmitter.submitTopology( "local-heatmap", new Config(), builder.createTopology()); } private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout()); builder.setBolt("geocode-lookup", new GeocodeLookupBolt() ) .shuffleGrouping("checkins"); builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() ) .fieldsGrouping("geocode-lookup", new Fields("city")); builder.setBolt("persistor", new PersistorBolt() ) .shuffleGrouping("heatmap-builder"); return builder; } }
  • 48. 48
  • 49. 49 Scaling the Topology public class LocalTopologyRunner { conf.setNumWorkers(20); public static void main(String[] args) { TopologyBuilder builder = buildTopolgy(); Config conf = new Config(); conf.setNumWorkers(2); StormSubmitter.submitTopology( "local-heatmap", conf, builder.createTopology()); } private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout(), 4 ); builder.setBolt("geocode-lookup", new GeocodeLookupBolt() , 8 ) .shuffleGrouping("checkins").setNumTasks(64); builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() , 4) .fieldsGrouping("geocode-lookup", new Fields("city")); builder.setBolt("persistor", new PersistorBolt() , 2 ) .shuffleGrouping("heatmap-builder").setNumTasks(4); return builder; Parallelism hint Increase Tasks For Future Set no. of workers
  • 51. 51 Database Storm Heat-Map Topology Persist Checkin Intervals GET Geo Location Check-in #1 Check-in #2 Check-in #3 Check-in #4 Check-in #5 Check-in #6 Check-in #7 Check-in #8 Check-in #9 ... Read Text Address Sample Checkins File Recap – Plan A Geo Location Service
  • 54. 54 Plan B - Kafka Spout&Bolt to HeatMap Geocode Lookup Bolt Heatmap Builder Bolt Kafka Checkins Spout Database Persistor Bolt Geo Location Service Read Text Addresses Checkin Kafka Topic Publish Checkins Locations Topic Kafka Locations Bolt
  • 55. 55
  • 56. 56 They all are Good But not for all use-cases
  • 58. 58
  • 60. 60
  • 61. 61
  • 62. 62
  • 63. 63
  • 64. 64 Stateless Broker & Doesn't Fear the File System
  • 65. 65
  • 66. 66
  • 67. 67
  • 68. 68 Topics ● Logical collections of partitions (the physical fi les). ● A broker contains some of the partitions for a topic
  • 69. 69 A partition is Consumed by Exactly One Group's Consumer
  • 71. 71 Broker 1 Broker 3Broker 2 Zoo Keeper Consumer 1 Consumer 2 Producer 1 Producer 2
  • 72. 72 Broker 1 Broker 4Broker 3Broker 2 Zoo Keeper Consumer 1 Consumer 2 Producer 1 Producer 2
  • 73. 73 Broker 1 Broker 4Broker 3Broker 2 Zoo Keeper Consumer 1 Consumer 2 Producer 1 Producer 2
  • 74. 74 Broker 1 Broker 4Broker 3Broker 2 Zoo Keeper Consumer 1 Consumer 2 Producer 1 Producer 2
  • 75. 75 Broker 1 Broker 4Broker 3Broker 2 Zoo Keeper Consumer 1 Consumer 2 Producer 1 Producer 2
  • 76. 76 Broker 1 Broker 4Broker 3Broker 2 Zoo Keeper Consumer 1 Consumer 2 Producer 1 Producer 2
  • 77. 77 Broker 1 Broker 4Broker 3Broker 2 Zoo Keeper Consumer 1 Consumer 2 Producer 1 Producer 2
  • 78. 78 Broker 1 Broker 3Broker 2 Zoo Keeper Consumer 1 Consumer 2 Producer 1 Producer 2
  • 79. 79 Broker 1 Broker 3Broker 2 Zoo Keeper Consumer 1 Consumer 2 Producer 1 Producer 2
  • 80. 80 Broker 1 Broker 3Broker 2 Zoo Keeper Consumer 1 Consumer 2 Producer 1 Producer 2
  • 81. 81 Broker 1 Broker 3Broker 2 Zoo Keeper Consumer 1 Producer 1 Producer 2
  • 82. 82 Broker 1 Broker 3Broker 2 Zoo Keeper Consumer 1 Producer 1 Producer 2
  • 83. 83 Broker 1 Broker 3Broker 2 Zoo Keeper Consumer 1 Producer 1 Producer 2
  • 84. 84 Performance Benchmark 1 Broker 1 Producer 1 Consumer
  • 85. 85
  • 86. 86
  • 87. 87 Add Kafka to our Topology public class LocalTopologyRunner { ... private static TopologyBuilder buildTopolgy() { ... builder.setSpout("checkins", new KafkaSpout(kafkaConfig)); ... builder.setBolt("kafkaProducer", new KafkaOutputBolt ( "localhost:9092", "kafka.serializer.StringEncoder", "locations-topic")) .shuffleGrouping("persistor"); return builder; } } Kafka Bolt Kafka Spout
  • 88. 88 Checkin HTTP Reactor Publish Checkins Plan C – Add Reactor Database Checkin Kafka Topic Consume Checkins Storm Heat-Map Topology Locations Kafka Topic Publish Interval Key Persist Checkin Intervals Geo Location ServiceGET Geo Location Index Interval Locations Search Server Index Text-Address
  • 93. 93 Reactor Pattern – Key Points ● Single thread / single event loop ● EVERYTHING runs on it ● You MUST NOT block the event loop ● Many Implementations (partial list): – Node.js (JavaScrip), EventMachine (Ruby), TwistedNode.js (JavaScrip), EventMachine (Ruby), Twisted (Python)... and Vert.X(Python)... and Vert.X
  • 94. 94 Reactor Pattern Problems ● Some work is naturally blocking: – Intensive data crunchingIntensive data crunching – 3rd-party blocking API’s (e.g. JDBC)3rd-party blocking API’s (e.g. JDBC) ● Pure reactor (e.g. Node.js) is not a good fi t for this kind of work!
  • 95. 95
  • 96. 96 ● Vertciles are Execution unit of Vert.x ● Single threaded ● Verticles communicate by message passing Verticles For Blocking IO Run in Thread-PoolRun in Event Loop
  • 98. 98 Vert.X Goodies ● Growing Module ● Repository ● web server ● Persistors (Mongo, JDBC, ...) ● Work queue ● Authentication ● Manager ● Session manager ● Socket.IO ● TCP/SSL servers/clients ● HTTP/HTTPS servers/ clients ● WebSockets support ● SockJS support ● Timers ● Buffers ● Streams and Pumps ● Routing ● Asynchronous File I/O
  • 100. 100 Node.js vs Vert.X ● Node.js – JavaScript OnlyJavaScript Only – Inherently SingleInherently Single ThreadedThreaded – No help much with IPCNo help much with IPC – All code MUST be inAll code MUST be in Event loopEvent loop ● Vert.X – Polyglot (JavaScript,Polyglot (JavaScript, Java, Ruby, Python...)Java, Ruby, Python...) – Leverages JVM multi-Leverages JVM multi- threadingthreading – Nervous Event BusNervous Event Bus – Blocking work can beBlocking work can be done off the event loopdone off the event loop
  • 101. 101 Node.js vs Vert.X Benchmark AMD Phenom II X6 (6 core), 8GB RAM, Ubuntu 11.04 http://vertxproject.wordpress.com/2012/05/09/vert-x-vs-node-js-simple-http-benchmarks/
  • 102. 102 Event Bus HTTP Server Verticle Kafka module Kafka Topic Storm Topology HeatMap Reactor Architecture Vert.X Instance Automatically sends EventBus Msg → KafkaTopic Vert.X Instance
  • 103. 103 Heat-Map Server – Only 6 LOC ! var vertx = require('vertx'); var container = require('vertx/container'); var console = require('vertx/console'); var config = container.config; vertx.createHttpServer().requestHandler(function(request) { request.dataHandler(function(buffer) { vertx.eventBus.send(config.address, {payload:buffer.toString()}); }); request.response.end(); }).listen(config.httpServerPort, config.httpServerHost); console.log("HTTP CheckinsReactor started on port "+config.httpServerPort); Send checkin to Vert.X EventBus
  • 104. 104 Database Checkin HTTP Reactor Checkin Kafka Topic Consume Checkins Storm Heat-Map Topology Hotzones Kafka Topic Publish Interval Key Persist Checkin Intervals Web App Geo Location ServiceGET Geo Location Get Interval Locations Consume Intervals Keys Push via WebSocket Publish Checkins Search Server Index Index Interval Locations Search Checkin HTTP Firehose
  • 108. 108 Doesn't Answer Many Answers... ● What are the most popular Salsa club in last month? ● How many unique visitors this year , per Salsa club? ● Show histogram of “bouncing” checkins for the last year?
  • 111. 111 Complementary Views Batch Views Real Time Views Just a few hours of data Time Now
  • 112. 112 Lambda Architecture New Data 01101001101... real-time view real-time view real-time view Speed Layer master dataset batch view batch view batch view Serving Layer Batch Layer Query How Many ?
  • 113. 113 Lambda Advantages ● Recover from Human mistakes ● No need for random writes on Batch DB ● Processed with high precision, and involve algorithms without losing short-term information
  • 115. 115 When You go out to Salsa Club ● Good Music ● Crowded
  • 116. 116 More Conclusions.. ● Storm – Great for real-time BigData processing. Complementary for Hadoop batch jobs. ● Kafka – Great messaging for logs/events data, been served as a good “source” for Storm spout ● Vert.X – Worth trial and check as an alternative for reactor. ● Lambda Architecture – Bring Real-Time and Batch- Processing concert for Big Data.