4. Samza
for
stream
processing
4
Input
Stream
Task
1
Task
2
Task
3
Output
Stream
Changelog
Stream
Local
state
store
Checkpoint
Container
5. How
Samza
scales
within
node
5
0 Task1
1 …
n
Part
1
0 Task2
1 …
n
Part
2
Container1
0 Task3
1 …
n
Part
3
0 Task4
1 …
n
Part
4
0 Task1
1 …
n
Part
1
Container1
0 Task2
1 …
n
Part
2
Container2
0 Task3
1 …
n
Part
3
Container3
0 Task4
1 …
n
Part
4
Container4
7. Test
setup
7
0
broker
KaOa
Clusters
1
…
N
Container
• Use
KaOa
Producer
to
replay
messages
• 10
million
unique
keys
• Each
message
around
100
bytes
Container
Container
Test
System
• Test
System
config
• 24
cores
• 1gbps
nic
• 1.65TB
SSD
broker
broker
8. Performance
metrics
• How
many
messages
does
Samza
container
process
per
sec?
– Process-‐envelopes
• How
long
does
Samza
container
process
one
message?
– ProcessMs
– Due
to
SAMZA-‐738,
it
is
not
useful
in
our
test
• Important
metrics
– Message
behind
high
watermark
8
9. Test
scenarios
• Message
Passing:
reads
message
from
source
topic,
writes
to
desnaon
topic
• Key
Count:
reads
message
,
calculates
the
count
of
the
message
key
and
stores
back
to
the
store
a.
with
in-‐memory
store
b.
with
RocksDB
store
c.
with
RocksDB
store
&
changelog
9
10. Case
1:
Message
passing
10
• 1.2
million
messages/sec
per
node
• Network
NIC
is
saturated
11. Case
2:
Key
counng
with
in-‐memory
store
11
• 1
million
messages/sec
per
node
• Window
task
runs
every
90s
to
clean
up
the
state.
Otherwise
messages
will
fill
up
the
heap
and
trigger
frequent
full
gc
• CPU
ulizaon
of
the
box
is
around
80%
12. Case
3:
Key
counng
with
RocksDB
store
12
• 443k
messages/sec
per
node
• The
test
is
performed
without
tuning
on
RocksDB
• With
SAMZA-‐449,
it
should
give
us
more
hint
on
how
long
messages
are
processed
in
RocksDB
in
the
future
• CPU
ulizaon
is
around
84%
13. Case
4:
Key
counng
with
RocksDB
store
&
changelog
13
• 300k
messages/sec
per
node
• We
specify
linger.ms
to
1
to
avoid
frequent
send
• CPU
ulizaon
is
around
89%
15. Summary
• Isolated
performance
test
environment
• Replay-‐ability
of
messages
with
KaOa
producer
• Foundaon
of
capacity
model
for
Samza
in
the
future
– PublicaMon:
A
Memory
Capacity
Model
for
High
Performing
Data-‐
filtering
Applica<ons
in
Samza
Framework,
IEEE
Interna<onal
Conference
on
Big
Data
2015
–
Data
Quality
Issues
Workshop
15
Message
Passing
Key
CounMng
w
in-‐
memory
Key
CounMng
w
RocksDB
Key
CounMng
w
RocksDB
&
Changelog
Messages
per
sec
per
node
1.2
millions
1
millions
443k
300k