Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
- Xin Wang
Dec.16, 2017, Shenzhen, Apache RocketMQ Meetup
Xin Wang
• Apache Storm Committer & PMC member
• Five years distributed system experience
• Love open source & community
• Focus on distributed technologies, especially stream processing
• https://github.com/vesense
Streaming and batch which come from different worlds,
use different ways to solve different problems.
- Xin
Which one is better for me?
• Simple API
• Fault-tolerant/Stable
• Scalable
• Performance(high throughput & low latency)
• Guarantees: at-least-once/exactly-once
• Mature
• Ecosystem
• Operation and Maintenance
• Support
• Code
Storm 2.0
• Port Clojure to Java
• Unified Stream API
• Storm-SQL Improvements
• Metrics V2
• Threading model Redesign
• Lambda Expression Support - bolt: `tuple -> System.out.println(tuple)`
• Apache Beam Runner
• Worker Classloader Isolation
• Dynamic Topology Updates
• ......
Best Practices
• Worker heavy GCs:
• worker restart, heartbeat timeout, bad performance -> take care of your heap memory usage. Do you use local
caches? Reasonable JVM options? e.g. -XX:CMSInitiatingOccupancyFraction
• Topology design vs performance
• bad performance -> put the lightweight logics into the same bolt/operator
• Too many executors/tasks
• high cluster CPU load, bad performance -> tuning the number of threads:
for CPU-intensive task: task parallelism <= vcore,
for IO-intensive task: vcore <= task parallelism <= N*vcore.
Warn: runnable sun.nio.ch.EPollArrayWrapper.epollWait
CPU: user cpu or sys cpu? Load: runnable task or io task?
Amdahl law: Non-Parallelizable + Parallelizable
• Data hot point / data skew
• some nodes have bad performance -> choose the right hash key, two-phase aggregation, or use micro-batch
• Big objects serialization:
• bad performance -> reduce the size of objects, and enable kryo registry(from 55ms to 11ms after kryo registry)
• Too many logs:
• bad performance -> never log the logs unnecessary
Data Hot Point / Data Skew
Q:
partition = hash(key) % N
A:
• Choose the right hash key
• mapreduce from history?
• key == null?
• Two-phase aggregation
• Use micro-batch / local-reduce
S
P
P
S
P
P
S
P
P
G
k1
k2
k1+salt1
k1+salt2
k1
k2micro-batch
num(k1)
num(P)
num(S)
RocketMQ-Streaming Integration
RocketMQ-Storm: https://github.com/apache/storm/tree/master/external/storm-rocketmq
• RocketMQSpout - Now only RocketMQ push mode supported, pull mode is in the plan. The default Deserializer is
StringScheme, you can override the value by setting `RocketMQConfig.SCHEME`.
• RocketMQBolt - Async sending by default, or you can change the value by invoking `withAsync(boolean async)`
• RocketMQState - For users using Storm Trident API
• TopicSelector - Selecting a topic based on the input Storm tuple
• TupleToMessageMapper - Mapping a Storm tuple to a RocketMQ message, you can implement the
MessageBodySerializer interface to serialize the message body. The default implementation of MessageBodySerializer
is `body.toString().getBytes(StandardCharsets.UTF_8)`
• MessageRetryManager - Retry policy for failed messages
RocketMQ-Spark: https://github.com/apache/rocketmq-externals/tree/master/rocketmq-spark
RocketMQ-Flink: Coming soon
RocketMQ-Avro: Coming soon
OpenMessaging-Streaming