Anúncio
Anúncio

Mais conteúdo relacionado

Similar a Realtime Statistics based on Apache Storm and RocketMQ(20)

Anúncio

Último(20)

Realtime Statistics based on Apache Storm and RocketMQ

  1. Realtime Statistics based on Apache Storm and RocketMQ - Xin Wang Dec.16, 2017, Shenzhen, Apache RocketMQ Meetup
  2. Xin Wang • Apache Storm Committer & PMC member • Five years distributed system experience • Love open source & community • Focus on distributed technologies, especially stream processing • https://github.com/vesense
  3. Streaming and batch which come from different worlds, use different ways to solve different problems. - Xin
  4. Index Part-1: Streaming Ecosystem Part-2: Stateful Statistics based on Storm & RocketMQ Part-3: Best Practices
  5. 01 Streaming Ecosystem
  6. The Streaming Ecosystem Collector Messaging SQL Streaming- Connector Streaming- Connector Storage APP Streaming- State Schema- Registry CEP ML ... Stream API Runtime Deploy: Local, Cluster, Cloud Streaming- Manager Streaming ... Messaging: apache/kafka,rocketmq,pulsar Streaming: apache/storm,flink,spark-streaming,kafka-streams Schema-Registry: hortonworks/registry, confluentinc/schema-registry Streaming-Manager: hortonworks/streamline
  7. Which one is better for me? • Simple API • Fault-tolerant/Stable • Scalable • Performance(high throughput & low latency) • Guarantees: at-least-once/exactly-once • Mature • Ecosystem • Operation and Maintenance • Support • Code
  8. Storm 2.0 • Port Clojure to Java • Unified Stream API • Storm-SQL Improvements • Metrics V2 • Threading model Redesign • Lambda Expression Support - bolt: `tuple -> System.out.println(tuple)` • Apache Beam Runner • Worker Classloader Isolation • Dynamic Topology Updates • ......
  9. 02 Stateful Statistics based on Storm & RocketMQ
  10. Realtime Architecture Apache HBase MySQL S S P R Source Topic Retry Topic Sink Topic Apache RocketMQ Apache RocketMQ Apache Storm
  11. Stateful Statistics Challenges 1 2 3 1 2 3 3 2 1 time 1 2 3 ~ loss duplicating out-of-order mutex Q: complex, state machine + streaming open source middleware? A: • loss -> compensating • s1 -> s2 on condition when e3 • duplicating -> idempotent • exists(key) • out-of-order -> compensating+idempotent • mutex -> +/- • s1++ && s2-- 1 2 3 expected
  12. Stateful Event Counting: Alien Alien? a stateful event counting middleware. • Support event loss, duplicating, out-of-order, mutex • Support time/event Window API • Support integrating with streaming systems • Support dimension changing • Support sync/async snapshot storage • Support user defined State, Snapshot Serializer • Support state REST interfaces • ... Alien alien = Alien.createAlien() .withState(new LocalMemoryState()) .withWindow(new TimeWindow(2){ @Override public void accept(Map<String,View> views) { View view = views.get(“report”); List<Row> rows = view.getRows(); for (Row r : rows) { println(r.getDimensions() + “->” + r.getMetrics()); } } }); alien.putEvent(new Event("name","key") );
  13. 03 Best Practices
  14. Best Practices • Worker heavy GCs: • worker restart, heartbeat timeout, bad performance -> take care of your heap memory usage. Do you use local caches? Reasonable JVM options? e.g. -XX:CMSInitiatingOccupancyFraction • Topology design vs performance • bad performance -> put the lightweight logics into the same bolt/operator • Too many executors/tasks • high cluster CPU load, bad performance -> tuning the number of threads: for CPU-intensive task: task parallelism <= vcore, for IO-intensive task: vcore <= task parallelism <= N*vcore. Warn: runnable sun.nio.ch.EPollArrayWrapper.epollWait CPU: user cpu or sys cpu? Load: runnable task or io task? Amdahl law: Non-Parallelizable + Parallelizable • Data hot point / data skew • some nodes have bad performance -> choose the right hash key, two-phase aggregation, or use micro-batch • Big objects serialization: • bad performance -> reduce the size of objects, and enable kryo registry(from 55ms to 11ms after kryo registry) • Too many logs: • bad performance -> never log the logs unnecessary
  15. Data Hot Point / Data Skew Q: partition = hash(key) % N A: • Choose the right hash key • mapreduce from history? • key == null? • Two-phase aggregation • Use micro-batch / local-reduce S P P S P P S P P G k1 k2 k1+salt1 k1+salt2 k1 k2micro-batch num(k1) num(P) num(S)
  16. RocketMQ-Streaming Integration RocketMQ-Storm: https://github.com/apache/storm/tree/master/external/storm-rocketmq • RocketMQSpout - Now only RocketMQ push mode supported, pull mode is in the plan. The default Deserializer is StringScheme, you can override the value by setting `RocketMQConfig.SCHEME`. • RocketMQBolt - Async sending by default, or you can change the value by invoking `withAsync(boolean async)` • RocketMQState - For users using Storm Trident API • TopicSelector - Selecting a topic based on the input Storm tuple • TupleToMessageMapper - Mapping a Storm tuple to a RocketMQ message, you can implement the MessageBodySerializer interface to serialize the message body. The default implementation of MessageBodySerializer is `body.toString().getBytes(StandardCharsets.UTF_8)` • MessageRetryManager - Retry policy for failed messages RocketMQ-Spark: https://github.com/apache/rocketmq-externals/tree/master/rocketmq-spark RocketMQ-Flink: Coming soon RocketMQ-Avro: Coming soon OpenMessaging-Streaming
  17. Thanks everyone!
Anúncio