O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Spark streaming + kafka 0.10

Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.

  • Seja o primeiro a comentar

Spark streaming + kafka 0.10

  1. 1. Spark Streaming + Kafka @joanvr
  2. 2. About me: Joan Viladrosa @joanvr Degree In Computer Science Advanced programming techniques & System interfaces and integration Co-Founder in Educabits Big data solutions using AWS cloud Big Data Developer in Trovit Hadoop and MapReduce Framework Big Data Architect & Tech Lead in BillyMobile Design new architecture with Hadoop cutting edge technologies: Kafka, Storm, Hive, HBase, Spark, Kylin, Druid
  3. 3. What is Kafka? - Publish-Subscribe Messaging System
  4. 4. What is Kafka? - Publish-Subscribe Messaging System - Fast - Scalable - Durable - Fault-tolerant What makes it great?
  5. 5. What is Kafka? Producer Producer Producer Producer Kafka Consumer Consumer Consumer Consumer
  6. 6. What is Kafka? Apache Storm Apache Spark My Java App Logger Kafka Apache Storm Apache Spark My Java App Monitoring Tool
  7. 7. Kafka terminology - Topic: A feed of messages - Producer: Processes that publish messages to a topic - Consumer: Processes that subscribe to topics and process the feed of published messages - Broker: Each server of a kafka cluster that holds, receives and sends the actual data
  8. 8. Kafka Topic Partitions 0 1 2 3 4 5 6Partition 0 Partition 1 Partition 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Topic: Old New writes
  9. 9. Kafka Topic Partitions 0 1 2 3 4 5 6Partition 0 7 8 9 Old New 1 0 1 1 1 2 1 3 1 4 1 5 Producer writes Consumer A (offset=6) Consumer B (offset=12) reads reads
  10. 10. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers
  11. 11. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers More Storage
  12. 12. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers More Parallelism
  13. 13. Apache Kafka Semantics - In short: consumer delivery semantics are up to you, not Kafka.
  14. 14. Apache Kafka Semantics - In short: consumer delivery semantics are up to you, not Kafka. - Kafka doesn’t store the state of the consumers* - It just sends you what you ask for (topic, partition, offset, length) - You have to take care of your state
  15. 15. Apache Kafka Timeline may-2016nov-2015nov-2013nov-2012 New Producer New Consumer Security Kafka Streams Apache Incubator Project 0.7 0.8 0.9 0.10
  16. 16. What is Spark Streaming? - Process streams of data - Micro-batching approach
  17. 17. What is Spark Streaming? - Process streams of data - Micro-batching approach - Same API as Spark - Same integrations as Spark - Same guarantees & semantics as Spark What makes it great?
  18. 18. What is Spark Streaming?
  19. 19. What is Spark Streaming?
  20. 20. How does it work? - Discretized Streams (DStream)
  21. 21. How does it work? - Discretized Streams (DStream)
  22. 22. - Discretized Streams (DStream) How does it work?
  23. 23. How does it work?
  24. 24. How does it work?
  25. 25. How does it work?
  26. 26. Spark Streaming Semantics: Side effects As in Spark: - Not guarantee exactly-once semantics for output actions - Any side-effecting output operations may be repeated - Because of node failure, process failure, etc. - So, be careful when outputting to external sources
  27. 27. Spark Streaming Kafka Integration
  28. 28. Spark Streaming Kafka Integration Timeline dec-2016jul-2016jan-2016sep-2015jun-2015mar-2015dec-2014sep-2014 Fault Tolerant WAL + Python API Direct Streams + Python API Improved Streaming UI Metadata in UI (offsets) + Graduated Direct Receivers Native Kafka 0.10 (experimental) 1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1
  29. 29. Kafka Receiver (<= Spark 1.1) Executor Driver Launch jobs on data Receiver Continuously receive data using High Level API Update offsets in ZooKeeper
  30. 30. HDFS Kafka Receiver with WAL (Spark 1.2) Executor Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper WAL Receiver
  31. 31. Kafka Receiver with WAL (Spark 1.2) Application Driver Executor Spark Context Jobs Computation checkpointed Receiver Input stream Block metadata Block metadata written to log Block data written both memory + log Streaming Context
  32. 32. Kafka Receiver with WAL (Spark 1.2) Restarted Driver Restarted Executor Restarted Spark Context Relaunch Jobs Restart computation from info in checkpoints Restarted Receiver Resend unacked data Recover Block metadata from log Recover Block data from log Restarted Streaming Context
  33. 33. HDFS Kafka Receiver with WAL (Spark 1.2) Executor Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper WAL Receiver
  34. 34. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver
  35. 35. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 1. Query latest offsets and decide offset ranges for batch
  36. 36. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 1. Query latest offsets and decide offset ranges for batch
  37. 37. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 1. Query latest offsets and decide offset ranges for batchtopic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102)
  38. 38. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batchtopic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102)
  39. 39. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batchtopic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102)
  40. 40. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batchtopic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102)
  41. 41. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batch
  42. 42. Direct Kafka API benefits - No WALs or Receivers - Allows end-to-end exactly-once semantics pipelines - updates to downstream systems should be idempotent or transactional - More fault-tolerant - More efficient - Easier to use.
  43. 43. How to use Direct Kafka API Maven dependency <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka_2.10</artifactId> <version>1.3.0</version><!-- until 1.6.3 --> </dependency>
  44. 44. How to use Direct Kafka API Java example code val ssc = new StreamingContext(...) //hostname:port for Kafka brokers, NOT Zookeeper val kafkaParams = Map( "metadata.broker.list" -> "broker01:9092,broker02:9092" ) val topics = Set( "sometopic", "anothertopic" ) val stream = KafkaUtils.createDirectStream [String, String, StringDecoder, StringDecoder] (ssc, kafkaParams, topics)
  45. 45. How to use Direct Kafka API Java example code stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] .offsetRanges rdd.foreachPartition { iter => val osr: OffsetRange = offsetRanges( TaskContext.get.partitionId) // get any needed data from the offset range val topic = osr.topic val kafkaPartitionId = osr.partition val begin = osr.fromOffset val end = osr.untilOffset } }
  46. 46. How to use Direct Kafka API Java example code val offsetRanges = Array( OffsetRange("sometopic", 0, 110, 220), OffsetRange("sometopic", 1, 100, 313), OffsetRange("anothertopic", 0, 456, 789) ) val rdd = KafkaUtils.createRDD [String, String, StringDecoder, StringDecoder] (sc, kafkaParams, offsetRanges)
  47. 47. Spark Streaming UI improvements (Spark 1.4)
  48. 48. Kafka Metadata (offsets) in UI (Spark 1.5)
  49. 49. What about Spark 2.0+ and new Kafka Integration? This is why we are here, right?
  50. 50. What about Spark 2.0+ and new Kafka Integration? This is why we are here, right? spark-streaming-kafka-0-8 spark-streaming-kafka-0-10 Broker Version 0.8.2.1 or higher 0.10.0 or higher Api Stability Stable Experimental Language Support Scala, Java, Python Scala, Java Receiver DStream Yes No Direct DStream Yes Yes SSL / TLS Support No Yes Offset Commit Api No Yes Dynamic Topic Subscription No Yes
  51. 51. How to use Old Kafka Integration on Spark 2.0+ Maven dependency <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-8_2.11</artifactId> <version>2.0.0</version><!-- until 2.1.0 right now --> </dependency>
  52. 52. How to use New Kafka Integration on Spark 2.0+ Maven dependency <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-10_2.11</artifactId> <version>2.0.0</version><!-- until 2.1.0 right now --> </dependency>
  53. 53. What’s really New with this New Kafka Integration? - New Consumer API - instead of Simple API - Location Strategies - Consumer Strategies - SSL / TLS - No Python API :(
  54. 54. Location Strategies - New consumer API will pre-fetch messages into buffers - So, keep cached consumers into executors - It’s better to schedule partitions on the host with appropriate consumers
  55. 55. Location Strategies - PreferConsistent Distribute partitions evenly across available executors - PreferBrokers If your executors are on the same hosts as your Kafka brokers - PreferFixed Specify an explicit mapping of partitions to hosts
  56. 56. Consumer Strategies - New consumer API has a number of different ways to specify topics, some of which require considerable post-object-instantiation setup. - ConsumerStrategies provides an abstraction that allows Spark to obtain properly configured consumers even after restart from checkpoint.
  57. 57. Consumer Strategies - Subscribe subscribe to a fixed collection of topics - SubscribePattern use a regex to specify topics of interest - Assign specify a fixed collection of partitions ● Overloaded constructors to specify the starting offset for a particular partition. ● ConsumerStrategy is a public class that you can extend.
  58. 58. SSL/TLS encryption - New consumer API supports SSL - Only applies to communication between Spark and Kafka brokers - Still responsible for separately securing Spark inter-node communication
  59. 59. SSL/TLS encryption val kafkaParams = Map[String, Object]( // ... // // the usual params, make sure the port in bootstrap.servers is the TLS one // ... // "security.protocol" -> "SSL", "ssl.truststore.location" -> "/some-directory/kafka.client.truststore.jks", "ssl.truststore.password" -> "test1234", "ssl.keystore.location" -> "/some-directory/kafka.client.keystore.jks", "ssl.keystore.password" -> "test1234", "ssl.key.password" -> "test1234" )
  60. 60. val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "broker01:9092,broker02:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "stream_group_id", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val topics = Array("topicA", "topicB") val stream = KafkaUtils.createDirectStream[String, String]( streamingContext, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) stream.map(record => (record.key, record.value)) How to use New Kafka Integration on Spark 2.0+ Java example code
  61. 61. stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] .offsetRanges rdd.foreachPartition { iter => val osr: OffsetRange = offsetRanges( TaskContext.get.partitionId) // get any needed data from the offset range val topic = osr.topic val kafkaPartitionId = osr.partition val begin = osr.fromOffset val end = osr.untilOffset } } How to use New Kafka Integration on Spark 2.0+ Java example code
  62. 62. stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] .offsetRanges // DO YOUR STUFF with DATA stream.asInstanceOf[CanCommitOffsets] .commitAsync(offsetRanges) } } How to use New Kafka Integration on Spark 2.0+ Java example code
  63. 63. Kafka + Spark Semantics Three use cases: - At most once - At least once - Exactly once
  64. 64. Kafka + Spark Semantics At most once - We don’t want duplicates - Not worth the hassle of ensuring that messages don’t get lost - Example: Sending statistics over UDP 1. Set spark.task.maxFailures to 1 2. Make sure spark.speculation is false (the default) 3. Set Kafka param auto.offset.reset to “largest” 4. Set Kafka param enable.auto.commit to true
  65. 65. Kafka + Spark Semantics At most once - This will mean you lose messages on restart - At least they shouldn’t get replayed. - Test this carefully if it’s actually important to you that a message never gets repeated, because it’s not a common use case.
  66. 66. Kafka + Spark Semantics At least once - We don’t want to lose any record - We don’t care about duplicates - Example: Sending internal alerts on relative rare occurrences on the stream 1. Set spark.task.maxFailures > 1000 2. Set Kafka param auto.offset.reset to “smallest” 3. Set Kafka param enable.auto.commit to false
  67. 67. Kafka + Spark Semantics At least once - Don’t be silly! Do NOT replay your whole log on every restart… - Manually commit the offsets when you are 100% sure records are processed - If this is “too hard” you’d better have a relative short retention log. - Or be REALLY ok with duplicates. For example you are outputting to an external system that handles duplicates for you (HBase)
  68. 68. Kafka + Spark Semantics Exactly once - We don’t want to lose any record - We don’t want duplicates either - Example: Storing stream in data warehouse 1. We need some kind of idempotent writes, or whole-or-nothing writes 2. Only store offsets EXACTLY after writing data 3. Same parameters as at least once
  69. 69. Kafka + Spark Semantics Exactly once - Probably the hardest to achieve right - Still some small chance of failure if your app fails just between writing data and committing offsets… (but REALLY small) - If your writing is not “atomic” or has some “steps”, you should checkpoint each step and when recovering from a failure finish the “current micro-batch” from the failed step.
  70. 70. It’s DEMO time!
  71. 71. Spark Streaming + Kafka at Billy Mobile a story of love and fury
  72. 72. Some Billy Insights we rock it! 10Brecords monthly 20TBweekly retention log 5Kevents/second x4growth/year
  73. 73. Our use cases: ETL to Data Warehouse - Input events from Kafka - Enrich the events with some external data sources - Finally store it to Hive - We do NOT want duplicates - We do NOT want to lose events
  74. 74. Our use cases: ETL to Data Warehouse - Input events from Kafka - Enrich the events with some external data sources - Finally store it to Hive - We do NOT want duplicates - We do NOT want to lose events exactly once!
  75. 75. Our use cases: ETL to Data Warehouse - Hive is not transactional - Neither idempotent writes - Writing files to HDFS is “atomic” (whole or nothing) - A relation 1:1 from each partition-batch to file in HDFS - Store to ZK the current state of the batch - Store to ZK offsets of last finished batch
  76. 76. Our use cases: ETL to Data Warehouse On failure: - If executors fails, just keep going (reschedule task) > spark.task.maxFailures = 1000 - If driver fails (or restart): - Load offsets and state from “current batch” if exists and “finish” it (KafkaUtils.createRDD) - Continue Stream from last saved offsets
  77. 77. Our use cases: Anomalies Detection - Input events from Kafka - Periodically load batch-computed model - Detect when an offer stops converting (or too much) - We do not care about losing some events (on restart) - We always need to process the “real-time” stream
  78. 78. Our use cases: Anomalies Detection - Input events from Kafka - Periodically load batch-computed model - Detect when an offer stops converting (or too much) - We do not care about losing some events (on restart) - We always need to process the “real-time” stream At most once!
  79. 79. Our use cases: Anomalies Detection - It’s useless to detect anomalies on a lagged stream! - Actually it could be very bad - Always restart stream on latest offsets - Restart with “fresh” state
  80. 80. Our use cases: Store it to Entity Cache - Input events from Kafka - Almost no processing - Store it to HBase (has idempotent writes) - We do not care about duplicates - We can’t lose a single event
  81. 81. Our use cases: Store it to Entity Cache - Input events from Kafka - Almost no processing - Store it to HBase (has idempotent writes) - We do not care about duplicates - We can’t lose a single event at least once!
  82. 82. Our use cases: Store it to Entity Cache - Since HBase has idempotent writes, we can write events multiple times without hassle - But, we do NOT start with earliest offsets… - That would be 7 days of redundant writes…!!! - We store offsets of last finished batch - But obviously we might re-write some events on restart or failure
  83. 83. Lessons Learned - Do NOT use checkpointing! - Not recoverable across upgrades - Track offsets yourself - ZK, HDFS, DB… - Do NOT use enable.auto.commit - It only makes sense in at most once - Memory might be an issue - You do not want to waste it... - Adjust batchDuration - Adjust maxRatePerPartition
  84. 84. Future Research - Dynamic Allocation spark.dynamicAllocation.enabled vs spark.streaming.dynamicAllocation.enabled https://issues.apache.org/jira/browse/SPARK-12133 But no reference in docs... - Monitoring - Graceful shutdown - Structured Streaming
  85. 85. Thank you very much! and have a beer! Questions?
  86. 86. WE ARE HIRING! @joanvr joan.viladrosa@billymob.com

    Seja o primeiro a comentar

    Entre para ver os comentários

  • rickzan

    May. 29, 2017
  • muniiitm

    May. 30, 2017
  • StreamingAnalytics

    May. 31, 2017
  • patricklachat5

    Jun. 7, 2017
  • wagnetic

    Jun. 8, 2017
  • ni_po

    Jun. 9, 2017
  • zhb_mccoy

    Jul. 5, 2017
  • baobao321

    Jul. 27, 2017
  • mewmf

    Aug. 22, 2017
  • rainerdun

    Oct. 4, 2017
  • BuvaneswariRamanan

    Oct. 28, 2017
  • feestend

    Feb. 8, 2018
  • youwenbin

    Aug. 12, 2018
  • GoutamChowdhury2

    Apr. 23, 2019
  • bonghongden133811

    May. 28, 2019
  • mohansoundar

    Jul. 14, 2020

Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself. So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples. Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.

Vistos

Vistos totais

4.067

No Slideshare

0

De incorporações

0

Número de incorporações

0

Ações

Baixados

158

Compartilhados

0

Comentários

0

Curtir

16

×