O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine Learning at Netflix - Elliot Chow

617 visualizações

Publicada em

eal-time Processing with Flink for Machine Learning at Netflix
Machine learning plays a critical role in providing a great Netflix member experience. It is used to drive many parts of the site including video recommendations, search results ranking, and selection of artwork images. Providing high-fidelity, near real-time data is increasingly important for these machine learning pipelines, especially as multi-armed bandit and reinforcement learning techniques, in addition to more ""traditional"" supervised learning, become more prevalent. With access to this data, models are able to converge more quickly, features can be updated more frequently, and analysis can be done in a more timely manner.

In this talk, we will focus on the practical details of leveraging Flink to process trillions of events per day, work with the time dimension, and manage large and frequently-changing state. We will discuss different processing schemes and dataflows, scalability and resiliency challenges we tackled, operational considerations, and instrumentation we added for monitoring job health in production.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine Learning at Netflix - Elliot Chow

  1. 1. Real-time Processing with Flink for Machine Learning at Netflix Elliot Chow
  2. 2. Agenda Recommendations @ Netflix Data For Machine Learning Processing with Flink State/Join Event-Time & Watermarks Checkpointing Monitoring and Understanding The Job
  3. 3. Recommendations
  4. 4. Recommendations
  5. 5. 139 million+ members 190+ countries 450 billion+ unique events/day 700+ Kafka topics Scale
  6. 6. Impressions
  7. 7. Member Activity Log-in Click Play Search ...
  8. 8. Recommendations Data Context Features - inputs to recommendation algorithms ...
  9. 9. Sessionization
  10. 10. Join with Recommendations Data
  11. 11. Output Data Format
  12. 12. Processing with Flink
  13. 13. Historically... Spark + Spark Streaming Some Challenges Processing-time Checkpointing performance and compatibility
  14. 14. Switching to Flink Event-time Processing Incremental Checkpointing Custom Serializers Internal Netflix Support
  15. 15. High-level Data Flow
  16. 16. Challenges and Considerations
  17. 17. Challenges and Considerations Many microservices involved Different join keys Different expiration policies Scale
  18. 18. Join / Window Implementation
  19. 19. Attempt I
  20. 20. Attempt I class Event // ... class State // ... class Output // ... def insert(input: Event, state: State): State = // ... def emit(time: Timestamp): (State, List[Output]) = // ...
  21. 21. Attempt I class Event // ... class State // ... class Output // ... def insert(input: Event, state: State): State = // ... def emit(time: Timestamp): (State, List[Output]) = // ... Store State in ValueState for each member Call insert in processElement Call emit in onTimer Use custom Protobuf TypeSerializer
  22. 22. Attempt I - Issues State object is too large Out-of-memory, even with rate-limiting outliers Serialization/deserialization of entire state for inserting events is too costly
  23. 23. Attempt I - Issues State object is too large Out-of-memory, even with rate-limiting outliers Serialization/deserialization of entire state for inserting events is too costly All windows get triggered simultaneously Bursty resource usage
  24. 24. Attempt II Use Flink's windowing API Sliding Windows
  25. 25. Attempt II - Issues Many copies of each event
  26. 26. Attempt II - Issues Difficult to manage expiration for different events
  27. 27. Attempt III Custom ProcessFunction Manual window management Break down state into many state objects Use MapState, ListState, and ValueState where appropriate Use a combination of event-time and processing-time timers
  28. 28. Attempt III Maintain frequently-accessed metadata in ValueState Minimum/maximum timestamps Existing timers Number of events and bytes (rate-limiting)
  29. 29. Attempt III Optimize for writes (RocksDB backend) Only read metadata during inserts Insert (append) events to ListState Deduplicate events at read time; write back deduplicated events
  30. 30. Attempt III Randomly offset the windows Member Window Start Window End 1 __ : 00 __ : 09 1 __ : 10 __ : 19 2 __ : 01 __ : 10 2 __ : 11 __ : 20
  31. 31. Event-Time & Watermarks
  32. 32. Event-Time & Watermarks Watermarking Crash Course Event-time: time associated with the actual event Watermark: a time marker stating that all data prior to this time has been seen Event-time triggers fire based on the watermark
  33. 33. Event-Time & Watermarks Watermarking Crash Course Example: BoundedOutOfOrdernessTimestampExtractor where outOfOrderness = 10 minutes Event-Time 10:00 10:08 10:05 10:06 10:15 Max Event-Time 10:00 10:08 10:08 10:08 10:15 Watermark 09:50 09:58 09:58 09:58 10:05
  34. 34. Event-Time & Watermarks Watermarking Crash Course Watermark is maintained per partition The watermark of an operator is computed as the minimum watermark of its inputs Partition 1 09:50 09:58 09:58 09:58 10:05 Partition 2 09:53 09:57 09:58 10:03 10:08 Operator 09:50 09:57 09:58 09:58 10:05
  35. 35. A Couple Quick Observations 1. Event-time timestamps must be correct 2. If the watermark of any partition stops progressing, time will stop
  36. 36. Why Has Time Stopped?
  37. 37. Why Has Time Stopped? System is unhealthy Delays in input data sources Backpressure Underprovisioned cluster Even a single, bad TM can drag the entire job
  38. 38. Why Has Time Stopped?
  39. 39. Why Has Time Stopped? System appears healthy - somewhere, there is not enough data
  40. 40. Why Has Time Stopped? System appears healthy - somewhere, there is not enough data Scheduled jobs
  41. 41. Why Has Time Stopped? System appears healthy - somewhere, there is not enough data Scheduled jobs Region Failover
  42. 42. Why Has Time Stopped? System appears healthy - somewhere, there is not enough data Scheduled jobs Region Failover Kafka Skip-Partitions Feature
  43. 43. Why Has Time Stopped? System appears healthy - somewhere, there is not enough data Scheduled jobs Region Failover Kafka Skip-Partitions Feature Topic is overprovisioned (# partitions : events/second > 1)
  44. 44. Why Has Time Stopped?
  45. 45. (Slightly) Custom Watermark Assigner Based on BoundedOutOfOrdernessTimestampExtractor 1. Detect inactivity 2. Force time to forward when inactive 3. Record metrics per partition per source
  46. 46. Possible Improvements More sophisticated inactivity detection More flexible forced-time-progression Detect inactivity at the source
  47. 47. Checkpointing Large State
  48. 48. Checkpointing Large State One unresponsive TM can cause slowness or even failure of entire checkpoint
  49. 49. Checkpointing Large State Resource intensive (2x-3x CPU/Network)
  50. 50. Checkpointing Large State Reduce interval and add min-pause between checkpoints Increases duplicates when restoring job Large catch-up after restore
  51. 51. An Observation About The State Large portion of total state is recommendations data Only ID and timestamp are needed for the join
  52. 52. Move Some State Out Of Flink Keep only ID and timestamp in Flink Move data to an external store Fetching becomes an order of magnitude slower (network call vs. local disk)
  53. 53. Possible Improvements Checkpoint to/restore from persistent EBS Incremental savepoint Clean restart after checkpoint
  54. 54. Monitoring and Understanding The Job
  55. 55. Monitoring and Understanding The Job Flink Metrics numberOfFailedCheckpoints, lastCheckpointDuration inputQueueLength, outputQueueLength currentLowWatermark fullRestarts, downtime ...
  56. 56. Monitoring and Understanding The Job Instance-/Container-Level Metrics CPU, Network, Disk, Memory, GC, ... Check for unbalanced processing
  57. 57. Monitoring and Understanding The Job Time and Watermarks Event timestamps of inputs Relative to wall-clock time & watermark Watermark relative to wall-clock time At different operators Break down by task
  58. 58. Monitoring and Understanding The Job Performance Issues often only appear at scale Time all parts of application Look at CPU flamegraphs Replay from earliest offset (Kafka)
  59. 59. Monitoring and Understanding The Job State Difficult to get insights about entire state at a point in time Take a savepoint Manually schedule timer for every key to collect metrics
  60. 60. Wrap-Up Job has been running well in production, especially after moving to 1.7 Continue to work on robustness, failure recovery, and operational ease Trade-off some consistency for higher availability Auto-scaling
  61. 61. Thanks! Questions?

×