My use case is to provide monitoring, and improving the overall search data quality, also to find the unusual patterns of user’s search behavior, and notifying the intent on-site back to the respective business stakeholders. To achieve the same, I explored various big data processing engines, which can process the huge data with complex business logic in real time. Eventually, I used Flink Stream processing. This talk will showcase how I used Flink to accomplish my goal.
1. Real Time DQMM on Flink
Jaydeep
Staff Engineer in Search Team
Apache Oozie Committer
June, 2019
2. Table of Contents
2
• What is Real Time Aggregation?
• Use Case
• What we deal with?
• System Requirements
• Spark vs Flink
• Flink Cluster setup
• Flink on Yarn
• Architecture
• 100% data completeness
• Open Items
3. What is Real Time Aggregation?
3
• What is real time ?
• What is the processing delay today?
• What real time offering?
• Why do we need it?
4. Use Case
4
• Bug detection in Response log
• Bot detection
• Best Seller Item
• Item Catalogue health
• Item out of stock (specially on event days)
• Best seller item tracking
• Top query monitoring
• Category performance
5. What we deal with?
5
~4 Billion logs Per day
~8 million records per minutes
~800 GB Data Per day
6. System Requirements
6
• Support for Real-time processing.
• Support to track the events.
• Easy to recover from failure.
• Exactly once processing
• Backpressure handling
• Support for Event based, Time based and Dynamic Window
• Highly Available
7. Spark vs Flink
7
Criteria Spark Flink
Data Processing Mini Batch Stream Processing
Data Shuffling Polling Trigger
Window Function Time Based Time/Event/Custom
Memory Management Configurable Auto Managed
Recovery DAG level State level
Re-Utilization and Iteration By Stage By event
11. 100% Data Completeness
11
Event Arrival Time Actual Event Time Clicks
2019-06-01 10:01:00 2019-06-01 10:01:00 3
2019-06-01 10:02:00 2019-06-01 10:02:00 1
2019-06-01 10:04:00 2019-06-01 10:03:00 4
2019-06-01 10:06:00 2019-06-01 10:04:00 5
2019-06-01 10:08:00 2019-06-01 10:04:00 1
Processed Time Event time Window Clicks
2019-06-01 10:05:00 2019-06-01 10:05:00 8
2019-06-01 10:10:00 2019-06-01 10:10:00 6
12. 100% Data Completeness
12
• Event Time data processing
• Handling the delayed event
• Prevent false anomaly detection
• Probability based Model for data completeness
13. Open Items
13
• Real time Model training
• Handling Seasonality while detecting Anomaly