Incremental transform of transactional data models to analytical data models in near real time
Transactional systems are designed with data models to maximize write throughput across multiple parallel business flows. They evolve iteratively with business and need to react quickly to the changing business landscape to minimize time to market.
Analytical systems, on the other hand, require data models to maximize query throughput over broad, deep and large data volumes.
The need for a platform which transforms from the transactional data model to an analytical data model is well established in the industry. This is currently achieved through two different paradigms. Stream processing at lower latencies and batch processing at higher latencies.
We have solved the same problem through a third paradigm of incremental processing for intermediate latencies (5 minutes to 1 hour).
We considered and dropped implementations of the streaming paradigm either because of a lack of completeness guarantees or the absence of complex join capabilities across a large number of entities.
Our incremental processing platform transforms transactional data models to analytical data models. It provides expressability for complex joins across multiple entities (live with 30) through a Transformation Definition Language (TDL). These complex joins are evaluated incrementally as transactional data changes to periodically update the analytical data model. For near real time use cases, this is done every 5-10 minutes.
Changes to transactional data models are handled through version support in the TDL. These changes are absorbed with a pause and resume of the transformations.
The Flipkart Fulfillment Services Group serves over a million shipments in a day at its peak. Customer delight through a reliable and fast delivery of orders is our primary goal. To succeed in this endeavour, our ground operations depends on live and accurate visibility into the journey of all shipments pan India. Overall data volumes range in 10s of TBs with a change frequency of over 25k QPS at peak. All our transactional systems combined generate mutations with volumes close to 200 GB every second. Our incremental platform is built to handle this scale.
With this platform, we have achieved analytics at low latencies with high completeness without compromising on business agility.
In this talk, I will cover the specifics of our evaluations and our learnings from the journey of building the platform.
7. Grocery fulfillment journey
Multiple systems participate in the journey
Time to act drives latency requirements
100% accuracy expected
Minimal effort for business monitoring
8. Characteristics of transactional data models
Normalized
Complex, Directed,
Deterministic relationships
Longer life cycles
Tumbling time windows
9. Normalization helps parallelism and fast writes
order_items picker
order_items shipment shipment van
Order (1) Picklists (2)
Vans (4)
order order_items
Shipments (3)
10. time order picker
t1 o1 p1
order shipment
o1 s1
shipment van
s1 v1
Picklists Shipments Vans
key order picker pick_time shipment van location
o1+s1 o1 p1 t1 s1 v1 <lat, long>
Pick_To_Dispatch
Composite Key: order+shipment
Denormalization helps fast reads
15. time order picker
10:00 o1 p1
10:00 o2 p2
Picklists
order shipment
Shipments
shipment van
Vans
start end time key order picker pick_time shipment van location
9:50 10:00 10:05 o1+null o1 p1 10:00
9:50 10:00 10:05 o2+null o2 p2 10:00
Pick_To_Dispatch
At 10:05
16. time order picker
10:00 o1 p1
10:00 o2 p2
10:10 o3 p2
10:10 o4 p3
order shipment
o1 s1
o1 s2
shipment van
previous start previous end
9:50 10:00
Picklists Shipments Vans
current start current end
10:00 10:10
At 10:10
18. time order picker
10:00 o1 p1
10:00 o2 p2
10:10 o3 p2
10:10 o4 p3
order shipment
o1 s1
o1 s2
shipment van
Picklists Shipments Vans
current start current end
10:00 10:10
Left Outer
Join
Inner
Join
At 10:10
33. Batch - High accuracy, high latency, low cost
Batch Bulk writes leverage cheaper disks
Range queries take longer for scans
Compute can be shared
Replays are simpler but take as long
34. Stream Record level updates need fast writes
Range scans slow down processing
Compute is consistently engaged
Replay is complex (ϰ) or infeasible (𝝺)
Stream - Lower accuracy, low latency, low cost
35. Incremental Replication needs fast writes
Joins need fast scans
Compute is shared
Replays addressed by design
Incremental - High accuracy, mid latency, mid cost
36. Stream Batch Incremental
Lower accuracy, low
latency, low cost
High accuracy, high
latency, low cost
High accuracy, medium
latency, medium cost
Data processing implementation trade offs
37. Applications of Incremental Processing
Positive Indicators
● Time to act is 30 minutes or higher
● Accuracy is crucial
● Incremental visibility is acceptable
● Multiple systems come together with complex join criteria
Negative Indicators
● Low infrastructure cost is a constraint
● Independent systems
38. Thick, Medium and Thin Slices - Choose yours
BATCH STREAM
INCREMENTAL
40. Thick, Medium and Thin Slices - Choose yours
BATCH STREAM
INCREMENTAL
THANK YOU!
41. time_d order picker
10:00 o1 p1
10:00 o2 p2
10:10 o3 p2
10:10 o4 p3
order shipment
o1 s1
o6 s6
shipment van
Picklists Shipments Vans
Left Outer
Join
Inner
Join
?
Example - Out of order mutations
42. time_d order picker
10:00 o1 p1
10:00 o2 p2
10:10 o3 p2
10:10 o4 p3
10:20 o6 p6
order shipment
o1 s1
o6 s6
shipment van
Picklists Shipments Vans
Left Outer
Join
Example - Out of order mutations