At WalmartLabs, millions of product information and new products are getting ingested every day. In quest of providing a seamless shopping experience for our customers, we developed near real time indexing data pipeline. Our pipeline is a key component to update dynamically changing product catalog and other features such as store and online availability, offers etc.
Our indexing component, which is based on Spark Streaming Receiver Approach, consumes events from multiple Kafka topics such as Product Change, Store Availability, and Offer Change and merges the transformed Product Attributes with the historical signals computed by relevance data pipeline stored in Cassandra. This data is further processed by another Streaming component, which partitions documents into Kafka topic for every shard as it can be indexed into Apache Solr for Product Search. Deployment of this pipeline is automated end to end.
4. Use Case: Near Real Time Indexing
Improve Customer experience•
Update Product Information•
Index new Productso
Product Attribute changeo
Product Offer (Online availability) eventso
86• million Product Change events/day
1• product -> 5000 stores
Store A• vailability Change Events ~ 20 K
events/sec
5. Motivation For Spark
• Offline/Full Indexing – Integration with Spark
Batch Job
• To maintain the same code base/logic to ease
debugging
• Potentially Leverage same technology stack for
Batch and Streaming
6. Challenges
• Merge real time data with historic signals data
updated at different frequency.
• Update the latest value of attribute from multiple
pipeline updates
• Dynamic configuration update in Streaming
component
• Manage Start/Stop Spark Streaming components
8. Historic data computed by batch pipeline stored in Cassandra
Automatic management of latest version of data fields
Merge real time data with historic signals to compute complete
dataset
Lambda Architecture Processing Overview
11. Reprocessing ?
Event Ordering ?
Synchronization of Configuration Update ?
Start/Stop Streaming Component?
Orchestration with Full Index Update ?
Implementation
12. Streaming Component Interaction
Spark Streaming Receiver Approach
Multiple Kafka Streams processing
Store offsets in Zookeeper
Kafka Partitions by ID
13. Monitoring
Extended Spark Metrics Api
Register Custom Accumulators/Gauges for key metrics
Kafka Consumer Lag with Custom Scripts
Grafana Dashboard for Visualization
14. Tuning
• Scheduling delay = 0
• Partition RDDs effectively – In terms of multiple
of spark workers
• Coalesce over repartition
• spark.streaming.backpressure.enabled
• spark.shuffle.consolidateFiles
15.
16. Lessons
Querying Cassandra
Worst : Filter on Spark side
sc.cassandraTable().filter(partitionkey in keys)
Bad : Filter on C* side in single operation
sc.cassandraTable().where(keys in productIds)
Similar to “in” Query Clause
Query : Select * from my_keyspace.users where id in (1,2,3,4)
Best : Filter on C* side in distributed and Concurrent
fashion
KafkaRDD.joinwithcassandraTable()
17. Little more about In Clause
Multiple Requests: “In” Clause Failure Scenario
Img src: https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/