In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc.
Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language.
We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng
1. LEARNINGS USING SPARK STREAMING & DATAFRAMES FOR
WALMART SEARCH
Yan Zheng Nirmal Sharma
Director, Search-Walmart Labs Principle Engineer, Search - Walmart Labs
2. Search growth in last couple of years
• Product Catalog is growing exponentially
• Product updates by merchants happen almost real time
• Price update happens almost real time
• Inventory update happens almost real time
• Data used for relevance signals increased 10 times
• Number of functionalities/business use cases to support increased a lot
• Need for real time analytics increased to analyze data quickly to make faster
business decisions
6. • Issue with the old architecture was that the index
update was happeningonce a day……and that
was taking us back in terms of user experience
and business
8. How we did it
• Started capturing all the catalog updates in Kafka and then using spark streaming to
process these real time events and make them available to our indexes ( processing
upto 10,000 events per sec ).
• Micro services using spark streaming to directly update price, inventory etc. details to
indexes (close to 8000 events per sec ).
• Spark streaming and custom built elastic search data loader to load data directly to
ES for real time analytics ( processing 20,000-25,000 events per sec )
• Custom hadoop and spark jobs to process user data faster for all our data science
signals (50 TB - 100 TB data per batch ) and also to make data available faster for
analyst and business people.
• Started updated all our BI reports with couple of hours which used to take days ( or
sometimes days )
9. Further deep dive in technologies used……..
• Technology stack used for our data pipelines
– Spark – Streaming, Dataframes
– Hadoop – Hive, Map-Reduce
– Cassandra – for lookup ( fast read/write )
– Kafka – Event Processing
– Elastic Search – for logging and analytics
– Solr – Indexing walmart.com
10. How we used spark dataframe to build scalable and flexible
pipelines…..
11. • DATA PIPELINES:
1. Batch processing (Analytics, Data Science )
2. Real time processing ( Index update, micro services)
13. • So the issue is that there no single uniform language to
build data pipeline
• No easier way to reuse the code or templating the code
for others to reuse for similar work
14. Dataframe has the potential to become the next unified language
for data engineering………
15. Here are some examples to explain….
• This is the current code for K-Means clustering using python, hive, java(for UDF )
• https://gecgithub01.walmart.com/LabsSearch/DOD-
BE/tree/master/src/main/scripts/query_categorization_daily
• The current python code is more than 1000 line which includes first data preparation,
then data transformation to calculate feature vector and then model training and
finally data post procession to validate and store data
16. This is the new code for the same K-Means clustering using Spark
Dataframes
• The new code is hardly contains 60-70 lines
• The whole code is just one single file
• https://gecgithub01.walmart.com/LabsSearch/polaris-data-
gen/blob/master/application/polaris-analytics/kmeans-query-
tier/src/main/scala/QueryClustering.scala
• And the additional advantage is that the whole code is parallelized and just take 1/5th
of the time taken by original code
17. Another example is our scalable Anomaly framework
• For any data pipeline data quality is the key
– All data is correlated in one way or another because almost all data feeds our search
– Upfront data checks at source is equally important as final data check at target
– Cause and effect analysis
– Easy to use
– Pluggable to all kind of data pipelines
– Easy to enhance
– Light weight and easy to install
18. This is how we can run ETL’s using templates and the process using Dataframes..
• {
"check_desc": "primary_Category_path_null_count",
"check_id" : 1,
"check_metric" : ["count"],
"backfill": [0],
"fact_info" : [{
"table_name" : "item_attribute_table",
"table_type" : [],
"key_names" : [],
"check_metric_field": "",
"database_name" : "polaris",
"filters" : ["primary_category_path is null"],
"run_datetime" : "LATEST_DATE_MINUS_10",
"datetime_partition_field" : "data_date",
"datetime_partition_index" : 0,
"other_partition_fields" : ""
}]
19. Code snippet…
• val aggsResult = metric match {
case s: Vector[String] if(s.isEmpty ) => resultSet //.map(row => (check_id.toString, row._1, row._2)) //
toDF("check_id","key", "aggValue")
case _ =>
val caseResult = {
metric(0) match {
case "average" => resultSet.mapValues(_.toDouble)
.aggregateByKey(init)((a, b) => (a._1 + b, a._2 + 1), (x, y) => (x._1 + y._1, x._2 + y._2))
.mapValues(x => x._1 / x._2)
case "sum" => resultSet.mapValues(_.toDouble)
.reduceByKey((a, b) => a + b)
case "count" | "dups" => resultSet
.aggregateByKey(0.0)((a, b) => a + 1.0, (a, b) => a + b) //this is for count
case "max" => resultSet.mapValues(_.toDouble)
.reduceByKey((a, b) => max(a.toDouble, b.toDouble))
case "min" => resultSet.mapValues(_.toDouble)
.reduceByKey((a, b) => min(a.toDouble, b.toDouble))
case "countdistinct" => resultSet
.aggregateByKey(0.0)((a, b) => a + 1.0, (a, b) => a + b)
}
}
caseResult.mapValues(_.toString) //.map(row => (check_id.toString, row._1, row._2) ) //.toDF("check_id","key",
"aggValue")
}