Lambda Processing for Near Real Time Search Indexing at WalmartLabs: Spark Summit East talk by Snehal Nagmote

•

2 gostaram•1,522 visualizações

At WalmartLabs, millions of product information and new products are getting ingested every day. In quest of providing a seamless shopping experience for our customers, we developed near real time indexing data pipeline. Our pipeline is a key component to update dynamically changing product catalog and other features such as store and online availability, offers etc. Our indexing component, which is based on Spark Streaming Receiver Approach, consumes events from multiple Kafka topics such as Product Change, Store Availability, and Offer Change and merges the transformed Product Attributes with the historical signals computed by relevance data pipeline stored in Cassandra. This data is further processed by another Streaming component, which partitions documents into Kafka topic for every shard as it can be indexed into Apache Solr for Product Search. Deployment of this pipeline is automated end to end.

Dados e análise

Lambda Processing for
Near Time Search
Indexing
Snehal Nagmote
- @WalmartLabs

WalmartLabs Usecase
Why Lambda Processing
NRT Architecture
Overview
Implementation
Monitoring
Spark Application Tuning
Lessons Learnt

Product
Categorization
Shipping
Logistics
Offers
Price
Adjustments
Use Case: Product Search Indexing
Supplier/
Merchants/
Sellers
Item Setup
Ecommerce Search

Use Case: Near Real Time Indexing
Improve Customer experience•
Update Product Information•
Index new Productso
Product Attribute changeo
Product Offer (Online availability) eventso
86• million Product Change events/day
1• product -> 5000 stores
Store A• vailability Change Events ~ 20 K
events/sec

Motivation For Spark
• Offline/Full Indexing – Integration with Spark
Batch Job
• To maintain the same code base/logic to ease
debugging
• Potentially Leverage same technology stack for
Batch and Streaming

Challenges
• Merge real time data with historic signals data
updated at different frequency.
• Update the latest value of attribute from multiple
pipeline updates
• Dynamic configuration update in Streaming
component
• Manage Start/Stop Spark Streaming components

Product Attributes
Real time streaming
attributes (60+)
 Availability
 Offers (lowest price)
 Product title
 Product Reviews
 Product description
…
Batch Computed
Attributes (20+)
 Item score
 Facets

Historic data computed by batch pipeline stored in Cassandra
Automatic management of latest version of data fields
Merge real time data with historic signals to compute complete
dataset
Lambda Architecture Processing Overview

 Reprocessing ?
 Event Ordering ?
 Synchronization of Configuration Update ?
 Start/Stop Streaming Component?
 Orchestration with Full Index Update ?
Implementation

Streaming Component Interaction
 Spark Streaming Receiver Approach
 Multiple Kafka Streams processing
 Store offsets in Zookeeper
 Kafka Partitions by ID

Monitoring
 Extended Spark Metrics Api
 Register Custom Accumulators/Gauges for key metrics
 Kafka Consumer Lag with Custom Scripts
 Grafana Dashboard for Visualization

Tuning
• Scheduling delay = 0
• Partition RDDs effectively – In terms of multiple
of spark workers
• Coalesce over repartition
• spark.streaming.backpressure.enabled
• spark.shuffle.consolidateFiles

Lessons
Querying Cassandra
 Worst : Filter on Spark side
sc.cassandraTable().filter(partitionkey in keys)
 Bad : Filter on C* side in single operation
sc.cassandraTable().where(keys in productIds)
Similar to “in” Query Clause
Query : Select * from my_keyspace.users where id in (1,2,3,4)
 Best : Filter on C* side in distributed and Concurrent
fashion
KafkaRDD.joinwithcassandraTable()

Little more about In Clause
Multiple Requests: “In” Clause Failure Scenario
Img src: https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Lessons
 Spark Locality Wait
Avoid ANY
spark.locality.wait = 3s
 Connection Keep Alive
Spark.cassandra.connection.keep_alive_ms
 Cache RDDs !!

Thank You ! Questions ?
- Snehal Nagmote
https://www.linkedin.com/in/snehal-nagmote-79651122
@ WalmartLabs

Mais conteúdo relacionado

Destaque

While the performance delivered by Spark has enabled data scientists to undertake sophisticated analyses on big and complex data in actionable timeframes, too often, the process of manually configuring the underlying Spark jobs (including the number and size of the executors) can be a significant and time consuming undertaking. Not only it does this configuration process typically rely heavily on repeated trial-and-error, it necessitates that data scientists have a low-level understanding of Spark and detailed cluster sizing information. At Alpine Data we have been working to eliminate this requirement, and develop algorithms that can be used to automatically tune Spark jobs with minimal user involvement, In this presentation, we discuss the algorithms we have developed and illustrate how they leverage information about the size of the data being analyzed, the analytical operations being used in the flow, the cluster size, configuration and real-time utilization, to automatically determine the optimal Spark job configuration for peak performance.

Spark Autotuning: Spark Summit East talk by Lawrence Spracklen

Spark Summit

Something really exciting and largely unnoticed is going on in the Spark ecosystem. As data scientists and engineers learn Spark, they’re actually all implicitly learning a much older, more general topic: typed functional programming. While Spark itself was built on an accumulation of powerful computer science concepts from functional programming and other areas, developers are often encountering these ideas in the context of Spark for the first time. It turns out that Spark makes an excellent platform for learning concepts like immutability, higher order and anonymous functions, laziness, and monadic operators. This talk will discuss how Spark can be used as teaching tool, to build skills in areas like typed functional programming. We’ll explore a skill-building curriculum that can be used with a data scientist or engineer who only has experience in imperative, dynamically-typed languages like Python. This curriculum introduces the core concepts of functional programming and type theory, while providing learners the opportunity to immediately apply their skills at massive scale, using the power of Spark’s painless scalability and resilience. Based on the experience of building machine learning teams at x.ai and other data-centric startups, this curriculum is the foundation of building poly-skilled, highly autonomous team members who can build scalable intelligent systems. We’ll work from foundational concepts of Scala and functional programming towards a fully implemented machine learning pipeline, all using Spark and MLlib. Unique new features of Spark like Datasets and Structured Streaming will be particularly useful in this effort. Using this approach, teams can help members in all roles learn how to use sophisticated programming techniques that ensure correctness at scale. With these skills in their toolbox, data scientists and engineers often find that building powerful machine learning systems is intuitive, easy, and even fun.

Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...

Spark Summit

spark-timeseries is a Scala / Java / Python library for interacting with time series data on Apache Spark. Time-series are an important part of data science applications, but are notoriously difficult in the context of distributed systems, due to their sequential nature. Getting this right is therefore a challenging but important element of progress in the universe of distributed systems applied to data science. This talk will cover the current overall design of spark-timeseries, the current functionalities, and will provide some usage examples. Because the project is still at an early stage, the talk will also cover the current weaknesses and future improvements that are in the spark-timeseries project roadmap.

Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette

Spark Summit

Scaling out doesn’t have to mean giving up transactions and efficient joins! Relational databases can scale horizontally, and using them as a store for Spark Streaming or batch computations can help cover areas in which Spark is typically weaker. Examples will be drawn from our experience using Citus (https://github.com/citusdata/citus), an open-source extension to Postgres, but lessons learned should be applicable to many databases.

Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...

Spark Summit

So you know you want to write a streaming app but any non-trivial streaming app developer would have to think about these questions: How do I manage offsets? How do I manage state? How do I make my spark streaming job resilient to failures? Can I avoid some failures? How do I gracefully shutdown my streaming job? How do I monitor and manage (e.g. re-try logic) streaming job? How can I better manage the DAG in my streaming job? When to use checkpointing and for what? When not to use checkpointing? Do I need a WAL when using streaming data source? Why? When don’t I need one? In this talk, we’ll share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...

Spark Summit

In Spark SQL’s Catalyst optimizer, many rule based optimization techniques have been implemented, but the optimizer itself can still be improved. For example, without detailed column statistics information on data distribution, it is difficult to accurately estimate the filter factor, cardinality, and thus output size of a database operator. With the inaccurate and/or misleading statistics, it often leads the optimizer to choose suboptimal query execution plans. We added a Cost-Based Optimizer framework to Spark SQL engine. In our framework, we use Analyze Table SQL statement to collect the detailed column statistics and save them into Spark’s catalog. For the relevant columns, we collect number of distinct values, number of NULL values, maximum/minimum value, average/maximal column length, etc. Also, we save the data distribution of columns in either equal-width or equal-height histograms in order to deal with data skew effectively. Furthermore, with the number of distinct values and number of records of a table, we can determine how unique a column is although Spark SQL does not support primary key. This helps determine, for example, the output size of join operation and multi-column group-by operation. In our framework, we compute the cardinality and output size of each database operator. With reliable statistics and derived cardinalities, we are able to make good decisions in these areas: selecting the correct build side of a hash-join operation, choosing the right join type (broadcast hash-join versus shuffled hash-join), adjusting multi-way join order, etc. In this talk, we will show Spark SQL’s new Cost-Based Optimizer framework and its performance impact on TPC-DS benchmark queries.

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...

Spark Summit

Whether it’s Internet of Things (IoT), analysis of Financial Data, or Adtech, the arrival of events in time order requires tools and techniques that are noticeably missing from the Pandas and pySpark software stack. In this talk, we’ll cover Two Sigma’s contribution to time series analysis for Spark, our work with Pandas, and propose a roadmap for to future-proof pySpark and establish Python as a first class language in the Spark Ecosystem.

New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...

Spark Summit

Real-world graphs are seldom static. Applications that generate graph-structured data today do so continuously, giving rise to an underlying graph whose structure evolves over time. Mining these time-evolving graphs can be insightful, both from research and business perspectives. While several works have focused on some individual aspects, there exists no general purpose time-evolving graph processing engine. We present Tegra, a time-evolving graph processing system built on a general-purpose dataflow framework. We introduce Timelapse, a flexible abstraction that enables efficient analytics on evolving graphs by allowing graph-parallel stages to iterate over complete history of nodes. We use Timelapse to present two computational models, a temporal analysis model for performing computations on multiple snapshots of an evolving graph, and a generalized incremental computation model for efficiently updating results of computations.

Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...

Spark Summit

Spark 2.0 provided strong performance enhancements to the Spark core while advancing Spark ML usability to use data frames. But what happens when you run Spark 2.0 machine learning algorithms on a large cluster with a very large data set? Do you even get any benefit from using a very large data set? It depends.  How do new hardware advances affect the topology of high performance Spark clusters. In this talk we will explore Spark 2.0 Machine Learning at scale and share our findings with the community. As our test platform we will be using a new cluster design, different from typical Hadoop clusters, with more cores, more RAM and latest generation NVMe SSD’s and a 100GbE network with a goal of more performance, in a more space and energy efficient footprint.

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...

Spark Summit

BigDL is a distributed deep Learning framework built for Big Data platform using Apache Spark. It combines the benefits of “high performance computing” and “Big Data” architecture, providing native support for deep learning functionalities in Spark, orders of magnitude speedup than out-of-box open source DL frameworks (e.g., Caffe/Torch) wrt single node performance (by leveraging Intel MKL), and the scale-out of deep learning workloads based on the Spark architecture. We’ll also share how our users adopt BigDL for their deep learning applications (such as image recognition, object detection, NLP, etc.), which allows them to use their Big Data (e.g., Apache Hadoop and Spark) platform as the unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.

BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...

Spark Summit

Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. If your data is already in one of Spark Streaming’s well-supported message queuing systems, this is easy. If not, an ad hoc solution to import data may work for a single application, but trying to scale that approach to complex data pipelines integrating dozens of data sources and sinks with multi-stage processing quickly breaks down. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier. This talk will first describe some data pipeline anti-patterns we have observed and motivate the need for a tool designed specifically to bridge the gap between other data systems and stream processing frameworks. We will introduce Kafka Connect, starting with basic usage, its data model, and how a variety of systems can map to this model. Next, we’ll explain how building a tool specifically designed around Kafka allows for stronger guarantees, better scalability, and simpler operationalization compared to other general purpose data copying tools. Finally, we’ll describe how combining Kafka Connect and Spark Streaming, and the resulting separation of concerns, allows you to manage the complexity of building, maintaining, and monitoring large scale data pipelines.

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...

Spark Summit

Spark SQL and Mllib are optimized for running feature extraction and machine learning algorithms on row based columnar datasets through full scan but does not provide constructs for column indexing and time series analysis. For dealing with document datasets with timestamps where the features are represented as variable number of columns in each document and use-cases demand searching over columns and time to retrieve documents to generate learning models in realtime, a close integration within Spark and Lucene was needed. We introduced LuceneDAO in Spark Summit Europe 2016 to build distributed lucene shards from data frame but the time series attributes were not part of the data model. In this talk we present our extension to LuceneDAO to maintain time stamps with document-term view for search and allow time filters. Lucene shards maintain the time aware document-term view for search and vector space representation for machine learning pipelines. We used Spark as our distributed query processing engine where each query is represented as boolean combination over terms with filters on time. LuceneDAO is used to load the shards to Spark executors and power sub-second distributed document retrieval for the queries. Our synchronous API uses Spark-as-a-Service to power analytical queries while our asynchronous API uses kafka, spark streaming and HBase to power time series prediction algorithms. In this talk we will demonstrate LuceneDAO write and read performance on millions of documents with 1M+ terms and configurable time stamp aggregate columns. We will demonstrate the latency of APIs on a suite of queries generated from terms. Key takeaways from the talk will be a thorough understanding of how to make Lucene powered time aware search a first class citizen in Spark to build interactive analytical query processing and time series prediction algorithms.

Realtime Analytical Query Processing and Predictive Model Building on High Di...

Spark Summit

In this talk, we will discuss several advantages of the Spark RDD API for developing custom applications when compared to pure SQL-like interfaces such as Hive. In particular, we will describe how to control data distribution, avoid data skew, and implement application specific optimizations in order to build performant and reliable data pipelines. In order to illustrate these ideas, we will share our experiences redesigning a large-scale, complex (100+ stage) language model training pipeline for Spark that was originally built in Hive. The final Spark based pipeline is modular, readable, and more maintainable when compared to previous set of HQL queries. In addition to the qualitative improvements, we also observed a significant reduction in both resource usage and data landing time. Finally, we will also describe Spark optimizations that we implemented for this workload that can be applied toward batch workloads in general.

Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Spark Summit

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. In this talk, Karanjeet Singh and Thamme Gowda will describe a new crawler called Sparkler (contraction of Spark-Crawler) that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster. https://github.com/USCDataScience/sparkler

Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...

Spark Summit

In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc. Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language. We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

Spark Summit

Drizzle is a low latency execution engine for Apache Spark that is targeted at stream processing and iterative workloads. Currently, Spark uses a BSP computation model, and notifies the scheduler at the end of each task. Invoking the scheduler at the end of each task adds overheads and results in decreased throughput and increased latency. In Drizzle, we introduce group scheduling, where multiple batches (or a group) of computation are scheduled at once. This helps decouple the granularity of task execution from scheduling and amortize the costs of task serialization and launch. Our experiments on a 128 node EC2 cluster show that Drizzle can achieve end-to-end streaming latencies of less than 100ms and can get up to 3.5x lower latency than Spark Streaming. Compared to Apache Flink, a record-at-a-time streaming system, we show that Drizzle can recover around 4x faster from failures and that Drizzle has up to 13x lower latency during recovery.

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...

Spark Summit

This presentation will provide technical design and development insights in order to set up a Kerberosied (secured) JupyterHub notebook using Spark. Joy will show how Bloomberg set up the Kerberos-based Spark-notebook-integrating JupyterHub, Sparkmagic, and Levy. Sparkmagic provides the Spark kernel for Scala and Python. Livy is one of the most promising open source software to allow to submit Spark jobs over http-based REST interfaces. In this presentation, Joy will highlight the capabilities of Sparkmagic and Livy, along with the gap or development required in order to integrate the software seamlessly to work with your secured Spark cluster. The Kerberos integration techniques that he’ll discuss can be applied to other types of authenticators, such as OAuth. No prior knowledge of any of these technologies is requied in order to understand this presentation.

Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...

Spark Summit

SparkSQL, a module for processing structured data in Spark, is one of the fastest SQL on Hadoop systems in the world. This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will walk away with a deeper understanding of how Spark analyzes, optimizes, plans and executes a user’s query. Speaker: Sameer Agarwal This talk was originally presented at Spark Summit East 2017.

SparkSQL: A Compiler from Queries to RDDs

Databricks

Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast! This session will cover different ways of joining tables in Apache Spark. Speaker: Vida Ha This talk was originally presented at Spark Summit East 2017.

Optimizing Apache Spark SQL Joins

Databricks

Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search. Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Spark Summit

Destaque (20)