This document summarizes Patrick Pletscher's presentation on training large-scale ad ranking models in Apache Spark. It discusses using Spark to implement logistic regression for click-through rate prediction on billions of daily ad impressions at Yahoo. Key points include joining impression and click data, implementing an incremental learning architecture in Spark, using feature hashing and online learning algorithms like follow-the-regularized-leader for model training, and lessons learned around Spark configurations, accumulators, and RDDs vs DataFrames.
1. Training Large-scale Ad Ranking Models in Spark
PRESENTED BY Patrick Pletscher October 19, 2015
2. About Us
2
Michal Aharon Oren Somekh Yaacov Fernandess Yair Koren
Amit Kagian Shahar Golan Raz Nissim Patrick Pletscher
Amir Ingber
Haifa
Collaborator
4. Ad Ranking Overview
4
âą Advertisers run several campaigns each with several ads
âą Each ad has a bid set by the advertiser; different ad price types
- pay per view
- pay per click
- various conversion price types
âą Auction for each impression on a Gemini Native enabled property
- auction between all eligible ads (filter by targeting/budget)
- ad with the highest expected revenue is determined
âą Need to know the (personalized!) probability of a click
- we mostly get money for clicks / conversions!
Ad 1 Ad 2
2$1$
5% 1%5c 2c
user
5. Click-Through Rate (CTR) Prediction
5
âą Given a user and context, predict probability of a click for an ad.
âą Probably the most âprofitableâ machine learning problem in industry
- simple binary problem; but want probabilities, not just the label
- very skewed label distribution: clicks << skips
- tons of data (every impression generates a training example)
- limitations at serving: need to predict quickly
âą Basic setting quite well-studied; scale makes it challenging
- Google (McMahan et al. 2013)
- Facebook (He et al. 2014)
- Yahoo (Aharon et al. 2013)
- others (Chapelle et al. 2014)
âą Some more involved research topics
- Exploration/Exploitation tradeoff
- Learning from logged feedback
6. Overview - CTR Prediction for Gemini Native Ads
6
âą Collaborative Filtering approach (Aharon et al. 2013)
- Current production system
- Implemented in Hadoop MapReduce
- Used in Gemini Native ad ranking
âą Large-scale Logistic Regression
- A research prototype
- Implemented in Spark
- The combination of Spark & Scala allows us to iterate quickly
- Takes several concepts from the CF approach
8. Apache Spark
8
âą âApache Spark is a fast and general engine for large-scale data processingâ
âą Similar to Hadoop
âą Advantages over Hadoop MapReduce
- Option to cache data in memory, great for iterative computations
- A lot of syntactic sugar
⣠filter, reduceByKey, distinct, sortByKey, join
⣠in general Spark/Scala code very concise
- Spark Shell, great for interactive/ETL* workflows
- Dataframes interesting for data scientists coming from R / Python
âą Includes modules for
- machine learning
- streaming
- graph computations
- SQL / Dataframes
*ETL: Extract, transform, load
9. Spark at Yahoo
9
âą Spark 1.5.1, the latest version of Spark
âą Runs on top of Hadoop YARN 2.6
- integrates nicely with existing Hadoop tools and infrastructureâš
at Yahoo
- data is generally stored in HDFS
âą Clusters are centrally managed
âą Large Hadoop deployment at Yahoo
- A few different clusters
- Each has at least a few thousand nodes
HDFS (storage)
YARN (resource management)
SparkMapReduceHive
10. Dataset for CTR Prediction
10
âą Billions of ad impressions daily
- Need for Streaming / Batched Streaming
- Each impression has a unique id
âą Need click information for every impression for learning
- Join impressions with a click stream every x minutes
- Need to wait for the click; introduces some delay
18:30 18:45 19:00
clicks
impressions impressions
clicks
impressions
clicks
19:15
impressions
clicks
labeled
events
labeled
events
in Spark: union & reduceByKey
11. Example - Joining Impression & Click RDDs
11
val keyAndImpressions = impressionsâš
.map(e => (e.joinKey, ("i", e))
val keyAndClicks = clicksâš
.map(e => (e.joinKey, ("c", e)))
keyAndImpressions.union(keyAndClicks)âš
.reduceByKey(smartCombine)âš
.flatMap { case (k, (t, event)) => t match {âš
case "ci" => Some(LabeledEvent(event, clicked=1))âš
case "i" => Some(LabeledEvent(event, clicked=0))âš
case "c" => Noneâš
}âš
}
def smartCombine(event1: (String, Event), event2: (String, Event)):
(String, Event) = {âš
(event1._1, event2._1) match {âš
case ("c", "c") => event1 // de-dupeâš
case ("i", "i") => event1 // de-dupeâš
case ("c", "i") => ("ci", event2._2) // combine click and impressionâš
case ("i", "c") => ("ci", event1._2) // combine click and impressionâš
case ("ci", _) => event1 // de-dupeâš
case (_, "ci") => event2 // de-dupeâš
}âš
}
13. Large-scale Logistic Regression
13
âą Industry standard for CTR prediction (McMahan et al. 2013, He et al. 2014)
âą Models the probability of a click as
- feature vector
⣠high-dimensional vector but sparse (few non-zero values)
⣠model expressivity controlled by the features
⣠a lot of hand-tuning and playing around
- model parameters
⣠need to be learned
⣠generally rather non-sparse
14. Features for Logistic Regression
14
âą Basic features
- age, gender
- browser, device
âą Feature crosses
- E.g. age x gender x state (30 year old male from Boston)
- mostly indicator features
- Examples:
⣠gender^age m^30
⣠gender^device m^Windows_NT
⣠gender^section m^5417810
⣠gender^state m^2347579
⣠age^device 30^Windows_NT
âą Feature hashing to get a vector of fixed length
- hash all the index tuples, e.g. (gender^age, m^30), to get a numeric index
- will introduce collisions! Choose dimensionality large enough
15. Parameter Estimation
15
âą Basic Problem: Regularized Maximum Likelihood
- Often: L1 regularization instead of L2
⣠promotes sparsity in the weight vector
⣠more efficient predictions in serving (also requires less memory!)
- Batch vs. streaming
⣠in our case: batched streaming, every x min perform an incremental model update
âą Follow-the-regularized leader (McMahan et al. 2013)
- sequential online algorithm: only use a data point once
- similar to stochastic gradient descent
- per coordinate learning rates
- encourages sparseness
- FTRL stores weight and accumulated gradient per coordinate
fit training data prevent overfitting
16. Basic Parallelized FTRL in Spark
16
def train(examples: RDD[LearningExample]): Unit={âš
val delta = examplesâš
.repartition(numWorkers)âš
.mapPartitions(xs => updatePartition(xs, weights, counts))âš
.treeReduce{case(a, b) => (a._1+b._1, a._2+b._2)}âš
âš
weights += delta._1 / numWorkers.toDoubleâš
counts += delta._2 / numWorkers.toDoubleâš
}âš
def updatePartition(examples: Iterator[LearningExample],âš
weights: DenseVector[Double],âš
counts: DenseVector[Double]): âš
Iterator[(DenseVector[Double], DenseVector[Double])]=
{
// standard FTRL code for examples
Iterator((deltaWeights, deltaCounts))
}
hack:
actually a single
result, but Spark
expects an iterator!
17. Summary: LR with Spark
17
âą Efficient: Can learn on all the data
- before: somewhat aggressive subsampling of the skips
âą Possible to do feature pre-processing
- in Hadoop MapReduce much harder: only one pass over data
- drop infrequent features, TF-IDF, âŠ
âą Spark-shell as a life-saver
- helps to debug problems as one can inspect intermediate results at scale
- have yet to try Zeppelin notebooks
âą Easy to unit test complex workflows
19. Upgrade!
19
âą Spark has a pretty regular 3 months release schedule
âą Always run with the latest version
- Lots of bugs get fixed
- Difficult to keep up with new functionality (see DataFrame vs. RDD)
âą Speed improvements over the past year
20. Configurations
20
âą Our solution
- config directory containing
⣠Logging: log4j.properties
⣠Spark itself: spark-defaults.conf
⣠our code: application.conf
- two versions of configs: local & cluster
- in YARN: specify them using --files argument & SPARK_CONF_DIR variable
âą Use Typesafeâs config library for all application related configs
- provide sensible defaults for everything
- overwrite using application.conf
âą Do not hard-code any configurations in code
21. Accumulators
21
âą Use accumulators for ensuring correctness!
âą Example:
- parse data, ignore event if there is a problem with the data
- use accumulator to count these failed lines
class Parser(failedLinesAccumulator: Accumulator[Int]) extends Serializable {
def parse(s: String): Option[Event] = {
try {
// parsing logic goes here
Some(...)
}
catch {âš
case e: Exception => {âš
failedLinesAccumulator += 1âš
Noneâš
}âš
}
}
}
val accumulator = Some(sc.accumulator(0, âfailed linesâ))
val parser = new Parser(accumulator)
val events = sc.textFile(âhdfs:///myfileâ)
.flatMap(s => parser.parse(s))
22. RDD vs. DataFrame in Spark
22
âą Initially Spark advocated Resilient Distributed Data (RDD) for data set
abstraction
- type-safe
- usually stores some Scala case class
- code relatively easy to understand
âą Recently Spark is pushing towards using DataFrame
- similar to R and Pythonâs Pandas data frames
- some advantages
⣠less rigid types: can append columns
⣠speed
- disadvantage: code readability suffers for non-basic types
⣠user defined types
⣠user defined functions
âą Have not fully migrated to it yet
23. Every Day Iâm ShufflingâŠ
23
âą Careful with operations which send a lot of data over the network
- reduceByKey
- repartition / shuffle
âą Careful with sending too much data to the driver
- collect
- reduce
âą found mapPartitions & treeReduce useful in some cases (see FTRL example)
âą play with spark configurations: frameSize, maxResultSize, timeoutsâŠ
textFile flatMap map reduceByKey
24. Machine Learning in Spark
24
âą Relatively basic
- some algorithms donât scale so well
- not customizable enough for experts:
⣠optimizers that assume a regularizer
⣠built our own DSL for feature extraction & combination
⣠a lot of the APIs are not exposed, i.e. private to Spark
- will hopefully get there eventually
âą Nice: new Transformer / Estimator / Pipeline approach
- Inspired by scikit-learn, makes it easy to combine different algorithms
- Requires DataFrame
- Example (from Spark docs)
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)