1. Scaling Machine Learning and
Statistics for Web Applications
Recommendations, Search, Advertising
Deepak Agarwal
dagarwal@linkedin.com
KDD, Aug 12th, 2015 @ Sydney, AUS
2. 2
Our vision
Create economic opportunity for every
member of the global workforce
Our mission
Connect the world’s professionals to
make them more productive and
successful
Our core value Members first!
6. Algorithmic Match-Making
via Machine Learning and Data Mining
JOBS to Apply Feed
Articles to Read Connections to Nurture,
Keep in Touch with
Search
PYMK
7. Scale Algorithmic Match Making
• Approach: continuously learn from historical data
– Machine learning/statistical models
– Run experiments (test new procedures and collect
data)
• Scalable and robust infrastructure
– Large scale batch computations
– Large scale near-line computations
– High throughput, low latency online computations
– Fast Retrieval Engines (Search)
9. Recommendations: Delivery Mechanisms
• Pull Model: When the user visits, serve the most
relevant
– Desktop, mobile web, mobile app, iPad
• Push Model: User is not visiting but we need to reach
out with information {Email, Notifications}
– Higher relevance bar: Right message, right user, right
time, right frequency, right channel
Done through ML and optimization
10. Rest of the Talk
• Data and Problem Formulation
• Machine Learning Process
– Illustrated with Feed Application
• Lessons
11. WHAT INFORMATION DO WE HAVE ABOUT
USERS AND ITEMS ?
MATCH-MAKING: Know your items, your users and
their intent
12. User
Characteristics
Profile Information
Title, seniority, skills,
education, endorsements,
presentations,…
Behavioral
Activities, search,..
Edge features (ego-
centric network)
Connection strength,
content affinities,..
• Professional profile of record
13. User Intent
• What are you here for ?
– Hire, get hired, stay informed, grow network,
nurture connections, sell, market,..
• Explicit (e.g., visiting jobs homepage, search query),
• Implicit (needs to be inferred, e.g., based on activities)
15. How to scale Recommendations ?
• Formulate objectives to optimize
• Optimize via ML models
– incorporate both implicit and explicit signals about user and intent
• Automate
16. Match making: Connecting long-term objectives to proxies that
can be optimized by machines/algorithms
Long-term objectives
(return visits, advertising
revenue, sign-ups, job apply,..)
Formulate objectives,
proxies (CTR, revenue/visit,
multiple-objectives, …)
Large scale
optimization via ML, UI
changes,..
Engage, experiment, Learn,
Deploy, Innovate
17. Automation
Optimize proxies with short feedback loop via Machine Learning
Whom?
User Profile, User Intent
Item Filtering,
Understanding
ContextWhat?
Interaction Data
INPUT SIGNALS
MACHINE LEARNING
RANK Items
Sort by Score
Mul -objec ve
Business rule
SCORE Items
P(Click), P(Share)
Similarity,…
19. The Feed: Heterogeneity of “types”
Network Updates (Nurture, keep in
touch)
• Job change
• Job anniversaries
• Connections
• Change Profile Picture
• …
Content with Explicit follow
• Articles by Influencer
• Shares by members in your
network
• Content in Channels followed
• Content by companies followed
• …
Recommendations & Ads
• Articles, PYMK, Endorsements
• Sponsored updates (Ads)
• Jobs
• …
22
20. The Feed: How to build a relevant/optimized feed?
• Independent vs Dependent
– Feed has dependent observations within and across types
• CTR decays when showing same type multiple times
• CTR decays when showing multiple things from the same connection
• ------
• What is the Metric we optimize for the Feed?
-- connections
– Revenue
– Clicks
– Likes, shares, comments
– Job applications
– …..
23
21. Fundamental Problem: Response
Prediction
Predict the probability that a user will respond to
an item in a given context
Provides a statistical framework to incorporate
downstream utilities and other constraints in ranking
items for users
22. Response Prediction
• Click prediction
– Estimate P(click | user, item, context)
– Use to calculate E[utility] = P(click) * utility
• Logistic regression for click prediction, LTR for Search like problems
– Scalable
– Well understood
• Challenges
– Integrating feature data from multiple sources
– Scaling training on large data with many features
– Flexible and rapid experimentation
– Real time scoring
23. Response Prediction Models
• Three main sources of features
– User (e.g. industry, title)
– Item (e.g. keywords, LDA topics)
– Context (e.g. time, page)
• Interactions are important
– E.g., Industry-specific ads may only
appeal to people in that industry
• Features only get so far
– Hard to beat Σclicks / Σviews for
items/users with a lot of data
– “Warmstart” terms incorporate
per-item/per-user information
xi : Features from user i
yj : Features from item j
zk : Features from context k
a,b,g,... :Coefficients
ai,bj,gk,... :Coefficients indexed by user, item, etc.
A, B,C,... : Interaction coefficients
Interaction
Cold Start
Warm Start
log
p
1- p
æ
è
ç
ö
ø
÷ = w +aT
xi + bT
yj +gT
zk +
xi
T
Ayj + zk
T
Byj +
wj +aj xi
log
p
1- p
æ
è
ç
ö
ø
÷ = w +aT
xi + bT
yj +gT
zk +
hj
T
xi +tk
T
yjj +
wj +aj xi
hj = Ayj
xi
T
Ayj = xi
T
Ayj( )=hj
T
xi
tk = Bzk
zk
T
Byj = zk
T
B( )yj = tk
T
yj
24. Global
Per-partition
Model Fitting at Scale
• Most fitting methods for LR are iterative
– Multiple passes through the data
– Poor fit with map/reduce
– We use Apache Spark
• We use Data Partitioning
– Split data into blocks
– Train separately
– Merge the parameters
• Alternating Direction Method of
Multipliers (ADMM)
– Proven to converge to global optimum
• Small tweaks can improve performance
– Optimized the starting values
– Learning rate decay
Q = Global parameter estimate
Qr = Parameter estimate for partition r
dr = Data in partition r
Lr Qr;dr( ) = Likelihood function for partition r
P Q( ) = Regularization penalty term
ur = Per-partition bias values
Qr
t+1( )
= argmin
Qr
Lr Qr;dr( )+
r
2
Qr -Q
t( )
+ur
t( )
2
Q
t+1( )
= argmin
Q
P Q( )+
Rr
2
Qr -Q
t( )
+ur
t( )
2
ur
t+1( )
= ur
t( )
+Qr
t+1( )
-Q
t+1( )
min Lr Qr;dr( )+ P Q( )
r=1
R
å
Subject to Qr -Q = 0 for r =1… R
25. Flexible Configuration
• Feature engineering
– Every problem is different
– Lots of trial and error
– Faster, easier feature engineering
translates to gains sooner
• JSON-based config language
– Sources: import features from outside
– Transformers: apply functions to
feature vectors
– Assembler: packages feature terms
for fitting/scoring
• Rapid development
– No code for most changes
– Offline and Online in sync
User Source
Context
Source
Item Source
SubsetSubset
Interaction
Assembler
Request User Item
Training or Scoring
26. Runtime scoring optimizations
• Real time performance
– About 10µs per inference (1500 items = 15ms)
– Reacts to changing features immediately
• “Better wrong than late”
– If a feature isn’t immediately available, back off to prior value
• Asynchronous computation
– Actions that block or take time run in background threads
• Lazy evaluation
– Sources & transformers do not create feature vectors for all items
– Feature vectors are constructed/transformed only when needed
• Partial results cache
– Logistic regression scoring is a series of dot products
– Scalars are small; cache can be huge
– Hardware-like implementation to minimize locking and heap pressure
8
9
10
11
12
13
14
15
16
0 10 20 30 40 50 60
Timeperrequest(ms)
Time (m)
27. Beyond Response Prediction
• Explore/exploit
• Impression discounting
– Don’t show same/similar stuff many times to the user
• Diversification
• Multi-objective optimization
28. Explore/Exploit: How to Score a New Item ?
• Predict the click rate for new item
• Cold-start problem
– No data to estimate warm start for new item just added
• Solution: Controlled and economical experiments
– Explore (experiment): Collect data by promoting new item
to a small random sample of users
– Exploit: Update warm start based on collected data
– Automate explore/exploit
• Only experiment when we can get bang for the bucks (potential of
gain high)
29. Explore/Exploit Problem
Simplified setting: Two items
CTR
Probabilitydensity
Item A
Item B
We know the CTR of Item A (say, shown 1 million times)
We are uncertain about the CTR of Item B (only 100 times)
If we only make a single decision,
give 100% page views to Item A
If we make multiple decisions in the future
explore Item B since its CTR can be potentially
higher
qp
dppfqp )()(Potential
CTR of item A is q
CTR of item B is p
Probability density function of item B’s CTR is f(p)
31. Impression Discounting
• Reduce the chance of
showing the same item to
the same user repeatedly
• Decay the score of an item
based on #times that the
user saw the item before
• Using real-time feedback
• Discounting by user
segments and item types
Global (over all types)
Impression discounting curves
of a few item types
32. Diversification
• Users’ experience deteriorates when exposed
to the same kind of items multiple times on
the same page Discounting actor
repetitions
Group Discussion CTR Drop
2 adjacent discussions 21%
3 adjacent discussions 48%
33. Multi-Objective Optimization
• E.g: Maximize advertising revenue s.t.
CTR ≥ (1-ε) max achievable CTR
– Invert via duals by making it strongly convex, helps obtain serving for
new user
• Obtain Pareto optimal solutions ( efficient
frontier )
CTR
Revenue
ε
Feasible
Impossible
34. Putting it all together
• Federated Model
– Tier 1: Local models for update types
– Tier 2: Calibrate local models and add more
features to do holistic personalization
• Re-rank by applying diversification, impression
discounting, multi-objective optimization
• Separate teams for Tier 1 and Tier 2
35. Scaling Model Building
Data
Tracking
Pre-
Processing
Feature
Generation
Model
Training
Offline
Model
Evaluation
Online
experiment
Only if Good
Extremely
Computationally
Intensive
▪ Research and Development: Flexible and easy to use software
environment to create models [offline compatible with online]
▪ Maintenance: Models in production trained continuously and
automatically, proper monitoring and testing for building reliable
workflows, config based model deployment, A/B testing platform
▪ Stack of Models: From simple baselines to more sophistication
▪ Feature Management: Easy to discover and share features across
applications
36. Going beyond ML, Statistics
Dogfooding our own products
– Look at employee experience
– Debug model scores if something does not look right
Focus research group studies, user surveys
Product strategy & Intuition
– E.g., remove/add certain content types
– Adding constraints like freshness bounds, etc
Presentation
– Testing different UI, presentation templates, fonts etc
37. Summary
• Formulating objectives is important (not easy)
• Machine Learning is slow and fragile
– Model training is not the only bottleneck
• Data pre-processing, feature management, near-line
computations and online scoring all important
• Need A/B testing platform, Fast Retrieval systems
• Need an end-to-end framework that can make
it easy for modelers to test rapidly
– Data Miners should be closely involved