Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising with Machine Learning and Spark - Debajyoti (Deb) Ray, CDO - VideoAmp

Enabling Cross-Screen
Advertising with 
Machine Learning and Spark
Deb Ray 
Chief Data Ofﬁcer 
Big Data Day LA 2016

How Media is Consumed
Consumers don’t differentiate between screens 
 
Same Game of Thrones on  
Tablet, TV, Desktop, XBox, Roku, Apple TV

How Media is Sold
But Advertising is sold in Silos

Creates a Gap
2 Sides of the Advertising Market
Selling and Buying of Ad Inventory.
Publishers
Exchanges
TV Providers
DSP
Advertisers
Brands
Agencies
Trading Desks
Websites
Mobile apps
TV Programs
OTT
DMP

Bridging the Gap
VideoAmp’s goal is to Enable Advertisers and Content Creators to
Transact Seamlessly Across All Media Types
• Frequency capping for target consumers. 
• TV media extension to desktop / mobile campaigns. 
• Competitive conquesting. —>

Consumer Graph
How Big is the Graph?
idfa 
In-App
Phone
Uid 1 
Safari
Phone
Uid 2 
Firefox
Home
Uid 3 
Chrome
Home
Uid 4 
Firefox
Work
Location
Login
• 1.5B+ unique cookie IDs, Device IDs. 
• 150M+ nodes. 
• Behavioral data from each ID (several TBs / day).

Video Ads : from Request to Delivery
Figure 1
Step 2: The publisher Yahoo! passes the information to the ad exchange, say, Google DoubleClick AdX, including
Figure 1
the URL where the ad slot is located, vertical of the web page content such as sports, and user cookie id.
Step 3: The ad exchange AdX composes a bid request and sends the bid requests to several DSPs. Let’s assume the
DSP iPinYou is one of them.
Step 4: When the iPinYou DSP server receives the bid request from the ad exchange AdX, it passes the information
Figure 1
Figure 1
Step 2: The publisher Yahoo! passes the information to the ad excha
the URL where the ad slot is located, vertical of the web page conten
Step 3: The ad exchange AdX composes a bid request and sends the
Figure 1
Step 3: The ad exchange AdX composes a bid request and sends the bid requests to several DSPs. Let’s assume the
DSP iPinYou is one of them.
Figure 1
Figure 1
1. User Visits
2. Calls Ad Exchange
3. Bid Request
4. User ID, IP
5. User ID
6. User Data
7. Bid Price
8. Bid CPM, Ad Tag
9. Auction winner’s Ad Tag,
2nd price CPM
10. Calls Winner’s
Ad Tag.
11. Serves Ad
12. Displays Ad Ad Server
Ad Exchange
Bid Listener
Decision Engine
User Data Storage
20 ms to calculate
Whole process
Takes ~100 ms

The Right Tool : Apache Spark
Apache Spark is a distributed computing framework
that came out of AMPLab at UC Berkeley.
Key innovation is a Resilient Distributed Dataset (RDD):
Logical collection of data partitioned across machines.
Worker
tasks
results
RAM
Input Data
Worker
RAM
Input Data
Worker
RAM
Input Data
Driver
Figure 2: Spark runtime. The user’s driver program launches
multiple workers, which read data blocks from a distributed file
system and can persist computed RDD partitions in memory.
ule tasks based on data locality to improve performance.
Second, RDDs degrade gracefully when there is not
enough memory to store them, as long as they are only
being used in scan-based operations. Partitions that do
not fit in RAM can be stored on disk and will provide
similar performance to current data-parallel systems.
2.4 Applications Not Suitable for RDDs
As discussed in the Introduction, RDDs are best suited
for batch applications that apply the same operation to
all elements of a dataset. In these cases, RDDs can ef-
ficiently remember each transformation as one step in a
tions lik
Scala re
these ob
node to
saves an
the Java
var x =
of an RD
RDDs
paramet
RDD[In
example
Altho
conceptu
Scala’s
needed m
interpret
less, we
3.1 RD
Table 2
available
ation, sh
call that
new RD
a value t
API in Scala and Python.
In our stack, Spark runs on Hadoop.
Data stored in HDFS / Parquet.
In some apps, involving iterative calls, Spark is upto 100X faster than MapReduce.
Distributed File System (e.g. HDFS)

Spark: Graph Frames
GraphFrames is a graph processing library (similar to GraphX) 
 
- Scala, Python, Java APIs. 
 
- Query on graphs (like SparkSQL):
> g.vertices.filter(“age” > 25)
> g.inDegrees.filter(“inDegree” > 2) 
 
- Supports all algorithms in GraphX, and also: 
Breadth first search (BFS) - shortest path between 2 vertices.
(Strongly) connected components
Label Propagation algorithm

VideoAmp Flint
We open-sourced Flint: creating push-button Spark clusters 
for Machine Learning and Data Science in the cloud. 
 
Designed for rapid deployment while providing native access to 
data in a pre-existing HDFS / Hive cluster. 
 
- Flint: a Spark Cluster Launcher (on AWS) 
 
- Self-contained Spark Docker images. 
 
- Jupyter Docker image preloaded with Python, R, Scala kernels. 
 
Users can expand or contract the cluster on the ﬂy.

Data from Devices
Data from TVs (ACR) Mobile Devices Desktop
TV ID generates: 
TV program viewership 
 
10M Smart TVs / STBs 
Data in 15 min chunks
Device ID generates: 
Sites, Video content,  
Segments.
50K QPS over 
300M Device IDs
Cookie ID generates: 
Sites, Video content,  
Segments 
 
100K QPS over 
1B cookie IDs

Sparse Representation
For each class of consumption data, create Dictionary with enumeration 
of all content (e.g. TMS ID), or types.
e.g. demographic segments:
Income = [ <30K, 30K to 60K, 60K to 90K, 90K to 120K, 120K+ ]
e.g. TV programs watched
TV_Programs = [“Walking Dead”, “Game of Thrones”,…,”Silicon Valley”]
Then the user data is sparse:
Income (User ABC123) = [0,0,0,1,0]
TV_Programs (User ABC123) = [0,1,…,1]

Connected Components
Subgraphs in the graph s.t.  
there is a path between any  
two vertices.
Start with a node s, and do BFS. This gives a  
component of the graph.  
 
At each stage, Pick an unexplored node n, and 
do BFS. This ﬁnds another component.

Clustering
Example with only Location (Lat / Long attributes) 
 
We utilize Location, IP address, Types (segments),
Behaviors (websites visited, TV program viewed) 
 
Clustering in a very high dimensional space with 
Sparse vectors.

Graph Inference
Find all Users similar to User A. 
 
Fill in Missing Attributes. What is User B’s income level?
Which users will like Brain Dead (new show)?

Validation
Ground Truth from Login Data 
 
e.g. Login to LinkedIn from Mobile, Tablet, Desktop
at Work, Laptop at Home.
Validation data is used for hold-out cross-validation, 
to learn the parameters e.g. edge distance threshold,
for Machine Learning.

Precision / Recall
High Precision -> Devices assigned to a consumer,  
belong to the consumer. 
 
High Recall -> All devices belonging to the consumer 
are correctly assigned.

TV Viewership Classiﬁcation
Data from TVs (ACR)
TV ID generates: TV program viewership
Dictionary is enumeration of ~10M Users
Sparse vector of Video Content (0 / 1 if they saw it)
Learning embedding: (TV programs, Users) —> Lookalike Programs.
How do we learn embeddings?  
 
Learn an underlying manifold ->  
Like word2vec where document is a set of users viewing the content.

Visualizing Embeddings
https://www.youtube.com/watch?v=RJVL80Gg3lA
Visualizing Data Using t-SNE by van der Maaten
t-distributed Stochastic Neighbor Embedding (t-SNE)
a) IsoMap 
 
b) Locally Linear Embedding
Implementations in R, Python:
R package “tsne”

Bandit Optimization
Metrics to Optimize: Viewability, Conversions.
Continue with same campaign parameters that have worked well,
OR explore new parameter combinations?
How to solve the Exploration-Exploitation Problem? Multi-Armed Bandits.
Parameters coded in our
Bidders (Actor-model in Scala).
Run Simultaneously and
determine prob of reward.

Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising with Machine Learning and Spark - Debajyoti (Deb) Ray, CDO - VideoAmp

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising with Machine Learning and Spark - Debajyoti (Deb) Ray, CDO - VideoAmp

Similar to Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising with Machine Learning and Spark - Debajyoti (Deb) Ray, CDO - VideoAmp (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising with Machine Learning and Spark - Debajyoti (Deb) Ray, CDO - VideoAmp