With content now viewed seamlessly across multiple screens, this shift in consumer behavior/consumption has come to a head with the way advertising is sold - separately in TV and online silos - creating an opportunity to make advertising more effective using data and machine learning. This talk explores technological developments at VideoAmp that bring together data from disparate mediums and creates cross-screen audience models using ML methods for cross-screen bid optimization, and graph based audience models for 150 Million users, across over a billion unique device IDs, as well as behavioral insights gleaned from observing such a large variety of data.
Similar to Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising with Machine Learning and Spark - Debajyoti (Deb) Ray, CDO - VideoAmp
Notes how to work with variables, constants and do calculationsWilliam Olivier
Similar to Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising with Machine Learning and Spark - Debajyoti (Deb) Ray, CDO - VideoAmp (20)
5. Bridging the Gap
VideoAmp’s goal is to Enable Advertisers and Content Creators to
Transact Seamlessly Across All Media Types
• Frequency capping for target consumers.
• TV media extension to desktop / mobile campaigns.
• Competitive conquesting. —>
6. Consumer Graph
How Big is the Graph?
idfa
In-App
Phone
Uid 1
Safari
Phone
Uid 2
Firefox
Home
Uid 3
Chrome
Home
Uid 4
Firefox
Work
Location
Login
• 1.5B+ unique cookie IDs, Device IDs.
• 150M+ nodes.
• Behavioral data from each ID (several TBs / day).
7. Video Ads : from Request to Delivery
Figure 1
Step 2: The publisher Yahoo! passes the information to the ad exchange, say, Google DoubleClick AdX, including
Figure 1
Step 2: The publisher Yahoo! passes the information to the ad exchange, say, Google DoubleClick AdX, including
the URL where the ad slot is located, vertical of the web page content such as sports, and user cookie id.
Step 3: The ad exchange AdX composes a bid request and sends the bid requests to several DSPs. Let’s assume the
DSP iPinYou is one of them.
Step 4: When the iPinYou DSP server receives the bid request from the ad exchange AdX, it passes the information
Figure 1
Step 2: The publisher Yahoo! passes the information to the ad exchange, say, Google DoubleClick AdX, including
the URL where the ad slot is located, vertical of the web page content such as sports, and user cookie id.
Figure 1
Step 2: The publisher Yahoo! passes the information to the ad excha
the URL where the ad slot is located, vertical of the web page conten
Step 3: The ad exchange AdX composes a bid request and sends the
Figure 1
Step 2: The publisher Yahoo! passes the information to the ad exchange, say, Google DoubleClick AdX, including
the URL where the ad slot is located, vertical of the web page content such as sports, and user cookie id.
Step 3: The ad exchange AdX composes a bid request and sends the bid requests to several DSPs. Let’s assume the
DSP iPinYou is one of them.
Figure 1
Step 2: The publisher Yahoo! passes the information to the ad exchange, say, Google DoubleClick AdX, including
the URL where the ad slot is located, vertical of the web page content such as sports, and user cookie id.
Figure 1
1. User Visits
2. Calls Ad Exchange
3. Bid Request
4. User ID, IP
5. User ID
6. User Data
7. Bid Price
8. Bid CPM, Ad Tag
9. Auction winner’s Ad Tag,
2nd price CPM
10. Calls Winner’s
Ad Tag.
11. Serves Ad
12. Displays Ad Ad Server
Ad Exchange
Bid Listener
Decision Engine
User Data Storage
20 ms to calculate
Whole process
Takes ~100 ms
8. The Right Tool : Apache Spark
Apache Spark is a distributed computing framework
that came out of AMPLab at UC Berkeley.
Key innovation is a Resilient Distributed Dataset (RDD):
Logical collection of data partitioned across machines.
Worker
tasks
results
RAM
Input Data
Worker
RAM
Input Data
Worker
RAM
Input Data
Driver
Figure 2: Spark runtime. The user’s driver program launches
multiple workers, which read data blocks from a distributed file
system and can persist computed RDD partitions in memory.
ule tasks based on data locality to improve performance.
Second, RDDs degrade gracefully when there is not
enough memory to store them, as long as they are only
being used in scan-based operations. Partitions that do
not fit in RAM can be stored on disk and will provide
similar performance to current data-parallel systems.
2.4 Applications Not Suitable for RDDs
As discussed in the Introduction, RDDs are best suited
for batch applications that apply the same operation to
all elements of a dataset. In these cases, RDDs can ef-
ficiently remember each transformation as one step in a
tions lik
Scala re
these ob
node to
saves an
the Java
var x =
of an RD
RDDs
paramet
RDD[In
example
Altho
conceptu
Scala’s
needed m
interpret
less, we
3.1 RD
Table 2
available
ation, sh
call that
new RD
a value t
API in Scala and Python.
In our stack, Spark runs on Hadoop.
Data stored in HDFS / Parquet.
In some apps, involving iterative calls, Spark is upto 100X faster than MapReduce.
Distributed File System (e.g. HDFS)
9. Spark: Graph Frames
GraphFrames is a graph processing library (similar to GraphX)
- Scala, Python, Java APIs.
- Query on graphs (like SparkSQL):
> g.vertices.filter(“age” > 25)
> g.inDegrees.filter(“inDegree” > 2)
- Supports all algorithms in GraphX, and also:
Breadth first search (BFS) - shortest path between 2 vertices.
(Strongly) connected components
Label Propagation algorithm
10. VideoAmp Flint
We open-sourced Flint: creating push-button Spark clusters
for Machine Learning and Data Science in the cloud.
Designed for rapid deployment while providing native access to
data in a pre-existing HDFS / Hive cluster.
- Flint: a Spark Cluster Launcher (on AWS)
- Self-contained Spark Docker images.
- Jupyter Docker image preloaded with Python, R, Scala kernels.
Users can expand or contract the cluster on the fly.
12. Data from Devices
Data from TVs (ACR) Mobile Devices Desktop
TV ID generates:
TV program viewership
10M Smart TVs / STBs
Data in 15 min chunks
Device ID generates:
Sites, Video content,
Segments.
50K QPS over
300M Device IDs
Cookie ID generates:
Sites, Video content,
Segments
100K QPS over
1B cookie IDs
13. Sparse Representation
For each class of consumption data, create Dictionary with enumeration
of all content (e.g. TMS ID), or types.
e.g. demographic segments:
Income = [ <30K, 30K to 60K, 60K to 90K, 90K to 120K, 120K+ ]
e.g. TV programs watched
TV_Programs = [“Walking Dead”, “Game of Thrones”,…,”Silicon Valley”]
Then the user data is sparse:
Income (User ABC123) = [0,0,0,1,0]
TV_Programs (User ABC123) = [0,1,…,1]
15. Connected Components
Subgraphs in the graph s.t.
there is a path between any
two vertices.
Start with a node s, and do BFS. This gives a
component of the graph.
At each stage, Pick an unexplored node n, and
do BFS. This finds another component.
16. Clustering
Example with only Location (Lat / Long attributes)
We utilize Location, IP address, Types (segments),
Behaviors (websites visited, TV program viewed)
Clustering in a very high dimensional space with
Sparse vectors.
17. Graph Inference
Find all Users similar to User A.
Fill in Missing Attributes. What is User B’s income level?
Which users will like Brain Dead (new show)?
18. Validation
Ground Truth from Login Data
e.g. Login to LinkedIn from Mobile, Tablet, Desktop
at Work, Laptop at Home.
Validation data is used for hold-out cross-validation,
to learn the parameters e.g. edge distance threshold,
for Machine Learning.
19. Precision / Recall
High Precision -> Devices assigned to a consumer,
belong to the consumer.
High Recall -> All devices belonging to the consumer
are correctly assigned.
20. TV Viewership Classification
Data from TVs (ACR)
TV ID generates: TV program viewership
Dictionary is enumeration of ~10M Users
Sparse vector of Video Content (0 / 1 if they saw it)
Learning embedding: (TV programs, Users) —> Lookalike Programs.
How do we learn embeddings?
Learn an underlying manifold ->
Like word2vec where document is a set of users viewing the content.