This talk will cover the tools we used, the hurdles we faced and the work arounds we developed with the help from Databricks support in our attempt to build a custom machine learning model and use it to predict the TV ratings for different networks and demographics.
The Apache Spark machine learning and dataframe APIs make it incredibly easy to produce a machine learning pipeline to solve an archetypal supervised learning problem. In our applications at Cadent, we face a challenge with high dimensional labels and relatively low dimensional features; at first pass such a problem is all but intractable but thanks to a large number of historical records and the tools available in Apache Spark, we were able to construct a multi-stage model capable of forecasting with sufficient accuracy to drive the business application.
Over the course of our work we have come across many tools that made our lives easier, and others that forced work around. In this talk we will review our custom multi-stage methodology, review the challenges we faced and walk through the key steps that made our project successful.
3. Motivation
• Business Model
– 2 sided business
– Upfront Sales sell Impressions
– Fulfill with Scatter Purchases based on
subscribers
– Impressions = ratings * subscribers
• Relevant Scales
– Weather-like View
• Shows
• Twitter trends
• Spectacle Events
– Climate-like View
• Seasonality
• Subscriber trends
• Daypart Variation
4. Theoretical Approach
Features
• unique combinations of
targetable characteristics
• Network, Age, Gender,
Category, Season, etc
QuarterHourofDay
RatingforQuarterHour
Features
• unique combinations of
targetable characteristics
• Network, Age, Gender,
Category, Season, etc
Quarter Hour of Day
Rating Vectors
• 96 positive real values
5. Daily Patterns: Mean & Variance
Values in Log-like coordinate system:
value 0 = rating 0
value 3 = rating 10^(-5)
value 5 = rating 10^(-3)
mean variance
6. Label Dimensionality Reduction
Features
• unique combinations of
targetable characteristics
• Network, Age, Gender,
Category, Season, etc
Quarter Hour of Day
Rating Vectors
• 96 positive real values
Features
• unique combinations of
targetable characteristics
• Network, Age, Gender,
Category, Season, etc
Coef of
Principals
• J real values
Component
8. Why Reduce Label Dimension
• The correlations between values capture by
reducing to principal components adds more
value than variance lost in “climate-like” view
• Apache Spark ML API doesn’t support nDim
regression so J dimensional regression is
computationally efficient for J<<n
9. Coordinate Systems Matter
• Regression works well when…
– Euclidean distance is fits well with human sense of “sameness”
– The labels being predicted are well conditioned
• A big part of our Methodology is understanding the
mathematical spaces our data lives in and using ‘change of
coordinate’ techniques
0:00 12:00 23:59
0:00
12:00
Define points on unit circle:
Using 2D (x,y) coordinates
Unknowns Imputed at (0,0)
10. Custom Log-Like Coordinates
This coordinate system is used to eliminate bias in error metrics,
In the domain the errors in large value ratings swamp those of small value ratings
11. Predictor Correct method
• Predictor-Corrector is a form of ensemble
– Build a naïve model and an estimator of that model’s bias function,
pipelining them together to create a PCM
X: y yPred=yHat-eHatyHat
X: e=yHat-y eHat
12. Implemented
Workflow
Naïve Estimator Model (NEM)
Domain Space Forecast
Correction Estimator Model (CEM)
Logspace Local Coordinate
Forecast
Both NEM and CEM are
Regressors in reduced
dimensional vector spaces created
using PCA linear subspace
reductions to find efficient
coordinate systems.
13. Principal Component Analysis (PCA): to
reduce the dimensionality of the problem
Features
• unique combinations of
targetable characteristics
• Network, Age, Gender,
Category, Season, etc.
Features
• unique combinations of
targetable characteristics
• Network, Age, Gender,
Category, Season, etc.
Component1
ComponentJ
…
Train GBT Regressor 1 Train GBT Regressor J
PCA
Transform
PCA
pseudo-
inverse
Pipeline of J single label regression models
Vector
Disassemble
Vector
Assembler
17. Exploring Model Performance
Big Data Results
X: y: pred y
(vectors)
Unpivot
Vector
X, qh:
Small
Data
Artifacts
Visualizations
&
Performance
StatisticsSmall
Data
Artifacts
…
*Like Estimators, Evaluators in Spark ML are 1 dimensional
RatingforQuarterHour
PredRatingforQuarterHour
ErrorforQuarterHour
18. Evaluation of Rating Feed Vectors
Ensure performance quality of our predictive
models:
Steps:
• UDF Composition
• Data Wrangling
• Machine Learning Evaluation
19. Evaluation of Rating Feed Vectors
UDF Composition:
• Perform element-wise calculations on Vectors
20. Evaluation of Rating Feed Vectors
UDF Composition:
• Zip and flatten relevant vectors
24. Future Work
• Program Schedule based short term refinements
– While our sales teams work with the “climate-like”
ratings forecasts generated months in advance,
operations buys media with weeks lead time
• Rentrak Integrations & Sensor Fusion
– Nielsen Ratings are Panel driven and Rentrak is
census based, but both are fundamentally
observations of the same underlying phenomenon
25. Contributors
• Michael Zargham
– Director, Data Science @ Cadent
– PhD in Optimization and Decision Theory from Upenn
– Founder of Cadent Data Science Team
– Architect of Information and Decision systems
• Stefan Panayotov
– Sr. Data Engineer @ Cadent Technology
– PhD in Computer Science from Bulgarian Academy of Sciences.
– Implemented the Big Data platform to support the data science and business intelligence teams at Cadent
– Built ETL & ELT processes and worked on creating ML models pipelines for predicting ratings.
• Joshua Jodesty
– Jr. Data Engineer @ Cadent Technology
– Award-winning Learning Analytics researcher
– B.S. in Information Science & Technology from Temple University
26. Broader Data Team @ Cadent
• Stephanie Mitchko-Beal, CTO/COO – Driver of Cadent’s Data Driven Transition
• Dr. Joe Matarese – Chief Technologist, General Manager, Silicon Valley Office
– Former VP & GM of ARRIS On Demand, SVP Advanced Technology at C-COR and CTO nCUBE
– Experience in high performance computing applied to big data problems in seismology and geophysical inverse theory
• Dr. David Sisson – VP Strategic Technology
– Research in computational neuroscience and signal processing, data platform architect at Cadent Network
• Chris Frazier – VP Business Intelligence
• Mark Sun – VP Software Development
– MS, Computer Science; BS, Nuclear Engineering and leader of Cadent DAI platform development team
• Dr. Yun Huang – Data Engineer & Director, Software Development
• Matthew Plourde – Sr. Analytics Engineer & Lead Machine Learning Developer
• Team has state of the art skills
– Over a dozen engineers with Apache Spark Big Data platform development experience
– 8 Engineers and analysts with Machine Learning experience
– Expertise in a wide array of languages in Python, R, SQL, Java, Scala, C#
– Across our ranks the data team has 6 PhDs including from top universities like Penn, MIT, & Caltech
27. Special Thanks to Databricks
Databricks’ Spark platform provided:
• the necessary stability and scalability for work of
this sophistication
• made accessible to us by a quality support staff
• at a cost that a mid-sized business can afford
28. Thank You.
Contact Us
Mike: mzargham@cadent.tv
Stefan: spanayotov@cadent.tv
Josh: jjodesty@cadent.tv
Interested in our team?
http://cadent.tv/careers/