Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit East talk by Stefan Panayotov and Michael Zargham

APACHE SPARK FOR MACHINE LEARNING
WITH HIGH DIMENSIONAL LABELS
Michael Zargham and Stefan Panayotov
Cadent, Data Science & Engineering Research

2© 2016 Cadent. All rights reserved.
Data Technology Company specializing in Television Advertising
§ Cadent has a bicoastal data science and engineering team
- Our business runs on internally developed software
- Hybrid cloud Apache Spark infrastructure
- Analytical rather than rule driven algorithms
- Machine Learning APIs and custom mathematics in decision optimizations
- Collaborations with IBM Research (Spark TC) and Product team (Data Science Experience)
Cadent: Data Empowered Television Advertising
Data
Infrastructure
Engineering
Science
Decisions
Analytics

Motivation
• Business Model
– 2 sided business
– Upfront Sales sell Impressions
– Fulfill with Scatter Purchases based on
subscribers
– Impressions = ratings * subscribers
• Relevant Scales
– Weather-like View
• Shows
• Twitter trends
• Spectacle Events
– Climate-like View
• Seasonality
• Subscriber trends
• Daypart Variation

Theoretical Approach
Features
• unique combinations of
targetable characteristics
• Network, Age, Gender,
Category, Season, etc
QuarterHourofDay
RatingforQuarterHour
Features
Quarter Hour of Day
Rating Vectors
• 96 positive real values

Daily Patterns: Mean & Variance
Values in Log-like coordinate system:
value 0 = rating 0
value 3 = rating 10^(-5)
value 5 = rating 10^(-3)
mean variance

Label Dimensionality Reduction
Features
Quarter Hour of Day
Rating Vectors
• 96 positive real values
Features
Coef of
Principals
• J real values
Component

Captured Variance
Warning:
Uncaptured Variance is strictly
lost from the predictive model

Why Reduce Label Dimension
• The correlations between values capture by
reducing to principal components adds more
value than variance lost in “climate-like” view
• Apache Spark ML API doesn’t support nDim
regression so J dimensional regression is
computationally efficient for J<<n

Coordinate Systems Matter
• Regression works well when…
– Euclidean distance is fits well with human sense of “sameness”
– The labels being predicted are well conditioned
• A big part of our Methodology is understanding the
mathematical spaces our data lives in and using ‘change of
coordinate’ techniques
0:00 12:00 23:59
0:00
12:00
Define points on unit circle:
Using 2D (x,y) coordinates
Unknowns Imputed at (0,0)

Custom Log-Like Coordinates
This coordinate system is used to eliminate bias in error metrics,
In the domain the errors in large value ratings swamp those of small value ratings

Predictor Correct method
• Predictor-Corrector is a form of ensemble
– Build a naïve model and an estimator of that model’s bias function,
pipelining them together to create a PCM
X: y yPred=yHat-eHatyHat
X: e=yHat-y eHat

Implemented
Workflow
Naïve Estimator Model (NEM)
Domain Space Forecast
Correction Estimator Model (CEM)
Logspace Local Coordinate
Forecast
Both NEM and CEM are
Regressors in reduced
dimensional vector spaces created
using PCA linear subspace
reductions to find efficient
coordinate systems.

Principal Component Analysis (PCA): to
reduce the dimensionality of the problem
Features
Category, Season, etc.
Features
Category, Season, etc.
Component1
ComponentJ
…
Train GBT Regressor 1 Train GBT Regressor J
PCA
Transform
PCA
pseudo-
inverse
Pipeline of J single label regression models
Vector
Disassemble
Vector
Assembler

Inverse PCA: Transform forecasts into
domain space

Gradient Boosted Tree Regressor
Pipeline: Stages = PCA Coefficients

Evaluation of Rating Feed Vectors

Exploring Model Performance
Big Data Results
X: y: pred y
(vectors)
Unpivot
Vector
X, qh:
Small
Data
Artifacts
Visualizations
&
Performance
StatisticsSmall
Data
Artifacts
…
*Like Estimators, Evaluators in Spark ML are 1 dimensional
RatingforQuarterHour
PredRatingforQuarterHour
ErrorforQuarterHour

Ensure performance quality of our predictive
models:
Steps:
• UDF Composition
• Data Wrangling
• Machine Learning Evaluation

UDF Composition:
• Perform element-wise calculations on Vectors

UDF Composition:
• Zip and flatten relevant vectors

Data Wrangling:

Data Wrangling:
• Pivot & Aggregate using summary statistics

Future Work
• Program Schedule based short term refinements
– While our sales teams work with the “climate-like”
ratings forecasts generated months in advance,
operations buys media with weeks lead time
• Rentrak Integrations & Sensor Fusion
– Nielsen Ratings are Panel driven and Rentrak is
census based, but both are fundamentally
observations of the same underlying phenomenon

Contributors
• Michael Zargham
– Director, Data Science @ Cadent
– PhD in Optimization and Decision Theory from Upenn
– Founder of Cadent Data Science Team
– Architect of Information and Decision systems
• Stefan Panayotov
– Sr. Data Engineer @ Cadent Technology
– PhD in Computer Science from Bulgarian Academy of Sciences.
– Implemented the Big Data platform to support the data science and business intelligence teams at Cadent
– Built ETL & ELT processes and worked on creating ML models pipelines for predicting ratings.
• Joshua Jodesty
– Jr. Data Engineer @ Cadent Technology
– Award-winning Learning Analytics researcher
– B.S. in Information Science & Technology from Temple University

Broader Data Team @ Cadent
• Stephanie Mitchko-Beal, CTO/COO – Driver of Cadent’s Data Driven Transition
• Dr. Joe Matarese – Chief Technologist, General Manager, Silicon Valley Office
– Former VP & GM of ARRIS On Demand, SVP Advanced Technology at C-COR and CTO nCUBE
– Experience in high performance computing applied to big data problems in seismology and geophysical inverse theory
• Dr. David Sisson – VP Strategic Technology
– Research in computational neuroscience and signal processing, data platform architect at Cadent Network
• Chris Frazier – VP Business Intelligence
• Mark Sun – VP Software Development
– MS, Computer Science; BS, Nuclear Engineering and leader of Cadent DAI platform development team
• Dr. Yun Huang – Data Engineer & Director, Software Development
• Matthew Plourde – Sr. Analytics Engineer & Lead Machine Learning Developer
• Team has state of the art skills
– Over a dozen engineers with Apache Spark Big Data platform development experience
– 8 Engineers and analysts with Machine Learning experience
– Expertise in a wide array of languages in Python, R, SQL, Java, Scala, C#
– Across our ranks the data team has 6 PhDs including from top universities like Penn, MIT, & Caltech

Special Thanks to Databricks
Databricks’ Spark platform provided:
• the necessary stability and scalability for work of
this sophistication
• made accessible to us by a quality support staff
• at a cost that a mid-sized business can afford

Thank You.
Contact Us
Mike: mzargham@cadent.tv
Stefan: spanayotov@cadent.tv
Josh: jjodesty@cadent.tv
Interested in our team?
http://cadent.tv/careers/

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit East talk by Stefan Panayotov and Michael Zargham

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit East talk by Stefan Panayotov and Michael Zargham

Semelhante a Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit East talk by Stefan Panayotov and Michael Zargham (20)

Mais de Spark Summit

Mais de Spark Summit (20)

Último

Último (20)

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit East talk by Stefan Panayotov and Michael Zargham