Our team at Comcast is challenged with operationalizing predictive ML models to improve customer experience. Our goal is to eliminate bottlenecks in the process from model inception to deployment and monitoring.
Traditionally CI/CD manages code and infrastructure artifacts like container definitions. We want to extend it to support granular traceability enabling tracking of ML Models from use-case, to feature/attribute selection, development of versioned datasets, model training code, model evaluation artifacts, model prediction deployment containers, and sinks to which the predictions/outcomes are persisted to. Our framework stack enables us to track models from use-case to deployments, manage and evaluate multiple models simultaneously in the live yet dark mode and continue to monitor models in production against real-world outcomes using configurable policies.
The technologies/components which drive this vision are:
1. FeatureStore – Enables data scientists to reuse versioned features and review feature metrics by models. Self-Service capabilities allow all teams to onboard their events data into the feature store.
2. ModelRepository – Manages meta-data about models including pre-processing parameters (Ex. Scaling parameters for features), mapping to the features needed to execute the model, model discovery mechanisms, etc.
3. Spark on Alluxio – Alluxio provides the universal data plane on top of various under-stores (Ex. S3, HDFS, RDBMS). Apache Spark with its Data Sources API provides a unified query language which Data Scientist use to consume features to create training/validation/test datasets which are versioned and integrated into the full model pipeline using Ground-Context discussed next.
4. Ground-Context – This open-source vendor-neutral data context service enables full traceability from use-case, models, features, model to features mapping, versioned datasets, model training codebase, model deployment containers and prediction/outcome sinks. It integrates with the Feature-Store, Container Repository and Git to integrate data, code and run-time artifacts for CI/CD integration.
2. 2
INTRODUCTION AND BACKGROUND
CUSTOMER EXPERIENCE TEAM
27 MILLION CUSTOMERS (HIGH SPEED DATA,
VIDEO, VOICE, HOME SECURITY, MOBILE)
INGESTING ABOUT 2 BILLION EVENTS / MONTH
HIGH-VOLUME OF MACHINE-GENERATED EVENTS
DATA SCIENCE PIPELINE
GREW FROM A FEW DOZEN TO 150+ DATA
SOURCES / FEEDS IN ABOUT A YEAR
Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
3. 3
COMCAST APPLIED AI
Media &
Video Analytics
Machine Learning
& Data Science
Content Discovery
Speech
& NLP
Video
High Speed Internet
Home Security /
Automation
Customer Service
Universal Parks
Media Properties
4. 4
BUSINESS PROBLEM
INCREASE POSITIVE CUSTOMER EXPERIENCES
RESOLVE POTENTIAL ISSUES CORRECTLY,
QUICKLY AND EVEN BETTER PROACTIVELY
PREDICT AND DIAGNOSE SERVICE TROUBLE
ACROSS MULTIPLE KNOWLEDGE DOMAINS
REDUCE COSTS THROUGH EARLIER RESOLUTION
AND BY REDUCING AVOIDABLE TECHNICIAN
VISITS
6. 1 2 3 4
6
XFINITY VIRTUAL ASSISTANT
My Account Main Screen XFINITY Assistant Type a question Disambiguate
7. 7
VIRTUAL ASSISTANT – STEP BY STEP
7
Devices, Applications, and Platforms
instrumented to provide telemetry
Natural language
input and feedback
Interactive (Conversational) Actions
Proactive (Automatic) Actions
Customer intents
Domain models
NLP
Action
Catalog
Schedule Truck Roll
Self-Heal
Notifications
Agent Contact
Choose Best
Explore
Decision Engine
Context
Root Cause Predictions
Predictive AI/MLPredictive AI/ML
Predictive AI/ML
8. 8
TECHNICAL PROBLEM
MULTIPLE PROGRAMMING AND DATA SCIENCE
ENVIRONMENTS
WIDESPREAD AND DISCORDANT DATA SOURCES
THE “DATA PLANE” PROBLEM: COMBINING DATA AT
REST AND DATA IN MOTION
CONSISTENT FEATURE ENGINEERING
ML VERSIONING: DATA, CODE, FEATURES, MODELS
9. 9
EXAMPLE NEAR REAL TIME
PREDICTION USE CASE
CUSTOMER RUNS A “SPEED TEST”
EVENT TRIGGERS A PREDICTION FLOW
ENRICH WITH NETWORK HEALTH AND OTHER
INDICATORS
EXECUTE ML MODEL
PREDICT WHETHER IT IS A WIFI, MODEM, OR
NETWORK ISSUE
Detect
Enrich
Predict
Gather Data
Event
ML
Model
Engage Customer
Act / Notify
Network Diagnostic Services
Slow
Speed?
Additional Context Services
Run
Prediction
10. 1 0
SPACE CORRELATION EXAMPLE
ML ALGORITHM NEEDS TO LEARN THAT THERE IS NO NEED TO SEND 3 REPAIR TRUCKS
• LOGS FROM WHICH TRAINING DATASETS ARE SOURCED SHOW CORRELATION
BETWEEN UNSUCCESSFUL TRUCK DISPATCHES AND CONCENTRATED CABLE
FAILURES
• GEO-LOCATION IS AVAILABLE IN THE CUSTOMER CONTEXT
• ALGORITHM CAN CLUSTER CUSTOMERS BASED ON GEO-LOCATION
Cable
Green = Works
Yellow = Has Problems
Likely failure
11. 1 1
CHALLENGE- STANDARDIZATION OF
FEATURES
TWO MAIN CHALLENGES
• FEATURE ASSEMBLY (ENRICHMENT) DURING
PREDICTION TIME
• DISCOVERING CORRELATIONS WHEN WE
HAVE 25 MILLION CUSTOMERS EACH USING 10
PRODUCTS
WE NEED A STANDARDIZATION OF FEATURES,
ACTIONS AND REWARDS
FEATURE STORE – CURATED DATA STORE TO
DRIVE MODEL TRAINING AND MODEL
PREDICTION
12. 1 2
ML PIPELINE – ROLES & WORKFLOW
Define
Use
Case
Business User
Data Scientist
ML Operations
Explore
Features
Create and
publish new
features
Create &
Validate
Models
Model
Selection
Go Live with
Selected
Models
• Define Online Feature
Assembly
• Define pipeline to
collect outcomes
• Model Deployment
and Monitoring
Model
Review
Iterate
Evaluate
Live Model
Performance
Inception Exploration
Model
Development
Candidate Model
Selection
Model
Operationalization
Model
Evaluation
Go Live
Phase
Monitor Live
ModelsCollect new data & retrain
Iterate
14. 1 4
WHY METADATA DRIVEN?
INSPIRED BY GROUND CONTEXT
• Berkeley’s RISE Lab
• Application context
• Parameters, callbacks, “meaty” metadata
• Behavior Context
• Data sets and code
• Change Context
• Version history
• Track any change end-to-end -> entire pipeline is versioned
• Metadata drives what/how code is ran
15. 1 5
AN OVERVIEW OF SPARK FLOWS
RAW DATA
STREAM
Feature Creation
Pipeline
VERSION
Historical
RAW
Store
Feature Creation
Disk or
Memory
Model
ON DEMAND OR
CONTINUOUS
Historical
Feature Store
Online
Feature Store
Prediction
CUSTOMER
EXPERIENCE
ELEMENTS
Analysis &
Business
Value
16. 1 6
FEATURE STORE
TWO TYPES OF FEATURE STORES:
• Online Feature Store – Current values by key (Key/Value
Store)
• History Feature Store – Append features as they are
collected (Ex. Hadoop File System, AWS S3)
ONLINE FEATURE STORE
• Used in the prediction phase for enrichment
• Needs to support fast ingest and query as it stores current
data for given account or account & device combination
HISTORY FEATURE STORE
• Used to build history of features
• Data Scientists use this store to create their training
datasets
MAINTAIN (VERSIONED) RAW DATA SEPARATELY
Feature Creation
Pipeline
History
Feature Store
Online
Feature Store
Prediction
Phase
Model Training
Phase
AppendOverwrite
17. 1 7
USING THE ONLINE FEATURE STORE
MODEL
EXECUTION
TRIGGER
1. Payload only
contains Model
Name & Account
Number
FEATURE
ASSEMBLY
Model
Metadata
Online
Feature
Store
2. Model Metadata
informs which
features are needed
for a model
3. Pull required
features by account
number
MODEL
EXECUTION
4. Pass full set of
assembled features
for model execution
5. Prediction
18. 1 8
FEATURE CREATION PIPELINE
Aggregation
Pipeline
On Demand
Pipeline
Continuous
Stream
On Demand
Feature Request
External
Rest API
Feature Writer
Feature
Assembly
Feature
Metadata
Model
Metadata
TWO TYPES:
• Continuous aggregations on streaming data
• On Demand Features
AGGREGATION FEATURE EXAMPLES
• Number of customer calls in the past 30 days. Key =
Account Number
• Number of signal errors > 2000 in a 24 hour tumbling
window. Key= Account Number + Device Id
ON DEMAND FEATURE EXAMPLE
• Diagnostic telemetry information for each device for
a given customer
• Expensive to collect. Only requested on demand
• Model Metadata specified TTL for such a feature
Online
Feature Store
History Feature
Store
Online
Feature Store
19. 1 9
FEATURE METADATA
KEY: NAMESPACE, NAME & VERSION
ONLINE FEATURE STORE KEY DEFINITIONS: JSON PATH & SCRIPT
REFERENCES (CODE & METHOD) IN GITHUB
HISTORICAL FEATURE STORE KEY DEFINITIONS: JSON PATH TO
EXTRACT IDENTIFIERS, CONNECTION PARAMETERS TO HISTORY
STORE, SCRIPT & JSON PATH TO EXTRACT PARTITIONS
UPDATE TS EXTRACTORS : COMBINATION OF JSON PATHS, SCRIPT
REFERENCES TO EXTRACT TIMESTAMP FROM FEATURE PAYLOADS
HOW DO I IDENTIFY A
FEATURE?
HOW I IDENTIFY A SPECIFIC
INSTANCE OF A FEATURE
HOW DO I WRITE TO A
HISTORY STORE(S)?
WHAT IS THE UPDATE TIME
STAMPS FOR EACH
FEATURE VALUE? EVENT VS.
INGESTION TIME
20. 2 0
EXAMPLE FEATURE VALUE
HEADER: TIMESTAMP, INTERNAL CUSTOMER IDENTIFIER
PAYLOAD: JSON PAYLOAD (EX. SPEED TEST DATA)
21. 2 1
INGEST FEATURE VALUE
HEADER: TIMESTAMP, INTERNAL CUSTOMER IDENTIFIER
PAYLOAD: JSON PAYLOAD (EX. SPEED TEST DATA)
FEATURE INGESTION
PIPELINE
FEATURE METADATA
Scripts
RepositoryOnline
Feature Store
History
Feature Store
22. 2 2
MODEL METADATA
KEY: USECASE, NAME & VERSION
PER FEATURE DEFINITION: PRE-FEATURE ENGINEERING HOOKS,
ATTRIBUTE LEVEL FEATURE ENGINEERING HOOKS, POST-FEATURE
ENGINEERING HOOKS, TTL
HOW DO I IDENTIFY A
MODEL?
DEFINE ENVIRONMENT
PARAMETERS FOR MODEL
EXECUTION
CONSISTENT FEATURE
ENGINEERING (SCRIPTS).
WHY?
HOW IS THE MODEL
DEPLOYED? AUTOSCALING
DEFINITIONS
ENVIRONMENT PARAMETERS
MODEL DEPLOYMENT DEFINITIONS
23. 2 3
CONSISTENT DATA
PLACE DATA ON SAME PLANE
• S3 (or form data plane via Alluxio)
• Storage parameters driven by metadata
• Consistent persistence and reads
• Metadata-driven operators
• Historical Store
• Raw data
• Engineered features
VERSION THE DATA
• Feature Creation keeps metadata paths
24. 2 4
CONSISTENT FEATURE ENGINEERING -
MODEL METADATA
FEATURE ENGINEERING MUST BE CONSISTENT:
• Training
• Prediction Phase
METADATA DRIVEN
• Using configured scripts just like feature metadata
• Define features used
• Define TTL per feature (Prediction Phase)
SCRIPTS DEFINED IN MODEL METADATA ARE DEFINED BY
DATA SCIENTIST
• Used for creating training/testing/validation datasets from raw
features. Apply on a record of data used for training or during
prediction
• Also used at prediction time to perform real time feature engineering
Feature
Engineering
Model
Metadata
Online Feature
Store
Scripts
Repository
Reference
Feature
Engineering
Scripts
History Feature
Store
Record
for
training
Record
for
prediction
Training
Prediction
25. 2 5
CONSISTENT FEATURE ENGINEERING
SQL AS A UNIFYING LANGUAGE
• Replace as many operations with their SQL equivalent
• No need to translate code
• No need for DSL
SPARK AS A UNIFYING LANGUAGE
• Many tools for deeper feature engineering
• Redeploy same code through streaming / web app
• Less frameworks
APPLICABLE AT EVERY PHASE
• Post-Ingest
• In-flight or at-rest
• Pre-model
• Standards to fit both stream and batch
26. 2 6
MODEL DEPLOYMENT
MODEL AS CODE
• H20 AI Pojo
• Spark ML Models
• Simple Python Scripts – Regression Models
• Specialized Python Scripts – Math libraries and need
specialized hardware like GPU support
ONE MODEL MULTIPLE DEPLOYMENT MODELS
• Deploy as Docker containers with REST Endpoints – Easy to
test and used directly if request has all the features available
• Deploy as Map Operators within Streaming framework
• Deploy as Lambda/SageMaker Spark functions in AWS
• SparkLauncher
• DataBricks Jobs API
27. 2 7
PREDICTION PHASE
ASSEMBLE FEATURES FOR
A GIVEN MODEL
Online
Feature
Store
Model
/Feature
Metadata
Feature
Store
API
Feature
Assembly
Feature Creation
Pipeline
Are All
Features
Current?
No
History
Feature Store
Online
Feature Store
Feature
Assembly
Append store (Ex. S3, HDFS,
Redshift) for use by Data
Scientist for Model Training
Model
Execution
Prediction/Outc
ome Store
Customer
Context
REQUESTING
APPLICATION
Listens
Payload: Model
Name + Account
Number
Yes
28. 2 8
FEATURES OF THE ML PIPELINE
AWS AGNOSTIC
• Integrates with the AWS Cloud but not
dependent on it
• Framework should be able to work in a non-
AWS distributed environment with
configuration (not code) changes
TRACEABILITY & REPEATABILITY &
AUDITABILITY
• Model to be traced back to business use-
cases
• Full traceability from raw data to feature
engineering to predictions
• “Everything Versioned” enables repeatability
CI/CD SUPPORT
• Code, Metadata (Hyper-Parameters) and
Data (Training/Validation Data) are
versioned. Deployable artifacts to integrate
with CI/CD Pipeline
29. 2 9
NEXT STEPS AND FUTURE WORK
UI PORTAL FOR
• MODEL / FEATURE AND METADATA MANAGEMENT
• CONTAINERIZATION SUPPORT FOR MODEL EXECUTION
PHASE
• WORKBENCH FOR DATA SCIENTIST
• CONTINUOUS MODEL MONITORING
KNOWLEDGE SHARING
• Promote Reusability : Users search for features by model
• Search features by their importance in models
• Real time model evaluation by comparing predictions with
outcomes
• Determining first-class tools
AUTOMATING THE RETRAINING PROCESS
SUPPORT FOR MULTIPLE/PLUGGABLE FEATURE
STORES (SLA DRIVEN)
30. 3 0
SUMMARY
Metadata Driven
Feature/Model
Definition,
Versioning , Feature
Assembly, Model
Deployment, Model
Monitoring is
metadata driven
Automation
Orchestrated
Deployment for
new Features
and Models
Rapid
Onboarding
Portal for Model
and Feature
Management as
well Model
Deployment
Data Consistency
Feature store
enforces a
consistent data
pipeline ensuring
that the data
used for training
is functionally
identical to the
data used for
predictions
Monitoring and
Metrics
Ability to execute
& monitor
multiple Models
in production to
enable real-time
metrics driven
model selection
Iterative/Consistent
Model
Development
Multiple versions of
the Models can be
developed
iteratively while
consuming from a
consistent dataset
(feature store),
enables A/B &
Multivariate Testing