SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
Reducing Cost of Production ML:
Feature Engineering Case Study
Dr. Venkata Pingali
Scribble Data
1
Outline
© Scribble Data 2018
● Production ML is Complex
● Feature Engineering Overview
● Detailed Cost Drivers
● Indicative Quantitative Improvement
● Approaches for Each Cost Driver
2
Takeaway: A disciplined approach will deliver 10x improvement
Production ML - Complex and Expensive
© Scribble Data 2018https://eng.uber.com/michelangelo/
3
⇑
⇓
Distribution of Challenges
© Scribble Data 2018https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
4
“Only a small fraction of real-world ML systems is composed of the ML code, as shown by the
small black box in the middle. The required surrounding infrastructure is vast and complex.”
Paper from Google - NeurIPS 2015
Drivers: Speed, Correctness, Evolution, Scale
Feature Engineering
© Scribble Data 2018
5
Feature Engineering - Nature
© Scribble Data 2018
● Features are variables generated from data
○ Continuous process (Batch + Near Realtime + Realtime)
● Large in number (‘00s to ‘000s) & evolving
● Frequently executed
Customer SKU Name
17826162 0293192 Thai Dragon Fruit
Customer Premium Imported
17826162 15% of txns 5% of spend
Retail Customer
(X GB)
Features
(~X/1000)
6
Historical
1
N
Models
….
7
Near RT
RT
Multi-scale Features
(1 min, 1 day, 1 year)
Multi-Phase Feature Engineering
Emerging Area of Work
© Scribble Data 2018
● Conceptual
○ Functional Data Engg (Beauchemin, Lyft)
○ Positive/Negative Data Engg (Lowin, Prefect)
○ Design discussions: Michelangelo (Uber), Zipline (AirBnB), TFX (Google)
● Implementations
8
Data Science platforms
Inhouse Systems
ML Engineering Platforms
Feature Engineering Cost Drivers
© Scribble Data 2018
Cost Driver Explanation
Correctness & Reconciliation Confidence in output
Evolution of data/algorithms/ features Safely scaling
Development Best use of time
Operations incl Backfilling Controlled execution
All are forms of end-to-end (in)discipline & known
9
Usecase - Customer Segmentation
© Scribble Data 2018
Gain (2015 ⇒ 2018)
Data Volume 10x+
Feature Output Size 10x
Dev Time 2x
Server Cost 10x
Bonus
Auditability, Idempotency
Documentation, Notifications
10
Data from Tier I Retail (2018) & F&B (2015) chains w/ 1500 Stores
Gaining Confidence in Output
© Scribble Data 2018
● End-to-end auditability
○ All data and all runs
○ Metadata standardization
● Discovery and reuse
○ Pipelines, modules
○ Automatic doc & ownership tracking
● Early warning systems
○ Input/output quality checks
○ Note critical decisions
11
Scaling Deployments
© Scribble Data 2018
● Isolation: Multi-tenant
namespaces
○ File system, tables, S3
○ Runs
● Time: Versioned namespaces
○ Storage locations
○ Metadata
● Version tracking of all data/code
○ Link data and code
12
Prepare for pipelines, models, versions, & dataset proliferation
Functional Data Engineering - Maxime Beauchemin
Increasing Development Speed
© Scribble Data 2018
● Feature DSL
○ Reduced code and errors
● Reusable assets
○ GUI, preprocessing
○ Standard transforms
● Standardized development
○ Unit testing
○ Automated documentation
13
Control Execution
© Scribble Data 2018
● Execution management
○ Parameterization, over-rides
○ Managed data access
● Automated deployment
○ Rollout best practices
● Minimum service integration
○ Performance, background tasks,
scheduled execution
● Notifications
○ With callouts
14
© Scribble Data 2018
15
● Production ML is expensive
○ Cost drivers: Speed,Correctness, Evolution, Scaling
● Feature engineering important component
○ 10x productivity improvement possible
● Platforms required to improve speed & reduce risk
○ All major companies have platforms
Key Takeaways
THANK YOU
FOR YOUR TIME
DENVER BANGALORE
Littleton Indiranagar | HSR
pingali@scribbledata.io
16
Feature Engineering
© Scribble Data 2018
17
Feature Engineering - Key Activity
© Scribble Data 2018
Source:https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation
-most-time-consuming-least-enjoyable-data-science-task-survey-says/
Modeling = Statistics over Features
Typical
DE:DS = 2-3:1
18
© Scribble Data 2018
2015* 2018 Improvement (2015>2018)
Usecase Customer Segmentation Customer Segmentation
Organization Tier I National F&B Chain Tier I National Retail Chain
Raw Data 100GB Pos Txns 1.1TB PoS Txns 11x
Output 500K x 100 Features 1M x 500 Features 10x
Compute Engine Hive+Pandas Pandas+Itertoolz
Dev Timeframe 3 Months 1.5 Months 2x
Compute 6 Node Cluster x 3 hours 1 Server x 1.5 hour ~ 10x
Bonus
Auditability, Idempotency
Documentation
2018 Implementation strongly informed by 2016
19
Typical Feature Engineering Cycle
© Scribble Data 2018
Change From Past: In Production, Everyday
20
Catalog
Compute
Features
Test Algorithm
Preprocess
Data
Refine
Usecase
Deploy &
Operate
Dataset
Production Feature Engineering Challenges
© Scribble Data 2018
21
Growing Volume
& Complexity,
Changing
SchemaCatalog
Compute
Features
Test Algorithm
Preprocess
Data
Refine
Usecase
Deploy &
Operate
Dataset
© Scribble Data 2018
22
Production Feature Engineering Challenges
Data loss,
inconsistenciesCatalog
Compute
Features
Test Algorithm
Preprocess
Data
Refine
Usecase
Deploy &
Operate
Dataset
© Scribble Data 2018
23
Production Feature Engineering Challenges
Catalog
Compute
Features
Test Algorithm
Preprocess
Data
Refine
Usecase
Deploy &
Operate
Dataset
Growing
number, type,
& complexity,
© Scribble Data 2018
24
Production Feature Engineering Challenges
100s-1000s
Entity common,
growing
Catalog
Compute
Features
Test Algorithm
Preprocess
Data
Refine
Usecase
Deploy &
Operate
Dataset
© Scribble Data 2018
25
Production Feature Engineering Challenges
Continuously
Tuned, Growing
number
Catalog
Compute
Features
Test Algorithm
Preprocess
Data
Refine
Usecase
Deploy &
Operate
Dataset
© Scribble Data 2018
26
Production Feature Engineering Challenges
Manage versions,
outputs, debug
errors
Catalog
Compute
Features
Test Algorithm
Preprocess
Data
Refine
Usecase
Deploy &
Operate
Dataset
Feature Engineering
© Scribble Data 2018
27
Only a small fraction of real-world ML systems
is composed of the ML code, as shown by the
small navy blue box in the middle. The
required surrounding infrastructure is vast and
complex.ML
Code
Feature
Extraction
Data
Collection
Analysis
Tools
Data
Verification
Process Management Tools
Machine Resource Management
ServicingInfrastructure
Configuration
Monitoring

Mais conteúdo relacionado

Mais procurados

Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practiceLars Albertsson
 
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...Flink Forward
 
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Webinar:  Neo4j Licensing: Which Edition Is Right For You?Webinar:  Neo4j Licensing: Which Edition Is Right For You?
Webinar: Neo4j Licensing: Which Edition Is Right For You?Neo4j
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
 
Migrating batch ETLs to streaming Flink
Migrating batch ETLs to streaming FlinkMigrating batch ETLs to streaming Flink
Migrating batch ETLs to streaming FlinkWilliam Saar
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Zhenxiao Luo
 
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...GIS in the Rockies
 
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraphFROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraphTigerGraph
 
Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructureProvenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructureAndreas Schreiber
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j
 
Kafka and Kafka Streams in the Global Schibsted Data Platform
Kafka and Kafka Streams in the Global Schibsted Data PlatformKafka and Kafka Streams in the Global Schibsted Data Platform
Kafka and Kafka Streams in the Global Schibsted Data PlatformFredrik Vraalsen
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
Devconf 17 metrics collection using open-source tools is easy
Devconf 17 metrics collection using open-source tools is easyDevconf 17 metrics collection using open-source tools is easy
Devconf 17 metrics collection using open-source tools is easyYaniv Bronhaim
 
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...Kenneth Lo
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningDan Sullivan, Ph.D.
 

Mais procurados (20)

Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
 
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
 
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Webinar:  Neo4j Licensing: Which Edition Is Right For You?Webinar:  Neo4j Licensing: Which Edition Is Right For You?
Webinar: Neo4j Licensing: Which Edition Is Right For You?
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Migrating batch ETLs to streaming Flink
Migrating batch ETLs to streaming FlinkMigrating batch ETLs to streaming Flink
Migrating batch ETLs to streaming Flink
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
 
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraphFROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
 
Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructureProvenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructure
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
 
Kafka and Kafka Streams in the Global Schibsted Data Platform
Kafka and Kafka Streams in the Global Schibsted Data PlatformKafka and Kafka Streams in the Global Schibsted Data Platform
Kafka and Kafka Streams in the Global Schibsted Data Platform
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Devconf 17 metrics collection using open-source tools is easy
Devconf 17 metrics collection using open-source tools is easyDevconf 17 metrics collection using open-source tools is easy
Devconf 17 metrics collection using open-source tools is easy
 
Data democratised
Data democratisedData democratised
Data democratised
 
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
 

Semelhante a Reducing Cost of Production ML: Feature Engineering Case Study

Accelerating ML using Production Feature Engineering
Accelerating ML using Production Feature EngineeringAccelerating ML using Production Feature Engineering
Accelerating ML using Production Feature EngineeringVenkata Pingali
 
Analytics on system z final
Analytics on system z finalAnalytics on system z final
Analytics on system z finalPeter Schouboe
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubolekbajda
 
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022HostedbyConfluent
 
Evolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at PinterestEvolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at PinterestHostedbyConfluent
 
Fast Data at ING – the why, what and how of the streaming analytics platform ...
Fast Data at ING – the why, what and how of the streaming analytics platform ...Fast Data at ING – the why, what and how of the streaming analytics platform ...
Fast Data at ING – the why, what and how of the streaming analytics platform ...Bas Geerdink
 
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?SnapLogic
 
Accelerating the Data to Value Journey
Accelerating the Data to Value JourneyAccelerating the Data to Value Journey
Accelerating the Data to Value JourneyDenodo
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data PlatformDani Solà Lagares
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...DataWorks Summit
 
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture MaturityADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture MaturityDATAVERSITY
 
Laboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nubeLaboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nubeSoftware Guru
 
Mindsphere: an open cloud-based IoT operating system for Industry
Mindsphere: an open cloud-based IoT operating system for IndustryMindsphere: an open cloud-based IoT operating system for Industry
Mindsphere: an open cloud-based IoT operating system for IndustryIIoTWorld
 
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkDatabricks
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...InfluxData
 
Achieving digital transformation with Siebel CRM and Oracle Cloud
Achieving digital transformation with Siebel CRM and Oracle Cloud Achieving digital transformation with Siebel CRM and Oracle Cloud
Achieving digital transformation with Siebel CRM and Oracle Cloud Sonia Wadhwa
 

Semelhante a Reducing Cost of Production ML: Feature Engineering Case Study (20)

Accelerating ML using Production Feature Engineering
Accelerating ML using Production Feature EngineeringAccelerating ML using Production Feature Engineering
Accelerating ML using Production Feature Engineering
 
Analytics on system z final
Analytics on system z finalAnalytics on system z final
Analytics on system z final
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
 
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
 
Evolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at PinterestEvolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at Pinterest
 
Fast Data at ING – the why, what and how of the streaming analytics platform ...
Fast Data at ING – the why, what and how of the streaming analytics platform ...Fast Data at ING – the why, what and how of the streaming analytics platform ...
Fast Data at ING – the why, what and how of the streaming analytics platform ...
 
Industrialiser spark
Industrialiser sparkIndustrialiser spark
Industrialiser spark
 
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
 
Accelerating the Data to Value Journey
Accelerating the Data to Value JourneyAccelerating the Data to Value Journey
Accelerating the Data to Value Journey
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
 
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture MaturityADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
 
Laboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nubeLaboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nube
 
Mindsphere: an open cloud-based IoT operating system for Industry
Mindsphere: an open cloud-based IoT operating system for IndustryMindsphere: an open cloud-based IoT operating system for Industry
Mindsphere: an open cloud-based IoT operating system for Industry
 
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
 
Achieving digital transformation with Siebel CRM and Oracle Cloud
Achieving digital transformation with Siebel CRM and Oracle Cloud Achieving digital transformation with Siebel CRM and Oracle Cloud
Achieving digital transformation with Siebel CRM and Oracle Cloud
 

Último

Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Último (20)

Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

Reducing Cost of Production ML: Feature Engineering Case Study

  • 1. Reducing Cost of Production ML: Feature Engineering Case Study Dr. Venkata Pingali Scribble Data 1
  • 2. Outline © Scribble Data 2018 ● Production ML is Complex ● Feature Engineering Overview ● Detailed Cost Drivers ● Indicative Quantitative Improvement ● Approaches for Each Cost Driver 2 Takeaway: A disciplined approach will deliver 10x improvement
  • 3. Production ML - Complex and Expensive © Scribble Data 2018https://eng.uber.com/michelangelo/ 3 ⇑ ⇓
  • 4. Distribution of Challenges © Scribble Data 2018https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf 4 “Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.” Paper from Google - NeurIPS 2015 Drivers: Speed, Correctness, Evolution, Scale
  • 6. Feature Engineering - Nature © Scribble Data 2018 ● Features are variables generated from data ○ Continuous process (Batch + Near Realtime + Realtime) ● Large in number (‘00s to ‘000s) & evolving ● Frequently executed Customer SKU Name 17826162 0293192 Thai Dragon Fruit Customer Premium Imported 17826162 15% of txns 5% of spend Retail Customer (X GB) Features (~X/1000) 6
  • 7. Historical 1 N Models …. 7 Near RT RT Multi-scale Features (1 min, 1 day, 1 year) Multi-Phase Feature Engineering
  • 8. Emerging Area of Work © Scribble Data 2018 ● Conceptual ○ Functional Data Engg (Beauchemin, Lyft) ○ Positive/Negative Data Engg (Lowin, Prefect) ○ Design discussions: Michelangelo (Uber), Zipline (AirBnB), TFX (Google) ● Implementations 8 Data Science platforms Inhouse Systems ML Engineering Platforms
  • 9. Feature Engineering Cost Drivers © Scribble Data 2018 Cost Driver Explanation Correctness & Reconciliation Confidence in output Evolution of data/algorithms/ features Safely scaling Development Best use of time Operations incl Backfilling Controlled execution All are forms of end-to-end (in)discipline & known 9
  • 10. Usecase - Customer Segmentation © Scribble Data 2018 Gain (2015 ⇒ 2018) Data Volume 10x+ Feature Output Size 10x Dev Time 2x Server Cost 10x Bonus Auditability, Idempotency Documentation, Notifications 10 Data from Tier I Retail (2018) & F&B (2015) chains w/ 1500 Stores
  • 11. Gaining Confidence in Output © Scribble Data 2018 ● End-to-end auditability ○ All data and all runs ○ Metadata standardization ● Discovery and reuse ○ Pipelines, modules ○ Automatic doc & ownership tracking ● Early warning systems ○ Input/output quality checks ○ Note critical decisions 11
  • 12. Scaling Deployments © Scribble Data 2018 ● Isolation: Multi-tenant namespaces ○ File system, tables, S3 ○ Runs ● Time: Versioned namespaces ○ Storage locations ○ Metadata ● Version tracking of all data/code ○ Link data and code 12 Prepare for pipelines, models, versions, & dataset proliferation Functional Data Engineering - Maxime Beauchemin
  • 13. Increasing Development Speed © Scribble Data 2018 ● Feature DSL ○ Reduced code and errors ● Reusable assets ○ GUI, preprocessing ○ Standard transforms ● Standardized development ○ Unit testing ○ Automated documentation 13
  • 14. Control Execution © Scribble Data 2018 ● Execution management ○ Parameterization, over-rides ○ Managed data access ● Automated deployment ○ Rollout best practices ● Minimum service integration ○ Performance, background tasks, scheduled execution ● Notifications ○ With callouts 14
  • 15. © Scribble Data 2018 15 ● Production ML is expensive ○ Cost drivers: Speed,Correctness, Evolution, Scaling ● Feature engineering important component ○ 10x productivity improvement possible ● Platforms required to improve speed & reduce risk ○ All major companies have platforms Key Takeaways
  • 16. THANK YOU FOR YOUR TIME DENVER BANGALORE Littleton Indiranagar | HSR pingali@scribbledata.io 16
  • 18. Feature Engineering - Key Activity © Scribble Data 2018 Source:https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation -most-time-consuming-least-enjoyable-data-science-task-survey-says/ Modeling = Statistics over Features Typical DE:DS = 2-3:1 18
  • 19. © Scribble Data 2018 2015* 2018 Improvement (2015>2018) Usecase Customer Segmentation Customer Segmentation Organization Tier I National F&B Chain Tier I National Retail Chain Raw Data 100GB Pos Txns 1.1TB PoS Txns 11x Output 500K x 100 Features 1M x 500 Features 10x Compute Engine Hive+Pandas Pandas+Itertoolz Dev Timeframe 3 Months 1.5 Months 2x Compute 6 Node Cluster x 3 hours 1 Server x 1.5 hour ~ 10x Bonus Auditability, Idempotency Documentation 2018 Implementation strongly informed by 2016 19
  • 20. Typical Feature Engineering Cycle © Scribble Data 2018 Change From Past: In Production, Everyday 20 Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
  • 21. Production Feature Engineering Challenges © Scribble Data 2018 21 Growing Volume & Complexity, Changing SchemaCatalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
  • 22. © Scribble Data 2018 22 Production Feature Engineering Challenges Data loss, inconsistenciesCatalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
  • 23. © Scribble Data 2018 23 Production Feature Engineering Challenges Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset Growing number, type, & complexity,
  • 24. © Scribble Data 2018 24 Production Feature Engineering Challenges 100s-1000s Entity common, growing Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
  • 25. © Scribble Data 2018 25 Production Feature Engineering Challenges Continuously Tuned, Growing number Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
  • 26. © Scribble Data 2018 26 Production Feature Engineering Challenges Manage versions, outputs, debug errors Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
  • 27. Feature Engineering © Scribble Data 2018 27 Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small navy blue box in the middle. The required surrounding infrastructure is vast and complex.ML Code Feature Extraction Data Collection Analysis Tools Data Verification Process Management Tools Machine Resource Management ServicingInfrastructure Configuration Monitoring