Enviar pesquisa
Carregar
Reducing Cost of Production ML: Feature Engineering Case Study
•
1 gostou
•
438 visualizações
Venkata Pingali
Seguir
Early talk given at Fifth Elephant Winter Edition at Mumbai in Jan, 2019
Leia menos
Leia mais
Dados e análise
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 27
Baixar agora
Baixar para ler offline
Recomendados
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices
confluent
#SlimScalding - Less Memory is More Capacity
#SlimScalding - Less Memory is More Capacity
Gera Shegalov
Data pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
1Spatial: Cardiff FME World Tour: Spinning Web and Business Data into Gold
1Spatial: Cardiff FME World Tour: Spinning Web and Business Data into Gold
1Spatial
Kubernetes as data platform
Kubernetes as data platform
Lars Albertsson
Idea behind Apache Hivemall
Idea behind Apache Hivemall
Makoto Yui
Scaling Multi-tenant Applications Using the Django ORM & Postgres | PyCon Can...
Scaling Multi-tenant Applications Using the Django ORM & Postgres | PyCon Can...
Citus Data
10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson
Recomendados
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices
confluent
#SlimScalding - Less Memory is More Capacity
#SlimScalding - Less Memory is More Capacity
Gera Shegalov
Data pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
1Spatial: Cardiff FME World Tour: Spinning Web and Business Data into Gold
1Spatial: Cardiff FME World Tour: Spinning Web and Business Data into Gold
1Spatial
Kubernetes as data platform
Kubernetes as data platform
Lars Albertsson
Idea behind Apache Hivemall
Idea behind Apache Hivemall
Makoto Yui
Scaling Multi-tenant Applications Using the Django ORM & Postgres | PyCon Can...
Scaling Multi-tenant Applications Using the Django ORM & Postgres | PyCon Can...
Citus Data
10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson
Structured Streaming in Spark
Structured Streaming in Spark
Digital Vidya
Protecting privacy in practice
Protecting privacy in practice
Lars Albertsson
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Neo4j
Engineering data quality
Engineering data quality
Lars Albertsson
Taming the reproducibility crisis
Taming the reproducibility crisis
Lars Albertsson
Eventually, time will kill your data processing
Eventually, time will kill your data processing
Lars Albertsson
Migrating batch ETLs to streaming Flink
Migrating batch ETLs to streaming Flink
William Saar
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
Zhenxiao Luo
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
GIS in the Rockies
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
TigerGraph
Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructure
Andreas Schreiber
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j
Kafka and Kafka Streams in the Global Schibsted Data Platform
Kafka and Kafka Streams in the Global Schibsted Data Platform
Fredrik Vraalsen
Continuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
Data ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson
Devconf 17 metrics collection using open-source tools is easy
Devconf 17 metrics collection using open-source tools is easy
Yaniv Bronhaim
Data democratised
Data democratised
Lars Albertsson
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
Kenneth Lo
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
Dan Sullivan, Ph.D.
Accelerating ML using Production Feature Engineering
Accelerating ML using Production Feature Engineering
Venkata Pingali
Analytics on system z final
Analytics on system z final
Peter Schouboe
Mais conteúdo relacionado
Mais procurados
Structured Streaming in Spark
Structured Streaming in Spark
Digital Vidya
Protecting privacy in practice
Protecting privacy in practice
Lars Albertsson
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Neo4j
Engineering data quality
Engineering data quality
Lars Albertsson
Taming the reproducibility crisis
Taming the reproducibility crisis
Lars Albertsson
Eventually, time will kill your data processing
Eventually, time will kill your data processing
Lars Albertsson
Migrating batch ETLs to streaming Flink
Migrating batch ETLs to streaming Flink
William Saar
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
Zhenxiao Luo
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
GIS in the Rockies
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
TigerGraph
Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructure
Andreas Schreiber
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j
Kafka and Kafka Streams in the Global Schibsted Data Platform
Kafka and Kafka Streams in the Global Schibsted Data Platform
Fredrik Vraalsen
Continuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
Data ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson
Devconf 17 metrics collection using open-source tools is easy
Devconf 17 metrics collection using open-source tools is easy
Yaniv Bronhaim
Data democratised
Data democratised
Lars Albertsson
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
Kenneth Lo
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
Dan Sullivan, Ph.D.
Mais procurados
(20)
Structured Streaming in Spark
Structured Streaming in Spark
Protecting privacy in practice
Protecting privacy in practice
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Engineering data quality
Engineering data quality
Taming the reproducibility crisis
Taming the reproducibility crisis
Eventually, time will kill your data processing
Eventually, time will kill your data processing
Migrating batch ETLs to streaming Flink
Migrating batch ETLs to streaming Flink
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructure
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Kafka and Kafka Streams in the Global Schibsted Data Platform
Kafka and Kafka Streams in the Global Schibsted Data Platform
Continuous delivery for machine learning
Continuous delivery for machine learning
Data ops in practice - Swedish style
Data ops in practice - Swedish style
Devconf 17 metrics collection using open-source tools is easy
Devconf 17 metrics collection using open-source tools is easy
Data democratised
Data democratised
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
Semelhante a Reducing Cost of Production ML: Feature Engineering Case Study
Accelerating ML using Production Feature Engineering
Accelerating ML using Production Feature Engineering
Venkata Pingali
Analytics on system z final
Analytics on system z final
Peter Schouboe
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Fei Chen
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
kbajda
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
HostedbyConfluent
Evolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at Pinterest
HostedbyConfluent
Fast Data at ING – the why, what and how of the streaming analytics platform ...
Fast Data at ING – the why, what and how of the streaming analytics platform ...
Bas Geerdink
Industrialiser spark
Industrialiser spark
Lucien Fregosi
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
SnapLogic
Accelerating the Data to Value Journey
Accelerating the Data to Value Journey
Denodo
Data engineering in 10 years.pdf
Data engineering in 10 years.pdf
Lars Albertsson
Simply Business' Data Platform
Simply Business' Data Platform
Dani Solà Lagares
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
DataWorks Summit
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
DATAVERSITY
Laboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nube
Software Guru
Mindsphere: an open cloud-based IoT operating system for Industry
Mindsphere: an open cloud-based IoT operating system for Industry
IIoTWorld
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
Databricks
Big Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
InfluxData
Achieving digital transformation with Siebel CRM and Oracle Cloud
Achieving digital transformation with Siebel CRM and Oracle Cloud
Sonia Wadhwa
Semelhante a Reducing Cost of Production ML: Feature Engineering Case Study
(20)
Accelerating ML using Production Feature Engineering
Accelerating ML using Production Feature Engineering
Analytics on system z final
Analytics on system z final
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
Evolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at Pinterest
Fast Data at ING – the why, what and how of the streaming analytics platform ...
Fast Data at ING – the why, what and how of the streaming analytics platform ...
Industrialiser spark
Industrialiser spark
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Accelerating the Data to Value Journey
Accelerating the Data to Value Journey
Data engineering in 10 years.pdf
Data engineering in 10 years.pdf
Simply Business' Data Platform
Simply Business' Data Platform
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
Laboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nube
Mindsphere: an open cloud-based IoT operating system for Industry
Mindsphere: an open cloud-based IoT operating system for Industry
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
Big Data, Bigger Analytics
Big Data, Bigger Analytics
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
Achieving digital transformation with Siebel CRM and Oracle Cloud
Achieving digital transformation with Siebel CRM and Oracle Cloud
Último
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
olyaivanovalion
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
Boston Institute of Analytics
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
Call Girls in Nagpur High Profile Call Girls
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
michael115558
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
olyaivanovalion
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
olyaivanovalion
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Delhi Call girls
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
fulawalesam
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Delhi Call girls
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
Delhi Call girls
Halmar dropshipping via API with DroFx
Halmar dropshipping via API with DroFx
olyaivanovalion
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
olyaivanovalion
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
Invezz1
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
MoniSankarHazra
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
Último
(20)
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
Halmar dropshipping via API with DroFx
Halmar dropshipping via API with DroFx
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Reducing Cost of Production ML: Feature Engineering Case Study
1.
Reducing Cost of
Production ML: Feature Engineering Case Study Dr. Venkata Pingali Scribble Data 1
2.
Outline © Scribble Data
2018 ● Production ML is Complex ● Feature Engineering Overview ● Detailed Cost Drivers ● Indicative Quantitative Improvement ● Approaches for Each Cost Driver 2 Takeaway: A disciplined approach will deliver 10x improvement
3.
Production ML -
Complex and Expensive © Scribble Data 2018https://eng.uber.com/michelangelo/ 3 ⇑ ⇓
4.
Distribution of Challenges ©
Scribble Data 2018https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf 4 “Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.” Paper from Google - NeurIPS 2015 Drivers: Speed, Correctness, Evolution, Scale
5.
Feature Engineering © Scribble
Data 2018 5
6.
Feature Engineering -
Nature © Scribble Data 2018 ● Features are variables generated from data ○ Continuous process (Batch + Near Realtime + Realtime) ● Large in number (‘00s to ‘000s) & evolving ● Frequently executed Customer SKU Name 17826162 0293192 Thai Dragon Fruit Customer Premium Imported 17826162 15% of txns 5% of spend Retail Customer (X GB) Features (~X/1000) 6
7.
Historical 1 N Models …. 7 Near RT RT Multi-scale Features (1
min, 1 day, 1 year) Multi-Phase Feature Engineering
8.
Emerging Area of
Work © Scribble Data 2018 ● Conceptual ○ Functional Data Engg (Beauchemin, Lyft) ○ Positive/Negative Data Engg (Lowin, Prefect) ○ Design discussions: Michelangelo (Uber), Zipline (AirBnB), TFX (Google) ● Implementations 8 Data Science platforms Inhouse Systems ML Engineering Platforms
9.
Feature Engineering Cost
Drivers © Scribble Data 2018 Cost Driver Explanation Correctness & Reconciliation Confidence in output Evolution of data/algorithms/ features Safely scaling Development Best use of time Operations incl Backfilling Controlled execution All are forms of end-to-end (in)discipline & known 9
10.
Usecase - Customer
Segmentation © Scribble Data 2018 Gain (2015 ⇒ 2018) Data Volume 10x+ Feature Output Size 10x Dev Time 2x Server Cost 10x Bonus Auditability, Idempotency Documentation, Notifications 10 Data from Tier I Retail (2018) & F&B (2015) chains w/ 1500 Stores
11.
Gaining Confidence in
Output © Scribble Data 2018 ● End-to-end auditability ○ All data and all runs ○ Metadata standardization ● Discovery and reuse ○ Pipelines, modules ○ Automatic doc & ownership tracking ● Early warning systems ○ Input/output quality checks ○ Note critical decisions 11
12.
Scaling Deployments © Scribble
Data 2018 ● Isolation: Multi-tenant namespaces ○ File system, tables, S3 ○ Runs ● Time: Versioned namespaces ○ Storage locations ○ Metadata ● Version tracking of all data/code ○ Link data and code 12 Prepare for pipelines, models, versions, & dataset proliferation Functional Data Engineering - Maxime Beauchemin
13.
Increasing Development Speed ©
Scribble Data 2018 ● Feature DSL ○ Reduced code and errors ● Reusable assets ○ GUI, preprocessing ○ Standard transforms ● Standardized development ○ Unit testing ○ Automated documentation 13
14.
Control Execution © Scribble
Data 2018 ● Execution management ○ Parameterization, over-rides ○ Managed data access ● Automated deployment ○ Rollout best practices ● Minimum service integration ○ Performance, background tasks, scheduled execution ● Notifications ○ With callouts 14
15.
© Scribble Data
2018 15 ● Production ML is expensive ○ Cost drivers: Speed,Correctness, Evolution, Scaling ● Feature engineering important component ○ 10x productivity improvement possible ● Platforms required to improve speed & reduce risk ○ All major companies have platforms Key Takeaways
16.
THANK YOU FOR YOUR
TIME DENVER BANGALORE Littleton Indiranagar | HSR pingali@scribbledata.io 16
17.
Feature Engineering © Scribble
Data 2018 17
18.
Feature Engineering -
Key Activity © Scribble Data 2018 Source:https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation -most-time-consuming-least-enjoyable-data-science-task-survey-says/ Modeling = Statistics over Features Typical DE:DS = 2-3:1 18
19.
© Scribble Data
2018 2015* 2018 Improvement (2015>2018) Usecase Customer Segmentation Customer Segmentation Organization Tier I National F&B Chain Tier I National Retail Chain Raw Data 100GB Pos Txns 1.1TB PoS Txns 11x Output 500K x 100 Features 1M x 500 Features 10x Compute Engine Hive+Pandas Pandas+Itertoolz Dev Timeframe 3 Months 1.5 Months 2x Compute 6 Node Cluster x 3 hours 1 Server x 1.5 hour ~ 10x Bonus Auditability, Idempotency Documentation 2018 Implementation strongly informed by 2016 19
20.
Typical Feature Engineering
Cycle © Scribble Data 2018 Change From Past: In Production, Everyday 20 Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
21.
Production Feature Engineering
Challenges © Scribble Data 2018 21 Growing Volume & Complexity, Changing SchemaCatalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
22.
© Scribble Data
2018 22 Production Feature Engineering Challenges Data loss, inconsistenciesCatalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
23.
© Scribble Data
2018 23 Production Feature Engineering Challenges Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset Growing number, type, & complexity,
24.
© Scribble Data
2018 24 Production Feature Engineering Challenges 100s-1000s Entity common, growing Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
25.
© Scribble Data
2018 25 Production Feature Engineering Challenges Continuously Tuned, Growing number Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
26.
© Scribble Data
2018 26 Production Feature Engineering Challenges Manage versions, outputs, debug errors Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
27.
Feature Engineering © Scribble
Data 2018 27 Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small navy blue box in the middle. The required surrounding infrastructure is vast and complex.ML Code Feature Extraction Data Collection Analysis Tools Data Verification Process Management Tools Machine Resource Management ServicingInfrastructure Configuration Monitoring
Baixar agora