Enviar pesquisa
Carregar
Reducing Cost of Production ML: Feature Engineering Case Study
•
1 gostou
•
438 visualizações
Venkata Pingali
Seguir
Early talk given at Fifth Elephant Winter Edition at Mumbai in Jan, 2019
Leia menos
Leia mais
Dados e análise
Vista de apresentação de diapositivos
Denunciar
Compartilhar
Vista de apresentação de diapositivos
Denunciar
Compartilhar
1 de 27
Baixar agora
Baixar para ler offline
Recomendados
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices
confluent
#SlimScalding - Less Memory is More Capacity
#SlimScalding - Less Memory is More Capacity
Gera Shegalov
Data pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
1Spatial: Cardiff FME World Tour: Spinning Web and Business Data into Gold
1Spatial: Cardiff FME World Tour: Spinning Web and Business Data into Gold
1Spatial
Kubernetes as data platform
Kubernetes as data platform
Lars Albertsson
Idea behind Apache Hivemall
Idea behind Apache Hivemall
Makoto Yui
Scaling Multi-tenant Applications Using the Django ORM & Postgres | PyCon Can...
Scaling Multi-tenant Applications Using the Django ORM & Postgres | PyCon Can...
Citus Data
10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson
Recomendados
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices
confluent
#SlimScalding - Less Memory is More Capacity
#SlimScalding - Less Memory is More Capacity
Gera Shegalov
Data pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
1Spatial: Cardiff FME World Tour: Spinning Web and Business Data into Gold
1Spatial: Cardiff FME World Tour: Spinning Web and Business Data into Gold
1Spatial
Kubernetes as data platform
Kubernetes as data platform
Lars Albertsson
Idea behind Apache Hivemall
Idea behind Apache Hivemall
Makoto Yui
Scaling Multi-tenant Applications Using the Django ORM & Postgres | PyCon Can...
Scaling Multi-tenant Applications Using the Django ORM & Postgres | PyCon Can...
Citus Data
10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson
Structured Streaming in Spark
Structured Streaming in Spark
Digital Vidya
Protecting privacy in practice
Protecting privacy in practice
Lars Albertsson
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Neo4j
Engineering data quality
Engineering data quality
Lars Albertsson
Taming the reproducibility crisis
Taming the reproducibility crisis
Lars Albertsson
Eventually, time will kill your data processing
Eventually, time will kill your data processing
Lars Albertsson
Migrating batch ETLs to streaming Flink
Migrating batch ETLs to streaming Flink
William Saar
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
Zhenxiao Luo
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
GIS in the Rockies
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
TigerGraph
Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructure
Andreas Schreiber
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j
Kafka and Kafka Streams in the Global Schibsted Data Platform
Kafka and Kafka Streams in the Global Schibsted Data Platform
Fredrik Vraalsen
Continuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
Data ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson
Devconf 17 metrics collection using open-source tools is easy
Devconf 17 metrics collection using open-source tools is easy
Yaniv Bronhaim
Data democratised
Data democratised
Lars Albertsson
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
Kenneth Lo
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
Dan Sullivan, Ph.D.
Accelerating ML using Production Feature Engineering
Accelerating ML using Production Feature Engineering
Venkata Pingali
Analytics on system z final
Analytics on system z final
Peter Schouboe
Mais conteúdo relacionado
Mais procurados
Structured Streaming in Spark
Structured Streaming in Spark
Digital Vidya
Protecting privacy in practice
Protecting privacy in practice
Lars Albertsson
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Neo4j
Engineering data quality
Engineering data quality
Lars Albertsson
Taming the reproducibility crisis
Taming the reproducibility crisis
Lars Albertsson
Eventually, time will kill your data processing
Eventually, time will kill your data processing
Lars Albertsson
Migrating batch ETLs to streaming Flink
Migrating batch ETLs to streaming Flink
William Saar
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
Zhenxiao Luo
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
GIS in the Rockies
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
TigerGraph
Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructure
Andreas Schreiber
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j
Kafka and Kafka Streams in the Global Schibsted Data Platform
Kafka and Kafka Streams in the Global Schibsted Data Platform
Fredrik Vraalsen
Continuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
Data ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson
Devconf 17 metrics collection using open-source tools is easy
Devconf 17 metrics collection using open-source tools is easy
Yaniv Bronhaim
Data democratised
Data democratised
Lars Albertsson
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
Kenneth Lo
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
Dan Sullivan, Ph.D.
Mais procurados
(20)
Structured Streaming in Spark
Structured Streaming in Spark
Protecting privacy in practice
Protecting privacy in practice
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Webinar: Neo4j Licensing: Which Edition Is Right For You?
Engineering data quality
Engineering data quality
Taming the reproducibility crisis
Taming the reproducibility crisis
Eventually, time will kill your data processing
Eventually, time will kill your data processing
Migrating batch ETLs to streaming Flink
Migrating batch ETLs to streaming Flink
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
2018 GIS in Government: Building an esri Workflow Manager to Maximize High Qu...
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructure
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Kafka and Kafka Streams in the Global Schibsted Data Platform
Kafka and Kafka Streams in the Global Schibsted Data Platform
Continuous delivery for machine learning
Continuous delivery for machine learning
Data ops in practice - Swedish style
Data ops in practice - Swedish style
Devconf 17 metrics collection using open-source tools is easy
Devconf 17 metrics collection using open-source tools is easy
Data democratised
Data democratised
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
San Francisco SharePoint Users Group - Mission Possible: Keeping SharePoint S...
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
Semelhante a Reducing Cost of Production ML: Feature Engineering Case Study
Accelerating ML using Production Feature Engineering
Accelerating ML using Production Feature Engineering
Venkata Pingali
Analytics on system z final
Analytics on system z final
Peter Schouboe
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Fei Chen
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
kbajda
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
HostedbyConfluent
Evolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at Pinterest
HostedbyConfluent
Fast Data at ING – the why, what and how of the streaming analytics platform ...
Fast Data at ING – the why, what and how of the streaming analytics platform ...
Bas Geerdink
Industrialiser spark
Industrialiser spark
Lucien Fregosi
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
SnapLogic
Accelerating the Data to Value Journey
Accelerating the Data to Value Journey
Denodo
Data engineering in 10 years.pdf
Data engineering in 10 years.pdf
Lars Albertsson
Simply Business' Data Platform
Simply Business' Data Platform
Dani Solà Lagares
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
DataWorks Summit
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
DATAVERSITY
Laboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nube
Software Guru
Mindsphere: an open cloud-based IoT operating system for Industry
Mindsphere: an open cloud-based IoT operating system for Industry
IIoTWorld
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
Databricks
Big Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
InfluxData
Achieving digital transformation with Siebel CRM and Oracle Cloud
Achieving digital transformation with Siebel CRM and Oracle Cloud
Sonia Wadhwa
Semelhante a Reducing Cost of Production ML: Feature Engineering Case Study
(20)
Accelerating ML using Production Feature Engineering
Accelerating ML using Production Feature Engineering
Analytics on system z final
Analytics on system z final
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
Evolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at Pinterest
Fast Data at ING – the why, what and how of the streaming analytics platform ...
Fast Data at ING – the why, what and how of the streaming analytics platform ...
Industrialiser spark
Industrialiser spark
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Accelerating the Data to Value Journey
Accelerating the Data to Value Journey
Data engineering in 10 years.pdf
Data engineering in 10 years.pdf
Simply Business' Data Platform
Simply Business' Data Platform
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
Laboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nube
Mindsphere: an open cloud-based IoT operating system for Industry
Mindsphere: an open cloud-based IoT operating system for Industry
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
Big Data, Bigger Analytics
Big Data, Bigger Analytics
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
Achieving digital transformation with Siebel CRM and Oracle Cloud
Achieving digital transformation with Siebel CRM and Oracle Cloud
Último
Real-Time AI Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
Timothy Spann
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
Seán Kennedy
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
Amil Baba Dawood bangali
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Thomas Poetter
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
gstagge
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
jennyeacort
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
17djon017
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
ssuserf63bd7
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
VICTOR MAESTRE RAMIREZ
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
yuu sss
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
Cathrine Wilhelmsen
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
Colleen Farrelly
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
GQ Research
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
ellehsormae
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
AleenaJamil4
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
Human37
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Boston Institute of Analytics
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Boston Institute of Analytics
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
yuu sss
Último
(20)
Real-Time AI Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
Reducing Cost of Production ML: Feature Engineering Case Study
1.
Reducing Cost of
Production ML: Feature Engineering Case Study Dr. Venkata Pingali Scribble Data 1
2.
Outline © Scribble Data
2018 ● Production ML is Complex ● Feature Engineering Overview ● Detailed Cost Drivers ● Indicative Quantitative Improvement ● Approaches for Each Cost Driver 2 Takeaway: A disciplined approach will deliver 10x improvement
3.
Production ML -
Complex and Expensive © Scribble Data 2018https://eng.uber.com/michelangelo/ 3 ⇑ ⇓
4.
Distribution of Challenges ©
Scribble Data 2018https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf 4 “Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.” Paper from Google - NeurIPS 2015 Drivers: Speed, Correctness, Evolution, Scale
5.
Feature Engineering © Scribble
Data 2018 5
6.
Feature Engineering -
Nature © Scribble Data 2018 ● Features are variables generated from data ○ Continuous process (Batch + Near Realtime + Realtime) ● Large in number (‘00s to ‘000s) & evolving ● Frequently executed Customer SKU Name 17826162 0293192 Thai Dragon Fruit Customer Premium Imported 17826162 15% of txns 5% of spend Retail Customer (X GB) Features (~X/1000) 6
7.
Historical 1 N Models …. 7 Near RT RT Multi-scale Features (1
min, 1 day, 1 year) Multi-Phase Feature Engineering
8.
Emerging Area of
Work © Scribble Data 2018 ● Conceptual ○ Functional Data Engg (Beauchemin, Lyft) ○ Positive/Negative Data Engg (Lowin, Prefect) ○ Design discussions: Michelangelo (Uber), Zipline (AirBnB), TFX (Google) ● Implementations 8 Data Science platforms Inhouse Systems ML Engineering Platforms
9.
Feature Engineering Cost
Drivers © Scribble Data 2018 Cost Driver Explanation Correctness & Reconciliation Confidence in output Evolution of data/algorithms/ features Safely scaling Development Best use of time Operations incl Backfilling Controlled execution All are forms of end-to-end (in)discipline & known 9
10.
Usecase - Customer
Segmentation © Scribble Data 2018 Gain (2015 ⇒ 2018) Data Volume 10x+ Feature Output Size 10x Dev Time 2x Server Cost 10x Bonus Auditability, Idempotency Documentation, Notifications 10 Data from Tier I Retail (2018) & F&B (2015) chains w/ 1500 Stores
11.
Gaining Confidence in
Output © Scribble Data 2018 ● End-to-end auditability ○ All data and all runs ○ Metadata standardization ● Discovery and reuse ○ Pipelines, modules ○ Automatic doc & ownership tracking ● Early warning systems ○ Input/output quality checks ○ Note critical decisions 11
12.
Scaling Deployments © Scribble
Data 2018 ● Isolation: Multi-tenant namespaces ○ File system, tables, S3 ○ Runs ● Time: Versioned namespaces ○ Storage locations ○ Metadata ● Version tracking of all data/code ○ Link data and code 12 Prepare for pipelines, models, versions, & dataset proliferation Functional Data Engineering - Maxime Beauchemin
13.
Increasing Development Speed ©
Scribble Data 2018 ● Feature DSL ○ Reduced code and errors ● Reusable assets ○ GUI, preprocessing ○ Standard transforms ● Standardized development ○ Unit testing ○ Automated documentation 13
14.
Control Execution © Scribble
Data 2018 ● Execution management ○ Parameterization, over-rides ○ Managed data access ● Automated deployment ○ Rollout best practices ● Minimum service integration ○ Performance, background tasks, scheduled execution ● Notifications ○ With callouts 14
15.
© Scribble Data
2018 15 ● Production ML is expensive ○ Cost drivers: Speed,Correctness, Evolution, Scaling ● Feature engineering important component ○ 10x productivity improvement possible ● Platforms required to improve speed & reduce risk ○ All major companies have platforms Key Takeaways
16.
THANK YOU FOR YOUR
TIME DENVER BANGALORE Littleton Indiranagar | HSR pingali@scribbledata.io 16
17.
Feature Engineering © Scribble
Data 2018 17
18.
Feature Engineering -
Key Activity © Scribble Data 2018 Source:https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation -most-time-consuming-least-enjoyable-data-science-task-survey-says/ Modeling = Statistics over Features Typical DE:DS = 2-3:1 18
19.
© Scribble Data
2018 2015* 2018 Improvement (2015>2018) Usecase Customer Segmentation Customer Segmentation Organization Tier I National F&B Chain Tier I National Retail Chain Raw Data 100GB Pos Txns 1.1TB PoS Txns 11x Output 500K x 100 Features 1M x 500 Features 10x Compute Engine Hive+Pandas Pandas+Itertoolz Dev Timeframe 3 Months 1.5 Months 2x Compute 6 Node Cluster x 3 hours 1 Server x 1.5 hour ~ 10x Bonus Auditability, Idempotency Documentation 2018 Implementation strongly informed by 2016 19
20.
Typical Feature Engineering
Cycle © Scribble Data 2018 Change From Past: In Production, Everyday 20 Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
21.
Production Feature Engineering
Challenges © Scribble Data 2018 21 Growing Volume & Complexity, Changing SchemaCatalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
22.
© Scribble Data
2018 22 Production Feature Engineering Challenges Data loss, inconsistenciesCatalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
23.
© Scribble Data
2018 23 Production Feature Engineering Challenges Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset Growing number, type, & complexity,
24.
© Scribble Data
2018 24 Production Feature Engineering Challenges 100s-1000s Entity common, growing Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
25.
© Scribble Data
2018 25 Production Feature Engineering Challenges Continuously Tuned, Growing number Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
26.
© Scribble Data
2018 26 Production Feature Engineering Challenges Manage versions, outputs, debug errors Catalog Compute Features Test Algorithm Preprocess Data Refine Usecase Deploy & Operate Dataset
27.
Feature Engineering © Scribble
Data 2018 27 Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small navy blue box in the middle. The required surrounding infrastructure is vast and complex.ML Code Feature Extraction Data Collection Analysis Tools Data Verification Process Management Tools Machine Resource Management ServicingInfrastructure Configuration Monitoring
Baixar agora