SlideShare a Scribd company logo
1 of 24
When we Spark and
when we don’t:
ML Pipeline Development at Stitch Fix
Talk Flow
● What is Stitch Fix?
● Infrastructure and Tech Stack
● Thoughts on Good Practices for Developing ML Pipelines
● Case Study: Inventory Recommendation Models
● Tooling & Abstractions at Stitch Fix
Share your style, size
and price preferences
with your personal
stylist.
Get 5 hand-selected
pieces of clothing
delivered to your
door.
Try your fix on
in the comfort
of your home
Leave feedback and
pay for only the
items you keep
Return the other
items in the
envelope
provided
Stitch Fix
There’s an algorithm for that...
Styling Algorithms
Client/Stylist
Matching
Demand Modeling
Human
Computation
Pick Path
Optimization
New Style
Development
Inventory
Allocation
State
Machines
Warehouse
Assignment
Batch Picking
Replenishment
* Find out more at http://algorithms-tour.stitchfix.com/
Our
Infrastructure
and
Tech Stack
Camera:
State Snapshots
FlotillaAWS ECS
Cluster
Bumblebee:
Metadata Manager
AWS:S3
Prod
Dev/Research
Metastore
AWS ECS
Cluster
AWS ECS
Cluster
Data Acquisition Data ProcessingData Storage
Data
Management
Uhura
Job Execution
Workflow
Management
Some facts
● 1000s of jobs / day
○ Model training, featurization, test analysis, reporting, analytics, adhoc research
● Production jobs run on
○ Spark: mostly Spark SQL and pySpark
○ Flotilla: Python or R in Docker containers on ECS
● ML pipelines typically consist of several jobs spanning the stack of
technologies
● Data scientists own pipelines and implementations end-to-end
Good Practices for Developing
ML Pipelines
Pipelines should be designed to support constant iteration
○ Individual pipelines/algorithms/implementations change quickly
○ Tooling and infrastructure should be relatively stable
At scale, failure should be expected
○ Be robust to failure
■ Checkpointing
■ Isolation
■ Automated Retries
■ Alerting
○ Make it easy to debug and diagnose
○ We train 100s of models / day, and expect some # to fail.
Pipelines and jobs should be idempotent.
Make pragmatic choices with respect to technology.
Case Study:
Inventory Recommendation
Models
Extract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload Model
Algo_V1_1
User Item
Rating
Data
Extract “wide”
Client
Training Data
Train
Model A
Upload
Model A
Extract “wide”
Item
Training Data
Model D
Training
Data
Model C
Training
Data
Ingest
Train
Model C
Upload
Model C
Train
Model D
Upload
Model D
Model B
Training
Data
Train
Model B
Upload
Model B
Model A
Training
Data
Extract “wide”
Client Training
Data
User Item
Rating
Data
Train
Model A
Upload Model
A
Extract “wide”
Item
Training Data Model D
Training Data
Model C
Training Data
Model A
Training Data
Ingest
Train
Model C
Upload Model
C
Train
Model D
Upload Model
D
Model B
Training Data
Train
Model B
Upload Model
B
client_features: {
"expanded_colors": {
"in": [
"client_colors"
],
"fn": "dummy_expand"
},
"X_Y_ratio" : {
"in": [
X,
Y
],
"fn": "compute_scaled_ratio"
}
…
},
item_features: {
"expanded_print" : {
"in": [
colors
],
"fn": "dummy_expand"
}
},
interaction_features: {
}
Extract Jobs generated from resolution of Model + Feature Definitions
{
“deptA”: {
"computed_features": [
“example_feature”
],
"formula": [
"s ~ 1 + f_a + shiny_material_flag + x_y_ratio”
]
},
"deptB": {
"computed_features": [
“example_feature”
],
"formula": [
"s ~ 1 + f_a + x_y_ratio + client_color_a +
expanded_print_x”
]
}
}
1. Spark is utilized heavily for feature engineering.
2. Model fitting occurs in containerized Python and R environments.
3. Individual jobs communicate via data dependencies.
4. Our inventory recommendation algorithms are specified with a
high degree of tooling.
5. Pipelines leave behind multiple artifacts for analysis, debugging,
and checkpointing. (extract, train, load)
6. Individual models are isolated from one another. (and can fail
without impacting the rest of the group)
7. Data is contextual: e.g. item type; business line
Some Observations
Platform Tooling is Important!
Desirable Properties of Infrastructure & Tooling
● Isolation should be guaranteed by the infrastructure
● It should be obvious what running jobs and services are doing, when, and why
● Access to data should be easy, consistent, and self-service
● Guide rails should enforce, or strongly encourage, idempotent patterns
● Scaling, logging, and security should be baked into infrastructure and tooling
Access to Data
● All data is managed and tracked by the Metastore
○ Hive metastore abstracted by Bumblebee
○ Location, Schema, Format
● Data access for Python and R is a 1st class citizen
○ Typically accessed as dataframes
○ df = load_dataframe(namespace, table)
○ store_dataframe(df, namespace, table)
the
cloud.
embrace elasticity.
Containerized Batch Jobs
● Containerized job execution has many benefits
○ Strong isolation
○ High degree of control over resources and environment
● But, needs abstraction over job definition and management
○ So we developed Flotilla
○ And open sourced it!
https://stitchfix.github.io/flotilla-os/
Questions?
Get in touch:
jmagnusson@stitchfix.com
@jeffmagnusson
http://www.linkedin.com/in/jmagnuss

More Related Content

What's hot

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"Fwdays
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Fwdays
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion EngineAdam Doyle
 
Machine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackMachine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackLynn Langit
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataDataWorks Summit
 
GraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL QueriesGraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL QueriesMarin Dimitrov
 
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksDatabricks
 
Machine Learning with PyCaret
Machine Learning with PyCaretMachine Learning with PyCaret
Machine Learning with PyCaretDatabricks
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Low-cost Open Data As-a-Service
Low-cost Open Data As-a-ServiceLow-cost Open Data As-a-Service
Low-cost Open Data As-a-ServiceMarin Dimitrov
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Text Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-ServiceText Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-ServiceMarin Dimitrov
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratchVinayak Hegde
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016Nikhil Shekhar
 

What's hot (20)

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
 
Machine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackMachine Learning on the Microsoft Stack
Machine Learning on the Microsoft Stack
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
GraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL QueriesGraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL Queries
 
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with Databricks
 
Machine Learning with PyCaret
Machine Learning with PyCaretMachine Learning with PyCaret
Machine Learning with PyCaret
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Low-cost Open Data As-a-Service
Low-cost Open Data As-a-ServiceLow-cost Open Data As-a-Service
Low-cost Open Data As-a-Service
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Big Data Pitfalls
Big Data PitfallsBig Data Pitfalls
Big Data Pitfalls
 
Text Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-ServiceText Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-Service
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratch
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
 

Similar to When We Spark and When We Don’t: Developing Data and ML Pipelines

World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018Adam Gibson
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?Ivo Andreev
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedLaurenz Wuttke
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Manoj Shanmugasundaram - Agile Machine Learning Development
Manoj Shanmugasundaram - Agile Machine Learning DevelopmentManoj Shanmugasundaram - Agile Machine Learning Development
Manoj Shanmugasundaram - Agile Machine Learning DevelopmentAgile Impact Conference
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-useltonrodriguez11
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableJustin Basilico
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveJune Andrews
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...All Things Open
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)dtz001
 

Similar to When We Spark and When We Don’t: Developing Data and ML Pipelines (20)

World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Machine learning
Machine learningMachine learning
Machine learning
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Manoj Shanmugasundaram - Agile Machine Learning Development
Manoj Shanmugasundaram - Agile Machine Learning DevelopmentManoj Shanmugasundaram - Agile Machine Learning Development
Manoj Shanmugasundaram - Agile Machine Learning Development
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-us
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
 

More from Stitch Fix Algorithms

Progression by Regression: How to increase your A/B Test Velocity
Progression by Regression: How to increase your A/B Test VelocityProgression by Regression: How to increase your A/B Test Velocity
Progression by Regression: How to increase your A/B Test VelocityStitch Fix Algorithms
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixStitch Fix Algorithms
 
Moment-based estimation for hierarchical models in Apache Spark
Moment-based estimation for hierarchical models in Apache SparkMoment-based estimation for hierarchical models in Apache Spark
Moment-based estimation for hierarchical models in Apache SparkStitch Fix Algorithms
 

More from Stitch Fix Algorithms (9)

Progression by Regression: How to increase your A/B Test Velocity
Progression by Regression: How to increase your A/B Test VelocityProgression by Regression: How to increase your A/B Test Velocity
Progression by Regression: How to increase your A/B Test Velocity
 
Deep recommendations in PyTorch
Deep recommendations in PyTorchDeep recommendations in PyTorch
Deep recommendations in PyTorch
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
Moment-based estimation for hierarchical models in Apache Spark
Moment-based estimation for hierarchical models in Apache SparkMoment-based estimation for hierarchical models in Apache Spark
Moment-based estimation for hierarchical models in Apache Spark
 
Production model deployment
Production model deploymentProduction model deployment
Production model deployment
 
Optimizing Spark
Optimizing SparkOptimizing Spark
Optimizing Spark
 
Incrementality
IncrementalityIncrementality
Incrementality
 
Apache Spark & ML Workflows
Apache Spark & ML WorkflowsApache Spark & ML Workflows
Apache Spark & ML Workflows
 
Enabling full stack data scientists
Enabling full stack data scientistsEnabling full stack data scientists
Enabling full stack data scientists
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

When We Spark and When We Don’t: Developing Data and ML Pipelines

  • 1. When we Spark and when we don’t: ML Pipeline Development at Stitch Fix
  • 2. Talk Flow ● What is Stitch Fix? ● Infrastructure and Tech Stack ● Thoughts on Good Practices for Developing ML Pipelines ● Case Study: Inventory Recommendation Models ● Tooling & Abstractions at Stitch Fix
  • 3. Share your style, size and price preferences with your personal stylist. Get 5 hand-selected pieces of clothing delivered to your door. Try your fix on in the comfort of your home Leave feedback and pay for only the items you keep Return the other items in the envelope provided Stitch Fix
  • 4. There’s an algorithm for that... Styling Algorithms Client/Stylist Matching Demand Modeling Human Computation Pick Path Optimization New Style Development Inventory Allocation State Machines Warehouse Assignment Batch Picking Replenishment * Find out more at http://algorithms-tour.stitchfix.com/
  • 6. Camera: State Snapshots FlotillaAWS ECS Cluster Bumblebee: Metadata Manager AWS:S3 Prod Dev/Research Metastore AWS ECS Cluster AWS ECS Cluster Data Acquisition Data ProcessingData Storage Data Management Uhura Job Execution Workflow Management
  • 7. Some facts ● 1000s of jobs / day ○ Model training, featurization, test analysis, reporting, analytics, adhoc research ● Production jobs run on ○ Spark: mostly Spark SQL and pySpark ○ Flotilla: Python or R in Docker containers on ECS ● ML pipelines typically consist of several jobs spanning the stack of technologies ● Data scientists own pipelines and implementations end-to-end
  • 8. Good Practices for Developing ML Pipelines
  • 9. Pipelines should be designed to support constant iteration ○ Individual pipelines/algorithms/implementations change quickly ○ Tooling and infrastructure should be relatively stable
  • 10. At scale, failure should be expected ○ Be robust to failure ■ Checkpointing ■ Isolation ■ Automated Retries ■ Alerting ○ Make it easy to debug and diagnose ○ We train 100s of models / day, and expect some # to fail.
  • 11. Pipelines and jobs should be idempotent.
  • 12. Make pragmatic choices with respect to technology.
  • 14. Extract Training Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training Data Train Model Upload Model Algo_V1_1
  • 15. User Item Rating Data Extract “wide” Client Training Data Train Model A Upload Model A Extract “wide” Item Training Data Model D Training Data Model C Training Data Ingest Train Model C Upload Model C Train Model D Upload Model D Model B Training Data Train Model B Upload Model B Model A Training Data
  • 16. Extract “wide” Client Training Data User Item Rating Data Train Model A Upload Model A Extract “wide” Item Training Data Model D Training Data Model C Training Data Model A Training Data Ingest Train Model C Upload Model C Train Model D Upload Model D Model B Training Data Train Model B Upload Model B
  • 17. client_features: { "expanded_colors": { "in": [ "client_colors" ], "fn": "dummy_expand" }, "X_Y_ratio" : { "in": [ X, Y ], "fn": "compute_scaled_ratio" } … }, item_features: { "expanded_print" : { "in": [ colors ], "fn": "dummy_expand" } }, interaction_features: { } Extract Jobs generated from resolution of Model + Feature Definitions { “deptA”: { "computed_features": [ “example_feature” ], "formula": [ "s ~ 1 + f_a + shiny_material_flag + x_y_ratio” ] }, "deptB": { "computed_features": [ “example_feature” ], "formula": [ "s ~ 1 + f_a + x_y_ratio + client_color_a + expanded_print_x” ] } }
  • 18. 1. Spark is utilized heavily for feature engineering. 2. Model fitting occurs in containerized Python and R environments. 3. Individual jobs communicate via data dependencies. 4. Our inventory recommendation algorithms are specified with a high degree of tooling. 5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are isolated from one another. (and can fail without impacting the rest of the group) 7. Data is contextual: e.g. item type; business line Some Observations
  • 19. Platform Tooling is Important!
  • 20. Desirable Properties of Infrastructure & Tooling ● Isolation should be guaranteed by the infrastructure ● It should be obvious what running jobs and services are doing, when, and why ● Access to data should be easy, consistent, and self-service ● Guide rails should enforce, or strongly encourage, idempotent patterns ● Scaling, logging, and security should be baked into infrastructure and tooling
  • 21. Access to Data ● All data is managed and tracked by the Metastore ○ Hive metastore abstracted by Bumblebee ○ Location, Schema, Format ● Data access for Python and R is a 1st class citizen ○ Typically accessed as dataframes ○ df = load_dataframe(namespace, table) ○ store_dataframe(df, namespace, table)
  • 23. Containerized Batch Jobs ● Containerized job execution has many benefits ○ Strong isolation ○ High degree of control over resources and environment ● But, needs abstraction over job definition and management ○ So we developed Flotilla ○ And open sourced it! https://stitchfix.github.io/flotilla-os/