The data platform at Stitch Fix runs thousands of jobs a day to feed data products that provide algorithmic capabilities to power nearly all aspects of the business, from merchandising to operations to styling recommendations. Many of these jobs are distributed across Spark clusters, while many others are scheduled as isolated single-node tasks in containers running Python, R, or Scala. Pipelines are often comprised of a mix of task types and containers.
This talk will cover thoughts and guidelines on how we develop, schedule, and maintain these pipelines at Stitch Fix. We’ll discuss guidelines on how we think about which portions of the pipelines we develop to run on what platforms (e.g. what is important to run distributed across Spark clusters vs run in stand-alone containers) and how we get them to play well together. We’ll also provide an overview of tools and abstractions that have been developed at Stitch Fix to facilitate the process from development, to deployment, to monitoring them in production.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
When We Spark and When We Don’t: Developing Data and ML Pipelines
1. When we Spark and
when we don’t:
ML Pipeline Development at Stitch Fix
2. Talk Flow
● What is Stitch Fix?
● Infrastructure and Tech Stack
● Thoughts on Good Practices for Developing ML Pipelines
● Case Study: Inventory Recommendation Models
● Tooling & Abstractions at Stitch Fix
3. Share your style, size
and price preferences
with your personal
stylist.
Get 5 hand-selected
pieces of clothing
delivered to your
door.
Try your fix on
in the comfort
of your home
Leave feedback and
pay for only the
items you keep
Return the other
items in the
envelope
provided
Stitch Fix
4. There’s an algorithm for that...
Styling Algorithms
Client/Stylist
Matching
Demand Modeling
Human
Computation
Pick Path
Optimization
New Style
Development
Inventory
Allocation
State
Machines
Warehouse
Assignment
Batch Picking
Replenishment
* Find out more at http://algorithms-tour.stitchfix.com/
7. Some facts
● 1000s of jobs / day
○ Model training, featurization, test analysis, reporting, analytics, adhoc research
● Production jobs run on
○ Spark: mostly Spark SQL and pySpark
○ Flotilla: Python or R in Docker containers on ECS
● ML pipelines typically consist of several jobs spanning the stack of
technologies
● Data scientists own pipelines and implementations end-to-end
9. Pipelines should be designed to support constant iteration
○ Individual pipelines/algorithms/implementations change quickly
○ Tooling and infrastructure should be relatively stable
10. At scale, failure should be expected
○ Be robust to failure
■ Checkpointing
■ Isolation
■ Automated Retries
■ Alerting
○ Make it easy to debug and diagnose
○ We train 100s of models / day, and expect some # to fail.
14. Extract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload Model
Algo_V1_1
15. User Item
Rating
Data
Extract “wide”
Client
Training Data
Train
Model A
Upload
Model A
Extract “wide”
Item
Training Data
Model D
Training
Data
Model C
Training
Data
Ingest
Train
Model C
Upload
Model C
Train
Model D
Upload
Model D
Model B
Training
Data
Train
Model B
Upload
Model B
Model A
Training
Data
16. Extract “wide”
Client Training
Data
User Item
Rating
Data
Train
Model A
Upload Model
A
Extract “wide”
Item
Training Data Model D
Training Data
Model C
Training Data
Model A
Training Data
Ingest
Train
Model C
Upload Model
C
Train
Model D
Upload Model
D
Model B
Training Data
Train
Model B
Upload Model
B
18. 1. Spark is utilized heavily for feature engineering.
2. Model fitting occurs in containerized Python and R environments.
3. Individual jobs communicate via data dependencies.
4. Our inventory recommendation algorithms are specified with a
high degree of tooling.
5. Pipelines leave behind multiple artifacts for analysis, debugging,
and checkpointing. (extract, train, load)
6. Individual models are isolated from one another. (and can fail
without impacting the rest of the group)
7. Data is contextual: e.g. item type; business line
Some Observations
20. Desirable Properties of Infrastructure & Tooling
● Isolation should be guaranteed by the infrastructure
● It should be obvious what running jobs and services are doing, when, and why
● Access to data should be easy, consistent, and self-service
● Guide rails should enforce, or strongly encourage, idempotent patterns
● Scaling, logging, and security should be baked into infrastructure and tooling
21. Access to Data
● All data is managed and tracked by the Metastore
○ Hive metastore abstracted by Bumblebee
○ Location, Schema, Format
● Data access for Python and R is a 1st class citizen
○ Typically accessed as dataframes
○ df = load_dataframe(namespace, table)
○ store_dataframe(df, namespace, table)
23. Containerized Batch Jobs
● Containerized job execution has many benefits
○ Strong isolation
○ High degree of control over resources and environment
● But, needs abstraction over job definition and management
○ So we developed Flotilla
○ And open sourced it!
https://stitchfix.github.io/flotilla-os/