At Stitch Fix we have a lot of Data Scientists. Around eighty at last count. One reason why I think we have so many, is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they’re end to end responsible for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor and debug everything and anything required to get the output desired. They’re full data-stack data scientists!
The teams in the organization do a variety of different tasks:
- Clothing recommendations for clients.
- Clothes reordering recommendations.
- Time series analysis & forecasting of inventory, client segments, etc.
- Warehouse worker path routing.
- NLP.
… and more!
They’re also quite prolific at what they do -- we are approaching 4500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other?
This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team tries to provide abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, where the Data Scientist is free to make their own decisions and thus choose they way they do their work, and the onus then falls on the Data Platform team to convince Data Scientists to use their tools; the easiest way to do that is by designing the tools well.
In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on:
Access to Data
Access to Compute Resources:
Ad-hoc compute (think prototype, iterate, workspace)
Production compute (think where things are executed once they’re needed regularly)
For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.
34. ● Amazon’s Simple Storage Service.
● Infinite* storage.
● Looks like a file system*:
○ URIs: my.bucket/path/to/files/file.txt
● Can read, write, delete, BUT NOT append (or overwrite).
● Lots of companies rely on it -- famously Dropbox.
What is S3?
* For all intents and purposes
35. S3 @ Stitch Fix
S3
Writing Data Hard to Saturate
Reading Data Hard to Saturate
Writing & Reading
Interference
Haven’t Experienced
Space “Infinite”
Tooling Lots of Options
● Data Scientists’ main datastore since very early on.
● S3 essentially removes any real worries with respect to data contention!
38. ● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
What is the Hive Metastore?
39. ● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore:
What is the Hive Metastore?
Partition Location
20161001 s3://bucket/sold_items/20161001
...
20161031 s3://bucket/sold_items/20161031
sold_items
40. Hive Metastore @ Stitch Fix
Brought into:
● Bring centralized order to data being stored on S3
● Provide metadata to build more tooling on top of
● Enable use of existing open source solutions
41. ● Our central source of truth!
● Never have to worry about space.
● Trading for immediate speed, you have consistent read & write performance.
○ “Contention Free”
● Decoupled data storage layer from data manipulation.
○ Very amenable to supporting a lot of different data sets and tools.
S3 + Hive Metastore
47. Replacing a file on S3
● S3 is eventually
consistent*
● These bugs are hard
to track down
● Need everyone to be
able to trust the data.
A
B
* for existing files
48. ● Use Hive Metastore to easily control partition source of truth
● Principles:
○ Never delete
○ Always write to a new place each time a partition changes
● What do we mean by “new place”?
○ Use an inner directory → called Batch ID
Avoiding Eventual Consistency
50. Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
sold_items
51. ● Overwriting a partition is just a matter of updating the location
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252/
sold_items
52. ● Overwriting a partition is just a matter of updating the location
● To the user this is a hidden inner directory
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252/
sold_items
53. ● Avoids eventual consistency issue
● Jobs finish on the data they started on
● Full partition history:
○ Can rollback
■ Data Scientists are less afraid of mistakes
○ Can create audit trails more easily
■ What data changed and when
○ Can anchor downstream consumers to a particular batch ID
Batch ID Pattern Benefits
58. ● Problem:
○ How do you enforce remembering to add a Batch ID into your S3 path?
1. Enforcing Batch IDs
59. ● Problem:
○ How do you enforce remembering to add a Batch ID into your S3 path?
● Solution:
○ By building APIs
■ For all tooling!
1. Enforcing Batch IDs
63. 1. Enforcing Batch IDs: APIs for DS
Tool Reading From S3+HM Writing to S3+HM
Python Internal API Internal API
R Internal API Internal API
Spark Standard API Internal API
PySpark Standard API Internal API
Presto Standard API N/A
Redshift Load via Internal API N/A
64. ● Problem:
○ What format do you use to work with all the tools?
2. File Format
65. ● Problem:
○ What format do you use to work with all the tools?
● Possible solutions:
○ Parquet
○ Some simple format {JSON, Delimited File} + gzip
○ Avro, Thrift, Protobuffers
2. File Format
66. ● Problem:
○ What format do you use to work with all the tools?
● Possible solutions:
○ Parquet
○ Some simple format {JSON, Delimited File} + gzip
○ Avro, Thrift, Protobuffers
● Philosophy: minimize for operational burden:
○ Choose `0`, i.e. null delimited, gzipped files
■ Easy to write an API for this, for all tools.
2. File Format
67. ● Problem:
○ Can’t necessarily have a single schema for all tools
■ E.g.
● Different type definitions.
3. Schemas for all Tools
68. ● Problem:
○ Can’t necessarily have a single schema for all tools
■ E.g.
● Different type definitions.
● Solution:
○ Define parallel schemas, that have specific types redefined in Hive
Metastore
■ E.g.
● Can redefine decimal type to be double for Presto*.
● This parallel schema would be named prod_presto.
○ Still points to same underlying data.
3. Schemas for all Tools
* It didn’t use to have functioning decimal support
69. ● Problem:
○ How do you handle schema evolution with 80+ Data Scientists?
■ E.g.
● Add a new column
● Delete an old column
4. Schema Evolution
70. ● Problem:
○ How do you handle schema evolution with 80+ Data Scientists?
■ E.g.
● Add a new column
● Delete an old column
● Solution:
○ Append columns to end of schemas.
○ Rename columns as deprecated -- breaks code, but not data.
4. Schema Evolution
72. ● Wait, what? Redshift?
○ Predates use of Spark & Presto
○ Redshift was brought in to help joining data
■ Previously DS had to load data & perform joins in R/Python
○ Data Scientists loved Redshift too much:
■ It became a huge source of contention
■ Have been migrating “production” off of it
5. Redshift
73. ● Need:
○ Still want to use Redshift for ad-hoc analysis
● Problem:
○ How do we keep data on S3 in sync with Redshift?
5. Redshift
74. ● Need:
○ Still want to use Redshift for ad-hoc analysis
● Problem:
○ How do we keep data on S3 in sync with Redshift?
● Solution:
○ API that abstracts syncing data with Redshift
■ Keeps schemas in sync
■ Uses standard data warehouse staged table insertion pattern
5. Redshift
75. ● What does our integration with Spark look like?
6. Spark
76. ● What does our integration with Spark look like?
○ Running on Amazon EMR using Netflix's Genie
■ Prod & Dev clusters
○ S3 still source of truth
■ Have custom write API:
● Enforces Batch IDs
● Scala based library making use of EMRFS
● Also exposed in Python for PySpark use
○ Heavy users of Spark SQL
○ It’s the main production workhorse
6. Spark
79. Data Scientist’s Ad-hoc workflow
The faster this iteration cycle, the faster Data Scientists can work
80. Data Scientist’s Ad-hoc workflow
Scaling this part
The faster this iteration cycle, the faster Data Scientists can work
81. Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Medium
High High
Ad hoc Infra: Options
Laptop
82. Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
High High
Ad hoc Infra: Options
Laptop
Shared
Instances
83. Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
High Time & Money
Ad hoc Infra: Options
Laptop
Shared
Instances
Individual
Instances
84. Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
Low Time & Money
Ad hoc Infra: Options
Laptop
Shared
Instances
Individual
Instances
85. ● Control of environment
○ Data Scientists don’t need to worry about env.
● Isolation
○ can host many docker containers on a single machine.
● Better host management
○ allowing central control of machine types.
Why Docker?
86. ● Has:
○ Our internal API libraries
○ Jupyter Hub Notebooks:
■ Pyspark, IPython, R, Javscript, Toree
○ Python libs:
■ scikit, numpy, scipy, pandas, etc.
○ RStudio
○ R libs:
■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc.
● Mounts User NFS
● User has terminal access to file system via Jupyter for git, pip, etc.
Ad-Hoc Docker Image
91. Flotilla Deployment
● Amazon ECS for cluster management.
● EC2 Instances:
○ Custom AMI based on ECS optimized docker image.
● Runs in a single Auto Scale Group.
● S3 backed self-hosted Artifactory as docker repository.
● Docker + Amazon ECS unlocks access to lots of CPU & Memory for DS!
95. ● Docker tightly integrates with the Linux Kernel.
○ Hypothesis:
■ Anything that makes uninterruptable calls to the kernel can:
● Break the ECS agent because the container doesn’t respond.
● Break isolation between containers.
■ E.g. Mounting NFS
● Docker Hub:
○ Weren’t happy with performance
○ Switched to artifactory
Docker Problems So Far
97. ● S3 + Hive Metastore is Stitch Fix’s very scalable data warehouse.
● Internally built APIs make S3 + Hive Metastore easier to use for Data Scientists.
● Docker is used to provide a consistent environment for Data Scientists to use.
● Docker + ECS enables a self-service ad-hoc platform for Data Scientists.
In Summary - Reducing Contention