SlideShare uma empresa Scribd logo
1 de 98
Baixar para ler offline
Scaling Data Science
At Stitch Fix
Stefan Krawczyk
@stefkrawczyk
linkedin.com/in/skrawczyk
January 2017
How many
Data Scientists do you have?
At Stitch Fix we have ~80
Two Data Scientist facts:
1. Ability to spin up their own
resources*.
2. End to end,
they’re responsible.
But what do they do?
What is Stitch Fix?
~4500 Job Definitions
Lots of Compute &
Data Movement!
So how did we get to our scale?
Reducing Contention
&
Unhappy Data Scientists Burning Infrastructure
Contention is Correlated with
Contention on:
● Access to Data
● Access to Compute Res.
Contention on:
● Access to Data
● Access to Compute Res.
○ Ad-hoc
○ Production
Contention on:
● Access to Data
● Access to Compute Res.
○ Ad-hoc
○ Production
Focus of this talk:
Fellow Collaborators
jeff akshay jacob
tarek
kurt derek
patrick
thomas
Horizontal team focused on Data Scientist Enablement
steven liz alex
Data Access:
Unhappy DS &
Burning Infrastructure
Data Access: ☹ DS & Infrastructure
Data Access: ☹ DS & Infrastructure
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Not enough space
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Not enough space
Limited by
tools
So how does Stitch Fix
mitigate these problems?
Data Access:
S3 & Hive Metastore
What is S3?
● Amazon’s Simple Storage Service.
● Infinite* storage.
● Looks like a file system*:
○ URIs: my.bucket/path/to/files/file.txt
● Can read, write, delete, BUT NOT append (or overwrite).
● Lots of companies rely on it -- famously Dropbox.
What is S3?
* For all intents and purposes
S3 @ Stitch Fix
S3
Writing Data Hard to Saturate
Reading Data Hard to Saturate
Writing & Reading
Interference
Haven’t Experienced
Space “Infinite”
Tooling Lots of Options
● Data Scientists’ main datastore since very early on.
● S3 essentially removes any real worries with respect to data contention!
S3 is not a complete solution!
What is the Hive Metastore?
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
What is the Hive Metastore?
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore:
What is the Hive Metastore?
Partition Location
20161001 s3://bucket/sold_items/20161001
...
20161031 s3://bucket/sold_items/20161031
sold_items
Hive Metastore @ Stitch Fix
Brought into:
● Bring centralized order to data being stored on S3
● Provide metadata to build more tooling on top of
● Enable use of existing open source solutions
● Our central source of truth!
● Never have to worry about space.
● Trading for immediate speed, you have consistent read & write performance.
○ “Contention Free”
● Decoupled data storage layer from data manipulation.
○ Very amenable to supporting a lot of different data sets and tools.
S3 + Hive Metastore
Our Current Picture
Caveat: Eventual Consistency
● Replacing data in a partition
Caveat: Eventual Consistency
● Replacing data in a partition
Caveat: Eventual Consistency
Replacing a file on S3
B
A
Replacing a file on S3
● S3 is eventually
consistent*
● These bugs are hard
to track down
● Need everyone to be
able to trust the data.
A
B
* for existing files
● Use Hive Metastore to easily control partition source of truth
● Principles:
○ Never delete
○ Always write to a new place each time a partition changes
● What do we mean by “new place”?
○ Use an inner directory → called Batch ID
Avoiding Eventual Consistency
Batch ID Pattern
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
sold_items
● Overwriting a partition is just a matter of updating the location
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252/
sold_items
● Overwriting a partition is just a matter of updating the location
● To the user this is a hidden inner directory
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252/
sold_items
● Avoids eventual consistency issue
● Jobs finish on the data they started on
● Full partition history:
○ Can rollback
■ Data Scientists are less afraid of mistakes
○ Can create audit trails more easily
■ What data changed and when
○ Can anchor downstream consumers to a particular batch ID
Batch ID Pattern Benefits
Data Access:
Tooling Integration
Recall
Recall
?
?
?
?
?
?
Data Access:
Tooling Integration
1. Enforcing Batch IDs
2. File Formats
3. Schemas for all Tools
4. Schema Evolution
5. Redshift
6. Spark
● Problem:
○ How do you enforce remembering to add a Batch ID into your S3 path?
1. Enforcing Batch IDs
● Problem:
○ How do you enforce remembering to add a Batch ID into your S3 path?
● Solution:
○ By building APIs
■ For all tooling!
1. Enforcing Batch IDs
1. Enforcing Batch IDs via an API
1. Enforcing Batch IDs via an API
Python:
store_dataframe(df, dest_db, dest_table, partitions=[‘2016’])
df = load_dataframe(src_db, src_table, partitions=[‘2016’])
R:
sf_writer(data = result,
namespace = dest_db,
resource = dest_table,
partitions = c(as.integer(opt$ETL_DATE)))
df <- sf_reader(namespace = src_db,
resource = src_table,
partitions = c(as.integer(opt$ETL_DATE)))
1. Enforcing Batch IDs: APIs for DS
1. Enforcing Batch IDs: APIs for DS
Tool Reading From S3+HM Writing to S3+HM
Python Internal API Internal API
R Internal API Internal API
Spark Standard API Internal API
PySpark Standard API Internal API
Presto Standard API N/A
Redshift Load via Internal API N/A
● Problem:
○ What format do you use to work with all the tools?
2. File Format
● Problem:
○ What format do you use to work with all the tools?
● Possible solutions:
○ Parquet
○ Some simple format {JSON, Delimited File} + gzip
○ Avro, Thrift, Protobuffers
2. File Format
● Problem:
○ What format do you use to work with all the tools?
● Possible solutions:
○ Parquet
○ Some simple format {JSON, Delimited File} + gzip
○ Avro, Thrift, Protobuffers
● Philosophy: minimize for operational burden:
○ Choose `0`, i.e. null delimited, gzipped files
■ Easy to write an API for this, for all tools.
2. File Format
● Problem:
○ Can’t necessarily have a single schema for all tools
■ E.g.
● Different type definitions.
3. Schemas for all Tools
● Problem:
○ Can’t necessarily have a single schema for all tools
■ E.g.
● Different type definitions.
● Solution:
○ Define parallel schemas, that have specific types redefined in Hive
Metastore
■ E.g.
● Can redefine decimal type to be double for Presto*.
● This parallel schema would be named prod_presto.
○ Still points to same underlying data.
3. Schemas for all Tools
* It didn’t use to have functioning decimal support
● Problem:
○ How do you handle schema evolution with 80+ Data Scientists?
■ E.g.
● Add a new column
● Delete an old column
4. Schema Evolution
● Problem:
○ How do you handle schema evolution with 80+ Data Scientists?
■ E.g.
● Add a new column
● Delete an old column
● Solution:
○ Append columns to end of schemas.
○ Rename columns as deprecated -- breaks code, but not data.
4. Schema Evolution
● Wait, what? Redshift?
5. Redshift
● Wait, what? Redshift?
○ Predates use of Spark & Presto
○ Redshift was brought in to help joining data
■ Previously DS had to load data & perform joins in R/Python
○ Data Scientists loved Redshift too much:
■ It became a huge source of contention
■ Have been migrating “production” off of it
5. Redshift
● Need:
○ Still want to use Redshift for ad-hoc analysis
● Problem:
○ How do we keep data on S3 in sync with Redshift?
5. Redshift
● Need:
○ Still want to use Redshift for ad-hoc analysis
● Problem:
○ How do we keep data on S3 in sync with Redshift?
● Solution:
○ API that abstracts syncing data with Redshift
■ Keeps schemas in sync
■ Uses standard data warehouse staged table insertion pattern
5. Redshift
● What does our integration with Spark look like?
6. Spark
● What does our integration with Spark look like?
○ Running on Amazon EMR using Netflix's Genie
■ Prod & Dev clusters
○ S3 still source of truth
■ Have custom write API:
● Enforces Batch IDs
● Scala based library making use of EMRFS
● Also exposed in Python for PySpark use
○ Heavy users of Spark SQL
○ It’s the main production workhorse
6. Spark
Ad-hoc
Compute Access:
Using Docker
Data Scientist’s Ad-hoc workflow
Data Scientist’s Ad-hoc workflow
The faster this iteration cycle, the faster Data Scientists can work
Data Scientist’s Ad-hoc workflow
Scaling this part
The faster this iteration cycle, the faster Data Scientists can work
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Medium
High High
Ad hoc Infra: Options
Laptop
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
High High
Ad hoc Infra: Options
Laptop
Shared
Instances
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
High Time & Money
Ad hoc Infra: Options
Laptop
Shared
Instances
Individual
Instances
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
Low Time & Money
Ad hoc Infra: Options
Laptop
Shared
Instances
Individual
Instances
● Control of environment
○ Data Scientists don’t need to worry about env.
● Isolation
○ can host many docker containers on a single machine.
● Better host management
○ allowing central control of machine types.
Why Docker?
● Has:
○ Our internal API libraries
○ Jupyter Hub Notebooks:
■ Pyspark, IPython, R, Javscript, Toree
○ Python libs:
■ scikit, numpy, scipy, pandas, etc.
○ RStudio
○ R libs:
■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc.
● Mounts User NFS
● User has terminal access to file system via Jupyter for git, pip, etc.
Ad-Hoc Docker Image
Self Service Ad-hoc Infra: Flotilla
Jupyter Hub on Flotilla
RStudio on Flotilla
Browser Based Terminal on Flotilla
Flotilla Deployment
● Amazon ECS for cluster management.
● EC2 Instances:
○ Custom AMI based on ECS optimized docker image.
● Runs in a single Auto Scale Group.
● S3 backed self-hosted Artifactory as docker repository.
● Docker + Amazon ECS unlocks access to lots of CPU & Memory for DS!
Flotilla Deployment
Flotilla Deployment
Flotilla Deployment
● Docker tightly integrates with the Linux Kernel.
○ Hypothesis:
■ Anything that makes uninterruptable calls to the kernel can:
● Break the ECS agent because the container doesn’t respond.
● Break isolation between containers.
■ E.g. Mounting NFS
● Docker Hub:
○ Weren’t happy with performance
○ Switched to artifactory
Docker Problems So Far
In Summary
● S3 + Hive Metastore is Stitch Fix’s very scalable data warehouse.
● Internally built APIs make S3 + Hive Metastore easier to use for Data Scientists.
● Docker is used to provide a consistent environment for Data Scientists to use.
● Docker + ECS enables a self-service ad-hoc platform for Data Scientists.
In Summary - Reducing Contention
Fin; Thanks! Questions?
@stefkrawczyk
Try out Stitch Fix → stitchfix.com/referral/8406746

Mais conteúdo relacionado

Mais procurados

Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemShirshanka Das
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...confluent
 
Hadoop advanced administration
Hadoop advanced administrationHadoop advanced administration
Hadoop advanced administrationLinh Ngoc
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsAndrzej Michałowski
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium confluent
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David AndersonVerverica
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changesconfluent
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedInGuozhang Wang
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataScyllaDB
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and FlinkBryan Bende
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 

Mais procurados (20)

Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
 
Hadoop advanced administration
Hadoop advanced administrationHadoop advanced administration
Hadoop advanced administration
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changes
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Real time data quality on Flink
Real time data quality on FlinkReal time data quality on Flink
Real time data quality on Flink
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and Flink
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Nifi
NifiNifi
Nifi
 

Semelhante a Scaling Data Science at Stitch Fix

Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixStefan Krawczyk
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixStitch Fix Algorithms
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
How to build data accessibility for everyone
How to build data accessibility for everyoneHow to build data accessibility for everyone
How to build data accessibility for everyoneKaren Hsieh
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Databricks
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraApache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraAnant Corporation
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data PlatformDani Solà Lagares
 
Choosing data warehouse considerations
Choosing data warehouse considerationsChoosing data warehouse considerations
Choosing data warehouse considerationsAseem Bansal
 
Serverless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience reportServerless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience reportMetosin Oy
 

Semelhante a Scaling Data Science at Stitch Fix (20)

Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
How to build data accessibility for everyone
How to build data accessibility for everyoneHow to build data accessibility for everyone
How to build data accessibility for everyone
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraApache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
Choosing data warehouse considerations
Choosing data warehouse considerationsChoosing data warehouse considerations
Choosing data warehouse considerations
 
Serverless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience reportServerless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience report
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 

Scaling Data Science at Stitch Fix

  • 1. Scaling Data Science At Stitch Fix Stefan Krawczyk @stefkrawczyk linkedin.com/in/skrawczyk January 2017
  • 3. At Stitch Fix we have ~80
  • 4. Two Data Scientist facts: 1. Ability to spin up their own resources*. 2. End to end, they’re responsible.
  • 5. But what do they do?
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 15. Lots of Compute & Data Movement!
  • 16. So how did we get to our scale?
  • 18. & Unhappy Data Scientists Burning Infrastructure Contention is Correlated with
  • 19. Contention on: ● Access to Data ● Access to Compute Res.
  • 20. Contention on: ● Access to Data ● Access to Compute Res. ○ Ad-hoc ○ Production
  • 21. Contention on: ● Access to Data ● Access to Compute Res. ○ Ad-hoc ○ Production Focus of this talk:
  • 22. Fellow Collaborators jeff akshay jacob tarek kurt derek patrick thomas Horizontal team focused on Data Scientist Enablement steven liz alex
  • 23. Data Access: Unhappy DS & Burning Infrastructure
  • 24. Data Access: ☹ DS & Infrastructure
  • 25. Data Access: ☹ DS & Infrastructure
  • 26. Data Access: ☹ DS & Infrastructure Can’t write fast enough
  • 27. Data Access: ☹ DS & Infrastructure Can’t write fast enough Can’t read fast enough
  • 28. Data Access: ☹ DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact
  • 29. Data Access: ☹ DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact Not enough space
  • 30. Data Access: ☹ DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact Not enough space Limited by tools
  • 31. So how does Stitch Fix mitigate these problems?
  • 32. Data Access: S3 & Hive Metastore
  • 34. ● Amazon’s Simple Storage Service. ● Infinite* storage. ● Looks like a file system*: ○ URIs: my.bucket/path/to/files/file.txt ● Can read, write, delete, BUT NOT append (or overwrite). ● Lots of companies rely on it -- famously Dropbox. What is S3? * For all intents and purposes
  • 35. S3 @ Stitch Fix S3 Writing Data Hard to Saturate Reading Data Hard to Saturate Writing & Reading Interference Haven’t Experienced Space “Infinite” Tooling Lots of Options ● Data Scientists’ main datastore since very early on. ● S3 essentially removes any real worries with respect to data contention!
  • 36. S3 is not a complete solution!
  • 37. What is the Hive Metastore?
  • 38. ● Hadoop service, that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition What is the Hive Metastore?
  • 39. ● Hadoop service, that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition Hive Metastore: What is the Hive Metastore? Partition Location 20161001 s3://bucket/sold_items/20161001 ... 20161031 s3://bucket/sold_items/20161031 sold_items
  • 40. Hive Metastore @ Stitch Fix Brought into: ● Bring centralized order to data being stored on S3 ● Provide metadata to build more tooling on top of ● Enable use of existing open source solutions
  • 41. ● Our central source of truth! ● Never have to worry about space. ● Trading for immediate speed, you have consistent read & write performance. ○ “Contention Free” ● Decoupled data storage layer from data manipulation. ○ Very amenable to supporting a lot of different data sets and tools. S3 + Hive Metastore
  • 44. ● Replacing data in a partition Caveat: Eventual Consistency
  • 45. ● Replacing data in a partition Caveat: Eventual Consistency
  • 46. Replacing a file on S3 B A
  • 47. Replacing a file on S3 ● S3 is eventually consistent* ● These bugs are hard to track down ● Need everyone to be able to trust the data. A B * for existing files
  • 48. ● Use Hive Metastore to easily control partition source of truth ● Principles: ○ Never delete ○ Always write to a new place each time a partition changes ● What do we mean by “new place”? ○ Use an inner directory → called Batch ID Avoiding Eventual Consistency
  • 50. Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ sold_items
  • 51. ● Overwriting a partition is just a matter of updating the location Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252/ sold_items
  • 52. ● Overwriting a partition is just a matter of updating the location ● To the user this is a hidden inner directory Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252/ sold_items
  • 53. ● Avoids eventual consistency issue ● Jobs finish on the data they started on ● Full partition history: ○ Can rollback ■ Data Scientists are less afraid of mistakes ○ Can create audit trails more easily ■ What data changed and when ○ Can anchor downstream consumers to a particular batch ID Batch ID Pattern Benefits
  • 57. Data Access: Tooling Integration 1. Enforcing Batch IDs 2. File Formats 3. Schemas for all Tools 4. Schema Evolution 5. Redshift 6. Spark
  • 58. ● Problem: ○ How do you enforce remembering to add a Batch ID into your S3 path? 1. Enforcing Batch IDs
  • 59. ● Problem: ○ How do you enforce remembering to add a Batch ID into your S3 path? ● Solution: ○ By building APIs ■ For all tooling! 1. Enforcing Batch IDs
  • 60. 1. Enforcing Batch IDs via an API
  • 61. 1. Enforcing Batch IDs via an API
  • 62. Python: store_dataframe(df, dest_db, dest_table, partitions=[‘2016’]) df = load_dataframe(src_db, src_table, partitions=[‘2016’]) R: sf_writer(data = result, namespace = dest_db, resource = dest_table, partitions = c(as.integer(opt$ETL_DATE))) df <- sf_reader(namespace = src_db, resource = src_table, partitions = c(as.integer(opt$ETL_DATE))) 1. Enforcing Batch IDs: APIs for DS
  • 63. 1. Enforcing Batch IDs: APIs for DS Tool Reading From S3+HM Writing to S3+HM Python Internal API Internal API R Internal API Internal API Spark Standard API Internal API PySpark Standard API Internal API Presto Standard API N/A Redshift Load via Internal API N/A
  • 64. ● Problem: ○ What format do you use to work with all the tools? 2. File Format
  • 65. ● Problem: ○ What format do you use to work with all the tools? ● Possible solutions: ○ Parquet ○ Some simple format {JSON, Delimited File} + gzip ○ Avro, Thrift, Protobuffers 2. File Format
  • 66. ● Problem: ○ What format do you use to work with all the tools? ● Possible solutions: ○ Parquet ○ Some simple format {JSON, Delimited File} + gzip ○ Avro, Thrift, Protobuffers ● Philosophy: minimize for operational burden: ○ Choose `0`, i.e. null delimited, gzipped files ■ Easy to write an API for this, for all tools. 2. File Format
  • 67. ● Problem: ○ Can’t necessarily have a single schema for all tools ■ E.g. ● Different type definitions. 3. Schemas for all Tools
  • 68. ● Problem: ○ Can’t necessarily have a single schema for all tools ■ E.g. ● Different type definitions. ● Solution: ○ Define parallel schemas, that have specific types redefined in Hive Metastore ■ E.g. ● Can redefine decimal type to be double for Presto*. ● This parallel schema would be named prod_presto. ○ Still points to same underlying data. 3. Schemas for all Tools * It didn’t use to have functioning decimal support
  • 69. ● Problem: ○ How do you handle schema evolution with 80+ Data Scientists? ■ E.g. ● Add a new column ● Delete an old column 4. Schema Evolution
  • 70. ● Problem: ○ How do you handle schema evolution with 80+ Data Scientists? ■ E.g. ● Add a new column ● Delete an old column ● Solution: ○ Append columns to end of schemas. ○ Rename columns as deprecated -- breaks code, but not data. 4. Schema Evolution
  • 71. ● Wait, what? Redshift? 5. Redshift
  • 72. ● Wait, what? Redshift? ○ Predates use of Spark & Presto ○ Redshift was brought in to help joining data ■ Previously DS had to load data & perform joins in R/Python ○ Data Scientists loved Redshift too much: ■ It became a huge source of contention ■ Have been migrating “production” off of it 5. Redshift
  • 73. ● Need: ○ Still want to use Redshift for ad-hoc analysis ● Problem: ○ How do we keep data on S3 in sync with Redshift? 5. Redshift
  • 74. ● Need: ○ Still want to use Redshift for ad-hoc analysis ● Problem: ○ How do we keep data on S3 in sync with Redshift? ● Solution: ○ API that abstracts syncing data with Redshift ■ Keeps schemas in sync ■ Uses standard data warehouse staged table insertion pattern 5. Redshift
  • 75. ● What does our integration with Spark look like? 6. Spark
  • 76. ● What does our integration with Spark look like? ○ Running on Amazon EMR using Netflix's Genie ■ Prod & Dev clusters ○ S3 still source of truth ■ Have custom write API: ● Enforces Batch IDs ● Scala based library making use of EMRFS ● Also exposed in Python for PySpark use ○ Heavy users of Spark SQL ○ It’s the main production workhorse 6. Spark
  • 79. Data Scientist’s Ad-hoc workflow The faster this iteration cycle, the faster Data Scientists can work
  • 80. Data Scientist’s Ad-hoc workflow Scaling this part The faster this iteration cycle, the faster Data Scientists can work
  • 81. Workstation Env. Mgmt. Contention Points Low Memory & CPU Medium Medium High High Ad hoc Infra: Options Laptop
  • 82. Workstation Env. Mgmt. Contention Points Low Memory & CPU Medium Isolation High High Ad hoc Infra: Options Laptop Shared Instances
  • 83. Workstation Env. Mgmt. Contention Points Low Memory & CPU Medium Isolation High Time & Money Ad hoc Infra: Options Laptop Shared Instances Individual Instances
  • 84. Workstation Env. Mgmt. Contention Points Low Memory & CPU Medium Isolation Low Time & Money Ad hoc Infra: Options Laptop Shared Instances Individual Instances
  • 85. ● Control of environment ○ Data Scientists don’t need to worry about env. ● Isolation ○ can host many docker containers on a single machine. ● Better host management ○ allowing central control of machine types. Why Docker?
  • 86. ● Has: ○ Our internal API libraries ○ Jupyter Hub Notebooks: ■ Pyspark, IPython, R, Javscript, Toree ○ Python libs: ■ scikit, numpy, scipy, pandas, etc. ○ RStudio ○ R libs: ■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc. ● Mounts User NFS ● User has terminal access to file system via Jupyter for git, pip, etc. Ad-Hoc Docker Image
  • 87. Self Service Ad-hoc Infra: Flotilla
  • 88. Jupyter Hub on Flotilla
  • 90. Browser Based Terminal on Flotilla
  • 91. Flotilla Deployment ● Amazon ECS for cluster management. ● EC2 Instances: ○ Custom AMI based on ECS optimized docker image. ● Runs in a single Auto Scale Group. ● S3 backed self-hosted Artifactory as docker repository. ● Docker + Amazon ECS unlocks access to lots of CPU & Memory for DS!
  • 95. ● Docker tightly integrates with the Linux Kernel. ○ Hypothesis: ■ Anything that makes uninterruptable calls to the kernel can: ● Break the ECS agent because the container doesn’t respond. ● Break isolation between containers. ■ E.g. Mounting NFS ● Docker Hub: ○ Weren’t happy with performance ○ Switched to artifactory Docker Problems So Far
  • 97. ● S3 + Hive Metastore is Stitch Fix’s very scalable data warehouse. ● Internally built APIs make S3 + Hive Metastore easier to use for Data Scientists. ● Docker is used to provide a consistent environment for Data Scientists to use. ● Docker + ECS enables a self-service ad-hoc platform for Data Scientists. In Summary - Reducing Contention
  • 98. Fin; Thanks! Questions? @stefkrawczyk Try out Stitch Fix → stitchfix.com/referral/8406746