SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
Scaling Data Science
Capabilities with Spark at
Stitch Fix
Derek Bennett
Overview
• About us
• Our usage of Spark
• Details of our Spark Architecture
• Challenges and Ongoing Work
About Stitch Fix
• Personalized online retail company based in
San Francisco, CA USA
• Stylists create shipments (“fixes”) based on
algorithmic recommendations and requests
• Customer pays for what they keep and returns
the rest, along with feedback on checkout
• “Empowering people to find what they love”
About The Algorithms Team
• Over 80 people dedicated to analytics and data
science capabilities
– Three vertically focused teams of data scientists
– Data platform team providing infrastructure, services
and tools to all vertical teams
• Data scientists own their pipelines
– Platform team has a self service charter
– “Engineers shouldn’t write ETL”
Our Usage of Spark
How we Leverage Spark
• Heavy use of Spark SQL along with
DataFrame APIs
• Hive Metastore and Spark HiveContext
• Mostly PySpark with some Scala
• Some initial use of Spark ML
What’s different about us
• Vast majority of data is in structured tables
• Instead of a small number of large jobs, large
number of diverse and varying-size jobs
• Scalability is by capability not amount of data
• Many people used to SQL, so Spark SQL is an
easy way to transition in
More Interesting Use Cases
• Analyzing website visitors
– Several queries mixed with use of Python API
– Reuse of existing python code in spark job
• Client “states” : chains of multiple queries
• Examples using 3rd party NLP libraries
• Analyzing client segments with loop of multiple
similar jobs
• Machine Learning : feature selection example
Our Spark Architecture
Data Architecture
Architecture Notes
• Amazon S3
– Read, Write, Delete (no append/update)
– Data ingested from databases and log events
• Hive Metastore
– Schema and partitions
– Data location of most recent data
• Spark primarily works with data on S3
• Additional query engines: Presto and Redshift
Architecture Principles
• S3 as the source of truth
• Batch Id pattern
– Hidden inner directory - timestamp based
– Overwrite : write new batch and update location
• Hive metastore manages schemas and latest
data location
• No data permanently kept in HDFS
Spark Setup in Detail
Spark Setup in Detail
• Genie service and its benefits:
– Execution service isolating EMR details
– Administration and supporting different versions of
Spark and multiple clusters
• EMR and our clusters: dev / prod / ad-hoc
• Clusters are transient - no data in HDFS
• Jupyter IPython notebooks and related tools
• EMR File System (more efficient S3 interaction)
Spark Builds and Deployments
• Supported versions: 1.6.x, 2.0.x, 2.1.x
• Maintaining our own fork of Spark
• Snapshot builds for testing
• Additional “applications” and “commands” in
genie to experiment with different settings
Additional Libraries and Utilities
• SFS3 : Output and Validation
– Writing while enforcing batch id paradigm
– Utilize Amazon EMRFS for efficient read/write
• Hive utilities
– Allow table creation or adding partitions
– Table analyzing
• Tracer : library interface to time series database
• Diagnosis script to find root cause of failures
Challenges and
Ongoing Work
Challenges
• Porting monolithic queries and using the API
– Education that Spark is more than just SQL!
• Education on use of Spark and various settings
– Find useful defaults and override when needed
– Provide reference for starting tuning
• Challenging errors related to joins, metastore,
memory usage, and contention
• Contention in the cluster; scheduling; priorities
Ongoing Projects
• Users & YARN queue settings
• Centralized history server
• Data reader/writer project
• More extensive use of different file formats
• Logging integration
• Future use of streaming

Mais conteúdo relacionado

Mais de Databricks

Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 

Mais de Databricks (20)

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 

Último

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 

Último (20)

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 

Scaling Data Science Capabilities with Apache Spark at Stitch Fix with Derek Bennett

  • 1. Scaling Data Science Capabilities with Spark at Stitch Fix Derek Bennett
  • 2. Overview • About us • Our usage of Spark • Details of our Spark Architecture • Challenges and Ongoing Work
  • 3. About Stitch Fix • Personalized online retail company based in San Francisco, CA USA • Stylists create shipments (“fixes”) based on algorithmic recommendations and requests • Customer pays for what they keep and returns the rest, along with feedback on checkout • “Empowering people to find what they love”
  • 4. About The Algorithms Team • Over 80 people dedicated to analytics and data science capabilities – Three vertically focused teams of data scientists – Data platform team providing infrastructure, services and tools to all vertical teams • Data scientists own their pipelines – Platform team has a self service charter – “Engineers shouldn’t write ETL”
  • 5. Our Usage of Spark
  • 6. How we Leverage Spark • Heavy use of Spark SQL along with DataFrame APIs • Hive Metastore and Spark HiveContext • Mostly PySpark with some Scala • Some initial use of Spark ML
  • 7. What’s different about us • Vast majority of data is in structured tables • Instead of a small number of large jobs, large number of diverse and varying-size jobs • Scalability is by capability not amount of data • Many people used to SQL, so Spark SQL is an easy way to transition in
  • 8. More Interesting Use Cases • Analyzing website visitors – Several queries mixed with use of Python API – Reuse of existing python code in spark job • Client “states” : chains of multiple queries • Examples using 3rd party NLP libraries • Analyzing client segments with loop of multiple similar jobs • Machine Learning : feature selection example
  • 11. Architecture Notes • Amazon S3 – Read, Write, Delete (no append/update) – Data ingested from databases and log events • Hive Metastore – Schema and partitions – Data location of most recent data • Spark primarily works with data on S3 • Additional query engines: Presto and Redshift
  • 12. Architecture Principles • S3 as the source of truth • Batch Id pattern – Hidden inner directory - timestamp based – Overwrite : write new batch and update location • Hive metastore manages schemas and latest data location • No data permanently kept in HDFS
  • 13. Spark Setup in Detail
  • 14. Spark Setup in Detail • Genie service and its benefits: – Execution service isolating EMR details – Administration and supporting different versions of Spark and multiple clusters • EMR and our clusters: dev / prod / ad-hoc • Clusters are transient - no data in HDFS • Jupyter IPython notebooks and related tools • EMR File System (more efficient S3 interaction)
  • 15. Spark Builds and Deployments • Supported versions: 1.6.x, 2.0.x, 2.1.x • Maintaining our own fork of Spark • Snapshot builds for testing • Additional “applications” and “commands” in genie to experiment with different settings
  • 16. Additional Libraries and Utilities • SFS3 : Output and Validation – Writing while enforcing batch id paradigm – Utilize Amazon EMRFS for efficient read/write • Hive utilities – Allow table creation or adding partitions – Table analyzing • Tracer : library interface to time series database • Diagnosis script to find root cause of failures
  • 18. Challenges • Porting monolithic queries and using the API – Education that Spark is more than just SQL! • Education on use of Spark and various settings – Find useful defaults and override when needed – Provide reference for starting tuning • Challenging errors related to joins, metastore, memory usage, and contention • Contention in the cluster; scheduling; priorities
  • 19. Ongoing Projects • Users & YARN queue settings • Centralized history server • Data reader/writer project • More extensive use of different file formats • Logging integration • Future use of streaming