SlideShare uma empresa Scribd logo
Cloud-Native Model Training
on Distributed Data
Shawn Sun, Cloud Native Tech Lead @ Alluxio - shawn.sun@alluxio.com
ChanChan Mao, Developer Advocate @ Alluxio - chanchan.mao@alluxio.com
1
Cloud Native Tech Lead
@ Alluxio
Shawn Sun
Developer Advocate
@ Alluxio
ChanChan Mao
Tightly-Coupled Hadoop
& HDFS
On-Prem HDFS
Single Region &
Single Cloud
10yr
Ago
The Evolution of the Modern Data Stack
Tightly-Coupled Hadoop
& HDFS
Compute-Storage
Separation
On-Prem HDFS
Cloud Data Lake
Single Region &
Single Cloud
Multi-Region/
Hybrid/Multi-Cloud
10yr
Ago
Today
More Elastic, Cheaper, More Scalable
The Evolution of the Modern Data Stack
Compute-Storage
Separation
Cloud Data Lake
Multi-Region/
Hybrid/Multi-Cloud
Today
Data is Remote from Compute; Locality is Missing
I/O Challenges
The Evolution of the Modern Data Stack
● GET/PUT operation costs
add up quickly
● Cross-region data transfer
(egress) fees
● GPU cycles are wasted
waiting for data
● Job failures
● Amazon S3 errors:
503 Slow Down
503 Service Unavailable
I/O Challenges
● Analytics SQL: High query
latency because of
retrieving remote data
● Model Training: Training is
slow because of loading
remote data in each epoch
(LISTing lots of small files is
particularly slow)
Performance Cost Reliability
10%
of your data is hot data
10%
of your data is hot data
Data Caching Layer
between compute & storage
Add a
Source: Alluxio
Reduce Latency
I/O
Compute I/O
Compute Compute
I/O
(first time retrieving
remote data)
Compute
I/O Compute
Without
Cache
With
Cache
Total job run time is reduced
I/O
Compute Compute
Compute I/O
Increase GPU Utilization
I/O
(data loading)
Training I/O
Training Training
I/O
(first time loading
remote data)
Training I/O
Training Training
I/O Training
Training
Without
Cache
With
Cache
GPU is idle idle
I/O
idle
GPU is idle GPU is busy most of the time
GPU utilization is greatly increased
Reduce Cloud Storage Cost
Compute
Compute
AWS S3
us-east-1
Without Cache With Cache
AWS S3
us-west-1
AWS S3
us-east-1
Frequently Retrieving Data =
High GET/PUT Operations Costs & Data Transfer
Costs
Fast Access with
Hot Data Cached
AWS S3
us-west-1
Only Retrieve Data When Necessary =
Lower S3 Costs
… …
… …
Data Cache
Improve Reliability
Prevent
Network
Congestion
Relieve
Overloaded
Storage
Prevent Job Failures like “503 Service Unavailable” …
DATA CACHING LAYER
Observations So Far …
● The evolution of modern data stack poses
challenges for data locality
● You should care about I/O in data lake
because it greatly impacts the
performance, cost & reliability of your
data platform
● Having a data caching layer between
compute and storage can solve the I/O
challenges
● You can use cache for both analytics and
AI workloads
COMPUTE
STORAGE
ALLUXIO 14
Accessing Data and
Models In the Cloud
14
Hybrid/Multi-Cloud ML Platforms
Online ML platform
Serving cluster
Models
Training Data
Models
1
2
3
Offline training platform
Training cluster
DC/Cloud A DC/Cloud B
15
Separation of compute and storage
1. Read data directly from cloud storage
2. Copy data from cloud to local before training
3. Local cache layer for data reuse
4. Distributed cache system
Existing Solutions
16
Option 1: Read From Cloud Storage
● Easy to set up
● Performance are not ideal
■ Model access: Models are repeatedly pulled from cloud storage
■ Data access: Reading data can take more time than actual training
82% of the time
spent by
DataLoader
17
Option 2: Copy Data To Local Before Training
● Data is now local
■ Faster access + less cost
● Management is hard
■ Must manually delete training data after use
● Local storage space is limited
■ Dataset is huge - limited benefits
18
Option 3: Local Cache for Data Reuse
Examples: S3FS built-in local cache, Alluxio Fuse SDK
● Reused data is local
■ Faster access + less cost
● Cache layer provider helps data management
■ No manual deletion/supervision
● Cache space is limited
■ Dataset is huge - limited benefits
19
Option 4: Distributed Cache System
Clients
Worker
Worker
Worker
…
● Training data and trained models can
be kept in cache - distributed.
● Typically with data management
functionalities.
20
Challenges
1. Performance
● Pulling data from cloud storage is hurting training/serving.
2. Cost
● Repeatedly requesting data from cloud storage is costly.
3. Reliability
● Availability is the key for every service in cloud.
4. Usability
● Manual data management is unfavorable.
21
ALLUXIO 22
Alluxio as an example
22
Clients Worker
Worker
…
Masters
Worker
● Use consistent hashing to cache both data
and metadata on workers.
● Worker nodes have plenty space for cache.
Training data and models only need to be
pulled once from cloud storage. Cost --
● No more single point of failure. Reliability ++
● No more performance bottleneck on masters.
Performance ++
● Data management system.
Consistent Hashing for caching
23
By the numbers
● High Scalability
■ One worker supports 30 - 50 million files
■ Scale linearly - easy to support 10 billions of files
● High Availability
■ 99.99% uptime
■ No single point of failure
● High Performance
■ Faster data loading
● Cloud-native K8s Operator and CSI-FUSE for data access management
24
25
Cloud Storage
Alluxio Operator
Training
Framework
Training
Framework
Cloud VMs
Alluxio Operator
Kubernetes
Alluxio Cluster
Training
Framework
Manages the life cycle of Alluxio clusters and datasets
26
Alluxio Cluster CRD
Alluxio Operator follows the Kubernetes Operator pattern
1.Create
AlluxioCluster,
Dataset CRs
2.Inform CR
User K8s Api
Server
Alluxio
Operator
Alluxio Cluster
Dataset
3.Manage k8s
resources
4.Reconcile
● Zero-downtime
Upgrade
● High-availability
● Auto-scaling
Alluxio FUSE
● Expose the Alluxio file system as a local file system.
● Can access the cloud storage just as accessing local storage.
○ cat, ls
○ f = open(“a.txt”, “r”)
● Very low impact for end users
27
Alluxio CSI on K8s x Alluxio FUSE for Data Access
● FUSE: Turn remote dataset in cloud
into local folder for training
● CSI: Launch Alluxio FUSE pod only
when dataset is needed
Alluxio Fuse pod
Fuse
Container
Host Machine
Application pod
Application
Container
Persistent
volume +
claim
mount
mount
28
ALLUXIO 29
Data Access
Management for
PyTorch
29
Under Storage
Integration with PyTorch Training (Alluxio)
Training Node
Get Task Info
Alluxio Client
PyTorch
Get Cluster Info
Send Result
Cache Cluster
Service Registry
Cache Worker
Cache Worker
Execute Task
Cache Worker
Cache Client
Find Worker(s)
Affinity Block
Location
Policy Client-side load
balance
1
2
3
4
5
Cache miss -
Under storage task
30
Data Loading Performance
ImageNet (subset)
31
Yelp review
32
Training Directly from Storage (S3-FUSE)
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
GPU Utilization Improvement
Training with Alluxio-FUSE
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)
GPU Utilization Improvement
ALLUXIO 34
How to enable Python
Applications
34
Use Alluxio - Ray Integration as an example
35
Ray Dataloader
fsspec - Alluxio
impl
Alluxio Python
client
Ray
etcd
Alluxio Worker
REST API server
Alluxio Worker
REST API server
PyArrow Dataset
loading
Registration
Get worker
addresses
Alluxio+Ray Benchmark – Small Files
● Dataset
○ 130GB imagenet dataset
● Process Settings
○ 4 train workers
○ 9 process reading
● Active Object Store Memory
○ 400-500 MiB
36
Alluxio+Ray Benchmark – Large Parquet files
● Dataset
○ 200MiB files, adds up to
60GiB
● Process Settings
○ 28 train workers
○ 28 process reading
● Active Object Store Memory
○ 20-30 GiB
37
Cost Saving – Egress/Data Transfer Fees
38
Cost Saving – API Calls/S3 Operations (List, Get)
List/Get API calls only access Alluxio
39
Any Questions? Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!
40
Thank you!
41
Up Next:
AI/ML Infra Meetup Thur May 9 @ Uber Sunnyvale
https://lu.ma/AIMLinfra
Speak at an Alluxio event:
https://forms.gle/iJX9GTMaAVQdzKc28

Mais conteúdo relacionado

Semelhante a Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
Alluxio, Inc.
 

Semelhante a Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data (20)

How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute final
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsSimplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Unified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudUnified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any Cloud
 
How @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudHow @twitterhadoop chose google cloud
How @twitterhadoop chose google cloud
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 

Mais de Alluxio, Inc.

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Mais de Alluxio, Inc. (20)

AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 
Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...
Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...
Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...
 

Último

Último (20)

Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data Migration
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting software
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion Production
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with StrimziStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

  • 1. Cloud-Native Model Training on Distributed Data Shawn Sun, Cloud Native Tech Lead @ Alluxio - shawn.sun@alluxio.com ChanChan Mao, Developer Advocate @ Alluxio - chanchan.mao@alluxio.com 1
  • 2. Cloud Native Tech Lead @ Alluxio Shawn Sun Developer Advocate @ Alluxio ChanChan Mao
  • 3. Tightly-Coupled Hadoop & HDFS On-Prem HDFS Single Region & Single Cloud 10yr Ago The Evolution of the Modern Data Stack
  • 4. Tightly-Coupled Hadoop & HDFS Compute-Storage Separation On-Prem HDFS Cloud Data Lake Single Region & Single Cloud Multi-Region/ Hybrid/Multi-Cloud 10yr Ago Today More Elastic, Cheaper, More Scalable The Evolution of the Modern Data Stack
  • 5. Compute-Storage Separation Cloud Data Lake Multi-Region/ Hybrid/Multi-Cloud Today Data is Remote from Compute; Locality is Missing I/O Challenges The Evolution of the Modern Data Stack
  • 6. ● GET/PUT operation costs add up quickly ● Cross-region data transfer (egress) fees ● GPU cycles are wasted waiting for data ● Job failures ● Amazon S3 errors: 503 Slow Down 503 Service Unavailable I/O Challenges ● Analytics SQL: High query latency because of retrieving remote data ● Model Training: Training is slow because of loading remote data in each epoch (LISTing lots of small files is particularly slow) Performance Cost Reliability
  • 7. 10% of your data is hot data
  • 8. 10% of your data is hot data Data Caching Layer between compute & storage Add a Source: Alluxio
  • 9. Reduce Latency I/O Compute I/O Compute Compute I/O (first time retrieving remote data) Compute I/O Compute Without Cache With Cache Total job run time is reduced I/O Compute Compute Compute I/O
  • 10. Increase GPU Utilization I/O (data loading) Training I/O Training Training I/O (first time loading remote data) Training I/O Training Training I/O Training Training Without Cache With Cache GPU is idle idle I/O idle GPU is idle GPU is busy most of the time GPU utilization is greatly increased
  • 11. Reduce Cloud Storage Cost Compute Compute AWS S3 us-east-1 Without Cache With Cache AWS S3 us-west-1 AWS S3 us-east-1 Frequently Retrieving Data = High GET/PUT Operations Costs & Data Transfer Costs Fast Access with Hot Data Cached AWS S3 us-west-1 Only Retrieve Data When Necessary = Lower S3 Costs … … … … Data Cache
  • 13. DATA CACHING LAYER Observations So Far … ● The evolution of modern data stack poses challenges for data locality ● You should care about I/O in data lake because it greatly impacts the performance, cost & reliability of your data platform ● Having a data caching layer between compute and storage can solve the I/O challenges ● You can use cache for both analytics and AI workloads COMPUTE STORAGE
  • 14. ALLUXIO 14 Accessing Data and Models In the Cloud 14
  • 15. Hybrid/Multi-Cloud ML Platforms Online ML platform Serving cluster Models Training Data Models 1 2 3 Offline training platform Training cluster DC/Cloud A DC/Cloud B 15 Separation of compute and storage
  • 16. 1. Read data directly from cloud storage 2. Copy data from cloud to local before training 3. Local cache layer for data reuse 4. Distributed cache system Existing Solutions 16
  • 17. Option 1: Read From Cloud Storage ● Easy to set up ● Performance are not ideal ■ Model access: Models are repeatedly pulled from cloud storage ■ Data access: Reading data can take more time than actual training 82% of the time spent by DataLoader 17
  • 18. Option 2: Copy Data To Local Before Training ● Data is now local ■ Faster access + less cost ● Management is hard ■ Must manually delete training data after use ● Local storage space is limited ■ Dataset is huge - limited benefits 18
  • 19. Option 3: Local Cache for Data Reuse Examples: S3FS built-in local cache, Alluxio Fuse SDK ● Reused data is local ■ Faster access + less cost ● Cache layer provider helps data management ■ No manual deletion/supervision ● Cache space is limited ■ Dataset is huge - limited benefits 19
  • 20. Option 4: Distributed Cache System Clients Worker Worker Worker … ● Training data and trained models can be kept in cache - distributed. ● Typically with data management functionalities. 20
  • 21. Challenges 1. Performance ● Pulling data from cloud storage is hurting training/serving. 2. Cost ● Repeatedly requesting data from cloud storage is costly. 3. Reliability ● Availability is the key for every service in cloud. 4. Usability ● Manual data management is unfavorable. 21
  • 22. ALLUXIO 22 Alluxio as an example 22
  • 23. Clients Worker Worker … Masters Worker ● Use consistent hashing to cache both data and metadata on workers. ● Worker nodes have plenty space for cache. Training data and models only need to be pulled once from cloud storage. Cost -- ● No more single point of failure. Reliability ++ ● No more performance bottleneck on masters. Performance ++ ● Data management system. Consistent Hashing for caching 23
  • 24. By the numbers ● High Scalability ■ One worker supports 30 - 50 million files ■ Scale linearly - easy to support 10 billions of files ● High Availability ■ 99.99% uptime ■ No single point of failure ● High Performance ■ Faster data loading ● Cloud-native K8s Operator and CSI-FUSE for data access management 24
  • 25. 25 Cloud Storage Alluxio Operator Training Framework Training Framework Cloud VMs Alluxio Operator Kubernetes Alluxio Cluster Training Framework Manages the life cycle of Alluxio clusters and datasets
  • 26. 26 Alluxio Cluster CRD Alluxio Operator follows the Kubernetes Operator pattern 1.Create AlluxioCluster, Dataset CRs 2.Inform CR User K8s Api Server Alluxio Operator Alluxio Cluster Dataset 3.Manage k8s resources 4.Reconcile ● Zero-downtime Upgrade ● High-availability ● Auto-scaling
  • 27. Alluxio FUSE ● Expose the Alluxio file system as a local file system. ● Can access the cloud storage just as accessing local storage. ○ cat, ls ○ f = open(“a.txt”, “r”) ● Very low impact for end users 27
  • 28. Alluxio CSI on K8s x Alluxio FUSE for Data Access ● FUSE: Turn remote dataset in cloud into local folder for training ● CSI: Launch Alluxio FUSE pod only when dataset is needed Alluxio Fuse pod Fuse Container Host Machine Application pod Application Container Persistent volume + claim mount mount 28
  • 30. Under Storage Integration with PyTorch Training (Alluxio) Training Node Get Task Info Alluxio Client PyTorch Get Cluster Info Send Result Cache Cluster Service Registry Cache Worker Cache Worker Execute Task Cache Worker Cache Client Find Worker(s) Affinity Block Location Policy Client-side load balance 1 2 3 4 5 Cache miss - Under storage task 30
  • 31. Data Loading Performance ImageNet (subset) 31 Yelp review
  • 32. 32 Training Directly from Storage (S3-FUSE) - > 80% of total time is spent in DataLoader - Result in Low GPU Utilization Rate (<20%) GPU Utilization Improvement
  • 33. Training with Alluxio-FUSE - Reduced DataLoader Rate from 82% to 1% (82X) - Increase GPU Utilization Rate from 17% to 93% (5X) GPU Utilization Improvement
  • 34. ALLUXIO 34 How to enable Python Applications 34
  • 35. Use Alluxio - Ray Integration as an example 35 Ray Dataloader fsspec - Alluxio impl Alluxio Python client Ray etcd Alluxio Worker REST API server Alluxio Worker REST API server PyArrow Dataset loading Registration Get worker addresses
  • 36. Alluxio+Ray Benchmark – Small Files ● Dataset ○ 130GB imagenet dataset ● Process Settings ○ 4 train workers ○ 9 process reading ● Active Object Store Memory ○ 400-500 MiB 36
  • 37. Alluxio+Ray Benchmark – Large Parquet files ● Dataset ○ 200MiB files, adds up to 60GiB ● Process Settings ○ 28 train workers ○ 28 process reading ● Active Object Store Memory ○ 20-30 GiB 37
  • 38. Cost Saving – Egress/Data Transfer Fees 38
  • 39. Cost Saving – API Calls/S3 Operations (List, Get) List/Get API calls only access Alluxio 39
  • 40. Any Questions? Scan the QR code for a Linktree including great learning resources, exciting meetups & a community of data & AI infra experts! 40
  • 41. Thank you! 41 Up Next: AI/ML Infra Meetup Thur May 9 @ Uber Sunnyvale https://lu.ma/AIMLinfra Speak at an Alluxio event: https://forms.gle/iJX9GTMaAVQdzKc28