SlideShare a Scribd company logo
1 of 33
Download to read offline
Unified Data Analytics and AI
Any Stack Any Cloud
Bin Fan (binfan@alluxio.com), Founding Engineer, VP of Open Source @ Alluxio
ALLUXIO 2
About Me
2
Bin Fan (https://www.linkedin.com/in/bin-fan/)
● Founding Engineer, VP Open Source @ Alluxio
● Alluxio PMC Co-Chair, Presto TSC/committer
● Email: binfan@alluxio.com
● PhD in CS @ Carnegie Mellon University
● Originally a research project (Tachyon) in UC Berkeley AMPLab led by by-then PHD student
Haoyuan Li (Alluxio founder CEO)
● Backed by top VCs (e.g., Andreessen Horowitz) with $70M raised in total, Series C ($50M)
announced in 2021
● Deployed in production at large scale in Facebook, Uber, Microsoft, Tencent, Tiktok and etc
● More than 1200 Contributors on Github. In 2021, more than 40% commits in Github were
contributed by the community users
● The 9th most critical Java-based Open-Source projects on Github by Google/OpenSSF[1]
Alluxio Overview
ALLUXIO 3
[1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects
Alluxio (Tachyon) back in 2015
Screenshot of Tachyon talk at AMPLab back in 2015
What is Tachyon Stack Release Growth
5
AMPLab活动上Tachyon演讲的截图
Alluxio (Tachyon) in 2015
Spark Task1 Spark Task 2
HDFS / Amazon S3
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
RDD
Topology
● On-prem Hadoop → Cloud-native, Multi- or Hybrid-cloud,
Multi-datacenter
Computation
● MR/Spark → Spark, Presto, Hive, Tensorflow, Pytorch ….
● More mature frameworks (less frequent OOM etc)
Data access pattern
● Sequential-read (e.g., scanning) on unstructured files → Ad-hoc
read into structured/columnar data
● Hundred to thousand of big files → millions of small files
Whatʼs Different Today
ALLUXIO 6
Data Storage
● On-prem & colocated HDFS → S3 !!! and other object stores
(possibly across regions like us-east & us-west),
and legacy on-prem HDFS in service
Resource/Job Orchestration
● YARN → K8s
○ Lost focus on data locality
The Evolution from Hadoop to Cloud-native Era
ALLUXIO 7
Unprecedented Complexity of Data Platforms
8
Data Trend Complex Platform
New compute and storage tech
created every 3-8 years
On-premise, cloud, hybrid,
multi-cloud environments all have
different environment properties
More data generated every day,
and stored in data silos
Data copies, synchronization costs
More people and teams need to
access and leverage these data
Multiple APIs necessitate
integration and application rewrites
Inefficient Manual Copy Across Data Centers, Regions, Clouds
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
Hive
DATACENTER 2
DATACENTER 1
ERROR PRONE AND
NETWORK INTENSIVE
DATA COPIES
Acceleration &
auto-tiering of remote
data sources
EFFICIENT ACCESS &
DATA MANAGEMENT
Agility across regions for
private, hybrid or
multi-cloud
ENVIRONMENT
AGNOSTICITY
Serve analytics & AI from
multiple data locations
UNIFICATION OF
DATA LAKES
≈
10
Strong Market Demand For Simplification
Analytics & AI
in the Hybrid & Multi-Cloud Era
Available:
11
No-copy data access across silos
agnostic to compute engine
Foundation of a heterogeneous data
platform across geos
≈
Multi-Cloud Ready Analytics & AI Platform
v
REGION A
v
REGION B
REGION A REGION B
GKE
DATACENTER 2
DATACENTER 1
HMS
12
Solution
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
Companies Using Alluxio
https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/
Modern Data Architecture on AWS
ALLUXIO 15
Examples to eliminate data copies
Case Studies
15
Expedia: Unify Data Lakes Across Multiple Geographic Regions in the Cloud
Problems Encountered Alluxio’s Solution Results Achieved
Data silos for different brands
ingesting data across multiple
regions in AWS
Central analytics query across
data silos suffered from poor
UX and long time to insight
Manual replication resulted in
operational inefficiency and
expensive network egress
Enhanced UX with consistent &
high performance analytics,
reducing time to insights
50%
Reduced cost per query
Unify data silos without the
need to copy or move data
Federate Data Lakes w/o Replication & Serve Various Compute Engines
v
BRAND A
v
BRAND B
BRAND C MAIN DATA LAKE
US-WEST-1
US-EAST-1
US-EAST-2
US-WEST-2
DATA
REPLICATION
Hive
Hive
Data Replication for Cross-region Data Access
Data Lake D
Data Lake A
Data Lake C
Main Data Lake
Replicated Data Lake
Replicated Data Lake
Data Lake B
CircusTrain
CircusTrain
CircusTrain
CircusTrain
CircusTrain
Hive
Hive
…
US-WEST-2 US-EAST-1
v
BRAND A
v
BRAND B
BRAND C MAIN DATA LAKE
US-WEST-1
US-EAST-1
US-EAST-2
US-WEST-2
MOUNT
Hive
Hive
Alluxio for Cross-region Data Access
Data Lake D
Data Lake B
Data Lake C
Main Data Lake
US-WEST-2 US-EAST-1
Data Lake A
Hive Hive
…
us-west-1 us-west-2
MAIN DATA LAKE
SQL query
Conversion
If local S3, s3://
If cross-region S3, alluxio://
us-east-1
Hive
Object Redirection with Waggle Dance
ALLUXIO 22
Enable a Hybrid Data Lake
Architecture Overview
22
ARCHITECTURE
Alluxio
Master
Consensus
Standby
Master
WAN
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
…
…
S3 region-us-east 1
S3 region-us-west 1
Control Path
Data Path
Alluxio
Client
Alluxio
Client
DATA LOCALITY WITH SCALE-OUT WORKERS
Local performance for remote data with intelligent multi-tiering
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
AWS S3
AWS EC2
Big Data ETL
Big Data Query
Synchronization of changes across clusters
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
AWS S3
AWS EC2
Big Data ETL
Big Data Query
RAM SSD
METADATA LOCALITY WITH SCALEABLE MASTERS
RocksDB
Spark
Alluxio
S3
Co-locate Alluxio Workers with compute for
optimal I/O performance
Remote cluster
Same cluster
Spark
Alluxio
S3
Deploy Alluxio as standalone cluster
between compute and Storage
Remote cluster
Same data center / region
Presto
26
Long-running Instances Ephemeral Elastic
DEPLOYMENT APPROACHES
UNIFIED NAMESPACE
With Replication & Live Data Migration Capabilities
• Single Alluxio path backed by multiple S3 regions
• Example policy: Migrate data older than 7 days from S3 region us west 1 to S3 region us east 1
Alluxio
S3 region us east 1
alluxio://host:port/
Data Users
Alice Bob
s3://bucket/
Users
Alice Bob
S3 region us west 1
s3://bucket
Reports Sales
Reports Sales
ALLUXIO 28
Training & Data Pre-processing
ML/DL
28
I/O Challenges in ML/DL
ALLUXIO 29
Training data often
consists of a
massive amount of
small files (billions
of 100KB photos)
Size of training
data keeps
growing & can
exceed individual
server capacity.
Training jobs are
highly concurrent,
require high I/O to
keep GPU utilized
Whatʼs Different
29
Using Alluxio for DL
Alluxio
Server
Alluxio
Server ...
Training Instances
POSIX POSIX POSIX
ALLUXIO 30
- Only fetch data on on cache miss
- No need to copy data before use
Distributed Caching
30
Consistent
Performance
Direct access to
data
Low latency and
high throughput
High GPU
utilization rate
ALLUXIO 31
Using Alluxio for DL
Distributed Caching
31
MOMONASDAQ:MOMO
runs thousands of Alluxio nodes across multiple Alluxio clusters,
managing more than 100+ TB data for search and training:
● Support multiple storage & compute frameworks.
● Accelerate compute & training tasks
● Reduce the metadata and data overhead
Model Training using PyTorch + Alluxio + Ceph
● 2 billion small files
● Reduce metadata & data interactions with Ceph to improve performance
32
https://www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/
Large Scale Deep Learning
TOPOLOGY: ON-PREMISES
Alluxio’s Solution
32
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
https://alluxio.io/slack
@
Social Media
Q&A

More Related Content

Similar to Unified Data API for Distributed Cloud Analytics and AI

Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native Era
Alluxio, Inc.
 

Similar to Unified Data API for Distributed Cloud Analytics and AI (20)

Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and Cloud
 
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsSimplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
 
Unify Data at Memory Speed
Unify Data at Memory SpeedUnify Data at Memory Speed
Unify Data at Memory Speed
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud EraModernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
 
Unified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudUnified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any Cloud
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native Era
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Alluxio Use Cases and Future Directions
Alluxio Use Cases and Future DirectionsAlluxio Use Cases and Future Directions
Alluxio Use Cases and Future Directions
 
The Architecture of Decoupling Compute and Storage with Alluxio
The Architecture of Decoupling Compute and Storage with AlluxioThe Architecture of Decoupling Compute and Storage with Alluxio
The Architecture of Decoupling Compute and Storage with Alluxio
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Accelerating Cloud Training With Alluxio
Accelerating Cloud Training With AlluxioAccelerating Cloud Training With Alluxio
Accelerating Cloud Training With Alluxio
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute final
 

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Recently uploaded

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

Unified Data API for Distributed Cloud Analytics and AI

  • 1. Unified Data Analytics and AI Any Stack Any Cloud Bin Fan (binfan@alluxio.com), Founding Engineer, VP of Open Source @ Alluxio
  • 2. ALLUXIO 2 About Me 2 Bin Fan (https://www.linkedin.com/in/bin-fan/) ● Founding Engineer, VP Open Source @ Alluxio ● Alluxio PMC Co-Chair, Presto TSC/committer ● Email: binfan@alluxio.com ● PhD in CS @ Carnegie Mellon University
  • 3. ● Originally a research project (Tachyon) in UC Berkeley AMPLab led by by-then PHD student Haoyuan Li (Alluxio founder CEO) ● Backed by top VCs (e.g., Andreessen Horowitz) with $70M raised in total, Series C ($50M) announced in 2021 ● Deployed in production at large scale in Facebook, Uber, Microsoft, Tencent, Tiktok and etc ● More than 1200 Contributors on Github. In 2021, more than 40% commits in Github were contributed by the community users ● The 9th most critical Java-based Open-Source projects on Github by Google/OpenSSF[1] Alluxio Overview ALLUXIO 3 [1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects
  • 4. Alluxio (Tachyon) back in 2015 Screenshot of Tachyon talk at AMPLab back in 2015 What is Tachyon Stack Release Growth
  • 5. 5 AMPLab活动上Tachyon演讲的截图 Alluxio (Tachyon) in 2015 Spark Task1 Spark Task 2 HDFS / Amazon S3 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory RDD
  • 6. Topology ● On-prem Hadoop → Cloud-native, Multi- or Hybrid-cloud, Multi-datacenter Computation ● MR/Spark → Spark, Presto, Hive, Tensorflow, Pytorch …. ● More mature frameworks (less frequent OOM etc) Data access pattern ● Sequential-read (e.g., scanning) on unstructured files → Ad-hoc read into structured/columnar data ● Hundred to thousand of big files → millions of small files Whatʼs Different Today ALLUXIO 6
  • 7. Data Storage ● On-prem & colocated HDFS → S3 !!! and other object stores (possibly across regions like us-east & us-west), and legacy on-prem HDFS in service Resource/Job Orchestration ● YARN → K8s ○ Lost focus on data locality The Evolution from Hadoop to Cloud-native Era ALLUXIO 7
  • 8. Unprecedented Complexity of Data Platforms 8 Data Trend Complex Platform New compute and storage tech created every 3-8 years On-premise, cloud, hybrid, multi-cloud environments all have different environment properties More data generated every day, and stored in data silos Data copies, synchronization costs More people and teams need to access and leverage these data Multiple APIs necessitate integration and application rewrites
  • 9. Inefficient Manual Copy Across Data Centers, Regions, Clouds v REGION A v REGION B REGION A REGION B PRIVATE DATA CENTERS Amazon EMR Cloud Dataproc Kubernetes Engine Compute Engine Hive DATACENTER 2 DATACENTER 1 ERROR PRONE AND NETWORK INTENSIVE DATA COPIES
  • 10. Acceleration & auto-tiering of remote data sources EFFICIENT ACCESS & DATA MANAGEMENT Agility across regions for private, hybrid or multi-cloud ENVIRONMENT AGNOSTICITY Serve analytics & AI from multiple data locations UNIFICATION OF DATA LAKES ≈ 10 Strong Market Demand For Simplification
  • 11. Analytics & AI in the Hybrid & Multi-Cloud Era Available: 11
  • 12. No-copy data access across silos agnostic to compute engine Foundation of a heterogeneous data platform across geos ≈ Multi-Cloud Ready Analytics & AI Platform v REGION A v REGION B REGION A REGION B GKE DATACENTER 2 DATACENTER 1 HMS 12 Solution
  • 13. INTERNET PUBLIC CLOUD PROVIDERS GENERAL E-COMMERCE OTHERS TECHNOLOGY FINANCIAL SERVICES TELCO & MEDIA LEARN MORE Companies Using Alluxio
  • 15. ALLUXIO 15 Examples to eliminate data copies Case Studies 15
  • 16. Expedia: Unify Data Lakes Across Multiple Geographic Regions in the Cloud Problems Encountered Alluxio’s Solution Results Achieved Data silos for different brands ingesting data across multiple regions in AWS Central analytics query across data silos suffered from poor UX and long time to insight Manual replication resulted in operational inefficiency and expensive network egress Enhanced UX with consistent & high performance analytics, reducing time to insights 50% Reduced cost per query Unify data silos without the need to copy or move data Federate Data Lakes w/o Replication & Serve Various Compute Engines
  • 17. v BRAND A v BRAND B BRAND C MAIN DATA LAKE US-WEST-1 US-EAST-1 US-EAST-2 US-WEST-2 DATA REPLICATION Hive Hive Data Replication for Cross-region Data Access
  • 18. Data Lake D Data Lake A Data Lake C Main Data Lake Replicated Data Lake Replicated Data Lake Data Lake B CircusTrain CircusTrain CircusTrain CircusTrain CircusTrain Hive Hive … US-WEST-2 US-EAST-1
  • 19. v BRAND A v BRAND B BRAND C MAIN DATA LAKE US-WEST-1 US-EAST-1 US-EAST-2 US-WEST-2 MOUNT Hive Hive Alluxio for Cross-region Data Access
  • 20. Data Lake D Data Lake B Data Lake C Main Data Lake US-WEST-2 US-EAST-1 Data Lake A Hive Hive …
  • 21. us-west-1 us-west-2 MAIN DATA LAKE SQL query Conversion If local S3, s3:// If cross-region S3, alluxio:// us-east-1 Hive Object Redirection with Waggle Dance
  • 22. ALLUXIO 22 Enable a Hybrid Data Lake Architecture Overview 22
  • 23. ARCHITECTURE Alluxio Master Consensus Standby Master WAN Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD … … S3 region-us-east 1 S3 region-us-west 1 Control Path Data Path Alluxio Client Alluxio Client
  • 24. DATA LOCALITY WITH SCALE-OUT WORKERS Local performance for remote data with intelligent multi-tiering RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL AWS S3 AWS EC2 Big Data ETL Big Data Query
  • 25. Synchronization of changes across clusters Alluxio Master Policies for pinning, promotion/demotion,TTL Metadata Synchronization AWS S3 AWS EC2 Big Data ETL Big Data Query RAM SSD METADATA LOCALITY WITH SCALEABLE MASTERS RocksDB
  • 26. Spark Alluxio S3 Co-locate Alluxio Workers with compute for optimal I/O performance Remote cluster Same cluster Spark Alluxio S3 Deploy Alluxio as standalone cluster between compute and Storage Remote cluster Same data center / region Presto 26 Long-running Instances Ephemeral Elastic DEPLOYMENT APPROACHES
  • 27. UNIFIED NAMESPACE With Replication & Live Data Migration Capabilities • Single Alluxio path backed by multiple S3 regions • Example policy: Migrate data older than 7 days from S3 region us west 1 to S3 region us east 1 Alluxio S3 region us east 1 alluxio://host:port/ Data Users Alice Bob s3://bucket/ Users Alice Bob S3 region us west 1 s3://bucket Reports Sales Reports Sales
  • 28. ALLUXIO 28 Training & Data Pre-processing ML/DL 28
  • 29. I/O Challenges in ML/DL ALLUXIO 29 Training data often consists of a massive amount of small files (billions of 100KB photos) Size of training data keeps growing & can exceed individual server capacity. Training jobs are highly concurrent, require high I/O to keep GPU utilized Whatʼs Different 29
  • 30. Using Alluxio for DL Alluxio Server Alluxio Server ... Training Instances POSIX POSIX POSIX ALLUXIO 30 - Only fetch data on on cache miss - No need to copy data before use Distributed Caching 30
  • 31. Consistent Performance Direct access to data Low latency and high throughput High GPU utilization rate ALLUXIO 31 Using Alluxio for DL Distributed Caching 31
  • 32. MOMONASDAQ:MOMO runs thousands of Alluxio nodes across multiple Alluxio clusters, managing more than 100+ TB data for search and training: ● Support multiple storage & compute frameworks. ● Accelerate compute & training tasks ● Reduce the metadata and data overhead Model Training using PyTorch + Alluxio + Ceph ● 2 billion small files ● Reduce metadata & data interactions with Ceph to improve performance 32 https://www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/ Large Scale Deep Learning TOPOLOGY: ON-PREMISES Alluxio’s Solution 32