Alluxio Day x APAC Modern Data Stack
September 22, 2022
For more on Alluxio Day: https://www.alluxio.io/alluxio-day/
For more Alluxio events: https://alluxio.io/events/
Speaker: Bin Fan (Founding Member & VP of Open Source, Alluxio)
Alluxio (www.alluxio.io) is an open-source virtual distributed file system that provides a unified data access layer for hybrid and multi-cloud deployments. It enables distributed compute engines like Spark, Presto or Machine Learning frameworks like TensorFlow to transparently access different persistent storage systems (including HDFS, S3, Azure and etc) while actively leveraging in-memory cache to accelerate data access. Developed originally from UC Berkeley AMPLab as research project “Tachyon”, Alluxio has more than 1200 contributors and is used by over 100 companies worldwide with the largest production deployment over 1000 nodes.
This presentation focuses on how Alluxio helps the big data analytics stack to be cloud-native. The trending Cloud object storage systems provide more cost-effective and scalable storage solutions but also different semantics and performance implications compared to HDFS. Applications like Spark or Presto will not benefit from the node-level locality or cross-job caching when retrieving data from the cloud object storage. Deploying Alluxio to access cloud solves these problems because data will be retrieved and cached in Alluxio instead of the underlying cloud or object storage repeatedly.
%in Midrand+277-882-255-28 abortion pills for sale in midrand
Unified Data API for Distributed Cloud Analytics and AI
1. Unified Data Analytics and AI
Any Stack Any Cloud
Bin Fan (binfan@alluxio.com), Founding Engineer, VP of Open Source @ Alluxio
2. ALLUXIO 2
About Me
2
Bin Fan (https://www.linkedin.com/in/bin-fan/)
● Founding Engineer, VP Open Source @ Alluxio
● Alluxio PMC Co-Chair, Presto TSC/committer
● Email: binfan@alluxio.com
● PhD in CS @ Carnegie Mellon University
3. ● Originally a research project (Tachyon) in UC Berkeley AMPLab led by by-then PHD student
Haoyuan Li (Alluxio founder CEO)
● Backed by top VCs (e.g., Andreessen Horowitz) with $70M raised in total, Series C ($50M)
announced in 2021
● Deployed in production at large scale in Facebook, Uber, Microsoft, Tencent, Tiktok and etc
● More than 1200 Contributors on Github. In 2021, more than 40% commits in Github were
contributed by the community users
● The 9th most critical Java-based Open-Source projects on Github by Google/OpenSSF[1]
Alluxio Overview
ALLUXIO 3
[1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects
4. Alluxio (Tachyon) back in 2015
Screenshot of Tachyon talk at AMPLab back in 2015
What is Tachyon Stack Release Growth
6. Topology
● On-prem Hadoop → Cloud-native, Multi- or Hybrid-cloud,
Multi-datacenter
Computation
● MR/Spark → Spark, Presto, Hive, Tensorflow, Pytorch ….
● More mature frameworks (less frequent OOM etc)
Data access pattern
● Sequential-read (e.g., scanning) on unstructured files → Ad-hoc
read into structured/columnar data
● Hundred to thousand of big files → millions of small files
Whatʼs Different Today
ALLUXIO 6
7. Data Storage
● On-prem & colocated HDFS → S3 !!! and other object stores
(possibly across regions like us-east & us-west),
and legacy on-prem HDFS in service
Resource/Job Orchestration
● YARN → K8s
○ Lost focus on data locality
The Evolution from Hadoop to Cloud-native Era
ALLUXIO 7
8. Unprecedented Complexity of Data Platforms
8
Data Trend Complex Platform
New compute and storage tech
created every 3-8 years
On-premise, cloud, hybrid,
multi-cloud environments all have
different environment properties
More data generated every day,
and stored in data silos
Data copies, synchronization costs
More people and teams need to
access and leverage these data
Multiple APIs necessitate
integration and application rewrites
9. Inefficient Manual Copy Across Data Centers, Regions, Clouds
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
Hive
DATACENTER 2
DATACENTER 1
ERROR PRONE AND
NETWORK INTENSIVE
DATA COPIES
10. Acceleration &
auto-tiering of remote
data sources
EFFICIENT ACCESS &
DATA MANAGEMENT
Agility across regions for
private, hybrid or
multi-cloud
ENVIRONMENT
AGNOSTICITY
Serve analytics & AI from
multiple data locations
UNIFICATION OF
DATA LAKES
≈
10
Strong Market Demand For Simplification
12. No-copy data access across silos
agnostic to compute engine
Foundation of a heterogeneous data
platform across geos
≈
Multi-Cloud Ready Analytics & AI Platform
v
REGION A
v
REGION B
REGION A REGION B
GKE
DATACENTER 2
DATACENTER 1
HMS
12
Solution
16. Expedia: Unify Data Lakes Across Multiple Geographic Regions in the Cloud
Problems Encountered Alluxio’s Solution Results Achieved
Data silos for different brands
ingesting data across multiple
regions in AWS
Central analytics query across
data silos suffered from poor
UX and long time to insight
Manual replication resulted in
operational inefficiency and
expensive network egress
Enhanced UX with consistent &
high performance analytics,
reducing time to insights
50%
Reduced cost per query
Unify data silos without the
need to copy or move data
Federate Data Lakes w/o Replication & Serve Various Compute Engines
17. v
BRAND A
v
BRAND B
BRAND C MAIN DATA LAKE
US-WEST-1
US-EAST-1
US-EAST-2
US-WEST-2
DATA
REPLICATION
Hive
Hive
Data Replication for Cross-region Data Access
18. Data Lake D
Data Lake A
Data Lake C
Main Data Lake
Replicated Data Lake
Replicated Data Lake
Data Lake B
CircusTrain
CircusTrain
CircusTrain
CircusTrain
CircusTrain
Hive
Hive
…
US-WEST-2 US-EAST-1
19. v
BRAND A
v
BRAND B
BRAND C MAIN DATA LAKE
US-WEST-1
US-EAST-1
US-EAST-2
US-WEST-2
MOUNT
Hive
Hive
Alluxio for Cross-region Data Access
20. Data Lake D
Data Lake B
Data Lake C
Main Data Lake
US-WEST-2 US-EAST-1
Data Lake A
Hive Hive
…
21. us-west-1 us-west-2
MAIN DATA LAKE
SQL query
Conversion
If local S3, s3://
If cross-region S3, alluxio://
us-east-1
Hive
Object Redirection with Waggle Dance
24. DATA LOCALITY WITH SCALE-OUT WORKERS
Local performance for remote data with intelligent multi-tiering
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
AWS S3
AWS EC2
Big Data ETL
Big Data Query
25. Synchronization of changes across clusters
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
AWS S3
AWS EC2
Big Data ETL
Big Data Query
RAM SSD
METADATA LOCALITY WITH SCALEABLE MASTERS
RocksDB
26. Spark
Alluxio
S3
Co-locate Alluxio Workers with compute for
optimal I/O performance
Remote cluster
Same cluster
Spark
Alluxio
S3
Deploy Alluxio as standalone cluster
between compute and Storage
Remote cluster
Same data center / region
Presto
26
Long-running Instances Ephemeral Elastic
DEPLOYMENT APPROACHES
27. UNIFIED NAMESPACE
With Replication & Live Data Migration Capabilities
• Single Alluxio path backed by multiple S3 regions
• Example policy: Migrate data older than 7 days from S3 region us west 1 to S3 region us east 1
Alluxio
S3 region us east 1
alluxio://host:port/
Data Users
Alice Bob
s3://bucket/
Users
Alice Bob
S3 region us west 1
s3://bucket
Reports Sales
Reports Sales
29. I/O Challenges in ML/DL
ALLUXIO 29
Training data often
consists of a
massive amount of
small files (billions
of 100KB photos)
Size of training
data keeps
growing & can
exceed individual
server capacity.
Training jobs are
highly concurrent,
require high I/O to
keep GPU utilized
Whatʼs Different
29
30. Using Alluxio for DL
Alluxio
Server
Alluxio
Server ...
Training Instances
POSIX POSIX POSIX
ALLUXIO 30
- Only fetch data on on cache miss
- No need to copy data before use
Distributed Caching
30
32. MOMONASDAQ:MOMO
runs thousands of Alluxio nodes across multiple Alluxio clusters,
managing more than 100+ TB data for search and training:
● Support multiple storage & compute frameworks.
● Accelerate compute & training tasks
● Reduce the metadata and data overhead
Model Training using PyTorch + Alluxio + Ceph
● 2 billion small files
● Reduce metadata & data interactions with Ceph to improve performance
32
https://www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/
Large Scale Deep Learning
TOPOLOGY: ON-PREMISES
Alluxio’s Solution
32