The 7 Things I Know About Cyber Security After 25 Years | April 2024
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
1. Memory-Speed Virtual Distributed Storage: Latest Features and Use Cases
August 2016 @ Strata+Hadoop World Beijing
Bin Fan, Yupeng Fu
1
2. About Us
• Bin Fan
• Software Engineer @ Alluxio, Inc.
• Alluxio PMC
• Worked at Google, MSR
• CMU, CUHK, USTC
• Wechat @ apc999_fb
• Yupeng Fu
• Software Engineer @ Alluxio, Inc.
• Alluxio PMC
• Worked at Palantir, Google
• UCSD, Tsinghua
• Wechat @richbird9
2
3. • Team consists of Alluxio creators and top committers
• Invested by
• Committed to Alluxio Open Source
• http://www.alluxio.com
Alluxio Inc.
3
4. Outline
• What is Alluxio
• Example use cases
1. Off-heap memory to alleviate resource pressure
2. Fast data sharing between jobs
3. Accelerate access to remote storage
4. Unified namespace
• Summary
4
6. OPEN SOURCE ALLUXIO: History
• Started in the AMPLab at UC Berkeley
– Summer of 2012
• Open Sourced as Tachyon (renamed to Alluxio)
– Apache 2.0 License
– Latest Release: version 1.2.0 (July, 2016)
6
8. OPEN SOURCE ALLUXIO
The fastest growing
open-source project
in the big data
ecosystem.
Currently over 300
contributors from
over 100
organizations.
8
9. Outline
• What is Alluxio
• Example use cases
1. Off-heap memory to alleviate resource pressure
2. Fast data sharing between jobs
3. Accelerate access to remote storage
4. Unified namespace
• Summary
9
12. Spark Job1
Spark mem
Spark Job2
Spark mem
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
In-Memory
block 1
block 3
block 4
Data not duplicated at memory-level
Consolidating Memory
storage engine &
execution engine
same process
12
13. Data Resilience during Crashes
Spark Task
Spark Memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Process crash requires network and/or disk I/O to re-read the
data
storage engine &
execution engine
same process
13
14. Data Resilience during Crashes
Crash
Spark Memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Process crash requires network and/or disk I/O to re-read the
data
storage engine &
execution engine
same process
14
15. HDFS / Amazon S3
Data Resilience during Crashes
block 1
block 3
block 2
block 4
Crash
storage engine &
execution engine
same process
Process crash requires network and/or disk I/O to re-read the
data
15
16. Spark Task
Spark Memory
block manager
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
In-Memory
block 1
block 3
block 4
Process crash only needs memory I/O to re-read the
data
Data Resilience during Crashes
storage engine &
execution engine
same process
16
17. Crash
Process crash only needs memory I/O to re-read the data
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
In-Memory
block 1
block 3
block 4
Data Resilience during Crashes
storage engine &
execution engine
same process
17
18. CASE STUDY
Thanks to Alluxio, we now have the raw
data immediately available at every
iteration and we can skip the costs of
loading in terms of time waiting,
network traffic, and RDBMS activity.
- Henry Powell, Barclays
RESULTS
• Barclays workflow iteration time decreased
from hours to seconds
• Alluxio enabled workflows that were
impossible before
• By keeping data only in memory, the I/O cost of
loading and storing in Alluxio is now on the
order of seconds
Relational Database
Share Data Across Jobs at Memory
Speed
• Reducing time from hours to
seconds
• Solving issues of Teradata as
under storage
18
19. Barclays Previous Architecture
Spark
Teradata
Slow ETL,
30+ minutes
Each context restart
requires ETL process
again
h"ps://dzone.com/ar1cles/Accelerate-‐In-‐Memory-‐Processing-‐with-‐Spark-‐from-‐Hours-‐to-‐Seconds-‐With-‐Tachyon
19
20. Barclays Architecture with Alluxio
Spark
Teradata
Alluxio
ETL only once
Data still available in Alluxio
memory after Job crash
20
21. CASE 2:
Fast Data sharing between apps
Application 1
Alluxio
Application2
HDFS
21
22. Spark Job
Spark
Memory
block 1
block 3
Hadoop MR Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Data Sharing Between Frameworks
Inter-process sharing slowed down by network and/or disk I/O
22
23. Data Sharing Between Frameworks
Spark Job
Spark Memory
Hadoop MR Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
In-Memory
block 1
block 3
block 4
storage engine &
execution engine
same process
Inter-process sharing can happen at memory
speed
23
24. CASE STUDY
We’ve been running Alluxio in
production for over 9 months, resulting
in 15x speedup on average, and 300x
speedup at peak service times.
- Xueyan Li, Qunar
RESULTS
• Multiple computing frameworks can share data
at memory speed with Alluxio
• Improved the performance of their system with
15x – 300x speedups
• Alluxio provides various user-friendly APIs,
which simplifies the transition to Alluxio.
Sharing Data Across Different
Compute Frameworks
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
24
30. Visualizing the Stack with Alluxio
FAST
104 - 105 MB/s
Application
Remote Storage
MODERATE
103 - 104 MB/s
SLOW
102 - 103 MB/s
Alluxio MEM
Often
Only When Necessary
Alluxio SSD/HDD
Limited
30
34. Pluggable Tier Management Policies
Evict stale data to
slower tier
Promote hot data
to faster tier
34
35. CASE STUDY
Baidu File System
The performance was amazing. With
Spark SQL alone, it took 100-150 seconds
to finish a query; using Alluxio, where
data may hit local or remote Alluxio
nodes, it took 10-15 seconds.
- Shaoshan Liu, Baidu
RESULTS
• Data queries are now 30x faster with Alluxio
• 1000-worker Alluxio cluster run stably,
providing over 50TB of RAM space
• By using Alluxio, batch queries usually lasting
over 15 minutes were transformed into an
interactive query taking less than 30 seconds
Accelerate Access to Remote
Storage
• 1000+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
• 1000-worker cluster
35
37. Common Scenarios
• At large organizations, data spans many storage
systems
• Legacy storage systems
• Application logic needs to integrate with different
storage systems
37
38. Unified Name Space
An abstraction for applications to access different storage
systems through the same interface
• Benefits
– Application written once with Allluxio API
– Transparently add new storage systems
– Seamlessly integrate with existing systems
38
39. Transparent Naming
Operations over persisted Alluxio objects mapped
transparently to underlying storage
alluxio://Data/Sales <-> hdfs://Data/Sales
Alluxio
Storage System (HDFS, S3, …)
alluxio://host:port/
Data
Users
Reports
Sales
Alice
Bob
hdfs://host:port/
Data
Users
Reports
Sales
Alice
Bob
39
mount
40. Multiple Storage Systems
• Unified namespace for multiple data sources
alluxio://Data/Sales -> s3://bucket/Sales
alluxio://Users/Alice -> hdfs://Users/Alice
• API for on-the-fly mounting / unmounting
Alluxio
Storage System A
alluxio://host:port/
Data
Users
Alice
Bob
hdfs://host:port/
Users
Alice
Bob
Storage System B
s3://host/bucket
Reports
Sales
Reports
Sales
40
mount
mount
42. CASE STUDY
RESULTS
• Alluxio’s unified namespace enables different
applications and frameworks to easily interact
with their data from different storage systems
• Global data centers across North America and
Asia
• Tiered storage feature manages various
storage resources including memory, SSD and
disk
Transparently Manage Data
Across Different Storage Systems
• Storage medium: MEM
• Clusters in different regions
An Internet Company
Machine Learning Pipelines
42
42
43. Outline
• What is Alluxio
• Example use cases
1. Off-heap memory to alleviate resource pressure
2. Fast data sharing between jobs
3. Accelerate access to remote storage
4. Unified namespace
• Summary
43
44. Takeaway: When to use Alluxio
• Two or more jobs access the same dataset
• Job(s) may not always succeed
• Spark with memory/GC pressure
• Jobs/Apps are pipelined
• Resulting data does not need to be immediately
persisted
• Interact with remote data
• Data stored across different storage systems
44