Alluxio Day x APAC Modern Data Stack
September 22, 2022
For more on Alluxio Day: https://www.alluxio.io/alluxio-day/
For more Alluxio events: https://alluxio.io/events/
Speaker: Luo Li (Director of Data Infra, Shopee & Alluxio PMC)
Shopee is the leading e-commerce platform in SouthEast Asia. In this presentation, Luo Li from Shopee will share their Data Infra team’s recent project on acceleration with Presto and storage servitization. He will share the details on how Shopee leverages Alluxio to accelerate Presto query and provide standardized methods of accessing data through Alluxio-Fuse and Alluxio-S3.
12. • Presto call load/free API
• Cache Manager load a new partition according to the path pattern
• A scheduled task will clean up expired partitions
Storage Acceleration—Update Policy
HDFS audit log
new partition event ,i.e.:
/a/b/date=2022-09-21(will match)
/a/c/date=2022-09-21(will not match)
existed partitions(.../date=2022-09-20)
existed partitions(.../date=2022-09-18)
existed partitions(.../date=2022-09-19)
path
match
Path pattern:
/a/b/date={}
Load to
Alluxio
Not load to
Alluxio
match not match
13. HDFS
Alluxio
HMS
Presto
On Alluxio No tag
• key:cache,
value:${DC}/Alluxio/ebj@${Alluxio_nameservice}
• If partition exists, set property in partition
property
• Else, set property in table property
Storage Acceleration—HMS Tag
15. ● Subscribe HDFS audit log event
1. Include rename,delete,create,append,truncate,concat etc events
2. Use Flink filtering to reduce the messages
● A Scheduled task to check and repair consistency
Storage Acceleration—Consistency
HDFS
Audit Log
Cache
Manager
A new
Kafka
topic
Flink
17. • 8 merged, 1 WIP, 1 fixed by Alluxio.
TYPE PR STATUS
core Fix master down when master change to leader merged
Hadoop 2.10
Fix HdfsVersion miss hadoop 2.10 config merged
Fix integration/yarn/pom.xml enforcer-plugin miss hadoop
2.10.x config
merged
Fix common.go miss hadoop 2.10 configuration merged
Command Line
Improve shell command support ebj nameservice merged
Fix for Alluxio.logs.dir
fixed by
Alluxio
Modify the meaning of variables more clearly merged
Web Page
Fix isMounted should not invoke ufs, if not /metrics page
very slowly
merged
Fix FormatUtils.getSizeFromBytes method should supports EB merged
NameServices
Fix unescape the ufs url of Alluxio fsadmin report metrics
result
WIP
Storage Acceleration—Community Contribution
18. Private &
Confidential 18
1 Storage Situation
2
3
Storage Acceleration
Storage Servitization
Storage Acceleration and Servitization at Shopee
19. Private &
Confidential 19
Storage Servitization—Status
▪ Most of data is stored in HDFS
▪ Various development languages are used
▪ HDFS has insufficient support for non Java clients
▪ Many applications need to access data as a service, not like a hard disk
20. Private &
Confidential 20
Fuse for HDFS
S3 for HDFS
▪ Alluxio fuse service on physical machine
▪ Alluxio fuse service on kubernetes cluster
▪ Using S3 API to access HDFS by alluxio proxy service
Storage Servitization—Solutions
21. Private &
Confidential 21
▪ Bucket: A bucket is a container for objects stored in Amazon S3
▪ Object: Objects are the fundamental entities stored in Amazon S3
▪ Key: An object key (or key name) is the unique identifier for an
object within a bucket.
▪ Region: You can choose a region to store the created buckets
Store Servitization—S3
Buckets
Objects
Keys Regions
Amazon
S3
Concepts
Conception
22. Private &
Confidential 22
▪ Alluxio can mount HDFS data
▪ Alluxio provides Proxy service
▪ Proxy is compatible with the basic operations of the S3 API
▪ S3 SDK supports many development languages
Store Servitization—S3 for HDFS
Access HDFS data via Alluxio using S3 protocol
23. Private &
Confidential 23
▪ 1-level directory as bucket
▪ Subdirectories and file paths as key
Store Servitization—Alluxio Proxy for S3 mapping
26. Private &
Confidential 26
Store Servitization—Community contribution
TYPE PR STATUS
proxy
Fix wrong format of s3 bucket creationDate merged
Support parse authorization headers for s3 proxy merged
Add s3 rest service audit log merged
Add header parameter 'Authorization' for postBucket method merged
fuse
Fix wrong method call to get username and wrong parameter assignment merged
Load jnr-runtime dependencies at initialization merged
Support overwrite for rename merged
csi Replace invalid env with args in nodeserver merged
core Avoid checking file permissions in getFileInfo method merged
doc
Fix bug case of S3 REST API merged
Fix wrong file name in k8s doc merged
Fix ambiguous description for impersonation in CN doc merged
▪ 12 merged.
28. Private &
Confidential 28
▪ Kernel
▪ User-level daemon
High-Level Architecture
Storage Servitization—Fuse
WHAT IS IT
▪ FileSystem in Userspace
29. Private &
Confidential 29
▪ libfuse
▪ JNR-Fuse
▪ JNI-Fuse
Requirements
Implementation
Storage Servitization—Alluxio Fuse
▪ Standalone Fuse
▪ Fuse on Workers
Deployment
▪ Not support random writes
Limitations
30. Private &
Confidential 30
Store Servitization—Alluxio CSI
▪ On nodeserver pod
▪ On separate pod
Fuse Deployment mode
WHAT IS IT
▪ Standard storage interface for
containers
31. Private &
Confidential 31
▪ Fuse sidecar container in a Pod to mount the
Alluxio directory
▪ Independent configuration of pods, high flexibility
▪ Each Pod runs a Fuse container without affecting
each other
▪ Each Fuse process occupies a container, so the
solution consumes more resources
Futures
Store Servitization—k8s sidecar for Alluxio
WHAT IS IT
32. Private &
Confidential 32
Store Servitization—Summarize
Fuse on physical
machine
K8s-csi
K8s-sidecar
Fuse on
nodeserver pod
Fuse on separate pod
maintenance
cost
high low higher higher
resource
usage
low lower high high
independence high low high high
stability high low high high