Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Microsoft Azure or Amazon S3, and on-premises object stores, such as Western Digital’s ActiveScale. In these settings, applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems for business continuity planning (BCP) and/or supporting hybrid cloud architectures to achieve the required business goals for durability, performance, and coordination.
To resolve this complexity, HDFS-9806 has added a PROVIDED storage tier to mount external storage systems in the HDFS NameNode. Building on this functionality, we can now allow remote namespaces to be synchronized with HDFS, enabling asynchronous writes to the remote storage and the possibility to synchronously and transparently read data back to a local application wanting to access file data which is stored remotely. In this talk, which corresponds to the work in progress under HDFS-12090, we will present how the Hadoop admin can manage storage tiering between clusters and how that is then handled inside HDFS through the snapshotting mechanism and asynchronously satisfying the storage policy.
Speakers
Chris Douglas, Microsoft, Principal Research Software Engineer
Thomas Denmoor, Western Digital, Object Storage Architect
2. • Tiered Storage [issues.apache.org]
– HDFS-9806
– HDFS-12090
Microsoft – Western Digital – Apache Community
2
Virajith Jalaparti
Chris Douglas
…
Ewan Higgs
Kasper Janssens
Thomas Demoor
…
3. • Hadoop Compatible FS [1]: s3a://, wasb://, adl://, …
• Direct IO between Hadoop apps and Object Store
• Disaggregated compute & storage
• HDFS NameNode functions taken up by Object Store
Hadoop already plays nicely with Object Stores
3
REMOTE
STORE
APP
HADOOP CLUSTER
READWRITE
[1]: https://s.apache.org/Hadoop3FSspec
• Pain points:
– Not really a FileSystem: rename, append, directories, ...
• Even with correct semantics, performance unlike HDFS
• HDFS features unavailable (e.g., hedged reads, snapshots, etc.)
– No locality
• Higher latency than attached storage
• Higher variance in both latency and throughput
– No HDFS integration
• Policies for users, permissions, quota, security, …
• Storage Plugins (e.g. Ranger, Sentry)
4. • External Storage Tier for HDFS
– HDFS Storage Policy: DISK, SSD, RAM, ARCHIVE, PROVIDED
• Share namespace, not only data!
– Keep 1-to-1 mapping: HDFS file external object
• No change to existing HDFS workflows
– Hadoop Apps interact with HDFS as before (fully transparent)
– Data Tiering happens async in background
– Native support for all HDFS features / admin tools
• Data Tiering controlled by admin
– On directory / file level
– Through Storage Policy (e.g. <HDD, HDD, HDD> <PROVIDED>)
• HDFS NameNode scalability not a bottleneck
– HDFS manages the working set/compatibility
– Object store manages larger data lake, ingest, etc.
Goal: let HDFS play nicely with Object Stores
4
HDFS
REMOTE
STORE
APP
HADOOP CLUSTER
WRITE
BACK
LOAD
ON-DEMAND
READWRITE
5. HDFS
• Use HDFS to manage remote storage
“Mount” remote storage in HDFS
5
• Use HDFS to manage remote storage
– HDFS blocks correspond to fixed range of bytes in remote
– AliasMap (DWS17: youtu.be/kpNDZNp-Nlw)
– HDFS coordinates reads/writes to remote store
– Mount remote store as a PROVIDED tier in HDFS
– Set StoragePolicy to move data into HDFS
… …
/
a b
HDFS
Namespace
… …
/
d e f
Remote
Namespace
Mount remote
namespace
c
d e f
Mount point
REMOTE
STORE
APP
HADOOP CLUSTER
WRITE
THROUGH
LOAD
ON-DEMAND
READWRITE
Alias Map
HDFS Block->Remote location
6. PROVIDED storage on the READ path
6
/foo/bar
/foo/baz
/foo/bazt
/foo/bazz
IaaS(De)Hydration Delegation
7. PROVIDED storage on the READ path
7
/foo/bar
/foo/baz
/foo/bazt
/foo/bazz
/foo
bar baz bazt bazz
setrep=2
IaaS(De)Hydration Delegation
8. PROVIDED storage on the READ path
8
/foo/bar
/foo/baz
/foo/bazt
/foo/bazz
/foo
bar baz bazt bazz
[Router-Based]
Federation
/cloud
?
IaaS(De)Hydration Delegation
9. Apache Hadoop 3.1.0
• Generate FSImage from a FileSystem
Start a NameNode serving remote data
Serve from (a subset of) DataNodes in the cluster
• Backported and deployed in production at Microsoft
• Static: namespace changes are not reflected in HDFS NameNode
9
• Prototype code [2] with the PROVIDED abstraction
Read-through caching of blocks (demand paging)
Scheduled, metered prefetch for recurring pipelines with SLOs
Write-through to remote (participant in the HDFS write pipeline)
Wire FSImage to a running NameNode
• Per-application NameNodes; with isolation
• Bidirectional synchronization out of scope
[2]: https://github.com/Microsoft-CISL/hadoop/tree/tieredStore-sig16
10. Running Apache Hadoop in the cloud
10
• HDInsight/Elastic MapReduce (EMR)/etc.
• Disaggregation introduces not only
latency, but also variance
• “Lift and shift” workloads
Rely on HDFS plugins
May need to use attached storage to meet
SLOs
Would otherwise require spending more for
capacity to the remote store
𝑠𝑡𝑑𝑑𝑒𝑣
𝑚𝑒𝑎𝑛
11. • HDFS can be used as a cache for Object Storage
• Similar to $my_favorite_caching_FS (CFS)?
– These are all caching systems that dispatch between storage systems horizontally
– We want to tier the storage systems vertically
• Support HDFS, not just Hadoop ecosystem around FileSystem
Notes on Caching
11
Compute
$CFS
☁️
Compute
$CFS
Compute
$CFS
HDFS HDFS HDFS
Compute
☁️
Compute Compute
HDFS HDFS HDFS
12. … …
…
/
bucket1 carlhadoop
Object Store
…
…
/
reports
fileA fileB dir
sales
HDFS cluster
NameNode
DataNode 1
P
External Storage for HDFS
Hadoop
Client
DataNode N
P
fileA fileB dir
12
13. • “DropBox for Hadoop"
– Hadoop cluster has complete namespace but only “data in working set” is stored locally
– Dynamically page in missing data from object store on read
– Asynchronously write back data to object store
• Storage Policies + Replication count offer rich placement options
– E.g.: hot data: <SSD, PROVIDED> / cold data: <PROVIDED>
• Dedicated object storage system more efficient ($$$)
– Similar goal as ARCHIVE storage policy
– Object storage features (erasure coding, multi-geo replication, …)
• Data sharing with non-Hadoop apps
– File-object mapping means objects can be accessed in remote store with REST API / SDKs
Use case: External Storage for HDFS
13
15. WD Activescale Object Storage
• Western Digital moving up the stack (Data Center Systems)
• Scale-out object storage system for Private & Public Cloud
• Key features:
Compatible with Amazon S3 API
Strong consistency (not eventual!)
Erasure coding for efficient storage
• Scale:
Petabytes per rack
Billions of objects per rack
Linear scalability in # of racks
• More info at http://www.hgst.com/products/systems
15
16. • AS AN Administrator
• I CAN configure HDFS with an object storage backend
hdfs storagepolicies -setStoragePolicy -policy PROVIDED -path /var/log
hdfs syncservice -create -backupOnly -name activescale /var/logs s3a://hadoop-logs/
• SO THAT when a user copies files to HDFS they are asynchronously copied to
to the synchronization endpoint
Demo time
16
17. Another example
• AS AN Administrator
• I CAN set the Storage Policy to be PROVIDED_ONLY
hdfs storagepolicies –setStoragePolicy -policy PROVIDED_ONLY -path /var/log
• SO THAT data is no longer in the Datanode but is transparently read
through from the synchronization endpoint on access.
17
18. • Preserve file-object mapping
– AliasMap (last year’s talk – HDFS-9806): synchronize namespaces
– Datanodes collaborate to move blocks which together form an object in destination system
• Minimize impact on frontend traffic / efficient data transfer
– Obvious: Read all blocks into a single Datanode to reconstruct a file before transferring
– Efficient: Transfer directly copies block per block outside of cluster using
• S3: multipart upload
• WASB: append blobs
• HDFS: tmpdir + concat
• Flexible deployment: could run in NameNode OR as External service
– In Namenode is easy to deploy but adds resource pressure
– External service is more difficult to deploy for some sites but reduces resource pressure
– Ongoing community discussion; start with external, include internal option as required
Requirements
18
19. • MountManager manages all the local mount points
– Mount point can be configured to sync with external store
• Periodically create a diff by comparing snapshots of the mountpoint
– NEW SyncService (in/out NameNode)
• Generate a ”phased plan” for ordering the operations in the diff
– Multiple ordered phases
• RENAMES_TO_TEMP, DELETES, RENAMES_TO_FINAL, CREATE_DIRS, CREATE_FILES
• e.g. dir creation before file creation
– Parallel operations within a phase
• Leverage multiple datanodes and connections to external store
• e.g. Upload multiple new files in parallel
• Execute plan and track work
– Namespace (metadata) operations originate from SyncService
– Data operations originate from DataNodes
– Tracking: admin can query mountpoint for progress
Deep Dive: Synchronization
19
20. • Snapshot diff:
– Reflect point-in-time 100% accurate state of HDFS in external store
– Snapshot ensures data remains referenceable: retains blocks of data
– Does not track create + delete in between consecutive snapshots (cfr. file B in Fig.)
• EditLog post-processing:
– To parallelize
• Read batch from log and track lineage between overlapping operations
– HDFS operations might have altered reality: no point-in time
• Data not part of log: would require postponing block garbage collection
Tracking changes: Snapshot diff vs. EditLog
20
ss-6
B B A A
ss-5time
21. SnapshotDiffReport
M d .
+ d ./a
+ f ./f1.bin
Example Diff – Simple Case
21
Commands
#given /basic-test
mkdir -p /basic-test/a/b/c
touch /basic-test/a/b/c/d/f1.bin
touch /basic-test/f1.bin
Simple Case - New dirs; new files
PhasedPlan
+ d ./a/b/c/d/
+ f ./a/b/c/d/f1.bin
+ f ./f1.bin
22. SnapshotDiffReport
M d .
R f ./a.bin -> ./b.bin
R f ./b.bin -> ./a.bin
Example Diff – Harder Case
22
Commands
#given /swap-test/a.bin
#given /swap-test/b.bin
mv /swap-test/a.bin /swap-test/tmp
mv /swap-test/b.bin /swap-test/a.bin
mv /swap-test/tmp /swap-test/b.bin
Harder Case - Cycle
PhasedPlan
R f ./a.bin -> ./tmp/b.bin
R f ./b.bin -> ./tmp/a.bin
R f ./tmp/b.bin -> b.bin
R f ./tmp/a.bin -> a.bin
23. • Tiered Storage HDFS-12090 [issues.apache.org]
– Design documentation
– List of subtasks, lots of linked tickets – take one!
– Discussion of scope, implementation, and feedback
• Bert Verslyppe, Hendrik Depauw, Íñigo Goiri, Rakesh Radhakrishnan, Uma
Gangumalla, Daryn Sharp, Steve Loughran, Sanjay Radia, Anu Engineer,
Jitendra Pandey, Andrew Wang, Zhe Zhang, Allen Wittenauer, and many
others …
Thanks to the community for feedback & help!
23
25. • Applications write to HDFS
– First to DISK, then SyncService asynchronously copies to synchronization endpoint
– When files have been copied, the extraneous disk replicas can be removed
Deep Dive: MultiPart Upload
25
SyncService
Datanode
Datanode
Datanode
External
Store
File
Block1
Client
Write File
Multipart InitMultipart Complete
Multipart PutPart
Block2
Block3
26. • Common concept in Object Storage
– Supported by S3, WASB
• Usage in Hadoop
– S3A uses it – see Steve Loughran’s talk
– New to HDFS – HDFS-13186
• Three phases
– UploadHandle initMultipart(Path filePath)
– PartHandle putPart(Path filePath, InputStream inputStream,
int partNumber, UploadHandle uploadId, long lengthInBytes)
– void complete(Path filePath, List<Pair<Integer, PartHandle>> handles,
UploadHandle multipartUploadId)
• Benefits:
– Object/File Isolation – you only see the results when it’s done
– Can be written in parallel across multiple nodes
MultipartUploader
26