HDFS tiered storage

1
HDFS Tiered Storage
Virajith Jalaparthi, Chris Douglas
Ewan Higgs, Thomas Demoor

• Tiered Storage [issues.apache.org]
– HDFS-9806
– HDFS-12090
Microsoft – Western Digital – Apache Community
2
Virajith Jalaparti
Chris Douglas
…
Ewan Higgs
Kasper Janssens
Thomas Demoor
…

• Hadoop Compatible FS [1]: s3a://, wasb://, adl://, …
• Direct IO between Hadoop apps and Object Store
• Disaggregated compute & storage
• HDFS NameNode functions taken up by Object Store
Hadoop already plays nicely with Object Stores
3
REMOTE
STORE
APP
HADOOP CLUSTER
READWRITE
[1]: https://s.apache.org/Hadoop3FSspec
• Pain points:
– Not really a FileSystem: rename, append, directories, ...
• Even with correct semantics, performance unlike HDFS
• HDFS features unavailable (e.g., hedged reads, snapshots, etc.)
– No locality
• Higher latency than attached storage
• Higher variance in both latency and throughput
– No HDFS integration
• Policies for users, permissions, quota, security, …
• Storage Plugins (e.g. Ranger, Sentry)

• External Storage Tier for HDFS
– HDFS Storage Policy: DISK, SSD, RAM, ARCHIVE, PROVIDED
• Share namespace, not only data!
– Keep 1-to-1 mapping: HDFS file  external object
• No change to existing HDFS workflows
– Hadoop Apps interact with HDFS as before (fully transparent)
– Data Tiering happens async in background
– Native support for all HDFS features / admin tools
• Data Tiering controlled by admin
– On directory / file level
– Through Storage Policy (e.g. <HDD, HDD, HDD>  <PROVIDED>)
• HDFS NameNode scalability not a bottleneck
– HDFS manages the working set/compatibility
– Object store manages larger data lake, ingest, etc.
Goal: let HDFS play nicely with Object Stores
4
HDFS
REMOTE
STORE
APP
HADOOP CLUSTER
WRITE
BACK
LOAD
ON-DEMAND
READWRITE

HDFS
• Use HDFS to manage remote storage
“Mount” remote storage in HDFS
5
• Use HDFS to manage remote storage
– HDFS blocks correspond to fixed range of bytes in remote
– AliasMap (DWS17: youtu.be/kpNDZNp-Nlw)
– HDFS coordinates reads/writes to remote store
– Mount remote store as a PROVIDED tier in HDFS
– Set StoragePolicy to move data into HDFS
… …
/
a b
HDFS
Namespace
… …
/
d e f
Remote
Namespace
Mount remote
namespace
c
d e f
Mount point
REMOTE
STORE
APP
HADOOP CLUSTER
WRITE
THROUGH
LOAD
ON-DEMAND
READWRITE
Alias Map
HDFS Block->Remote location

PROVIDED storage on the READ path
6
/foo/bar
/foo/baz
/foo/bazt
/foo/bazz
IaaS(De)Hydration Delegation

7
/foo/bar
/foo/baz
/foo/bazt
/foo/bazz
/foo
bar baz bazt bazz
setrep=2

8
/foo/bar
/foo/baz
/foo/bazt
/foo/bazz
/foo
bar baz bazt bazz
[Router-Based]
Federation
/cloud
?

Apache Hadoop 3.1.0
• Generate FSImage from a FileSystem
 Start a NameNode serving remote data
 Serve from (a subset of) DataNodes in the cluster
• Backported and deployed in production at Microsoft
• Static: namespace changes are not reflected in HDFS NameNode
9
• Prototype code [2] with the PROVIDED abstraction
 Read-through caching of blocks (demand paging)
 Scheduled, metered prefetch for recurring pipelines with SLOs
 Write-through to remote (participant in the HDFS write pipeline)
 Wire FSImage to a running NameNode
• Per-application NameNodes; with isolation
• Bidirectional synchronization out of scope
[2]: https://github.com/Microsoft-CISL/hadoop/tree/tieredStore-sig16

Running Apache Hadoop in the cloud
10
• HDInsight/Elastic MapReduce (EMR)/etc.
• Disaggregation introduces not only
latency, but also variance
• “Lift and shift” workloads
 Rely on HDFS plugins
 May need to use attached storage to meet
SLOs
 Would otherwise require spending more for
capacity to the remote store
𝑠𝑡𝑑𝑑𝑒𝑣
𝑚𝑒𝑎𝑛

• HDFS can be used as a cache for Object Storage
• Similar to $my_favorite_caching_FS (CFS)?
– These are all caching systems that dispatch between storage systems horizontally
– We want to tier the storage systems vertically
• Support HDFS, not just Hadoop ecosystem around FileSystem
Notes on Caching
11
Compute
$CFS
☁️
Compute
$CFS
Compute
$CFS
HDFS HDFS HDFS
Compute
☁️
Compute Compute
HDFS HDFS HDFS

… …
…
/
bucket1 carlhadoop
Object Store
…
…
/
reports
fileA fileB dir
sales
HDFS cluster
NameNode
DataNode 1
P
External Storage for HDFS
Hadoop
Client
DataNode N
P
fileA fileB dir
12

• “DropBox for Hadoop"
– Hadoop cluster has complete namespace but only “data in working set” is stored locally
– Dynamically page in missing data from object store on read
– Asynchronously write back data to object store
• Storage Policies + Replication count offer rich placement options
– E.g.: hot data: <SSD, PROVIDED> / cold data: <PROVIDED>
• Dedicated object storage system more efficient ($$$)
– Similar goal as ARCHIVE storage policy
– Object storage features (erasure coding, multi-geo replication, …)
• Data sharing with non-Hadoop apps
– File-object mapping means objects can be accessed in remote store with REST API / SDKs
Use case: External Storage for HDFS
13

Community feedback at last year’s Summit
14

WD Activescale Object Storage
• Western Digital moving up the stack (Data Center Systems)
• Scale-out object storage system for Private & Public Cloud
• Key features:
 Compatible with Amazon S3 API
 Strong consistency (not eventual!)
 Erasure coding for efficient storage
• Scale:
 Petabytes per rack
 Billions of objects per rack
 Linear scalability in # of racks
• More info at http://www.hgst.com/products/systems
15

• AS AN Administrator
• I CAN configure HDFS with an object storage backend
hdfs storagepolicies -setStoragePolicy -policy PROVIDED -path /var/log
hdfs syncservice -create -backupOnly -name activescale /var/logs s3a://hadoop-logs/
• SO THAT when a user copies files to HDFS they are asynchronously copied to
to the synchronization endpoint
Demo time
16

Another example
• AS AN Administrator
• I CAN set the Storage Policy to be PROVIDED_ONLY
hdfs storagepolicies –setStoragePolicy -policy PROVIDED_ONLY -path /var/log
• SO THAT data is no longer in the Datanode but is transparently read
through from the synchronization endpoint on access.
17

• Preserve file-object mapping
– AliasMap (last year’s talk – HDFS-9806): synchronize namespaces
– Datanodes collaborate to move blocks which together form an object in destination system
• Minimize impact on frontend traffic / efficient data transfer
– Obvious: Read all blocks into a single Datanode to reconstruct a file before transferring
– Efficient: Transfer directly copies block per block outside of cluster using
• S3: multipart upload
• WASB: append blobs
• HDFS: tmpdir + concat
• Flexible deployment: could run in NameNode OR as External service
– In Namenode is easy to deploy but adds resource pressure
– External service is more difficult to deploy for some sites but reduces resource pressure
– Ongoing community discussion; start with external, include internal option as required
Requirements
18

• MountManager manages all the local mount points
– Mount point can be configured to sync with external store
• Periodically create a diff by comparing snapshots of the mountpoint
– NEW SyncService (in/out NameNode)
• Generate a ”phased plan” for ordering the operations in the diff
– Multiple ordered phases
• RENAMES_TO_TEMP, DELETES, RENAMES_TO_FINAL, CREATE_DIRS, CREATE_FILES
• e.g. dir creation before file creation
– Parallel operations within a phase
• Leverage multiple datanodes and connections to external store
• e.g. Upload multiple new files in parallel
• Execute plan and track work
– Namespace (metadata) operations originate from SyncService
– Data operations originate from DataNodes
– Tracking: admin can query mountpoint for progress
Deep Dive: Synchronization
19

• Snapshot diff:
– Reflect point-in-time 100% accurate state of HDFS in external store
– Snapshot ensures data remains referenceable: retains blocks of data
– Does not track create + delete in between consecutive snapshots (cfr. file B in Fig.)
• EditLog post-processing:
– To parallelize
• Read batch from log and track lineage between overlapping operations
– HDFS operations might have altered reality: no point-in time
• Data not part of log: would require postponing block garbage collection
Tracking changes: Snapshot diff vs. EditLog
20
ss-6
B B A A
ss-5time

SnapshotDiffReport
M d .
+ d ./a
+ f ./f1.bin
Example Diff – Simple Case
21
Commands
#given /basic-test
mkdir -p /basic-test/a/b/c
touch /basic-test/a/b/c/d/f1.bin
touch /basic-test/f1.bin
Simple Case - New dirs; new files
PhasedPlan
+ d ./a/b/c/d/
+ f ./a/b/c/d/f1.bin
+ f ./f1.bin

SnapshotDiffReport
M d .
R f ./a.bin -> ./b.bin
R f ./b.bin -> ./a.bin
Example Diff – Harder Case
22
Commands
#given /swap-test/a.bin
#given /swap-test/b.bin
mv /swap-test/a.bin /swap-test/tmp
mv /swap-test/b.bin /swap-test/a.bin
mv /swap-test/tmp /swap-test/b.bin
Harder Case - Cycle
PhasedPlan
R f ./a.bin -> ./tmp/b.bin
R f ./b.bin -> ./tmp/a.bin
R f ./tmp/b.bin -> b.bin
R f ./tmp/a.bin -> a.bin

• Tiered Storage HDFS-12090 [issues.apache.org]
– Design documentation
– List of subtasks, lots of linked tickets – take one!
– Discussion of scope, implementation, and feedback
• Bert Verslyppe, Hendrik Depauw, Íñigo Goiri, Rakesh Radhakrishnan, Uma
Gangumalla, Daryn Sharp, Steve Loughran, Sanjay Radia, Anu Engineer,
Jitendra Pandey, Andrew Wang, Zhe Zhang, Allen Wittenauer, and many
others …
Thanks to the community for feedback & help!
23

• Applications write to HDFS
– First to DISK, then SyncService asynchronously copies to synchronization endpoint
– When files have been copied, the extraneous disk replicas can be removed
Deep Dive: MultiPart Upload
25
SyncService
Datanode
Datanode
Datanode
External
Store
File
Block1
Client
Write File
Multipart InitMultipart Complete
Multipart PutPart
Block2
Block3

• Common concept in Object Storage
– Supported by S3, WASB
• Usage in Hadoop
– S3A uses it – see Steve Loughran’s talk
– New to HDFS – HDFS-13186
• Three phases
– UploadHandle initMultipart(Path filePath)
– PartHandle putPart(Path filePath, InputStream inputStream,
int partNumber, UploadHandle uploadId, long lengthInBytes)
– void complete(Path filePath, List<Pair<Integer, PartHandle>> handles,
UploadHandle multipartUploadId)
• Benefits:
– Object/File Isolation – you only see the results when it’s done
– Can be written in parallel across multiple nodes
MultipartUploader
26

MultipartUploader in SyncService
27

HDFS tiered storage

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a HDFS tiered storage

Semelhante a HDFS tiered storage (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

HDFS tiered storage