Strata + Hadoop World 2012: HDFS: Now and Future

HDFS:
Now and Future
Todd Lipcon (todd@cloudera.com)
Sanjay Radia (sanjay@hortonworks.com)

Outline
Part 1 – Todd Lipcon (Cloudera)
• Namenode HA
• HDFS Performance improvements
• Taking advantage of next-gen hardware
• Storage Efficiency (RAID and compression)
Part 2 - Sanjay Radia (Hortonworks)
• Federation and Generalized storage service
– Leverage it for further innovation
• Snapshots
• Other
– WebHDFS
– Wire compatibility

2 O'Reilly Strata & Hadoop World

HDFS HA in Hadoop 2.0.0
• Initial implementation last year
– Introduced Standby NameNode and manual hot
failover (see Hadoop World 2011 presentation)
• Handled planned maintenance (eg upgrades) but not
unplanned
– Required a highly-available NFS filer to store
NameNode metadata
• Complicated and expensive to set up


HDFS HA Phase 2
• Automatic failover
– Uses Apache ZooKeeper to automatically detect
NameNode failures and trigger a failover
– Ops may invoke manual failover for planned
maintenance windows
• Removed dependency on NFS storage
– HDFS HA is entirely self-contained
– No special hardware or software required
– No SPOF anywhere in the system

Automatic Failover
• Each NameNode has a new process called
ZooKeeperFailoverController (ZKFC)
– Maintains a session to ZooKeeper
– Periodically runs a health-check against its local NameNode to verify
that it is running properly
• Triggers failover if the health check fails or the ZK session expires
• Operators may still issue manual failover commands for planned
maintenance
• Failover time: 30-40 seconds unplanned; 0-3 seconds planned.
• Handles all types of faults: machine, software, network, etc.


Removed NFS/filer dependency
• Shared storage on NFS practical for some
organizations, but difficult for others
– Complex configuration, custom fencing scripts
– Filer itself must be highly available
– Expensive to buy, expensive to support
– Buggy NFS clients in Linux
• Introduced new system for reliable edit log
storage: QuorumJournalManager

QuorumJournalManager
• Run 3 or 5 JournalNodes, collocated on existing hardware
investment
• Each edit must be committed to a majority of the nodes (i.e
a quorum)
– A minority of nodes may crash or be slow without affecting
system availability
– Run N nodes to tolerate (N-1)/2 failures (same as ZooKeeper)
• Built into HDFS
– Designed for existing Hadoop ops teams to understand
– Hadoop Metrics support, full Kerberos support, etc.


HDFS HA Architecture
(with Automatic Failover and QuorumJournalManager)
ZK ZK ZK
Heartbeat Heartbeat

FailoverController FailoverController
Active Standby

Cmds JN JN JN

NN Shared NN state NN
Monitor Health through Quorum
of NN. OS, HW
Active of JournalNodes
Standby Monitor Health
of NN. OS, HW

Block Reports to Active & Standby
DN fencing: only obey commands
from active
DN DN DN DN

HA Improvements Summary
• Automatic failover
– Avoid both planned an unplanned downtime
• Non-NFS Shared Storage
– No need to buy or configure a filer
• Result: HA with no external dependencies
• Available now in HDFS trunk and CDH4.1
• Come to our 5pm talk in this room for more
details on these HA improvements!

HDFS Performance Update: 2.x vs 1.x
• Significant speedups from SSE4.2 hardware checksum
calculation (2.5-3x less CPU on read path)
• Rewritten read path for fewer memory copies
• Short-circuit past datanodes for 2-3x faster random
read (HBase workloads)
• I/O scheduling improvements: push down hints to
Linux using posix_fadvise()
• Covered in my presentation from Hadoop World 2011


HDFS Performance: Recent Work
• Completed
– Zero-copy read for libhdfs (2-3x improvement for C++
clients like Impala reading cached data)
– Expose mapping of blocks to disks: 2x improvement by
avoiding contention on slower drives (HDFS-3672)
• In progress
– Using native checksum computation on write path
– Avoiding copies and allocation on write path

HDFS Performance Benchmarks
1000
(as of June 2012)
800
Throughput
(MB/sec)

600
Read
400
Write
200

0
Raw ext4 HDFS HDFS with disk awareness

Dual quad-core, 12x2T 7200RPM drives, measured max disk throughput at
900MB/sec.
Write throughput is CPU bound; improvements in progress bring it to max disk
throughput as well
Easily saturates SATA3 bus bandwidth on common hardware

Hardware Trends
• Denser storage
– 36T per node already common
– Millions of blocks per DN
• New need to invest in scaling DataNode memory usage
• More RAM
– 64GB common today. 256GB soon inexpensive
– Customers want to explicitly pin recently ingested data in RAM
(especially with efficient query engines like Impala)
• Solid state storage (SSD, FusionIO, etc)
– HDFS should transparently or explicitly migrate hot random-
access data to/from flash
13
– Hierarchical storage management O'Reilly Strata & Hadoop World

HDFS Storage Efficiency
• Many customers are expanding their clusters simply to add storage
– How can we better utilize the disks they already have?
• RAID (Reed-Solomon coding)
– Store blocks at low replication, keep parity blocks to allow
reconstruction if they are lost
– Effective replication: 1.5x with same durability, less locality
• Transparent compression
– Automatically detect infrequently used files, transparently re-
compress with Snappy, GZip, bz2, or LZMA
– Cloudera workload traces indicate 10% of files accessed 90% of the
time!


Outline
Part 1 – Todd Lipcon (Cloudera)
• Namenode HA
• HDFS Performance improvements
• Taking advantage of next-gen hardware
• Storage Efficiency (RAID and compression)
Part 2 - Sanjay Radia (Hortonworks)
• Federation and Generalized storage service
– Leverage it for further innovation
• Snapshots
• Other
– WebHDFS
– Wire compatibilityHA in Hadoop 1!


Federation: Generalized Block Storage
NN-1 NN-k NN-n

Namespace
Foreign
NS1 NS k NS n
.. ..
. .
Pool 1 Pool k Pool n
Block Storage

Block Pools

DN 1 DN 2 DN m
.. .. ..
Common Storage
• Block Storage as generic storage service
– Set of blocks for a Namespace Volume is called a Block Pool
– DNs store blocks for all the Namespace Volumes – no partitioning
• Multiple independent Namenodes and Namespace Volumes in a cluster
– Namespace Volume = Namespace + Block Pool

HDFS’ Generic Storage Service
Opportunities for Innovation
• Federation - Distributed (Partitioned) Namespace
– Simple and Robust due to independent masters
Alternate NN
– Scalability, Isolation, Availability Implementation
HBase

• New Services – Independent Block Pools
HDFS
Namespace MR tmp

– New FS - Partial namespace in memory
– MR Tmp storage directly on block storage
– Shadow file system – caches HDFS, NFS, S3
Storage Service
• Future: move Block Management in DataNodes
– Simplifies namespace/application implementation
– Distributed namenode becomes significantly simple


Managing Namespaces
• Federation has multiple namespaces
/ Client-side
• Don’t you need a single global namespace?
mount-table
– Some tenants want private namespace
– Do you create a single DB or Single Table?
– Many volumes, share what you want data project home tmp
– Global? Key is to share the data and the names used to access the data

• Client-side mount table can implement global or private namespaces
– Shared mount-table => “global” shared view NS4
– Personalized mount-table => per-application view
• Share the data that matter by mounting it

• Client-side implementation of mount tables
NS1 NS2 NS3
– xInclude from shared place – global view
– No single point of failure
– No hotspot for root and top level directories


Next Steps… first class support for volumes
• NameServer - Container for namespaces
– Lots of small namespace volumes
• Chosen per user, tenant, data feed

• Management policies (quota, …)

• Mount tables for unified namespace
…
– Centrally managed – (xInclude, ZK, ..)
NameServers as
Containers of Namespaces • Keep only WorkingSet of namespace in memory
– Break away from old NN’s full namespace in memory

Datanode … Datanode – Faster startup, Billions of names, Hundreds of volumes

• Number of NameServers =
Storage Layer – Sum of (Namespace working set)

– Sum of (Namespace throughput)

19 – Move namespace for balancing
O'Reilly Strata & Hadoop World

Snapshots
• Take snapshot of any directory
– Multiple snapshots allowed
• Snapshot metadata info stored in Namemode
– Datanodes have no knowledge
– Blocks are shared
• All regular commands/apis can be used against
snapshots
– Cp /foo/bar/.snapshot/x/y /a/b/z
• New CLI’s to create and delete snapshots

Snapshots - Status
• HDFS-2802 (feature branch)
– Initial design and prototype – March 2012
– Development active
• Updated design document and test plan posted
– Review meeting – 1st week November
• 15 + patches
– Expected completion – early December!


Enterprise Use Cases
• Storage fault-tolerance – built into HDFS Architecture 
– Over 7’9s of data reliability
• High Availability 
• Standard Interfaces 
– WebHdfs(REST) , Fuse  and NFS access
• HTTPFS – (WebHDFS as farm of proxy servers)
• libWebhdfs – pure c-library for HDFS

• Wire protocol compatibility 
– Protocol buffers
• Rolling upgrades
– Rolling upgrades for dot-releases 
• Snapshots - Under active development
• Disaster Recovery
– Distcp does parallel and incremental copies across cluster 
• Future - Enhance using journal interface & Snapshots

Summary
• HA for Namenode
– Hot failover, shared storage not required (QJM)
• Performance improvements
• Utilize today’s and tomorrow’s hardware to full potential
• Federation and Generalized storage layer
– Opportunities for innovation
• Partial namespace in memory, shadow/caching file system, MR tmp, etc.
• Wire compatibility, WebHdfs, …
• Snapshots - Development well in progress

Questions?


Strata + Hadoop World 2012: HDFS: Now and Future

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (20)

Semelhante a Strata + Hadoop World 2012: HDFS: Now and Future

Semelhante a Strata + Hadoop World 2012: HDFS: Now and Future (20)

Mais de Cloudera, Inc.

Mais de Cloudera, Inc. (20)

Strata + Hadoop World 2012: HDFS: Now and Future

Notas do Editor