ZFS is a file system, volume manager, and RAID controller combined. It uses a copy-on-write design and checksums data for integrity. ZFS has advantages like speed, simplicity, self-healing capabilities, and built-in features like snapshots and data sharing. ZFS achieves these feats through its layered architecture including the ZPL, DMU, and SPA layers which handle I/O, transactions, block allocation and integrity protection.
1. ZFS Nuts and Bolts
Eric Sproul
OmniTI Computer Consulting
2. Quick Overview
• More than just another filesystem: it’s a filesystem,
a volume manager, and a RAID controller all in one
• Production debut in Solaris 10 6/06
• 1 ZB = 1 billion TB
• 128-bit
• 264 snapshots, 248 files/directory,
264 bytes/filesystem, 278 bytes/pool,
264 devices/pool, 264 pools/system
3. Old & Busted
Traditional storage stack:
filesystem(upper): filename to object (inode)
filesystem(lower): object to volume LBA
volume manager: volume LBA to array LBA
RAID controller: array LBA to disk LBA
• Strict separation between layers
• Each layer often comes from separate vendors
• Complex, difficult to administer, hard to predict
performance of a particular combination
4. New Hotness
• Telescoped stack:
ZPL: filename to object
DMU: object to DVA
SPA: DVA to disk LBA
• Terms:
• ZPL: ZFS POSIX layer (standard syscall interface)
• DMU: Data Management Unit (transactional object store)
• DVA: Data Virtual Address (vdev + offset)
• SPA: Storage Pool Allocator (block allocation, data
transformation)
5. New Hotness
• No more separate tools to manage filesystems vs.
volumes vs. RAID arrays
• 2 commands: zpool(1M), zfs(1M) (RFE exists to combine these)
• Pooled storage means never getting stuck with too
much or too little space in your filesystems
• Can expose block devices as well; “zvol” blocks
map directly to DVAs
6. ZFS Advantages
• Fast
• copy-on-write, pipelined I/O, dynamic striping,
variable block size, intelligent resilvering
• Simple management
• End-to-end data integrity, self-healing
• Checksum everything, all the time
• Built-in goodies
• block transforms
• snapshots
• NFS, CIFS, iSCSI sharing
• Platform-neutral on-disk format
7. Getting Down to Brass Tacks
How does ZFS achieve these feats?
8. ZFS I/O Life Cycle
Writes
1. Translated to object transactions by the ZPL:
“Make these 5 changes to these 2 objects.”
2. Transactions bundled in DMU into transaction
groups (TXGs) that flush when full (1/8 of system
memory) or at regular intervals (30 seconds)
3. Blocks making up a TXG are transformed (if
necessary), scheduled and then issued to physical
media in the SPA
9. ZFS I/O Life Cycle
Synchronous Writes
• ZFS maintains a per-filesystem log called the ZFS
Intent Log (ZIL). Each transaction gets a log
sequence number.
• When a synchronous command, such as fsync(), is
issued, the ZIL commits blocks up to the current
sequence number. This is a blocking operation.
• The ZIL commits all necessary operations and
flushes any write caches that may be enabled,
ensuring that all bits have been committed to stable
storage.
10. ZFS I/O Life Cycle
Reads
• ZFS makes heavy use of caching and prefetching
• If requested blocks are not cached, issue a
prioritized I/O that “cuts the line” ahead of pending
writes
• Writes are intelligently throttled to maintain
acceptable read performance
• ARC (Adaptive Replacement Cache) tracks recently
and frequently used blocks in main memory
• L2 ARC uses durable storage to extend the ARC
11. Speed Is Life
• Copy-on-write design means random writes can
be made sequential
• Pipelined I/O extracts maximum parallelism with
out-of-order issue, sorting and aggregation
• Dynamic striping across all underlying devices
eliminates hot-spots
• Variable block size = no wasted space or effort
• Intelligent resilvering copies only live data, can do
partial rebuild for transient outages
15. Copy-On-Write
Atomically update uberblock to point at updated blocks
The uberblock is special in that it does get overwritten, but 4
copies are stored as part of the vdev label and are updated in
transactional pairs. Therefore, integrity on disk is maintained.
16. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
17. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
18. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
Move
head
19. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
Move Spin
head wait
20. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
Move Spin Move
head wait head
21. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
Move Spin Move Move
head wait head head
22. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
Move Spin Move Move Move
head wait head head head
23. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
Move Spin Move Move Move
head wait head head head
24. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
Pipelining lets us examine
writes as a group and
optimize order:
25. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
Pipelining lets us examine
writes as a group and
optimize order:
Move
head
26. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
Pipelining lets us examine
writes as a group and
optimize order:
Move Move
head head
27. Dynamic Striping
• Load distribution across top-level vdevs
• Factors determining block allocation
include:
• Capacity
• Latency & bandwidth
• Device health
28. Dynamic Striping
New data striped across three mirrors.
Writes striped across both mirrors. No migration of existing data.
Copy-on-write reallocates data over time,
Reads occur wherever data was gradually spreading it across all three mirrors.
written.
* RFE for “on-demand” resilvering to explicitly re-balance
+
# zpool create tank
# zpool add tank
mirror c1t0d0 c1t1d0
mirror c3t0d0 c3t1d0
mirror c2t0d0 c2t1d0
29. Variable Block Size
• No single value works well with all types of files
• Large blocks increase bandwidth but reduce metadata and can lead to
wasted space
• Small blocks save space for smaller files, but increase I/O operations on
larger ones
• Record-based files such as those used by databases have a fixed block
size that must be matched by the filesystem to avoid extra overhead
(blocks too small) or read-modify-write (blocks too large)
30. Variable Block Size
• The DMU operates on units of a fixed record size;
default is 128KB
• Files that are less than the record size are written as
a single filesystem block (FSB) of variable size in
multiples of disk sectors (512B)
• Files that are larger than the record size are stored
in multiple FSBs equal to record size
• DMU records are assembled into transaction groups
and committed atomically
31. Variable Block Size
• FSBs are the basic unit of ZFS datasets, of which
checksums are maintained
• Handled by the SPA, which can optionally transform
them (compression, ditto blocks today; encryption,
de-dupe in the future)
• Compression improves I/O performance, as fewer
operations are needed on the underlying disk
32. Intelligent Resilver
• a.k.a. rebuild, resync, reconstruct
• Traditional resilvering is basically a whole-disk copy
in the mirror case; RAID-5 does XOR of the other
disks to rebuild
• No priority given to more important blocks
(top of the tree)
• If you’ve copied 99% of the blocks, but the last
1% contains the top few blocks in the tree,
another failure ruins everything
33. Intelligent Resilver
• The ZFS way is metadata-driven
• Live blocks only: just walk the block tree;
unallocated blocks are ignored
• Top-down: Start with the most important blocks.
Every block copied increases the amount of
discoverable data.
• Transactional pruning: If the failure is transient,
repair by identifying the missed TXGs. Resilver
time is only slightly longer than the outage time.
34. Keep It Simple
• Unified management model: pools and datasets
• Datasets are just a group of tagged bits with
certain attributes: filesystems, volumes, snapshots,
clones
• Properties can be set while the dataset is active
• Hierarchical arrangement: children inherit
properties of parent
• Datasets become administration points-- give
every user or application their own filesystem
35. Keep It Simple
• Datasets only occupy as much space as they need
• Compression, quotas and reservations are built-in
properties
• Pools may be grown dynamically without service
interruption
36. Data Integrity
• Not enough to be fast and simple; must be
safe too
• Silent corruption is our mortal enemy
• Defects can occur anywhere: disks, firmware, cables, kernel drivers
• Main memory has ECC; why shouldn’t storage have something similar?
• Other types of corruption are also killers:
• Power outages, accidental overwrite, use a disk as swap
37. Data Integrity
Traditional Method:
Disk Block Checksum
cksum
data
38. Data Integrity
Traditional Method:
Disk Block Checksum
cksum
data
Only detects problems after data is successfully written (“bit rot”)
39. Data Integrity
Traditional Method:
Disk Block Checksum
cksum
data
Only detects problems after data is successfully written (“bit rot”)
Won’t catch silent corruption caused by issues in the I/O path
between disk and host
40. Data Integrity
The ZFS Way
• Store data checksum in parent block
pointer
ptr
cksum • Isolates faults between checksum and
data
ptr • Forms a hash tree, enabling validation of
cksum
the entire pool
• 256-bit checksums
• fletcher2 (default, simple and fast) or
data data SHA-256 (slower, more secure)
• Can be validated at any time with
‘zpool scrub’
46. Data Integrity
App
data
ZFS
data data
Self-healing mirror!
47. Goodie Bag
• Block Transforms
• Snapshots & Clones
• Sharing (NFS, CIFS, iSCSI)
• Platform-neutral on-disk format
48. Block Transforms
• Handled at SPA layer, transparent to upper layers
• Available today:
• Compression
• zfs set compression=on tank/myfs
• LZJB (default) or GZIP
• Multi-threaded as of snv_79
• Duplication, a.k.a. “ditto blocks”
• zfs set copies=N tank/myfs
• In addition to mirroring/RAID-Z: One logical block = up to 3
physical blocks
• Metadata always has 2+ copies, even without ditto blocks
• Copies stored on different devices, or different places on same
device
• Future: de-duplication, encryption
49. Snapshots & Clones
• zfs snapshot tank/myfs@thursday
• Based on block birth time, stored in block pointer
• Nearly instantaneous (<1 sec) on idle system
• Communicates structure, since it is based on
object changes, not just a block delta
• Occupies negligible space initially, and only grows
as large as the block changeset
• Clone is just a read/write snapshot
50. Sharing
• NFSv4
• zfs set sharenfs=on tank/myfs
• Automatically updates /etc/dfs/sharetab
• CIFS
• zfs set sharesmb=on tank/myfs
• Additional properties control the share name and workgroup
• Supports full NT ACLs and user mapping, not just POSIX uid
• iSCSI
• zfs set shareiscsi=on tank/myvol
• Makes sharing block devices as easy as sharing filesystems
51. On-Disk Format
• Platform-neutral, adaptive endianness
• Writes always use native endianness, recorded in a bit in the block
pointer
• Reads byteswap if necessary, based on comparison of host endianness to
value of block pointer bit
• Migrate between x86 and SPARC
• No worries about device paths, fstab, mountpoints, it all just works
• ‘zpool export’ on old host, move disks, ‘zpool import’ on new host
• Also migrate between Solaris and non-Sun implementations, such as
MacOS X and FreeBSD