ZFS: Revolutionary File System

What is ZFS? Developed: Sun Microsystems
Introduced: November 2005 (OpenSolaris)

• ZFS (Zettabyte File System) was a file system made by Sun, and later acquired by Oracle
who had bought them out.

• Initially Oracle was championing for BTRFS until they acquired ZFS.

• They are still funding for development into BTRFS though which feature set should be similar to ZFS but is
years behind it because of slow development from having a stable release.

• ZFS is an object based ﬁlesystem and is very differently organized from most regular ﬁle
systems. ZFS provides transactional consistency and is always on-disk consistent due to
copy-on-write semantics and strong checksums which are stored at a different location than
the data blocks.

Trouble With Existing
Filesystems

• No defense against silent data corruption
•Any defect in disk, controller, cable, driver, or firmware can corrupt data silently; like
running a server without ECC memory
• Difficult to manage
•Disk labels, partitions, volumes, provisioning, grow/shrink, hand-editing /etc/vfstab...
•Lots of limits: filesystem/volume size, file size, number of files, files per directory, number
of snapshots, ...
•Not portable between x86 and SPARC

• Performance could be much better
•Linear-time create, fat locks, fixed block size, naïve prefetch, slow random writes, dirty
region logging

ZFS Objective

• End the suffering

• Design an integrated system from scratch

• Throw away 20 years of obsolete assumptions

Evolution of Disks and
Volumes

File System File System File System

Initially, we had simple disks
Volume Manager Volume Manager Volume Manager
Abstraction of disks into volumes
to meet requirements
Industry grew around HW / SW
volume management

Lower Upper Even Odd Right
1GB 1GB Left 1GB
1GB 1GB 1GB

Concatenated 2GB Striped 2GB Mirrored 1GB

ZFS Design Principles

• Start with a new design around today's requirements
• Pooled storage
– Eliminate the notion of volumes
– Do for storage what virtual memory did for RAM
• End-to-end data (and metadata) integrity
– Historically considered too expensive.
– Now, data is too valuable not to protect
• Transactional operation
– Maintain consistent on-disk format
– Reorder transactions for performance gains – big performance win by
coalesced I/O

FS/Volume Model vs.
ZFS

Traditional Volumes ZFS Pooled Storage
1:1 FS to Volume No partitions / volumes
Grow / shrink by hand Grow / shrink FS automatically
Limited bandwidth All bandwidth always available
Storage fragmented All storage in pool is shared

ZFS ZFS ZFS
FS

Volume Manager

ZFS in a nutshell
ZFS Data Integrity Model
Features
Transparent compression: Yes
Everything is copy-on-write
Transparent encryption: Yes
• Never overwrite live data
Data deduplication: Yes
• On-disk state always valid – no “windows of
vulnerability”
• No need for fsck(1M)

Everything is transactional
• Related changes succeed or fail as a whole Limits
• No need for journaling Max. file size: 264 bytes (16 Exabytes)
Max. number of files: 248
Max. filename length: 255 bytes
Everything is checksummed Max. volume size: 264 bytes (16 Exabytes)
• No silent data corruption
• No panics due to silently corrupted metadata

ZFS pool fundamentals

• ZFS data lives in pools. A system can have multiple pools
• ZFS pools can have different storage properties: one more more disks simple, mirrored, or
RAID (several styles), optionally with separate cache or “intent log” devices
• A ZFS pool is composed of multiple virtual devices (vdevs) that are based on either physical
devices (eg: a disk) or groups of logically linked disks (eg: a mirror or RAID group)
• Each pool can have multiple ZFS file systems, which may be nested, and can each have
separate properties (such as quotas, compression, record size), ownership, be separately
snapshoted, cloned, etc.
• zpool command manages pools, zfs command manages FS

FS / Volume Model vs. ZFS

ZFS I/O Stack
FS / Volume I/O Stack
• ZFS to Data Mgmt Unit
• FS to Volume
– Object-based transactions
– Block device interface
– “Change these objects”
– Write blocks, no TX boundary
– All or nothing
– Loss of power = loss of consistency
• DMU to Storage Pool
– Workaround: journaling – slow & complex
– Transaction group commit
• Volume to Disk
– All or nothing
– Block device interface
– Always consistent on disk
– Write each block to each disk immediately
– Journal not needed
to sync mirrors
– Loss of power = resync • SP to Disk
– Synchronous & slow – Schedule, aggregate, and issue I/O at will
– runs at platter speed
– No resync if power lost

ZFS Data Integrity Model

Everything is copy-on-write
Never overwrite live data
On-disk state always valid – no fsck
Everything is transactional
Related changes succeed or fail as a whole
No need for journaling
Everything is checksummed
No silent corruptions
No panics from bad metadata
Enhanced data protection
Mirrored pools, RAID-Z, disk scrubbing

Copy-On-Write

•While copy-on-write is used by ZFS as a means to achieve always consistent on-disk
structures, it also enables some useful side effects.
•ZFS does not perform any immediate correction when it detects errorsin checksums
of objects. It simply takes advantage of the copy-on-write (COW) mechanism and
waits for the next transaction group commit to write new objects on disk.
•This technique provides for better performance while relying on the frequency of
transaction group commits.

Copy-on-Write and
Transactional

Uber-block

Original Data

New Data

Initial block tree Writes a copy of some changes

Original Pointers New Uber-block

New Pointers

Copy-on-write of indirect blocks Rewrites the Uber-block

End-to-End Checksums
ZFS Structure:
•Uberblock
•Tree with Block Pointers
•Data only in leaves
Checksums are separated from
the data

Entire I/O path is self-validating (uber-block)

Self-Healing Data

ZFS can detect bad data using checksums and “heal”
the data using its mirrored copy.

Application Application Application

ZFS Mirror ZFS Mirror ZFS Mirror

Detects Bad Data Gets Good Data from Mirror “Heals” Bad Copy

SILENT DATA
CORRUPTION

Study of CERN showed alarming results
- 8.7TB, 1:1500 files corrupted

• Provable end to end data integrity
- Checksum and data are isolated

• Only “array” initialization is damaged
- No
rebuild data that

• Ditto blocks (redundant copies for data)
- Just another property

# zfs set copies=2 doubled_data_fs

RAID-Z Protection

ZFS provides better than RAID-5 availability
•Copy-on-write approach solves historical problems
•Striping uses dynamic widths
•Each logical block is its own stripe
•All writes are full-stripe writes
•Eliminates read-modify-write (So it's fast!)
•Eliminates RAID-5 “write hole”
•No need for NVRAM

RAID-Z

Dynamic stripe width
Variable block size: 512 – 128K
Disk
LBA A B C D E

0 P0 D0 D2 D4 D6

Each logical block is its own stripe 1 P1
P0
D1
D0
D3
D1
D5
D2
D7
P0
2

Single, double, or triple parity 3 D0 D1 D2 P0 D0
4 P0 D0 D4 D8 D11

All writes are full-stripe writes 5

6
P1
P2
D1
D2
D5
D6
D9
D10
D12
D13

Eliminates read-modify-write (it's fast) 7 P3
D1
D3
D2
D7
D3
P0
X
D0
P0
8

Eliminates the RAID-5 write hole 9 D0 D1 X P0 D0
D3 D6 D9 P1 D1
(no need for NVRAM)
10

11 D4 D7 D10 P2 D2

Detects and corrects silent data corruption
12 D5 D8 • • •

Checksum-driven combinatorial reconstruction
No special hardware – ZFS loves cheap disks

ZFS Intent Log (ZIL)

Filesystems buffer write requests and sync these to storage periodically to improve
performance
Power loss can corrupt filesystems and/or suffer data loss. In ZFS, corruption solved
with TXG commits
synchronous semantics for apps requiring data flushed to stable storage by the time a
system call returns
Open file with O_DSYNC, or flush buffers with fsync(3c)
The ZIL provides synchronous semantics for ZFS with a replayable log written to
disk
High IOPS, small, mostly-write: can direct to separate disk (short stroke disk, SSD,
Flash) for dramatic performance improvement with thousands writes/sec

ZFS Snapshots

Provide a read-only point-in-time copy
of file system Snapshot Uber-block New Uber-block

Copy-on-write makes them essentially
Current Data
“free”
Very space efficient – only changes are
tracked/stored
And instantaneous – just doesn't delete
the copy

ZFS Snapshots

Simple to create and rollback with snapshots

# zfs list -r tank
NAME USED AVAIL REFER MOUNTPOINT
tank 20.0G 46.4G 24.5K /tank
tank/home 20.0G 46.4G 28.5K /export/home
tank/home/ahrens 24.5K 10.0G 24.5K /export/home/ahrens
tank/home/billm 24.5K 46.4G 24.5K /export/home/billm
tank/home/bonwick 24.5K 66.4G 24.5K /export/home/bonwick

# zfs snapshot tank/home/billm@s1
# zfs list -r tank/home/billm
tank/home/billm@s1 0 - 24.5K -

# cat /export/home/billm/.zfs/snapshot/s1/foo.c
# zfs rollback tank/home/billm@s1
# zfs destroy tank/home/billm@s1

ZFS Clones

A clone is a writable copy of a snapshot
Created instantly, unlimited number
Perfect for “read-mostly” file systems – source directories, application binaries
and configuration, etc.

# zfs list -r tank/home/billm

# zfs clone tank/home/billm@s1 tank/newbillm

# zfs list -r tank/home/billm tank/newbillm
tank/newbillm 0 46.4G 24.5K /tank/newbillm

ZFS Data Migration

•Host-neutral format on-disk
•Move data from SPARC to x86 transparently
•Data always written in native format, reads reformat data if needed
•ZFS pools may be moved from host to host
•Or handy for external USB disks
•ZFS handles device ids & paths, mount points, etc.

Export pool from original host
source# zpool export tank

Import pool on new host (“zpool import” without operands lists importable pools)

destination# zpool import tank

ZFS Cheatsheet
http://www.datadisk.co.uk/html_docs/sun/sun_zfs_cs.htm
Create a raidz pool See pools on drives that haven't been
Partition drives to match, in this case "s0" is the same size. imported
•zpool create -f p01 raidz c7t0d0s0 c7t1d0s0 c8t0d0s0 •zpool import
•zpool status
Create swap area in zfs pool, activate it
Create File Systems •zfs create -V 5gb tank/vol
zpool list / zpool status •swap -a /dev/zvol/dsk/tank/vol
•zfs create p01/CDIMAGES •swap -l
•zfs list / df -k
Cloning Drive Partition Tables
Rename pool •prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s -
•zpool export rpool /dev/rdsk/c0t1d0s2
•zpool import rpool oldrpool
Mirror root parition after initial install
Change Mount Point & Mount •zpool list / zpool status
•zfs set mountpoint=/oldrpool/export oldrpool/export •Assuming c5t0d0s0 is root, repartition c5t1d0s0 to
•zfs mount oldrpool/export match. (Make sure you delete "s2", the full drive
partition, or you'll get an overlap error.)
See all the mount points in a zfs pool •zpool attach rpool c5t0d0s0 c5t1d0s0
•zfs list

ZFS Command Summary
•Create a ZFS storage pool # zpool create mpool mirror c1t0d0 c2t0d0
•Add capacity to a ZFS storage pool # zpool add mpool mirror c5t0d0 c6t0d0
•Add hot spares to a ZFS storage pool # zpool add mypool spare c6t0d0 c7t0d0
•Replace a device in a storage pool # zpool replace mpool c6t0d0 [c7t0d0]
•Display storage pool capacity # zpool list
•Display storage pool status # zpool status
•Scrub a pool # zpool scrub mpool
•Remove a pool # zpool destroy mpool
•Create a ZFS ile system # zfs create mpool/devel
•Create a child ZFS ile system # zfs create mpool/devel/data
•Remove a ile system # zfs destroy mpool/devel
•Take a snapshot of a ile system # zfs snapshot mpool/devel/data@today
•Roll back to a ile system snapshot # zfs rollback -r mpool/devel/data@today
•Create a writable clone from a snapshot # zfs clone mpool/devel/data@today mpool/clones/devdata
•Remove a snapshot # zfs destroy mpool/devel/data@today
•Enable compression on a ile system # zfs set compression=on mpool/clones/devdata
•Disable compression on a ile system # zfs inherit compression mpool/clones/devdata
•Set a quota on a ile system # zfs set quota=60G mpool/devel/data
•Set a reservation on a new ile system # zfs create -o reserv=20G mpool/devel/admin
•Share a ile system over NFS # zfs set sharenfs=on mpool/devel/data
•Create a ZFS volume # zfs create -V 2GB mpool/vol
•Remove a ZFS volume # zfs destroy mpool/vol

Q&A

http://twitter.com/pwr

Wolodymyr Protsaylo

ZFS: Revolutionary File System

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (6)

Semelhante a ZFS: Revolutionary File System

Semelhante a ZFS: Revolutionary File System (20)

Último

Último (20)

ZFS: Revolutionary File System

Notas do Editor