ZFS is a file system developed by Sun Microsystems that provides advanced storage capabilities such as data integrity checking, snapshots and cloning. Some key features of ZFS include using copy-on-write storage, end-to-end checksumming of data to prevent silent data corruption, transactional semantics for consistency, and pooled storage that allows for thin provisioning and easy management of storage resources. ZFS aims to eliminate many of the issues with traditional file systems through its novel approach to data storage and management.
2. What is ZFS? Developed: Sun Microsystems
Introduced: November 2005 (OpenSolaris)
• ZFS (Zettabyte File System) was a file system made by Sun, and later acquired by Oracle
who had bought them out.
• Initially Oracle was championing for BTRFS until they acquired ZFS.
• They are still funding for development into BTRFS though which feature set should be similar to ZFS but is
years behind it because of slow development from having a stable release.
• ZFS is an object based filesystem and is very differently organized from most regular file
systems. ZFS provides transactional consistency and is always on-disk consistent due to
copy-on-write semantics and strong checksums which are stored at a different location than
the data blocks.
3. Trouble With Existing
Filesystems
• No defense against silent data corruption
•Any defect in disk, controller, cable, driver, or firmware can corrupt data silently; like
running a server without ECC memory
• Difficult to manage
•Disk labels, partitions, volumes, provisioning, grow/shrink, hand-editing /etc/vfstab...
•Lots of limits: filesystem/volume size, file size, number of files, files per directory, number
of snapshots, ...
•Not portable between x86 and SPARC
• Performance could be much better
•Linear-time create, fat locks, fixed block size, naïve prefetch, slow random writes, dirty
region logging
4. ZFS Objective
• End the suffering
• Design an integrated system from scratch
• Throw away 20 years of obsolete assumptions
5. Trouble With Existing
Filesystems
• No defense against silent data corruption
•Any defect in disk, controller, cable, driver, or firmware can corrupt data silently; like
running a server without ECC memory
• Difficult to manage
•Disk labels, partitions, volumes, provisioning, grow/shrink, hand-editing /etc/vfstab...
•Lots of limits: filesystem/volume size, file size, number of files, files per directory, number
of snapshots, ...
•Not portable between x86 and SPARC
• Performance could be much better
•Linear-time create, fat locks, fixed block size, naïve prefetch, slow random writes, dirty
region logging
6. Evolution of Disks and
Volumes
File System File System File System
Initially, we had simple disks
Volume Manager Volume Manager Volume Manager
Abstraction of disks into volumes
to meet requirements
Industry grew around HW / SW
volume management
Lower Upper Even Odd Right
1GB 1GB Left 1GB
1GB 1GB 1GB
Concatenated 2GB Striped 2GB Mirrored 1GB
7. ZFS Design Principles
• Start with a new design around today's requirements
• Pooled storage
– Eliminate the notion of volumes
– Do for storage what virtual memory did for RAM
• End-to-end data (and metadata) integrity
– Historically considered too expensive.
– Now, data is too valuable not to protect
• Transactional operation
– Maintain consistent on-disk format
– Reorder transactions for performance gains – big performance win by
coalesced I/O
8. FS/Volume Model vs.
ZFS
Traditional Volumes ZFS Pooled Storage
1:1 FS to Volume No partitions / volumes
Grow / shrink by hand Grow / shrink FS automatically
Limited bandwidth All bandwidth always available
Storage fragmented All storage in pool is shared
ZFS ZFS ZFS
FS
Volume Manager
9. ZFS in a nutshell
ZFS Data Integrity Model
Features
Transparent compression: Yes
Everything is copy-on-write
Transparent encryption: Yes
• Never overwrite live data
Data deduplication: Yes
• On-disk state always valid – no “windows of
vulnerability”
• No need for fsck(1M)
Everything is transactional
• Related changes succeed or fail as a whole Limits
• No need for journaling Max. file size: 264 bytes (16 Exabytes)
Max. number of files: 248
Max. filename length: 255 bytes
Everything is checksummed Max. volume size: 264 bytes (16 Exabytes)
• No silent data corruption
• No panics due to silently corrupted metadata
10. ZFS pool fundamentals
• ZFS data lives in pools. A system can have multiple pools
• ZFS pools can have different storage properties: one more more disks simple, mirrored, or
RAID (several styles), optionally with separate cache or “intent log” devices
• A ZFS pool is composed of multiple virtual devices (vdevs) that are based on either physical
devices (eg: a disk) or groups of logically linked disks (eg: a mirror or RAID group)
• Each pool can have multiple ZFS file systems, which may be nested, and can each have
separate properties (such as quotas, compression, record size), ownership, be separately
snapshoted, cloned, etc.
• zpool command manages pools, zfs command manages FS
11. FS / Volume Model vs. ZFS
ZFS I/O Stack
FS / Volume I/O Stack
• ZFS to Data Mgmt Unit
• FS to Volume
– Object-based transactions
– Block device interface
– “Change these objects”
– Write blocks, no TX boundary
– All or nothing
– Loss of power = loss of consistency
• DMU to Storage Pool
– Workaround: journaling – slow & complex
– Transaction group commit
• Volume to Disk
– All or nothing
– Block device interface
– Always consistent on disk
– Write each block to each disk immediately
– Journal not needed
to sync mirrors
– Loss of power = resync • SP to Disk
– Synchronous & slow – Schedule, aggregate, and issue I/O at will
– runs at platter speed
– No resync if power lost
13. ZFS Data Integrity Model
Everything is copy-on-write
Never overwrite live data
On-disk state always valid – no fsck
Everything is transactional
Related changes succeed or fail as a whole
No need for journaling
Everything is checksummed
No silent corruptions
No panics from bad metadata
Enhanced data protection
Mirrored pools, RAID-Z, disk scrubbing
14. Copy-On-Write
•While copy-on-write is used by ZFS as a means to achieve always consistent on-disk
structures, it also enables some useful side effects.
•ZFS does not perform any immediate correction when it detects errorsin checksums
of objects. It simply takes advantage of the copy-on-write (COW) mechanism and
waits for the next transaction group commit to write new objects on disk.
•This technique provides for better performance while relying on the frequency of
transaction group commits.
15. Copy-on-Write and
Transactional
Uber-block
Original Data
New Data
Initial block tree Writes a copy of some changes
Original Pointers New Uber-block
New Pointers
Copy-on-write of indirect blocks Rewrites the Uber-block
16. End-to-End Checksums
ZFS Structure:
•Uberblock
•Tree with Block Pointers
•Data only in leaves
Checksums are separated from
the data
Entire I/O path is self-validating (uber-block)
17. Self-Healing Data
ZFS can detect bad data using checksums and “heal”
the data using its mirrored copy.
Application Application Application
ZFS Mirror ZFS Mirror ZFS Mirror
Detects Bad Data Gets Good Data from Mirror “Heals” Bad Copy
18. SILENT DATA
CORRUPTION
Study of CERN showed alarming results
- 8.7TB, 1:1500 files corrupted
• Provable end to end data integrity
- Checksum and data are isolated
• Only “array” initialization is damaged
- No
rebuild data that
• Ditto blocks (redundant copies for data)
- Just another property
# zfs set copies=2 doubled_data_fs
19. RAID-Z Protection
ZFS provides better than RAID-5 availability
•Copy-on-write approach solves historical problems
•Striping uses dynamic widths
•Each logical block is its own stripe
•All writes are full-stripe writes
•Eliminates read-modify-write (So it's fast!)
•Eliminates RAID-5 “write hole”
•No need for NVRAM
20. RAID-Z
Dynamic stripe width
Variable block size: 512 – 128K
Disk
LBA A B C D E
0 P0 D0 D2 D4 D6
Each logical block is its own stripe 1 P1
P0
D1
D0
D3
D1
D5
D2
D7
P0
2
Single, double, or triple parity 3 D0 D1 D2 P0 D0
4 P0 D0 D4 D8 D11
All writes are full-stripe writes 5
6
P1
P2
D1
D2
D5
D6
D9
D10
D12
D13
Eliminates read-modify-write (it's fast) 7 P3
D1
D3
D2
D7
D3
P0
X
D0
P0
8
Eliminates the RAID-5 write hole 9 D0 D1 X P0 D0
D3 D6 D9 P1 D1
(no need for NVRAM)
10
11 D4 D7 D10 P2 D2
Detects and corrects silent data corruption
12 D5 D8 • • •
Checksum-driven combinatorial reconstruction
No special hardware – ZFS loves cheap disks
21. ZFS Intent Log (ZIL)
Filesystems buffer write requests and sync these to storage periodically to improve
performance
Power loss can corrupt filesystems and/or suffer data loss. In ZFS, corruption solved
with TXG commits
synchronous semantics for apps requiring data flushed to stable storage by the time a
system call returns
Open file with O_DSYNC, or flush buffers with fsync(3c)
The ZIL provides synchronous semantics for ZFS with a replayable log written to
disk
High IOPS, small, mostly-write: can direct to separate disk (short stroke disk, SSD,
Flash) for dramatic performance improvement with thousands writes/sec
22. ZFS Snapshots
Provide a read-only point-in-time copy
of file system Snapshot Uber-block New Uber-block
Copy-on-write makes them essentially
Current Data
“free”
Very space efficient – only changes are
tracked/stored
And instantaneous – just doesn't delete
the copy
23. ZFS Snapshots
Simple to create and rollback with snapshots
# zfs list -r tank
NAME USED AVAIL REFER MOUNTPOINT
tank 20.0G 46.4G 24.5K /tank
tank/home 20.0G 46.4G 28.5K /export/home
tank/home/ahrens 24.5K 10.0G 24.5K /export/home/ahrens
tank/home/billm 24.5K 46.4G 24.5K /export/home/billm
tank/home/bonwick 24.5K 66.4G 24.5K /export/home/bonwick
# zfs snapshot tank/home/billm@s1
# zfs list -r tank/home/billm
NAME USED AVAIL REFER MOUNTPOINT
tank/home/billm 24.5K 46.4G 24.5K /export/home/billm
tank/home/billm@s1 0 - 24.5K -
# cat /export/home/billm/.zfs/snapshot/s1/foo.c
# zfs rollback tank/home/billm@s1
# zfs destroy tank/home/billm@s1
24. ZFS Clones
A clone is a writable copy of a snapshot
Created instantly, unlimited number
Perfect for “read-mostly” file systems – source directories, application binaries
and configuration, etc.
# zfs list -r tank/home/billm
NAME USED AVAIL REFER MOUNTPOINT
tank/home/billm 24.5K 46.4G 24.5K /export/home/billm
tank/home/billm@s1 0 - 24.5K -
# zfs clone tank/home/billm@s1 tank/newbillm
# zfs list -r tank/home/billm tank/newbillm
NAME USED AVAIL REFER MOUNTPOINT
tank/home/billm 24.5K 46.4G 24.5K /export/home/billm
tank/home/billm@s1 0 - 24.5K -
tank/newbillm 0 46.4G 24.5K /tank/newbillm
25. ZFS Data Migration
•Host-neutral format on-disk
•Move data from SPARC to x86 transparently
•Data always written in native format, reads reformat data if needed
•ZFS pools may be moved from host to host
•Or handy for external USB disks
•ZFS handles device ids & paths, mount points, etc.
Export pool from original host
source# zpool export tank
Import pool on new host (“zpool import” without operands lists importable pools)
destination# zpool import tank
26. ZFS Cheatsheet
http://www.datadisk.co.uk/html_docs/sun/sun_zfs_cs.htm
Create a raidz pool See pools on drives that haven't been
Partition drives to match, in this case "s0" is the same size. imported
•zpool create -f p01 raidz c7t0d0s0 c7t1d0s0 c8t0d0s0 •zpool import
•zpool status
Create swap area in zfs pool, activate it
Create File Systems •zfs create -V 5gb tank/vol
zpool list / zpool status •swap -a /dev/zvol/dsk/tank/vol
•zfs create p01/CDIMAGES •swap -l
•zfs list / df -k
Cloning Drive Partition Tables
Rename pool •prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s -
•zpool export rpool /dev/rdsk/c0t1d0s2
•zpool import rpool oldrpool
Mirror root parition after initial install
Change Mount Point & Mount •zpool list / zpool status
•zfs set mountpoint=/oldrpool/export oldrpool/export •Assuming c5t0d0s0 is root, repartition c5t1d0s0 to
•zfs mount oldrpool/export match. (Make sure you delete "s2", the full drive
partition, or you'll get an overlap error.)
See all the mount points in a zfs pool •zpool attach rpool c5t0d0s0 c5t1d0s0
•zfs list
27. ZFS Command Summary
•Create a ZFS storage pool # zpool create mpool mirror c1t0d0 c2t0d0
•Add capacity to a ZFS storage pool # zpool add mpool mirror c5t0d0 c6t0d0
•Add hot spares to a ZFS storage pool # zpool add mypool spare c6t0d0 c7t0d0
•Replace a device in a storage pool # zpool replace mpool c6t0d0 [c7t0d0]
•Display storage pool capacity # zpool list
•Display storage pool status # zpool status
•Scrub a pool # zpool scrub mpool
•Remove a pool # zpool destroy mpool
•Create a ZFS ile system # zfs create mpool/devel
•Create a child ZFS ile system # zfs create mpool/devel/data
•Remove a ile system # zfs destroy mpool/devel
•Take a snapshot of a ile system # zfs snapshot mpool/devel/data@today
•Roll back to a ile system snapshot # zfs rollback -r mpool/devel/data@today
•Create a writable clone from a snapshot # zfs clone mpool/devel/data@today mpool/clones/devdata
•Remove a snapshot # zfs destroy mpool/devel/data@today
•Enable compression on a ile system # zfs set compression=on mpool/clones/devdata
•Disable compression on a ile system # zfs inherit compression mpool/clones/devdata
•Set a quota on a ile system # zfs set quota=60G mpool/devel/data
•Set a reservation on a new ile system # zfs create -o reserv=20G mpool/devel/admin
•Share a ile system over NFS # zfs set sharenfs=on mpool/devel/data
•Create a ZFS volume # zfs create -V 2GB mpool/vol
•Remove a ZFS volume # zfs destroy mpool/vol
The "write hole" effect can happen if a power failure occurs during the write. It happens in all the array types, including but not limited to RAID5, RAID6, and RAID1. In this case it is impossible to determine which of data blocks or parity blocks have been written to the disks and which have not. In this situation the parity data does not match to the rest of the data in the stripe. Also, you cannot determine with confidence which data is incorrect - parity or one of the data blocks. http://www.raid-recovery-guide.com/raid5-write-hole.aspx
Short stroking aims to minimize performance-eating head repositioning delays by reducing the number of tracks used per hard drive. In a simple example, a terabyte hard drive (1,000 GB) may be based on three platters with 333 GB storage capacity each. If we were to use only 10% of the storage medium, starting with the outer sectors of the drive (which provide the best performance), the hard drive would have to deal with significantly fewer head movements. The result of short stroking is always significantly reduced capacity. In this example, the terabyte drive would be limited to 33 GB per platter and hence only offer a total capacity of 100 GB. But the result should be noticeably shorter access times and much improved I/O performance, as the drive can operate with a minimum amount of physical activity. *** ZFS uses an intent log to provide synchronous write guarantees to applications. When an application issues a synchronous write, ZFS writes this transaction in the intent log (ZIL) and request for the write returns. When there is sufficiently large data to write on to the disk, ZFS performs a txg commit and writes all the data at once. The ZIL is not used to maintain consistency of on-disk structures; it is only to provide synchronous guarantees.
http://mognet.no-ip.info/wordpress/2012/02/zfs-the-best-file-system-for-raid/ L2ARC works as a READ cache layer in-between main memory and Disk Storage Pool. It holds non-dirty ZFS data, and is currently intended to improve the performance of random READ workloads or streaming READ workloads (l2arc_noprefetch option). ARC<->L2ARC<->Disk Storage Pool. ZiL works as a WRITE cache layer in-between main memory and Disk Storage Pool. But how does it work? ZiL currently intended to improve the performance of random OR streaming WRITE workloads? When ZiL send the ZFS data to Disk Storage Pool, when ZiL is full? If l2arc_noprefetch is enabled, L2ARC reading data from Disk Storage Pool, only when not found same data in L2ARC. How often ZiL writing data to Disk Storage Pool? ZIL (ZFS Intent Log) drives can be added to a ZFS pool to speed up the write capabilities of any level of ZFS RAID. It writes the metadata for a file to a very fast SSD drive to increase the write throughput of the system. When the physical spindles have a moment, that data is then flushed to the spinning media and the process starts over. We have observed significant performance increases by adding ZIL drives to our ZFS configuration. One thing to keep in mind is that the ZIL should be mirrored to protect the speed of the ZFS system. If the ZIL is not mirrored, and the drive that is being used as the ZIL drive fails, the system will revert to writing the data directly to the disk, severely hampering performance.