2. About Me
• CTO of Tuangru, a data center management software company
• 22+ years of experience in technology
• Education in science (Physics) and business (MBA)
5. Typical Enterprise Storage Needs
• Data services for operations like windows network file sharing, and email
server back ends to a single enterprise
• Multi-tenant cloud for service providers
• Archival grade for backup and long term storage
• Specialized for example low latency applications like high frequency trading
and low latency databases
6. Components and Technologies
• Direct attach, e.g. attached disk, NVME, NVDIMM, SAS JBOD etc ..
• SAN, e.g. Fiber Channel, iSCSI
• NAS, e.g. CIFS & NFS
• Object, e.g. S3 or Swift Object Storage
• Archival, e.g. BlackPearl from SpectraLogic and Everspan from Sony
7. Storage Tiers
• Tier 0: High performance, e.g. very busy OLTP databases $$$$
• Tier 1: General purpose, e.g. Web server $$$
• Tier 2: Low performance, e.g. backup site or backup target $$
• Tier 3: Cheap and deep. E.g. Object store $
• Deep Archive: Write once read never (e.g. Archival tape libraries) $
8. Typical Concerns for a Storage Admin
• Cost
• Security (isolation of traffic and data)
• Performance (Peak load, average load,
percentile etc.)
• QOS (dealing with noisy neighbors)
• Scale management (More applications,
more clients, more data, etc.)
• Growth management (Scale up vs Scale out)
• Data integrity (Silent corruption & device
failure)
• Service availability (Backup and business
continuity)
• Programmability (prescriptive applications)
11. How does ZFS fit in?
• Brief history of ZFS
• Introduction to ZFS concepts
• Using ZFS in production
12. FreeBSD and ZFS
FreeBSD is used as the base system for NetApp, EMC Isilon, Dell Compellent,
Spectralogic, IX Systems TrueNAS and FreeNAS and many more. However
not all of these use ZFS.
ZFS is the base storage filesystem for SpectraLogic, Oracle, FreeNAS,
TrueNAS, Delphix, Nexenta, Netgear, OS Nexus, Datto, Joyent Cloud and
many more. However not all of these FreeBSD.
13. Short History
• 2005 – Released as part of OpenSolaris under the CDDL license.
• 2007 - Integrated into FreeBSD as part of the 7.0-RELEASE.
• 2010 - Forked to the OpenZFS project after Oracle closed source development
• Open-ZFS.org is a vibrant, productive and open community that supports ZFS on
Solaris variants mainly Illumos, FreeBSD, Linux and OS-X
14. ZFS Basics
ZFS is a copy on write (COW) file system that is designed to keep large amounts of data for an
indefinite period of time.
Its limits are designed not to be reached in practice.
Its design tolerates:
• Normal hardware failure scenarios, e.g. drive failure
• Data corruption, using checksum, parity information and data copies. This includes the normal
corruption due to disk failure and silent corruption/bit-rot
16. Types of VDEVs
• Disk: An entire disk or a partition
• File: A file with a minimum 128MB size. This is typically for testing or experimentation
• Mirror: AKA RAID 1
• RAIDZ(1,2,3): this is equivalent to RAID levels 5,6, and the theoretical 7
• Spare: Special pseudo device. This is for hot spares to be used with “zfs replace”
• Cache: AKA L2ARC and is used for read caching
• Log: AKA ZIL and is used to capture writes before they are flushed to disk
17. Datasets
• ZFS datasets are the basic building blocks for data management in ZFS
• Datasets are thin provisioned and share the pool
• Each dataset has system properties like mount point, compression, case sensitivity, read only and
many more
• Datasets can have user properties to further annotate it
• Datasets can be nested
• Dataset administration can be delegated
18. ZFS Volumes
Volumes are a special type of dataset. They allow the storage admin to export a portion
of the pool as a block device that can be formatted to another file system, like UFS, EXT4
or NTFS.
Volumes work well for exporting block devices via iSCSI and can serve as a disk
backend for a VM.
19. Snapshots
ZFS allows for nearly instantaneous read-only snapshots. Snapshots do not initially use any space in
the pool but will start to use space as the original diverges from the snapshot.
Snapshots can be used to:
• Restore a dataset or a single file
• Clone a dataset
Snapshots are not recursive by default. Be careful with nested datasets.
20. Replication
• Snapshots are the basis of replication
• A storage administrator can use zfs send to serialize a dataset and send it to a file,
another pool or system via SSH to a file or dataset
• The zfs send command can also do incremental backups
• The zfs receive command can transfer the data from the send operation back to files
and directories
21. More Cool Things About ZFS
• Every zpool keeps a history of the commands that affected it and when the action was
done. This can be accessed by the zpool history command
• ZFS has a robust quota system
• ZFS is NFS aware and sharing for datasets can be controlled with the sharenfs
property.
22. Preparing for Production Deployment
• Map out your performance versus data protection strategy
• Decide if you need to do any acceleration with ZIL and L2ARC
• Consider day 2 operations like pool expansion and hardware failure
• Look at your data and consider if compression & de-duplication will be of any use
• Look at any application specific optimizations for example databases like PostgreSQL and
MySQL
• Measure twice cut once! Remember some ZFS settings and components are immutable and some
operations are not reversible.
23. DOs DON’Ts
Use ECC RAM (Lots of it!) Desktop RAM
Use reliable IT mode HBAs and storage controllers IR Mode RAID
Monitor ARC & L2ARCcache hit rate Use desktop grade drives
Consider using a ZIL and L2 ARC especially with Network file
systems
Fill up your pool.
Disable atime unless absolutely needed especially for SSDs
Prefer 4K Native Enterprise drives & SSDs
Be very careful with de-duplication
Use the right ashift value for your drives
Scrub your pools periodically
Look at SMART stats for drives
USE GPT partitioning
Scrub your pools periodically
Turn on compression where needed
24. Example Applications and Tools for ZFS
iocage, a jail manager (FreeBSD)
chyves, a bhyve virtual machine manager (FreeBSD)
LXD, OS Container hypervisor (Ubuntu Server)
Docker, by adding with ZFS as a storage backend (Various Linux distros)
FreeNAS, a NAS implementation on top of FreeBSD
25. Emerging Trends and Final Thoughts
• Flash is winning the online storage game
• NVMe is the future on the hardware side
• Distributed, programmable and object storage technologies are the future
• The is room for ZFS as it can offer the the base layer or be part of solution
• Opensource innovation is driving the future of storage
There is a lot of information and misinformation out there when it comes to storage and ZFS.
These slides are based on my personal experience designing, building and selling storage solutions.
I hope you find the information useful.
Data services are typically provided for by traditional DAS, NAS, SAN technologies.
Multitenant cloud is unique because the mixture of the application and organizations.
Archival storage is needed for business continuity and regulatory requirements. Scope of requirements are typically set by the regulation or the business. E.g. these records need to be kept for 7 years, these for a 100years and so on.
Specialized applications like low latency requirements or throughput requirements. E.g. High frequency trading. Vendor examples here are Fusion IO and IBM Flash systems
Tier 2 getting squeezed out by Tier 1 and Tier 3 technologies especially as prices drop for flash, disk and compute.
Vendor stability and talent risk are typically looked at as well. This is important when looking at the solution outside of the technical merits.
For FreeBSD the email was sent to the FreeBSD current mailing list by Pawel Jakub Dawidek on April 6 – 2007. The Work was supported by the FreeBSD foundation, wheel.pl and Sentex.net.
https://lists.freebsd.org/pipermail/freebsd-current/2007-April/070544.html
Copy on write means that the original block in a write operation is never overwritten. Writes are redirected to a new empty block and once the write is made the pointers are updated. The ramification of this is that your file system is always consistent. The caution with copy on write is that fragmentation will increase however this is a problem that
Capacity: source: https://en.wikipedia.org/wiki/ZFS
ZFS is a 128-bit file system, so it can address 1.84 × 1019 times more data than 64-bit systems such as Btrfs. The maximum limits of ZFS are designed to be so large that they should never be encountered in practice. For instance, fully populating a single zpool with 2128 bits of data would require 1024 3 TB hard disk drives.
Some theoretical limits in ZFS are:
• 248: number of entries in any individual directory[35]
• 16 exbibytes (264 bytes): maximum size of a single file
• 16 exbibytes: maximum size of any attribute
• 256 quadrillion zebibytes (2128 bytes): maximum size of any zpool
• 256: number of attributes of a file (actually constrained to 248 for the number of files in a directory)
• 264: number of devices in any zpool
• 264: number of zpools in a system
• 264: number of file systems in a zpool
Depending on the application. L2ARC and ZIL are optional and may not be needed.
ZIL and L2ARC separation between performance and underlying hardware
Note the separation between data layout and how the data is stored
Source: https://www.freebsd.org/doc/handbook/zfs-term.html
A pool is made up of one or more vdevs, which themselves can be a single disk or a group of disks, in the case of a RAID transform. When multiple vdevs are used, ZFS spreads data across the vdevs to increase performance and maximize usable space.
Disk - The most basic type of vdev is a standard block device. This can be an entire disk (such as /dev/ada0 or /dev/da0) or a partition (/dev/ada0p3). On FreeBSD, there is no performance penalty for using a partition rather than the entire disk. This differs from recommendations made by the Solaris documentation.
File - In addition to disks, ZFS pools can be backed by regular files, this is especially useful for testing and experimentation. Use the full path to the file as the device path in zpool create. All vdevs must be at least 128 MB in size.
Mirror - When creating a mirror, specify the mirror keyword followed by the list of member devices for the mirror. A mirror consists of two or more devices, all data will be written to all member devices. A mirror vdev will only hold as much data as its smallest member. A mirror vdev can withstand the failure of all but one of its members without losing any data.
Note: A regular single disk vdev can be upgraded to a mirror vdev at any time with zpool attach.
RAID-Z - ZFS implements RAID-Z, a variation on standard RAID-5 that offers better distribution of parity and eliminates the “RAID-5 write hole” in which the data and parity information become inconsistent after an unexpected restart. ZFS supports three levels of RAID-Z which provide varying levels of redundancy in exchange for decreasing levels of usable storage. The types are named RAID-Z1 through RAID-Z3 based on the number of parity devices in the array and the number of disks which can fail while the pool remains operational.
In a RAID-Z1 configuration with four disks, each 1 TB, usable storage is 3 TB and the pool will still be able to operate in degraded mode with one faulted disk. If an additional disk goes offline before the faulted disk is replaced and resilvered, all data in the pool can be lost.
In a RAID-Z3 configuration with eight disks of 1 TB, the volume will provide 5 TB of usable space and still be able to operate with three faulted disks. Sun™ recommends no more than nine disks in a single vdev. If the configuration has more disks, it is recommended to divide them into separate vdevs and the pool data will be striped across them.
A configuration of two RAID-Z2 vdevs consisting of 8 disks each would create something similar to a RAID-60 array. A RAID-Z group's storage capacity is approximately the size of the smallest disk multiplied by the number of non-parity disks. Four 1 TB disks in RAID-Z1 has an effective size of approximately 3 TB, and an array of eight 1 TB disks in RAID-Z3 will yield 5 TB of usable space.
Spare - ZFS has a special pseudo-vdev type for keeping track of available hot spares. Note that installed hot spares are not deployed automatically; they must manually be configured to replace the failed device using zfs replace.
Log - ZFS Log Devices, also known as ZFS Intent Log (ZIL) move the intent log from the regular pool devices to a dedicated device, typically an SSD. Having a dedicated log device can significantly improve the performance of applications with a high volume of synchronous writes, especially databases. Log devices can be mirrored, but RAID-Z is not supported. If multiple log devices are used, writes will be load balanced across them.
Cache - Adding a cache vdev to a pool will add the storage of the cache to the L2ARC. Cache devices cannot be mirrored. Since a cache device only stores additional copies of existing data, there is no risk of data loss.
Coolest use is boot environments (IMHO)
This allows you to determine if the target system is an online replica or a backup target.
This can all be automated via cron and there are tools that make this easier.
Note that the sharesmb property has no effect on FreeBSD.
For optimizations please note that things should work out of the box but some application can benefit from extra optimizations, for example if the application is already compressing the data there may be no need to ask ZFS to compress the dataset.
Immutable: for example ZDB settings like ashift and VDEV settings.
One way operations: for example updating a pool to a new version of ZFS for example.
These are mostly for production.
Normally you want to add more VDEVs or delete unneeded files when around 80% capacity.
There are many more like zrep, zfstools and zfsstats all are applications that make it easy to get stats from ZFS and to manage replication. Your mileage will vary depending on the tool and what you are trying to do.
My recommendation is to not boil the ocean. Stick to basics and add tools when absolutely needed to automate things you understand.