One of the main design princples of ZFS is merging the management of physical volumes with individual filesystems. Instead of relying on an underlying volume manager, ZFS manages disks directly and aggregates them into pools from which individual filesystems are allocated. Storage servers using ZFS typically configure two pools: one pool onto which the system’s root filesystem is installed, and a second for the data to be managed by that system.
At Joyent we’ve taken a different approach and discarded the root pool in favor of a single system-wide pool. Not only does this approach free up an additional two drives to be used for main storage, it also provides us flexibility in upgrading system software, higher customer multitenancy, and ease of deploying new machines. In this talk, I’ll describe our overall architecture, talk about challenges we faced in constructing such an architecture, and characterize our experiences having deployed this model in production over the last 18 months.
Boost PC performance: How more available memory can improve productivity
Running without a ZFS system pool
1. Running ZFS without
a system pool
Bill Pijewski
Software Engineer, Joyent
@pijewski
Tuesday, October 2, 2012
2. Agenda
• Why ZFS is important to Joyent
• Evolution of USB and PXE boot architectures
• Running with no system pool
Tuesday, October 2, 2012
3. ZFS at Joyent
• We run a production cloud with many servers in
datacenters worldwide
• Two kinds of zones (covered in detail in other talks):
• Zones: sparse zones share libraries with the
platform
• VMs: fully virtualized GNU/Linux, Windows,
FreeBSD, etc. machines
• Use small number of NFS machines to provide
additional storage capacity in each datacenter
Tuesday, October 2, 2012
4. ZFS for Zones and VMs
• Zones are allocated two ZFS datasets
• One dataset for data in that zone
• Another for core files -- to prevent cores from
exceeding quota
• VMs have a ZFS volume into which the VM image is
installed, plus one or more additional volumes
presented to guest as disks
• Guest filesystems are installed into volumes
Tuesday, October 2, 2012
5. ZFS in different contexts
• For Joyent, two main contexts: SmartOS and SDC
• SmartOS: community distribution, illumos +
lightweight virtualization tools
• SmartDataCenter (SDC): SmartOS + full cloud
management and orchestration stack
Tuesday, October 2, 2012
6. Important ZFS features
• As with all ZFS users, we take for granted rely on
end-to-end data integrity
• Copy-on-write architecture: snapshots, clones
• Compression
• Space management tools: quotas and reservations
• Replication to move customers around between
different machines
Tuesday, October 2, 2012
7. Delegated administration!
• In our next SDC release, we enable delegated
administration
• Allows customers to:
• Take snapshots outside of Joyentʼs API
• Create child datasets
• Snapshot and clone datasets
• Replicate or migrate data between instances
• Open work: basic limits on delegated activity to
avoid DOS
Tuesday, October 2, 2012
8. ZFS Performance
• SSDs for ZIL
• ARC
• We hold back some portion of a serverʼs total
memory, knowing that a good portion of this
memory will be consumed by the ARC
• Committing memory achieves greater I/O
performance
• ZFS I/O throttle for QoS controls
• For more information, check out Brendan Greggʼs
excellent talk next door
Tuesday, October 2, 2012
9. Read-only system pool
• At Fishworks, we decided to have a read-only
system pool
• Necessary for OS install as well as analytics data
• Simplified some things:
• No unnecessary customizations from customers
• Discouraged hot patching
• Other disadvantages:
• Upgrade, rollback, and factory reset were tricky
Tuesday, October 2, 2012
10. SmartOS USB Boot
• Instead of installing OS to root disks, SmartOS boots
from a USB key
• Entire kernel and userland fit in about 200 MB
(compressed)
• Other software can be installed from pkgsrc
• Single ZFS pool for all zones
Tuesday, October 2, 2012
11. USB Boot Advantages
• All disks are available for zone/VM storage, thereby
increasing both performance and capacity
• Encourages users to provision a zone for each
application rather than using the global zone
• Discourages customization and one-off patching
• Fast to get up and running
• Easy to “bring your OS with you”
Tuesday, October 2, 2012
12. SmartDataCenter (SDC) Architecture
• Two kinds of servers: head nodes and compute
nodes
• Head nodes run management, provisioning,
monitoring, and boot services
• Compute nodes contain customer zones
• Head nodes are similar to SmartOS installs
• Each compute node PXE boots its platform from the
head node
• Both head nodes and compute nodes have a single
ZFS pool
Tuesday, October 2, 2012
13. SDC Diagram
DC 0 DC 1 DC 2
Headnode Headnode Headnode
PXE PXE PXE
CN 0 CN 10 CN 20
CN 1 CN 11 CN 21
CN 2 CN 12 CN 22
......
......
......
Tuesday, October 2, 2012
14. PXE Boot Advantages
• Ben Rockwood, 10/1/2012:
“Apparently other people spend time installing
software. I think that's stupid.”
• As with SmartOS, operators encouraged to put
applications in zones instead of global
• Upgrade = rollback = reboot, nothing more
• Newer platforms can be staged and machines
rebooted later
• Any machine which hits a known fixed problem will
automatically boot onto fresh platform
Tuesday, October 2, 2012
15. Storage pools!
• Most OSes assume the existence of a “system” pool
-- a pool onto which the OS, applications, and
configuration information is installed
• Joyent moving away from single-vdev pools backed
by hardware RAID
• Embracing hybrid storage pool (HSP) using an SSD
for the ZFS intent log (ZIL)
• Everything else worked on RAID-Z pools except for
saving a crash dump
Tuesday, October 2, 2012
16. RAID-Z Crash Dump
• Problem: have only one RAID-Z or mirrored pool but
cannot save crash dump on said pool
• Implement crash dumps on RAID-Z (majority of
work) and pools with multiple vdevs
• Not necessarily to save parity bits for crash dump
data:
• Crash dump is immediately saved upon reboot
• Needs to be reliable, simple, and (hopefully) fast
Tuesday, October 2, 2012
17. Why no parity bits?
• Since DVAs on the dump device are preallocated,
use those 128K blocks for each write
• Most calls into dump entry point are not block
aligned
• Rather than write variable size, use original 128K
• I first calculated parity bits, only my test machine
took three hours to save a crash dump
• No parity calculated -- on a pool with n vdevs, each
write could require n-1 (synchronous) reads
Tuesday, October 2, 2012
18. Other system components
• Swap device (thankfully) supports RAID-Z pools
• /var, /opt have their own datasets
• /etc not persistent
• /root also not persistent, again incentivizing people
to configure applications in zones rather than using
the GZ
Tuesday, October 2, 2012
19. Summary
• The single ZFS pool has simplified Joyentʼs
deployment
• Delegated administration has given customers more
power
• ZFS has been and will continue to be a crucial
component of our architecture for many years
Tuesday, October 2, 2012
23. ZFS 101
• ZFS is a copy-on-write filesystem from Sun originally
shipped with Solaris 10
• Many innovative features: data compression,
snapshot/rollback, ZFS send/receive, SSD
integration
• Enterprise-grade reliability and data integrity
• Two main components relevant here:
• ZFS pools
• ZFS datasets
Tuesday, October 2, 2012
24. ZFS Pools
• Aggregate disks into a single storage pool from
which “datasets” are allocated
• No parted/LVM needed
• Mix both spinning disks and SSDs:
• L2ARC: extends filesystem buffer cache
• ZIL: absorbs synchronous write activity
Tuesday, October 2, 2012
25. ZFS Datasets
• Datasets are a tree of blocks within the storage pool,
presented as either:
• A filesystem (file interface)
• A volume (block interface)
• Datasets can be flexibly resized, and volumes can
even be thinly provisioned
• Administrative controls on datasets
Tuesday, October 2, 2012
26. Zones and VMs
• A zone is a lightweight software-virtualized container
• Uses the systemʼs OS platform
• Allocated its own ZFS filesystem (more in a sec)
• A VM is a hardware-virtualized container for GNU/
Linux, Windows, BSD, etc.
• Uses its own ZFS volume
• VMʼs filesystem installed into ZFS volume
• Both machines have resource controls for CPU,
memory, and disk I/O
Tuesday, October 2, 2012
27. Advantages of ZFS
• Snapshots: zone/VM backup and recovery
• Space management: reservations and quota flexibly
allocate space between zones
• Delegated administration: each tenant can
administer their own dataset:
• Set compression level and other properties
• Take snapshots of application data
• Generate send streams for replication/backup
Tuesday, October 2, 2012
28. Advantages of ZFS (2)
• Data integrity: verifies data of VM guest filesystems
(ext4, XFS, NTFS, etc.)
• Multiple storage configurations available: mirrored,
RAID-Z2, and others
• System fully supported on any storage
configurations, can even take a crash dump to a
RAID-Z pool
Tuesday, October 2, 2012