SlideShare uma empresa Scribd logo
1 de 52
ZFS Nuts and Bolts
          Eric Sproul
  OmniTI Computer Consulting
Quick Overview
•   More than just another filesystem: it’s a filesystem,
    a volume manager, and a RAID controller all in one

•   Production debut in Solaris 10 6/06

•   1 ZB = 1 billion TB

•   128-bit

•   264 snapshots, 248 files/directory,
    264 bytes/filesystem, 278 bytes/pool,
    264 devices/pool, 264 pools/system
Old & Busted
Traditional storage stack:
  filesystem(upper): filename to object (inode)
  filesystem(lower): object to volume LBA
  volume manager: volume LBA to array LBA
  RAID controller: array LBA to disk LBA

• Strict separation between layers
• Each layer often comes from separate vendors
• Complex, difficult to administer, hard to predict
 performance of a particular combination
New Hotness
•   Telescoped stack:
        ZPL: filename to object
        DMU: object to DVA
        SPA: DVA to disk LBA
•   Terms:

    •   ZPL: ZFS POSIX layer (standard syscall interface)

    •   DMU: Data Management Unit (transactional object store)

    •   DVA: Data Virtual Address (vdev + offset)

    •   SPA: Storage Pool Allocator (block allocation, data
        transformation)
New Hotness

•   No more separate tools to manage filesystems vs.
    volumes vs. RAID arrays
    •   2 commands: zpool(1M), zfs(1M) (RFE exists to combine these)

•   Pooled storage means never getting stuck with too
    much or too little space in your filesystems

•   Can expose block devices as well; “zvol” blocks
    map directly to DVAs
ZFS Advantages
•   Fast
    •   copy-on-write, pipelined I/O, dynamic striping,
        variable block size, intelligent resilvering


•   Simple management

•   End-to-end data integrity, self-healing
    •   Checksum everything, all the time

•   Built-in goodies
    •   block transforms

    •   snapshots

    •   NFS, CIFS, iSCSI sharing

    •   Platform-neutral on-disk format
Getting Down to Brass Tacks



 How does ZFS achieve these feats?
ZFS I/O Life Cycle
                       Writes
1. Translated to object transactions by the ZPL:
   “Make these 5 changes to these 2 objects.”
2. Transactions bundled in DMU into transaction
   groups (TXGs) that flush when full (1/8 of system
   memory) or at regular intervals (30 seconds)
3. Blocks making up a TXG are transformed (if
   necessary), scheduled and then issued to physical
   media in the SPA
ZFS I/O Life Cycle
                 Synchronous Writes
•   ZFS maintains a per-filesystem log called the ZFS
    Intent Log (ZIL). Each transaction gets a log
    sequence number.
•   When a synchronous command, such as fsync(), is
    issued, the ZIL commits blocks up to the current
    sequence number. This is a blocking operation.
•   The ZIL commits all necessary operations and
    flushes any write caches that may be enabled,
    ensuring that all bits have been committed to stable
    storage.
ZFS I/O Life Cycle
                          Reads
•   ZFS makes heavy use of caching and prefetching
•   If requested blocks are not cached, issue a
    prioritized I/O that “cuts the line” ahead of pending
    writes
•   Writes are intelligently throttled to maintain
    acceptable read performance
•   ARC (Adaptive Replacement Cache) tracks recently
    and frequently used blocks in main memory
•   L2 ARC uses durable storage to extend the ARC
Speed Is Life
•   Copy-on-write design means random writes can
    be made sequential

•   Pipelined I/O extracts maximum parallelism with
    out-of-order issue, sorting and aggregation

•   Dynamic striping across all underlying devices
    eliminates hot-spots

•   Variable block size = no wasted space or effort

•   Intelligent resilvering copies only live data, can do
    partial rebuild for transient outages
Copy-On-Write




   Initial block tree
Copy-On-Write




New blocks represent changes
 Never modifies existing data
Copy-On-Write




 Indirect blocks also change
Copy-On-Write




Atomically update uberblock to point at updated blocks
          The uberblock is special in that it does get overwritten, but 4
          copies are stored as part of the vdev label and are updated in
          transactional pairs. Therefore, integrity on disk is maintained.
Pipelined I/O
  Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:
Pipelined I/O
  Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Pipelined I/O
       Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Move
head
Pipelined I/O
       Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Move     Spin
head     wait
Pipelined I/O
       Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Move     Spin   Move
head     wait   head
Pipelined I/O
       Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Move     Spin   Move   Move
head     wait   head   head
Pipelined I/O
       Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Move     Spin   Move   Move   Move
head     wait   head   head   head
Pipelined I/O
       Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Move     Spin   Move   Move   Move
head     wait   head   head   head
Pipelined I/O
  Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

Pipelining lets us examine
  writes as a group and
     optimize order:
Pipelined I/O
  Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

Pipelining lets us examine
  writes as a group and
     optimize order:
    Move
    head
Pipelined I/O
  Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

Pipelining lets us examine
  writes as a group and
     optimize order:
    Move    Move
    head    head
Dynamic Striping
• Load distribution across top-level vdevs
• Factors determining block allocation
  include:
  •   Capacity

  •   Latency & bandwidth

  •   Device health
Dynamic Striping
                                      New data striped across three mirrors.
Writes striped across both mirrors.   No migration of existing data.
                                      Copy-on-write reallocates data over time,
Reads occur wherever data was         gradually spreading it across all three mirrors.
written.
                                      * RFE for “on-demand” resilvering to explicitly re-balance




                                                                                              +
     # zpool create tank 
                                                    # zpool add tank 
     mirror c1t0d0 c1t1d0 
                                                    mirror c3t0d0 c3t1d0
     mirror c2t0d0 c2t1d0
Variable Block Size
•   No single value works well with all types of files
    •   Large blocks increase bandwidth but reduce metadata and can lead to
        wasted space

    •   Small blocks save space for smaller files, but increase I/O operations on
        larger ones

    •   Record-based files such as those used by databases have a fixed block
        size that must be matched by the filesystem to avoid extra overhead
        (blocks too small) or read-modify-write (blocks too large)
Variable Block Size
•   The DMU operates on units of a fixed record size;
    default is 128KB

•   Files that are less than the record size are written as
    a single filesystem block (FSB) of variable size in
    multiples of disk sectors (512B)

•   Files that are larger than the record size are stored
    in multiple FSBs equal to record size

•   DMU records are assembled into transaction groups
    and committed atomically
Variable Block Size

•   FSBs are the basic unit of ZFS datasets, of which
    checksums are maintained

•   Handled by the SPA, which can optionally transform
    them (compression, ditto blocks today; encryption,
    de-dupe in the future)

•   Compression improves I/O performance, as fewer
    operations are needed on the underlying disk
Intelligent Resilver
•   a.k.a. rebuild, resync, reconstruct

•   Traditional resilvering is basically a whole-disk copy
    in the mirror case; RAID-5 does XOR of the other
    disks to rebuild

    •   No priority given to more important blocks
        (top of the tree)

    •   If you’ve copied 99% of the blocks, but the last
        1% contains the top few blocks in the tree,
        another failure ruins everything
Intelligent Resilver
•   The ZFS way is metadata-driven

•   Live blocks only: just walk the block tree;
    unallocated blocks are ignored

•   Top-down: Start with the most important blocks.
    Every block copied increases the amount of
    discoverable data.

•   Transactional pruning: If the failure is transient,
    repair by identifying the missed TXGs. Resilver
    time is only slightly longer than the outage time.
Keep It Simple
•   Unified management model: pools and datasets

•   Datasets are just a group of tagged bits with
    certain attributes: filesystems, volumes, snapshots,
    clones

•   Properties can be set while the dataset is active

•   Hierarchical arrangement: children inherit
    properties of parent

•   Datasets become administration points-- give
    every user or application their own filesystem
Keep It Simple

•   Datasets only occupy as much space as they need

•   Compression, quotas and reservations are built-in
    properties

•   Pools may be grown dynamically without service
    interruption
Data Integrity
• Not enough to be fast and simple; must be
  safe too
• Silent corruption is our mortal enemy
  •   Defects can occur anywhere: disks, firmware, cables, kernel drivers

  •   Main memory has ECC; why shouldn’t storage have something similar?


• Other types of corruption are also killers:
  •   Power outages, accidental overwrite, use a disk as swap
Data Integrity
  Traditional Method:
 Disk Block Checksum




                cksum
         data
Data Integrity
                     Traditional Method:
                    Disk Block Checksum




                                       cksum
                                data



Only detects problems after data is successfully written (“bit rot”)
Data Integrity
                     Traditional Method:
                    Disk Block Checksum




                                       cksum
                                data



Only detects problems after data is successfully written (“bit rot”)

  Won’t catch silent corruption caused by issues in the I/O path
                      between disk and host
Data Integrity
                        The ZFS Way
                               •   Store data checksum in parent block
                                   pointer
                ptr
                cksum          •   Isolates faults between checksum and
                                   data

       ptr                     •   Forms a hash tree, enabling validation of
       cksum
                                   the entire pool

                               •   256-bit checksums

                               •   fletcher2 (default, simple and fast) or
data            data               SHA-256 (slower, more secure)

                               •   Can be validated at any time with
                                   ‘zpool scrub’
Data Integrity
         App


         ZFS




  data         data
Data Integrity
         App


         ZFS
  data




  data         data
Data Integrity
         App


         ZFS
               data




  data         data
Data Integrity
         App
               data


         ZFS




  data         data
Data Integrity
         App
               data


         ZFS




  data         data
Data Integrity
          App
                data


          ZFS




   data         data




  Self-healing mirror!
Goodie Bag

• Block Transforms
• Snapshots & Clones
• Sharing (NFS, CIFS, iSCSI)
• Platform-neutral on-disk format
Block Transforms
•   Handled at SPA layer, transparent to upper layers
•   Available today:
    • Compression
        •   zfs set compression=on tank/myfs
        •   LZJB (default) or GZIP
        •   Multi-threaded as of snv_79

    •   Duplication, a.k.a. “ditto blocks”
        •   zfs set copies=N tank/myfs
        •   In addition to mirroring/RAID-Z: One logical block = up to 3
            physical blocks
        •   Metadata always has 2+ copies, even without ditto blocks
        •   Copies stored on different devices, or different places on same
            device

•   Future: de-duplication, encryption
Snapshots & Clones
•   zfs snapshot tank/myfs@thursday

•   Based on block birth time, stored in block pointer

•   Nearly instantaneous (<1 sec) on idle system

•   Communicates structure, since it is based on
    object changes, not just a block delta

•   Occupies negligible space initially, and only grows
    as large as the block changeset

•   Clone is just a read/write snapshot
Sharing
•   NFSv4
    •   zfs set sharenfs=on tank/myfs
    •   Automatically updates /etc/dfs/sharetab


•   CIFS
    •   zfs set sharesmb=on tank/myfs
    •   Additional properties control the share name and workgroup
    •   Supports full NT ACLs and user mapping, not just POSIX uid


•   iSCSI
    •   zfs set shareiscsi=on tank/myvol
    •   Makes sharing block devices as easy as sharing filesystems
On-Disk Format
• Platform-neutral, adaptive endianness
  •   Writes always use native endianness, recorded in a bit in the block
      pointer

  •   Reads byteswap if necessary, based on comparison of host endianness to
      value of block pointer bit


• Migrate between x86 and SPARC
  •   No worries about device paths, fstab, mountpoints, it all just works

  •   ‘zpool export’ on old host, move disks, ‘zpool import’ on new host

  •   Also migrate between Solaris and non-Sun implementations, such as
      MacOS X and FreeBSD
Fin
Further reading:
ZFS Community:

http://opensolaris.org/os/community/zfs

ZFS Administration Guide:

http://docs.sun.com/app/docs/doc/819-5461

Jeff Bonwick’s blog:

http://blogs.sun.com/bonwick/en_US/category/ZFS

ZFS-related blog entries:

http://blogs.sun.com/main/tags/zfs

Mais conteúdo relacionado

Mais procurados

MariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB plc
 
PostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSPostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSTomas Vondra
 
M|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write PathsM|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write PathsMariaDB plc
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseCloudera, Inc.
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerYongseok Oh
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deploymentYoshinori Matsunobu
 
SSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLYoshinori Matsunobu
 
Open vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NATOpen vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NATThomas Graf
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about cephEmma Haruka Iwao
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadKevin Traynor
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
Getting started with MariaDB with Docker
Getting started with MariaDB with DockerGetting started with MariaDB with Docker
Getting started with MariaDB with DockerMariaDB plc
 

Mais procurados (20)

MariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & Optimization
 
PostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSPostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFS
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
M|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write PathsM|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write Paths
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data WarehousingDatastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
SSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQL
 
Open vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NATOpen vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NAT
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about ceph
 
MAINVIEW for DB2.ppt
MAINVIEW for DB2.pptMAINVIEW for DB2.ppt
MAINVIEW for DB2.ppt
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Ceph
CephCeph
Ceph
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offload
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
Getting started with MariaDB with Docker
Getting started with MariaDB with DockerGetting started with MariaDB with Docker
Getting started with MariaDB with Docker
 

Destaque

Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...
Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...
Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...ACTUONDA
 
Building Storage on the Cheap
Building Storage on the CheapBuilding Storage on the Cheap
Building Storage on the CheapYao Jun Yap
 
ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011Richard Elling
 
PROYECTO DE ACUERDO COMUNAS EN TUNJA
PROYECTO DE ACUERDO COMUNAS EN TUNJAPROYECTO DE ACUERDO COMUNAS EN TUNJA
PROYECTO DE ACUERDO COMUNAS EN TUNJAPedro Hernandez
 
Respuesta derecho de petición Alcaldia de Tunja
Respuesta derecho de petición Alcaldia de Tunja Respuesta derecho de petición Alcaldia de Tunja
Respuesta derecho de petición Alcaldia de Tunja Pedro Hernandez
 

Destaque (7)

Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...
Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...
Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...
 
Biologia: Teorías de la evolución celular
Biologia: Teorías de la evolución celularBiologia: Teorías de la evolución celular
Biologia: Teorías de la evolución celular
 
Building Storage on the Cheap
Building Storage on the CheapBuilding Storage on the Cheap
Building Storage on the Cheap
 
ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011
 
PROYECTO DE ACUERDO COMUNAS EN TUNJA
PROYECTO DE ACUERDO COMUNAS EN TUNJAPROYECTO DE ACUERDO COMUNAS EN TUNJA
PROYECTO DE ACUERDO COMUNAS EN TUNJA
 
Documento soporte
Documento soporteDocumento soporte
Documento soporte
 
Respuesta derecho de petición Alcaldia de Tunja
Respuesta derecho de petición Alcaldia de Tunja Respuesta derecho de petición Alcaldia de Tunja
Respuesta derecho de petición Alcaldia de Tunja
 

Semelhante a Zfs Nuts And Bolts

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Amazon Web Services
 
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - SlidesSeveralnines
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailInternet World
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...Joao Galdino Mello de Souza
 
Development to Production with Sharded MongoDB Clusters
Development to Production with Sharded MongoDB ClustersDevelopment to Production with Sharded MongoDB Clusters
Development to Production with Sharded MongoDB ClustersSeveralnines
 
Introduction to debugging linux applications
Introduction to debugging linux applicationsIntroduction to debugging linux applications
Introduction to debugging linux applicationscommiebstrd
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.Jack Levin
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementJ Singh
 
vSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting PerformancevSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting PerformanceProfessionalVMware
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld
 
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching ThemKey Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching ThemYahoo Developer Network
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph Community
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...StreamNative
 
Living with the Oracle Database Appliance
Living with the Oracle Database ApplianceLiving with the Oracle Database Appliance
Living with the Oracle Database ApplianceSimon Haslam
 

Semelhante a Zfs Nuts And Bolts (20)

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
os
osos
os
 
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
 
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
 
Development to Production with Sharded MongoDB Clusters
Development to Production with Sharded MongoDB ClustersDevelopment to Production with Sharded MongoDB Clusters
Development to Production with Sharded MongoDB Clusters
 
Introduction to debugging linux applications
Introduction to debugging linux applicationsIntroduction to debugging linux applications
Introduction to debugging linux applications
 
The Smug Mug Tale
The Smug Mug TaleThe Smug Mug Tale
The Smug Mug Tale
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
vSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting PerformancevSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting Performance
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
 
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching ThemKey Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance Barriers
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
 
Mysql talk
Mysql talkMysql talk
Mysql talk
 
Living with the Oracle Database Appliance
Living with the Oracle Database ApplianceLiving with the Oracle Database Appliance
Living with the Oracle Database Appliance
 

Último

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 

Zfs Nuts And Bolts

  • 1. ZFS Nuts and Bolts Eric Sproul OmniTI Computer Consulting
  • 2. Quick Overview • More than just another filesystem: it’s a filesystem, a volume manager, and a RAID controller all in one • Production debut in Solaris 10 6/06 • 1 ZB = 1 billion TB • 128-bit • 264 snapshots, 248 files/directory, 264 bytes/filesystem, 278 bytes/pool, 264 devices/pool, 264 pools/system
  • 3. Old & Busted Traditional storage stack: filesystem(upper): filename to object (inode) filesystem(lower): object to volume LBA volume manager: volume LBA to array LBA RAID controller: array LBA to disk LBA • Strict separation between layers • Each layer often comes from separate vendors • Complex, difficult to administer, hard to predict performance of a particular combination
  • 4. New Hotness • Telescoped stack: ZPL: filename to object DMU: object to DVA SPA: DVA to disk LBA • Terms: • ZPL: ZFS POSIX layer (standard syscall interface) • DMU: Data Management Unit (transactional object store) • DVA: Data Virtual Address (vdev + offset) • SPA: Storage Pool Allocator (block allocation, data transformation)
  • 5. New Hotness • No more separate tools to manage filesystems vs. volumes vs. RAID arrays • 2 commands: zpool(1M), zfs(1M) (RFE exists to combine these) • Pooled storage means never getting stuck with too much or too little space in your filesystems • Can expose block devices as well; “zvol” blocks map directly to DVAs
  • 6. ZFS Advantages • Fast • copy-on-write, pipelined I/O, dynamic striping, variable block size, intelligent resilvering • Simple management • End-to-end data integrity, self-healing • Checksum everything, all the time • Built-in goodies • block transforms • snapshots • NFS, CIFS, iSCSI sharing • Platform-neutral on-disk format
  • 7. Getting Down to Brass Tacks How does ZFS achieve these feats?
  • 8. ZFS I/O Life Cycle Writes 1. Translated to object transactions by the ZPL: “Make these 5 changes to these 2 objects.” 2. Transactions bundled in DMU into transaction groups (TXGs) that flush when full (1/8 of system memory) or at regular intervals (30 seconds) 3. Blocks making up a TXG are transformed (if necessary), scheduled and then issued to physical media in the SPA
  • 9. ZFS I/O Life Cycle Synchronous Writes • ZFS maintains a per-filesystem log called the ZFS Intent Log (ZIL). Each transaction gets a log sequence number. • When a synchronous command, such as fsync(), is issued, the ZIL commits blocks up to the current sequence number. This is a blocking operation. • The ZIL commits all necessary operations and flushes any write caches that may be enabled, ensuring that all bits have been committed to stable storage.
  • 10. ZFS I/O Life Cycle Reads • ZFS makes heavy use of caching and prefetching • If requested blocks are not cached, issue a prioritized I/O that “cuts the line” ahead of pending writes • Writes are intelligently throttled to maintain acceptable read performance • ARC (Adaptive Replacement Cache) tracks recently and frequently used blocks in main memory • L2 ARC uses durable storage to extend the ARC
  • 11. Speed Is Life • Copy-on-write design means random writes can be made sequential • Pipelined I/O extracts maximum parallelism with out-of-order issue, sorting and aggregation • Dynamic striping across all underlying devices eliminates hot-spots • Variable block size = no wasted space or effort • Intelligent resilvering copies only live data, can do partial rebuild for transient outages
  • 12. Copy-On-Write Initial block tree
  • 13. Copy-On-Write New blocks represent changes Never modifies existing data
  • 15. Copy-On-Write Atomically update uberblock to point at updated blocks The uberblock is special in that it does get overwritten, but 4 copies are stored as part of the vdev label and are updated in transactional pairs. Therefore, integrity on disk is maintained.
  • 16. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes:
  • 17. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning:
  • 18. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move head
  • 19. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin head wait
  • 20. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move head wait head
  • 21. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move Move head wait head head
  • 22. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move Move Move head wait head head head
  • 23. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move Move Move head wait head head head
  • 24. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: Pipelining lets us examine writes as a group and optimize order:
  • 25. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: Pipelining lets us examine writes as a group and optimize order: Move head
  • 26. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: Pipelining lets us examine writes as a group and optimize order: Move Move head head
  • 27. Dynamic Striping • Load distribution across top-level vdevs • Factors determining block allocation include: • Capacity • Latency & bandwidth • Device health
  • 28. Dynamic Striping New data striped across three mirrors. Writes striped across both mirrors. No migration of existing data. Copy-on-write reallocates data over time, Reads occur wherever data was gradually spreading it across all three mirrors. written. * RFE for “on-demand” resilvering to explicitly re-balance + # zpool create tank # zpool add tank mirror c1t0d0 c1t1d0 mirror c3t0d0 c3t1d0 mirror c2t0d0 c2t1d0
  • 29. Variable Block Size • No single value works well with all types of files • Large blocks increase bandwidth but reduce metadata and can lead to wasted space • Small blocks save space for smaller files, but increase I/O operations on larger ones • Record-based files such as those used by databases have a fixed block size that must be matched by the filesystem to avoid extra overhead (blocks too small) or read-modify-write (blocks too large)
  • 30. Variable Block Size • The DMU operates on units of a fixed record size; default is 128KB • Files that are less than the record size are written as a single filesystem block (FSB) of variable size in multiples of disk sectors (512B) • Files that are larger than the record size are stored in multiple FSBs equal to record size • DMU records are assembled into transaction groups and committed atomically
  • 31. Variable Block Size • FSBs are the basic unit of ZFS datasets, of which checksums are maintained • Handled by the SPA, which can optionally transform them (compression, ditto blocks today; encryption, de-dupe in the future) • Compression improves I/O performance, as fewer operations are needed on the underlying disk
  • 32. Intelligent Resilver • a.k.a. rebuild, resync, reconstruct • Traditional resilvering is basically a whole-disk copy in the mirror case; RAID-5 does XOR of the other disks to rebuild • No priority given to more important blocks (top of the tree) • If you’ve copied 99% of the blocks, but the last 1% contains the top few blocks in the tree, another failure ruins everything
  • 33. Intelligent Resilver • The ZFS way is metadata-driven • Live blocks only: just walk the block tree; unallocated blocks are ignored • Top-down: Start with the most important blocks. Every block copied increases the amount of discoverable data. • Transactional pruning: If the failure is transient, repair by identifying the missed TXGs. Resilver time is only slightly longer than the outage time.
  • 34. Keep It Simple • Unified management model: pools and datasets • Datasets are just a group of tagged bits with certain attributes: filesystems, volumes, snapshots, clones • Properties can be set while the dataset is active • Hierarchical arrangement: children inherit properties of parent • Datasets become administration points-- give every user or application their own filesystem
  • 35. Keep It Simple • Datasets only occupy as much space as they need • Compression, quotas and reservations are built-in properties • Pools may be grown dynamically without service interruption
  • 36. Data Integrity • Not enough to be fast and simple; must be safe too • Silent corruption is our mortal enemy • Defects can occur anywhere: disks, firmware, cables, kernel drivers • Main memory has ECC; why shouldn’t storage have something similar? • Other types of corruption are also killers: • Power outages, accidental overwrite, use a disk as swap
  • 37. Data Integrity Traditional Method: Disk Block Checksum cksum data
  • 38. Data Integrity Traditional Method: Disk Block Checksum cksum data Only detects problems after data is successfully written (“bit rot”)
  • 39. Data Integrity Traditional Method: Disk Block Checksum cksum data Only detects problems after data is successfully written (“bit rot”) Won’t catch silent corruption caused by issues in the I/O path between disk and host
  • 40. Data Integrity The ZFS Way • Store data checksum in parent block pointer ptr cksum • Isolates faults between checksum and data ptr • Forms a hash tree, enabling validation of cksum the entire pool • 256-bit checksums • fletcher2 (default, simple and fast) or data data SHA-256 (slower, more secure) • Can be validated at any time with ‘zpool scrub’
  • 41. Data Integrity App ZFS data data
  • 42. Data Integrity App ZFS data data data
  • 43. Data Integrity App ZFS data data data
  • 44. Data Integrity App data ZFS data data
  • 45. Data Integrity App data ZFS data data
  • 46. Data Integrity App data ZFS data data Self-healing mirror!
  • 47. Goodie Bag • Block Transforms • Snapshots & Clones • Sharing (NFS, CIFS, iSCSI) • Platform-neutral on-disk format
  • 48. Block Transforms • Handled at SPA layer, transparent to upper layers • Available today: • Compression • zfs set compression=on tank/myfs • LZJB (default) or GZIP • Multi-threaded as of snv_79 • Duplication, a.k.a. “ditto blocks” • zfs set copies=N tank/myfs • In addition to mirroring/RAID-Z: One logical block = up to 3 physical blocks • Metadata always has 2+ copies, even without ditto blocks • Copies stored on different devices, or different places on same device • Future: de-duplication, encryption
  • 49. Snapshots & Clones • zfs snapshot tank/myfs@thursday • Based on block birth time, stored in block pointer • Nearly instantaneous (<1 sec) on idle system • Communicates structure, since it is based on object changes, not just a block delta • Occupies negligible space initially, and only grows as large as the block changeset • Clone is just a read/write snapshot
  • 50. Sharing • NFSv4 • zfs set sharenfs=on tank/myfs • Automatically updates /etc/dfs/sharetab • CIFS • zfs set sharesmb=on tank/myfs • Additional properties control the share name and workgroup • Supports full NT ACLs and user mapping, not just POSIX uid • iSCSI • zfs set shareiscsi=on tank/myvol • Makes sharing block devices as easy as sharing filesystems
  • 51. On-Disk Format • Platform-neutral, adaptive endianness • Writes always use native endianness, recorded in a bit in the block pointer • Reads byteswap if necessary, based on comparison of host endianness to value of block pointer bit • Migrate between x86 and SPARC • No worries about device paths, fstab, mountpoints, it all just works • ‘zpool export’ on old host, move disks, ‘zpool import’ on new host • Also migrate between Solaris and non-Sun implementations, such as MacOS X and FreeBSD
  • 52. Fin Further reading: ZFS Community: http://opensolaris.org/os/community/zfs ZFS Administration Guide: http://docs.sun.com/app/docs/doc/819-5461 Jeff Bonwick’s blog: http://blogs.sun.com/bonwick/en_US/category/ZFS ZFS-related blog entries: http://blogs.sun.com/main/tags/zfs

Notas do Editor