SlideShare uma empresa Scribd logo
1 de 71
Baixar para ler offline
Ceph: scaling storage for the
                    cloud and beyond
                                                                     Sage Weil
                                                                      Inktank




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
outline
  ●    why you should care
  ●    what is it, what it does
  ●    distributed object storage
  ●    ceph fs
  ●    who we are, why we do this




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
why should you care about another
                            storage system?




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
requirements
  ●    diverse storage needs
         –     object storage
         –     block devices (for VMs) with snapshots, cloning
         –     shared file system with POSIX, coherent caches
         –     structured data... files, block devices, or objects?
  ●    scale
         –     terabytes, petabytes, exabytes
         –     heterogeneous hardware
         –     reliability and fault tolerance


2012 Storage Developer Conference. © Inktank. All Rights Reserved.
time
  ●    ease of administration
  ●    no manual data migration, load balancing
  ●    painless scaling
         –     expansion and contraction
         –     seamless migration




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
cost
  ●    linear function of size or performance
  ●    incremental expansion
         –     no fork-lift upgrades
  ●    no vendor lock-in
         –     choice of hardware
         –     choice of software
  ●    open




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
what is ceph?




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
unified storage system
  ●    objects
         –     native
         –     RESTful
  ●    block
         –     thin provisioning, snapshots, cloning
  ●    file
         –     strong consistency, snapshots




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
APP
                   APP                            APP
                                                  APP                  HOST/VM
                                                                       HOST/VM                     CLIENT
                                                                                                   CLIENT



                                       RADOSGW
                                       RADOSGW                       RBD
                                                                     RBD                       CEPH FS
                                                                                               CEPH FS
           LIBRADOS
            LIBRADOS
                                          A bucket-based
                                           A bucket-based             A reliable and fully-
                                                                       A reliable and fully-    A POSIX-compliant
                                                                                                 A POSIX-compliant
              A library allowing
               A library allowing         REST gateway,
                                           REST gateway,              distributed block
                                                                       distributed block        distributed file
                                                                                                 distributed file
              apps to directly
               apps to directly           compatible with S3
                                           compatible with S3         device, with aaLinux
                                                                       device, with Linux       system, with aa
                                                                                                 system, with
              access RADOS,
               access RADOS,              and Swift
                                           and Swift                  kernel client and aa
                                                                       kernel client and        Linux kernel client
                                                                                                 Linux kernel client
              with support for
               with support for                                       QEMU/KVM driver
                                                                       QEMU/KVM driver          and support for
                                                                                                 and support for
              C, C++, Java,
               C, C++, Java,                                                                    FUSE
                                                                                                 FUSE
              Python, Ruby,
               Python, Ruby,
              and PHP
               and PHP




        RADOS
        RADOS

           A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
            A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
           intelligent storage nodes
            intelligent storage nodes



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
open source
  ●    LGPLv2
         –     copyleft
         –     ok to link to proprietary code
  ●    no copyright assignment
         –     no dual licensing
         –     no “enterprise-only” feature set
  ●    active community
  ●    commercial support


2012 Storage Developer Conference. © Inktank. All Rights Reserved.
distributed storage system
  ●    data center scale
         –     10s to 10,000s of machines
         –     terabytes to exabytes
  ●    fault tolerant
         –     no single point of failure
         –     commodity hardware
  ●    self-managing, self-healing




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
ceph object model
  ●    pools
         –     1s to 100s
         –     independent namespaces or object collections
         –     replication level, placement policy
  ●    objects
         –     bazillions
         –     blob of data (bytes to gigabytes)
         –     attributes (e.g., “version=12”; bytes to kilobytes)
         –     key/value bundle (bytes to gigabytes)


2012 Storage Developer Conference. © Inktank. All Rights Reserved.
why start with objects?
  ●    more useful than (disk) blocks
         –     names in a single flat namespace
         –     variable size
         –     simple API with rich semantics
  ●    more scalable than files
         –     no hard-to-distribute hierarchy
         –     update semantics do not span objects
         –     workload is trivially parallel



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
DISK
                                                                     DISK

                                                                     DISK
                                                                     DISK

                                                                     DISK
                                                                     DISK

              HUMAN
              HUMAN                                      COMPUTER
                                                         COMPUTER    DISK
                                                                     DISK

                                                                     DISK
                                                                     DISK

                                                                     DISK
                                                                     DISK

                                                                     DISK
                                                                     DISK


2012 Storage Developer Conference. © Inktank. All Rights Reserved.
DISK
                                                                     DISK

                                                                     DISK
                                                                     DISK
              HUMAN
              HUMAN
                                                                     DISK
                                                                     DISK

              HUMAN
              HUMAN                                      COMPUTER
                                                         COMPUTER    DISK
                                                                     DISK

                                                                     DISK
                                                                     DISK
              HUMAN
              HUMAN
                                                                     DISK
                                                                     DISK

                                                                     DISK
                                                                     DISK


2012 Storage Developer Conference. © Inktank. All Rights Reserved.
HUMAN
            HUMAN            HUMAN
                              HUMAN

                                        HUMAN
                                         HUMAN
       HUMAN
        HUMAN                                                                           DISK
                                                                                        DISK
                             HUMAN
                              HUMAN
       HUMAN
        HUMAN                                                                           DISK
                                                                                        DISK
       HUMAN
        HUMAN                 HUMAN
                               HUMAN
                                                                                        DISK
                                                                                        DISK
                                                                                        DISK
                                                                                        DISK
               HUMAN
                HUMAN
                                                                                        DISK
                                                                                        DISK
                       HUMAN
                        HUMAN
     HUMAN
      HUMAN
                                                                                        DISK
                                                                                        DISK
                                                                 (COMPUTER))
                                                                  (COMPUTER
                         HUMAN
                          HUMAN
                                                                                        DISK
                                                                                        DISK
          HUMAN                 HUMAN
                                 HUMAN
           HUMAN                                                                        DISK
                                                                                        DISK
                          HUMAN
                           HUMAN
        HUMAN
         HUMAN                                                                          DISK
                                                                                        DISK
                           HUMAN
                            HUMAN                                                       DISK
                                                                                        DISK
           HUMAN
            HUMAN                                                                       DISK
                                                                                        DISK
                                   HUMAN
                                    HUMAN
               HUMAN
                HUMAN                                                                   DISK
                                                                                        DISK
                                   HUMAN
                                    HUMAN

                    HUMAN
                     HUMAN
                                                           (actually more like this…)
2012 Storage Developer Conference. © Inktank. All Rights Reserved.
COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                      HUMAN
                      HUMAN
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                      HUMAN
                      HUMAN
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                      HUMAN
                      HUMAN
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
OSD             OSD              OSD    OSD    OSD




                                      FS              FS              FS    FS     FS     btrfs
                                                                                          xfs
                                                                                          ext4
                                    DISK            DISK             DISK   DISK   DISK




                                         M                            M             M



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
Monitors:
                                                               •
                                                                     Maintain cluster membership and state



                       M
                                                               •
                                                                     Provide consensus for distributed
                                                                     decision-making via Paxos
                                                               •
                                                                     Small, odd number
                                                               •
                                                                     These do not serve stored objects to
                                                                     clients



                                                               Object Storage Daemons (OSDs):
                                                               •
                                                                     At least three in a cluster
                                                               •
                                                                     One per disk or RAID group
                                                               •
                                                                     Serve stored objects to clients
                                                               •
                                                                     Intelligently peer to perform
                                                                     replication tasks

2012 Storage Developer Conference. © Inktank. All Rights Reserved.
HUMAN




                                                                     M




                                                M                        M




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
data distribution
  ●    all objects are replicated N times
  ●    objects are automatically placed, balanced,
       migrated in a dynamic cluster
  ●    must consider physical infrastructure
         –    ceph-osds on hosts in racks in rows in data centers

  ●    three approaches
         –    pick a spot; remember where you put it
         –    pick a spot; write down where you put it
         –    calculate where to put it, where to find it
2012 Storage Developer Conference. © Inktank. All Rights Reserved.
CRUSH
                                                                     •   Pseudo-random placement
                                                                         algorithm
                                                                     •   Fast calculation, no lookup
                                                                     •   Repeatable, deterministic
                                                                     •   Ensures even distribution
                                                                     •   Stable mapping
                                                                         •   Limited data migration
                                                                     •   Rule-based configuration
                                                                         •   specifiable replication
                                                                         •   infrastructure topology aware
                                                                         •   allows weighting



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
10 10 01 01 10 10 01 11 01 10
                                                 10 10 01 01 10 10 01 11 01 10

                                                                                 hash(object name) % num pg

                                     10
                                      10     10
                                              10     01
                                                      01     01
                                                              01     10
                                                                      10   10
                                                                            10    01
                                                                                   01   11
                                                                                         11   01
                                                                                               01   10
                                                                                                     10




                                                                                 CRUSH(pg, cluster state, policy)




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
10 10 01 01 10 10 01 11 01 10
                                                 10 10 01 01 10 10 01 11 01 10




                                     10
                                      10     10
                                              10     01
                                                      01     01
                                                              01     10
                                                                      10   10
                                                                            10   01
                                                                                  01   11
                                                                                        11   01
                                                                                              01   10
                                                                                                    10




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
RADOS
  ●   monitors publish osd map that describes cluster state
        –   ceph-osd node status (up/down, weight, IP)
        –   CRUSH function specifying desired data distribution      M

  ●   object storage daemons (OSDs)
        –   safely replicate and store object
        –   migrate data as the cluster changes over time
        –   coordinate based on shared view of reality – gossip!
  ●   decentralized, distributed approach allows
        –   massive scales (10,000s of servers or more)
        –   the illusion of a single copy with consistent behavior



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
CLIENT
                                                                 CLIENT

                                                                          ??




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
2012 Storage Developer Conference. © Inktank. All Rights Reserved.
2012 Storage Developer Conference. © Inktank. All Rights Reserved.
CLIENT

                                                                              ??




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
APP
                   APP                            APP
                                                  APP                  HOST/VM
                                                                       HOST/VM                     CLIENT
                                                                                                   CLIENT



                                       RADOSGW
                                       RADOSGW                       RBD
                                                                     RBD                       CEPH FS
                                                                                               CEPH FS
           LIBRADOS
                                          A bucket-based
                                           A bucket-based             A reliable and fully-
                                                                       A reliable and fully-    A POSIX-compliant
                                                                                                 A POSIX-compliant
              A library allowing          REST gateway,
                                           REST gateway,              distributed block
                                                                       distributed block        distributed file
                                                                                                 distributed file
              apps to directly            compatible with S3
                                           compatible with S3         device, with aaLinux
                                                                       device, with Linux       system, with aa
                                                                                                 system, with
              access RADOS,               and Swift
                                           and Swift                  kernel client and aa
                                                                       kernel client and        Linux kernel client
                                                                                                 Linux kernel client
              with support for                                        QEMU/KVM driver
                                                                       QEMU/KVM driver          and support for
                                                                                                 and support for
              C, C++, Java,                                                                     FUSE
                                                                                                 FUSE
              Python, Ruby,
              and PHP




        RADOS
        RADOS

           A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
            A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
           intelligent storage nodes
            intelligent storage nodes



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
APP
                                                                      APP
                                                                     LIBRADOS
                                                                      LIBRADOS

                                                                                 native




                                                                 M
                                                                 M
                                                  M
                                                  M                              M
                                                                                 M




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
LIBRADOS



                                  L
                                                                     • Provides direct access to
                                                                       RADOS for applications
                                                                     • C, C++, Python, PHP, Java
                                                                     • No HTTP overhead




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
atomic transactions
  ●   client operations send to the OSD cluster
       –    operate on a single object
       –    can contain a sequence of operations, e.g.
              ●   truncate object
              ●   write new object data
              ●   set attribute
  ●   atomicity
       –    all operations commit or do not commit atomically
  ●   conditional
       –    'guard' operations can control whether operation is performed
              ●   verify xattr has specific value
              ●   assert object is a specific version
       –    allows atomic compare-and-swap etc.

2012 Storage Developer Conference. © Inktank. All Rights Reserved.
key/value storage
  ●    store key/value pairs in an object
        –    independent from object attrs or byte data payload
  ●    based on google's leveldb
        –    efficient random and range insert/query/removal
        –    based on BigTable SSTable design
  ●    exposed via key/value API
        –    insert, update, remove
        –    individual keys or ranges of keys
  ●    avoid read/modify/write cycle for updating complex
       objects
        –    e.g., file system directory objects

2012 Storage Developer Conference. © Inktank. All Rights Reserved.
watch/notify
  ●   establish stateful 'watch' on an object
        –    client interest persistently registered with object
        –    client keeps session to OSD open
  ●   send 'notify' messages to all watchers
        –    notify message (and payload) is distributed to all watchers
        –    variable timeout
        –    notification on completion
               ●   all watchers got and acknowledged the notify
  ●   use any object as a communication/synchronization
      channel
        –    locking, distributed coordination (ala ZooKeeper), etc.


2012 Storage Developer Conference. © Inktank. All Rights Reserved.
OSD
              CLIENT                        CLIENT                   CLIENT
                #1                            #2                       #3

                       watch
                                                                                ack/commit
                                                    watch
                                                                                ack/commit
                                                                        watch
                                                                                ack/commit

                       notify


                                                                                     notify
                                                                                     notify
                                                                                     notify
                                                                        ack
                                                     ack
                      ack

                                                                                  complete


2012 Storage Developer Conference. © Inktank. All Rights Reserved.
watch/notify example
  ●    radosgw cache consistency
         –     radosgw instances watch a single object
               (.rgw/notify)
         –     locally cache bucket metadata
         –     on bucket metadata changes (removal, ACL
               changes)
                 ●    write change to relevant bucket object
                 ●    send notify with bucket name to other radosgw
                      instances
         –     on receipt of notify
                 ●    invalidate relevant portion of cache

2012 Storage Developer Conference. © Inktank. All Rights Reserved.
rados classes
 ●    dynamically loaded .so
       –   /var/lib/rados-classes/*
       –   implement new object “methods” using existing methods
       –   part of I/O pipeline
       –   simple internal API
 ●    reads
       –   can call existing native or class methods
       –   do whatever processing is appropriate
       –   return data
 ●    writes
       –   can call existing native or class methods
       –   do whatever processing is appropriate
       –   generates a resulting transaction to be applied atomically

2012 Storage Developer Conference. © Inktank. All Rights Reserved.
class examples
  ●    grep
         –    read an object, filter out individual records, and
              return those
  ●    sha1
         –    read object, generate fingerprint, return that
  ●    images
         –    rotate, resize, crop image stored in object
         –    remove red-eye
  ●    crypto
         –    encrypt/decrypt object data with provided key
2012 Storage Developer Conference. © Inktank. All Rights Reserved.
APP
                   APP                            APP
                                                  APP                  HOST/VM
                                                                       HOST/VM                     CLIENT
                                                                                                   CLIENT



                                       RADOSGW                       RBD
                                                                     RBD                       CEPH FS
                                                                                               CEPH FS
           LIBRADOS
            LIBRADOS
                                          A bucket-based              A reliable and fully-
                                                                       A reliable and fully-    A POSIX-compliant
                                                                                                 A POSIX-compliant
              A library allowing
               A library allowing         REST gateway,               distributed block
                                                                       distributed block        distributed file
                                                                                                 distributed file
              apps to directly
               apps to directly           compatible with S3          device, with aaLinux
                                                                       device, with Linux       system, with aa
                                                                                                 system, with
              access RADOS,
               access RADOS,              and Swift                   kernel client and aa
                                                                       kernel client and        Linux kernel client
                                                                                                 Linux kernel client
              with support for
               with support for                                       QEMU/KVM driver
                                                                       QEMU/KVM driver          and support for
                                                                                                 and support for
              C, C++, Java,
               C, C++, Java,                                                                    FUSE
                                                                                                 FUSE
              Python, Ruby,
               Python, Ruby,
              and PHP
               and PHP




        RADOS
        RADOS

           A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
            A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
           intelligent storage nodes
            intelligent storage nodes



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
APP
                   APP                            APP
                                                  APP                  HOST/VM
                                                                       HOST/VM                    CLIENT
                                                                                                  CLIENT



                                       RADOSGW
                                       RADOSGW                       RBD                      CEPH FS
                                                                                              CEPH FS
           LIBRADOS
            LIBRADOS
                                          A bucket-based
                                           A bucket-based             A reliable and fully-    A POSIX-compliant
                                                                                                A POSIX-compliant
              A library allowing
               A library allowing         REST gateway,
                                           REST gateway,              distributed block        distributed file
                                                                                                distributed file
              apps to directly
               apps to directly           compatible with S3
                                           compatible with S3         device, with a Linux     system, with aa
                                                                                                system, with
              access RADOS,
               access RADOS,              and Swift
                                           and Swift                  kernel client and a      Linux kernel client
                                                                                                Linux kernel client
              with support for
               with support for                                       QEMU/KVM driver          and support for
                                                                                                and support for
              C, C++, Java,
               C, C++, Java,                                                                   FUSE
                                                                                                FUSE
              Python, Ruby,
               Python, Ruby,
              and PHP
               and PHP




        RADOS
        RADOS

           A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
            A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
           intelligent storage nodes
            intelligent storage nodes



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
          COMPUTER
          COMPUTER                                                   COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                               DISK
                                               DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                 VM
                                 VM                                  COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                 VM
                                 VM                                  COMPUTER   DISK
                                                                     COMPUTER   DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                 VM
                                 VM
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK
                                                                     COMPUTER
                                                                     COMPUTER   DISK
                                                                                DISK



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
RADOS Block Device:
                                                                     • Storage of virtual disks in RADOS
                                                                     • Decouples VMs and containers
                                                                      • Live migration!
                                                                     • Images are striped across the cluster
                                                                     • Snapshots!
                                                                     • Support in
                                                                       • Qemu/KVM
                                                                       • OpenStack, CloudStack
                                                                       • Mainline Linux kernel
                                                                     • Image cloning
                                                                      • Copy-on-write “snapshot” of existing
                                                                        image




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
VM
                                                                      VM




                                      VIRTUALIZATION CONTAINER
                                      VIRTUALIZATION CONTAINER
                                                                      LIBRBD
                                                                       LIBRBD
                                                                     LIBRADOS
                                                                      LIBRADOS




                                                                 M
                                                                 M
                                                  M
                                                  M                              M
                                                                                 M




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
CONTAINER
                        CONTAINER                                    VM
                                                                     VM       CONTAINER
                                                                              CONTAINER
                                   LIBRBD
                                    LIBRBD                                       LIBRBD
                                                                                  LIBRBD
                                  LIBRADOS
                                   LIBRADOS                                     LIBRADOS
                                                                                 LIBRADOS




                                                                 M
                                                                 M
                                                  M
                                                  M                       M
                                                                          M




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
HOST
                                                                     HOST
                                                             KRBD (KERNEL MODULE)
                                                              KRBD (KERNEL MODULE)
                                                                     LIBRADOS
                                                                      LIBRADOS




                                                                 M
                                                                 M
                                                  M
                                                  M                              M
                                                                                 M




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
APP
                   APP                            APP
                                                  APP                  HOST/VM
                                                                       HOST/VM                     CLIENT
                                                                                                   CLIENT



                                       RADOSGW
                                       RADOSGW                       RBD
                                                                     RBD                       CEPH FS
           LIBRADOS
            LIBRADOS
                                          A bucket-based
                                           A bucket-based             A reliable and fully-
                                                                       A reliable and fully-    A POSIX-compliant
              A library allowing
               A library allowing         REST gateway,
                                           REST gateway,              distributed block
                                                                       distributed block        distributed file
              apps to directly
               apps to directly           compatible with S3
                                           compatible with S3         device, with aaLinux
                                                                       device, with Linux       system, with a
              access RADOS,
               access RADOS,              and Swift
                                           and Swift                  kernel client and aa
                                                                       kernel client and        Linux kernel client
              with support for
               with support for                                       QEMU/KVM driver
                                                                       QEMU/KVM driver          and support for
              C, C++, Java,
               C, C++, Java,                                                                    FUSE
              Python, Ruby,
               Python, Ruby,
              and PHP
               and PHP




        RADOS
        RADOS

           A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
            A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
           intelligent storage nodes
            intelligent storage nodes



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
CLIENT
                                                                 CLIENT



                    metadata                                         01
                                                                      01   data
                                                                     10
                                                                      10




                                                                 M
                                                                 M
                                                  M
                                                  M                        M
                                                                           M




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
M
                                                                 M
                                                  M
                                                  M                  M
                                                                     M




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
Metadata Server
                                                                     • Manages metadata for a
                                                                       POSIX-compliant shared
                                                                       filesystem
                                                                      • Directory hierarchy
                                                                      • File metadata (owner,
                                                                        timestamps, mode, etc.)
                                                                     • Stores metadata in RADOS
                                                                     • Does not serve file data to
                                                                       clients
                                                                     • Only required for shared
                                                                       filesystem




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
legacy metadata storage
  ●    a scaling disaster
         –     name → inode →
               block list → data                                     etc
                                                                     home
                                                                     usr
                                                                     var
         –     no inode table locality                               vmlinuz
                                                                         …


         –     fragmentation                                         hosts
                                                                     mtab
                                                                     passwd
                 ●    inode table                                      …        bin
                                                                               include
                                                                                lib
                 ●    directory                                                     …



  ●    many seeks
  ●    difficult to partition


2012 Storage Developer Conference. © Inktank. All Rights Reserved.
ceph fs metadata storage
                                                                     ●   block lists unnecessary
                                                                     ●   inode table mostly useless
                                          100

                    1
                                       hosts
                                       mtab
                                                                         –   APIs are path-based, not
                 etc
                                       passwd
                                         …
                                                                             inode-based
                 home
                 usr
                 var
                                                    102                  –   no random table access,
                 vmlinuz                          bin
                   …                             include                     sloppy caching
                                                  lib
                                                      …
                                                                     ●   embed inodes inside
                                                                         directories
                                                                         –   good locality, prefetching
                                                                         –   leverage key/value object




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
one tree




                                                      three metadata servers


                                                                                     ??

2012 Storage Developer Conference. © Inktank. All Rights Reserved.
2012 Storage Developer Conference. © Inktank. All Rights Reserved.
2012 Storage Developer Conference. © Inktank. All Rights Reserved.
2012 Storage Developer Conference. © Inktank. All Rights Reserved.
2012 Storage Developer Conference. © Inktank. All Rights Reserved.
DYNAMIC SUBTREE PARTITIONING


2012 Storage Developer Conference. © Inktank. All Rights Reserved.
dynamic subtree partitioning
  ●
       scalable                                                      ●
                                                                         efficient
         –     arbitrarily partition                                     –   hierarchical partition
               metadata                                                      preserve locality
  ●
       adaptive                                                      ●
                                                                         dynamic
         –     move work from busy                                       –   daemons can
               to idle servers                                               join/leave
         –     replicate hot                                             –   take over for failed
               metadata                                                      nodes




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
controlling metadata io
  ●   view ceph-mds as cache
       –    reduce reads
              ●   dir+inode prefetching
                                                                     journal
       –    reduce writes
              ●   consolidate multiple writes
  ●   large journal or log
       –    stripe over objects
       –    two tiers
              ●   journal for short term
              ●   per-directory for long term                        directories

       –    fast failure recovery




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
what is journaled
 ●    lots of state
       –   journaling is expensive up-front, cheap to recover
       –   non-journaled state is cheap, but complex (and somewhat
           expensive) to recover
 ●    yes
       –   client sessions
       –   actual fs metadata modifications
 ●    no
       –   cache provenance
       –   open files
 ●    lazy flush
       –   client modifications may not be durable until fsync() or visible by
           another client

2012 Storage Developer Conference. © Inktank. All Rights Reserved.
client protocol
  ●    highly stateful
         –     consistent, fine-grained caching
  ●    seamless hand-off between ceph-mds
       daemons
         –     when client traverses hierarchy
         –     when metadata is migrated between servers
  ●    direct access to OSDs for file I/O




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
an example
  ●    mount -t ceph 1.2.3.4:/ /mnt
         –     3 ceph-mon RT
         –     2 ceph-mds RT (1 ceph-mds to -osd
               RT)                                                   ceph-mon   ceph-osd
  ●    cd /mnt/foo/bar
         –     2 ceph-mds RT (2 ceph-mds to -osd
               RT)
  ●    ls -al
         –     open
         –     readdir
                                                                                ceph-mds
                 ●    1 ceph-mds RT (1 ceph-mds to
                      -osd RT)
         –     stat each file
         –     close
  ●    cp * /tmp
         –     N ceph-osd RT


2012 Storage Developer Conference. © Inktank. All Rights Reserved.
recursive accounting
 ●    ceph-mds tracks recursive directory stats
       –   file sizes
       –   file and directory counts
       –   modification time
 ●    virtual xattrs present full stats
 ●    efficient

           $ ls ­alSh | head
           total 0
           drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 .
           drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 ..
           drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph
           drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1
           drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko
           drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest
           drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2
           drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph
           drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
snapshots
 ●    volume or subvolume snapshots unusable at petabyte scale
       –   snapshot arbitrary subdirectories
 ●    simple interface
       –   hidden '.snap' directory
       –   no special tools



           $ mkdir foo/.snap/one                        # create snapshot
           $ ls foo/.snap
           one
           $ ls foo/bar/.snap
           _one_1099511627776                           # parent's snap name is mangled
           $ rm foo/myfile
           $ ls -F foo
           bar/
           $ ls -F foo/.snap/one
           myfile bar/
           $ rmdir foo/.snap/one                        # remove snapshot


2012 Storage Developer Conference. © Inktank. All Rights Reserved.
multiple client implementations
  ●    Linux kernel client
         –    mount -t ceph
              1.2.3.4:/ /mnt
                                                                      NFS            SMB/CIFS
         –    export (NFS), Samba
              (CIFS)
                                                                      Ganesha            Samba
  ●    ceph-fuse                                                       libcephfs          libcephfs

  ●    libcephfs.so                                                   Hadoop             your app
         –    your app                                                 libcephfs          libcephfs

         –    Samba (CIFS)                                                   ceph-fuse
                                                                     ceph        fuse
         –    Ganesha (NFS)                                                     kernel
         –    Hadoop (map/reduce)

2012 Storage Developer Conference. © Inktank. All Rights Reserved.
APP
                   APP                            APP
                                                  APP                  HOST/VM
                                                                       HOST/VM                     CLIENT
                                                                                                   CLIENT



                                       RADOSGW
                                       RADOSGW                       RBD
                                                                     RBD                       CEPH FS
                                                                                               CEPH FS
           LIBRADOS
            LIBRADOS
                                          A bucket-based
                                           A bucket-based             A reliable and fully-
                                                                       A reliable and fully-    A POSIX-compliant
                                                                                                 A POSIX-compliant
              A library allowing
               A library allowing         REST gateway,
                                           REST gateway,              distributed block
                                                                       distributed block        distributed file
                                                                                                 distributed file
              apps to directly
               apps to directly           compatible with S3
                                           compatible with S3         device, with aaLinux
                                                                       device, with Linux       system, with aa
                                                                                                 system, with
              access RADOS,
               access RADOS,              and Swift
                                           and Swift                  kernel client and aa
                                                                       kernel client and        Linux kernel client
                                                                                                 Linux kernel client
              with support for
               with support for                                       QEMU/KVM driver
                                                                       QEMU/KVM driver          and support for
                                                                                                 and support for
              C, C++, Java,
               C, C++, Java,                                                                    FUSE
                                                                                                 FUSE
              Python, Ruby,
               Python, Ruby,
              and PHP
               and PHP                      AWESOME                   AWESOME
                                                                                                  NEARLY
             AWESOME                                                                             AWESOME


        RADOS
        RADOS                                                   AWESOME
           A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
            A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
           intelligent storage nodes
            intelligent storage nodes



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
why we do this
  ●    limited options for scalable open source
       storage
  ●    proprietary solutions
         –     expensive
         –     don't scale (well or out)
         –     marry hardware and software


  ●    industry ready for change



2012 Storage Developer Conference. © Inktank. All Rights Reserved.
who we are
  ●    Ceph created at UC Santa Cruz (2004-2007)
  ●    developed by DreamHost (2008-2011)
  ●    supported by Inktank (2012)
         –     Los Angeles, Sunnyvale, San Francisco, remote
  ●    growing user and developer community
         –     Linux distros, users, cloud stacks, SIs, OEMs




2012 Storage Developer Conference. © Inktank. All Rights Reserved.
thanks




       sage weil
       sage@inktank.com
       @liewegas                                                     http://github.com/ceph
                                                                     http://ceph.com/


2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Mais conteúdo relacionado

Semelhante a Storage Developer Conference - 09/19/2012

Ceph LISA'12 Presentation
Ceph LISA'12 PresentationCeph LISA'12 Presentation
Ceph LISA'12 PresentationCeph Community
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSCeph Community
 
Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with cephIan Colle
 
New features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and BeyondNew features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and BeyondCeph Community
 
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondNew Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondOpenStack Foundation
 
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Community
 
Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Day London 2014 - Ceph Ecosystem Overview Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Day London 2014 - Ceph Ecosystem Overview Ceph Community
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemNETWAYS
 
Building Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGiBuilding Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGiCédric Hüsler
 
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Sean Cohen
 
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver MeetupCeph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetupktdreyer
 
Oracle - Continuous Delivery NYC meetup, June 07, 2018
Oracle - Continuous Delivery NYC meetup, June 07, 2018Oracle - Continuous Delivery NYC meetup, June 07, 2018
Oracle - Continuous Delivery NYC meetup, June 07, 2018Oracle Developers
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationCeph Community
 
Cloudjiffy vs Open Shift (private cloud)
Cloudjiffy vs Open Shift (private cloud)Cloudjiffy vs Open Shift (private cloud)
Cloudjiffy vs Open Shift (private cloud)Sharma Aashish
 

Semelhante a Storage Developer Conference - 09/19/2012 (20)

Ceph LISA'12 Presentation
Ceph LISA'12 PresentationCeph LISA'12 Presentation
Ceph LISA'12 Presentation
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
 
Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with ceph
 
New features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and BeyondNew features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and Beyond
 
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondNew Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and Beyond
 
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFS
 
XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
 
Inktank:ceph overview
Inktank:ceph overviewInktank:ceph overview
Inktank:ceph overview
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Day London 2014 - Ceph Ecosystem Overview Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Day London 2014 - Ceph Ecosystem Overview
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
 
Building Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGiBuilding Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGi
 
Crx 2.2 Deep-Dive
Crx 2.2 Deep-DiveCrx 2.2 Deep-Dive
Crx 2.2 Deep-Dive
 
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
 
CloudOpen - 08/29/2012
CloudOpen - 08/29/2012CloudOpen - 08/29/2012
CloudOpen - 08/29/2012
 
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver MeetupCeph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetup
 
Oracle - Continuous Delivery NYC meetup, June 07, 2018
Oracle - Continuous Delivery NYC meetup, June 07, 2018Oracle - Continuous Delivery NYC meetup, June 07, 2018
Oracle - Continuous Delivery NYC meetup, June 07, 2018
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph Replication
 
Cloudjiffy vs Open Shift (private cloud)
Cloudjiffy vs Open Shift (private cloud)Cloudjiffy vs Open Shift (private cloud)
Cloudjiffy vs Open Shift (private cloud)
 

Último

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Storage Developer Conference - 09/19/2012

  • 1. Ceph: scaling storage for the cloud and beyond Sage Weil Inktank 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 2. outline ● why you should care ● what is it, what it does ● distributed object storage ● ceph fs ● who we are, why we do this 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 3. why should you care about another storage system? 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 4. requirements ● diverse storage needs – object storage – block devices (for VMs) with snapshots, cloning – shared file system with POSIX, coherent caches – structured data... files, block devices, or objects? ● scale – terabytes, petabytes, exabytes – heterogeneous hardware – reliability and fault tolerance 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 5. time ● ease of administration ● no manual data migration, load balancing ● painless scaling – expansion and contraction – seamless migration 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 6. cost ● linear function of size or performance ● incremental expansion – no fork-lift upgrades ● no vendor lock-in – choice of hardware – choice of software ● open 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 7. what is ceph? 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 8. unified storage system ● objects – native – RESTful ● block – thin provisioning, snapshots, cloning ● file – strong consistency, snapshots 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 9. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based A bucket-based A reliable and fully- A reliable and fully- A POSIX-compliant A POSIX-compliant A library allowing A library allowing REST gateway, REST gateway, distributed block distributed block distributed file distributed file apps to directly apps to directly compatible with S3 compatible with S3 device, with aaLinux device, with Linux system, with aa system, with access RADOS, access RADOS, and Swift and Swift kernel client and aa kernel client and Linux kernel client Linux kernel client with support for with support for QEMU/KVM driver QEMU/KVM driver and support for and support for C, C++, Java, C, C++, Java, FUSE FUSE Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 10. open source ● LGPLv2 – copyleft – ok to link to proprietary code ● no copyright assignment – no dual licensing – no “enterprise-only” feature set ● active community ● commercial support 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 11. distributed storage system ● data center scale – 10s to 10,000s of machines – terabytes to exabytes ● fault tolerant – no single point of failure – commodity hardware ● self-managing, self-healing 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 12. ceph object model ● pools – 1s to 100s – independent namespaces or object collections – replication level, placement policy ● objects – bazillions – blob of data (bytes to gigabytes) – attributes (e.g., “version=12”; bytes to kilobytes) – key/value bundle (bytes to gigabytes) 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 13. why start with objects? ● more useful than (disk) blocks – names in a single flat namespace – variable size – simple API with rich semantics ● more scalable than files – no hard-to-distribute hierarchy – update semantics do not span objects – workload is trivially parallel 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 14. DISK DISK DISK DISK DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK DISK DISK DISK DISK DISK DISK 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 15. DISK DISK DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK DISK DISK HUMAN HUMAN DISK DISK DISK DISK 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 16. HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK (COMPUTER)) (COMPUTER HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN (actually more like this…) 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 17. COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 18. OSD OSD OSD OSD OSD FS FS FS FS FS btrfs xfs ext4 DISK DISK DISK DISK DISK M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 19. Monitors: • Maintain cluster membership and state M • Provide consensus for distributed decision-making via Paxos • Small, odd number • These do not serve stored objects to clients Object Storage Daemons (OSDs): • At least three in a cluster • One per disk or RAID group • Serve stored objects to clients • Intelligently peer to perform replication tasks 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 20. HUMAN M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 21. data distribution ● all objects are replicated N times ● objects are automatically placed, balanced, migrated in a dynamic cluster ● must consider physical infrastructure – ceph-osds on hosts in racks in rows in data centers ● three approaches – pick a spot; remember where you put it – pick a spot; write down where you put it – calculate where to put it, where to find it 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 22. CRUSH • Pseudo-random placement algorithm • Fast calculation, no lookup • Repeatable, deterministic • Ensures even distribution • Stable mapping • Limited data migration • Rule-based configuration • specifiable replication • infrastructure topology aware • allows weighting 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 23. 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10 hash(object name) % num pg 10 10 10 10 01 01 01 01 10 10 10 10 01 01 11 11 01 01 10 10 CRUSH(pg, cluster state, policy) 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 24. 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10 10 10 10 10 01 01 01 01 10 10 10 10 01 01 11 11 01 01 10 10 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 25. RADOS ● monitors publish osd map that describes cluster state – ceph-osd node status (up/down, weight, IP) – CRUSH function specifying desired data distribution M ● object storage daemons (OSDs) – safely replicate and store object – migrate data as the cluster changes over time – coordinate based on shared view of reality – gossip! ● decentralized, distributed approach allows – massive scales (10,000s of servers or more) – the illusion of a single copy with consistent behavior 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 26. CLIENT CLIENT ?? 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 27. 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 28. 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 29. CLIENT ?? 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 30. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS A bucket-based A bucket-based A reliable and fully- A reliable and fully- A POSIX-compliant A POSIX-compliant A library allowing REST gateway, REST gateway, distributed block distributed block distributed file distributed file apps to directly compatible with S3 compatible with S3 device, with aaLinux device, with Linux system, with aa system, with access RADOS, and Swift and Swift kernel client and aa kernel client and Linux kernel client Linux kernel client with support for QEMU/KVM driver QEMU/KVM driver and support for and support for C, C++, Java, FUSE FUSE Python, Ruby, and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 31. APP APP LIBRADOS LIBRADOS native M M M M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 32. LIBRADOS L • Provides direct access to RADOS for applications • C, C++, Python, PHP, Java • No HTTP overhead 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 33. atomic transactions ● client operations send to the OSD cluster – operate on a single object – can contain a sequence of operations, e.g. ● truncate object ● write new object data ● set attribute ● atomicity – all operations commit or do not commit atomically ● conditional – 'guard' operations can control whether operation is performed ● verify xattr has specific value ● assert object is a specific version – allows atomic compare-and-swap etc. 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 34. key/value storage ● store key/value pairs in an object – independent from object attrs or byte data payload ● based on google's leveldb – efficient random and range insert/query/removal – based on BigTable SSTable design ● exposed via key/value API – insert, update, remove – individual keys or ranges of keys ● avoid read/modify/write cycle for updating complex objects – e.g., file system directory objects 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 35. watch/notify ● establish stateful 'watch' on an object – client interest persistently registered with object – client keeps session to OSD open ● send 'notify' messages to all watchers – notify message (and payload) is distributed to all watchers – variable timeout – notification on completion ● all watchers got and acknowledged the notify ● use any object as a communication/synchronization channel – locking, distributed coordination (ala ZooKeeper), etc. 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 36. OSD CLIENT CLIENT CLIENT #1 #2 #3 watch ack/commit watch ack/commit watch ack/commit notify notify notify notify ack ack ack complete 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 37. watch/notify example ● radosgw cache consistency – radosgw instances watch a single object (.rgw/notify) – locally cache bucket metadata – on bucket metadata changes (removal, ACL changes) ● write change to relevant bucket object ● send notify with bucket name to other radosgw instances – on receipt of notify ● invalidate relevant portion of cache 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 38. rados classes ● dynamically loaded .so – /var/lib/rados-classes/* – implement new object “methods” using existing methods – part of I/O pipeline – simple internal API ● reads – can call existing native or class methods – do whatever processing is appropriate – return data ● writes – can call existing native or class methods – do whatever processing is appropriate – generates a resulting transaction to be applied atomically 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 39. class examples ● grep – read an object, filter out individual records, and return those ● sha1 – read object, generate fingerprint, return that ● images – rotate, resize, crop image stored in object – remove red-eye ● crypto – encrypt/decrypt object data with provided key 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 40. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based A reliable and fully- A reliable and fully- A POSIX-compliant A POSIX-compliant A library allowing A library allowing REST gateway, distributed block distributed block distributed file distributed file apps to directly apps to directly compatible with S3 device, with aaLinux device, with Linux system, with aa system, with access RADOS, access RADOS, and Swift kernel client and aa kernel client and Linux kernel client Linux kernel client with support for with support for QEMU/KVM driver QEMU/KVM driver and support for and support for C, C++, Java, C, C++, Java, FUSE FUSE Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 41. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based A bucket-based A reliable and fully- A POSIX-compliant A POSIX-compliant A library allowing A library allowing REST gateway, REST gateway, distributed block distributed file distributed file apps to directly apps to directly compatible with S3 compatible with S3 device, with a Linux system, with aa system, with access RADOS, access RADOS, and Swift and Swift kernel client and a Linux kernel client Linux kernel client with support for with support for QEMU/KVM driver and support for and support for C, C++, Java, C, C++, Java, FUSE FUSE Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 42. COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER COMPUTER COMPUTER DISK DISK DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 43. COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK VM VM COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK VM VM COMPUTER DISK COMPUTER DISK COMPUTER COMPUTER DISK DISK VM VM COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 44. RADOS Block Device: • Storage of virtual disks in RADOS • Decouples VMs and containers • Live migration! • Images are striped across the cluster • Snapshots! • Support in • Qemu/KVM • OpenStack, CloudStack • Mainline Linux kernel • Image cloning • Copy-on-write “snapshot” of existing image 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 45. VM VM VIRTUALIZATION CONTAINER VIRTUALIZATION CONTAINER LIBRBD LIBRBD LIBRADOS LIBRADOS M M M M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 46. CONTAINER CONTAINER VM VM CONTAINER CONTAINER LIBRBD LIBRBD LIBRBD LIBRBD LIBRADOS LIBRADOS LIBRADOS LIBRADOS M M M M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 47. HOST HOST KRBD (KERNEL MODULE) KRBD (KERNEL MODULE) LIBRADOS LIBRADOS M M M M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 48. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS LIBRADOS LIBRADOS A bucket-based A bucket-based A reliable and fully- A reliable and fully- A POSIX-compliant A library allowing A library allowing REST gateway, REST gateway, distributed block distributed block distributed file apps to directly apps to directly compatible with S3 compatible with S3 device, with aaLinux device, with Linux system, with a access RADOS, access RADOS, and Swift and Swift kernel client and aa kernel client and Linux kernel client with support for with support for QEMU/KVM driver QEMU/KVM driver and support for C, C++, Java, C, C++, Java, FUSE Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 49. CLIENT CLIENT metadata 01 01 data 10 10 M M M M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 50. M M M M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 51. Metadata Server • Manages metadata for a POSIX-compliant shared filesystem • Directory hierarchy • File metadata (owner, timestamps, mode, etc.) • Stores metadata in RADOS • Does not serve file data to clients • Only required for shared filesystem 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 52. legacy metadata storage ● a scaling disaster – name → inode → block list → data etc home usr var – no inode table locality vmlinuz … – fragmentation hosts mtab passwd ● inode table … bin include lib ● directory … ● many seeks ● difficult to partition 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 53. ceph fs metadata storage ● block lists unnecessary ● inode table mostly useless 100 1 hosts mtab – APIs are path-based, not etc passwd … inode-based home usr var 102 – no random table access, vmlinuz bin … include sloppy caching lib … ● embed inodes inside directories – good locality, prefetching – leverage key/value object 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 54. one tree three metadata servers ?? 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 55. 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 56. 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 57. 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 58. 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 59. DYNAMIC SUBTREE PARTITIONING 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 60. dynamic subtree partitioning ● scalable ● efficient – arbitrarily partition – hierarchical partition metadata preserve locality ● adaptive ● dynamic – move work from busy – daemons can to idle servers join/leave – replicate hot – take over for failed metadata nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 61. controlling metadata io ● view ceph-mds as cache – reduce reads ● dir+inode prefetching journal – reduce writes ● consolidate multiple writes ● large journal or log – stripe over objects – two tiers ● journal for short term ● per-directory for long term directories – fast failure recovery 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 62. what is journaled ● lots of state – journaling is expensive up-front, cheap to recover – non-journaled state is cheap, but complex (and somewhat expensive) to recover ● yes – client sessions – actual fs metadata modifications ● no – cache provenance – open files ● lazy flush – client modifications may not be durable until fsync() or visible by another client 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 63. client protocol ● highly stateful – consistent, fine-grained caching ● seamless hand-off between ceph-mds daemons – when client traverses hierarchy – when metadata is migrated between servers ● direct access to OSDs for file I/O 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 64. an example ● mount -t ceph 1.2.3.4:/ /mnt – 3 ceph-mon RT – 2 ceph-mds RT (1 ceph-mds to -osd RT) ceph-mon ceph-osd ● cd /mnt/foo/bar – 2 ceph-mds RT (2 ceph-mds to -osd RT) ● ls -al – open – readdir ceph-mds ● 1 ceph-mds RT (1 ceph-mds to -osd RT) – stat each file – close ● cp * /tmp – N ceph-osd RT 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 65. recursive accounting ● ceph-mds tracks recursive directory stats – file sizes – file and directory counts – modification time ● virtual xattrs present full stats ● efficient $ ls ­alSh | head total 0 drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 . drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 .. drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1 drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2 drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 66. snapshots ● volume or subvolume snapshots unusable at petabyte scale – snapshot arbitrary subdirectories ● simple interface – hidden '.snap' directory – no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 67. multiple client implementations ● Linux kernel client – mount -t ceph 1.2.3.4:/ /mnt NFS SMB/CIFS – export (NFS), Samba (CIFS) Ganesha Samba ● ceph-fuse libcephfs libcephfs ● libcephfs.so Hadoop your app – your app libcephfs libcephfs – Samba (CIFS) ceph-fuse ceph fuse – Ganesha (NFS) kernel – Hadoop (map/reduce) 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 68. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based A bucket-based A reliable and fully- A reliable and fully- A POSIX-compliant A POSIX-compliant A library allowing A library allowing REST gateway, REST gateway, distributed block distributed block distributed file distributed file apps to directly apps to directly compatible with S3 compatible with S3 device, with aaLinux device, with Linux system, with aa system, with access RADOS, access RADOS, and Swift and Swift kernel client and aa kernel client and Linux kernel client Linux kernel client with support for with support for QEMU/KVM driver QEMU/KVM driver and support for and support for C, C++, Java, C, C++, Java, FUSE FUSE Python, Ruby, Python, Ruby, and PHP and PHP AWESOME AWESOME NEARLY AWESOME AWESOME RADOS RADOS AWESOME A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 69. why we do this ● limited options for scalable open source storage ● proprietary solutions – expensive – don't scale (well or out) – marry hardware and software ● industry ready for change 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 70. who we are ● Ceph created at UC Santa Cruz (2004-2007) ● developed by DreamHost (2008-2011) ● supported by Inktank (2012) – Los Angeles, Sunnyvale, San Francisco, remote ● growing user and developer community – Linux distros, users, cloud stacks, SIs, OEMs 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
  • 71. thanks sage weil sage@inktank.com @liewegas http://github.com/ceph http://ceph.com/ 2012 Storage Developer Conference. © Inktank. All Rights Reserved.