SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
the ceph distributed storage system


                sage weil
         strata – march 1, 2012
hello
●   why you should care
●   what is it, what it does
●   how it works, how you can use it
    ●   architecture
    ●   objects
    ●   recovery
    ●   file system
    ●   hadoop integration
●   distributed computation
    ●   object classes
●   who we are, why we do this
why should you care about another
        storage system?

     requirements, time, money
requirements
●   diverse storage needs
    ●   object storage
    ●   block devices (for VMs) with snapshots, cloning
    ●   shared file system with POSIX, coherent caches
    ●   structured data... files, block devices, or objects?
●   scale
    ●   terabytes, petabytes, exabytes
    ●   heterogeneous hardware
    ●   reliability and fault tolerance
time
●   ease of administration
●   no manual data migration, load balancing
●   painless scaling
    ●   expansion and contraction
    ●   seamless migration
money
●   low cost per gigabyte
●   no vendor lock-in

●   software solution
    ●   run on commodity hardware
●   open source
what is ceph?
unified storage system
●   objects
    ●   small or large
    ●   multi-protocol       Netflix   VM       Hadoop
●   block devices
                             radosgw   RBD     Ceph DFS
    ●   snapshots, cloning             RADOS
●   files
    ●   cache coherent
    ●   snapshots
    ●   usage accounting
open source
●   LGPLv2
    ●   copyleft
    ●   free to link to proprietary code
●   no copyright assignment
    ●   no dual licensing
    ●   no “enterprise-only” feature set
●   active community
●   commercial support
distributed storage system
●   data center (not geo) scale
    ●   10s to 10,000s of machines
    ●   terabytes to exabytes
●   fault tolerant
    ●   no SPoF
    ●   commodity hardware
        –   ethernet, SATA/SAS, HDD/SSD
        –   RAID, SAN probably a waste of time, power, and money
architecture
●   monitors (ceph-mon)
    ●   1s-10s, paxos
    ●   lightweight process
    ●   authentication, cluster membership,
        critical cluster state
●   object storage daemons (ceph-osd)
    ●   1s-10,000s
    ●   smart, coordinate with peers
●   clients (librados, librbd)
    ●   zillions
    ●   authenticate with monitors, talk directly
        to ceph-osds
●   metadata servers (ceph-mds)
    ●   1s-10s
    ●   build POSIX file system on top of objects
rados object storage model
●   pools
    ●   1s to 100s
    ●   independent namespaces or object collections
    ●   replication level, placement policy
●   objects
    ●   trillions
    ●   blob of data (bytes to gigabytes)
    ●   attributes (e.g., “version=12”; bytes to kilobytes)
    ●   key/value bundle (bytes to gigabytes)
rados object API
●   librados.so
    ●   C, C++, Python, Java. shell.
●   read/write (extent), truncate, remove; get/set/remove xattr or key
    ●   like a file or .db file
●
    efficient copy-on-write clone
●
    atomic compound operations/transactions
    ●
        read + getxattr, write + setxattr
    ●   compare xattr value, if match write + setxattr
●
    classes
    ●   load new code into cluster to implement new methods
    ●
        calc sha1, grep/filter, generate thumbnail
    ●   encrypt, increment, rotate image
object storage
●   client/server, host/device paradigm doesn't scale
    ●   idle servers are wasted servers
    ●   if storage devices don't coordinate, clients must
●   ceph-osds are intelligent storage daemons
    ●   coordinate with peers
    ●   sensible, cluster-aware protocols
●   flexible deployment
    ●   one per disk
    ●   one per host
    ●   one per RAID volume
●   sit on local file system
    ●   btrfs, xfs, ext4, etc.
data distribution
●   all objects are replicated N times
●   objects are automatically placed, balanced, migrated
    in a dynamic cluster
●   must consider physical infrastructure
    ●   ceph-osds on hosts in racks in rows in data centers

●   three approaches
    ●   pick a spot; remember where you put it
    ●   pick a spot; write down where you put it
    ●   calculate where to put it, where to find it
CRUSH
●   pseudo-random placement algorithm
    ●   uniform, weighted distribution
    ●   fast calculation, no lookup
●   placement rules
    ●   in terms of physical infrastructure
        –   “3 replicas, same row, different racks”
●   predictable, bounded migration on changes
    ●   N → N + 1 ceph-osds means a bit over 1/Nth of
        data moves
object placement
        pool

                                     hash(object name) % num_pg = pg

placement group (PG)

                                  CRUSH(pg, cluster state, rule) = [A, B]




                       X
replication
●   all data replicated N times
●   ceph-osd cluster handles replication
    ●   client writes to first replica




    ●   reduce client bandwidth
    ●   “only once” semantics
    ●   cluster maintains strict consistently
recovery
●   dynamic cluster
    ●   nodes are added, removed
    ●   nodes reboot, fail, recover
●   “recovery” is the norm
    ●   “map” records cluster state at point in time
        –   ceph-osd node status (up/down, weight, IP)
        –   CRUSH function specifying desired data distribution
    ●   ceph-osds cooperatively migrate data to achieve that
●   any map update potentially triggers data migration
    ●   ceph-osds monitor peers for failure
    ●   new nodes register with monitor
    ●   administrator adjusts weights, mark out old hardware, etc.
rbd – rados block device
●   replicated, reliable, high-performance virtual disk
    ●   striped over objects across entire cluster
    ●   thinly provisioned, snapshots
    ●   image cloning (real soon now)
                                                                 KVM
●   well integrated                                               librbd
                                                                librados
    ●   Linux kernel driver (/dev/rbd0)              KVM/Xen
    ●   qemu/KVM + librbd                   ext4         rbd
                                            rbd        kernel
    ●   libvirt, OpenStack
●   sever link between virtual machine and host
    ●   fail-over, live migration
ceph distributed file system
●   shared cluster-coherent file system
●   separate metadata and data paths
    ●   avoid “server” bottleneck inherent in NFS etc
●   ceph-mds cluster
    ●   manages file system hierarchy
    ●   redistributes load based on workload
    ●   ultimately stores everything in objects
●   highly stateful client sessions
    ●   lots of caching, prefetching, locks and leases
an example
●   mount -t ceph 1.2.3.4:/ /mnt
    ●   3 ceph-mon RT
    ●   2 ceph-mds RT (1 ceph-mds to -osd RT)
                                                     ceph-mon   ceph-osd
●   cd /mnt/foo/bar
    ●   2 ceph-mds RT (2 ceph-mds to -osd RT)
●   ls -al
    ●   open
    ●   readdir
         –   1 ceph-mds RT (1 ceph-mds to -osd RT)
    ●   stat each file
    ●   close                                                   ceph-mds
●   cp * /tmp
    ●   N ceph-osd RT
dynamic subtree partitioning
                                          Root




                                                                      ceph-mds


●   efficient                                    ●   scalable
    ●   hierarchical partition preserve              ●   arbitrarily partition metadata
        locality                                 ●   adaptive
●   dynamic                                          ●   move work from busy to idle
    ●   daemons can join/leave                           servers
    ●   take over for failed nodes                   ●   replicate hot metadata
recursive accounting
●   ceph-mds tracks recursive directory stats
    ●   file sizes
    ●   file and directory counts
    ●   modification time
●
    virtual xattrs present full stats
●
    efficient
        $ ls ­alSh | head
        total 0
        drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 .
        drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 ..
        drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph
        drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1
        drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko
        drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest
        drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2
        drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph
        drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
snapshots
●   volume or subvolume snapshots unusable at petabyte scale
    ●   snapshot arbitrary subdirectories
●   simple interface
    ●   hidden '.snap' directory
    ●
        no special tools


        $ mkdir foo/.snap/one      # create snapshot
        $ ls foo/.snap
        one
        $ ls foo/bar/.snap
        _one_1099511627776         # parent's snap name is mangled
        $ rm foo/myfile
        $ ls -F foo
        bar/
        $ ls -F foo/.snap/one
        myfile bar/
        $ rmdir foo/.snap/one      # remove snapshot
multiple protocols, implementations
●   Linux kernel client
    ●   mount -t ceph 1.2.3.4:/ /mnt
                                        NFS            SMB/CIFS
    ●   export (NFS), Samba (CIFS)
●   ceph-fuse                           Ganesha            Samba
                                         libcephfs          libcephfs
●   libcephfs.so
    ●   your app                        Hadoop             your app
                                         libcephfs          libcephfs
    ●   Samba (CIFS)
    ●   Ganesha (NFS)                          ceph-fuse
                                       ceph        fuse
    ●   Hadoop (map/reduce)                       kernel
hadoop
●   seamless integration                   ●   can interact “normally” with
    ●   Java libcephfs wrapper                 Hadoop data
    ●   Hadoop CephFileSystem                  ●   kernel mount
    ●   drop-in replacement for HDFS           ●   ceph-fuse
●   locality                                   ●   NFS/CIFS
    ●   exposes data layout                ●   can colocate Hadoop with
    ●   reads from local replica               “normal” storage
    ●   first write does not go to local       ●   avoid staging/destaging
        node
distributed computation models
●   object classes                  ●   map/reduce
    ●   tightly couple                  ●   colocation of
        computation with data               computation and data is
                                            optimization only
    ●   carefully sandboxed
                                        ●   more loosely sandboxed
    ●   part of I/O pipeline
                                        ●   orchestrated data flow
    ●   atomic transactions                 between files, nodes
    ●   rich data abstraction           ●   job scheduling
         –   blob of bytes (file)       ●   limited storage
         –   xattrs                         abstraction
         –   key/value bundle
structured data
●   data types             ●   operations
    ●   record streams         ●   filter
    ●   key/value maps         ●   fingerprint
    ●   queues             ●   mutations
    ●   images                 ●   rotate/resize
    ●   matrices               ●   substitute
                               ●   translate
                           ●   avoid read/modify/write
size vs (intra-object) smarts

               S3   HDFS
                                                      RADOS
object size




                           RADOS object       hbase          redis
                                     cassandra        riak

                              object smarts
best tool for the job
●   key/value stores             ●   hbase/bigtable
    ●   cassandra, riak, redis       ●   tablets, logs
●   object store                 ●   map/reduce
    ●   RADOS (ceph)                 ●   data flows
●   map/reduce                   ●   percolator
    “filesystems”                    ●   triggers, transactions
    ●   GFS, HDFS
●   POSIX filesystems
    ●   ceph, lustre, gluster
can I deploy it already?
●   rados object store is stable
    ●   librados
    ●   radosgw (RESTful APIs)
    ●   rbd rados block device
    ●   commercial support
●   file system is almost ready
    ●   feature complete
    ●   suitable for testing, PoC, benchmarking
    ●   needs testing, deliberate qa effort for production
why we do this
●   limited options for scalable open source storage
    ●   orangefs, lustre
    ●   glusterfs
    ●   HDFS
●   proprietary solutions
    ●   marry hardware and software
    ●   expensive
    ●   don't scale (well or out)
●   industry needs to change
who we are
●   created at UC Santa Cruz (2007)
●   supported by DreamHost (2008-2011)
●   spun off as new company (2012)
    ●   downtown Los Angeles, downtown San Francisco
●   growing user and developer community
    ●   Silicon Valley, Asia, Europe
    ●   Debian, SuSE, Canonical, RedHat
    ●   cloud computing stacks
●   we are hiring
    ●   C/C++/Python developers
    ●   sysadmins, testing engineers


                                       http://ceph.com/
librados, radosgw
●   librados
                                                               HTTP
    ●   direct parallel access to
        cluster                                     haproxy
    ●   rich API                                               HTTP
●   radosgw                         your app   radosgw    radosgw
                                    librados   librados   librados
    ●   RESTful object storage
        –   S3, Swift APIs
    ●   proxy HTTP to rados
    ●   ACL-based security for
        the big bad internet
why we like btrfs
●   pervasive checksumming
●   snapshots, copy-on-write
●   efficient metadata (xattrs)
●   inline data for small files
●   transparent compression
●   integrated volume management
    ●   software RAID, mirroring, error recovery
    ●   SSD-aware
●   online fsck
●   active development community

Mais conteúdo relacionado

Mais procurados

What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and BeyondSage Weil
 
Keeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersKeeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersSage Weil
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackSage Weil
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and BeyondSage Weil
 
Making distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyondMaking distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyondSage Weil
 
GlusterFS CTDB Integration
GlusterFS CTDB IntegrationGlusterFS CTDB Integration
GlusterFS CTDB IntegrationEtsuji Nakai
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous updateinwin stack
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to CephCeph Community
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for youTaco Scargo
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEOpenStack
 
MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on LinuxPawan Kumar
 
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...Tommy Lee
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelColleen Corrice
 
Java in containers
Java in containersJava in containers
Java in containersMartin Baez
 

Mais procurados (20)

What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and Beyond
 
Keeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersKeeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containers
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStack
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
 
Making distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyondMaking distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyond
 
GlusterFS CTDB Integration
GlusterFS CTDB IntegrationGlusterFS CTDB Integration
GlusterFS CTDB Integration
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous update
 
librados
libradoslibrados
librados
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for you
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
 
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
 
MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on Linux
 
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to Jewel
 
Intorduce to Ceph
Intorduce to CephIntorduce to Ceph
Intorduce to Ceph
 
Java in containers
Java in containersJava in containers
Java in containers
 

Destaque

Ceph Day Nov 2012 - Sage Weil
Ceph Day Nov 2012 - Sage WeilCeph Day Nov 2012 - Sage Weil
Ceph Day Nov 2012 - Sage WeilCeph Community
 
Ceph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wildCeph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wildCeph Community
 
Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Community
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Community
 
Ceph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Community
 
General english project
General english projectGeneral english project
General english projectHency Soni
 
Propaganda De Produtos Sujeitos à VigilâNcia SanitáRia
Propaganda De Produtos Sujeitos à VigilâNcia SanitáRiaPropaganda De Produtos Sujeitos à VigilâNcia SanitáRia
Propaganda De Produtos Sujeitos à VigilâNcia SanitáRiaDESENVOLVA CONSULTORIA
 
Bonsais Agosto2008
Bonsais Agosto2008Bonsais Agosto2008
Bonsais Agosto2008angelnuro
 
Teste de dna
Teste de dnaTeste de dna
Teste de dnaRonCruz
 
Presentación para seminario
Presentación para seminarioPresentación para seminario
Presentación para seminarioAlexilla
 
Portfolio digital fabio2
Portfolio digital fabio2Portfolio digital fabio2
Portfolio digital fabio2Dulce Gomes
 
A ESB Abraça a Ciência
A ESB Abraça a CiênciaA ESB Abraça a Ciência
A ESB Abraça a CiênciaDulce Gomes
 
A missão evangelizadora
A missão evangelizadoraA missão evangelizadora
A missão evangelizadoraDulce Gomes
 

Destaque (20)

Ceph Day Nov 2012 - Sage Weil
Ceph Day Nov 2012 - Sage WeilCeph Day Nov 2012 - Sage Weil
Ceph Day Nov 2012 - Sage Weil
 
Ceph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wildCeph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wild
 
Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 
Ceph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance Networks
 
General english project
General english projectGeneral english project
General english project
 
Propaganda De Produtos Sujeitos à VigilâNcia SanitáRia
Propaganda De Produtos Sujeitos à VigilâNcia SanitáRiaPropaganda De Produtos Sujeitos à VigilâNcia SanitáRia
Propaganda De Produtos Sujeitos à VigilâNcia SanitáRia
 
Boletín RadioAMLO no. 7
Boletín RadioAMLO no. 7Boletín RadioAMLO no. 7
Boletín RadioAMLO no. 7
 
Pensamiento critico
Pensamiento criticoPensamiento critico
Pensamiento critico
 
Bonsais Agosto2008
Bonsais Agosto2008Bonsais Agosto2008
Bonsais Agosto2008
 
Teste de dna
Teste de dnaTeste de dna
Teste de dna
 
B&d antrax 2013
B&d   antrax 2013B&d   antrax 2013
B&d antrax 2013
 
El Santo
El SantoEl Santo
El Santo
 
Producto final
Producto finalProducto final
Producto final
 
Presentación para seminario
Presentación para seminarioPresentación para seminario
Presentación para seminario
 
Portfolio digital fabio2
Portfolio digital fabio2Portfolio digital fabio2
Portfolio digital fabio2
 
A ESB Abraça a Ciência
A ESB Abraça a CiênciaA ESB Abraça a Ciência
A ESB Abraça a Ciência
 
A missão evangelizadora
A missão evangelizadoraA missão evangelizadora
A missão evangelizadora
 
Inmunizaciones1
Inmunizaciones1Inmunizaciones1
Inmunizaciones1
 
Tics foligna korol[1]
Tics foligna korol[1]Tics foligna korol[1]
Tics foligna korol[1]
 

Semelhante a Strata - 03/31/2012

INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureThomas Uhl
 
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.ioDávid Kőszeghy
 
Linuxtag.ceph.talk
Linuxtag.ceph.talkLinuxtag.ceph.talk
Linuxtag.ceph.talkUdo Seidel
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with PacemakerKris Buytaert
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH Ceph Community
 
Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016Thang Man
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETNikos Kormpakis
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Divejoshdurgin
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific DashboardCeph Community
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with PacemakerKris Buytaert
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Community
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCeph Community
 
Open stack HA - Theory to Reality
Open stack HA -  Theory to RealityOpen stack HA -  Theory to Reality
Open stack HA - Theory to RealitySriram Subramanian
 

Semelhante a Strata - 03/31/2012 (20)

XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
 
Cncf meetup-rook
Cncf meetup-rookCncf meetup-rook
Cncf meetup-rook
 
Cncf meetup-rook
Cncf meetup-rookCncf meetup-rook
Cncf meetup-rook
 
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
 
Linuxtag.ceph.talk
Linuxtag.ceph.talkLinuxtag.ceph.talk
Linuxtag.ceph.talk
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
 
Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Dive
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with Pacemaker
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
Open stack HA - Theory to Reality
Open stack HA -  Theory to RealityOpen stack HA -  Theory to Reality
Open stack HA - Theory to Reality
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 

Último

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Strata - 03/31/2012

  • 1. the ceph distributed storage system sage weil strata – march 1, 2012
  • 2. hello ● why you should care ● what is it, what it does ● how it works, how you can use it ● architecture ● objects ● recovery ● file system ● hadoop integration ● distributed computation ● object classes ● who we are, why we do this
  • 3. why should you care about another storage system? requirements, time, money
  • 4. requirements ● diverse storage needs ● object storage ● block devices (for VMs) with snapshots, cloning ● shared file system with POSIX, coherent caches ● structured data... files, block devices, or objects? ● scale ● terabytes, petabytes, exabytes ● heterogeneous hardware ● reliability and fault tolerance
  • 5. time ● ease of administration ● no manual data migration, load balancing ● painless scaling ● expansion and contraction ● seamless migration
  • 6. money ● low cost per gigabyte ● no vendor lock-in ● software solution ● run on commodity hardware ● open source
  • 8. unified storage system ● objects ● small or large ● multi-protocol Netflix VM Hadoop ● block devices radosgw RBD Ceph DFS ● snapshots, cloning RADOS ● files ● cache coherent ● snapshots ● usage accounting
  • 9. open source ● LGPLv2 ● copyleft ● free to link to proprietary code ● no copyright assignment ● no dual licensing ● no “enterprise-only” feature set ● active community ● commercial support
  • 10. distributed storage system ● data center (not geo) scale ● 10s to 10,000s of machines ● terabytes to exabytes ● fault tolerant ● no SPoF ● commodity hardware – ethernet, SATA/SAS, HDD/SSD – RAID, SAN probably a waste of time, power, and money
  • 11. architecture ● monitors (ceph-mon) ● 1s-10s, paxos ● lightweight process ● authentication, cluster membership, critical cluster state ● object storage daemons (ceph-osd) ● 1s-10,000s ● smart, coordinate with peers ● clients (librados, librbd) ● zillions ● authenticate with monitors, talk directly to ceph-osds ● metadata servers (ceph-mds) ● 1s-10s ● build POSIX file system on top of objects
  • 12. rados object storage model ● pools ● 1s to 100s ● independent namespaces or object collections ● replication level, placement policy ● objects ● trillions ● blob of data (bytes to gigabytes) ● attributes (e.g., “version=12”; bytes to kilobytes) ● key/value bundle (bytes to gigabytes)
  • 13. rados object API ● librados.so ● C, C++, Python, Java. shell. ● read/write (extent), truncate, remove; get/set/remove xattr or key ● like a file or .db file ● efficient copy-on-write clone ● atomic compound operations/transactions ● read + getxattr, write + setxattr ● compare xattr value, if match write + setxattr ● classes ● load new code into cluster to implement new methods ● calc sha1, grep/filter, generate thumbnail ● encrypt, increment, rotate image
  • 14. object storage ● client/server, host/device paradigm doesn't scale ● idle servers are wasted servers ● if storage devices don't coordinate, clients must ● ceph-osds are intelligent storage daemons ● coordinate with peers ● sensible, cluster-aware protocols ● flexible deployment ● one per disk ● one per host ● one per RAID volume ● sit on local file system ● btrfs, xfs, ext4, etc.
  • 15. data distribution ● all objects are replicated N times ● objects are automatically placed, balanced, migrated in a dynamic cluster ● must consider physical infrastructure ● ceph-osds on hosts in racks in rows in data centers ● three approaches ● pick a spot; remember where you put it ● pick a spot; write down where you put it ● calculate where to put it, where to find it
  • 16. CRUSH ● pseudo-random placement algorithm ● uniform, weighted distribution ● fast calculation, no lookup ● placement rules ● in terms of physical infrastructure – “3 replicas, same row, different racks” ● predictable, bounded migration on changes ● N → N + 1 ceph-osds means a bit over 1/Nth of data moves
  • 17. object placement pool hash(object name) % num_pg = pg placement group (PG) CRUSH(pg, cluster state, rule) = [A, B] X
  • 18. replication ● all data replicated N times ● ceph-osd cluster handles replication ● client writes to first replica ● reduce client bandwidth ● “only once” semantics ● cluster maintains strict consistently
  • 19. recovery ● dynamic cluster ● nodes are added, removed ● nodes reboot, fail, recover ● “recovery” is the norm ● “map” records cluster state at point in time – ceph-osd node status (up/down, weight, IP) – CRUSH function specifying desired data distribution ● ceph-osds cooperatively migrate data to achieve that ● any map update potentially triggers data migration ● ceph-osds monitor peers for failure ● new nodes register with monitor ● administrator adjusts weights, mark out old hardware, etc.
  • 20. rbd – rados block device ● replicated, reliable, high-performance virtual disk ● striped over objects across entire cluster ● thinly provisioned, snapshots ● image cloning (real soon now) KVM ● well integrated librbd librados ● Linux kernel driver (/dev/rbd0) KVM/Xen ● qemu/KVM + librbd ext4 rbd rbd kernel ● libvirt, OpenStack ● sever link between virtual machine and host ● fail-over, live migration
  • 21. ceph distributed file system ● shared cluster-coherent file system ● separate metadata and data paths ● avoid “server” bottleneck inherent in NFS etc ● ceph-mds cluster ● manages file system hierarchy ● redistributes load based on workload ● ultimately stores everything in objects ● highly stateful client sessions ● lots of caching, prefetching, locks and leases
  • 22. an example ● mount -t ceph 1.2.3.4:/ /mnt ● 3 ceph-mon RT ● 2 ceph-mds RT (1 ceph-mds to -osd RT) ceph-mon ceph-osd ● cd /mnt/foo/bar ● 2 ceph-mds RT (2 ceph-mds to -osd RT) ● ls -al ● open ● readdir – 1 ceph-mds RT (1 ceph-mds to -osd RT) ● stat each file ● close ceph-mds ● cp * /tmp ● N ceph-osd RT
  • 23. dynamic subtree partitioning Root ceph-mds ● efficient ● scalable ● hierarchical partition preserve ● arbitrarily partition metadata locality ● adaptive ● dynamic ● move work from busy to idle ● daemons can join/leave servers ● take over for failed nodes ● replicate hot metadata
  • 24. recursive accounting ● ceph-mds tracks recursive directory stats ● file sizes ● file and directory counts ● modification time ● virtual xattrs present full stats ● efficient $ ls ­alSh | head total 0 drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 . drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 .. drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1 drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2 drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
  • 25. snapshots ● volume or subvolume snapshots unusable at petabyte scale ● snapshot arbitrary subdirectories ● simple interface ● hidden '.snap' directory ● no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot
  • 26. multiple protocols, implementations ● Linux kernel client ● mount -t ceph 1.2.3.4:/ /mnt NFS SMB/CIFS ● export (NFS), Samba (CIFS) ● ceph-fuse Ganesha Samba libcephfs libcephfs ● libcephfs.so ● your app Hadoop your app libcephfs libcephfs ● Samba (CIFS) ● Ganesha (NFS) ceph-fuse ceph fuse ● Hadoop (map/reduce) kernel
  • 27. hadoop ● seamless integration ● can interact “normally” with ● Java libcephfs wrapper Hadoop data ● Hadoop CephFileSystem ● kernel mount ● drop-in replacement for HDFS ● ceph-fuse ● locality ● NFS/CIFS ● exposes data layout ● can colocate Hadoop with ● reads from local replica “normal” storage ● first write does not go to local ● avoid staging/destaging node
  • 28. distributed computation models ● object classes ● map/reduce ● tightly couple ● colocation of computation with data computation and data is optimization only ● carefully sandboxed ● more loosely sandboxed ● part of I/O pipeline ● orchestrated data flow ● atomic transactions between files, nodes ● rich data abstraction ● job scheduling – blob of bytes (file) ● limited storage – xattrs abstraction – key/value bundle
  • 29. structured data ● data types ● operations ● record streams ● filter ● key/value maps ● fingerprint ● queues ● mutations ● images ● rotate/resize ● matrices ● substitute ● translate ● avoid read/modify/write
  • 30. size vs (intra-object) smarts S3 HDFS RADOS object size RADOS object hbase redis cassandra riak object smarts
  • 31. best tool for the job ● key/value stores ● hbase/bigtable ● cassandra, riak, redis ● tablets, logs ● object store ● map/reduce ● RADOS (ceph) ● data flows ● map/reduce ● percolator “filesystems” ● triggers, transactions ● GFS, HDFS ● POSIX filesystems ● ceph, lustre, gluster
  • 32. can I deploy it already? ● rados object store is stable ● librados ● radosgw (RESTful APIs) ● rbd rados block device ● commercial support ● file system is almost ready ● feature complete ● suitable for testing, PoC, benchmarking ● needs testing, deliberate qa effort for production
  • 33. why we do this ● limited options for scalable open source storage ● orangefs, lustre ● glusterfs ● HDFS ● proprietary solutions ● marry hardware and software ● expensive ● don't scale (well or out) ● industry needs to change
  • 34. who we are ● created at UC Santa Cruz (2007) ● supported by DreamHost (2008-2011) ● spun off as new company (2012) ● downtown Los Angeles, downtown San Francisco ● growing user and developer community ● Silicon Valley, Asia, Europe ● Debian, SuSE, Canonical, RedHat ● cloud computing stacks ● we are hiring ● C/C++/Python developers ● sysadmins, testing engineers http://ceph.com/
  • 35.
  • 36. librados, radosgw ● librados HTTP ● direct parallel access to cluster haproxy ● rich API HTTP ● radosgw your app radosgw radosgw librados librados librados ● RESTful object storage – S3, Swift APIs ● proxy HTTP to rados ● ACL-based security for the big bad internet
  • 37. why we like btrfs ● pervasive checksumming ● snapshots, copy-on-write ● efficient metadata (xattrs) ● inline data for small files ● transparent compression ● integrated volume management ● software RAID, mirroring, error recovery ● SSD-aware ● online fsck ● active development community