Enviar pesquisa
Carregar
Storage Developer Conference - 09/19/2012
•
0 gostou
•
556 visualizações
Ceph Community
Seguir
Sage Weil's slides from SDC in Sep 2012.
Leia menos
Leia mais
Tecnologia
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 71
Baixar agora
Baixar para ler offline
Recomendados
Next Gen Datacenter
Next Gen Datacenter
Rui Lopes
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Ceph Community
Scalable Elastic Systems Architecture (SESA)
Scalable Elastic Systems Architecture (SESA)
Eric Van Hensbergen
Upgrading to SystemVerilog for FPGA Designs - FPGA Camp Bangalore, 2010
Upgrading to SystemVerilog for FPGA Designs - FPGA Camp Bangalore, 2010
FPGA Central
Learn OpenStack from trystack.cn ——Folsom in practice
Learn OpenStack from trystack.cn ——Folsom in practice
OpenCity Community
SDEC2011 Going by TACC
SDEC2011 Going by TACC
Korea Sdec
Architecting a Private Cloud - Cloud Expo
Architecting a Private Cloud - Cloud Expo
smw355
Why we (Day) open source most of our code
Why we (Day) open source most of our code
Bertrand Delacretaz
Recomendados
Next Gen Datacenter
Next Gen Datacenter
Rui Lopes
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Ceph Community
Scalable Elastic Systems Architecture (SESA)
Scalable Elastic Systems Architecture (SESA)
Eric Van Hensbergen
Upgrading to SystemVerilog for FPGA Designs - FPGA Camp Bangalore, 2010
Upgrading to SystemVerilog for FPGA Designs - FPGA Camp Bangalore, 2010
FPGA Central
Learn OpenStack from trystack.cn ——Folsom in practice
Learn OpenStack from trystack.cn ——Folsom in practice
OpenCity Community
SDEC2011 Going by TACC
SDEC2011 Going by TACC
Korea Sdec
Architecting a Private Cloud - Cloud Expo
Architecting a Private Cloud - Cloud Expo
smw355
Why we (Day) open source most of our code
Why we (Day) open source most of our code
Bertrand Delacretaz
Ceph LISA'12 Presentation
Ceph LISA'12 Presentation
Ceph Community
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
Ceph Community
Openstack with ceph
Openstack with ceph
Ian Colle
New features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and Beyond
Ceph Community
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and Beyond
OpenStack Foundation
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFS
Ceph Community
XenSummit - 08/28/2012
XenSummit - 08/28/2012
Ceph Community
Inktank:ceph overview
Inktank:ceph overview
Ceph Community
Block Storage For VMs With Ceph
Block Storage For VMs With Ceph
The Linux Foundation
Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Community
Ceph as software define storage
Ceph as software define storage
Mahmoud Shiri Varamini
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
NETWAYS
Building Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGi
Cédric Hüsler
Crx 2.2 Deep-Dive
Crx 2.2 Deep-Dive
Gabriel Walt
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Sean Cohen
CloudOpen - 08/29/2012
CloudOpen - 08/29/2012
Ceph Community
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetup
ktdreyer
Oracle - Continuous Delivery NYC meetup, June 07, 2018
Oracle - Continuous Delivery NYC meetup, June 07, 2018
Oracle Developers
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph Replication
Ceph Community
Cloudjiffy vs Open Shift (private cloud)
Cloudjiffy vs Open Shift (private cloud)
Sharma Aashish
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Product Anonymous
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
Rustici Software
Mais conteúdo relacionado
Semelhante a Storage Developer Conference - 09/19/2012
Ceph LISA'12 Presentation
Ceph LISA'12 Presentation
Ceph Community
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
Ceph Community
Openstack with ceph
Openstack with ceph
Ian Colle
New features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and Beyond
Ceph Community
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and Beyond
OpenStack Foundation
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFS
Ceph Community
XenSummit - 08/28/2012
XenSummit - 08/28/2012
Ceph Community
Inktank:ceph overview
Inktank:ceph overview
Ceph Community
Block Storage For VMs With Ceph
Block Storage For VMs With Ceph
The Linux Foundation
Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Community
Ceph as software define storage
Ceph as software define storage
Mahmoud Shiri Varamini
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
NETWAYS
Building Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGi
Cédric Hüsler
Crx 2.2 Deep-Dive
Crx 2.2 Deep-Dive
Gabriel Walt
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Sean Cohen
CloudOpen - 08/29/2012
CloudOpen - 08/29/2012
Ceph Community
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetup
ktdreyer
Oracle - Continuous Delivery NYC meetup, June 07, 2018
Oracle - Continuous Delivery NYC meetup, June 07, 2018
Oracle Developers
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph Replication
Ceph Community
Cloudjiffy vs Open Shift (private cloud)
Cloudjiffy vs Open Shift (private cloud)
Sharma Aashish
Semelhante a Storage Developer Conference - 09/19/2012
(20)
Ceph LISA'12 Presentation
Ceph LISA'12 Presentation
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
Openstack with ceph
Openstack with ceph
New features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and Beyond
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFS
XenSummit - 08/28/2012
XenSummit - 08/28/2012
Inktank:ceph overview
Inktank:ceph overview
Block Storage For VMs With Ceph
Block Storage For VMs With Ceph
Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph as software define storage
Ceph as software define storage
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
Building Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGi
Crx 2.2 Deep-Dive
Crx 2.2 Deep-Dive
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
CloudOpen - 08/29/2012
CloudOpen - 08/29/2012
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetup
Oracle - Continuous Delivery NYC meetup, June 07, 2018
Oracle - Continuous Delivery NYC meetup, June 07, 2018
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph Replication
Cloudjiffy vs Open Shift (private cloud)
Cloudjiffy vs Open Shift (private cloud)
Último
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Product Anonymous
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
Rustici Software
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
apidays
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
apidays
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
rafiqahmad00786416
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
DianaGray10
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
MIND CTI
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Igalia
Architecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Martijn de Jong
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
lior mazor
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
jfdjdjcjdnsjd
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
The Digital Insurer
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
wesley chun
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
The Digital Insurer
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Dropbox
Último
(20)
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Architecting Cloud Native Applications
Architecting Cloud Native Applications
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Storage Developer Conference - 09/19/2012
1.
Ceph: scaling storage
for the cloud and beyond Sage Weil Inktank 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
2.
outline ●
why you should care ● what is it, what it does ● distributed object storage ● ceph fs ● who we are, why we do this 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
3.
why should you
care about another storage system? 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
4.
requirements ●
diverse storage needs – object storage – block devices (for VMs) with snapshots, cloning – shared file system with POSIX, coherent caches – structured data... files, block devices, or objects? ● scale – terabytes, petabytes, exabytes – heterogeneous hardware – reliability and fault tolerance 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
5.
time ●
ease of administration ● no manual data migration, load balancing ● painless scaling – expansion and contraction – seamless migration 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
6.
cost ●
linear function of size or performance ● incremental expansion – no fork-lift upgrades ● no vendor lock-in – choice of hardware – choice of software ● open 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
7.
what is ceph? 2012
Storage Developer Conference. © Inktank. All Rights Reserved.
8.
unified storage system
● objects – native – RESTful ● block – thin provisioning, snapshots, cloning ● file – strong consistency, snapshots 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
9.
APP
APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based A bucket-based A reliable and fully- A reliable and fully- A POSIX-compliant A POSIX-compliant A library allowing A library allowing REST gateway, REST gateway, distributed block distributed block distributed file distributed file apps to directly apps to directly compatible with S3 compatible with S3 device, with aaLinux device, with Linux system, with aa system, with access RADOS, access RADOS, and Swift and Swift kernel client and aa kernel client and Linux kernel client Linux kernel client with support for with support for QEMU/KVM driver QEMU/KVM driver and support for and support for C, C++, Java, C, C++, Java, FUSE FUSE Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
10.
open source
● LGPLv2 – copyleft – ok to link to proprietary code ● no copyright assignment – no dual licensing – no “enterprise-only” feature set ● active community ● commercial support 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
11.
distributed storage system
● data center scale – 10s to 10,000s of machines – terabytes to exabytes ● fault tolerant – no single point of failure – commodity hardware ● self-managing, self-healing 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
12.
ceph object model
● pools – 1s to 100s – independent namespaces or object collections – replication level, placement policy ● objects – bazillions – blob of data (bytes to gigabytes) – attributes (e.g., “version=12”; bytes to kilobytes) – key/value bundle (bytes to gigabytes) 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
13.
why start with
objects? ● more useful than (disk) blocks – names in a single flat namespace – variable size – simple API with rich semantics ● more scalable than files – no hard-to-distribute hierarchy – update semantics do not span objects – workload is trivially parallel 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
14.
DISK
DISK DISK DISK DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK DISK DISK DISK DISK DISK DISK 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
15.
DISK
DISK DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK DISK DISK HUMAN HUMAN DISK DISK DISK DISK 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
16.
HUMAN
HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK (COMPUTER)) (COMPUTER HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN (actually more like this…) 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
17.
COMPUTER
COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
18.
OSD
OSD OSD OSD OSD FS FS FS FS FS btrfs xfs ext4 DISK DISK DISK DISK DISK M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
19.
Monitors:
• Maintain cluster membership and state M • Provide consensus for distributed decision-making via Paxos • Small, odd number • These do not serve stored objects to clients Object Storage Daemons (OSDs): • At least three in a cluster • One per disk or RAID group • Serve stored objects to clients • Intelligently peer to perform replication tasks 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
20.
HUMAN
M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
21.
data distribution
● all objects are replicated N times ● objects are automatically placed, balanced, migrated in a dynamic cluster ● must consider physical infrastructure – ceph-osds on hosts in racks in rows in data centers ● three approaches – pick a spot; remember where you put it – pick a spot; write down where you put it – calculate where to put it, where to find it 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
22.
CRUSH
• Pseudo-random placement algorithm • Fast calculation, no lookup • Repeatable, deterministic • Ensures even distribution • Stable mapping • Limited data migration • Rule-based configuration • specifiable replication • infrastructure topology aware • allows weighting 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
23.
10 10 01
01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10 hash(object name) % num pg 10 10 10 10 01 01 01 01 10 10 10 10 01 01 11 11 01 01 10 10 CRUSH(pg, cluster state, policy) 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
24.
10 10 01
01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10 10 10 10 10 01 01 01 01 10 10 10 10 01 01 11 11 01 01 10 10 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
25.
RADOS ●
monitors publish osd map that describes cluster state – ceph-osd node status (up/down, weight, IP) – CRUSH function specifying desired data distribution M ● object storage daemons (OSDs) – safely replicate and store object – migrate data as the cluster changes over time – coordinate based on shared view of reality – gossip! ● decentralized, distributed approach allows – massive scales (10,000s of servers or more) – the illusion of a single copy with consistent behavior 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
26.
CLIENT
CLIENT ?? 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
27.
2012 Storage Developer
Conference. © Inktank. All Rights Reserved.
28.
2012 Storage Developer
Conference. © Inktank. All Rights Reserved.
29.
CLIENT
?? 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
30.
APP
APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS A bucket-based A bucket-based A reliable and fully- A reliable and fully- A POSIX-compliant A POSIX-compliant A library allowing REST gateway, REST gateway, distributed block distributed block distributed file distributed file apps to directly compatible with S3 compatible with S3 device, with aaLinux device, with Linux system, with aa system, with access RADOS, and Swift and Swift kernel client and aa kernel client and Linux kernel client Linux kernel client with support for QEMU/KVM driver QEMU/KVM driver and support for and support for C, C++, Java, FUSE FUSE Python, Ruby, and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
31.
APP
APP LIBRADOS LIBRADOS native M M M M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
32.
LIBRADOS
L • Provides direct access to RADOS for applications • C, C++, Python, PHP, Java • No HTTP overhead 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
33.
atomic transactions
● client operations send to the OSD cluster – operate on a single object – can contain a sequence of operations, e.g. ● truncate object ● write new object data ● set attribute ● atomicity – all operations commit or do not commit atomically ● conditional – 'guard' operations can control whether operation is performed ● verify xattr has specific value ● assert object is a specific version – allows atomic compare-and-swap etc. 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
34.
key/value storage
● store key/value pairs in an object – independent from object attrs or byte data payload ● based on google's leveldb – efficient random and range insert/query/removal – based on BigTable SSTable design ● exposed via key/value API – insert, update, remove – individual keys or ranges of keys ● avoid read/modify/write cycle for updating complex objects – e.g., file system directory objects 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
35.
watch/notify ●
establish stateful 'watch' on an object – client interest persistently registered with object – client keeps session to OSD open ● send 'notify' messages to all watchers – notify message (and payload) is distributed to all watchers – variable timeout – notification on completion ● all watchers got and acknowledged the notify ● use any object as a communication/synchronization channel – locking, distributed coordination (ala ZooKeeper), etc. 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
36.
OSD
CLIENT CLIENT CLIENT #1 #2 #3 watch ack/commit watch ack/commit watch ack/commit notify notify notify notify ack ack ack complete 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
37.
watch/notify example
● radosgw cache consistency – radosgw instances watch a single object (.rgw/notify) – locally cache bucket metadata – on bucket metadata changes (removal, ACL changes) ● write change to relevant bucket object ● send notify with bucket name to other radosgw instances – on receipt of notify ● invalidate relevant portion of cache 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
38.
rados classes ●
dynamically loaded .so – /var/lib/rados-classes/* – implement new object “methods” using existing methods – part of I/O pipeline – simple internal API ● reads – can call existing native or class methods – do whatever processing is appropriate – return data ● writes – can call existing native or class methods – do whatever processing is appropriate – generates a resulting transaction to be applied atomically 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
39.
class examples
● grep – read an object, filter out individual records, and return those ● sha1 – read object, generate fingerprint, return that ● images – rotate, resize, crop image stored in object – remove red-eye ● crypto – encrypt/decrypt object data with provided key 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
40.
APP
APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based A reliable and fully- A reliable and fully- A POSIX-compliant A POSIX-compliant A library allowing A library allowing REST gateway, distributed block distributed block distributed file distributed file apps to directly apps to directly compatible with S3 device, with aaLinux device, with Linux system, with aa system, with access RADOS, access RADOS, and Swift kernel client and aa kernel client and Linux kernel client Linux kernel client with support for with support for QEMU/KVM driver QEMU/KVM driver and support for and support for C, C++, Java, C, C++, Java, FUSE FUSE Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
41.
APP
APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based A bucket-based A reliable and fully- A POSIX-compliant A POSIX-compliant A library allowing A library allowing REST gateway, REST gateway, distributed block distributed file distributed file apps to directly apps to directly compatible with S3 compatible with S3 device, with a Linux system, with aa system, with access RADOS, access RADOS, and Swift and Swift kernel client and a Linux kernel client Linux kernel client with support for with support for QEMU/KVM driver and support for and support for C, C++, Java, C, C++, Java, FUSE FUSE Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
42.
COMPUTER
COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER COMPUTER COMPUTER DISK DISK DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
43.
COMPUTER
COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK VM VM COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK VM VM COMPUTER DISK COMPUTER DISK COMPUTER COMPUTER DISK DISK VM VM COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
44.
RADOS Block Device:
• Storage of virtual disks in RADOS • Decouples VMs and containers • Live migration! • Images are striped across the cluster • Snapshots! • Support in • Qemu/KVM • OpenStack, CloudStack • Mainline Linux kernel • Image cloning • Copy-on-write “snapshot” of existing image 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
45.
VM
VM VIRTUALIZATION CONTAINER VIRTUALIZATION CONTAINER LIBRBD LIBRBD LIBRADOS LIBRADOS M M M M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
46.
CONTAINER
CONTAINER VM VM CONTAINER CONTAINER LIBRBD LIBRBD LIBRBD LIBRBD LIBRADOS LIBRADOS LIBRADOS LIBRADOS M M M M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
47.
HOST
HOST KRBD (KERNEL MODULE) KRBD (KERNEL MODULE) LIBRADOS LIBRADOS M M M M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
48.
APP
APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS LIBRADOS LIBRADOS A bucket-based A bucket-based A reliable and fully- A reliable and fully- A POSIX-compliant A library allowing A library allowing REST gateway, REST gateway, distributed block distributed block distributed file apps to directly apps to directly compatible with S3 compatible with S3 device, with aaLinux device, with Linux system, with a access RADOS, access RADOS, and Swift and Swift kernel client and aa kernel client and Linux kernel client with support for with support for QEMU/KVM driver QEMU/KVM driver and support for C, C++, Java, C, C++, Java, FUSE Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
49.
CLIENT
CLIENT metadata 01 01 data 10 10 M M M M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
50.
M
M M M M M 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
51.
Metadata Server
• Manages metadata for a POSIX-compliant shared filesystem • Directory hierarchy • File metadata (owner, timestamps, mode, etc.) • Stores metadata in RADOS • Does not serve file data to clients • Only required for shared filesystem 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
52.
legacy metadata storage
● a scaling disaster – name → inode → block list → data etc home usr var – no inode table locality vmlinuz … – fragmentation hosts mtab passwd ● inode table … bin include lib ● directory … ● many seeks ● difficult to partition 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
53.
ceph fs metadata
storage ● block lists unnecessary ● inode table mostly useless 100 1 hosts mtab – APIs are path-based, not etc passwd … inode-based home usr var 102 – no random table access, vmlinuz bin … include sloppy caching lib … ● embed inodes inside directories – good locality, prefetching – leverage key/value object 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
54.
one tree
three metadata servers ?? 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
55.
2012 Storage Developer
Conference. © Inktank. All Rights Reserved.
56.
2012 Storage Developer
Conference. © Inktank. All Rights Reserved.
57.
2012 Storage Developer
Conference. © Inktank. All Rights Reserved.
58.
2012 Storage Developer
Conference. © Inktank. All Rights Reserved.
59.
DYNAMIC SUBTREE PARTITIONING 2012
Storage Developer Conference. © Inktank. All Rights Reserved.
60.
dynamic subtree partitioning
● scalable ● efficient – arbitrarily partition – hierarchical partition metadata preserve locality ● adaptive ● dynamic – move work from busy – daemons can to idle servers join/leave – replicate hot – take over for failed metadata nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
61.
controlling metadata io
● view ceph-mds as cache – reduce reads ● dir+inode prefetching journal – reduce writes ● consolidate multiple writes ● large journal or log – stripe over objects – two tiers ● journal for short term ● per-directory for long term directories – fast failure recovery 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
62.
what is journaled
● lots of state – journaling is expensive up-front, cheap to recover – non-journaled state is cheap, but complex (and somewhat expensive) to recover ● yes – client sessions – actual fs metadata modifications ● no – cache provenance – open files ● lazy flush – client modifications may not be durable until fsync() or visible by another client 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
63.
client protocol
● highly stateful – consistent, fine-grained caching ● seamless hand-off between ceph-mds daemons – when client traverses hierarchy – when metadata is migrated between servers ● direct access to OSDs for file I/O 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
64.
an example
● mount -t ceph 1.2.3.4:/ /mnt – 3 ceph-mon RT – 2 ceph-mds RT (1 ceph-mds to -osd RT) ceph-mon ceph-osd ● cd /mnt/foo/bar – 2 ceph-mds RT (2 ceph-mds to -osd RT) ● ls -al – open – readdir ceph-mds ● 1 ceph-mds RT (1 ceph-mds to -osd RT) – stat each file – close ● cp * /tmp – N ceph-osd RT 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
65.
recursive accounting ●
ceph-mds tracks recursive directory stats – file sizes – file and directory counts – modification time ● virtual xattrs present full stats ● efficient $ ls alSh | head total 0 drwxrxrx 1 root root 9.7T 20110204 15:51 . drwxrxrx 1 root root 9.7T 20101216 15:06 .. drwxrxrx 1 pomceph pg4194980 9.6T 20110224 08:25 pomceph drwxrxrx 1 mcg_test1 pg2419992 23G 20110202 08:57 mcg_test1 drwxx 1 luko adm 19G 20110121 12:17 luko drwxx 1 eest adm 14G 20110204 16:29 eest drwxrxrx 1 mcg_test2 pg2419992 3.0G 20110202 09:34 mcg_test2 drwxx 1 fuzyceph adm 1.5G 20110118 10:46 fuzyceph drwxrxrx 1 dallasceph pg275 596M 20110114 10:06 dallasceph 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
66.
snapshots ●
volume or subvolume snapshots unusable at petabyte scale – snapshot arbitrary subdirectories ● simple interface – hidden '.snap' directory – no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
67.
multiple client implementations
● Linux kernel client – mount -t ceph 1.2.3.4:/ /mnt NFS SMB/CIFS – export (NFS), Samba (CIFS) Ganesha Samba ● ceph-fuse libcephfs libcephfs ● libcephfs.so Hadoop your app – your app libcephfs libcephfs – Samba (CIFS) ceph-fuse ceph fuse – Ganesha (NFS) kernel – Hadoop (map/reduce) 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
68.
APP
APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based A bucket-based A reliable and fully- A reliable and fully- A POSIX-compliant A POSIX-compliant A library allowing A library allowing REST gateway, REST gateway, distributed block distributed block distributed file distributed file apps to directly apps to directly compatible with S3 compatible with S3 device, with aaLinux device, with Linux system, with aa system, with access RADOS, access RADOS, and Swift and Swift kernel client and aa kernel client and Linux kernel client Linux kernel client with support for with support for QEMU/KVM driver QEMU/KVM driver and support for and support for C, C++, Java, C, C++, Java, FUSE FUSE Python, Ruby, Python, Ruby, and PHP and PHP AWESOME AWESOME NEARLY AWESOME AWESOME RADOS RADOS AWESOME A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
69.
why we do
this ● limited options for scalable open source storage ● proprietary solutions – expensive – don't scale (well or out) – marry hardware and software ● industry ready for change 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
70.
who we are
● Ceph created at UC Santa Cruz (2004-2007) ● developed by DreamHost (2008-2011) ● supported by Inktank (2012) – Los Angeles, Sunnyvale, San Francisco, remote ● growing user and developer community – Linux distros, users, cloud stacks, SIs, OEMs 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
71.
thanks
sage weil sage@inktank.com @liewegas http://github.com/ceph http://ceph.com/ 2012 Storage Developer Conference. © Inktank. All Rights Reserved.
Baixar agora