SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
Scaling Ceph at CERN
Dan van der Ster (daniel.vanderster@cern.ch)
Data and Storage Service Group | CERN IT Department
CERN’s Mission and Tools
●  CERN studies the fundamental laws of nature
○  Why do particles have mass?
○  What is our universe made of?
○  Why is there no antimatter left?
○  What was matter like right after the “Big Bang”?
○  …
●  The Large Hadron Collider (LHC)
○  Built in a 27km long tunnel, ~200m underground
○  Dipole magnets operated at -271°C (1.9K)
○  Particles do ~11’000 turns/sec, 600 million collisions/sec
○  …
●  Detectors
○  Four main experiments, each the size of a cathedral
○  DAQ systems Processing PetaBytes/sec
Scaling Ceph at CERN - D. van der Ster 3
Big Data at CERN
Physics Data on CASTOR/EOS
●  LHC experiments produce ~10GB/s
25PB/year
User Data on OpenAFS & DFS
●  Home directories for 30k users
●  Physics analysis development
●  Project spaces for applications
Service Data on AFS/NFS
●  Databases, admin applications
Tape archival with CASTOR/TSM
●  RAW physics outputs
●  Desktop/Server backups
Scaling Ceph at CERN - D. van der Ster 4
Service Size Files
OpenAFS 290TB 2.3B
CASTOR 89.0PB 325M
EOS 20.1PB 160M
IT Evolution at CERN
Scaling Ceph at CERN - D. van der Ster 5
Cloudifying CERN’s IT infrastructure ...
●  Centrally-managed and uniform hardware
○  No more service-specific storage boxes
●  OpenStack VMs for most services
○  Building for 100k nodes (mostly for batch processing)
●  Attractive desktop storage services
○  Huge demand for a local Dropbox, Google Drive …
●  Remote data centre in Budapest
○  More rack space and power, plus disaster recovery
… brings new storage requirements
●  Block storage for OpenStack VMs
○  Images and volumes
●  Backend storage for existing and new services
○  AFS, NFS, OwnCloud, Data Preservation, ...
●  Regional storage
○  Use of our new data centre in Hungary
●  Failure tolerance, data checksumming, easy to operate, security, ...
Ceph at CERN
Scaling Ceph at CERN - D. van der Ster 6
12 racks of disk server quads
Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph
Our 3PB Ceph Cluster
Dual Intel Xeon L5640
24 threads incl. HT
Dual 1Gig-E NICs
Only one connected
2x 2TB Hitachi system disks
RAID-1 mirror
1x 240GB OCZ Deneva 2
/var/lib/ceph/mon
48GB RAM
Scaling Ceph at CERN - D. van der Ster 8
Dual Intel Xeon E5-2650
32 threads incl. HT
Dual 10Gig-E NICs
Only one connected
24x 3TB Hitachi disks
Eco drive, ~5900 RPM
3x 2TB Hitachi system disks
Triple mirror
64GB RAM
47 disk servers/1128 OSDs 5 monitors
#	
  df	
  -­‐h	
  /mnt/ceph	
  
Filesystem	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
Size	
  	
  Used	
  Avail	
  Use%	
  Mounted	
  on	
  
xxx:6789:/	
  	
  3.1P	
  	
  173T	
  	
  2.9P	
  	
  	
  6%	
  /mnt/ceph	
  
Use-Cases Being Evaluated
1.  Images and Volumes for OpenStack
2.  S3 Storage for Data Preservation / Public
Dissemination
3.  Physics data storage for archival and/or
analysis
Scaling Ceph at CERN - D. van der Ster 9
#1 is moving into production. #2 and #3 are more
exploratory at the moment.
OpenStack Volumes & Images
•  Glance: using RBD for ~3 months now.
•  Only issue was to increase ulimit -n above 1024 (10k
is good).
•  Cinder: testing with close colleagues.
•  126 Cinder Volumes attached today – 56TB used
Scaling Ceph at CERN - D. van der Ster 10
Growing # of volumes/images Usual traffic is ~50-100MB/s
with current usage. (~idle)
RBD for OpenStack Volumes
•  Before general availability, we need to test and
enable qemu iops/bps throttling
•  Otherwise VMs with many IOs can disrupt other
users.
•  One ongoing issue is that a few clients are
getting an (infrequent) segfault of qemu during
a VM reboot.
•  Happens on VMs with many attached RBD’s.
•  Difficult to get a complete (16GB) core dump.
Scaling Ceph at CERN - D. van der Ster 11
CASTOR & XRootD/EOS
•  Exploring RADOS backend for these two HEP-developed
file systems
•  Gateway model, similar to S3 via RADOSGW
•  CASTOR needs raw throughput performance (to feed
many tape drives at 250MBps each).
•  Striped RWs across many OSDs are important.
•  XRootD/EOS may benefit from the highly scalable
namespace to store O(billion) objects
•  Bonus: XRootD also offers http/webdav with X509/kerberos,
possibly even fuse mountable.
•  Developments are in early stages.
Scaling Ceph at CERN - D. van der Ster 12
Operations & Lessons Learned
Scaling Ceph at CERN - D. van der Ster 13
Configuration and Deployment
•  Dumpling 0.67.7
•  Fully Puppet-ized
•  Automated server deployment,
automated OSD replacement
•  Very few custom ceph.conf
options à
•  Experimenting with the
filestore	
  wbthrottle	
  
•  we find that disabling it
completely gives better IOps
performance
•  But don’t do this!!!
Scaling Ceph at CERN - D. van der Ster 14
mon	
  osd	
  down	
  out	
  interval	
  =	
  900	
  
	
  
osd	
  pool	
  default	
  size	
  =	
  3	
  
osd	
  pool	
  default	
  min	
  size	
  =	
  1	
  
osd	
  pool	
  default	
  pg	
  num	
  =	
  1024	
  
osd	
  pool	
  default	
  pgp	
  num	
  =	
  1024	
  
osd	
  pool	
  default	
  flag	
  hashpspool	
  =	
  true	
  
	
  
osd	
  max	
  backfills	
  =	
  1	
  
osd	
  recovery	
  max	
  active	
  =	
  1	
  
Cluster Activity
Scaling Ceph at CERN - D. van der Ster 15
General Comments…
•  In these ~7 months of running the cluster, there have been very
few problems
•  No outages
•  No data losses/corruptions
•  No unfixable performance issues
•  Behaves well during stress tests
•  But now we’re starting to get real/varied/creative users, and this
brings up many interesting issues...
•  “No amount of stress testing can prepare you for real users”
- Unknown
•  (point being, don’t take the next slides to be too negative – I’m
just trying to give helpful advice ;)
Scaling Ceph at CERN - D. van der Ster 16
Latency & Slow Requests
•  Best latency we can achieve is 20-40ms
•  Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster,
but could in a smaller limited use-case cluster (e.g. for Cinder-only)
•  Latency can increase dramatically with heavy usage
•  Don’t mix latency-bound and throughput-bound users on the same
OSDs
•  Local processes scanning the disks can hurt performance
•  Add /var/lib/ceph to the updatedb PRUNEPATH
•  If you have slow disks like us, you need to understand your disk IO
scheduler – e.g. deadline prefers reads over writes: writes are given a
5 second deadline vs. 500ms for reads!
•  Scrubbing!
•  Kernel tuning: vm.* sysctl, dirty page flushing, memory
reclaiming…
•  “Something is flushing the buffers, blocking the OSD processes”
•  Slow requests: monitor them, eliminate them.
Scaling Ceph at CERN - D. van der Ster 17
Life with 250 million objects
•  Recently, a user decided to write 250 million 1kB objects
•  Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster
being full of RBD images, at least in terms of # objects
•  It worked – no big problems from holding this many objects.
•  Tested single OSD failure: ~7 hours to backfill, including a
double-backfill glitch that we’re trying to understand.
•  But now we want to cleanup, and it is not trivial to remove 250M
objects!
•  rados rmpool generated quite a load when we rm’d a 3 million object
pool (some OSDs were temporarily marked down).
•  Probably due to a mistake in our wbthrottle tuning
Scaling Ceph at CERN - D. van der Ster 18
Other backfilling issues
•  During a backfilling event (draining a whole server),
we started observing repeated monitor elections
•  Caused by the mons’ LevelDBs being so active that the
local SATA disks couldn’t keep up.
•  When a mon falls behind, it calls an election
•  Could be due to LevelDB compaction…
•  We moved /var/lib/ceph/mon to SSDs – no more
elections during backfilling
•  Avoid double backfilling when taking an OSD out of
service:
•  Start with ceph	
  osd	
  crush	
  rm	
  <osd	
  id> !!	
  
•  If you mark the OSD out first, then crush rm it, you will
compute a new CRUSH map twice, i.e. backfill twice.
Scaling Ceph at CERN - D. van der Ster 19
Fun with CRUSH
•  CRUSH is simple yet powerful, so it is tempting to
play with the cluster layout
•  But once you have non-zero amounts of data, significant
CRUSH changes will lead to massive data movements,
which create extra disk load and may disrupt users.
•  Early CRUSH planning is crucial!
•  A network switch is a failure domain, so we should
configure CRUSH to replicate across switches,
right?
•  But (assuming we don’t have a private cluster network)
that would send all replication traffic via the switch uplinks
– bottleneck!
•  Unclear tradeoff between uptime and performance.
Scaling Ceph at CERN - D. van der Ster 20
CRUSH & Data distribution
•  CRUSH may give your cluster
an uneven data distribution
•  An OSD’s used space will
scale with the number of PGs
assigned to it
•  After you have designed your
cluster, created your pools,
started adding data, check the
PG and volume distributions
•  reweight-­‐by-­‐utilization	
  
is useful to iron out an uneven
PG distribution
•  The hashpspool flag is also
important if you have many
active pools
Scaling Ceph at CERN - D. van der Ster 21
0
20
40
60
80
100
120
140
160
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30nOSDs
n PGs
Number of OSDs having N PGs
(for pool = volumes)
RBD Reliability with 3 Replicas
•  RBD devices are chunked across thousands of objects:
•  A full 1TB volume is composed of 250,000 4MB objects
•  If any single object is lost, the whole RBD can be considered to be corrupted
(obviously, it depends which blocks are lost!)
•  If you lose an entire PG, you can consider all RBDs to be lost / corrupted.
•  Our incorrect & irrational fears:
•  Any simultaneous triple disk failure in the cluster would lead to objects being
lost – and somehow all RBDs would be corrupted.
•  As we add OSDs to the cluster, the data gets spread wider, and the chances of
RBD data loss increase.
•  But this is wrong!!
•  The only triple disk failures that can lead to data loss are those combinations
actively used by PGs – so having e.g. 4096 PGs for RBDs means that only
4096 combinations out of the 10^9 possible combinations matter.
•  N_PGs * ~(P_diskfailure^3) / 3!
•  We use 4 replicas for the RBD volumes, but this is probably overkill.
Scaling Ceph at CERN - D. van der Ster 22
Trust your clients
•  There is no server-side per-client throttling
•  A few nasty clients can overwhelm an OSD, leading to slow requests
for everyone.
•  When you have a high load / slow requests, it is not always
trivial to identify and blacklist/firewall the misbehaving client
•  Could use some help in the monitoring: per-client perf stats?
•  One of our creative users found a way to make the mon’s
generate 5*40 MBps of outbound network traffic
•  Could saturate the mon network, lead to disruptions
•  RADOS is not for end-users. A cephx keyring is for trusted
persons only, not for Joe Random User.
Scaling Ceph at CERN - D. van der Ster 23
Fat fingers
•  A healthy cluster is always vulnerable to human errors
•  We’ve thus far avoided any big mistakes
•  Used PG splitting to grow a pool from 8 to 2048 PGs
•  Leads to unresponsive OSDs who get marked down à degraded objs.
•  Safer & now-enforced to grow in 2x or 4x steps
•  ulimits, ulimits, ulimits
•  With a large number of OSDs (say, more than 500), you will hit num
file and num process limits everywhere:
•  Glance, qemu, radosgw, ceph/rados CLI, …
•  If you use XFS, don’t put your OSD journal as a file on the disk
•  Use a separate partition, the first partition!
•  We still need to reinstall our whole cluster to re-partition the OSDs
Scaling Ceph at CERN - D. van der Ster 24
Scale up and out
•  Scale up: we are demonstrating the viability of a
3PB cluster with O(1000) OSDs.
•  What about 10,000 or 100,000 OSDs?
•  What about 10,000 or 100,000 clients?
•  Many Ceph instances is always an option, but not ideal
•  Scale out: our growing data centre in Budapest
brings many options:
•  Replicate over the WAN (though, 30ms RTT)
•  Tiering / Caching pools (new feature, need to get
experience…)
•  Data locality – direct IOs to nearby replica or caching pool
Scaling Ceph at CERN - D. van der Ster 25
Summary
Scaling Ceph at CERN - D. van der Ster 26
Summary
•  CERN IT infrastructure is undergoing a private
cloud revolution, and Ceph is providing the
underlying storage.
•  Our CASTOR and XRootD physics data use-
cases may exploit RADOS for improved
performance/scalability.
•  In seven months with a 3PB cluster, we’ve not
had any disasters. Actually it’s working quite
well.
•  Presented some lessons learned, I hope they
prove useful in your Ceph explorations.
Scaling Ceph at CERN - D. van der Ster 27
Scaling Ceph at CERN - Ceph Day Frankfurt

Mais conteúdo relacionado

Mais procurados

The container revolution, and what it means to operators.pptx
The container revolution, and what it means to operators.pptxThe container revolution, and what it means to operators.pptx
The container revolution, and what it means to operators.pptxRobert Starmer
 
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLESQuick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLESJan Kalcic
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Community
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage systemItalo Santos
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversLinaro
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewMarcel Hergaarden
 
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and OutlookLinux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and OutlookDanny Al-Gaaf
 
New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesKamesh Pemmaraju
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopPatrick McGarry
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackRed_Hat_Storage
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017 Karan Singh
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a NutshellKaran Singh
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneCeph Community
 
Tutorial ceph-2
Tutorial ceph-2Tutorial ceph-2
Tutorial ceph-2Tommy Lee
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsCeph Community
 

Mais procurados (19)

The container revolution, and what it means to operators.pptx
The container revolution, and what it means to operators.pptxThe container revolution, and what it means to operators.pptx
The container revolution, and what it means to operators.pptx
 
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLESQuick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM servers
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) Overview
 
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and OutlookLinux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
 
New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference Architectures
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data Workshop
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a Nutshell
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
Tutorial ceph-2
Tutorial ceph-2Tutorial ceph-2
Tutorial ceph-2
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 

Semelhante a Scaling Ceph at CERN - Ceph Day Frankfurt

London Ceph Day: Ceph at CERN
London Ceph Day: Ceph at CERNLondon Ceph Day: Ceph at CERN
London Ceph Day: Ceph at CERNCeph Community
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph Community
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Dave Holland
 
Ceph in the GRNET cloud stack
Ceph in the GRNET cloud stackCeph in the GRNET cloud stack
Ceph in the GRNET cloud stackNikos Kormpakis
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster inwin stack
 
Ceph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdfCeph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdfClyso GmbH
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph clusterMirantis
 
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016DataStax
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... CassandraInstaclustr
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightRed_Hat_Storage
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureCeph Community
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraCeph Community
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil
 
Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Belmiro Moreira
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...In-Memory Computing Summit
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFShapeBlue
 
Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Jisc
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsKyle Hailey
 

Semelhante a Scaling Ceph at CERN - Ceph Day Frankfurt (20)

London Ceph Day: Ceph at CERN
London Ceph Day: Ceph at CERNLondon Ceph Day: Ceph at CERN
London Ceph Day: Ceph at CERN
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Ceph in the GRNET cloud stack
Ceph in the GRNET cloud stackCeph in the GRNET cloud stack
Ceph in the GRNET cloud stack
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
 
Ceph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdfCeph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdf
 
Ceph barcelona-v-1.2
Ceph barcelona-v-1.2Ceph barcelona-v-1.2
Ceph barcelona-v-1.2
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O Statistics
 

Último

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 

Último (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 

Scaling Ceph at CERN - Ceph Day Frankfurt

  • 1.
  • 2. Scaling Ceph at CERN Dan van der Ster (daniel.vanderster@cern.ch) Data and Storage Service Group | CERN IT Department
  • 3. CERN’s Mission and Tools ●  CERN studies the fundamental laws of nature ○  Why do particles have mass? ○  What is our universe made of? ○  Why is there no antimatter left? ○  What was matter like right after the “Big Bang”? ○  … ●  The Large Hadron Collider (LHC) ○  Built in a 27km long tunnel, ~200m underground ○  Dipole magnets operated at -271°C (1.9K) ○  Particles do ~11’000 turns/sec, 600 million collisions/sec ○  … ●  Detectors ○  Four main experiments, each the size of a cathedral ○  DAQ systems Processing PetaBytes/sec Scaling Ceph at CERN - D. van der Ster 3
  • 4. Big Data at CERN Physics Data on CASTOR/EOS ●  LHC experiments produce ~10GB/s 25PB/year User Data on OpenAFS & DFS ●  Home directories for 30k users ●  Physics analysis development ●  Project spaces for applications Service Data on AFS/NFS ●  Databases, admin applications Tape archival with CASTOR/TSM ●  RAW physics outputs ●  Desktop/Server backups Scaling Ceph at CERN - D. van der Ster 4 Service Size Files OpenAFS 290TB 2.3B CASTOR 89.0PB 325M EOS 20.1PB 160M
  • 5. IT Evolution at CERN Scaling Ceph at CERN - D. van der Ster 5 Cloudifying CERN’s IT infrastructure ... ●  Centrally-managed and uniform hardware ○  No more service-specific storage boxes ●  OpenStack VMs for most services ○  Building for 100k nodes (mostly for batch processing) ●  Attractive desktop storage services ○  Huge demand for a local Dropbox, Google Drive … ●  Remote data centre in Budapest ○  More rack space and power, plus disaster recovery … brings new storage requirements ●  Block storage for OpenStack VMs ○  Images and volumes ●  Backend storage for existing and new services ○  AFS, NFS, OwnCloud, Data Preservation, ... ●  Regional storage ○  Use of our new data centre in Hungary ●  Failure tolerance, data checksumming, easy to operate, security, ...
  • 6. Ceph at CERN Scaling Ceph at CERN - D. van der Ster 6
  • 7. 12 racks of disk server quads Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph
  • 8. Our 3PB Ceph Cluster Dual Intel Xeon L5640 24 threads incl. HT Dual 1Gig-E NICs Only one connected 2x 2TB Hitachi system disks RAID-1 mirror 1x 240GB OCZ Deneva 2 /var/lib/ceph/mon 48GB RAM Scaling Ceph at CERN - D. van der Ster 8 Dual Intel Xeon E5-2650 32 threads incl. HT Dual 10Gig-E NICs Only one connected 24x 3TB Hitachi disks Eco drive, ~5900 RPM 3x 2TB Hitachi system disks Triple mirror 64GB RAM 47 disk servers/1128 OSDs 5 monitors #  df  -­‐h  /mnt/ceph   Filesystem                                                                                                             Size    Used  Avail  Use%  Mounted  on   xxx:6789:/    3.1P    173T    2.9P      6%  /mnt/ceph  
  • 9. Use-Cases Being Evaluated 1.  Images and Volumes for OpenStack 2.  S3 Storage for Data Preservation / Public Dissemination 3.  Physics data storage for archival and/or analysis Scaling Ceph at CERN - D. van der Ster 9 #1 is moving into production. #2 and #3 are more exploratory at the moment.
  • 10. OpenStack Volumes & Images •  Glance: using RBD for ~3 months now. •  Only issue was to increase ulimit -n above 1024 (10k is good). •  Cinder: testing with close colleagues. •  126 Cinder Volumes attached today – 56TB used Scaling Ceph at CERN - D. van der Ster 10 Growing # of volumes/images Usual traffic is ~50-100MB/s with current usage. (~idle)
  • 11. RBD for OpenStack Volumes •  Before general availability, we need to test and enable qemu iops/bps throttling •  Otherwise VMs with many IOs can disrupt other users. •  One ongoing issue is that a few clients are getting an (infrequent) segfault of qemu during a VM reboot. •  Happens on VMs with many attached RBD’s. •  Difficult to get a complete (16GB) core dump. Scaling Ceph at CERN - D. van der Ster 11
  • 12. CASTOR & XRootD/EOS •  Exploring RADOS backend for these two HEP-developed file systems •  Gateway model, similar to S3 via RADOSGW •  CASTOR needs raw throughput performance (to feed many tape drives at 250MBps each). •  Striped RWs across many OSDs are important. •  XRootD/EOS may benefit from the highly scalable namespace to store O(billion) objects •  Bonus: XRootD also offers http/webdav with X509/kerberos, possibly even fuse mountable. •  Developments are in early stages. Scaling Ceph at CERN - D. van der Ster 12
  • 13. Operations & Lessons Learned Scaling Ceph at CERN - D. van der Ster 13
  • 14. Configuration and Deployment •  Dumpling 0.67.7 •  Fully Puppet-ized •  Automated server deployment, automated OSD replacement •  Very few custom ceph.conf options à •  Experimenting with the filestore  wbthrottle   •  we find that disabling it completely gives better IOps performance •  But don’t do this!!! Scaling Ceph at CERN - D. van der Ster 14 mon  osd  down  out  interval  =  900     osd  pool  default  size  =  3   osd  pool  default  min  size  =  1   osd  pool  default  pg  num  =  1024   osd  pool  default  pgp  num  =  1024   osd  pool  default  flag  hashpspool  =  true     osd  max  backfills  =  1   osd  recovery  max  active  =  1  
  • 15. Cluster Activity Scaling Ceph at CERN - D. van der Ster 15
  • 16. General Comments… •  In these ~7 months of running the cluster, there have been very few problems •  No outages •  No data losses/corruptions •  No unfixable performance issues •  Behaves well during stress tests •  But now we’re starting to get real/varied/creative users, and this brings up many interesting issues... •  “No amount of stress testing can prepare you for real users” - Unknown •  (point being, don’t take the next slides to be too negative – I’m just trying to give helpful advice ;) Scaling Ceph at CERN - D. van der Ster 16
  • 17. Latency & Slow Requests •  Best latency we can achieve is 20-40ms •  Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster, but could in a smaller limited use-case cluster (e.g. for Cinder-only) •  Latency can increase dramatically with heavy usage •  Don’t mix latency-bound and throughput-bound users on the same OSDs •  Local processes scanning the disks can hurt performance •  Add /var/lib/ceph to the updatedb PRUNEPATH •  If you have slow disks like us, you need to understand your disk IO scheduler – e.g. deadline prefers reads over writes: writes are given a 5 second deadline vs. 500ms for reads! •  Scrubbing! •  Kernel tuning: vm.* sysctl, dirty page flushing, memory reclaiming… •  “Something is flushing the buffers, blocking the OSD processes” •  Slow requests: monitor them, eliminate them. Scaling Ceph at CERN - D. van der Ster 17
  • 18. Life with 250 million objects •  Recently, a user decided to write 250 million 1kB objects •  Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster being full of RBD images, at least in terms of # objects •  It worked – no big problems from holding this many objects. •  Tested single OSD failure: ~7 hours to backfill, including a double-backfill glitch that we’re trying to understand. •  But now we want to cleanup, and it is not trivial to remove 250M objects! •  rados rmpool generated quite a load when we rm’d a 3 million object pool (some OSDs were temporarily marked down). •  Probably due to a mistake in our wbthrottle tuning Scaling Ceph at CERN - D. van der Ster 18
  • 19. Other backfilling issues •  During a backfilling event (draining a whole server), we started observing repeated monitor elections •  Caused by the mons’ LevelDBs being so active that the local SATA disks couldn’t keep up. •  When a mon falls behind, it calls an election •  Could be due to LevelDB compaction… •  We moved /var/lib/ceph/mon to SSDs – no more elections during backfilling •  Avoid double backfilling when taking an OSD out of service: •  Start with ceph  osd  crush  rm  <osd  id> !!   •  If you mark the OSD out first, then crush rm it, you will compute a new CRUSH map twice, i.e. backfill twice. Scaling Ceph at CERN - D. van der Ster 19
  • 20. Fun with CRUSH •  CRUSH is simple yet powerful, so it is tempting to play with the cluster layout •  But once you have non-zero amounts of data, significant CRUSH changes will lead to massive data movements, which create extra disk load and may disrupt users. •  Early CRUSH planning is crucial! •  A network switch is a failure domain, so we should configure CRUSH to replicate across switches, right? •  But (assuming we don’t have a private cluster network) that would send all replication traffic via the switch uplinks – bottleneck! •  Unclear tradeoff between uptime and performance. Scaling Ceph at CERN - D. van der Ster 20
  • 21. CRUSH & Data distribution •  CRUSH may give your cluster an uneven data distribution •  An OSD’s used space will scale with the number of PGs assigned to it •  After you have designed your cluster, created your pools, started adding data, check the PG and volume distributions •  reweight-­‐by-­‐utilization   is useful to iron out an uneven PG distribution •  The hashpspool flag is also important if you have many active pools Scaling Ceph at CERN - D. van der Ster 21 0 20 40 60 80 100 120 140 160 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30nOSDs n PGs Number of OSDs having N PGs (for pool = volumes)
  • 22. RBD Reliability with 3 Replicas •  RBD devices are chunked across thousands of objects: •  A full 1TB volume is composed of 250,000 4MB objects •  If any single object is lost, the whole RBD can be considered to be corrupted (obviously, it depends which blocks are lost!) •  If you lose an entire PG, you can consider all RBDs to be lost / corrupted. •  Our incorrect & irrational fears: •  Any simultaneous triple disk failure in the cluster would lead to objects being lost – and somehow all RBDs would be corrupted. •  As we add OSDs to the cluster, the data gets spread wider, and the chances of RBD data loss increase. •  But this is wrong!! •  The only triple disk failures that can lead to data loss are those combinations actively used by PGs – so having e.g. 4096 PGs for RBDs means that only 4096 combinations out of the 10^9 possible combinations matter. •  N_PGs * ~(P_diskfailure^3) / 3! •  We use 4 replicas for the RBD volumes, but this is probably overkill. Scaling Ceph at CERN - D. van der Ster 22
  • 23. Trust your clients •  There is no server-side per-client throttling •  A few nasty clients can overwhelm an OSD, leading to slow requests for everyone. •  When you have a high load / slow requests, it is not always trivial to identify and blacklist/firewall the misbehaving client •  Could use some help in the monitoring: per-client perf stats? •  One of our creative users found a way to make the mon’s generate 5*40 MBps of outbound network traffic •  Could saturate the mon network, lead to disruptions •  RADOS is not for end-users. A cephx keyring is for trusted persons only, not for Joe Random User. Scaling Ceph at CERN - D. van der Ster 23
  • 24. Fat fingers •  A healthy cluster is always vulnerable to human errors •  We’ve thus far avoided any big mistakes •  Used PG splitting to grow a pool from 8 to 2048 PGs •  Leads to unresponsive OSDs who get marked down à degraded objs. •  Safer & now-enforced to grow in 2x or 4x steps •  ulimits, ulimits, ulimits •  With a large number of OSDs (say, more than 500), you will hit num file and num process limits everywhere: •  Glance, qemu, radosgw, ceph/rados CLI, … •  If you use XFS, don’t put your OSD journal as a file on the disk •  Use a separate partition, the first partition! •  We still need to reinstall our whole cluster to re-partition the OSDs Scaling Ceph at CERN - D. van der Ster 24
  • 25. Scale up and out •  Scale up: we are demonstrating the viability of a 3PB cluster with O(1000) OSDs. •  What about 10,000 or 100,000 OSDs? •  What about 10,000 or 100,000 clients? •  Many Ceph instances is always an option, but not ideal •  Scale out: our growing data centre in Budapest brings many options: •  Replicate over the WAN (though, 30ms RTT) •  Tiering / Caching pools (new feature, need to get experience…) •  Data locality – direct IOs to nearby replica or caching pool Scaling Ceph at CERN - D. van der Ster 25
  • 26. Summary Scaling Ceph at CERN - D. van der Ster 26
  • 27. Summary •  CERN IT infrastructure is undergoing a private cloud revolution, and Ceph is providing the underlying storage. •  Our CASTOR and XRootD physics data use- cases may exploit RADOS for improved performance/scalability. •  In seven months with a 3PB cluster, we’ve not had any disasters. Actually it’s working quite well. •  Presented some lessons learned, I hope they prove useful in your Ceph explorations. Scaling Ceph at CERN - D. van der Ster 27