SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
Mastering Ceph Operations:
Upmap and the Mgr Balancer
Dan van der Ster | CERN IT | Ceph Day Berlin | 12 November 2018
Already >250PB physics data
Ceph @ CERN Update
5
CERN Ceph Clusters Size Version
OpenStack Cinder/Glance Production 5.5PB luminous
Satellite data centre (1000km away) 1.6PB luminous
Hyperconverged KVM+Ceph 16TB luminous
CephFS (HPC+Manila) Production 0.8PB luminous
Client Scale Testing 0.4PB luminous
Hyperconverged HPC+Ceph 0.4PB luminous
CASTOR/XRootD Production 4.4PB luminous
CERN Tape Archive 0.8TB luminous
S3+SWIFT Production (4+2 EC) 2.3PB luminous
6Stable growth in RBD, S3, CephFS
CephFS Scale Testing
7
• Two activities testing performance:
• HPC storage with CephFS, BoF at SuperComputing in
Dallas right now! (Pablo Llopis, CERN IT)
• Scale testing ceph-csi with k8s (10000 cephfs clients!)
8
RBD Tuning
8
Rackspace / CERN Openlab Collaboration
Performance assessment tools
• ceph-osd benchmarking suite
• rbd top for identifying active clients
Studied performance impacts:
• various hdd configurations
• Flash for block.db, wal, dm-*
• hyperconverged configurations
Target real use-cases at CERN
• database applications
• monitoring and data analysis
Upmap and the Mgr Balancer
9
Background: 2-Step Placement
Background: 2-Step Placement
RANDOM
1. Map an object to a PG uniformly at random.
Background: 2-Step Placement
CRUSH
2. Map each PG to a set of OSDs using CRUSH
Background: 2-Step Placement
CRUSH
RANDOM
Do we really need PGs?
• Why don’t we map objects to OSDs directly with CRUSH?
• If we did that, all OSDs would be coupled (peered) with all
others.
• Any 3x failure would lead to data loss.
• Consider a 1000-osd cluster:
• 1000^3 possible combinations, but only #PGs of them are relevant for
data loss.
• …
Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• CRUSH provides a language for describing data placement
rules according to your infrastructure.
• The “failure-domain” part of CRUSH is always perfect: e.g. it will
never put two copies on the same host (unless you tell it to…)
• The uniformity part of CRUSH is imperfect: uneven osd utilizations
are a fact of life. Perfection requires an impractical number of PGs.
Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• We can! Now in luminous with upmap!
Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• We can! Now in luminous with upmap!
2.1-Step Placement
CRUSH
RANDOM
UPMAP
Data Imbalance
• User Question:
• ceph df says my cluster is 50% full overall but the
cephfs_data pool is 73% full. What’s wrong?
• A pool is full once its first OSD is full.
• ceph df reports the used space on the most full OSD.
• This difference shows that the OSD are imbalanced.
19
Data Balancing
• Luminous added a data balancer.
• The general idea is to slowly move PGs from the most full to the least full
OSDs.
• Two ways to accomplish this, two modes:
• crush-compat: internally tweak the OSD sizes, e.g. to make under full disks
look bigger
• Compatible with legacy clients.
• upmap: move PGs precisely where we want them (without breaking failure
domain rules)
• Compatible with luminous+ clients.
20
Turning on the balancer (luminous)
• My advice: Do whatever you can to use the upmap balancer:
ceph osd set-require-min-compat-client luminous
ceph mgr module ls
ceph mgr module enable balancer
ceph config-key set mgr/balancer/begin_time 0830
ceph config-key set mgr/balancer/end_time 1800
ceph config-key set mgr/balancer/max_misplaced 0.005
ceph config-key set mgr/balancer/upmap_max_iterations 2
ceph balancer mode upmap
ceph balancer on
21
Luminous limitation:
If the balancer config doesn’t take,
restart the active ceph-mgr.
Success!
• It really works
• On our largest clusters, we have recovered hundreds of
terabytes of capacity using the upmap balancer.
22
But wait, there’s more…
23
Ceph Operations at Scale
• Cluster is nearly full (need to add capacity)
• Cluster has old tunables (need to set to optimal)
• Cluster has legacy osd reweights (need to reweight all to 1.0)
• Cluster data placement changes (need to change crush ruleset)
• Operations like those above often involve:
• large amounts of data movement
• lasting several days or weeks
• unpredictable impact on users (and the cluster)
• and no easy rollback!
24
25
WE ARE
HERE LEAP OF FAITH
WE WANT TO
BE HERE
A brief interlude…
• “remapped” PGs are fully replicated (normally “clean”), but
CRUSH wants to move them to a new set of OSDs
• 4815 active+remapped+backfill_wait
• “norebalance” is a cluster state to tell Ceph *not* to make
progress on any remapped PGs
• ceph osd set norebalance
26
Adding capacity
• Step 1: set the norebalance flag
• Step 2: ceph-volume lvm create…
• Step 3: 4518 active+remapped+backfill_wait
• Step 4: ???
• Step 5: HEALTH_OK
27
Adding capacity
28
Adding capacity
29
What if we could
“upmap” those
remapped PGs
back to where the
data is now?
Adding capacity with upmap
30
Adding capacity with upmap
31
Adding capacity with upmap
• But you may be wondering:
• We’ve added new disks to the cluster, but they have zero PGs, so
what’s the point?
• The good news is that the upmap balancer will automatically
notice the new OSDs are under full, and will gradually move
PGs to them.
• In fact, the balancer simply *removes* the upmap entries we created
to keep PGs off of the new OSDs.
32
Changing tunables
• Step 1: set the norebalance flag
• Step 2: ceph osd crush tunables optimal
• Step 3: 4815162/20935486 objects misplaced
• Step 4: ???
• Step 5: HEALTH_OK
33
Changing tunables
34
Changing tunables
35
Such an intensive
backfilling would
not be transparent
to the users…
Changing tunables
36
Changing tunables
37
Changing tunables
38
And remember,
the balancer will
slowly move PGs
around to where
they need to be.
Other use-cases: Legacy OSD Weights
• Remove legacy OSD reweights:
• ceph osd reweight-by-utilization is a legacy feature for
balancing OSDs with a [0,1] reweight factor.
• When we set the reweights back to 1.0, many PGs will become
active+remapped
• We can upmap them back to where they are.
• Bonus: Since those reweights helped balance the OSDs, this
acts as a shortcut to find the right set of upmaps to balance a
cluster.
39
Other use-cases: Placement Changes
• We had a pool with the first 2 replicas in Room A and 3rd
replica in Room B.
• For capacity reasons, we needed to move all three
replicas into Room A.
40
Other use-cases: Placement Changes
• What did we do?
1. Create the new crush ruleset
2. Set the norebalance flag
3. Set the pool’s crush rule to be the new one…
• This puts *every* PG in “remapped” state.
4. Use upmap to map those PGs back to where they are (3rd replica in
the wrong room)
5. Gradually remove those upmap entries to slowly move all PGs fully
into Room A.
6. HEALTH_OK
41
Other use-cases: Placement Changes
• What did we do?
1. Create the new crush ruleset
2. Set the norebalance flag
3. Set the pool’s crush rule to be the new one…
• This puts *every* PG in “remapped” state.
4. Use upmap to map those PGs back to where they are (3rd replica in
the wrong room)
5. Gradually remove those upmap entries to slowly move all PGs fully
into Room A.
6. HEALTH_OK
42
We moved several hundred TBs of data
without users noticing, and could pause
with HEALTH_OK at any time.
No leap of faith required
43
WE
ARE
HERE
LEAP OF FAITH WE WANT
TO BE
HERE
UPMAP
What’s next for “upmap remapped”
• We find this capability to be super useful!
• Wrote external tools to manage the upmap entries.
• It can be tricky!
• After some iteration with upstream we could share…
• Possibly contribute as a core feature?
• Maybe you can think of other use-cases?
44
Thanks!

Mais conteúdo relacionado

Mais procurados

Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Community
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to CephCeph Community
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsKaran Singh
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSHSage Weil
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...OpenStack Korea Community
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph Community
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanCeph Community
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortNAVER D2
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Sean Cohen
 
Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for CephCeph Community
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage systemItalo Santos
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Wido den Hollander - 10 ways to break your Ceph cluster
Wido den Hollander - 10 ways to break your Ceph clusterWido den Hollander - 10 ways to break your Ceph cluster
Wido den Hollander - 10 ways to break your Ceph clusterShapeBlue
 
Apache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan FlonenkoApache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan FlonenkoDatabricks
 

Mais procurados (20)

Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOcean
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
 
Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for Ceph
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph
CephCeph
Ceph
 
Wido den Hollander - 10 ways to break your Ceph cluster
Wido den Hollander - 10 ways to break your Ceph clusterWido den Hollander - 10 ways to break your Ceph cluster
Wido den Hollander - 10 ways to break your Ceph cluster
 
Apache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan FlonenkoApache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan Flonenko
 

Semelhante a CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER

Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneCeph Community
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Ceph Community
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph Community
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterSrihari Sriraman
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Community
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemAvleen Vig
 
Puppet Camp CERN Geneva
Puppet Camp CERN GenevaPuppet Camp CERN Geneva
Puppet Camp CERN GenevaSteve Traylen
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon
 
Alibaba patches in MariaDB
Alibaba patches in MariaDBAlibaba patches in MariaDB
Alibaba patches in MariaDBLixun Peng
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Dave Holland
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB
 
Inoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migrationInoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migrationOpenNebula Project
 
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time VariationUse of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time VariationJonathan Beard
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightRed_Hat_Storage
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightColleen Corrice
 
Chef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HAChef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HAAdam Spiers
 
08 neural networks
08 neural networks08 neural networks
08 neural networksankit_ppt
 

Semelhante a CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER (20)

Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
Puppet Camp CERN Geneva
Puppet Camp CERN GenevaPuppet Camp CERN Geneva
Puppet Camp CERN Geneva
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
Alibaba patches in MariaDB
Alibaba patches in MariaDBAlibaba patches in MariaDB
Alibaba patches in MariaDB
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
Inoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migrationInoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migration
 
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time VariationUse of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Chef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HAChef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HA
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 

Último

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Último (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER

  • 1.
  • 2. Mastering Ceph Operations: Upmap and the Mgr Balancer Dan van der Ster | CERN IT | Ceph Day Berlin | 12 November 2018
  • 3.
  • 5. Ceph @ CERN Update 5
  • 6. CERN Ceph Clusters Size Version OpenStack Cinder/Glance Production 5.5PB luminous Satellite data centre (1000km away) 1.6PB luminous Hyperconverged KVM+Ceph 16TB luminous CephFS (HPC+Manila) Production 0.8PB luminous Client Scale Testing 0.4PB luminous Hyperconverged HPC+Ceph 0.4PB luminous CASTOR/XRootD Production 4.4PB luminous CERN Tape Archive 0.8TB luminous S3+SWIFT Production (4+2 EC) 2.3PB luminous 6Stable growth in RBD, S3, CephFS
  • 7. CephFS Scale Testing 7 • Two activities testing performance: • HPC storage with CephFS, BoF at SuperComputing in Dallas right now! (Pablo Llopis, CERN IT) • Scale testing ceph-csi with k8s (10000 cephfs clients!)
  • 8. 8 RBD Tuning 8 Rackspace / CERN Openlab Collaboration Performance assessment tools • ceph-osd benchmarking suite • rbd top for identifying active clients Studied performance impacts: • various hdd configurations • Flash for block.db, wal, dm-* • hyperconverged configurations Target real use-cases at CERN • database applications • monitoring and data analysis
  • 9. Upmap and the Mgr Balancer 9
  • 11. Background: 2-Step Placement RANDOM 1. Map an object to a PG uniformly at random.
  • 12. Background: 2-Step Placement CRUSH 2. Map each PG to a set of OSDs using CRUSH
  • 14. Do we really need PGs? • Why don’t we map objects to OSDs directly with CRUSH? • If we did that, all OSDs would be coupled (peered) with all others. • Any 3x failure would lead to data loss. • Consider a 1000-osd cluster: • 1000^3 possible combinations, but only #PGs of them are relevant for data loss. • …
  • 15. Do we really need CRUSH? • Why don’t we just distribute the PG mappings directly? • CRUSH provides a language for describing data placement rules according to your infrastructure. • The “failure-domain” part of CRUSH is always perfect: e.g. it will never put two copies on the same host (unless you tell it to…) • The uniformity part of CRUSH is imperfect: uneven osd utilizations are a fact of life. Perfection requires an impractical number of PGs.
  • 16. Do we really need CRUSH? • Why don’t we just distribute the PG mappings directly? • We can! Now in luminous with upmap!
  • 17. Do we really need CRUSH? • Why don’t we just distribute the PG mappings directly? • We can! Now in luminous with upmap!
  • 19. Data Imbalance • User Question: • ceph df says my cluster is 50% full overall but the cephfs_data pool is 73% full. What’s wrong? • A pool is full once its first OSD is full. • ceph df reports the used space on the most full OSD. • This difference shows that the OSD are imbalanced. 19
  • 20. Data Balancing • Luminous added a data balancer. • The general idea is to slowly move PGs from the most full to the least full OSDs. • Two ways to accomplish this, two modes: • crush-compat: internally tweak the OSD sizes, e.g. to make under full disks look bigger • Compatible with legacy clients. • upmap: move PGs precisely where we want them (without breaking failure domain rules) • Compatible with luminous+ clients. 20
  • 21. Turning on the balancer (luminous) • My advice: Do whatever you can to use the upmap balancer: ceph osd set-require-min-compat-client luminous ceph mgr module ls ceph mgr module enable balancer ceph config-key set mgr/balancer/begin_time 0830 ceph config-key set mgr/balancer/end_time 1800 ceph config-key set mgr/balancer/max_misplaced 0.005 ceph config-key set mgr/balancer/upmap_max_iterations 2 ceph balancer mode upmap ceph balancer on 21 Luminous limitation: If the balancer config doesn’t take, restart the active ceph-mgr.
  • 22. Success! • It really works • On our largest clusters, we have recovered hundreds of terabytes of capacity using the upmap balancer. 22
  • 23. But wait, there’s more… 23
  • 24. Ceph Operations at Scale • Cluster is nearly full (need to add capacity) • Cluster has old tunables (need to set to optimal) • Cluster has legacy osd reweights (need to reweight all to 1.0) • Cluster data placement changes (need to change crush ruleset) • Operations like those above often involve: • large amounts of data movement • lasting several days or weeks • unpredictable impact on users (and the cluster) • and no easy rollback! 24
  • 25. 25 WE ARE HERE LEAP OF FAITH WE WANT TO BE HERE
  • 26. A brief interlude… • “remapped” PGs are fully replicated (normally “clean”), but CRUSH wants to move them to a new set of OSDs • 4815 active+remapped+backfill_wait • “norebalance” is a cluster state to tell Ceph *not* to make progress on any remapped PGs • ceph osd set norebalance 26
  • 27. Adding capacity • Step 1: set the norebalance flag • Step 2: ceph-volume lvm create… • Step 3: 4518 active+remapped+backfill_wait • Step 4: ??? • Step 5: HEALTH_OK 27
  • 29. Adding capacity 29 What if we could “upmap” those remapped PGs back to where the data is now?
  • 32. Adding capacity with upmap • But you may be wondering: • We’ve added new disks to the cluster, but they have zero PGs, so what’s the point? • The good news is that the upmap balancer will automatically notice the new OSDs are under full, and will gradually move PGs to them. • In fact, the balancer simply *removes* the upmap entries we created to keep PGs off of the new OSDs. 32
  • 33. Changing tunables • Step 1: set the norebalance flag • Step 2: ceph osd crush tunables optimal • Step 3: 4815162/20935486 objects misplaced • Step 4: ??? • Step 5: HEALTH_OK 33
  • 35. Changing tunables 35 Such an intensive backfilling would not be transparent to the users…
  • 38. Changing tunables 38 And remember, the balancer will slowly move PGs around to where they need to be.
  • 39. Other use-cases: Legacy OSD Weights • Remove legacy OSD reweights: • ceph osd reweight-by-utilization is a legacy feature for balancing OSDs with a [0,1] reweight factor. • When we set the reweights back to 1.0, many PGs will become active+remapped • We can upmap them back to where they are. • Bonus: Since those reweights helped balance the OSDs, this acts as a shortcut to find the right set of upmaps to balance a cluster. 39
  • 40. Other use-cases: Placement Changes • We had a pool with the first 2 replicas in Room A and 3rd replica in Room B. • For capacity reasons, we needed to move all three replicas into Room A. 40
  • 41. Other use-cases: Placement Changes • What did we do? 1. Create the new crush ruleset 2. Set the norebalance flag 3. Set the pool’s crush rule to be the new one… • This puts *every* PG in “remapped” state. 4. Use upmap to map those PGs back to where they are (3rd replica in the wrong room) 5. Gradually remove those upmap entries to slowly move all PGs fully into Room A. 6. HEALTH_OK 41
  • 42. Other use-cases: Placement Changes • What did we do? 1. Create the new crush ruleset 2. Set the norebalance flag 3. Set the pool’s crush rule to be the new one… • This puts *every* PG in “remapped” state. 4. Use upmap to map those PGs back to where they are (3rd replica in the wrong room) 5. Gradually remove those upmap entries to slowly move all PGs fully into Room A. 6. HEALTH_OK 42 We moved several hundred TBs of data without users noticing, and could pause with HEALTH_OK at any time.
  • 43. No leap of faith required 43 WE ARE HERE LEAP OF FAITH WE WANT TO BE HERE UPMAP
  • 44. What’s next for “upmap remapped” • We find this capability to be super useful! • Wrote external tools to manage the upmap entries. • It can be tricky! • After some iteration with upstream we could share… • Possibly contribute as a core feature? • Maybe you can think of other use-cases? 44