SlideShare uma empresa Scribd logo
1 de 15
Baixar para ler offline
Life at 700us
Nick Fisk
Who Am I
• Nick Fisk
• Ceph user since 2012
• Author of Mastering Ceph
• Technical manager at SysGroup
• Managed Service Provider
• Use Ceph for providing tier-2 services to customers (Backups, standby
replicas) - Veeam
• Ceph RBD to ESXi via NFS
What is Latency?
• What the user feels when he clicks the button
• Buffer IO probably not affected though
• Traditional 10G iSCSI storage array will service a 4KB IO in around 300us.
• Local SAS SSD ~20us
• NVME ~2us
• Software defined storage will always have higher latency due to replication
across nodes and a fatter software stack.
• Latency heavily affects single threaded operations that can’t run in parallel.
• Eg. SQL transaction logs
• Or in the case of Ceph PG Contention
PG Contention
• PG serialises distributed workload in Ceph
• Each operation takes a lock on that PG, can lead to contention
• Multiple requests to a single Object will be hitting same PG
• Or if you are unlucky 2 hot objects may share the same PG
• Latency defines how fast a PG can process that operation, 2nd
operation has to wait
• If you dump slow ops from the OSD admin socket and see a lot of
delay “Waiting For PG”, you are likely hitting PG Contention
The theory behind minimising latency
• Ceph is software
• Each step of the Ceph “software” will run through faster with faster CPU’s (Ghz)
• Generally, CPU’s with more cores = lower Ghz
• High CPU Ghz = $$$ ?
• Try and avoid dual socket systems, adds latency and can introduce complications
on high disk count boxes (thread counts, thread pinning, interrupts)
• Every write has to go to journal, so make journal as fast as reasonably possible
• Bluestore – Only small IO’s
• Blessing or a curse?
• 10G networking is a must
• So…..less faster cores + NVME journal = Ceph Latency Nirvana
• Lets come up with a hardware design that takes this into account…
Bluestore – deferred writes
• For spinning disks
• IO<64K write to WAL, ACK, async commit later to disk
• IO>64K sync commit to disk
• This is great from a double write perspective, WAL doesn’t need to be
stupidly fast or have massive write endurance
• But a NVME will service a 128kb write a lot faster than a 7.2k disk
• May need to tune cutover for your use case
Ceph CPU Frequency Scaling
CPU
Mhz 4Kb Write IO
Min Latency
(us)
Avg Latency
(us)
1600 797 886 1250
2000 815 746 1222
2400 1161 630 857
2800 1227 549 812
3300 1320 482 755
4300 1548 437 644
• Ever wondered how Ceph performs at different clockspeeds?
• Using manual CPU governor on unlocked desktop CPU, ran fio QD=1 on a RBD
at different clock speeds
Networking Latency
• Sample ping test with 4KB payload over 1G and 10G networks
• 25Gb networking is interesting in potentially further reducing latency
• Even still, networking latency makes up a large part of the overall latency
due to Ceph replication between nodes.
• Client -> Primary OSD -> Replica OSD’s
• If using a NFS/iSCSI gateway/proxy, extra network hop added again
• RDMA will be the game changer!!
The Hardware
• 1U server
• Xeon E3 4x3.5Ghz (3.9Ghz Turbo)
• 10GB-T Onboard
• 8 SAS onboard
• 8 SATA onboard
• 64GB Ram
• 12x8TB He8’s (Not pictured)
• Intel P3700 400GB for Journal + OS
• 96TB node = ~£5k (Brexit!!)
• 160W Idle
• 180W average Ceph Load
• 220W Disks + CPU maxed out
How much CPU does Ceph require?
• Please don’t take this as a “HW
requirements” guide
• Use it to make informed
decisions, instead of 1 core per
OSD
• If latency is important, work out
total required Ghz and find CPU
with highest Ghz per core that
meets that total. Ie 3.5Ghz*4 =
14
0.1
1
10
100
1000
4KB 8KB 16KB 32KB 64KB 1MB 4MB
Mhz per Ceph IO
Mhz Per IO Mhz Per MB/s
Initial Results
• I was wrong!!!! - 4kb Average latency 2.4ms
write: io=115268KB, bw=1670.1KB/s, iops=417, runt= 68986msec
slat (usec): min=2, max=414, avg= 4.41, stdev= 3.81
clat (usec): min=966, max=27116, avg=2386.84, stdev=571.57
lat (usec): min=970, max=27120, avg=2391.25, stdev=571.69
clat percentiles (usec):
| 1.00th=[ 1480], 5.00th=[ 1688], 10.00th=[ 1912], 20.00th=[ 2128],
| 30.00th=[ 2192], 40.00th=[ 2288], 50.00th=[ 2352], 60.00th=[ 2448],
| 70.00th=[ 2576], 80.00th=[ 2704], 90.00th=[ 2832], 95.00th=[ 2960],
| 99.00th=[ 3312], 99.50th=[ 3536], 99.90th=[ 6112], 99.95th=[ 9536],
| 99.99th=[22400]
But Hang On, what’s this?
Real Current Frequency 900.47 MHz [100.11 x 8.99] (Max of below)
Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % Temp VCore
Core 1 [0]: 900.38 (8.99x) 10.4 44.2 3.47 49.7 27 0.7406
Core 2 [1]: 900.16 (8.99x) 8.46 66.7 1.18 29.9 27 0.7404
Core 3 [2]: 900.47 (8.99x) 10.5 73.8 1 22.5 27 0.7404
Core 4 [3]: 900.12 (8.99x) 8.03 58.6 1 38.3 27 0.7404
• Core’s are spending a lot of their time in C6 and below
• And only running at 900Mhz
Intel Cstate Wake Up Latency (us)
• POLL
• 0
• C1-SKL
• 2
• C1E-SKL
• 10
• C3-SKL
• 70
• C6-SKL
• 85
• C7s-SKL
• 124
• C8-SKL
• 200
From the previous slide, a large proportion of threads could be waiting for up to 200us for the CPU to wakeup to be serviced!!!
4kb Seq Write – Replica x3
• That’s more like it
write: io=105900KB, bw=5715.7KB/s, iops=1428, runt= 18528msec
slat (usec): min=2, max=106, avg= 3.50, stdev= 1.31
clat (usec): min=491, max=32099, avg=694.16, stdev=491.91
lat (usec): min=494, max=32102, avg=697.66, stdev=492.04
clat percentiles (usec):
| 1.00th=[ 540], 5.00th=[ 572], 10.00th=[ 588], 20.00th=[ 604],
| 30.00th=[ 620], 40.00th=[ 636], 50.00th=[ 652], 60.00th=[ 668],
| 70.00th=[ 692], 80.00th=[ 716], 90.00th=[ 764], 95.00th=[ 820],
| 99.00th=[ 1448], 99.50th=[ 2320], 99.90th=[ 7584], 99.95th=[11712],
| 99.99th=[24448]
Questions?

Mais conteúdo relacionado

Mais procurados

Red hat ceph storage customer presentation
Red hat ceph storage customer presentationRed hat ceph storage customer presentation
Red hat ceph storage customer presentation
Rodrigo Missiaggia
 

Mais procurados (20)

Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
Deploying IPv6 on OpenStack
Deploying IPv6 on OpenStackDeploying IPv6 on OpenStack
Deploying IPv6 on OpenStack
 
Red hat ceph storage customer presentation
Red hat ceph storage customer presentationRed hat ceph storage customer presentation
Red hat ceph storage customer presentation
 
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Troubleshooting containerized triple o deployment
Troubleshooting containerized triple o deploymentTroubleshooting containerized triple o deployment
Troubleshooting containerized triple o deployment
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Bluestore
BluestoreBluestore
Bluestore
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
 
Qemu
QemuQemu
Qemu
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
 

Semelhante a Nick Fisk - low latency Ceph

Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
Baruch Osoveskiy
 
V mware2012 20121221_final
V mware2012 20121221_finalV mware2012 20121221_final
V mware2012 20121221_final
Web2Present
 
OSS Presentation VMWorld 2011 by Andy Bennett & Craig Morgan
OSS Presentation VMWorld 2011 by Andy Bennett & Craig MorganOSS Presentation VMWorld 2011 by Andy Bennett & Craig Morgan
OSS Presentation VMWorld 2011 by Andy Bennett & Craig Morgan
OpenStorageSummit
 

Semelhante a Nick Fisk - low latency Ceph (20)

ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Ceph barcelona-v-1.2
Ceph barcelona-v-1.2Ceph barcelona-v-1.2
Ceph barcelona-v-1.2
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
System Capa Planning_DBA oracle edu
System Capa Planning_DBA oracle eduSystem Capa Planning_DBA oracle edu
System Capa Planning_DBA oracle edu
 
V mware2012 20121221_final
V mware2012 20121221_finalV mware2012 20121221_final
V mware2012 20121221_final
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016
 
OSS Presentation VMWorld 2011 by Andy Bennett & Craig Morgan
OSS Presentation VMWorld 2011 by Andy Bennett & Craig MorganOSS Presentation VMWorld 2011 by Andy Bennett & Craig Morgan
OSS Presentation VMWorld 2011 by Andy Bennett & Craig Morgan
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Aerospike & GCE (LSPE Talk)
Aerospike & GCE (LSPE Talk)Aerospike & GCE (LSPE Talk)
Aerospike & GCE (LSPE Talk)
 
CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...
CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...
CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...
 
Azure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
Azure VM 101 - HomeGen by CloudGen Verona - Marco ObinuAzure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
Azure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
 
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton inserts
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 

Mais de ShapeBlue

Mais de ShapeBlue (20)

CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueCloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
 
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlueCloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
 
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
 
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubHow We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
 
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
 
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
 
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIOHow We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
 
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
 
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
 
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
 
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
 
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
 
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
 
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
 
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
 
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
 
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
 
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
 
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Nick Fisk - low latency Ceph

  • 2. Who Am I • Nick Fisk • Ceph user since 2012 • Author of Mastering Ceph • Technical manager at SysGroup • Managed Service Provider • Use Ceph for providing tier-2 services to customers (Backups, standby replicas) - Veeam • Ceph RBD to ESXi via NFS
  • 3. What is Latency? • What the user feels when he clicks the button • Buffer IO probably not affected though • Traditional 10G iSCSI storage array will service a 4KB IO in around 300us. • Local SAS SSD ~20us • NVME ~2us • Software defined storage will always have higher latency due to replication across nodes and a fatter software stack. • Latency heavily affects single threaded operations that can’t run in parallel. • Eg. SQL transaction logs • Or in the case of Ceph PG Contention
  • 4. PG Contention • PG serialises distributed workload in Ceph • Each operation takes a lock on that PG, can lead to contention • Multiple requests to a single Object will be hitting same PG • Or if you are unlucky 2 hot objects may share the same PG • Latency defines how fast a PG can process that operation, 2nd operation has to wait • If you dump slow ops from the OSD admin socket and see a lot of delay “Waiting For PG”, you are likely hitting PG Contention
  • 5. The theory behind minimising latency • Ceph is software • Each step of the Ceph “software” will run through faster with faster CPU’s (Ghz) • Generally, CPU’s with more cores = lower Ghz • High CPU Ghz = $$$ ? • Try and avoid dual socket systems, adds latency and can introduce complications on high disk count boxes (thread counts, thread pinning, interrupts) • Every write has to go to journal, so make journal as fast as reasonably possible • Bluestore – Only small IO’s • Blessing or a curse? • 10G networking is a must • So…..less faster cores + NVME journal = Ceph Latency Nirvana • Lets come up with a hardware design that takes this into account…
  • 6. Bluestore – deferred writes • For spinning disks • IO<64K write to WAL, ACK, async commit later to disk • IO>64K sync commit to disk • This is great from a double write perspective, WAL doesn’t need to be stupidly fast or have massive write endurance • But a NVME will service a 128kb write a lot faster than a 7.2k disk • May need to tune cutover for your use case
  • 7. Ceph CPU Frequency Scaling CPU Mhz 4Kb Write IO Min Latency (us) Avg Latency (us) 1600 797 886 1250 2000 815 746 1222 2400 1161 630 857 2800 1227 549 812 3300 1320 482 755 4300 1548 437 644 • Ever wondered how Ceph performs at different clockspeeds? • Using manual CPU governor on unlocked desktop CPU, ran fio QD=1 on a RBD at different clock speeds
  • 8. Networking Latency • Sample ping test with 4KB payload over 1G and 10G networks • 25Gb networking is interesting in potentially further reducing latency • Even still, networking latency makes up a large part of the overall latency due to Ceph replication between nodes. • Client -> Primary OSD -> Replica OSD’s • If using a NFS/iSCSI gateway/proxy, extra network hop added again • RDMA will be the game changer!!
  • 9. The Hardware • 1U server • Xeon E3 4x3.5Ghz (3.9Ghz Turbo) • 10GB-T Onboard • 8 SAS onboard • 8 SATA onboard • 64GB Ram • 12x8TB He8’s (Not pictured) • Intel P3700 400GB for Journal + OS • 96TB node = ~£5k (Brexit!!) • 160W Idle • 180W average Ceph Load • 220W Disks + CPU maxed out
  • 10. How much CPU does Ceph require? • Please don’t take this as a “HW requirements” guide • Use it to make informed decisions, instead of 1 core per OSD • If latency is important, work out total required Ghz and find CPU with highest Ghz per core that meets that total. Ie 3.5Ghz*4 = 14 0.1 1 10 100 1000 4KB 8KB 16KB 32KB 64KB 1MB 4MB Mhz per Ceph IO Mhz Per IO Mhz Per MB/s
  • 11. Initial Results • I was wrong!!!! - 4kb Average latency 2.4ms write: io=115268KB, bw=1670.1KB/s, iops=417, runt= 68986msec slat (usec): min=2, max=414, avg= 4.41, stdev= 3.81 clat (usec): min=966, max=27116, avg=2386.84, stdev=571.57 lat (usec): min=970, max=27120, avg=2391.25, stdev=571.69 clat percentiles (usec): | 1.00th=[ 1480], 5.00th=[ 1688], 10.00th=[ 1912], 20.00th=[ 2128], | 30.00th=[ 2192], 40.00th=[ 2288], 50.00th=[ 2352], 60.00th=[ 2448], | 70.00th=[ 2576], 80.00th=[ 2704], 90.00th=[ 2832], 95.00th=[ 2960], | 99.00th=[ 3312], 99.50th=[ 3536], 99.90th=[ 6112], 99.95th=[ 9536], | 99.99th=[22400]
  • 12. But Hang On, what’s this? Real Current Frequency 900.47 MHz [100.11 x 8.99] (Max of below) Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % Temp VCore Core 1 [0]: 900.38 (8.99x) 10.4 44.2 3.47 49.7 27 0.7406 Core 2 [1]: 900.16 (8.99x) 8.46 66.7 1.18 29.9 27 0.7404 Core 3 [2]: 900.47 (8.99x) 10.5 73.8 1 22.5 27 0.7404 Core 4 [3]: 900.12 (8.99x) 8.03 58.6 1 38.3 27 0.7404 • Core’s are spending a lot of their time in C6 and below • And only running at 900Mhz
  • 13. Intel Cstate Wake Up Latency (us) • POLL • 0 • C1-SKL • 2 • C1E-SKL • 10 • C3-SKL • 70 • C6-SKL • 85 • C7s-SKL • 124 • C8-SKL • 200 From the previous slide, a large proportion of threads could be waiting for up to 200us for the CPU to wakeup to be serviced!!!
  • 14. 4kb Seq Write – Replica x3 • That’s more like it write: io=105900KB, bw=5715.7KB/s, iops=1428, runt= 18528msec slat (usec): min=2, max=106, avg= 3.50, stdev= 1.31 clat (usec): min=491, max=32099, avg=694.16, stdev=491.91 lat (usec): min=494, max=32102, avg=697.66, stdev=492.04 clat percentiles (usec): | 1.00th=[ 540], 5.00th=[ 572], 10.00th=[ 588], 20.00th=[ 604], | 30.00th=[ 620], 40.00th=[ 636], 50.00th=[ 652], 60.00th=[ 668], | 70.00th=[ 692], 80.00th=[ 716], 90.00th=[ 764], 95.00th=[ 820], | 99.00th=[ 1448], 99.50th=[ 2320], 99.90th=[ 7584], 99.95th=[11712], | 99.99th=[24448]