Nick Fisk - low latency Ceph

Who Am I
• Nick Fisk
• Ceph user since 2012
• Author of Mastering Ceph
• Technical manager at SysGroup
• Managed Service Provider
• Use Ceph for providing tier-2 services to customers (Backups, standby
replicas) - Veeam
• Ceph RBD to ESXi via NFS

What is Latency?
• What the user feels when he clicks the button
• Buffer IO probably not affected though
• Traditional 10G iSCSI storage array will service a 4KB IO in around 300us.
• Local SAS SSD ~20us
• NVME ~2us
• Software defined storage will always have higher latency due to replication
across nodes and a fatter software stack.
• Latency heavily affects single threaded operations that can’t run in parallel.
• Eg. SQL transaction logs
• Or in the case of Ceph PG Contention

PG Contention
• PG serialises distributed workload in Ceph
• Each operation takes a lock on that PG, can lead to contention
• Multiple requests to a single Object will be hitting same PG
• Or if you are unlucky 2 hot objects may share the same PG
• Latency defines how fast a PG can process that operation, 2nd
operation has to wait
• If you dump slow ops from the OSD admin socket and see a lot of
delay “Waiting For PG”, you are likely hitting PG Contention

The theory behind minimising latency
• Ceph is software
• Each step of the Ceph “software” will run through faster with faster CPU’s (Ghz)
• Generally, CPU’s with more cores = lower Ghz
• High CPU Ghz = $$$ ?
• Try and avoid dual socket systems, adds latency and can introduce complications
on high disk count boxes (thread counts, thread pinning, interrupts)
• Every write has to go to journal, so make journal as fast as reasonably possible
• Bluestore – Only small IO’s
• Blessing or a curse?
• 10G networking is a must
• So…..less faster cores + NVME journal = Ceph Latency Nirvana
• Lets come up with a hardware design that takes this into account…

Bluestore – deferred writes
• For spinning disks
• IO<64K write to WAL, ACK, async commit later to disk
• IO>64K sync commit to disk
• This is great from a double write perspective, WAL doesn’t need to be
stupidly fast or have massive write endurance
• But a NVME will service a 128kb write a lot faster than a 7.2k disk
• May need to tune cutover for your use case

Ceph CPU Frequency Scaling
CPU
Mhz 4Kb Write IO
Min Latency
(us)
Avg Latency
(us)
1600 797 886 1250
2000 815 746 1222
2400 1161 630 857
2800 1227 549 812
3300 1320 482 755
4300 1548 437 644
• Ever wondered how Ceph performs at different clockspeeds?
• Using manual CPU governor on unlocked desktop CPU, ran fio QD=1 on a RBD
at different clock speeds

Networking Latency
• Sample ping test with 4KB payload over 1G and 10G networks
• 25Gb networking is interesting in potentially further reducing latency
• Even still, networking latency makes up a large part of the overall latency
due to Ceph replication between nodes.
• Client -> Primary OSD -> Replica OSD’s
• If using a NFS/iSCSI gateway/proxy, extra network hop added again
• RDMA will be the game changer!!

The Hardware
• 1U server
• Xeon E3 4x3.5Ghz (3.9Ghz Turbo)
• 10GB-T Onboard
• 8 SAS onboard
• 8 SATA onboard
• 64GB Ram
• 12x8TB He8’s (Not pictured)
• Intel P3700 400GB for Journal + OS
• 96TB node = ~£5k (Brexit!!)
• 160W Idle
• 180W average Ceph Load
• 220W Disks + CPU maxed out

How much CPU does Ceph require?
• Please don’t take this as a “HW
requirements” guide
• Use it to make informed
decisions, instead of 1 core per
OSD
• If latency is important, work out
total required Ghz and find CPU
with highest Ghz per core that
meets that total. Ie 3.5Ghz*4 =
14
0.1
1
10
100
1000
4KB 8KB 16KB 32KB 64KB 1MB 4MB
Mhz per Ceph IO
Mhz Per IO Mhz Per MB/s

Initial Results
• I was wrong!!!! - 4kb Average latency 2.4ms
write: io=115268KB, bw=1670.1KB/s, iops=417, runt= 68986msec
slat (usec): min=2, max=414, avg= 4.41, stdev= 3.81
clat (usec): min=966, max=27116, avg=2386.84, stdev=571.57
lat (usec): min=970, max=27120, avg=2391.25, stdev=571.69
clat percentiles (usec):
| 1.00th=[ 1480], 5.00th=[ 1688], 10.00th=[ 1912], 20.00th=[ 2128],
| 30.00th=[ 2192], 40.00th=[ 2288], 50.00th=[ 2352], 60.00th=[ 2448],
| 70.00th=[ 2576], 80.00th=[ 2704], 90.00th=[ 2832], 95.00th=[ 2960],
| 99.00th=[ 3312], 99.50th=[ 3536], 99.90th=[ 6112], 99.95th=[ 9536],
| 99.99th=[22400]

But Hang On, what’s this?
Real Current Frequency 900.47 MHz [100.11 x 8.99] (Max of below)
Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % Temp VCore
Core 1 [0]: 900.38 (8.99x) 10.4 44.2 3.47 49.7 27 0.7406
Core 2 [1]: 900.16 (8.99x) 8.46 66.7 1.18 29.9 27 0.7404
Core 3 [2]: 900.47 (8.99x) 10.5 73.8 1 22.5 27 0.7404
Core 4 [3]: 900.12 (8.99x) 8.03 58.6 1 38.3 27 0.7404
• Core’s are spending a lot of their time in C6 and below
• And only running at 900Mhz

Intel Cstate Wake Up Latency (us)
• POLL
• 0
• C1-SKL
• 2
• C1E-SKL
• 10
• C3-SKL
• 70
• C6-SKL
• 85
• C7s-SKL
• 124
• C8-SKL
• 200
From the previous slide, a large proportion of threads could be waiting for up to 200us for the CPU to wakeup to be serviced!!!

4kb Seq Write – Replica x3
• That’s more like it
write: io=105900KB, bw=5715.7KB/s, iops=1428, runt= 18528msec
slat (usec): min=2, max=106, avg= 3.50, stdev= 1.31
clat (usec): min=491, max=32099, avg=694.16, stdev=491.91
lat (usec): min=494, max=32102, avg=697.66, stdev=492.04
clat percentiles (usec):
| 1.00th=[ 540], 5.00th=[ 572], 10.00th=[ 588], 20.00th=[ 604],
| 30.00th=[ 620], 40.00th=[ 636], 50.00th=[ 652], 60.00th=[ 668],
| 70.00th=[ 692], 80.00th=[ 716], 90.00th=[ 764], 95.00th=[ 820],
| 99.00th=[ 1448], 99.50th=[ 2320], 99.90th=[ 7584], 99.95th=[11712],
| 99.99th=[24448]

Nick Fisk - low latency Ceph

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Nick Fisk - low latency Ceph

Semelhante a Nick Fisk - low latency Ceph (20)

Mais de ShapeBlue

Mais de ShapeBlue (20)

Último

Último (20)

Nick Fisk - low latency Ceph