2. • Haomai Wang, active Ceph contributor
• Maintain multi components
• XSKY CTO, a China-based storage startup
• haomaiwang@gmail.com/haomai@xsky.com
Who Am I
3.
4. • Hammer v0.94.x (LTS) – March '15
• Infernalis v9.2.x – November '15
• Jewel v10.2.x (LTS) – April '16
• Kraken v11.2.x – December '16
• Luminous v12.2.x (LTS) – September ‘17 (delay)
Releases
6. • BlueStore=Block+NewStore
• Key/value database(RocksDB) for metadata
• All data written directly to raw block device(s)
• Fast on both HDDs (~2x) and SSDs (~1.5x)
– Similar to FileStore on NVMe, where the device is not the bottleneck
• Full data checksums (crc32c, xxhash, etc.)
• Inline compression(zlib, snappy, zstd)
• Stable and default
RADOS - BlueStore
7. • requires BlueStore to perform reasonably
• signicant improvement in effciency over 3x replication
• 2+2 → 2x 4+2 → 1.5x
• small writes slower than replication
– early testing showed 4+2 is about half as fast as 3x replication
• large writes faster than replication
– less IO to device
• implementation still does the “simple” thing
– all writes update a full stripe
RADOS – RBD Over Erasure Code
8. • ceph-mgr
– new management daemon to supplement ceph-mon (monitor)
– easier integration point for python management logic
– integrated metrics
• make ceph-mon scalable again
– offload pg stats from mon to mgr
– push to 10K OSDs (planned “big bang 3” @ CERN)
• new REST API
– pecan
– based on previous Calamari API
• built-in web dashbard
CEPH-MGR
9. AsyncMessenger
• AsyncMessenger
– Core Library included by all components
– Kernel TCP/IP driver
– Epoll/Kqueue Drive
– Maintain connection lifecycle and session
– replaces aging SimpleMessenger
– fixed size thread pool (vs 2 threads per socket)
– scales better to larger clusters
– more healthy relationship with tcmalloc
– now the default!
11. • RDMA backend
– Inherit NetworkStack and implement RDMAStack
– Using user-space verbs directly
– TCP as control path
– Exchange message using RDMA SEND
– Using shared receive queue
– Multiple connection qp’s in many-to-many topology
– Built-in into ceph master
– All Features are fully avail on ceph master
• Support:
– RH/centos
– INFINIBAND and ETH
– Roce V2 for cross subnet
– Front-end TCP and back-end RDMA
RDMA Support
12. Plugin Default Hardware
Requirement
Performance Compatible OSD Storage
Engine
Requirement
OSD Disk
Backend
Requirement
Posix(Kernel) YES None Middle TCP/IP Compatible None None
DPDK+Userspace
TCP/IP
NO DPDK Supported NIC High TCP/IP Compatible BlueStore Must be NVME
SSD
RDMA NO RDMA Supported
NIC
High RDMA Supported
Network
None None
Messenger Plugins
17. RGW MISC
• NFS gateway
– NFSv4 and v3
– full object access (not general purpose!)
• dynamic bucket index sharding
– automatic ( nally!)
• inline compression
• Encryption
– follows S3 encryption APIs
• S3 and Swift API odds and ends
NFS-
Client
nfs-ganesha
(nfs-v4)
librgw-file
RADOS
NFS-Server RadosGW
Apps
rados api
S3 API
Swift
API
rados api
RadosHandler