SlideShare uma empresa Scribd logo
1 de 48
Baixar para ler offline
BLUESTORE: A NEW, FASTER STORAGE BACKEND
FOR CEPH
SAGE WEIL
2016.06.21
2
OUTLINE
●
Ceph background and context
– FileStore, and why POSIX failed us
– NewStore – a hybrid approach
●
BlueStore – a new Ceph OSD backend
– Metadata
– Data
● Performance
●
Status and availability
●
Summary
MOTIVATION
4
CEPH
● Object, block, and file storage in a single cluster
● All components scale horizontally
● No single point of failure
● Hardware agnostic, commodity hardware
● Self-manage whenever possible
● Open source (LGPL)
● “A Scalable, High-Performance Distributed File System”
● “performance, reliability, and scalability”
5
CEPH COMPONENTS
RGW
A web services gateway
for object storage,
compatible with S3 and
Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-distributed
block device with cloud
platform integration
CEPHFS
A distributed file system
with POSIX semantics and
scale-out metadata
management
OBJECT BLOCK FILE
6
OBJECT STORAGE DAEMONS (OSDS)
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfs
btrfs
ext4
M
M
M
7
OBJECT STORAGE DAEMONS (OSDS)
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfs
btrfs
ext4
M
M
M
FileStore FileStoreFileStoreFileStore
8
●
ObjectStore
– abstract interface for storing
local data
– EBOFS, FileStore
●
EBOFS
– a user-space extent-based
object file system
– deprecated in favor of FileStore
on btrfs in 2009
●
Object – “file”
– data (file-like byte stream)
– attributes (small key/value)
– omap (unbounded key/value)
●
Collection – “directory”
– placement group shard (slice of
the RADOS pool)
– sharded by 32-bit hash value
●
All writes are transactions
– Atomic + Consistent + Durable
– Isolation provided by OSD
OBJECTSTORE AND DATA MODEL
9
●
FileStore
– PG = collection = directory
– object = file
●
Leveldb
– large xattr spillover
– object omap (key/value) data
●
Originally just for development...
– later, only supported backend
(on XFS)
●
/var/lib/ceph/osd/ceph-123/
– current/
●
meta/
– osdmap123
– osdmap124
●
0.1_head/
– object1
– object12
●
0.7_head/
– object3
– object5
●
0.a_head/
– object4
– object6
●
db/
– <leveldb files>
FILESTORE
10
● OSD carefully manages
consistency of its data
● All writes are transactions
– we need A+C+D; OSD provides I
● Most are simple
– write some bytes to object (file)
– update object attribute (file
xattr)
– append to update log (leveldb
insert)
...but others are arbitrarily
large/complex
[
{
"op_name": "write",
"collection": "0.6_head",
"oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",
"length": 4194304,
"offset": 0,
"bufferlist length": 4194304
},
{
"op_name": "setattrs",
"collection": "0.6_head",
"oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",
"attr_lens": {
"_": 269,
"snapset": 31
}
},
{
"op_name": "omap_setkeys",
"collection": "0.6_head",
"oid": "#0:60000000::::head#",
"attr_lens": {
"0000000005.00000000000000000006": 178,
"_info": 847
}
}
]
POSIX FAILS: TRANSACTIONS
11
POSIX FAILS: TRANSACTIONS
● Btrfs transaction hooks
/* trans start and trans end are dangerous, and only for
* use by applications that know how to avoid the
* resulting deadlocks
*/
#define BTRFS_IOC_TRANS_START _IO(BTRFS_IOCTL_MAGIC, 6)
#define BTRFS_IOC_TRANS_END _IO(BTRFS_IOCTL_MAGIC, 7)
● Writeback ordering
#define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7)
● What if we hit an error? ceph-osd process dies?
#define BTRFS_MOUNT_WEDGEONTRANSABORT (1 << …)
– There is no rollback...
POSIX FAILS: TRANSACTIONS
12
POSIX FAILS: TRANSACTIONS
● Write-ahead journal
– serialize and journal every ObjectStore::Transaction
– then write it to the file system
● Btrfs parallel journaling
– periodic sync takes a snapshot, then trim old journal entries
– on OSD restart: rollback and replay journal against last snapshot
● XFS/ext4 write-ahead journaling
– periodic sync, then trim old journal entries
– on restart, replay entire journal
– lots of ugly hackery to deal with events that aren't idempotent
● e.g., renames, collection delete + create, …
● full data journal → we double write everything → ~halve disk throughput
POSIX FAILS: TRANSACTIONS
13
POSIX FAILS: ENUMERATION
●
Ceph objects are distributed by a 32-bit hash
●
Enumeration is in hash order
– scrubbing
– “backfill” (data rebalancing, recovery)
– enumeration via librados client API
● POSIX readdir is not well-ordered
● Need O(1) “split” for a given shard/range
● Build directory tree by hash-value prefix
– split any directory when size > ~100 files
– merge when size < ~50 files
– read entire directory, sort in-memory
…
DIR_A/
DIR_A/A03224D3_qwer
DIR_A/A247233E_zxcv
…
DIR_B/
DIR_B/DIR_8/
DIR_B/DIR_8/B823032D_foo
DIR_B/DIR_8/B8474342_bar
DIR_B/DIR_9/
DIR_B/DIR_9/B924273B_baz
DIR_B/DIR_A/
DIR_B/DIR_A/BA4328D2_asdf
… 
NEWSTORE
15
NEW OBJECTSTORE GOALS
● More natural transaction atomicity
● Avoid double writes
● Efficient object enumeration
● Efficient clone operation
● Efficient splice (“move these bytes from object X to object Y”)
● Efficient IO pattern for HDDs, SSDs, NVMe
● Minimal locking, maximum parallelism (between PGs)
● Full data and metadata checksums
● Inline compression
16
NEWSTORE – WE MANAGE NAMESPACE
● POSIX has the wrong metadata model for us
● Ordered key/value is perfect match
– well-defined object name sort order
– efficient enumeration and random lookup
● NewStore = rocksdb + object files
– /var/lib/ceph/osd/ceph-123/
● db/
– <rocksdb, leveldb, whatever>
●
blobs.1/
– 0
– 1
– ...
●
blobs.2/
– 100000
– 100001
– ...
HDD
OSD
SSD SSD
OSD
HDD
OSD
NewStore NewStore NewStore
RocksDBRocksDB
17
● RocksDB has a write-ahead log
“journal”
● XFS/ext4(/btrfs) have their own
journal (tree-log)
● Journal-on-journal has high
overhead
– each journal manages half of
overall consistency, but incurs
the same overhead
● write(2) + fsync(2) to new blobs.2/10302
1 write + flush to block device
1 write + flush to XFS/ext4 journal
● write(2) + fsync(2) on RocksDB log
1 write + flush to block device
1 write + flush to XFS/ext4 journal
NEWSTORE FAIL: CONSISTENCY OVERHEAD
18
● We can't overwrite a POSIX file as part of a atomic transaction
– (we must preserve old data until the transaction commits)
● Writing overwrite data to a new file means many files for each object
● Write-ahead logging
– put overwrite data in a “WAL” records in RocksDB
– commit atomically with transaction
– then overwrite original file data
– ...but then we're back to a double-write for overwrites
● Performance sucks again
●
Overwrites dominate RBD block workloads
NEWSTORE FAIL: ATOMICITY NEEDS WAL
BLUESTORE
20
● BlueStore = Block + NewStore
– consume raw block device(s)
– key/value database (RocksDB) for metadata
– data written directly to block device
– pluggable block Allocator (policy)
● We must share the block device with RocksDB
– implement our own rocksdb::Env
– implement tiny “file system” BlueFS
– make BlueStore and BlueFS share device(s)
BLUESTORE
BlueStore
BlueFS
RocksDB
BlockDeviceBlockDeviceBlockDevice
BlueRocksEnv
data metadata
Allocator
ObjectStore
21
ROCKSDB: BLUEROCKSENV + BLUEFS
● class BlueRocksEnv : public rocksdb::EnvWrapper
– passes file IO operations to BlueFS
● BlueFS is a super-simple “file system”
– all metadata loaded in RAM on start/mount
– no need to store block free list
– coarse allocation unit (1 MB blocks)
– all metadata lives in written to a journal
– journal rewritten/compacted when it gets large
superblock journal …
more journal …
data
data
data
data
file 10 file 11 file 12 file 12 file 13 rm file 12 file 13 ...
●
Map “directories” to different block devices
– db.wal/ – on NVRAM, NVMe, SSD
– db/ – level0 and hot SSTs on SSD
– db.slow/ – cold SSTs on HDD
●
BlueStore periodically balances free space
22
ROCKSDB: JOURNAL RECYCLING
● rocksdb LogReader only understands two modes
– read until end of file (need accurate file size)
– read all valid records, then ignore zeros at end (need zeroed tail)
● writing to “fresh” log “files” means >1 IO for a log append
● modified upstream rocksdb to re-use previous log files
– now resembles “normal” journaling behavior over a circular buffer
● works with vanilla RocksDB on files and on BlueFS
23
● Single device
– HDD or SSD
●
rocksdb
● object data
● Two devices
– 128MB of SSD or NVRAM
● rocksdb WAL
– big device
●
everything else
MULTI-DEVICE SUPPORT
● Two devices
– a few GB of SSD
● rocksdb WAL
● rocksdb (warm data)
– big device
● rocksdb (cold data)
● object data
● Three devices
– 128MB NVRAM
● rocksdb WAL
– a few GB SSD
● rocksdb (warm data)
– big device
● rocksdb (cold data)
● object data
METADATA
25
BLUESTORE METADATA
● Partition namespace for different metadata
– S* – “superblock” metadata for the entire store
– B* – block allocation metadata (free block bitmap)
– T* – stats (bytes used, compressed, etc.)
– C* – collection name → cnode_t
– O* – object name → onode_t or bnode_t
– L* – write-ahead log entries, promises of future IO
– M* – omap (user key/value data, stored in objects)
26
● Collection metadata
– Interval of object namespace
shard pool hash name bits
C<NOSHARD,12,3d3e0000> “12.e3d3” = <19>
shard pool hash name snap gen
O<NOSHARD,12,3d3d880e,foo,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = …
struct spg_t {
uint64_t pool;
uint32_t hash;
shard_id_t shard;
};
struct bluestore_cnode_t {
uint32_t bits;
};
● Nice properties
– Ordered enumeration of objects
– We can “split” collections by adjusting code
metadata only
CNODE
27
● Per object metadata
– Lives directly in key/value pair
– Serializes to 100s of bytes
● Size in bytes
● Inline attributes (user attr data)
● Data pointers (user byte data)
– lextent_t → (blob, offset, length)
– blob → (disk extents, csums, ...)
● Omap prefix/ID (user k/v data)
struct bluestore_onode_t {
uint64_t size;
map<string,bufferptr> attrs;
map<uint64_t,bluestore_lextent_t> extent_map;
uint64_t omap_head;
};
struct bluestore_blob_t {
vector<bluestore_pextent_t> extents;
uint32_t compressed_length;
bluestore_extent_ref_map_t ref_map;
uint8_t csum_type, csum_order;
bufferptr csum_data;
};
struct bluestore_pextent_t {
uint64_t offset;
uint64_t length;
};
ONODE
28
● Blob metadata
– Usually blobs stored in the onode
– Sometimes we share blocks between objects (usually clones/snaps)
– We need to reference count those extents
– We still want to split collections and repartition extent metadata by hash
shard pool hash name snap gen
O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e02c2> = bnode
O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e125d> = bnode
O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = onode
O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = onode
● onode value includes, and bnode value is
map<int64_t,bluestore_blob_t> blob_map;
● lextent blob ids
– > 0 → blob in onode
– < 0 → blob in bnode
BNODE
29
● We scrub... periodically
– window before we detect error
– we may read bad data
– we may not be sure which copy
is bad
● We want to validate checksum
on every read
● Must store more metadata in the
blobs
– 32-bit csum metadata for 4MB
object and 4KB blocks = 4KB
– larger csum blocks
● csum_order > 12
– smaller csums
● crc32c_8 or 16
● IO hints
– seq read + write → big chunks
– compression → big chunks
● Per-pool policy
CHECKSUMS
30
● 3x replication is expensive
– Any scale-out cluster is expensive
● Lots of stored data is (highly) compressible
● Need largish extents to get compression benefit (64 KB, 128 KB)
– may need to support small (over)writes
– overwrites occlude/obscure compressed blobs
– compacted (rewritten) when >N layers deep
INLINE COMPRESSION
allocated
written
written (compressed)
start of object end of objectend of object
uncompressed blob
DATA PATH
32
Terms
● Sequencer
– An independent, totally ordered
queue of transactions
– One per PG
● TransContext
– State describing an executing
transaction
DATA PATH BASICS
Two ways to write
● New allocation
– Any write larger than
min_alloc_size goes to a new,
unused extent on disk
– Once that IO completes, we
commit the transaction
● WAL (write-ahead-logged)
– Commit temporary promise to
(over)write data with transaction
● includes data!
– Do async overwrite
– Then up temporary k/v pair
33
TRANSCONTEXT STATE MACHINE
PREPARE AIO_WAIT
KV_QUEUED KV_COMMITTING
WAL_QUEUED WAL_AIO_WAIT
FINISH
WAL_CLEANUP
Initiate some AIO
Wait for next TransContext(s) in Sequencer to be ready
PREPARE
AIO_WAIT
KV_QUEUED
AIO_WAIT
KV_COMMITTING
KV_COMMITTING
KV_COMMITTING
FINISH
WAL_QUEUEDWAL_QUEUED
FINISH
WAL_AIO_WAIT
WAL_CLEANUP
WAL_CLEANUP
Sequencer queue
Initiate some AIO
Wait for next commit batch
FINISH
FINISHWAL_CLEANUP_COMMITTING
34
● OnodeSpace per collection
– in-memory ghobject_t → Onode map of decoded onodes
● BufferSpace for in-memory blobs
– may contain cached on-disk data
● Both buffers and onodes have lifecycles linked to a Cache
– LRUCache – trivial LRU
– TwoQCache – implements 2Q cache replacement algorithm (default)
● Cache is sharded for parallelism
– Collection → shard mapping matches OSD's op_wq
– same CPU context that processes client requests will touch the LRU/2Q lists
– IO completion execution not yet sharded – TODO?
CACHING
35
● FreelistManager
– persist list of free extents to key/value
store
– prepare incremental updates for allocate
or release
● Initial implementation
– extent-based
<offset> = <length>
– kept in-memory copy
– enforces an ordering on commits; freelist
updates had to pass through single
thread/lock
del 1600=100000
put 1700=0fff00
– small initial memory footprint, very
expensive when fragmented
● New bitmap-based approach
<offset> = <region bitmap>
– where region is N blocks
● 128 blocks = 8 bytes
– use k/v merge operator to XOR
allocation or release
merge 10=0000000011
merge 20=1110000000
– RocksDB log-structured-merge
tree coalesces keys during
compaction
– no in-memory state
BLOCK FREE LIST
36
● Allocator
– abstract interface to allocate blocks
● StupidAllocator
– extent-based
– bin free extents by size (powers of
2)
– choose sufficiently large extent
closest to hint
– highly variable memory usage
● btree of free extents
– implemented, works
– based on ancient ebofs policy
● BitmapAllocator
– hierarchy of indexes
● L1: 2 bits = 2^6 blocks
● L2: 2 bits = 2^12 blocks
● ...
00 = all free, 11 = all used,
01 = mix
– fixed memory consumption
● ~35 MB RAM per TB
BLOCK ALLOCATOR
37
● Let's support them natively!
● 256 MB zones / bands
– must be written sequentially,
but not all at once
– libzbc supports ZAC and ZBC
HDDs
– host-managed or host-aware
● SMRAllocator
– write pointer per zone
– used + free counters per zone
– Bonus: almost no memory!
● IO ordering
– must ensure allocated writes
reach disk in order
● Cleaning
– store k/v hints
zone offset → object hash
– pick emptiest closed zone, scan
hints, move objects that are still
there
– opportunistically rewrite objects
we read if the zone is flagged
for cleaning soon
SMR HDD
PERFORMANCE
39
0
50
100
150
200
250
300
350
400
450
500
Ceph 10.1.0 Bluestore vs Filestore Sequential Writes
FS HDD
BS HDD
IO Size
Throughput(MB/s)
HDD: SEQUENTIAL WRITE
40
HDD: RANDOM WRITE
0
50
100
150
200
250
300
350
400
450
Ceph 10.1.0 Bluestore vs Filestore Random Writes
FS HDD
BS HDD
IO Size
Throughput(MB/s)
0
200
400
600
800
1000
1200
1400
1600
Ceph 10.1.0 Bluestore vs Filestore Random Writes
FS HDD
BS HDD
IO Size
IOPS
41
HDD: SEQUENTIAL READ
0
200
400
600
800
1000
1200
Ceph 10.1.0 Bluestore vs Filestore Sequential Reads
FS HDD
BS HDD
IO Size
Throughput(MB/s)
42
HDD: RANDOM READ
0
200
400
600
800
1000
1200
1400
Ceph 10.1.0 Bluestore vs Filestore Random Reads
FS HDD
BS HDD
IO Size
Throughput(MB/s)
0
500
1000
1500
2000
2500
3000
3500
Ceph 10.1.0 Bluestore vs Filestore Random Reads
FS HDD
BS HDD
IO Size
IOPS
43
SSD AND NVME?
● NVMe journal
– random writes ~2x faster
– some testing anomalies (problem with test rig kernel?)
● SSD only
– similar to HDD result
– small write benefit is more pronounced
● NVMe only
– more testing anomalies on test rig.. WIP
STATUS
45
STATUS
● Done
– fully function IO path with checksums and compression
– fsck
– bitmap-based allocator and freelist
● Current efforts
– optimize metadata encoding efficiency
– performance tuning
– ZetaScale key/value db as RocksDB alternative
– bounds on compressed blob occlusion
● Soon
– per-pool properties that map to compression, checksum, IO hints
– more performance optimization
– native SMR HDD support
– SPDK (kernel bypass for NVMe devices)
46
AVAILABILITY
● Experimental backend in Jewel v10.2.z (just released)
– enable experimental unrecoverable data corrupting features = bluestore rocksdb
– ceph-disk --bluestore DEV
● no multi-device magic provisioning just yet
– predates checksums and compression
● Current master
– new disk format
– checksums
– compression
● The goal...
– stable in Kraken (Fall '16)
– default in Luminous (Spring '17)
47
SUMMARY
● Ceph is great
● POSIX was poor choice for storing objects
● RocksDB rocks and was easy to embed
● Our new BlueStore backend is awesome
● Full data checksums and inline compression!
THANK YOU!
Sage Weil
CEPH PRINCIPAL ARCHITECT
sage@redhat.com
@liewegas

Mais conteúdo relacionado

Mais procurados

Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 
Ceph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to EnterpriseCeph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to EnterpriseCeph Community
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High CostsJonathan Long
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelColleen Corrice
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudPatrick McGarry
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Danielle Womboldt
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightColleen Corrice
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph Community
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)Lars Marowsky-Brée
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Danielle Womboldt
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversLinaro
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
 
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLESQuick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLESJan Kalcic
 

Mais procurados (19)

Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Ceph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to EnterpriseCeph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to Enterprise
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High Costs
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to Jewel
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-Gene
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
MySQL on Ceph
MySQL on CephMySQL on Ceph
MySQL on Ceph
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM servers
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLESQuick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
 

Semelhante a Ceph Tech Talk: Bluestore

BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureThomas Uhl
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and BeyondSage Weil
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Sage Weil
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for youTaco Scargo
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETNikos Kormpakis
 
What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and BeyondSage Weil
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonSage Weil
 
Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS Ceph Community
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to CephCeph Community
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 
Using oracle12c pluggable databases to archive
Using oracle12c pluggable databases to archiveUsing oracle12c pluggable databases to archive
Using oracle12c pluggable databases to archiveSecure-24
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortNAVER D2
 
Optimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsOptimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsJavier González
 

Semelhante a Ceph Tech Talk: Bluestore (20)

BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for you
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and Beyond
 
Bluestore
BluestoreBluestore
Bluestore
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Using oracle12c pluggable databases to archive
Using oracle12c pluggable databases to archiveUsing oracle12c pluggable databases to archive
Using oracle12c pluggable databases to archive
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
Strata - 03/31/2012
Strata - 03/31/2012Strata - 03/31/2012
Strata - 03/31/2012
 
Optimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsOptimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDs
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Ceph Tech Talk: Bluestore

  • 1. BLUESTORE: A NEW, FASTER STORAGE BACKEND FOR CEPH SAGE WEIL 2016.06.21
  • 2. 2 OUTLINE ● Ceph background and context – FileStore, and why POSIX failed us – NewStore – a hybrid approach ● BlueStore – a new Ceph OSD backend – Metadata – Data ● Performance ● Status and availability ● Summary
  • 4. 4 CEPH ● Object, block, and file storage in a single cluster ● All components scale horizontally ● No single point of failure ● Hardware agnostic, commodity hardware ● Self-manage whenever possible ● Open source (LGPL) ● “A Scalable, High-Performance Distributed File System” ● “performance, reliability, and scalability”
  • 5. 5 CEPH COMPONENTS RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully-distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management OBJECT BLOCK FILE
  • 6. 6 OBJECT STORAGE DAEMONS (OSDS) FS DISK OSD DISK OSD FS DISK OSD FS DISK OSD FS xfs btrfs ext4 M M M
  • 7. 7 OBJECT STORAGE DAEMONS (OSDS) FS DISK OSD DISK OSD FS DISK OSD FS DISK OSD FS xfs btrfs ext4 M M M FileStore FileStoreFileStoreFileStore
  • 8. 8 ● ObjectStore – abstract interface for storing local data – EBOFS, FileStore ● EBOFS – a user-space extent-based object file system – deprecated in favor of FileStore on btrfs in 2009 ● Object – “file” – data (file-like byte stream) – attributes (small key/value) – omap (unbounded key/value) ● Collection – “directory” – placement group shard (slice of the RADOS pool) – sharded by 32-bit hash value ● All writes are transactions – Atomic + Consistent + Durable – Isolation provided by OSD OBJECTSTORE AND DATA MODEL
  • 9. 9 ● FileStore – PG = collection = directory – object = file ● Leveldb – large xattr spillover – object omap (key/value) data ● Originally just for development... – later, only supported backend (on XFS) ● /var/lib/ceph/osd/ceph-123/ – current/ ● meta/ – osdmap123 – osdmap124 ● 0.1_head/ – object1 – object12 ● 0.7_head/ – object3 – object5 ● 0.a_head/ – object4 – object6 ● db/ – <leveldb files> FILESTORE
  • 10. 10 ● OSD carefully manages consistency of its data ● All writes are transactions – we need A+C+D; OSD provides I ● Most are simple – write some bytes to object (file) – update object attribute (file xattr) – append to update log (leveldb insert) ...but others are arbitrarily large/complex [ { "op_name": "write", "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "length": 4194304, "offset": 0, "bufferlist length": 4194304 }, { "op_name": "setattrs", "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "attr_lens": { "_": 269, "snapset": 31 } }, { "op_name": "omap_setkeys", "collection": "0.6_head", "oid": "#0:60000000::::head#", "attr_lens": { "0000000005.00000000000000000006": 178, "_info": 847 } } ] POSIX FAILS: TRANSACTIONS
  • 11. 11 POSIX FAILS: TRANSACTIONS ● Btrfs transaction hooks /* trans start and trans end are dangerous, and only for * use by applications that know how to avoid the * resulting deadlocks */ #define BTRFS_IOC_TRANS_START _IO(BTRFS_IOCTL_MAGIC, 6) #define BTRFS_IOC_TRANS_END _IO(BTRFS_IOCTL_MAGIC, 7) ● Writeback ordering #define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7) ● What if we hit an error? ceph-osd process dies? #define BTRFS_MOUNT_WEDGEONTRANSABORT (1 << …) – There is no rollback... POSIX FAILS: TRANSACTIONS
  • 12. 12 POSIX FAILS: TRANSACTIONS ● Write-ahead journal – serialize and journal every ObjectStore::Transaction – then write it to the file system ● Btrfs parallel journaling – periodic sync takes a snapshot, then trim old journal entries – on OSD restart: rollback and replay journal against last snapshot ● XFS/ext4 write-ahead journaling – periodic sync, then trim old journal entries – on restart, replay entire journal – lots of ugly hackery to deal with events that aren't idempotent ● e.g., renames, collection delete + create, … ● full data journal → we double write everything → ~halve disk throughput POSIX FAILS: TRANSACTIONS
  • 13. 13 POSIX FAILS: ENUMERATION ● Ceph objects are distributed by a 32-bit hash ● Enumeration is in hash order – scrubbing – “backfill” (data rebalancing, recovery) – enumeration via librados client API ● POSIX readdir is not well-ordered ● Need O(1) “split” for a given shard/range ● Build directory tree by hash-value prefix – split any directory when size > ~100 files – merge when size < ~50 files – read entire directory, sort in-memory … DIR_A/ DIR_A/A03224D3_qwer DIR_A/A247233E_zxcv … DIR_B/ DIR_B/DIR_8/ DIR_B/DIR_8/B823032D_foo DIR_B/DIR_8/B8474342_bar DIR_B/DIR_9/ DIR_B/DIR_9/B924273B_baz DIR_B/DIR_A/ DIR_B/DIR_A/BA4328D2_asdf … 
  • 15. 15 NEW OBJECTSTORE GOALS ● More natural transaction atomicity ● Avoid double writes ● Efficient object enumeration ● Efficient clone operation ● Efficient splice (“move these bytes from object X to object Y”) ● Efficient IO pattern for HDDs, SSDs, NVMe ● Minimal locking, maximum parallelism (between PGs) ● Full data and metadata checksums ● Inline compression
  • 16. 16 NEWSTORE – WE MANAGE NAMESPACE ● POSIX has the wrong metadata model for us ● Ordered key/value is perfect match – well-defined object name sort order – efficient enumeration and random lookup ● NewStore = rocksdb + object files – /var/lib/ceph/osd/ceph-123/ ● db/ – <rocksdb, leveldb, whatever> ● blobs.1/ – 0 – 1 – ... ● blobs.2/ – 100000 – 100001 – ... HDD OSD SSD SSD OSD HDD OSD NewStore NewStore NewStore RocksDBRocksDB
  • 17. 17 ● RocksDB has a write-ahead log “journal” ● XFS/ext4(/btrfs) have their own journal (tree-log) ● Journal-on-journal has high overhead – each journal manages half of overall consistency, but incurs the same overhead ● write(2) + fsync(2) to new blobs.2/10302 1 write + flush to block device 1 write + flush to XFS/ext4 journal ● write(2) + fsync(2) on RocksDB log 1 write + flush to block device 1 write + flush to XFS/ext4 journal NEWSTORE FAIL: CONSISTENCY OVERHEAD
  • 18. 18 ● We can't overwrite a POSIX file as part of a atomic transaction – (we must preserve old data until the transaction commits) ● Writing overwrite data to a new file means many files for each object ● Write-ahead logging – put overwrite data in a “WAL” records in RocksDB – commit atomically with transaction – then overwrite original file data – ...but then we're back to a double-write for overwrites ● Performance sucks again ● Overwrites dominate RBD block workloads NEWSTORE FAIL: ATOMICITY NEEDS WAL
  • 20. 20 ● BlueStore = Block + NewStore – consume raw block device(s) – key/value database (RocksDB) for metadata – data written directly to block device – pluggable block Allocator (policy) ● We must share the block device with RocksDB – implement our own rocksdb::Env – implement tiny “file system” BlueFS – make BlueStore and BlueFS share device(s) BLUESTORE BlueStore BlueFS RocksDB BlockDeviceBlockDeviceBlockDevice BlueRocksEnv data metadata Allocator ObjectStore
  • 21. 21 ROCKSDB: BLUEROCKSENV + BLUEFS ● class BlueRocksEnv : public rocksdb::EnvWrapper – passes file IO operations to BlueFS ● BlueFS is a super-simple “file system” – all metadata loaded in RAM on start/mount – no need to store block free list – coarse allocation unit (1 MB blocks) – all metadata lives in written to a journal – journal rewritten/compacted when it gets large superblock journal … more journal … data data data data file 10 file 11 file 12 file 12 file 13 rm file 12 file 13 ... ● Map “directories” to different block devices – db.wal/ – on NVRAM, NVMe, SSD – db/ – level0 and hot SSTs on SSD – db.slow/ – cold SSTs on HDD ● BlueStore periodically balances free space
  • 22. 22 ROCKSDB: JOURNAL RECYCLING ● rocksdb LogReader only understands two modes – read until end of file (need accurate file size) – read all valid records, then ignore zeros at end (need zeroed tail) ● writing to “fresh” log “files” means >1 IO for a log append ● modified upstream rocksdb to re-use previous log files – now resembles “normal” journaling behavior over a circular buffer ● works with vanilla RocksDB on files and on BlueFS
  • 23. 23 ● Single device – HDD or SSD ● rocksdb ● object data ● Two devices – 128MB of SSD or NVRAM ● rocksdb WAL – big device ● everything else MULTI-DEVICE SUPPORT ● Two devices – a few GB of SSD ● rocksdb WAL ● rocksdb (warm data) – big device ● rocksdb (cold data) ● object data ● Three devices – 128MB NVRAM ● rocksdb WAL – a few GB SSD ● rocksdb (warm data) – big device ● rocksdb (cold data) ● object data
  • 25. 25 BLUESTORE METADATA ● Partition namespace for different metadata – S* – “superblock” metadata for the entire store – B* – block allocation metadata (free block bitmap) – T* – stats (bytes used, compressed, etc.) – C* – collection name → cnode_t – O* – object name → onode_t or bnode_t – L* – write-ahead log entries, promises of future IO – M* – omap (user key/value data, stored in objects)
  • 26. 26 ● Collection metadata – Interval of object namespace shard pool hash name bits C<NOSHARD,12,3d3e0000> “12.e3d3” = <19> shard pool hash name snap gen O<NOSHARD,12,3d3d880e,foo,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = … struct spg_t { uint64_t pool; uint32_t hash; shard_id_t shard; }; struct bluestore_cnode_t { uint32_t bits; }; ● Nice properties – Ordered enumeration of objects – We can “split” collections by adjusting code metadata only CNODE
  • 27. 27 ● Per object metadata – Lives directly in key/value pair – Serializes to 100s of bytes ● Size in bytes ● Inline attributes (user attr data) ● Data pointers (user byte data) – lextent_t → (blob, offset, length) – blob → (disk extents, csums, ...) ● Omap prefix/ID (user k/v data) struct bluestore_onode_t { uint64_t size; map<string,bufferptr> attrs; map<uint64_t,bluestore_lextent_t> extent_map; uint64_t omap_head; }; struct bluestore_blob_t { vector<bluestore_pextent_t> extents; uint32_t compressed_length; bluestore_extent_ref_map_t ref_map; uint8_t csum_type, csum_order; bufferptr csum_data; }; struct bluestore_pextent_t { uint64_t offset; uint64_t length; }; ONODE
  • 28. 28 ● Blob metadata – Usually blobs stored in the onode – Sometimes we share blocks between objects (usually clones/snaps) – We need to reference count those extents – We still want to split collections and repartition extent metadata by hash shard pool hash name snap gen O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = onode O<NOSHARD,12,3d3e02c2> = bnode O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = onode O<NOSHARD,12,3d3e125d> = bnode O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = onode O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = onode O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = onode ● onode value includes, and bnode value is map<int64_t,bluestore_blob_t> blob_map; ● lextent blob ids – > 0 → blob in onode – < 0 → blob in bnode BNODE
  • 29. 29 ● We scrub... periodically – window before we detect error – we may read bad data – we may not be sure which copy is bad ● We want to validate checksum on every read ● Must store more metadata in the blobs – 32-bit csum metadata for 4MB object and 4KB blocks = 4KB – larger csum blocks ● csum_order > 12 – smaller csums ● crc32c_8 or 16 ● IO hints – seq read + write → big chunks – compression → big chunks ● Per-pool policy CHECKSUMS
  • 30. 30 ● 3x replication is expensive – Any scale-out cluster is expensive ● Lots of stored data is (highly) compressible ● Need largish extents to get compression benefit (64 KB, 128 KB) – may need to support small (over)writes – overwrites occlude/obscure compressed blobs – compacted (rewritten) when >N layers deep INLINE COMPRESSION allocated written written (compressed) start of object end of objectend of object uncompressed blob
  • 32. 32 Terms ● Sequencer – An independent, totally ordered queue of transactions – One per PG ● TransContext – State describing an executing transaction DATA PATH BASICS Two ways to write ● New allocation – Any write larger than min_alloc_size goes to a new, unused extent on disk – Once that IO completes, we commit the transaction ● WAL (write-ahead-logged) – Commit temporary promise to (over)write data with transaction ● includes data! – Do async overwrite – Then up temporary k/v pair
  • 33. 33 TRANSCONTEXT STATE MACHINE PREPARE AIO_WAIT KV_QUEUED KV_COMMITTING WAL_QUEUED WAL_AIO_WAIT FINISH WAL_CLEANUP Initiate some AIO Wait for next TransContext(s) in Sequencer to be ready PREPARE AIO_WAIT KV_QUEUED AIO_WAIT KV_COMMITTING KV_COMMITTING KV_COMMITTING FINISH WAL_QUEUEDWAL_QUEUED FINISH WAL_AIO_WAIT WAL_CLEANUP WAL_CLEANUP Sequencer queue Initiate some AIO Wait for next commit batch FINISH FINISHWAL_CLEANUP_COMMITTING
  • 34. 34 ● OnodeSpace per collection – in-memory ghobject_t → Onode map of decoded onodes ● BufferSpace for in-memory blobs – may contain cached on-disk data ● Both buffers and onodes have lifecycles linked to a Cache – LRUCache – trivial LRU – TwoQCache – implements 2Q cache replacement algorithm (default) ● Cache is sharded for parallelism – Collection → shard mapping matches OSD's op_wq – same CPU context that processes client requests will touch the LRU/2Q lists – IO completion execution not yet sharded – TODO? CACHING
  • 35. 35 ● FreelistManager – persist list of free extents to key/value store – prepare incremental updates for allocate or release ● Initial implementation – extent-based <offset> = <length> – kept in-memory copy – enforces an ordering on commits; freelist updates had to pass through single thread/lock del 1600=100000 put 1700=0fff00 – small initial memory footprint, very expensive when fragmented ● New bitmap-based approach <offset> = <region bitmap> – where region is N blocks ● 128 blocks = 8 bytes – use k/v merge operator to XOR allocation or release merge 10=0000000011 merge 20=1110000000 – RocksDB log-structured-merge tree coalesces keys during compaction – no in-memory state BLOCK FREE LIST
  • 36. 36 ● Allocator – abstract interface to allocate blocks ● StupidAllocator – extent-based – bin free extents by size (powers of 2) – choose sufficiently large extent closest to hint – highly variable memory usage ● btree of free extents – implemented, works – based on ancient ebofs policy ● BitmapAllocator – hierarchy of indexes ● L1: 2 bits = 2^6 blocks ● L2: 2 bits = 2^12 blocks ● ... 00 = all free, 11 = all used, 01 = mix – fixed memory consumption ● ~35 MB RAM per TB BLOCK ALLOCATOR
  • 37. 37 ● Let's support them natively! ● 256 MB zones / bands – must be written sequentially, but not all at once – libzbc supports ZAC and ZBC HDDs – host-managed or host-aware ● SMRAllocator – write pointer per zone – used + free counters per zone – Bonus: almost no memory! ● IO ordering – must ensure allocated writes reach disk in order ● Cleaning – store k/v hints zone offset → object hash – pick emptiest closed zone, scan hints, move objects that are still there – opportunistically rewrite objects we read if the zone is flagged for cleaning soon SMR HDD
  • 39. 39 0 50 100 150 200 250 300 350 400 450 500 Ceph 10.1.0 Bluestore vs Filestore Sequential Writes FS HDD BS HDD IO Size Throughput(MB/s) HDD: SEQUENTIAL WRITE
  • 40. 40 HDD: RANDOM WRITE 0 50 100 150 200 250 300 350 400 450 Ceph 10.1.0 Bluestore vs Filestore Random Writes FS HDD BS HDD IO Size Throughput(MB/s) 0 200 400 600 800 1000 1200 1400 1600 Ceph 10.1.0 Bluestore vs Filestore Random Writes FS HDD BS HDD IO Size IOPS
  • 41. 41 HDD: SEQUENTIAL READ 0 200 400 600 800 1000 1200 Ceph 10.1.0 Bluestore vs Filestore Sequential Reads FS HDD BS HDD IO Size Throughput(MB/s)
  • 42. 42 HDD: RANDOM READ 0 200 400 600 800 1000 1200 1400 Ceph 10.1.0 Bluestore vs Filestore Random Reads FS HDD BS HDD IO Size Throughput(MB/s) 0 500 1000 1500 2000 2500 3000 3500 Ceph 10.1.0 Bluestore vs Filestore Random Reads FS HDD BS HDD IO Size IOPS
  • 43. 43 SSD AND NVME? ● NVMe journal – random writes ~2x faster – some testing anomalies (problem with test rig kernel?) ● SSD only – similar to HDD result – small write benefit is more pronounced ● NVMe only – more testing anomalies on test rig.. WIP
  • 45. 45 STATUS ● Done – fully function IO path with checksums and compression – fsck – bitmap-based allocator and freelist ● Current efforts – optimize metadata encoding efficiency – performance tuning – ZetaScale key/value db as RocksDB alternative – bounds on compressed blob occlusion ● Soon – per-pool properties that map to compression, checksum, IO hints – more performance optimization – native SMR HDD support – SPDK (kernel bypass for NVMe devices)
  • 46. 46 AVAILABILITY ● Experimental backend in Jewel v10.2.z (just released) – enable experimental unrecoverable data corrupting features = bluestore rocksdb – ceph-disk --bluestore DEV ● no multi-device magic provisioning just yet – predates checksums and compression ● Current master – new disk format – checksums – compression ● The goal... – stable in Kraken (Fall '16) – default in Luminous (Spring '17)
  • 47. 47 SUMMARY ● Ceph is great ● POSIX was poor choice for storing objects ● RocksDB rocks and was easy to embed ● Our new BlueStore backend is awesome ● Full data checksums and inline compression!
  • 48. THANK YOU! Sage Weil CEPH PRINCIPAL ARCHITECT sage@redhat.com @liewegas