SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
Bookie Storage
M a t t e o M e r l i
BookKeeper
2
▪ Provides distributed logs (ledgers)
▪ BookKeeper client + Bookies
▪ Client API can be summarized as :
› createLedger() → ledgerId
› ledger.addEntry(data) → entryId
› ledger.readEntry(ledgerId, entryId)
› deleteLedger(ledgerId)
▪ BK Client library implements all the “logic”
› Consistency, metadata in ZK, fencing, recovery, replication
▪ Bookie Server are charged to store the data
Bookie Storage
Bookie external interface
3
▪ Simple primitives
› addEntry(ledgerId, entryId, payload) → OK
› readEntry(ledgerId, entryId) → payload
› getLastEntry(ledgerId) → entryId
▪ Is that all??
› Fence flag on readEntry() → no more writes allowed to a ledger
› Deletion → background garbage collection
› Auto replication → it's a different logical component that uses the BK client API
Bookie Storage
Interleaved storage
4 Bookie Storage
▪ Default bookie storage
▪ Use journal on a separate device
› Every entry is synced on the journal
▪ Entries are also written to "entryLog" files as they come in
› Writes on the entryLog are periodically flushed in background
› Entries are appended to the current entry log file
› When entryLog reaches 2GB, a new one is created
› Entries for multiple ledgers are interleaved in the same entry log
▪ Need to maintain and index (ledgerId, entryId) → (entryLogId, offset)
› Default implementation uses a file for each ledger to store the data locations
Bookie Garbage Collection
5 Bookie Storage
▪ Runs periodically in background
▪ Get the list of ledgers stored locally
▪ Get list of ledgers from ZK
▪ Whatever ledger is not in ZK is marked for deletion
When are entry logs deleted?
6
▪ Need to keep track of usage of each entry log
▪ EntryLog metadata, in memory map for each entryLog
› (ledgerId → size)
▪ Whenever a ledger is deleted, each entry log will update the usage
▪ Metadata is appended to each entryLog, to avoid having to scan the log
when bookie restarts (since 4.4.0)
▪ If the entryLog usage is 0% → delete it
▪ If usage falls below x % → compaction
Bookie Storage
Entry log compaction
7 Bookie Storage
▪ There are 2 compactions which differs in threshold :
› Minor (every 1 hour, usage < 50%)
› Major (every 1 day, usage < 80%)
▪ Scan the entryLog file and append all valid entries into the current
(newer) entryLog file
▪ Update the indexes to point to new location
Changes already done
8 Bookie Storage
▪ Writes interleaved in entryLog makes poor read performance
› Typically you want to read many entries sequentially
› In SortedLedgerStorage (since BK-4.3) and in DbLedgerStorage (scheduled for
BK-4.5), there’s the concept of write-cache :
• Defer the writing to entryLog and sort by ledgerId/entryId to have entries stored sequentially.
› On the same note, using read-ahead cache will amortize IO ops
▪ Use RocksDB to maintain indexes
› In DbLedgerStorage we load all the offsets into RocksDB. Helps when storing many
ledgers (tested with few millions) in a single bookie.
Improvements areas / 1
9
▪ JVM GC still has impact on latencies
› Already done several improvements
› GC cannot be avoided, going to 0 allocation per entry written is not practical
› Only option is to make pauses as least as possible frequent
› Single bookie throughput is limited by GC rather than hardware:
› Above a certain rate the latency spikes from pauses would make it miss SLA
› Batching more logical entries into a single BK entry helps a lot, but it’s not always
practical
Bookie Storage
Improvements areas / 2
10
▪ Having large sequence of sequential entries and take advantage of
read-ahead cache depends on flushInterval :
› Frequent flushes will make for less contiguous entries
› Longer interval means to have more long-lived java objects and longer pauses.
▪ Similarly, if writes are spread across many ledgers, with very low per-
ledger rate, there will be few sequential entries
▪ Bookie compaction
› During compaction, older entries are re-appended and mixed with new entries.
› Long lived entries will get compacted all over again.
› Need to keep EntryLogMetadata in memory (when storing 20TB that can be quite
significant)
Bookie Storage
Consideration on Bookie storage
11
▪ Original BK implementation dates back to 2009
▪ Bookie storage really resembles an LSM DB
› Journal → Write Ahead Log
› Entry Log → SSTs
› Compaction
› Write cache → MemTable
› Read cache → LRU Block cache
▪ Why not directly store all the data in RocksDB?
▪ Can we get the same performance as current Bookie?
▪ That would replace large portion of Bookie code
▪ At that point, why not have the Bookie server in C++?
Bookie Storage
Bookie-CPP
12
▪ What is it?
› Proof of Concept to validate performance assumption
› Compatible with regular BK Java client
› Async C++ server that writes into RocksDB
› So far, only addEntry() implemented
▪ What is not
› No plan to write BK client in C++
▪ ¿¿Why??
› Fully utilize IO capacity (vertical scalability)
› Better compaction, no GC pauses, block-level compression, etc…
Bookie Storage
RocksDB tuning
13
▪ We can make RocksDB look like Bookie
▪ Goals : high-throughput and low-latency for writes
› Use background thread to implement group-commit on top of RocksDB
› To ensure writes are not stalled by compaction, use large MemTable (write-cache)
size: 4x 1GB
› Use big SST size: 1GB
› Big block-size: 256K (helps for HDDs)
› Compaction read-ahead buffer: 8MB
Bookie Storage
How to implement deletion
14
▪ Bookie GC will still do to he same scan & compare
▪ Typically, in LSM DBs a delete operation consist is writing a tombstone
marker
› Data is deleted when the tombstones are pushed to the last level and the SST is
compacted
▪ RocksDB provides additional options to delete data:
› DeleteFilesInRange() → immediately delete SSTs that only contains keys in that range
• eg: DeleteFilesInRange( [ledgerId, ledgerId+1) )
› Compaction filter → hook into RocksDB compaction to decide which data needs to be
kept when compacting. Can use the map of active ledgers to do it. Compaction can
also be forced by calling CompactRange()
Bookie Storage
Preliminary tests / 1
15
▪ 1 client - 1 bookie
▪ Bookie journal: SSD + RAID BBU
▪ Bookie ledgers: HDDs
▪ Writing 60K 1KB entries/s over multiple ledgers
▪ C++ perf tool that simulates BK client and measure latency
› Only send addEntry request / no actual ledger metadata in ZK
› Using C++ client, removes JVM GC measure noise on the client side
▪ Measure 99pct write latency over different time intervals
› 1min, 10sec, 1sec
Bookie Storage
Preliminary tests / 2
16 Bookie Storage
Preliminary tests / 3
17 Bookie Storage
Preliminary tests / 4
18 Bookie Storage
Conclusions
19
▪ Preliminary results look promising
▪ Work in Progress, code at github.com/merlimat/bookie-cpp
▪ Feedback welcome
▪ Hopefully there’s interest in this area
▪ It would be great to include in main BK repository at some point
Bookie Storage

Mais conteúdo relacionado

Mais procurados

FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...Ashnikbiz
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path HBaseCon
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
 
Kafka Summit SF 2017 - Shopify Flash Sales with Apache Kafka
Kafka Summit SF 2017 - Shopify Flash Sales with Apache KafkaKafka Summit SF 2017 - Shopify Flash Sales with Apache Kafka
Kafka Summit SF 2017 - Shopify Flash Sales with Apache Kafkaconfluent
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践HBaseCon
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0HBaseCon
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketCloudera, Inc.
 
Inside CynosDB: MariaDB optimized for the cloud at Tencent
Inside CynosDB: MariaDB optimized for the cloud at TencentInside CynosDB: MariaDB optimized for the cloud at Tencent
Inside CynosDB: MariaDB optimized for the cloud at TencentMariaDB plc
 
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Ashnikbiz
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0enissoz
 
Membase Introduction
Membase IntroductionMembase Introduction
Membase IntroductionMembase
 
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandShared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandFuenteovejuna
 
Configuring workload-based storage and topologies
Configuring workload-based storage and topologiesConfiguring workload-based storage and topologies
Configuring workload-based storage and topologiesMariaDB plc
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction HBaseCon
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
 
Membase East Coast Meetups
Membase East Coast MeetupsMembase East Coast Meetups
Membase East Coast MeetupsMembase
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
 

Mais procurados (20)

FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San Francisco
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
Kafka Summit SF 2017 - Shopify Flash Sales with Apache Kafka
Kafka Summit SF 2017 - Shopify Flash Sales with Apache KafkaKafka Summit SF 2017 - Shopify Flash Sales with Apache Kafka
Kafka Summit SF 2017 - Shopify Flash Sales with Apache Kafka
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
 
Inside CynosDB: MariaDB optimized for the cloud at Tencent
Inside CynosDB: MariaDB optimized for the cloud at TencentInside CynosDB: MariaDB optimized for the cloud at Tencent
Inside CynosDB: MariaDB optimized for the cloud at Tencent
 
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 
Membase Introduction
Membase IntroductionMembase Introduction
Membase Introduction
 
Accordion HBaseCon 2017
Accordion HBaseCon 2017Accordion HBaseCon 2017
Accordion HBaseCon 2017
 
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandShared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
 
Configuring workload-based storage and topologies
Configuring workload-based storage and topologiesConfiguring workload-based storage and topologies
Configuring workload-based storage and topologies
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
Membase East Coast Meetups
Membase East Coast MeetupsMembase East Coast Meetups
Membase East Coast Meetups
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 

Destaque

สุขภาพเริ่มต้นที่บ้าน
สุขภาพเริ่มต้นที่บ้านสุขภาพเริ่มต้นที่บ้าน
สุขภาพเริ่มต้นที่บ้านkan2500
 
Website development services
Website development servicesWebsite development services
Website development servicessourcPEP
 
Interseccion superficies
Interseccion superficiesInterseccion superficies
Interseccion superficiesannie ww
 
Cloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical OverviewCloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical OverviewMessaging Meetup
 
Datomic – A Modern Database - StampedeCon 2014
Datomic – A Modern Database - StampedeCon 2014Datomic – A Modern Database - StampedeCon 2014
Datomic – A Modern Database - StampedeCon 2014StampedeCon
 
SEO services
SEO servicesSEO services
SEO servicessourcPEP
 
Representation in soap operas
Representation in soap operas Representation in soap operas
Representation in soap operas teasticks
 
Neuropeptide Y
Neuropeptide YNeuropeptide Y
Neuropeptide YChee Oh
 
Disruption of Mitosis in Onion Lab Report
Disruption of Mitosis in Onion Lab ReportDisruption of Mitosis in Onion Lab Report
Disruption of Mitosis in Onion Lab ReportSamirah Boksmati
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 

Destaque (15)

สุขภาพเริ่มต้นที่บ้าน
สุขภาพเริ่มต้นที่บ้านสุขภาพเริ่มต้นที่บ้าน
สุขภาพเริ่มต้นที่บ้าน
 
Poster evidence
Poster evidencePoster evidence
Poster evidence
 
манғолия
манғолияманғолия
манғолия
 
Shan chowdhury
Shan chowdhuryShan chowdhury
Shan chowdhury
 
Website development services
Website development servicesWebsite development services
Website development services
 
Interseccion superficies
Interseccion superficiesInterseccion superficies
Interseccion superficies
 
Cloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical OverviewCloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical Overview
 
Datomic – A Modern Database - StampedeCon 2014
Datomic – A Modern Database - StampedeCon 2014Datomic – A Modern Database - StampedeCon 2014
Datomic – A Modern Database - StampedeCon 2014
 
SEO services
SEO servicesSEO services
SEO services
 
Ua bmay2015 aml.fheili
Ua bmay2015 aml.fheiliUa bmay2015 aml.fheili
Ua bmay2015 aml.fheili
 
Representation in soap operas
Representation in soap operas Representation in soap operas
Representation in soap operas
 
Neuropeptide Y
Neuropeptide YNeuropeptide Y
Neuropeptide Y
 
Disruption of Mitosis in Onion Lab Report
Disruption of Mitosis in Onion Lab ReportDisruption of Mitosis in Onion Lab Report
Disruption of Mitosis in Onion Lab Report
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Tutorial Kafka-Storm
Tutorial Kafka-StormTutorial Kafka-Storm
Tutorial Kafka-Storm
 

Semelhante a Bookie storage - Apache BookKeeper Meetup - 2015-06-28

MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deploymentYoshinori Matsunobu
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
An Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformAn Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformMongoDB
 
The InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLThe InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLMorgan Tocker
 
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)Shivji Kumar Jha
 
CosmosDB for IoT Scenarios
CosmosDB for IoT ScenariosCosmosDB for IoT Scenarios
CosmosDB for IoT ScenariosIvo Andreev
 
Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdffengxun
 
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021StreamNative
 
Loadays MySQL
Loadays MySQLLoadays MySQL
Loadays MySQLlefredbe
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloudOVHcloud
 
JSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance TuningJSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance TuningKenichiro Nakamura
 
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...Ontico
 
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...Insight Technology, Inc.
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on ReadDatabricks
 
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...Lucidworks
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheDavid Grier
 

Semelhante a Bookie storage - Apache BookKeeper Meetup - 2015-06-28 (20)

MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
An Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformAn Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media Platform
 
The InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLThe InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQL
 
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
 
CosmosDB for IoT Scenarios
CosmosDB for IoT ScenariosCosmosDB for IoT Scenarios
CosmosDB for IoT Scenarios
 
Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdf
 
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Loadays MySQL
Loadays MySQLLoadays MySQL
Loadays MySQL
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
JSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance TuningJSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance Tuning
 
Percona FT / TokuDB
Percona FT / TokuDBPercona FT / TokuDB
Percona FT / TokuDB
 
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
 
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
 
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Introduction to Bizur
Introduction to BizurIntroduction to Bizur
Introduction to Bizur
 
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 

Último

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 

Último (20)

2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 

Bookie storage - Apache BookKeeper Meetup - 2015-06-28

  • 1. Bookie Storage M a t t e o M e r l i
  • 2. BookKeeper 2 ▪ Provides distributed logs (ledgers) ▪ BookKeeper client + Bookies ▪ Client API can be summarized as : › createLedger() → ledgerId › ledger.addEntry(data) → entryId › ledger.readEntry(ledgerId, entryId) › deleteLedger(ledgerId) ▪ BK Client library implements all the “logic” › Consistency, metadata in ZK, fencing, recovery, replication ▪ Bookie Server are charged to store the data Bookie Storage
  • 3. Bookie external interface 3 ▪ Simple primitives › addEntry(ledgerId, entryId, payload) → OK › readEntry(ledgerId, entryId) → payload › getLastEntry(ledgerId) → entryId ▪ Is that all?? › Fence flag on readEntry() → no more writes allowed to a ledger › Deletion → background garbage collection › Auto replication → it's a different logical component that uses the BK client API Bookie Storage
  • 4. Interleaved storage 4 Bookie Storage ▪ Default bookie storage ▪ Use journal on a separate device › Every entry is synced on the journal ▪ Entries are also written to "entryLog" files as they come in › Writes on the entryLog are periodically flushed in background › Entries are appended to the current entry log file › When entryLog reaches 2GB, a new one is created › Entries for multiple ledgers are interleaved in the same entry log ▪ Need to maintain and index (ledgerId, entryId) → (entryLogId, offset) › Default implementation uses a file for each ledger to store the data locations
  • 5. Bookie Garbage Collection 5 Bookie Storage ▪ Runs periodically in background ▪ Get the list of ledgers stored locally ▪ Get list of ledgers from ZK ▪ Whatever ledger is not in ZK is marked for deletion
  • 6. When are entry logs deleted? 6 ▪ Need to keep track of usage of each entry log ▪ EntryLog metadata, in memory map for each entryLog › (ledgerId → size) ▪ Whenever a ledger is deleted, each entry log will update the usage ▪ Metadata is appended to each entryLog, to avoid having to scan the log when bookie restarts (since 4.4.0) ▪ If the entryLog usage is 0% → delete it ▪ If usage falls below x % → compaction Bookie Storage
  • 7. Entry log compaction 7 Bookie Storage ▪ There are 2 compactions which differs in threshold : › Minor (every 1 hour, usage < 50%) › Major (every 1 day, usage < 80%) ▪ Scan the entryLog file and append all valid entries into the current (newer) entryLog file ▪ Update the indexes to point to new location
  • 8. Changes already done 8 Bookie Storage ▪ Writes interleaved in entryLog makes poor read performance › Typically you want to read many entries sequentially › In SortedLedgerStorage (since BK-4.3) and in DbLedgerStorage (scheduled for BK-4.5), there’s the concept of write-cache : • Defer the writing to entryLog and sort by ledgerId/entryId to have entries stored sequentially. › On the same note, using read-ahead cache will amortize IO ops ▪ Use RocksDB to maintain indexes › In DbLedgerStorage we load all the offsets into RocksDB. Helps when storing many ledgers (tested with few millions) in a single bookie.
  • 9. Improvements areas / 1 9 ▪ JVM GC still has impact on latencies › Already done several improvements › GC cannot be avoided, going to 0 allocation per entry written is not practical › Only option is to make pauses as least as possible frequent › Single bookie throughput is limited by GC rather than hardware: › Above a certain rate the latency spikes from pauses would make it miss SLA › Batching more logical entries into a single BK entry helps a lot, but it’s not always practical Bookie Storage
  • 10. Improvements areas / 2 10 ▪ Having large sequence of sequential entries and take advantage of read-ahead cache depends on flushInterval : › Frequent flushes will make for less contiguous entries › Longer interval means to have more long-lived java objects and longer pauses. ▪ Similarly, if writes are spread across many ledgers, with very low per- ledger rate, there will be few sequential entries ▪ Bookie compaction › During compaction, older entries are re-appended and mixed with new entries. › Long lived entries will get compacted all over again. › Need to keep EntryLogMetadata in memory (when storing 20TB that can be quite significant) Bookie Storage
  • 11. Consideration on Bookie storage 11 ▪ Original BK implementation dates back to 2009 ▪ Bookie storage really resembles an LSM DB › Journal → Write Ahead Log › Entry Log → SSTs › Compaction › Write cache → MemTable › Read cache → LRU Block cache ▪ Why not directly store all the data in RocksDB? ▪ Can we get the same performance as current Bookie? ▪ That would replace large portion of Bookie code ▪ At that point, why not have the Bookie server in C++? Bookie Storage
  • 12. Bookie-CPP 12 ▪ What is it? › Proof of Concept to validate performance assumption › Compatible with regular BK Java client › Async C++ server that writes into RocksDB › So far, only addEntry() implemented ▪ What is not › No plan to write BK client in C++ ▪ ¿¿Why?? › Fully utilize IO capacity (vertical scalability) › Better compaction, no GC pauses, block-level compression, etc… Bookie Storage
  • 13. RocksDB tuning 13 ▪ We can make RocksDB look like Bookie ▪ Goals : high-throughput and low-latency for writes › Use background thread to implement group-commit on top of RocksDB › To ensure writes are not stalled by compaction, use large MemTable (write-cache) size: 4x 1GB › Use big SST size: 1GB › Big block-size: 256K (helps for HDDs) › Compaction read-ahead buffer: 8MB Bookie Storage
  • 14. How to implement deletion 14 ▪ Bookie GC will still do to he same scan & compare ▪ Typically, in LSM DBs a delete operation consist is writing a tombstone marker › Data is deleted when the tombstones are pushed to the last level and the SST is compacted ▪ RocksDB provides additional options to delete data: › DeleteFilesInRange() → immediately delete SSTs that only contains keys in that range • eg: DeleteFilesInRange( [ledgerId, ledgerId+1) ) › Compaction filter → hook into RocksDB compaction to decide which data needs to be kept when compacting. Can use the map of active ledgers to do it. Compaction can also be forced by calling CompactRange() Bookie Storage
  • 15. Preliminary tests / 1 15 ▪ 1 client - 1 bookie ▪ Bookie journal: SSD + RAID BBU ▪ Bookie ledgers: HDDs ▪ Writing 60K 1KB entries/s over multiple ledgers ▪ C++ perf tool that simulates BK client and measure latency › Only send addEntry request / no actual ledger metadata in ZK › Using C++ client, removes JVM GC measure noise on the client side ▪ Measure 99pct write latency over different time intervals › 1min, 10sec, 1sec Bookie Storage
  • 16. Preliminary tests / 2 16 Bookie Storage
  • 17. Preliminary tests / 3 17 Bookie Storage
  • 18. Preliminary tests / 4 18 Bookie Storage
  • 19. Conclusions 19 ▪ Preliminary results look promising ▪ Work in Progress, code at github.com/merlimat/bookie-cpp ▪ Feedback welcome ▪ Hopefully there’s interest in this area ▪ It would be great to include in main BK repository at some point Bookie Storage