SlideShare uma empresa Scribd logo
1 de 33
©2017 LinkedIn Corporation. All Rights Reserved.
Kafka at half the price
Dong Lin
Streams Infrastructure
©2017 LinkedIn Corporation. All Rights Reserved. 2
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 3
RAID-10 setup with RF=2
producer
Broker 1 Broker 2
A
B
C
A
B
C
A
B
C
A
B
C
©2017 LinkedIn Corporation. All Rights Reserved. 4
RAID-10 setup with RF=2
producer
Broker 1 Broker 2
A
B
C
A
B
C
A
B
C
A
B
C
- Tolerate only one broker failure
©2017 LinkedIn Corporation. All Rights Reserved. 5
RAID-10 setup with RF=3
producer
Broker 1 Broker 3
A
B
C
A
B
C
A
B
C
A
B
C
Broker 2
A
B
C
A
B
C
- Tolerate up to two broker failures
- 50% more storage cost
©2017 LinkedIn Corporation. All Rights Reserved. 6
JBOD setup with RF=2
producer
Broker 1
A
B
C
Broker 2
A
B
C
- Tolerate only one broker failure
- 50% less storage cost
©2017 LinkedIn Corporation. All Rights Reserved. 7
JBOD setup with RF=3
producer
Broker 1
A
B
C
Broker 3
A
B
C
Broker 2
A
B
C
- Tolerate up to two broker failures
- 25% less storage cost
©2017 LinkedIn Corporation. All Rights Reserved. 8
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
©2017 LinkedIn Corporation. All Rights Reserved. 9
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
JBOD
2 2X (50% down) 1 (too small) 1 (too small)
3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
©2017 LinkedIn Corporation. All Rights Reserved. 10
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
JBOD
2 2X (50% down) 1 (too small) 1 (too small)
3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
4 4X 3 (300% up) 3
©2017 LinkedIn Corporation. All Rights Reserved. 11
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 12
Problem 1: All replicas become offline if any log directory fails
Broker
Disk A
IOException
when accessing disk B
Disk B
Disk C
Broker
Disk A
Disk B
Disk C
©2017 LinkedIn Corporation. All Rights Reserved. 13
Solution: Only replicas on the failed disk become offline
Broker
Disk A
IOException
when accessing disk B
Disk B
Disk C
Broker
Disk A
Disk B
Disk C
©2017 LinkedIn Corporation. All Rights Reserved. 14
Problem 2: Controller does not recognize disk failure
Zookeeper
Controller
Broker 1
Partition 1
Partition 2
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
X No further
leader election
STEP 3:
Become leader for
partitions 1 and 2
STEP 4:
partition 2 is offline
©2017 LinkedIn Corporation. All Rights Reserved. 15
Solution: Broker notifies and provides partition list to controller
Zookeeper
Controller
Broker 1
Partition 1
Partition 2
STEP 2: Broker 1 has new disk failureSTEP 1: Notify disk failure
X
STEP 3:
Become leader for
partitions 1 and 2
STEP 4:
partition 2 is offline
STEP 5: Elect
another broker as
leader for partition 2
©2017 LinkedIn Corporation. All Rights Reserved. 16
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
Broker 1
Partition 1
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
©2017 LinkedIn Corporation. All Rights Reserved. 17
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
STEP 4:
Created partition 2
(problematic)
Broker 1
Partition 1
Partition 2
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
©2017 LinkedIn Corporation. All Rights Reserved. 18
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
Broker 1
Partition 1
Partition 2
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
Good disk may
become overloaded STEP 1: I am online
STEP 4:
Created partition 2
(problematic)
©2017 LinkedIn Corporation. All Rights Reserved. 19
Solution: Controller specifies whether to create log for partition
Zookeeper
Controller
STEP 3:
Become follower for partition 2
This is NOT a new partition
STEP 4:
Partition 2 is not available
and there is offline log dir
Broker 1
Partition 1
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
- Broker -> is new partition?
STEP 5:
Exclude broker 1 from
leader election
for partition 2
STEP 1: I am online
©2017 LinkedIn Corporation. All Rights Reserved. 20
Problem 4: No mechanism to move replicas between disks
Broker 1
P1 P2 P3
P5P4 P6
P7
Disk 1 Disk 2
©2017 LinkedIn Corporation. All Rights Reserved. 21
Example workflow to move replicas between disks
Broker
Client
STEP 1: DescribeDirRequest
STEP 2: DescribeDirResponse
Partition list and size
STEP 3: ChangeDirRequest
Disk 1 Disk 2
STEP 4: create p1.move
STEP 5: ChangeDirResponse
(Inprogress)
STEP 6: copy data from
p1.log to p1.move
STEP 7: delete p1.log and
rename p1.move to p1.log
STEP 8: Verify new assignment
via DescribeDirRequest
©2017 LinkedIn Corporation. All Rights Reserved. 22
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 23
Alternatives
▪ RAID-0 doesn’t provide disk fault tolerance
– Assume each broker has 10 disks and RF = 2
– RAID-0 has 100X higher probability of unavailability due to disk failure than JBOD
▪ RAID-5 and RAID-6 have poor performance
▪ Hardware RAID is expensive
▪ One broker per disk
©2017 LinkedIn Corporation. All Rights Reserved. 24
one-broker-per-machine vs. one-broker-per-disk
Physical Machine
Disk 1 Disk 2 Disk 3
Broker 1
Physical Machine
Disk 1 Disk 2 Disk 3
Broker 1 Broker 2 Broker 3
V.S.
One-broker-per-machine One-broker-per-disk
©2017 LinkedIn Corporation. All Rights Reserved. 25
one-broker-per-machine vs. one-broker-per-disk
▪ Both solutions use JBOD as disk configuration
▪ Main drawbacks of one-broker-per-disk (assume 10 disk per machine)
– 100X threads and 100X sockets per machine
– 10X control plane traffic from the controller to brokers (e.g. MetadataRequest)
– 10X broker instances and configuration files to manage
– 10X time to bounce a cluster if we bounce one broker at a time
– 10X load on external service (e.g. a service used to query per-topic ACL)
– Less efficient quota enforcement
– Less efficient rebalance across disks on the same machine
– Lower throughput
©2017 LinkedIn Corporation. All Rights Reserved. 26
Experimental setup
▪ Brokers deployed on 15 machines with 10 disks per machine
IO threads Network threads Replica-fetcher threads
One-broker-per-machine 160 120 140
One-broker-per-disk 16 12 14
▪ Producers deployed on 15 machines
acks threads sync retries retry backoff message size batch size request timeout
all 50 true MAX_INT 60 sec 100 KB 1 MB MAX_INT
▪ Topic configuration
partition replication factor min-insync-replicas
512 3 3
©2017 LinkedIn Corporation. All Rights Reserved. 27
One-broker-per-machine throughput
Average throughput is 2.3 GBps
©2017 LinkedIn Corporation. All Rights Reserved. 28
One-broker-per-disk throughput
Average throughput is 2 GBps
©2017 LinkedIn Corporation. All Rights Reserved. 29
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 30
Changes in operational procedure
▪ Adjust replication factor and min.insync.replicas
▪ Configure num.replica.move.threads for broker
▪ Monitor disk failure via the OfflineLogDirectoriesCount metric
©2017 LinkedIn Corporation. All Rights Reserved. 31
Future work
▪ Use more intelligent solution to select log directory for new replica
▪ Automatic load balancing across log directories on the same broker
– Reduced operational overhead
▪ Distribute segments of a given replica across multiple log directories
– Less overhead for rebalance between disks
– Higher partition size limit
▪ Handle partial disk failure, e.g. disk with degraded performance.
©2017 LinkedIn Corporation. All Rights Reserved. 32
References
▪ KIP-112: Handle disk failure for JBOD (link)
▪ KIP-113: Support replicas movement between log directories (link)
©2017 LinkedIn Corporation. All Rights Reserved. 33

Mais conteúdo relacionado

Mais procurados

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Redis: Swiss Army Knife @HackerRank: Kamal Joshi
Redis: Swiss Army Knife @HackerRank: Kamal JoshiRedis: Swiss Army Knife @HackerRank: Kamal Joshi
Redis: Swiss Army Knife @HackerRank: Kamal Joshi
Redis Labs
 

Mais procurados (20)

Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
 
Incremental data processing with Hudi & Spark + dbt.pdf
Incremental data processing with Hudi & Spark + dbt.pdfIncremental data processing with Hudi & Spark + dbt.pdf
Incremental data processing with Hudi & Spark + dbt.pdf
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Redis: Swiss Army Knife @HackerRank: Kamal Joshi
Redis: Swiss Army Knife @HackerRank: Kamal JoshiRedis: Swiss Army Knife @HackerRank: Kamal Joshi
Redis: Swiss Army Knife @HackerRank: Kamal Joshi
 
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
 
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxGrafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
 
Manual wifislax
Manual wifislaxManual wifislax
Manual wifislax
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
High Availability With DRBD & Heartbeat
High Availability With DRBD & HeartbeatHigh Availability With DRBD & Heartbeat
High Availability With DRBD & Heartbeat
 
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Rocks db state store in structured streaming
Rocks db state store in structured streamingRocks db state store in structured streaming
Rocks db state store in structured streaming
 
Why we love pgpool-II and why we hate it!
Why we love pgpool-II and why we hate it!Why we love pgpool-II and why we hate it!
Why we love pgpool-II and why we hate it!
 
Load Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - SlidesLoad Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - Slides
 

Semelhante a Kafka at half the price with JBOD setup

Loadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkitLoadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkit
Frederic Descamps
 
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod ColledgeDb As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
sqlserver.co.il
 

Semelhante a Kafka at half the price with JBOD setup (20)

Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
 
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 20161049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
 
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
 
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
 
Reusing your existing software on Android
Reusing your existing software on AndroidReusing your existing software on Android
Reusing your existing software on Android
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedIn
 
Fusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN EnvironmentFusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN Environment
 
Galera Cluster 3.0 Features
Galera Cluster 3.0 FeaturesGalera Cluster 3.0 Features
Galera Cluster 3.0 Features
 
OpenStack Days Krakow
OpenStack Days KrakowOpenStack Days Krakow
OpenStack Days Krakow
 
Containerize Legacy .NET Framework Web Apps for Cloud Migration
Containerize Legacy .NET Framework Web Apps for Cloud Migration Containerize Legacy .NET Framework Web Apps for Cloud Migration
Containerize Legacy .NET Framework Web Apps for Cloud Migration
 
Loadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkitLoadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkit
 
GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS  GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla Operator
 
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod ColledgeDb As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenant
 
Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs Talks
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 

Mais de Dong Lin

Mais de Dong Lin (6)

FeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptxFeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptx
 
FeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptxFeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptx
 
FeatHub_FFA_2022
FeatHub_FFA_2022FeatHub_FFA_2022
FeatHub_FFA_2022
 
基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统
 
为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021
 
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedInAn introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
 

Último

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 

Último (20)

Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 

Kafka at half the price with JBOD setup

  • 1. ©2017 LinkedIn Corporation. All Rights Reserved. Kafka at half the price Dong Lin Streams Infrastructure
  • 2. ©2017 LinkedIn Corporation. All Rights Reserved. 2 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 3. ©2017 LinkedIn Corporation. All Rights Reserved. 3 RAID-10 setup with RF=2 producer Broker 1 Broker 2 A B C A B C A B C A B C
  • 4. ©2017 LinkedIn Corporation. All Rights Reserved. 4 RAID-10 setup with RF=2 producer Broker 1 Broker 2 A B C A B C A B C A B C - Tolerate only one broker failure
  • 5. ©2017 LinkedIn Corporation. All Rights Reserved. 5 RAID-10 setup with RF=3 producer Broker 1 Broker 3 A B C A B C A B C A B C Broker 2 A B C A B C - Tolerate up to two broker failures - 50% more storage cost
  • 6. ©2017 LinkedIn Corporation. All Rights Reserved. 6 JBOD setup with RF=2 producer Broker 1 A B C Broker 2 A B C - Tolerate only one broker failure - 50% less storage cost
  • 7. ©2017 LinkedIn Corporation. All Rights Reserved. 7 JBOD setup with RF=3 producer Broker 1 A B C Broker 3 A B C Broker 2 A B C - Tolerate up to two broker failures - 25% less storage cost
  • 8. ©2017 LinkedIn Corporation. All Rights Reserved. 8 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5
  • 9. ©2017 LinkedIn Corporation. All Rights Reserved. 9 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5 JBOD 2 2X (50% down) 1 (too small) 1 (too small) 3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
  • 10. ©2017 LinkedIn Corporation. All Rights Reserved. 10 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5 JBOD 2 2X (50% down) 1 (too small) 1 (too small) 3 (future) 3X (25% down) 2 (100% up) 2 (33% down) 4 4X 3 (300% up) 3
  • 11. ©2017 LinkedIn Corporation. All Rights Reserved. 11 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 12. ©2017 LinkedIn Corporation. All Rights Reserved. 12 Problem 1: All replicas become offline if any log directory fails Broker Disk A IOException when accessing disk B Disk B Disk C Broker Disk A Disk B Disk C
  • 13. ©2017 LinkedIn Corporation. All Rights Reserved. 13 Solution: Only replicas on the failed disk become offline Broker Disk A IOException when accessing disk B Disk B Disk C Broker Disk A Disk B Disk C
  • 14. ©2017 LinkedIn Corporation. All Rights Reserved. 14 Problem 2: Controller does not recognize disk failure Zookeeper Controller Broker 1 Partition 1 Partition 2 STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online X No further leader election STEP 3: Become leader for partitions 1 and 2 STEP 4: partition 2 is offline
  • 15. ©2017 LinkedIn Corporation. All Rights Reserved. 15 Solution: Broker notifies and provides partition list to controller Zookeeper Controller Broker 1 Partition 1 Partition 2 STEP 2: Broker 1 has new disk failureSTEP 1: Notify disk failure X STEP 3: Become leader for partitions 1 and 2 STEP 4: partition 2 is offline STEP 5: Elect another broker as leader for partition 2
  • 16. ©2017 LinkedIn Corporation. All Rights Reserved. 16 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent Broker 1 Partition 1 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online
  • 17. ©2017 LinkedIn Corporation. All Rights Reserved. 17 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent STEP 4: Created partition 2 (problematic) Broker 1 Partition 1 Partition 2 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online
  • 18. ©2017 LinkedIn Corporation. All Rights Reserved. 18 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent Broker 1 Partition 1 Partition 2 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list Good disk may become overloaded STEP 1: I am online STEP 4: Created partition 2 (problematic)
  • 19. ©2017 LinkedIn Corporation. All Rights Reserved. 19 Solution: Controller specifies whether to create log for partition Zookeeper Controller STEP 3: Become follower for partition 2 This is NOT a new partition STEP 4: Partition 2 is not available and there is offline log dir Broker 1 Partition 1 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list - Broker -> is new partition? STEP 5: Exclude broker 1 from leader election for partition 2 STEP 1: I am online
  • 20. ©2017 LinkedIn Corporation. All Rights Reserved. 20 Problem 4: No mechanism to move replicas between disks Broker 1 P1 P2 P3 P5P4 P6 P7 Disk 1 Disk 2
  • 21. ©2017 LinkedIn Corporation. All Rights Reserved. 21 Example workflow to move replicas between disks Broker Client STEP 1: DescribeDirRequest STEP 2: DescribeDirResponse Partition list and size STEP 3: ChangeDirRequest Disk 1 Disk 2 STEP 4: create p1.move STEP 5: ChangeDirResponse (Inprogress) STEP 6: copy data from p1.log to p1.move STEP 7: delete p1.log and rename p1.move to p1.log STEP 8: Verify new assignment via DescribeDirRequest
  • 22. ©2017 LinkedIn Corporation. All Rights Reserved. 22 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 23. ©2017 LinkedIn Corporation. All Rights Reserved. 23 Alternatives ▪ RAID-0 doesn’t provide disk fault tolerance – Assume each broker has 10 disks and RF = 2 – RAID-0 has 100X higher probability of unavailability due to disk failure than JBOD ▪ RAID-5 and RAID-6 have poor performance ▪ Hardware RAID is expensive ▪ One broker per disk
  • 24. ©2017 LinkedIn Corporation. All Rights Reserved. 24 one-broker-per-machine vs. one-broker-per-disk Physical Machine Disk 1 Disk 2 Disk 3 Broker 1 Physical Machine Disk 1 Disk 2 Disk 3 Broker 1 Broker 2 Broker 3 V.S. One-broker-per-machine One-broker-per-disk
  • 25. ©2017 LinkedIn Corporation. All Rights Reserved. 25 one-broker-per-machine vs. one-broker-per-disk ▪ Both solutions use JBOD as disk configuration ▪ Main drawbacks of one-broker-per-disk (assume 10 disk per machine) – 100X threads and 100X sockets per machine – 10X control plane traffic from the controller to brokers (e.g. MetadataRequest) – 10X broker instances and configuration files to manage – 10X time to bounce a cluster if we bounce one broker at a time – 10X load on external service (e.g. a service used to query per-topic ACL) – Less efficient quota enforcement – Less efficient rebalance across disks on the same machine – Lower throughput
  • 26. ©2017 LinkedIn Corporation. All Rights Reserved. 26 Experimental setup ▪ Brokers deployed on 15 machines with 10 disks per machine IO threads Network threads Replica-fetcher threads One-broker-per-machine 160 120 140 One-broker-per-disk 16 12 14 ▪ Producers deployed on 15 machines acks threads sync retries retry backoff message size batch size request timeout all 50 true MAX_INT 60 sec 100 KB 1 MB MAX_INT ▪ Topic configuration partition replication factor min-insync-replicas 512 3 3
  • 27. ©2017 LinkedIn Corporation. All Rights Reserved. 27 One-broker-per-machine throughput Average throughput is 2.3 GBps
  • 28. ©2017 LinkedIn Corporation. All Rights Reserved. 28 One-broker-per-disk throughput Average throughput is 2 GBps
  • 29. ©2017 LinkedIn Corporation. All Rights Reserved. 29 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 30. ©2017 LinkedIn Corporation. All Rights Reserved. 30 Changes in operational procedure ▪ Adjust replication factor and min.insync.replicas ▪ Configure num.replica.move.threads for broker ▪ Monitor disk failure via the OfflineLogDirectoriesCount metric
  • 31. ©2017 LinkedIn Corporation. All Rights Reserved. 31 Future work ▪ Use more intelligent solution to select log directory for new replica ▪ Automatic load balancing across log directories on the same broker – Reduced operational overhead ▪ Distribute segments of a given replica across multiple log directories – Less overhead for rebalance between disks – Higher partition size limit ▪ Handle partial disk failure, e.g. disk with degraded performance.
  • 32. ©2017 LinkedIn Corporation. All Rights Reserved. 32 References ▪ KIP-112: Handle disk failure for JBOD (link) ▪ KIP-113: Support replicas movement between log directories (link)
  • 33. ©2017 LinkedIn Corporation. All Rights Reserved. 33