Mais conteúdo relacionado Semelhante a Kafka at half the price with JBOD setup (20) Kafka at half the price with JBOD setup2. ©2017 LinkedIn Corporation. All Rights Reserved. 2
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
4. ©2017 LinkedIn Corporation. All Rights Reserved. 4
RAID-10 setup with RF=2
producer
Broker 1 Broker 2
A
B
C
A
B
C
A
B
C
A
B
C
- Tolerate only one broker failure
5. ©2017 LinkedIn Corporation. All Rights Reserved. 5
RAID-10 setup with RF=3
producer
Broker 1 Broker 3
A
B
C
A
B
C
A
B
C
A
B
C
Broker 2
A
B
C
A
B
C
- Tolerate up to two broker failures
- 50% more storage cost
6. ©2017 LinkedIn Corporation. All Rights Reserved. 6
JBOD setup with RF=2
producer
Broker 1
A
B
C
Broker 2
A
B
C
- Tolerate only one broker failure
- 50% less storage cost
7. ©2017 LinkedIn Corporation. All Rights Reserved. 7
JBOD setup with RF=3
producer
Broker 1
A
B
C
Broker 3
A
B
C
Broker 2
A
B
C
- Tolerate up to two broker failures
- 25% less storage cost
8. ©2017 LinkedIn Corporation. All Rights Reserved. 8
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
9. ©2017 LinkedIn Corporation. All Rights Reserved. 9
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
JBOD
2 2X (50% down) 1 (too small) 1 (too small)
3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
10. ©2017 LinkedIn Corporation. All Rights Reserved. 10
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
JBOD
2 2X (50% down) 1 (too small) 1 (too small)
3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
4 4X 3 (300% up) 3
11. ©2017 LinkedIn Corporation. All Rights Reserved. 11
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
12. ©2017 LinkedIn Corporation. All Rights Reserved. 12
Problem 1: All replicas become offline if any log directory fails
Broker
Disk A
IOException
when accessing disk B
Disk B
Disk C
Broker
Disk A
Disk B
Disk C
13. ©2017 LinkedIn Corporation. All Rights Reserved. 13
Solution: Only replicas on the failed disk become offline
Broker
Disk A
IOException
when accessing disk B
Disk B
Disk C
Broker
Disk A
Disk B
Disk C
14. ©2017 LinkedIn Corporation. All Rights Reserved. 14
Problem 2: Controller does not recognize disk failure
Zookeeper
Controller
Broker 1
Partition 1
Partition 2
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
X No further
leader election
STEP 3:
Become leader for
partitions 1 and 2
STEP 4:
partition 2 is offline
15. ©2017 LinkedIn Corporation. All Rights Reserved. 15
Solution: Broker notifies and provides partition list to controller
Zookeeper
Controller
Broker 1
Partition 1
Partition 2
STEP 2: Broker 1 has new disk failureSTEP 1: Notify disk failure
X
STEP 3:
Become leader for
partitions 1 and 2
STEP 4:
partition 2 is offline
STEP 5: Elect
another broker as
leader for partition 2
16. ©2017 LinkedIn Corporation. All Rights Reserved. 16
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
Broker 1
Partition 1
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
17. ©2017 LinkedIn Corporation. All Rights Reserved. 17
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
STEP 4:
Created partition 2
(problematic)
Broker 1
Partition 1
Partition 2
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
18. ©2017 LinkedIn Corporation. All Rights Reserved. 18
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
Broker 1
Partition 1
Partition 2
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
Good disk may
become overloaded STEP 1: I am online
STEP 4:
Created partition 2
(problematic)
19. ©2017 LinkedIn Corporation. All Rights Reserved. 19
Solution: Controller specifies whether to create log for partition
Zookeeper
Controller
STEP 3:
Become follower for partition 2
This is NOT a new partition
STEP 4:
Partition 2 is not available
and there is offline log dir
Broker 1
Partition 1
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
- Broker -> is new partition?
STEP 5:
Exclude broker 1 from
leader election
for partition 2
STEP 1: I am online
20. ©2017 LinkedIn Corporation. All Rights Reserved. 20
Problem 4: No mechanism to move replicas between disks
Broker 1
P1 P2 P3
P5P4 P6
P7
Disk 1 Disk 2
21. ©2017 LinkedIn Corporation. All Rights Reserved. 21
Example workflow to move replicas between disks
Broker
Client
STEP 1: DescribeDirRequest
STEP 2: DescribeDirResponse
Partition list and size
STEP 3: ChangeDirRequest
Disk 1 Disk 2
STEP 4: create p1.move
STEP 5: ChangeDirResponse
(Inprogress)
STEP 6: copy data from
p1.log to p1.move
STEP 7: delete p1.log and
rename p1.move to p1.log
STEP 8: Verify new assignment
via DescribeDirRequest
22. ©2017 LinkedIn Corporation. All Rights Reserved. 22
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
23. ©2017 LinkedIn Corporation. All Rights Reserved. 23
Alternatives
▪ RAID-0 doesn’t provide disk fault tolerance
– Assume each broker has 10 disks and RF = 2
– RAID-0 has 100X higher probability of unavailability due to disk failure than JBOD
▪ RAID-5 and RAID-6 have poor performance
▪ Hardware RAID is expensive
▪ One broker per disk
24. ©2017 LinkedIn Corporation. All Rights Reserved. 24
one-broker-per-machine vs. one-broker-per-disk
Physical Machine
Disk 1 Disk 2 Disk 3
Broker 1
Physical Machine
Disk 1 Disk 2 Disk 3
Broker 1 Broker 2 Broker 3
V.S.
One-broker-per-machine One-broker-per-disk
25. ©2017 LinkedIn Corporation. All Rights Reserved. 25
one-broker-per-machine vs. one-broker-per-disk
▪ Both solutions use JBOD as disk configuration
▪ Main drawbacks of one-broker-per-disk (assume 10 disk per machine)
– 100X threads and 100X sockets per machine
– 10X control plane traffic from the controller to brokers (e.g. MetadataRequest)
– 10X broker instances and configuration files to manage
– 10X time to bounce a cluster if we bounce one broker at a time
– 10X load on external service (e.g. a service used to query per-topic ACL)
– Less efficient quota enforcement
– Less efficient rebalance across disks on the same machine
– Lower throughput
26. ©2017 LinkedIn Corporation. All Rights Reserved. 26
Experimental setup
▪ Brokers deployed on 15 machines with 10 disks per machine
IO threads Network threads Replica-fetcher threads
One-broker-per-machine 160 120 140
One-broker-per-disk 16 12 14
▪ Producers deployed on 15 machines
acks threads sync retries retry backoff message size batch size request timeout
all 50 true MAX_INT 60 sec 100 KB 1 MB MAX_INT
▪ Topic configuration
partition replication factor min-insync-replicas
512 3 3
29. ©2017 LinkedIn Corporation. All Rights Reserved. 29
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
30. ©2017 LinkedIn Corporation. All Rights Reserved. 30
Changes in operational procedure
▪ Adjust replication factor and min.insync.replicas
▪ Configure num.replica.move.threads for broker
▪ Monitor disk failure via the OfflineLogDirectoriesCount metric
31. ©2017 LinkedIn Corporation. All Rights Reserved. 31
Future work
▪ Use more intelligent solution to select log directory for new replica
▪ Automatic load balancing across log directories on the same broker
– Reduced operational overhead
▪ Distribute segments of a given replica across multiple log directories
– Less overhead for rebalance between disks
– Higher partition size limit
▪ Handle partial disk failure, e.g. disk with degraded performance.
32. ©2017 LinkedIn Corporation. All Rights Reserved. 32
References
▪ KIP-112: Handle disk failure for JBOD (link)
▪ KIP-113: Support replicas movement between log directories (link)