2. Speaker introduction
â Yuto Kawamura
â Senior Software Engineer at
LINE
â Leading project to redesign
microservices architecture
w/ Kafka
â Apache Kafka Contributor
â Speaker at Kafka Summit SF
2017
â Also at Kafka Meetup #3
3. Outline
â Kafka at LINE as of today (2018.04)
â Challenges on multitenancy
â Engineering for achieving multitenancy
5. We have more clusters
â Added more clusters since last
year to support:
â Different DCs
â Security sensitive data w/
SASL+TLS
â They are separated by "purposes"
but not by "users" ; our
multitenancy strategy
â Fewer clusters allow us to
concentrate our engineering
resources for maximizing their
performance
â They're concepturally the "Data
Hub" too
6. One cluster has many users
â Topics:
â 100 ~ 400+ per cluster
â Users:
â few ~ tens per cluster
â Messages: 150 billion messages / day in largest cluster
â 3~ million / sec on peak
â None of messages are supposed to lost because all
usages are somehow related to service
8. For doing multitenancy, we have to ensure:
â Certain level of isolation among client workloads
â Cluster is abusing-client proof
â Can track on which client sending particular request
â We have to be conïŹdent about what we do to say
"don't worry" for people saying "we want a dedicated
cluster only for us!"
10. Request Quota
â It's more important to manage number of requests over
incoming/outgoing byte rate
â Kafka is amazingly strong at handling large data if they
are well-batched
â => For consumers responses are naturally batched
â => Main danger exists on Producers which conïŹgures
linger.ms=0
â Starting from 0.11.0.0, by KIP-124 we can conïŹgure request
rate quota 2
2
https://cwiki.apache.org/conïŹuence/display/KAFKA/KIP-124+-+Request+rate+quotas
11. Request Quota
â Manage master of cluster conïŹg in YAML inside Ansible repository
â Apply all at once during cluster provisioning by kafka_config ansible module
(developed internally)
â Can tell latest conïŹg on cluster w/o quierying cluster, can keep change history
on git
---
kafka_cluster_configs:
- entity_type: clients
configs:
request_percentage: 40
producer_byte_rate: 1073741824
- entity_type: clients
entity_name: foobar-producer
configs:
request_percentage: 200
12. Slowlog
â Log requests which took longer than certain threshold to process
â Kafka has "request logging" but it leads too many of lines
â Inspired by HBase's
# RequestChannel.scala#updateRequestMetrics
+ slowLogThresholdMap.get(metricNames.head).filter(_ >= 0).filter { v =>
+ val targetTime = requestId match {
+ case ApiKeys.FETCH.id => totalTime - apiRemoteTime
+ case _ => totalTime
+ }
+
+ targetTime >= v
+ }.foreach { _ =>
+ requestLogger.warn("Slow response:%s from connection %s;totalTime:%d...
+ .format(requestDesc(true), connectionId, totalTime, requestQueueTime...
+ }
[2016-12-26 16:04:20,135] WARN Slow response:Name: FetchRequest;
Version: 2 ... ;totalTime:1817;localTime: ...
14. The disk read by delayed consumer problem
â Detection: 50x ~ 100x slower response time in 99th %ile Produce response time
â Disk read of certain amount
â Network threads' utilization was very high
15. Suspecting sendïŹle is taking long...
â Because: 1. disk read was occuring at that time, 2. network threads' utilization was high
$ stap -e â(script counting sendfile(2) duration histogram)â
value |---------------------------------------- count
0 | 0
1 | 71
2 |@@@ 6171
16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29472
32 |@@@ 3418
2048 | 0
...
8192 | 3
â Normal: 2 ~ 32us
â Outliers: 8ms ~
â (About SystemTap, see my previous presentation 3
)
3
https://www.slideshare.net/kawamuray/kafka-meetup-jp-3-engineering-apache-kafka-at-line
16. Kafka broker's thread model
â Network threads (controlled by
num.network.threads) takes
read/write of requests/
responses
â Network threads hold
established connections
exclusively for event-driven IO
â Request handler threads
(controlled by num.io.threads)
takes request processing and
IO between block device
except sendïŹle(2) for Fetch
requests
17. When Fetch request for data that doesn't present in
page cache occurs...
18. Problem deïŹnition
â network threads contains potentially-blocking ops
while it's supposed to work as event loop
â and we have no way to know if upcoming sendfile(2)
blocks awaiting disk read or not
19. It was the one of the worst issues we had because of:
â Completely breaks resource isolation among all
clients including producers
â Occurs naturally when one of consumers slows down
â Have to communicate with users every time to ask for
ïŹx
â Occurs 100% when one broker restores log data from
leader
20. Solution candidates
â A: Separate network threads among clients
â => Possible, but a lot of changes required
â => Not essential because network threads should be
completely computation intensive
â B: Balance connections among network threads
â => Possible, but again a lot of changes
â => Still for ïŹrst moment other connections will get
affected
â C: Make sure that data are ready on memory before the
response passed to the network thread
21. To make sure non-blocking sendïŹle(2) in network
threads...
â The target data must be available on page cache
22. How?
NAME
sendfile - transfer data between file descriptors
SYNOPSIS
#include <sys/sendfile.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
â sendfile(2) on Linux doesn't accepts ïŹags for controlling
it's behavior
â Interestingly FreeBSD has such, by contribution from nginx
and NetïŹix 1
1
https://www.nginx.com/blog/nginx-and-netïŹix-contribute-new-sendïŹle2-to-freebsd/
23. So have to;
1. Pre-read data not available on page cache from disks,
2. and conïŹrm the page's existence before passing
response to network threads
24. sendïŹle(2) to the dest /dev/null
â Calling channel.transferTo("/dev/null") (== sendfile(/
dev/null)) in request handler thread might populates
page cache?
â Tested out, and ïŹgured out there's no noticeable
performance impact
25. How could it be that harmless?
â Linux kernel internally uses splice to implement sendfile(2)
â splice requests struct file_operations to handle splice
â struct file_operations null_fops just iterates list of page pointers but not each
bytes
â => Iteration count is SIZE / PAGE_SIZE(4k)
# ./drivers/char/mem.c
static int pipe_to_null(struct pipe_inode_info *info, struct pipe_buffer *buf,
struct splice_desc *sd)
{
return sd->len;
}
static ssize_t splice_write_null(struct pipe_inode_info *pipe,struct file *out,
loff_t *ppos, size_t len, unsigned int flags)
{
return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_null);
}
26. Patching broker to call sendïŹle(/dev/null) in request
handler threads
# FileRecords.java
@SuppressWarnings("UnnecessaryFullyQualifiedName")
private static final java.nio.file.Path DEVNULL_PATH = new File("/dev/null").toPath();
public void prepareForRead() throws IOException {
long size = Math.min(channel.size(), end) - start;
try (FileChannel devnullChannel = FileChannel.open(DEVNULL_PATH,
java.nio.file.StandardOpenOption.WRITE)) {
channel.transferTo(start, size, devnullChannel);
}
}
â Still not fully-portable because it assumes underlying
kernel's implementation detail (so we haven't
contributed...)
27. ... and more to minimize impact of increased syscall...
# Log.scala#read
@@ -585,6 +586,17 @@ class Log(@volatile var dir: File,
if(fetchInfo == null) {
entry = segments.higherEntry(entry.getKey)
} else {
+ // For last entries we assume that it is hot enough to still have all data in page cache.
+ // Most of fetch requests are fetching from the tail of the log, so this optimization
+ // should save call of readahead() + mmap() + mincore() * N significantly.
+ if (!isLastEntry && fetchInfo.records.isInstanceOf[FileRecords]) {
+ try {
+ info("Prepare Read for " + fetchInfo.records.asInstanceOf[FileRecords].file().getPath)
+ fetchInfo.records.asInstanceOf[FileRecords].prepareForRead()
+ } catch {
+ case e: Throwable => warn("failed to prepare cache for read", e)
+ }
+ }
return fetchInfo
}
â Perform cache warmup only if the read segment IS NOT the latest
â => can save unnecessary syscalls for 99% of Fetch requests
29. Conclusion
â Having fewer clusters enables us to concentriate on
reliability engineering and essential troubleshootings/ïŹxes
â Preventive engineering enables us to keep operating
Kafka clusters in highest reliability even under high and
inexplicable load
â We've had some failures in development cluster, but
never in production cluster
â The important in operating on-premise multitenancy; not
necessary to prevent 100% of failure, but never let the
same hole to be punched again