SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Monitoring Kafka
w/ Prometheus
Yuto Kawamura(kawamuray)
About me
● Software Engineer @ LINE corp
○ Develop & operate Apache HBase clusters
○ Design and implement data flow between services with ♥ to Apache Kafka
● Recent works
○ Playing with Apache Kafka and Kafka Streams
■ https://issues.apache.org/jira/browse/KAFKA-3471?jql=project%20%3D%20KAFKA%
20AND%20assignee%20in%20(kawamuray)%20OR%20reporter%20in%20
(kawamuray)
● Past works
○ Docker + Checkpoint Restore @ CoreOS meetup http://www.slideshare.
net/kawamuray/coreos-meetup
○ Norikraでアプリログを集計してリアルタイムエラー通知 @ Norikra Meetup #1 http://www.
slideshare.net/kawamuray/norikra-meetup
○ Student @ Google Summer of Code 2013, 2014
● https://github.com/kawamuray
How are we(our team) using Prometheus?
● To monitor most of our middleware, clients on Java applications
○ Kafka clusters
○ HBase clusters
○ Kafka clients - producer and consumer
○ Stream Processing jobs
Overall Architecture
Grafana
Prometheus
HBase
clusterHBase
cluster
Kafka cluster
Prometheus
Prometheus
Prometheus
(Federation)
Prometheus
Prometheus
Prometheus
YARN Application
Pushgateway
Dashboard
Direct query
Why Prometheus?
● Inhouse monitoring tool wasn’t enough for large-scale + high resolution metrics
collection
● Good data model
○ Genuine metric identifier + attributes as labels
■ http_requests_total{code="200",handler="prometheus",instance="localhost:9090",job="
prometheus",method="get"}
● Scalable by nature
● Simple philosophy
○ Metrics exposure interface: GET /metrics => Text Protocol
○ Monolithic server
● Flexible but easy PromQL
○ Derive aggregated metrics by composing existing metrics
○ E.g, Sum of TX bps / second of entire cluster
■ sum(rate(node_network_receive_bytes{cluster="cluster-A",device="eth0"}[30s]) * 8)
Deployment
● Launch
○ Official Docker image: https://hub.docker.com/r/prom/prometheus/
○ Ansible for dynamic prometheus.yml generation based on inventory and container
management
● Machine spec
○ 2.40GHz * 24 CPUs
○ 192GB RAM
○ 6 SSDs
○ Single SSD / Single Prometheus instance
○ Overkill? => Obviously. Reused existing unused servers. You must don’t need this crazy spec
just to use it.
Kafka monitoring w/ Prometheus overview
Kafka broker
Kafka client in
Java application
YARN
ResourceManager
Stream Processing
jobs on YARN
Prometheus Server
Pushgate
way
Jmx
exporter
Prometh
eus Java
library
+ Servlet
JSON
exporter
Kafka
consumer
group
exporter
Monitoring Kafka brokers - jmx_exporter
● https://github.com/prometheus/jmx_exporter
● Run as standalone process(no -javaagent)
○ Just in order to avoid cumbersome rolling restart
○ Maybe turn into use javaagent on next opportunity of rolling restart :p
● With very complicated config.yml
○ https://gist.github.com/kawamuray/25136a9ab22b1cb992e435e0ea67eb06
● Colocate one instance per broker on the same host
Monitoring Kafka producer on Java application -
prometheus_simpleclient
● https://github.com/prometheus/client_java
● Official Java client library
prometheus_simpleclient - Basic usage
private static final Counter queueOutCounter =
Counter.build()
.namespace("kafka_streams") // Namespace(= Application prefix?)
.name("process_count") // Metric name
.help("Process calls count") // Metric description
.labelNames("processor", "topic") // Declare labels
.register(); // Register to CollectorRegistry.defaultRegistry (default, global registry)
...
queueOutCounter.labels("Processor-A", "topic-T").inc(); // Increment counter with labels
queueOutCounter.labels("Processor-B", "topic-P").inc(2.0);
=> kafka_streams_process_count{processor="Processor-A",topic="topic-T"} 1.0
kafka_streams_process_count{processor="Processor-B",topic="topic-P"} 2.0
Exposing Java application metrics
● Through servlet
○ io.prometheus.client.exporter.MetricsServlet from simpleclient_servlet
● Add an entry to web.xml or embedded jetty ..
Server server = new Server(METRICS_PORT);
ServletContextHandler context = new ServletContextHandler();
context.setContextPath("/");
server.setHandler(context);
context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics");
server.start();
Monitoring Kafka producer on Java application -
prometheus_simpleclient
● Primitive types:
○ Counter, Gauge, Histogram, Summary
● Kafka’s MetricsRerpoter interface gives KafkaMetrics instance
● How to expose the value?
● => Implement proxy metric type which implements
SimpleCollector public class PrometheusMetricsReporter implements MetricsReporter {
...
private void registerMetric(KafkaMetric kafkaMetric) {
...
KafkaMetricProxy.build()
.namespace(“kafka”)
.name(fqn)
.help("Help: " + metricName.description())
.labelNames(labelNames)
.register();
...
}
...
}
public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> {
public static class Builder extends SimpleCollector.Builder<Builder, KafkaMetricProxy> {
@Override
public KafkaMetricProxy create() {
return new KafkaMetricProxy(this);
}
}
KafkaMetricProxy(Builder b) {
super(b);
}
...
@Override
public List<MetricFamilySamples> collect() {
List<MetricFamilySamples.Sample> samples = new ArrayList<>();
for (Map.Entry<List<String>, Child> entry : children.entrySet()) {
List<String> labels = entry.getKey();
Child child = entry.getValue();
samples.add(new Sample(fullname, labelNames, labels, child.getValue()));
}
return Collections.singletonList(new MetricFamilySamples(fullname, Type.GAUGE, help, samples));
}
}
Monitoring YARN jobs - json_exporter
● https://github.com/kawamuray/prometheus-json-exporter
○ Can export value from JSON by specifying the value as JSONPath
● http://<rm http address:port>/ws/v1/cluster/apps
○ https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-
site/ResourceManagerRest.html#Cluster_Applications_API
○ https://gist.github.com/kawamuray/c07b03de82bf6ddbdae6508e27d3fb4d
json_exporter
- name: yarn_application
type: object
path: $.apps.app[*]?(@.state == "RUNNING")
labels:
application: $.id
phase: beta
values:
alive: 1
elapsed_time: $.elapsedTime
allocated_mb: $.allocatedMB
...
{"apps":{"app":[
{
"id": "application_1234_0001",
"state": "RUNNING",
"elapsedTime": 25196,
"allocatedMB": 1024,
...
},
...
}}
+
yarn_application_alive{application="application_1326815542473_0001",phase="beta"} 1
yarn_application_elapsed_time{application="application_1326815542473_0001",phase="beta"} 25196
yarn_application_allocated_mb{application="application_1326815542473_0001",phase="beta"} 1024
Important configurations
● -storage.local.retention(default: 15 days)
○ TTL for collected values
● -storage.local.memory-chunks(default: 1M)
○ Practically controls memory allocation of Prometheus instance
○ Lower value can cause ingestion throttling(metric loss)
● -storage.local.max-chunks-to-persist(default: 512K)
○ Lower value can cause ingestion throttling likewise
○ https://prometheus.io/docs/operating/storage/#persistence-pressure-and-rushed-mode
○ > Equally important, especially if writing to a spinning disk, is raising the value for the storage.
local.max-chunks-to-persist flag. As a rule of thumb, keep it around 50% of the storage.local.
memory-chunks value.
● -query.staleness-delta(default: 5mins)
○ Resolution to detect lost metrics
○ Could lead weird behavior on Prometheus WebUI
Query tips - label_replace function
● It’s quite common that two metrics has different label sets
○ E.g, server side metric and client side metrics
● Say have metrics like:
○ kafka_log_logendoffset{cluster="cluster-A",instance="HOST:PORT",job="kafka",partition="1234",topic="topic-A"}
● Introduce new label from existing label
○ label_replace(..., "host", "$1", "instance", "^([^:]+):.*")
○ => kafka_log_logendoffset{...,instance=”HOST:PORT”,host=”HOST”}
● Rewrite existing label with new value
○ label_replace(..., "instance", "$1", "instance", "^([^:]+):.*")
○ => kafka_log_logendoffset{...,instance=”HOST”}
● Even possible to rewrite metric name… :D
○ label_replace(kafka_log_logendoffset, "__name__", "foobar", "__name__", ".*")
○ => foobar{...}
Points to improve
● Service discovery
○ It’s too cumbersome to configure server list and exporter list statically
○ Pushgateway?
■ > The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose
their metrics to Prometheus - https://github.
com/prometheus/pushgateway/blob/master/README.md#prometheus-pushgateway-
○ file_sd_config? https://prometheus.io/docs/operating/configuration/#<file_sd_config>
■ > It reads a set of files containing a list of zero or more <target_group>s. Changes to all
defined files are detected via disk watches and applied immediately.
● Local time support :(
○ They don’t like TZ other than UTC; making sense though: https://prometheus.
io/docs/introduction/faq/#can-i-change-the-timezone?-why-is-everything-in-utc?
○ https://github.com/prometheus/prometheus/issues/500#issuecomment-167560093
○ Still might possible to introduce toggle on view
Conclusion
● Data model is very intuitive
● PromQL is very powerful and relatively easy
○ Helps you find out important metrics from hundreds of metrics
● Few pitfalls needs to be avoid w/ tuning configurations
○ memory-chunks, query.staleness-detla…
● Building exporter is reasonably easy
○ Officially supported lot’s of languages…
○ /metrics is the only interface
Questions?
End of Presentation
Metrics naming
● APPLICATIONPREFIX_METRICNAME
○ https://prometheus.io/docs/practices/naming/#metric-names
○ kafka_producer_request_rate
○ http_request_duration
● Fully utilize labels
○ x: kafka_network_request_duration_milliseconds_{max,min,mean}
○ o: kafka_network_request_duration_milliseconds{“aggregation”=”max|min|mean”}
○ Compare all min/max/mean in single graph: kafka_network_request_duration_milliseconds
{instance=”HOSTA”}
○ Much flexible than using static name
Alerting
● Not using Alert Manager
● Inhouse monitoring tool has alerting capability
○ Has user directory of alerting target
○ Has known expression to configure alerting
○ Tool unification is important and should be respected as
possible
● Then?
○ Built a tool to mirror metrics from Prometheus to inhouse
monitoring tool
○ Setup alert on inhouse monitoring tool
/api/v1/query?query=sum(kafka_stream_process_calls_rate{client_id=~"CLIENT_ID.*"}) by
(instance)
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"instance": "HOST_A:PORT"
},
"value": [
1465819064.067,
"82317.10280584119"
]
},
{
"metric": {
"instance": "HOST_B:PORT"
},
"value": [
1465819064.067,
"81379.73499610288"
]
},
]
}
}
public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> {
...
public static class Child {
private KafkaMetric kafkaMetric;
public void setKafkaMetric(KafkaMetric kafkaMetric) {
this.kafkaMetric = kafkaMetric;
}
double getValue() {
return kafkaMetric == null ? 0 : kafkaMetric.value();
}
}
@Override
protected Child newChild() {
return new Child();
}
...
}
Monitoring Kafka consumer offset -
kafka_consumer_group_exporter
● https://github.com/kawamuray/prometheus-kafka-consumer-group-exporter
● Exports some metrics WRT Kafka consumer group by executing kafka-
consumer-groups.sh command(bundled to Kafka)
● Specific exporter for specific use
● Would better being familiar with your favorite exporter framework
○ Raw use of official prometheus package: https://github.
com/prometheus/client_golang/tree/master/prometheus
○ Mine: https://github.com/kawamuray/prometheus-exporter-harness
Query tips - Product set
● Calculated result of more than two metrics results product set
● metric_A{cluster=”A or B”}
● metric_B{cluster=”A or B”,instance=”a or b or c”}
● metric_A / metric_B
● => {}
● metric_A / sum(metric_B) by (cluster)
● => {cluster=”A or B”}
● x: metric_A{cluster=”A”} - sum(metric_B{cluster=”A”}) by (cluster)
● o: metric_A{cluster=”A”} - sum(metric_B) by (cluster) => Same result!

Mais conteúdo relacionado

Mais procurados

Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
confluent
 

Mais procurados (20)

3GPP 5G Control Plane Service Based Architecture
3GPP 5G Control Plane Service Based Architecture3GPP 5G Control Plane Service Based Architecture
3GPP 5G Control Plane Service Based Architecture
 
[Main Session] 카프카, 데이터 플랫폼의 최강자
[Main Session] 카프카, 데이터 플랫폼의 최강자[Main Session] 카프카, 데이터 플랫폼의 최강자
[Main Session] 카프카, 데이터 플랫폼의 최강자
 
Connecting Your System to Globus (APS Workshop)
Connecting Your System to Globus (APS Workshop)Connecting Your System to Globus (APS Workshop)
Connecting Your System to Globus (APS Workshop)
 
Software architecture for high traffic website
Software architecture for high traffic websiteSoftware architecture for high traffic website
Software architecture for high traffic website
 
IBM Aspera for high-speed data migration to your AWS Cloud - DEM02-S - New Yo...
IBM Aspera for high-speed data migration to your AWS Cloud - DEM02-S - New Yo...IBM Aspera for high-speed data migration to your AWS Cloud - DEM02-S - New Yo...
IBM Aspera for high-speed data migration to your AWS Cloud - DEM02-S - New Yo...
 
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우
 
Massive MIMO Systems for 5G and Beyond Networks.pptx
Massive MIMO Systems for 5G and Beyond Networks.pptxMassive MIMO Systems for 5G and Beyond Networks.pptx
Massive MIMO Systems for 5G and Beyond Networks.pptx
 
Python/Django를 이용한 쇼핑몰 구축(2018 4월 Django Girls Seoul)
Python/Django를 이용한 쇼핑몰 구축(2018 4월 Django Girls Seoul)Python/Django를 이용한 쇼핑몰 구축(2018 4월 Django Girls Seoul)
Python/Django를 이용한 쇼핑몰 구축(2018 4월 Django Girls Seoul)
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
 
Quick Summary of LTE Voice Summit 2015 #LTEVoice
Quick Summary of LTE Voice Summit 2015 #LTEVoiceQuick Summary of LTE Voice Summit 2015 #LTEVoice
Quick Summary of LTE Voice Summit 2015 #LTEVoice
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
 
Universal metrics with Apache Beam
Universal metrics with Apache BeamUniversal metrics with Apache Beam
Universal metrics with Apache Beam
 
Beginners: TCO of a Mobile Network
Beginners: TCO of a Mobile NetworkBeginners: TCO of a Mobile Network
Beginners: TCO of a Mobile Network
 
VictoriaMetrics: Welcome to the Virtual Meet Up March 2023
VictoriaMetrics: Welcome to the Virtual Meet Up March 2023VictoriaMetrics: Welcome to the Virtual Meet Up March 2023
VictoriaMetrics: Welcome to the Virtual Meet Up March 2023
 
Celery with python
Celery with pythonCelery with python
Celery with python
 
Driving Mobile VAS Adoption and Creating a Sustainable VAS Proposition in New...
Driving Mobile VAS Adoption and Creating a Sustainable VAS Proposition in New...Driving Mobile VAS Adoption and Creating a Sustainable VAS Proposition in New...
Driving Mobile VAS Adoption and Creating a Sustainable VAS Proposition in New...
 

Destaque

Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 

Destaque (11)

Prometheus casual talk1
Prometheus casual talk1Prometheus casual talk1
Prometheus casual talk1
 
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
 
Prometheus on AWS
Prometheus on AWSPrometheus on AWS
Prometheus on AWS
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Application security as crucial to the modern distributed trust model
Application security as crucial to   the modern distributed trust modelApplication security as crucial to   the modern distributed trust model
Application security as crucial to the modern distributed trust model
 
FRONTIERS IN CRYPTOGRAPHY
FRONTIERS IN CRYPTOGRAPHYFRONTIERS IN CRYPTOGRAPHY
FRONTIERS IN CRYPTOGRAPHY
 
Drawing the Line Correctly: Enough Security, Everywhere
Drawing the Line Correctly:   Enough Security, EverywhereDrawing the Line Correctly:   Enough Security, Everywhere
Drawing the Line Correctly: Enough Security, Everywhere
 
ゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティ
 
Implementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile WorldImplementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile World
 
FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」
 
“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication
 

Semelhante a Monitoring Kafka w/ Prometheus

Full Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and GrafanaFull Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and Grafana
Jazz Yao-Tsung Wang
 

Semelhante a Monitoring Kafka w/ Prometheus (20)

Sprint 17
Sprint 17Sprint 17
Sprint 17
 
PostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacksPostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacks
 
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
Introducing Playwright's New Test Runner
Introducing Playwright's New Test RunnerIntroducing Playwright's New Test Runner
Introducing Playwright's New Test Runner
 
Php 5.6 From the Inside Out
Php 5.6 From the Inside OutPhp 5.6 From the Inside Out
Php 5.6 From the Inside Out
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby Team
 
React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発
 
BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013
 
Deploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuDeploying Prometheus stacks with Juju
Deploying Prometheus stacks with Juju
 
GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Full Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and GrafanaFull Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and Grafana
 
202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
 
openATTIC using grafana and prometheus
openATTIC using  grafana and prometheusopenATTIC using  grafana and prometheus
openATTIC using grafana and prometheus
 
Capistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient wayCapistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient way
 
.NET @ apache.org
 .NET @ apache.org .NET @ apache.org
.NET @ apache.org
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with Gatling
 

Mais de kawamuray (7)

Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
 
Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINE
 
LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEKafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
 
Docker + Checkpoint/Restore
Docker + Checkpoint/RestoreDocker + Checkpoint/Restore
Docker + Checkpoint/Restore
 
LXC on Ganeti
LXC on GanetiLXC on Ganeti
LXC on Ganeti
 
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetupNorikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Monitoring Kafka w/ Prometheus

  • 2. About me ● Software Engineer @ LINE corp ○ Develop & operate Apache HBase clusters ○ Design and implement data flow between services with ♥ to Apache Kafka ● Recent works ○ Playing with Apache Kafka and Kafka Streams ■ https://issues.apache.org/jira/browse/KAFKA-3471?jql=project%20%3D%20KAFKA% 20AND%20assignee%20in%20(kawamuray)%20OR%20reporter%20in%20 (kawamuray) ● Past works ○ Docker + Checkpoint Restore @ CoreOS meetup http://www.slideshare. net/kawamuray/coreos-meetup ○ Norikraでアプリログを集計してリアルタイムエラー通知 @ Norikra Meetup #1 http://www. slideshare.net/kawamuray/norikra-meetup ○ Student @ Google Summer of Code 2013, 2014 ● https://github.com/kawamuray
  • 3. How are we(our team) using Prometheus? ● To monitor most of our middleware, clients on Java applications ○ Kafka clusters ○ HBase clusters ○ Kafka clients - producer and consumer ○ Stream Processing jobs
  • 5. Why Prometheus? ● Inhouse monitoring tool wasn’t enough for large-scale + high resolution metrics collection ● Good data model ○ Genuine metric identifier + attributes as labels ■ http_requests_total{code="200",handler="prometheus",instance="localhost:9090",job=" prometheus",method="get"} ● Scalable by nature ● Simple philosophy ○ Metrics exposure interface: GET /metrics => Text Protocol ○ Monolithic server ● Flexible but easy PromQL ○ Derive aggregated metrics by composing existing metrics ○ E.g, Sum of TX bps / second of entire cluster ■ sum(rate(node_network_receive_bytes{cluster="cluster-A",device="eth0"}[30s]) * 8)
  • 6. Deployment ● Launch ○ Official Docker image: https://hub.docker.com/r/prom/prometheus/ ○ Ansible for dynamic prometheus.yml generation based on inventory and container management ● Machine spec ○ 2.40GHz * 24 CPUs ○ 192GB RAM ○ 6 SSDs ○ Single SSD / Single Prometheus instance ○ Overkill? => Obviously. Reused existing unused servers. You must don’t need this crazy spec just to use it.
  • 7. Kafka monitoring w/ Prometheus overview Kafka broker Kafka client in Java application YARN ResourceManager Stream Processing jobs on YARN Prometheus Server Pushgate way Jmx exporter Prometh eus Java library + Servlet JSON exporter Kafka consumer group exporter
  • 8. Monitoring Kafka brokers - jmx_exporter ● https://github.com/prometheus/jmx_exporter ● Run as standalone process(no -javaagent) ○ Just in order to avoid cumbersome rolling restart ○ Maybe turn into use javaagent on next opportunity of rolling restart :p ● With very complicated config.yml ○ https://gist.github.com/kawamuray/25136a9ab22b1cb992e435e0ea67eb06 ● Colocate one instance per broker on the same host
  • 9. Monitoring Kafka producer on Java application - prometheus_simpleclient ● https://github.com/prometheus/client_java ● Official Java client library
  • 10. prometheus_simpleclient - Basic usage private static final Counter queueOutCounter = Counter.build() .namespace("kafka_streams") // Namespace(= Application prefix?) .name("process_count") // Metric name .help("Process calls count") // Metric description .labelNames("processor", "topic") // Declare labels .register(); // Register to CollectorRegistry.defaultRegistry (default, global registry) ... queueOutCounter.labels("Processor-A", "topic-T").inc(); // Increment counter with labels queueOutCounter.labels("Processor-B", "topic-P").inc(2.0); => kafka_streams_process_count{processor="Processor-A",topic="topic-T"} 1.0 kafka_streams_process_count{processor="Processor-B",topic="topic-P"} 2.0
  • 11. Exposing Java application metrics ● Through servlet ○ io.prometheus.client.exporter.MetricsServlet from simpleclient_servlet ● Add an entry to web.xml or embedded jetty .. Server server = new Server(METRICS_PORT); ServletContextHandler context = new ServletContextHandler(); context.setContextPath("/"); server.setHandler(context); context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics"); server.start();
  • 12. Monitoring Kafka producer on Java application - prometheus_simpleclient ● Primitive types: ○ Counter, Gauge, Histogram, Summary ● Kafka’s MetricsRerpoter interface gives KafkaMetrics instance ● How to expose the value? ● => Implement proxy metric type which implements SimpleCollector public class PrometheusMetricsReporter implements MetricsReporter { ... private void registerMetric(KafkaMetric kafkaMetric) { ... KafkaMetricProxy.build() .namespace(“kafka”) .name(fqn) .help("Help: " + metricName.description()) .labelNames(labelNames) .register(); ... } ... }
  • 13. public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> { public static class Builder extends SimpleCollector.Builder<Builder, KafkaMetricProxy> { @Override public KafkaMetricProxy create() { return new KafkaMetricProxy(this); } } KafkaMetricProxy(Builder b) { super(b); } ... @Override public List<MetricFamilySamples> collect() { List<MetricFamilySamples.Sample> samples = new ArrayList<>(); for (Map.Entry<List<String>, Child> entry : children.entrySet()) { List<String> labels = entry.getKey(); Child child = entry.getValue(); samples.add(new Sample(fullname, labelNames, labels, child.getValue())); } return Collections.singletonList(new MetricFamilySamples(fullname, Type.GAUGE, help, samples)); } }
  • 14. Monitoring YARN jobs - json_exporter ● https://github.com/kawamuray/prometheus-json-exporter ○ Can export value from JSON by specifying the value as JSONPath ● http://<rm http address:port>/ws/v1/cluster/apps ○ https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn- site/ResourceManagerRest.html#Cluster_Applications_API ○ https://gist.github.com/kawamuray/c07b03de82bf6ddbdae6508e27d3fb4d
  • 15. json_exporter - name: yarn_application type: object path: $.apps.app[*]?(@.state == "RUNNING") labels: application: $.id phase: beta values: alive: 1 elapsed_time: $.elapsedTime allocated_mb: $.allocatedMB ... {"apps":{"app":[ { "id": "application_1234_0001", "state": "RUNNING", "elapsedTime": 25196, "allocatedMB": 1024, ... }, ... }} + yarn_application_alive{application="application_1326815542473_0001",phase="beta"} 1 yarn_application_elapsed_time{application="application_1326815542473_0001",phase="beta"} 25196 yarn_application_allocated_mb{application="application_1326815542473_0001",phase="beta"} 1024
  • 16. Important configurations ● -storage.local.retention(default: 15 days) ○ TTL for collected values ● -storage.local.memory-chunks(default: 1M) ○ Practically controls memory allocation of Prometheus instance ○ Lower value can cause ingestion throttling(metric loss) ● -storage.local.max-chunks-to-persist(default: 512K) ○ Lower value can cause ingestion throttling likewise ○ https://prometheus.io/docs/operating/storage/#persistence-pressure-and-rushed-mode ○ > Equally important, especially if writing to a spinning disk, is raising the value for the storage. local.max-chunks-to-persist flag. As a rule of thumb, keep it around 50% of the storage.local. memory-chunks value. ● -query.staleness-delta(default: 5mins) ○ Resolution to detect lost metrics ○ Could lead weird behavior on Prometheus WebUI
  • 17. Query tips - label_replace function ● It’s quite common that two metrics has different label sets ○ E.g, server side metric and client side metrics ● Say have metrics like: ○ kafka_log_logendoffset{cluster="cluster-A",instance="HOST:PORT",job="kafka",partition="1234",topic="topic-A"} ● Introduce new label from existing label ○ label_replace(..., "host", "$1", "instance", "^([^:]+):.*") ○ => kafka_log_logendoffset{...,instance=”HOST:PORT”,host=”HOST”} ● Rewrite existing label with new value ○ label_replace(..., "instance", "$1", "instance", "^([^:]+):.*") ○ => kafka_log_logendoffset{...,instance=”HOST”} ● Even possible to rewrite metric name… :D ○ label_replace(kafka_log_logendoffset, "__name__", "foobar", "__name__", ".*") ○ => foobar{...}
  • 18. Points to improve ● Service discovery ○ It’s too cumbersome to configure server list and exporter list statically ○ Pushgateway? ■ > The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus - https://github. com/prometheus/pushgateway/blob/master/README.md#prometheus-pushgateway- ○ file_sd_config? https://prometheus.io/docs/operating/configuration/#<file_sd_config> ■ > It reads a set of files containing a list of zero or more <target_group>s. Changes to all defined files are detected via disk watches and applied immediately. ● Local time support :( ○ They don’t like TZ other than UTC; making sense though: https://prometheus. io/docs/introduction/faq/#can-i-change-the-timezone?-why-is-everything-in-utc? ○ https://github.com/prometheus/prometheus/issues/500#issuecomment-167560093 ○ Still might possible to introduce toggle on view
  • 19. Conclusion ● Data model is very intuitive ● PromQL is very powerful and relatively easy ○ Helps you find out important metrics from hundreds of metrics ● Few pitfalls needs to be avoid w/ tuning configurations ○ memory-chunks, query.staleness-detla… ● Building exporter is reasonably easy ○ Officially supported lot’s of languages… ○ /metrics is the only interface
  • 22. Metrics naming ● APPLICATIONPREFIX_METRICNAME ○ https://prometheus.io/docs/practices/naming/#metric-names ○ kafka_producer_request_rate ○ http_request_duration ● Fully utilize labels ○ x: kafka_network_request_duration_milliseconds_{max,min,mean} ○ o: kafka_network_request_duration_milliseconds{“aggregation”=”max|min|mean”} ○ Compare all min/max/mean in single graph: kafka_network_request_duration_milliseconds {instance=”HOSTA”} ○ Much flexible than using static name
  • 23. Alerting ● Not using Alert Manager ● Inhouse monitoring tool has alerting capability ○ Has user directory of alerting target ○ Has known expression to configure alerting ○ Tool unification is important and should be respected as possible ● Then? ○ Built a tool to mirror metrics from Prometheus to inhouse monitoring tool ○ Setup alert on inhouse monitoring tool /api/v1/query?query=sum(kafka_stream_process_calls_rate{client_id=~"CLIENT_ID.*"}) by (instance) { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "instance": "HOST_A:PORT" }, "value": [ 1465819064.067, "82317.10280584119" ] }, { "metric": { "instance": "HOST_B:PORT" }, "value": [ 1465819064.067, "81379.73499610288" ] }, ] } }
  • 24. public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> { ... public static class Child { private KafkaMetric kafkaMetric; public void setKafkaMetric(KafkaMetric kafkaMetric) { this.kafkaMetric = kafkaMetric; } double getValue() { return kafkaMetric == null ? 0 : kafkaMetric.value(); } } @Override protected Child newChild() { return new Child(); } ... }
  • 25. Monitoring Kafka consumer offset - kafka_consumer_group_exporter ● https://github.com/kawamuray/prometheus-kafka-consumer-group-exporter ● Exports some metrics WRT Kafka consumer group by executing kafka- consumer-groups.sh command(bundled to Kafka) ● Specific exporter for specific use ● Would better being familiar with your favorite exporter framework ○ Raw use of official prometheus package: https://github. com/prometheus/client_golang/tree/master/prometheus ○ Mine: https://github.com/kawamuray/prometheus-exporter-harness
  • 26. Query tips - Product set ● Calculated result of more than two metrics results product set ● metric_A{cluster=”A or B”} ● metric_B{cluster=”A or B”,instance=”a or b or c”} ● metric_A / metric_B ● => {} ● metric_A / sum(metric_B) by (cluster) ● => {cluster=”A or B”} ● x: metric_A{cluster=”A”} - sum(metric_B{cluster=”A”}) by (cluster) ● o: metric_A{cluster=”A”} - sum(metric_B) by (cluster) => Same result!