SlideShare a Scribd company logo
1 of 40
Brendan Gregg, Senior Performance Architect
Designing
Tracing
Tools
Wielding Superpowers
I'm currently developing more tracing tools (bcc/BPF)
Tool Design
• For tool developers
• For everyone else: what you can ask for
– Tool templates
– GUI visualizations
• The following is applicable to all tracers
– sysdig, bcc/BPF, DTrace, SystemTap, LTTng, …
Methodologies
Methodology-driven Design
• Methodologies provide ideas for purposeful tools
• Find/draw a functional diagram, apply methods
See: http://www.brendangregg.com/methodology.html
Operating Systems
Methodology Examples
Eg, at the syscall layer (well known & documented):
• Workload Characterization
– exec() or open() per-event trace (execsnoop, opensnoop)
– connect()/accept() per-event trace (tcpconnect, tcpaccept)
– read()/write() size histogram (one-liners)
• Latency Analysis
– read()/write() latency histogram (biolatency, …)
• USE Method
– network utilization by thread (not done yet)
– syscall errors (fserrors, soerrors)
CLI Tracing Tools
CLI Templates
1. Per event output
– *snoop, *slower 0, …
2. Filtered event output
– *slower
3. Interval summary
– *stat, *top
4. Count summary
– *count
5. Histogram summary
– *dist, *latency
6. Heatmap summary
– spectrogram.lua, subsecoffset.lua, …
Template 1: Per Event Output
Examples: *snoop, *slower 0, …
# opensnoop
PID COMM FD ERR PATH
10085 sshd 3 0 /lib/x86_64-linux-gnu/libkeyutils.so.1
10085 sshd 3 0 /lib/x86_64-linux-gnu/libresolv.so.2
10085 sshd 3 0 /lib/x86_64-linux-gnu/libgpg-error.so.0
10085 sshd 3 0 /dev/urandom
10085 sshd -1 2 /lib/x86_64-linux-gnu/.libcrypto.so.1.0.0.hmac
10085 sshd -1 2 /proc/sys/crypto/fips_enabled
10085 sshd 3 0 /proc/filesystems
10085 sshd 3 0 /dev/null
10085 sshd 3 0 /proc/10085/fd
10085 sshd 3 0 /usr/lib/ssl/openssl.cnf
10085 sshd 3 0 /etc/gai.conf
10085 sshd 3 0 /etc/nsswitch.conf
10085 sshd 3 0 /etc/ld.so.cache
10085 sshd 3 0 /lib/x86_64-linux-gnu/libnss_compat.so.2
10085 sshd 3 0 /etc/ld.so.cache
10085 sshd 3 0 /lib/x86_64-linux-gnu/libnss_nis.so.2
[…]
Template 2: Filtered Event Output
Examples: *slower
Tools like this can also do all event output:
# sysdig -c fileslower 1
TIME PROCESS TYPE LAT(ms) FILE
2014-04-13 20:40:43.973 cksum read 2 /mnt/partial.0.0
2014-04-13 20:40:44.187 cksum read 1 /mnt/partial.0.0
2014-04-13 20:40:44.689 cksum read 2 /mnt/partial.0.0
2014-04-13 20:40:45.005 cksum read 2 /mnt/partial.0.0
2014-04-13 20:40:45.193 cksum read 1 /mnt/partial.0.0
[…]
# sysdig -c fileslower 0
TIME PROCESS TYPE LAT(ms) FILE
2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/librt.so.1
2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/libacl.so.1
2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/libc.so.6
2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/libdl.so.2
2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-
gnu/libattr.so.1
2014-04-13 20:59:04.415 ls read 0 /proc/filesystems
2014-04-13 20:59:04.415 ls read 0 /proc/filesystems
[...]
Template 3: Interval Summary
Examples: *stat, *top
# dcstat
TIME REFS/s SLOW/s MISS/s HIT%
08:11:47: 2059 141 97 95.29
08:11:48: 79974 151 106 99.87
08:11:49: 192874 146 102 99.95
08:11:50: 2051 144 100 95.12
08:11:51: 73373 17239 17194 76.57
08:11:52: 54685 25431 25387 53.58
08:11:53: 18127 8182 8137 55.12
08:11:54: 22517 10345 10301 54.25
08:11:55: 7524 2881 2836 62.31
08:11:56: 2067 141 97 95.31
08:11:57: 2115 145 101 95.22
[…]
Template 4: Count Summary
Examples: *count
# funccount 'vfs_*'
Tracing... Ctrl-C to end.
^C
ADDR FUNC COUNT
ffffffff811efe81 vfs_create 1
ffffffff811f24a1 vfs_rename 1
ffffffff81215191 vfs_fsync_range 2
ffffffff81231df1 vfs_lock_file 30
ffffffff811e8dd1 vfs_fstatat 152
ffffffff811e8d71 vfs_fstat 154
ffffffff811e4381 vfs_write 166
ffffffff811e8c71 vfs_getattr_nosec 262
ffffffff811e8d41 vfs_getattr 262
ffffffff811e3221 vfs_open 264
ffffffff811e4251 vfs_read 470
Detaching...
Template 5: Histogram Summary
Examples: *dist, *latency
# biolatency
Tracing block device I/O... Hit Ctrl-C to end.
^C
usecs : count distribution
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 | |
128 -> 255 : 12 |******** |
256 -> 511 : 15 |********** |
512 -> 1023 : 43 |******************************* |
1024 -> 2047 : 52 |**************************************|
2048 -> 4095 : 47 |********************************** |
4096 -> 8191 : 52 |**************************************|
8192 -> 16383 : 36 |************************** |
16384 -> 32767 : 15 |********** |
32768 -> 65535 : 2 |* |
65536 -> 131071 : 2 |* |
Template 6: Heatmap Summary
Example: subsecoffset.lua (aka "spectrogram")
Valuable
Know what already exists, and what doesn't
Low Overhead (or documented)
• Understand tracing internals
– For example, sysdig's design has ~20x lower overhead than strace
(it still has overhead: test and measure to see if it's acceptable)
– Tracing overhead is usually relative to event rate
• Design for low overhead, and document expectations
sysdig
1. enable
Kernel
syscalls
sysdig
driver
ring
buffer
lua
program
2. async
read
3. output
Documentation
• Good tools have 3 docs:
1. Code comments
2. Man page
3. Examples file
• Man page
– troff, docbook, …
• Examples file:
.TH Title heading
.SH Section heading
.IP Indented paragraph
.TP Indented paragraph with label
.B Bold
- -
common man macros (see groff_man(7))
Demonstrations of biosnoop, the Linux eBPF/bcc version.
biosnoop traces block device I/O (disk I/O), and prints a line of output
per I/O. Example:
# ./biosnoop
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
0.000004001 supervise 1950 xvda1 W 13092560 4096 0.74
[...]
Concise, intuitive, self-explanatory
• Useful startup message
– What I'm tracing, when there's output, when I'll end
• Vigorous tooling is concise
– No wasted text; leave less useful output for non-default options
• Unix philosophy: do one thing and do it well
# ./iolatency
Tracing block I/O. Output every 1 seconds. Ctrl-C to end.
>=(ms) .. <(ms) : I/O |Distribution |
0 -> 1 : 4381 |######################################|
1 -> 2 : 9 |# |
2 -> 4 : 5 |# |
4 -> 8 : 0 | |
8 -> 16 : 1 |# |
[…]
POSIX-style Arguments
# ./biolatency -h
usage: biolatency [-h] [-T] [-Q] [-m] [-D] [interval] [count]
Summarize block device I/O latency as a histogram
positional arguments:
interval output interval, in seconds
count number of outputs
optional arguments:
-h, --help show this help message and exit
-T, --timestamp include timestamp on output
-Q, --queued include OS queued time in I/O time
-m, --milliseconds millisecond histogram
-D, --disks print a histogram per disk device
examples:
./biolatency # summarize block I/O latency as a histogram
./biolatency 1 10 # print 1 second summaries, 10 times
./biolatency -mT 1 # 1s summaries, milliseconds, and timestamps
./biolatency -Q # include OS queued time in I/O time
./biolatency -D # show each disk device separately
Option Alternate Expectation
-a --all all events
-c CMD --cmd … run this command
-d SECONDS --duration … duration of tool execution
-h --help help
-i FILE --input … input file
-i SECONDS --interval … summary interval
-n name --name … this process name only
-o FILE --output … output file
-p PID --pid … this process ID only
-P --by-process per-process ID breakdown
-P PORT --port … this TCP port only
-t or -T --[no]timestamp include or exclude timestamps
-v --verbose verbose output
-x --extended, --errors extended output, or only failures
[interval [count]] - summary interval, and # of outputs
POSIX-style Arguments
Testing, Testing, Testing
• If you can't write the workload, you can't write the tool
– Be it 10 lines of C, some shell, or dd
– dd if=/dev/urandom of=/dev/null bs=1k count=23
• Known workload testing: create 23 events
• Testing can be time consuming
– I spend more time testing a tool than writing it
– Automatic tool testing is a difficult problem
Example: gethostlatency
# gethostlatency
TIME PID COMM LATms HOST
06:10:24 28011 wget 90.00 www.iovisor.org
06:10:28 28127 wget 0.00 www.iovisor.org
06:10:41 28404 wget 9.00 www.netflix.com
06:10:48 28544 curl 35.00 www.netflix.com.au
06:11:10 29054 curl 31.00 www.plumgrid.com
06:11:16 29195 curl 3.00 www.facebook.com
06:11:25 29404 curl 72.00 foo
06:11:28 29475 curl 1.00 foo
Example: ext4slower
# ext4slower 1
Tracing ext4 operations slower than 1 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
06:49:17 bash 3616 R 128 0 7.75 cksum
06:49:17 cksum 3616 R 39552 0 1.34 [
06:49:17 cksum 3616 R 96 0 5.36 2to3-2.7
06:49:17 cksum 3616 R 96 0 14.94 2to3-3.4
06:49:17 cksum 3616 R 10320 0 6.82 411toppm
06:49:17 cksum 3616 R 65536 0 4.01 a2p
06:49:17 cksum 3616 R 55400 0 8.77 ab
06:49:17 cksum 3616 R 36792 0 16.34 aclocal-1.14
06:49:17 cksum 3616 R 15008 0 19.31 acpi_listen
06:49:17 cksum 3616 R 6123 0 17.23 add-apt-
repository
06:49:17 cksum 3616 R 6280 0 18.40 addpart
06:49:17 cksum 3616 R 27696 0 2.16 addr2line
06:49:17 cksum 3616 R 58080 0 10.11 ag
06:49:17 cksum 3616 R 906 0 6.30 ec2-meta-data
[…]
Example: biolatency
# biolatency -m 1 5
Tracing block device I/O... Hit Ctrl-C to end.
msecs : count distribution
0 -> 1 : 36 |**************************************|
2 -> 3 : 1 |* |
4 -> 7 : 3 |*** |
8 -> 15 : 17 |***************** |
16 -> 31 : 33 |********************************** |
32 -> 63 : 7 |******* |
64 -> 127 : 6 |****** |
[…]
GUI Tracing Tools
GUI Visualizations
1. Event logs
2. Tables
3. Line graphs
4. Histograms
5. Heatmaps (spectrographs)
6. Waterfall charts
7. Directed graphs
8. Flame graphs
9. Flame charts
Visualization 1: Event Logs
https://commons.wikimedia.org/wiki/File:Wireshark_screenshot.png
Visualization 2: Tables
Visualization 3: Line Graphs
http://www.paradyn.org/html/screen-shots.html
Visualization 4: Histograms
Or a density plot
Or as a frequency trail (can cascade)
Visualization 5: Heat Maps
eg, Oracle ZFS Storage Appliance Analytics (DTrace-based)
Visualization 5: Spectrograms
Visualization 6: Waterfall Charts
Visualization 7: Directed Graphs
Visualization 8: Flame Graphs
Commonly used with CPU profilers. Also useful for tracers: off-CPU time, ...
file read
from disk
directory read
from disk
pipe write
path read from disk
fstat from disk
Visualization 9: Flame Charts
Desirable Attributes
• Valuable
– Methodologies provide ideas for purposeful metrics
• Documented
– Tool tips, wikis
• Tested
• Real Time
• Dashboards
– To support methodologies
Thank You!
http://www.brendangregg.com
http://slideshare.net/brendangregg
bgregg@netflix.com
@brendangregg
References & Links:
– http://www.brendangregg.com/heatmaps.html
– http://www.brendangregg.com/flamegraphs.html
– http://www.slideshare.net/brendangregg/monitorama-2015-netflix-instance-analysis

More Related Content

What's hot

USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
Brendan Gregg
 
Virtualization which isn't: LXC (Linux Containers)
Virtualization which isn't: LXC (Linux Containers)Virtualization which isn't: LXC (Linux Containers)
Virtualization which isn't: LXC (Linux Containers)
Dobrica Pavlinušić
 

What's hot (20)

Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)
 
Open ZFS Keynote (public)
Open ZFS Keynote (public)Open ZFS Keynote (public)
Open ZFS Keynote (public)
 
Using cgroups in docker container
Using cgroups in docker containerUsing cgroups in docker container
Using cgroups in docker container
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems Performance
 
Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
 
Lxc- Introduction
Lxc- IntroductionLxc- Introduction
Lxc- Introduction
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizations
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java Applications
 
Virtualization which isn't: LXC (Linux Containers)
Virtualization which isn't: LXC (Linux Containers)Virtualization which isn't: LXC (Linux Containers)
Virtualization which isn't: LXC (Linux Containers)
 
Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDP
 
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConAnatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
 
FOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerFOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the corner
 
SystemV vs systemd
SystemV vs systemdSystemV vs systemd
SystemV vs systemd
 
DCSF 19 eBPF Superpowers
DCSF 19 eBPF SuperpowersDCSF 19 eBPF Superpowers
DCSF 19 eBPF Superpowers
 
Talk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopTalk 160920 @ Cat System Workshop
Talk 160920 @ Cat System Workshop
 
Portable TeX Documents (PTD): PackagingCon 2021
Portable TeX Documents (PTD): PackagingCon 2021Portable TeX Documents (PTD): PackagingCon 2021
Portable TeX Documents (PTD): PackagingCon 2021
 
Infrastructure coders logstash
Infrastructure coders logstashInfrastructure coders logstash
Infrastructure coders logstash
 
Using eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthUsing eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster Health
 
Linux Containers From Scratch
Linux Containers From ScratchLinux Containers From Scratch
Linux Containers From Scratch
 

Viewers also liked

Viewers also liked (13)

The Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishThe Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - Spanish
 
Intro to sysdig in 15 minutes
Intro to sysdig in 15 minutesIntro to sysdig in 15 minutes
Intro to sysdig in 15 minutes
 
WTF my container just spawned a shell!
WTF my container just spawned a shell!WTF my container just spawned a shell!
WTF my container just spawned a shell!
 
Extending Sysdig with Chisel
Extending Sysdig with ChiselExtending Sysdig with Chisel
Extending Sysdig with Chisel
 
Building Trustworthy Containers
Building Trustworthy ContainersBuilding Trustworthy Containers
Building Trustworthy Containers
 
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
 
Behavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoBehavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig Falco
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Find the Hacker
Find the HackerFind the Hacker
Find the Hacker
 
How to Secure Containers
How to Secure ContainersHow to Secure Containers
How to Secure Containers
 
Troubleshooting Kubernetes
Troubleshooting KubernetesTroubleshooting Kubernetes
Troubleshooting Kubernetes
 
You're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongYou're monitoring Kubernetes Wrong
You're monitoring Kubernetes Wrong
 
How to Monitor Microservices
How to Monitor MicroservicesHow to Monitor Microservices
How to Monitor Microservices
 

Similar to Designing Tracing Tools

Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
Brendan Gregg
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
Acácio Oliveira
 
Servers and Processes: Behavior and Analysis
Servers and Processes: Behavior and AnalysisServers and Processes: Behavior and Analysis
Servers and Processes: Behavior and Analysis
dreamwidth
 
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Anne Nicolas
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
Brendan Gregg
 
Linux or unix interview questions
Linux or unix interview questionsLinux or unix interview questions
Linux or unix interview questions
Teja Bheemanapally
 

Similar to Designing Tracing Tools (20)

Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challenges
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
Linux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compactLinux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compact
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
 
SOFA Tutorial
SOFA TutorialSOFA Tutorial
SOFA Tutorial
 
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Linux Common Command
Linux Common CommandLinux Common Command
Linux Common Command
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Servers and Processes: Behavior and Analysis
Servers and Processes: Behavior and AnalysisServers and Processes: Behavior and Analysis
Servers and Processes: Behavior and Analysis
 
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, Netflix
 
Open Source Systems Performance
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems Performance
 
Linux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sLinux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA's
 
Debug generic process
Debug generic processDebug generic process
Debug generic process
 
Linux or unix interview questions
Linux or unix interview questionsLinux or unix interview questions
Linux or unix interview questions
 

More from Sysdig

More from Sysdig (8)

Wordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionWordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccion
 
What Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsWhat Prometheus means for monitoring vendors
What Prometheus means for monitoring vendors
 
15 kubernetes failure points you should watch
15 kubernetes failure points you should watch15 kubernetes failure points you should watch
15 kubernetes failure points you should watch
 
Docker Runtime Security
Docker Runtime SecurityDocker Runtime Security
Docker Runtime Security
 
CI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesCI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in Kubernetes
 
Continuous Security
Continuous SecurityContinuous Security
Continuous Security
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Recently uploaded (20)

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 

Designing Tracing Tools

  • 1. Brendan Gregg, Senior Performance Architect Designing Tracing Tools
  • 3. I'm currently developing more tracing tools (bcc/BPF)
  • 4. Tool Design • For tool developers • For everyone else: what you can ask for – Tool templates – GUI visualizations • The following is applicable to all tracers – sysdig, bcc/BPF, DTrace, SystemTap, LTTng, …
  • 6. Methodology-driven Design • Methodologies provide ideas for purposeful tools • Find/draw a functional diagram, apply methods See: http://www.brendangregg.com/methodology.html Operating Systems
  • 7. Methodology Examples Eg, at the syscall layer (well known & documented): • Workload Characterization – exec() or open() per-event trace (execsnoop, opensnoop) – connect()/accept() per-event trace (tcpconnect, tcpaccept) – read()/write() size histogram (one-liners) • Latency Analysis – read()/write() latency histogram (biolatency, …) • USE Method – network utilization by thread (not done yet) – syscall errors (fserrors, soerrors)
  • 9. CLI Templates 1. Per event output – *snoop, *slower 0, … 2. Filtered event output – *slower 3. Interval summary – *stat, *top 4. Count summary – *count 5. Histogram summary – *dist, *latency 6. Heatmap summary – spectrogram.lua, subsecoffset.lua, …
  • 10. Template 1: Per Event Output Examples: *snoop, *slower 0, … # opensnoop PID COMM FD ERR PATH 10085 sshd 3 0 /lib/x86_64-linux-gnu/libkeyutils.so.1 10085 sshd 3 0 /lib/x86_64-linux-gnu/libresolv.so.2 10085 sshd 3 0 /lib/x86_64-linux-gnu/libgpg-error.so.0 10085 sshd 3 0 /dev/urandom 10085 sshd -1 2 /lib/x86_64-linux-gnu/.libcrypto.so.1.0.0.hmac 10085 sshd -1 2 /proc/sys/crypto/fips_enabled 10085 sshd 3 0 /proc/filesystems 10085 sshd 3 0 /dev/null 10085 sshd 3 0 /proc/10085/fd 10085 sshd 3 0 /usr/lib/ssl/openssl.cnf 10085 sshd 3 0 /etc/gai.conf 10085 sshd 3 0 /etc/nsswitch.conf 10085 sshd 3 0 /etc/ld.so.cache 10085 sshd 3 0 /lib/x86_64-linux-gnu/libnss_compat.so.2 10085 sshd 3 0 /etc/ld.so.cache 10085 sshd 3 0 /lib/x86_64-linux-gnu/libnss_nis.so.2 […]
  • 11. Template 2: Filtered Event Output Examples: *slower Tools like this can also do all event output: # sysdig -c fileslower 1 TIME PROCESS TYPE LAT(ms) FILE 2014-04-13 20:40:43.973 cksum read 2 /mnt/partial.0.0 2014-04-13 20:40:44.187 cksum read 1 /mnt/partial.0.0 2014-04-13 20:40:44.689 cksum read 2 /mnt/partial.0.0 2014-04-13 20:40:45.005 cksum read 2 /mnt/partial.0.0 2014-04-13 20:40:45.193 cksum read 1 /mnt/partial.0.0 […] # sysdig -c fileslower 0 TIME PROCESS TYPE LAT(ms) FILE 2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/librt.so.1 2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/libacl.so.1 2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/libc.so.6 2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/libdl.so.2 2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux- gnu/libattr.so.1 2014-04-13 20:59:04.415 ls read 0 /proc/filesystems 2014-04-13 20:59:04.415 ls read 0 /proc/filesystems [...]
  • 12. Template 3: Interval Summary Examples: *stat, *top # dcstat TIME REFS/s SLOW/s MISS/s HIT% 08:11:47: 2059 141 97 95.29 08:11:48: 79974 151 106 99.87 08:11:49: 192874 146 102 99.95 08:11:50: 2051 144 100 95.12 08:11:51: 73373 17239 17194 76.57 08:11:52: 54685 25431 25387 53.58 08:11:53: 18127 8182 8137 55.12 08:11:54: 22517 10345 10301 54.25 08:11:55: 7524 2881 2836 62.31 08:11:56: 2067 141 97 95.31 08:11:57: 2115 145 101 95.22 […]
  • 13. Template 4: Count Summary Examples: *count # funccount 'vfs_*' Tracing... Ctrl-C to end. ^C ADDR FUNC COUNT ffffffff811efe81 vfs_create 1 ffffffff811f24a1 vfs_rename 1 ffffffff81215191 vfs_fsync_range 2 ffffffff81231df1 vfs_lock_file 30 ffffffff811e8dd1 vfs_fstatat 152 ffffffff811e8d71 vfs_fstat 154 ffffffff811e4381 vfs_write 166 ffffffff811e8c71 vfs_getattr_nosec 262 ffffffff811e8d41 vfs_getattr 262 ffffffff811e3221 vfs_open 264 ffffffff811e4251 vfs_read 470 Detaching...
  • 14. Template 5: Histogram Summary Examples: *dist, *latency # biolatency Tracing block device I/O... Hit Ctrl-C to end. ^C usecs : count distribution 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 1 | | 128 -> 255 : 12 |******** | 256 -> 511 : 15 |********** | 512 -> 1023 : 43 |******************************* | 1024 -> 2047 : 52 |**************************************| 2048 -> 4095 : 47 |********************************** | 4096 -> 8191 : 52 |**************************************| 8192 -> 16383 : 36 |************************** | 16384 -> 32767 : 15 |********** | 32768 -> 65535 : 2 |* | 65536 -> 131071 : 2 |* |
  • 15. Template 6: Heatmap Summary Example: subsecoffset.lua (aka "spectrogram")
  • 16.
  • 17. Valuable Know what already exists, and what doesn't
  • 18. Low Overhead (or documented) • Understand tracing internals – For example, sysdig's design has ~20x lower overhead than strace (it still has overhead: test and measure to see if it's acceptable) – Tracing overhead is usually relative to event rate • Design for low overhead, and document expectations sysdig 1. enable Kernel syscalls sysdig driver ring buffer lua program 2. async read 3. output
  • 19. Documentation • Good tools have 3 docs: 1. Code comments 2. Man page 3. Examples file • Man page – troff, docbook, … • Examples file: .TH Title heading .SH Section heading .IP Indented paragraph .TP Indented paragraph with label .B Bold - - common man macros (see groff_man(7)) Demonstrations of biosnoop, the Linux eBPF/bcc version. biosnoop traces block device I/O (disk I/O), and prints a line of output per I/O. Example: # ./biosnoop TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms) 0.000004001 supervise 1950 xvda1 W 13092560 4096 0.74 [...]
  • 20. Concise, intuitive, self-explanatory • Useful startup message – What I'm tracing, when there's output, when I'll end • Vigorous tooling is concise – No wasted text; leave less useful output for non-default options • Unix philosophy: do one thing and do it well # ./iolatency Tracing block I/O. Output every 1 seconds. Ctrl-C to end. >=(ms) .. <(ms) : I/O |Distribution | 0 -> 1 : 4381 |######################################| 1 -> 2 : 9 |# | 2 -> 4 : 5 |# | 4 -> 8 : 0 | | 8 -> 16 : 1 |# | […]
  • 21. POSIX-style Arguments # ./biolatency -h usage: biolatency [-h] [-T] [-Q] [-m] [-D] [interval] [count] Summarize block device I/O latency as a histogram positional arguments: interval output interval, in seconds count number of outputs optional arguments: -h, --help show this help message and exit -T, --timestamp include timestamp on output -Q, --queued include OS queued time in I/O time -m, --milliseconds millisecond histogram -D, --disks print a histogram per disk device examples: ./biolatency # summarize block I/O latency as a histogram ./biolatency 1 10 # print 1 second summaries, 10 times ./biolatency -mT 1 # 1s summaries, milliseconds, and timestamps ./biolatency -Q # include OS queued time in I/O time ./biolatency -D # show each disk device separately
  • 22. Option Alternate Expectation -a --all all events -c CMD --cmd … run this command -d SECONDS --duration … duration of tool execution -h --help help -i FILE --input … input file -i SECONDS --interval … summary interval -n name --name … this process name only -o FILE --output … output file -p PID --pid … this process ID only -P --by-process per-process ID breakdown -P PORT --port … this TCP port only -t or -T --[no]timestamp include or exclude timestamps -v --verbose verbose output -x --extended, --errors extended output, or only failures [interval [count]] - summary interval, and # of outputs POSIX-style Arguments
  • 23. Testing, Testing, Testing • If you can't write the workload, you can't write the tool – Be it 10 lines of C, some shell, or dd – dd if=/dev/urandom of=/dev/null bs=1k count=23 • Known workload testing: create 23 events • Testing can be time consuming – I spend more time testing a tool than writing it – Automatic tool testing is a difficult problem
  • 24. Example: gethostlatency # gethostlatency TIME PID COMM LATms HOST 06:10:24 28011 wget 90.00 www.iovisor.org 06:10:28 28127 wget 0.00 www.iovisor.org 06:10:41 28404 wget 9.00 www.netflix.com 06:10:48 28544 curl 35.00 www.netflix.com.au 06:11:10 29054 curl 31.00 www.plumgrid.com 06:11:16 29195 curl 3.00 www.facebook.com 06:11:25 29404 curl 72.00 foo 06:11:28 29475 curl 1.00 foo
  • 25. Example: ext4slower # ext4slower 1 Tracing ext4 operations slower than 1 ms TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME 06:49:17 bash 3616 R 128 0 7.75 cksum 06:49:17 cksum 3616 R 39552 0 1.34 [ 06:49:17 cksum 3616 R 96 0 5.36 2to3-2.7 06:49:17 cksum 3616 R 96 0 14.94 2to3-3.4 06:49:17 cksum 3616 R 10320 0 6.82 411toppm 06:49:17 cksum 3616 R 65536 0 4.01 a2p 06:49:17 cksum 3616 R 55400 0 8.77 ab 06:49:17 cksum 3616 R 36792 0 16.34 aclocal-1.14 06:49:17 cksum 3616 R 15008 0 19.31 acpi_listen 06:49:17 cksum 3616 R 6123 0 17.23 add-apt- repository 06:49:17 cksum 3616 R 6280 0 18.40 addpart 06:49:17 cksum 3616 R 27696 0 2.16 addr2line 06:49:17 cksum 3616 R 58080 0 10.11 ag 06:49:17 cksum 3616 R 906 0 6.30 ec2-meta-data […]
  • 26. Example: biolatency # biolatency -m 1 5 Tracing block device I/O... Hit Ctrl-C to end. msecs : count distribution 0 -> 1 : 36 |**************************************| 2 -> 3 : 1 |* | 4 -> 7 : 3 |*** | 8 -> 15 : 17 |***************** | 16 -> 31 : 33 |********************************** | 32 -> 63 : 7 |******* | 64 -> 127 : 6 |****** | […]
  • 28. GUI Visualizations 1. Event logs 2. Tables 3. Line graphs 4. Histograms 5. Heatmaps (spectrographs) 6. Waterfall charts 7. Directed graphs 8. Flame graphs 9. Flame charts
  • 29. Visualization 1: Event Logs https://commons.wikimedia.org/wiki/File:Wireshark_screenshot.png
  • 31. Visualization 3: Line Graphs http://www.paradyn.org/html/screen-shots.html
  • 32. Visualization 4: Histograms Or a density plot Or as a frequency trail (can cascade)
  • 33. Visualization 5: Heat Maps eg, Oracle ZFS Storage Appliance Analytics (DTrace-based)
  • 37. Visualization 8: Flame Graphs Commonly used with CPU profilers. Also useful for tracers: off-CPU time, ... file read from disk directory read from disk pipe write path read from disk fstat from disk
  • 39. Desirable Attributes • Valuable – Methodologies provide ideas for purposeful metrics • Documented – Tool tips, wikis • Tested • Real Time • Dashboards – To support methodologies
  • 40. Thank You! http://www.brendangregg.com http://slideshare.net/brendangregg bgregg@netflix.com @brendangregg References & Links: – http://www.brendangregg.com/heatmaps.html – http://www.brendangregg.com/flamegraphs.html – http://www.slideshare.net/brendangregg/monitorama-2015-netflix-instance-analysis

Editor's Notes

  1. demo BPF iosnoop
  2. needs a better name. Linux FrogFilter.
  3. I don't want a floor wax and a desert topping! netstat doesn't have a tcpdump option (it's bad enough as it is). single tools: - simplify testing - free up argument options
  4. If you try to automate, then your tool output may be polluted with other system events (unless you can filter to a PID, but then, you aren't testing system-wide). Ensuring a tool doesn't over count from another workload is also difficult: what other workloads should you run during the test? undefined.
  5. now do demos
  6. <hi thanks for joining…> I’ll start with a really quick introduction and then I’d love to learn a little bit more about you and your environment, and why you’re interested in Sysdig Cloud. But to first set the stage… Sysdig Cloud. There are a million other monitoring tools out there – why does the world need another? Well with Sysdig Cloud, we set out to create the first and only comprehensive, container-native monitoring solution