2. Netflix, Inc.
•
•
•
•
•
World's leading internet television network
~ 40 Million subscribers in 40+ countries
Over a billion hours streamed per month
Approximately 33% of all US Internet traffic at night
Recent Notables
• Increased originals catalog
4. Why Tune the AMI?
• @ Netflix: 10’s of 1000’s of instances running globally
–
“Rising Tide Lifts All Ships”
• Large variability in production workloads
–
–
–
–
OLTP (majority of REST-based services)
Batch/Pre-Compute (think movie recommendations…)
Cassandra
EVCache (memcached tier)
• Cloud environments have inherent performance variability
–
Improve resilience to such variability
• Deployment model affords ease of customization
5. Baking Performance Into the Base
• Aminator – Open Source AMI bakery
• Broad propagation of standard performance tunings
–
Apache, Tomcat configurations
• Focused application of workload-specific configurations
–
Primarily kernel and OS optimizations: CPU Scheduling, Memory Management, Network, IO
6. Linux Kernel Tuning - Benefits
•
•
•
•
Effectively drive key instance resource dimensions
Improved efficiency at scale saves big $
Tuning process drives identification of ideal instance type
Readily available advanced Linux tools (e.g. perf, systemtap)
provide deep insight into the kernel and the application:
– Top Down Analysis: Review of application interaction with system resources
– Bottom Up Analysis: System resource usage of the application
7. Kernel Tuning Trade-Offs
• Kernel subsystems are inter-dependent
– tuning in one area my improve efficiency at the expense of another
• 80/20 Rule:
80%: Improvement gained by application refactoring and tuning
20%: OS tuning, infrastructure improvement etc..
• Tuning tailors the system for a specific workload
–
Other workload may perform worse
Tuning objective is to align system resources to application
requirements in order to improve overall system performance.
9. Metrics of Interest
• Performance analysis focus
Resource
Characteristic
CPU
utilization, saturation, process priority, affinity, NUMA
Memory
physical/virtual memory usage, swapping, page cache
Network IO
network stack congestion, latency, throughput
Block IO
block layer and device latency, throughput, file system
Scalability
concurrency, parallelism, shared resources, lock contention
10. Basic Tools
Tool
Description
vmstat, dstat
Reports system-wide CPU utilization, saturation, memory and swap usage.
Overview of kernel events like: syscall, context switch, interrupts etc.
mpstat
Reports per-CPU utilization. hard/soft interrupts, virtualization overhead
(%steel, %guest)
top, atop, htop,
nmon
Reports per process/thread state, scheduling priorities and CPU usage etc.
atop is similar to top but keeps historical data for trend analysis. htop and
nmon provides similar stats with graphical view.
iostat
IO latency/throughput at the driver and the block layer. Device utilization
sar
Keeps historical data about CPU, memory, Network, IO usage
uptime
Reports CPU saturation - Threads waiting for CPU
11. Basic Tools, cont.
Tool
Description
free
Free memory and swap. Counts page cache memory as free
/proc/meminfo
Memory, swap and file system statistics. Kernel memory usage, statistics for
conservative memory allocation policy, HugeTLB etc..
pidstat
Per process/thread CPU usage, context switch, memory, swap, IO usage
ps, pstree
Per process/thread CPU and Memory usage
/proc, /sys File system
/proc: stats about process, threads, scheduling, kernel stacks, memory etc..
/sys: Report device specific stats: disk, NIC etc..
netstat, iptraf
TCP/IP statistics , routing, errors, network connectivity, and NIC stats.
iptraf shows real time tcp/ip network traffic
nicstat, ping, ifconfig
NIC stats, network connectivity, netmask, subnet etc..
12. Advanced Tools
Tool
Description
blktrace
Profile the Linux Block layer and reports events like: merge, plug, split, remapped etc.
Reports PID, block number, IO size, timestamp etc..
slabtop
Kernel memory usage and statistics for various kernel caches in use by kernel
pmap
Dumps all memory segments in the process address space: heap, stack, mmap
pstack, jstack
Dumps application user level stack trace. jstack contains java methods, tid, pid, threads
states
iotop
per process/thread IO statistics. Reports application time spend blocking on IO
/proc/net/softnet_stats
per CPU backlog queue throttling (netdev_max_backlog) stats.
/proc/interrutps
/proc/softirqs
Tells which CPU is processing device interrupts. softirqs provides information about softirq
processing for network stack and Linux block layer
tcpdump, wireshark
Network sniffer. Capture network traffic (libpcap format) for post analysis. Wireshark can be
used to analyze tcpdump and ethereal traces
ethtool
NIC low level statistics: NIC speed, full duplex, transmit descriptors , ring buffer
13. Advanced Tools, cont.
Tool
Description
perf
Application and kernel profiling and tracing tool. Reports top kernel and
application CPU bound routines, stack traces. Capture hardware events (cpu
cache, TLB misses etc.), software (kernel, application) static and dynamic
events to perform low level profiling and tracing.
systemtap
Application and kernel profiling and tracing tool. Allow inserting trace points in
the kernel and application dynamically to capture low level profiling data for
performance analysis and debugging. Scripting language similar to C and Perl.
latencytop
Kernel blocking events due to lock, IO, condition variable . Dumps kernel stack
strace
Report information about system calls generated by the application: Type of
system call, arguments, return value, errno, and elapsed time.
numastat
numa related latency stats on the HVM platform
15. Use Case: CFS scheduler tuning
• Goal:
– Improve batch and compute-intensive processing:
• Increase time slice and/or process priority in order to reduce context switches
• Longer the time process runs on the CPU, better the use of CPU caches
• Tunables:
– Change scheduling policy of workload: # chrt –a –b –p 0 <PID>
OR
– Set CFS tunables to improve time slice at a system-wide level
• sched_latency_ns: 6ms * (1 + log2(ncpus))
Ex: 4 CPU cores = 18ms. Set it higher
• sched_min_granularity_ns: 0.75 * (1 + log2(ncpus))
Ex: 4 CPU cores = 2.25ms. Set it higher
17. Use Case: Page Cache Tuning
• Goal:
– Increase application write throughput
– Reduce IO flooding by writing consistently rather than in bulk
• Tunables:
–
–
–
–
dirty_ratio= 60
dirty_background_ratio= 5
dirty_expire_centisecs= 30000
swappiness=0
• Page cache hit/miss ratio:
– systemtap (ioblock_request, vfs_read probes).
– fincore command can be used to find what pages of a file are in page cache
18. Use Case: Linux Block Layer Tuning
• Goal:
– Queue more data to SSD device to achieve higher throughput
– Better sequential read IO throughput by fetching more data
– Distribute IO processing across multiple CPUs
• Tunables:
/sys/block/<dev>/queue/nr_requests=256
/sys/block/<dev>/read_ahead=256
/sys/block/<dev>/queue/scheduler=noop
/sys/block/<dev>/queue/rq_affinity=2
19. Use Case: Memory Allocation Tuning
• Goal:
– Avoid running out of memory while running a production load
– Do not allow memory over-commit that may result in OOM
• Tunable
– overcommit_memory=2
– overcommit_ratio=80
20. Use Case: Network Stack Tuning
• Goal:
– Increase Network Stack Throughput
– Larger TCP receive and Congestion window
– Scale network stack processing across multiple CPUs
• Tunable
tcp_slow_start_after_idle=0
rmem_max,wmem_max = 16777216 or higher
tcp_fin_timeout=10
tcp_wmem, tcp_rmem
8388608,1258291,16777216 or higher
tcp_early_retrans=1
rps_sock_flow_entries=32768
netdev_max_backlog=5000
/sys/class/net/eth?/queues/rx0/rps_flow_cnt=32768
txqueuelen=5000
/sys/class/net/eth?/queues/rx-0/rps_cpus=0xf
22. Future tuning activity
•
M3 class instances supports both HVM and PV. Easy validation of performance gain
with HVM versus PV
•
Study Cassandra workload on SSD-based systems
•
Tune Linux Block Layer and compare performance of different IO schedulers: noop,
CFQ, deadline
•
Test file system: XFS, EXT4, BTRFS performance on various workload running on
SSD instances.
•
Test network performance with new TCP/IP and Network Stack features: TCP early
retransmit, TCP Proportional Rate Reduction, and RFS/RPS features
•
Capture low level performance metrics using perf, systemtap, and JVM profiling tools
23. Please give us your feedback on this
presentation
CPN302
As a thank you, we will select prize
winners daily for completed surveys!
25. Profiling and Tracing Benefits
• Fine grain measurements and low level statistics to help
with difficult to solve performance issues
• Isolate hot spots, resource usage and contention in
application and kernel
• Gain comprehensive insight into application and kernel
behavior
26. SystemTap and Perf Benefits
• Inserts trace points into the running application and kernel
without adding any debug code
• Lower overhead, processing done in the kernel space. No
stopping/starting the application
• Help build custom tools to fill out observability gaps
• Analyze throughput and latency across application and all
kernel subsystems
• Unified view of user (application) and kernel events
27. SystemTap and Perf Benefits (cont.)
SystemTap and Perf can track all sorts of events at system-wide,
process and thread levels:
•
•
•
•
•
•
•
•
Time spent in system call and kernel functions; arguments passed, return
values, errno.
Dump application and kernel stack trace at any point in the execution path
Time spend in various process states: blocking for IO, lock, resource and
waiting for CPU
Top CPU bound user and kernel functions
Low level TCP stats. Not possible with standard tools.
Low IO and Network activities. Page cache hit/miss rates.
Monitor page faults, memory allocation, memory leaks
Aggregate results when large amount of data needs to be collected and
analyzed
29. SystemTap and Perf Events
Perf and SystemTap capture events generated from various sources:
• Hardware Events (perf only): If running on bare-metal system, perf can
access hardware events generated by PMU (performance monitoring Unit)
Examples: CPU cache/TLB loads, references and misses, IPC (cpu stall
cycles), Branch etc..
• Software Events: Events like: page-faults, cpu-clock, context switches etc..
• Static Trace Events: These are trace points coded into entry and exit of
kernel functions: Examples; syscalls, net, sched, irq, etc..
• Dynamic Trace Events: These are dynamic trace points that can be inserted
on-the-fly (hot patching) into application and kernel functions via break point
engine (kprobe). No kernel and application debug compilation, pauses etc..
42. SystemTap
• SystemTap supports scripted language similar to C and Perl and
follows an event-action model:
– Event: Trace or Probe point of interest
• Example: system calls, kernel functions, profiling events etc..
– Action: What to do when event of interest occur
• Example: Print app-name, PID whenever write() syscall is invoked
• Idea behind a SystemTap is to name an event (probe) and provide a
handler to perform action in the event context
– probe point is like a break point but instead of stopping
kernel/application at the break point, SystemTap causes a branch
(jump) to probe handler routine to perform the action.
• Script can have multiple probes and associated handlers. Data is
accumulated in buffer and then dump it out into standard out.
43. SystemTap – Runs as a Kernel Module
• When systemtap script is executed, it is converted into .c file and
compiled as a linux kernel module (.ko)
• Module is loaded into the kernel and probes are inserted by hot
patching the running kernel and application
• Module is unloaded when <cntl><c> is pressed or exit() probe is
invoked from the module.
• Systemtap script use file extension (.stp) and contains probe and
handler written in the format:
probe event { statements}
• When run as a script, first line should have interpreter:
#!/usr/bin/env stap
• Or run from the command line:
# stap –e script.stp
44. SystemTap: Events
SystemTap trace points can be placed at various locations
in kernel:
– syscall: system call entry and return
• Example: syscall.read, syscall.read.return
– vfs: VFS functions entry and return
– kernel.function: Kernel function entry and return
• Example: kernel.function(“do_fork”), kernel.function(“do_fork”).return
– module.function: Kernel module entry and return
• Other events:
– begin: event fires at the start of script
– end: event fires when script exit
– timer: event fires periodically.
45. SystemTap: Functions
Commonly used functions:
•
•
•
•
•
•
•
•
•
•
•
•
tid():The ID of the current thread.
uid(): The ID of the current user.
cpu(): The current CPU number.
gettimeofday_s(): The number of seconds since UNIX epoch (January 1, 1970)
probefunc(): Probe function
pid(): PID
execname: Executable name
thread_ident(): Provide indentation to nicely format printing of function call entry and return
target(): specify the pid on the command line
print_backtrace(): Print the complete stack trace
print_regs(): print CPU registers
kernel_string(). Useful to print char type in data structures
47. CFS Scheduler Tuning
• CFS scheduler:
– Provides fair share of CPU resources to all running tasks
– Tasks are assigned weights (priority) to control the time a task can
run on the CPU.
• Involuntary context switch: A task has consumed its time slot or is preempted by higher priority task
• Task voluntary relinquishes the CPU when it blocks on a resource: IO
(disk, net), locks..
• CFS supports various scheduling policies: FIFO, BATCH, IDLE,
OTHER (default), RR
48. CFS Tunable – Compute Intensive Workload
• Performance goal of Batch workload is to complete the given task in
the shortest time possible. SCHED_BATCH policy is more
appropriate for batch processing workloads
• Task running with SCHED_BATCH policy gets bigger time-slice and
thus does not get involuntary context switched as frequently and
that allows computed tasks to run longer and gets better use of CPU
caches.
49. CFS Tunable – Compute Intensive Workload
CFS tunables can also be set to reduce context switching activity:
• sched_latency_ns: period in which each runnable task should run
once. Larger value offers bigger CPU slice, that may improve
compute performance. Interactive application performance may
suffer
Default: 6ms * (1 + log2(ncpus)). Example: 4 CPU cores = 18ms (default). Change it to 36 ms
• sched_min_granularity_ns: Threshold on minimum amount of CPU
cycles each task should get. Larger value helps compute workload.
Default: 0.75 * (1 + log2(ncpus)). Example: 4 CPU cores: 2.25ms (default). Change it to 5ms
Internal Testing at Netflix shows 2-5% performance improvement of compute intensive tasks when running the
workload with SCHED_BATCH policy as compared to SCHED_OTHER.
50. Avoid OOM Killer
To overcome memory and swap shortages the Linux kernel may kill random processes to
free memory. This mechanism is called Out-Of-Memory Killer.
Tunable
Discussion
Heuristic overcommit
overcommit_memory=0 (default)
Allows to overcommit some reasonable amount of memory as determined
by free memory, swap and other heuristics. No reservation of memory and
swap. Thus memory and swap may run out before application uses all of its
memory. This may result in application failure due to OOM.
Always overcommit
overcommit_memory=1
Allow wild overcommit. Any size of memory allocation (malloc) will be
successful. As in the case of Heuristic, memory and swap may run out and
trigger OOM killer.
Strict overcommit
overcommit_memory=2
overcommit_ratio=80
Prevents overcommit. It does not count free memory or swap when making
decisions about commit limit. When application calls malloc(1GB), kernel
reserves or deducts 1G from free memory and swap. This guarantees that
memory committed to application will be available if needed. This prevents
OOM due to no overcommit allowed.
51. Avoid OOM Killer (continue..)
•
When strict overcommit is enforced, total memory that can be allocated system-wide is
restricted to:
overcommit Limit = Physical Memory x overcommit_ratio + swap
where: overcommit_ratio=50% (default). Tune overcommit_ratio = 80%
•
•
•
New program may fail to allocate memory even when the system is reporting plenty of free
memory and swap. This is due to memory and swap reserved on behalf of the process.
This feature does not effect memory use by file system page cache. Page cache memory is
always counted as free.
Use “/proc/meminfo” statistics to monitor memory already been committed.
CommitLimit : Total amount of memory that can be allocated system-wide
Committed_AS: Memory already been committed on behalf of application
MemoryAvailable: CommitLimit - Committed_AS
Any attempt to allocate memory over “MemoryAvailable” will fail when strict overcommit is used.
52. Tuning for Higher Throughput
Tunable
Discussion
dirty_ratio
Throttle writes when dirty pages in the file system cache reaches to
40%. For write intensive workload increase it to 60-80%
dirty_background_ratio
Wakes up pdflush when dirty pages reach 10% of total memory.
Reducing the value (5%) allows pdflush to wake up early and that
may keep dirty pages growth in check
dirty_expire_centisecs
Data can stay dirty in the page cache for 30 secs. Increase it to
60-300 seconds on large memory systems to prevent heavy IO to
the storage due to short deadline. Drawback of tuning is that
unexpected outage may result in loss of data not committed.
swappiness
Controls Linux periodic swapping activities. Large value favors
growing page cache by steeling application in-active pages. Setting
value to zero disables periodic swapping. Large value may improve
application write throughput. Value of zero is recommended for
latency sensitive application
53. Linux Block Layer – IO Tuning
•
sysfs (/sys) is used to set device specific
attributes (tunables):
/sys/block/<dev>/queue/..
•
•
•
nr_requests: Limits number of IO requests queued per
device to 128. To improve IO throughput consider
doubling this value for RAID (multiple disks) devices
or SSD.
scheduler: VM instances use Xen virtualization layer
and thus have no knowledge of underlying geometry
of disks. noop IO scheduler is recommended
considering it is FIFO and has least overhead.
read_ahead: Improves sequential IO performance.
Larger value mean fetch more data into page cache to
improve application IO throughput.
noop
IO scheduler
nr_requests
54. Block Layer: IO Affinity
•
•
•
Linux IO Affinity feature distributes IO processing work across multiple CPUs
When the application blocks on IO, the kernel records the CPU and dispatches IO.
When the IO is marked completed by the storage driver, the block layer performs IO
processing on the same CPU that has originally issued the IO.
This feature is very helpful when dealing with high IOPS rates such as SSD systems
given the IO completion processing will be distributed across multiple CPUs.
Tunable
Discussion
rq_affinity = 1 (default)
Block layer will migrate IO completion to the CPU group that
originally submitted the request
rq_affinity = 2
Forces the IO completion on the CPU that originally issued the
IO. Thus bypass the “group” logic. This option maximizes
distribution of the IO completion
55. RPS/RFS - Network Performance and Scalability
•
•
•
•
•
RPS (Receive Packet Steering) and RFS (Receive Flow Steering) can help system to
scale better by distributing network stack processing across multiple CPUs
Without this feature network stack processing is restricted to the same CPU that
serviced the NIC interrupts, and that may induce latencies and lower the network
throughput
NIC driver calls netif_rx() to enqueue the packet for processing. RPS function
get_rps_cpu() selects the appropriate queue that should process the request and
thus distributes the work across multiple CPUs.
RPS make decision by hash lookup that uses CPU bitmask to decide which CPU
should process the packet
RFS steers the processing to the CPU where the application thread, that eventually
consumes the data, is running. It uses the hash as an index into the network flow
lookup table that maps the flow to the CPUs. This improves CPU cache locality.
56. RPS/RFS - Network Performance and Scalability
(continue..)
Tunable
Discussion
core.
rps_sock_flow_entries=32768
global flow table containing the desired CPU to flow. Each table value
is a CPU index that is updated during socket calls.
/sys/class/net/eth?/queues/rx-0
rps_flow_cnt=32768
Number of entries in the per-queue flow table. Value of flow is
determined by number of active connections. Setting 32768 is a good
start for moderately loaded server. For a single queue device (as in the
case of AWS instances), the value of two tunables should be the
same.
core.rps_sock_flow_entries should be set in order for it to work.
/sys/class/net/eth?/queues/rx-0
rps_cpus=0xf
It is set as a bitmask of CPUs. Disable when set to zero (means
packets are processed on the interrupted CPU). Set to all CPU or
CPUs that are part of the same NUMA node (large server). Setting
value 0xf will cause CPU 0,1,2,3 to do network stack processing
57. Network Stack Tuning
Packet Transmit Path:
• Network stack converts application payload
written in socket buffer into TCP segments (or
UDP datagrams), calculates the best route
and then writes the packet into NIC driver
queue.
• QOS is provided by inserting various queue
disciplines (FIFO, RED, CBQ..). Queue size is
set to txqueuelen
• NIC driver process packets one-by-one by
writing (DMA) to NIC transmit descriptors. In
case of Xen, packet is written into Xen shared
IO ring (Xen split device driver model)
58. Network Stack Tuning
Packet Receive Path:
• Device writes (DMA) packet into kernel memory
and raises interrupt.
• In case of Xen, packet is written into IO shared
ring and notification is sent via event channel
• NIC driver interrupt handler copies the packet into
input queue (per-cpu queue). Queue is
maintained by network stack and its size is set to
netdev_max_backlog.
• Packets are processed on the same CPU that
received the interrupt. If RPS/RFS feature is
enabled then network stack processing is
distributed across multiple CPUs
• Packet is eventually written to socket buffer.
Application wakes up and process the payload
59. TCP Congestion and Receiver Advertise Window
TCP tuning requires understanding of some critical parameters
Paramters
Discussion
receiver window size (rwnd)
sender window size (swnd)
congestion window (cwnd)
cwnd controls number of packets a sender can send without needing an
acknowledgment. TCP cwnd starts with 10 segments (slow start) and increase
exponentially until it reaches receiver advertise window size (rwnd). Thus TCP
cwnd will continue to grow if rwnd and swnd are set to a large value. However,
setting rwnd and swnd too large may result in packet loss due to congestion and
this may cut the cwnd to half of rwnd or to TCP slow start value resulting in slower
throughput.
Proportional Rate Reduction (PRR) and Early Retransmit (ER) features (kernel 3.2)
help recover from packet losses quickly by retransmit early and pacing out
retransmission across received ACKs during TCP fast recovery
Bandwidthdelay product (BDP)
rwnd and swnd should be set larger than BDP. Otherwise, TCP throughput will be
limited.
BDP = Link Bandwidth * RTT = 1000 * 0.001 sec /8 = 128KB
Socket Buffer size
tcp_wmem, tcp_rmem,
rmem_max, wmem_max
Limits amount data application can send/receive to/from network stack. To improve
application throughput socket size should be set large enough to utilize the TCP
window fully
60. Network Stack Tunables: Higher Throughput
Tunable
Value
Discussion
tcp_slow_start_after_idle
(default: 1 means enable)
0 (disable)
prevents TCP slow start value (10 segments) to be used as a new advertise
window for connections sitting idle for 3 seconds. Better throughput due to
continue use of receiver advertise window instead of slow start.
tcp_fin_timeout
(default: 60 sec)
10 sec
This tunable limits number of connections in TCP TIME_WAIT state to avoid
running out of available ports. Recommended for site with high socket churn
rate and server application initiating connection close.
TIME_WAIT timeout = 2 * tcp_fin_timeout
tcp_early_retrans
(default=0 means disable)
1
(enable)
It allows fast retransmit to trigger after 2 duplicate (instead of 3 or more) ACKs
for the same segment is received. Allows connection to recover quickly due to
packet loss or network congestion.
http://research.google.com/pubs/pub37486.html
netdev_max_backlog
(default: 1000 packets)
5000
packets received by NIC driver are queued into per CPU input queue for
network stack processing. Packets will be dropped if input queue is full and
cause TCP retransmits
61. Network Stack Tunables: Higher Throughput
Tunable
Value
Discussion
txqueuelen (default: 1000)
5000
Controls amount of data that can be queued by network stack for NIC
driver processing.
For latency sensitive application, consider reducing the value (means
less buffering) so that TCP congestion avoidance kicks in early in case
of packet loss.
rmem_max
wmem_max
16777216
or
higher
Maximum receive and send socket buffer size for all protocols. Set the
same as tcp_wmem and tcp_rmem. It sets the maximum TCP receive
window size. Larger the receive buffer, more data can be sent before
requiring acknowledgement.
Caution: Larger buffer may cause memory pressure
tcp_wmem
tcp_rmem
8388608, 1258291,
16777216
or
higher
Control socket receive and send buffer size. Triplet:
Min: Minimum socket buffer size during memory pressure (default:
4096)
Default: socket buffer size
(receive buffer: 87380 | send buffer: 16384)
Max: Maximum socket buffer size (auto-tuned)
Notas do Editor
First tackle your primary application bottlenecks over system tuning as the benefits of optimization are much better. In the RDBMS world common example is attempting to tune the OS to overcome inefficiencies present as a lack of poor index support for heavy queries.Tuning the OS should be a priority when the system is throttling the performance of the application
/proc/net/softnet_stats output format:Per CPU stats. If there are 8 CPUs, you should see 8 lines of stats. Each line contains:Column 1: Total number of packets queued by NIC driver in per CPU input queue, not including packets that were processing using netpoll (NAPI)Column 2: Number of packets that were dropped because netdev_max_backlog was exceededColumn 3: Number of time ksoftirq ran out of netdev_budeget or time slice with work remainingColumn 4: # of of time two CPUs contended for device queue lock..Last Column: CPUs participating in Network stack processing (softirq). This should give non-zero values when RFS/RPS is enabled
Hardware events are not available when running under Xen. To capture hardware events, perf needs access to PMU part of the CPU.
/usr/bin/pidstat –C <APP-NAME>-w -t -T ALL 5
fincore is available as fcoretools package. There is also fincore() system call that be used in the C program to capture similar informationSample script to capture page cache hit and miss rates using ioblock_request, vfs.read kernel probes====#! /usr/bin/envstapglobal total_bytes, disk_bytes, counterprobe vfs.read.return { if (bytes_read>0) { if (devname=="N/A") { } else {total_bytes += bytes_read } }}probe ioblock.request{ if (rw == 0 && size > 0) { if (devname=="N/A") { } else {disk_bytes += size } }}# print VFS hits and misses every 5 second, plus the hit rate in %probe timer.s(5) { if (counter%15 == 0) {printf ("\n%18s %18s %10s %10s\n", "Cache Reads (KB)", "Disk Reads (KB)", "Miss Rate", "Hit Rate") }cache_bytes = total_bytes - disk_bytes if (cache_bytes < 0)cache_bytes = 0 counter++hitrate = 10000 * cache_bytes / (cache_bytes+disk_bytes)missrate = 10000 * disk_bytes / (cache_bytes+disk_bytes)printf ("%18d %18d %6d.%02d%% %6d.%02d%%\n",cache_bytes/1024, disk_bytes/1024,missrate/100, missrate%100, hitrate/100, hitrate%100)total_bytes = 0disk_bytes = 0}
Default for nr_requests is 128
Check with AWS on pvmvshvm support going forward
CFS was introduced in 2.6.27, prior default scheduler was O(1)
If the table entry in the flow lookup table does not contain a valid CPU, then packets are steered using RPS only.
TCP has two primary ways for recovering from losses:1- Fast Retransmit: TCP performs a retransmission of the missing segment after receiving a certain number of of duplicate ack (dupacks). Fast Recovery is implemented as:RFC 3517: During fast recovery sender sets cwnd and ssthreshold to half of the data outstanding in the network. Fast retransmit the first unack segment, and further transmits more segments if allowed by cwnd.. RFC goal is to recover TCP’s self clock by relying on returning dupacks during recovery to clock more data into the network. Depending on the losses, RFC may get either too conservative or too aggressive.LINUX RATE HALVING: Linux uses rate halving in recovery. When cwnd is reduced, Linux sends data in response to alternate ACKs during recovery. Linux approach is conservative during fast recovery phase. First it cuts the cwnd to half even for single loss event. Also, in presence of heavy losses, Linux transmits at most one packet per ACK during the rest of recovery period. As a result recovery is prolonged or it enters an RTO. Also, rate halving assumes every ACK represents one data packet delivered. However, lost ACKs will cause Linux to retransmit less than half the congestion window.PRR: Pacing out retransmission across received ACKs during TCP fast recovery.Fast recovery ends when all data that was outstanding before entering recovery is cumulatively acknowledged or when the timeout occurs2- When sender does not receive enough duplicate acks, TCP uses a slower method, where it waits for duration of the retransmission timeout (RTO) before counting a segment lost.Early Retransmit (ER) lowers the duplicate acknowledgement threshold for short transfers. Combination of PRR and ER reduce the TCP latency of connections experienceing losses by 3-10% depending on the response size.PRR also recommends delaying early retransmission for a short while is effective in mitigating the spurious retransmissions.