Steps to identify ONTAP latency related issues

Steps to identify & understand NetApp
ONTAP FILER’s Latency related issues
Following snapshot captures one of the FAS8020 (2-Node-cluster) Node.
[FAS8020 | 6 cores, 24GB ram & 4GB separate NVRAM = Per Node | Running: ONTAP 9.1P13]
::> set diag
Warning: These diagnostic commands are for use by NetApp personnel only.
Do you want to continue? {y|n}: y
::*> node run -node node-01 sysstat -M 1
WAFL_ex (marked 1) = represents parallel WAFL processing while Kahuna (marked 2) represents
serial WAFL processing. These two logical domains are mutually exclusive, meaning either Kahuna
can be active on 1 CPU, or WAFL_ex can be active on 1+ CPUs, but both Kahuna and WAFL_Ex
cannot be active at the same time.
A bottleneck on a physical CPU core is not possible without either reaching a domain bottleneck or
average CPU bottleneck. In the above ‘snapshot’, neither individual domain nor cumulative (93%)
hits 100 % mark.
In that example, Average CPU Utilization is 97% across the 6 cores. The busiest domains are WAFL
Exempt (1) at 200 % approx., networking exempt (3) at 230 % approx., RAID exempt (4) at 40 %
approx., and exempt{general parallel processing} (5) at 40 % approx.
In our example - WAFL was active 93% of the sample x 10 interval, with 2% spent in serial processing
and 91% in parallel processing. Because WAFL serial processing is quite low it is likely that more
work could be completed by parallelized WAFL, so being 93% active is absolutely normal.

However, if you look at the avg overall CPU, its almost reaching 100%, and that could translate to
work-will-begin-to-queue for CPU, in other words 'latency' wil start to increase. This represents so
called ‘NODE PERFORMANCE (cpu)’ capability. However, this is not yet the root cause.
NetApp highly recommends looking at the service level latency (lun/volume/protocol) instead of just
focussing on the CPU bottleneck of the Physical cores that are available to FILER. No doubt if the Avg
CPU cores are clocking around 100% consistently then it will inevitably affect everything that goes IN
and goes OUT of the FILER.
However, it is important to break down the latency related issue/complains at the service level first.
Identify which LUN or Volume is affected; check the physical container (Aggregate percentile).Ideally
Aggregate utilization should never cross 80% (only true for spinning disks), for SSD it doesn’t matter
(may be in specific cases) and for flash-pool it depends how much I services from the cache.
As WAFL does optimized striping, it needs free space to write those chained blocks, which could
result in delayed CPs which eventually leads to fragmentation and reads gets affected. (Another
topic in its own right and needs separate explanation, anyway let’s stick to our current topic).
In general if you can show and prove that NetApp filer does not have any of those above bottlenecks
reported by the FILER (7-mode/ONTAP), then it’s up to either ‘NETWORK/PIPE’ OR ‘SERVER/vmware’
side issue.
Game changer: Compared to 7-mode, ONTAP makes this job super easy. With QoS level
commands one can quickly identify the components of the bottlenecks.
Following on the case we are discussing, let’s explore the QoS level command to identify the actual
culprits. According to NetApp, Data column (CPU), should ideally not go above 2ms to be considered
normal and healthy. In this example – It’s in sub-milliseconds. Hence, CPU is not the root cause here.

Following ‘QoS’ command is available from command line to all ONTAP filers. This one is cool, b’cos
it breaks down the entire storage layers in to sub-components that makes up the FILER, and
therefore enables administrator to pin-point issues related to storage.
As per storage industry standards, for ‘DISK’ anything under 5ms (for sensitive applications) and for
the rest anything under 10ms I wouldn’t be bothered. However, as the objective of this KB is to
logically demonstrate steps to troubleshoot latency related issues; hence we are tasked to find out
the source of the ‘3ms’ latency showing up in the ‘DISK’ column in the previous snapshot.
Next thing, I would look at the Aggregate(s) on the Node to determine what is causing this so called
‘3ms’ disk latency.
node-01> df -Ah
Aggregate total used avail capacity
aggr_01_t1 61TB 54TB 6951GB 89%
Please note: This aggregate is made of flash_pool (SAS & SSD) and hence even at 89%
it has not caused any significant latency, but ideally should be kept below 85 %
considering the number of SAS disks in the raid-group. However, in our example, I
will become concerned if gets too close to 95, b’cos that is when the writes & CPs
will be affected.
ashwinwriter@gmail.com
June, 2019

Steps to identify ONTAP latency related issues

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Steps to identify ONTAP latency related issues

Similar to Steps to identify ONTAP latency related issues (20)

More from Ashwin Pawar

More from Ashwin Pawar (20)

Recently uploaded

Recently uploaded (20)

Steps to identify ONTAP latency related issues