1. Components of Latency in an IT
infrastructure
Note: Applicable to components on the Ground only, no cloud here …
A simple rule of thumb:
If a particular system or application is running slower than you expect, or slower than it has
historically, it might be a performance issue.
However,
If a particular system or application is not working at all, it is likely not a performance related
issue.
Latency (response time in mili-seconds) is the time it takes for a Storage Controller [volume/LUN] to
respond to I/O requests from client applications.
Cause: The cause of the high latency could be on the Storage itself, Or on one or more components
end-2-end i.e starting from Host <> Network <> storage:
Such as:
1) Application such as SQL/Exchange/Oracle/SAP etc [Incorrect configuration/settings]
2) a) Physical Host [Incorrect settings and/or out-dated firmware/driver on the NIC or HBA and/or
Insufficient CPU/Memory]
b) Virtual Machine [Insufficient CPU/Memory/NIC Capabilities]
c) ESX Host [Contention of resources such as disk/CPU/Memory/NIC-1g/10g]
3) Network components that attach Host to storage [cable/switches/routers etc] [Settings:
Flowcontrol]
4) Storage components [Hardware: CPU/Memory/Disk[volume - cifs/nfs-datastore]/LUN/Internal
components of the storage controller] [Software: Filesystem/Software/OS BUG]
2. As a storage administrator : it's my responsibility to identify if there are any bottlenecks originating
from the storage end and once I have collected the stats and only when those stats suggests no
bottlenecks on the storage end that I can confidently pass the ticket to the Application/Host
Infrastructure team to do the analysis on their part.
Similarly, Once they have the stats to prove that the Host/Applications is not a bottleneck, then it
has to be something in between, and most likely Network and thereafter Network team can do
analysis on their part to identify any latency issues. Honestly, it has to be one of those 3 areas or all
may be all collectively responsible for application latencies.
In general :If all the applications are experiencing slow-ness irrespective of the storage tiering
[DISK/FLASH/SSD], then it is worth checking with network team to find out if there is any kind of
disruption or issues reported before starting any investigations on the storage side [Unless the
whole Storage is down ;) ]
As a NetApp storage administrator for Clustered ONTAP: My job is made easier with the Introduction
Clustered ontap and now simply called 'ONTAP'. I am recommending ONTAP 9.x Major release
versions here as it has major improvements over previous 8.3.x versions.
3 Most Important commands to cover your back when you are faced with tickets passed to storage
team for latency related issues are:
Commands are:
: > qos statistics volume performance show
:> qos statistics volume characteristics show
:> qos statistics volume latency show
All the 3 commands are stepping stone towards identifying the root cause, however 'volume
latency' is possibly the most useful as it breaks down the latency contribution of the individual
clustered Data ONTAP components, making your job convenient and easier compared to 7-mode
filers.
Please note: Those 3 commands will give you real-time stats however if you are asked to investigate
issue that occurred in the past then you must use OCUM [OPM] Performance Manager Tool to view
the historical stats (data). Honestly, OCUM/OPM is sufficient to view, collect and identify the
performance related issues on NetApp storage and it’s free.
However, there is another heavy duty tool called ‘OCI’ which is a licensed product and can do the
same thing across heterogeneous storage & components but if you don’t have the budget for it you
can ignore it, you can live without it.
3. In the next article, I will demonstrate how to use the 3 commands along with OCUM/OPM Tool and
will show you how to interpret the latency across all the different ONTAP components and it’s
relation to each other and how application IO size could play a critical role with respect to your
Networking environment such as1500 or 9000 mtu, and will also learn how to mitigate it from the
storage side.
ashwinwriter@gmail.com
April, 2019