SlideShare a Scribd company logo
1 of 3
Download to read offline
Steps to identify & understand NetApp
ONTAP FILER’s Latency related issues
Following snapshot captures one of the FAS8020 (2-Node-cluster) Node.
[FAS8020 | 6 cores, 24GB ram & 4GB separate NVRAM = Per Node | Running: ONTAP 9.1P13]
::> set diag
Warning: These diagnostic commands are for use by NetApp personnel only.
Do you want to continue? {y|n}: y
::*> node run -node node-01 sysstat -M 1
WAFL_ex (marked 1) = represents parallel WAFL processing while Kahuna (marked 2) represents
serial WAFL processing. These two logical domains are mutually exclusive, meaning either Kahuna
can be active on 1 CPU, or WAFL_ex can be active on 1+ CPUs, but both Kahuna and WAFL_Ex
cannot be active at the same time.
A bottleneck on a physical CPU core is not possible without either reaching a domain bottleneck or
average CPU bottleneck. In the above ‘snapshot’, neither individual domain nor cumulative (93%)
hits 100 % mark.
In that example, Average CPU Utilization is 97% across the 6 cores. The busiest domains are WAFL
Exempt (1) at 200 % approx., networking exempt (3) at 230 % approx., RAID exempt (4) at 40 %
approx., and exempt{general parallel processing} (5) at 40 % approx.
In our example - WAFL was active 93% of the sample x 10 interval, with 2% spent in serial processing
and 91% in parallel processing. Because WAFL serial processing is quite low it is likely that more
work could be completed by parallelized WAFL, so being 93% active is absolutely normal.
However, if you look at the avg overall CPU, its almost reaching 100%, and that could translate to
work-will-begin-to-queue for CPU, in other words 'latency' wil start to increase. This represents so
called ‘NODE PERFORMANCE (cpu)’ capability. However, this is not yet the root cause.
NetApp highly recommends looking at the service level latency (lun/volume/protocol) instead of just
focussing on the CPU bottleneck of the Physical cores that are available to FILER. No doubt if the Avg
CPU cores are clocking around 100% consistently then it will inevitably affect everything that goes IN
and goes OUT of the FILER.
However, it is important to break down the latency related issue/complains at the service level first.
Identify which LUN or Volume is affected; check the physical container (Aggregate percentile).Ideally
Aggregate utilization should never cross 80% (only true for spinning disks), for SSD it doesn’t matter
(may be in specific cases) and for flash-pool it depends how much I services from the cache.
As WAFL does optimized striping, it needs free space to write those chained blocks, which could
result in delayed CPs which eventually leads to fragmentation and reads gets affected. (Another
topic in its own right and needs separate explanation, anyway let’s stick to our current topic).
In general if you can show and prove that NetApp filer does not have any of those above bottlenecks
reported by the FILER (7-mode/ONTAP), then it’s up to either ‘NETWORK/PIPE’ OR ‘SERVER/vmware’
side issue.
Game changer: Compared to 7-mode, ONTAP makes this job super easy. With QoS level
commands one can quickly identify the components of the bottlenecks.
Following on the case we are discussing, let’s explore the QoS level command to identify the actual
culprits. According to NetApp, Data column (CPU), should ideally not go above 2ms to be considered
normal and healthy. In this example – It’s in sub-milliseconds. Hence, CPU is not the root cause here.
Following ‘QoS’ command is available from command line to all ONTAP filers. This one is cool, b’cos
it breaks down the entire storage layers in to sub-components that makes up the FILER, and
therefore enables administrator to pin-point issues related to storage.
As per storage industry standards, for ‘DISK’ anything under 5ms (for sensitive applications) and for
the rest anything under 10ms I wouldn’t be bothered. However, as the objective of this KB is to
logically demonstrate steps to troubleshoot latency related issues; hence we are tasked to find out
the source of the ‘3ms’ latency showing up in the ‘DISK’ column in the previous snapshot.
Next thing, I would look at the Aggregate(s) on the Node to determine what is causing this so called
‘3ms’ disk latency.
node-01> df -Ah
Aggregate total used avail capacity
aggr_01_t1 61TB 54TB 6951GB 89%
Please note: This aggregate is made of flash_pool (SAS & SSD) and hence even at 89%
it has not caused any significant latency, but ideally should be kept below 85 %
considering the number of SAS disks in the raid-group. However, in our example, I
will become concerned if gets too close to 95, b’cos that is when the writes & CPs
will be affected.
ashwinwriter@gmail.com
June, 2019

More Related Content

What's hot

Process management in os
Process management in osProcess management in os
Process management in os
Miong Lazaro
 
M.c.a. (sem ii) operating systems
M.c.a. (sem   ii) operating systemsM.c.a. (sem   ii) operating systems
M.c.a. (sem ii) operating systems
Tushar Rajput
 

What's hot (20)

resin-dataに関する障害について
resin-dataに関する障害についてresin-dataに関する障害について
resin-dataに関する障害について
 
10 Problems with your RMAN backup script
10 Problems with your RMAN backup script10 Problems with your RMAN backup script
10 Problems with your RMAN backup script
 
Running E-Business Suite Database on Oracle Database Appliance
Running E-Business Suite Database on Oracle Database ApplianceRunning E-Business Suite Database on Oracle Database Appliance
Running E-Business Suite Database on Oracle Database Appliance
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Process management in os
Process management in osProcess management in os
Process management in os
 
4. linux file systems
4. linux file systems4. linux file systems
4. linux file systems
 
Fixed partitioning of memory
Fixed partitioning of memoryFixed partitioning of memory
Fixed partitioning of memory
 
SMTPのSTARTTLSにおけるTLSバージョンについて
SMTPのSTARTTLSにおけるTLSバージョンについてSMTPのSTARTTLSにおけるTLSバージョンについて
SMTPのSTARTTLSにおけるTLSバージョンについて
 
M.c.a. (sem ii) operating systems
M.c.a. (sem   ii) operating systemsM.c.a. (sem   ii) operating systems
M.c.a. (sem ii) operating systems
 
Oracle Data Integrator R12.2.1.1 Agentセットアップガイド
Oracle Data Integrator R12.2.1.1 AgentセットアップガイドOracle Data Integrator R12.2.1.1 Agentセットアップガイド
Oracle Data Integrator R12.2.1.1 Agentセットアップガイド
 
加密勒索時代下的資料保存之戰 [2020/11/03] @InfoSec Taiwan 2020
加密勒索時代下的資料保存之戰 [2020/11/03] @InfoSec Taiwan 2020加密勒索時代下的資料保存之戰 [2020/11/03] @InfoSec Taiwan 2020
加密勒索時代下的資料保存之戰 [2020/11/03] @InfoSec Taiwan 2020
 
5分で分かった気になるTekton
5分で分かった気になるTekton5分で分かった気になるTekton
5分で分かった気になるTekton
 
Redefining tables online without surprises
Redefining tables online without surprisesRedefining tables online without surprises
Redefining tables online without surprises
 
Pacemaker + PostgreSQL レプリケーション構成(PG-REX)のフェイルオーバー高速化
Pacemaker + PostgreSQL レプリケーション構成(PG-REX)のフェイルオーバー高速化Pacemaker + PostgreSQL レプリケーション構成(PG-REX)のフェイルオーバー高速化
Pacemaker + PostgreSQL レプリケーション構成(PG-REX)のフェイルオーバー高速化
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
Step By Step to Install Oracle Business Intelligence
Step By Step to Install Oracle Business IntelligenceStep By Step to Install Oracle Business Intelligence
Step By Step to Install Oracle Business Intelligence
 
Account Fraud Situation and Prevention in Rakuten
Account Fraud Situation and Prevention in RakutenAccount Fraud Situation and Prevention in Rakuten
Account Fraud Situation and Prevention in Rakuten
 
Memtest86をかけてみた話
Memtest86をかけてみた話Memtest86をかけてみた話
Memtest86をかけてみた話
 
Proxmox VE 功能概觀、案例分享與實用工具 [2019/12/07] @Proxmox VE 中文使用者社團 2019 年會
Proxmox VE 功能概觀、案例分享與實用工具 [2019/12/07] @Proxmox VE 中文使用者社團 2019 年會Proxmox VE 功能概觀、案例分享與實用工具 [2019/12/07] @Proxmox VE 中文使用者社團 2019 年會
Proxmox VE 功能概觀、案例分享與實用工具 [2019/12/07] @Proxmox VE 中文使用者社團 2019 年會
 
Operating System Lecture Notes
Operating System Lecture NotesOperating System Lecture Notes
Operating System Lecture Notes
 

Similar to Steps to identify ONTAP latency related issues

Vstoragetamsupportday1 110311121032-phpapp02
Vstoragetamsupportday1 110311121032-phpapp02Vstoragetamsupportday1 110311121032-phpapp02
Vstoragetamsupportday1 110311121032-phpapp02
Suresh Kumar
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performance
Ricky Zhu
 
Troubleshooting SQL Server
Troubleshooting SQL ServerTroubleshooting SQL Server
Troubleshooting SQL Server
Stephen Rose
 
1. Scaling PHP/MySQL...Presentation from Flickr
	
1.	
Scaling PHP/MySQL...Presentation from Flickr	
1.	
Scaling PHP/MySQL...Presentation from Flickr
1. Scaling PHP/MySQL...Presentation from Flickr
akshat
 

Similar to Steps to identify ONTAP latency related issues (20)

vSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting PerformancevSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting Performance
 
WALT vs PELT : Redux - SFO17-307
WALT vs PELT : Redux  - SFO17-307WALT vs PELT : Redux  - SFO17-307
WALT vs PELT : Redux - SFO17-307
 
Vstoragetamsupportday1 110311121032-phpapp02
Vstoragetamsupportday1 110311121032-phpapp02Vstoragetamsupportday1 110311121032-phpapp02
Vstoragetamsupportday1 110311121032-phpapp02
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performance
 
NetApp Administration and Best Practice, Brendon Higgins, Proact UK
NetApp Administration and Best Practice, Brendon Higgins, Proact UKNetApp Administration and Best Practice, Brendon Higgins, Proact UK
NetApp Administration and Best Practice, Brendon Higgins, Proact UK
 
Troubleshooting SQL Server
Troubleshooting SQL ServerTroubleshooting SQL Server
Troubleshooting SQL Server
 
Latency in storage
Latency in storageLatency in storage
Latency in storage
 
Mysql talk
Mysql talkMysql talk
Mysql talk
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
 
PoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HAPoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HA
 
Guide to alfresco monitoring
Guide to alfresco monitoringGuide to alfresco monitoring
Guide to alfresco monitoring
 
Oracle Performance Tuning DE(v1.2)-part2.ppt
Oracle Performance Tuning DE(v1.2)-part2.pptOracle Performance Tuning DE(v1.2)-part2.ppt
Oracle Performance Tuning DE(v1.2)-part2.ppt
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
 
Tuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for LogsTuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for Logs
 
Mcserviceguard2
Mcserviceguard2Mcserviceguard2
Mcserviceguard2
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
 
1. Scaling PHP/MySQL...Presentation from Flickr
	
1.	
Scaling PHP/MySQL...Presentation from Flickr	
1.	
Scaling PHP/MySQL...Presentation from Flickr
1. Scaling PHP/MySQL...Presentation from Flickr
 
Rac&asm
Rac&asmRac&asm
Rac&asm
 
MySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspectiveMySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspective
 

More from Ashwin Pawar

Our 5 senses can only perceive representation of reality but not the actual r...
Our 5 senses can only perceive representation of reality but not the actual r...Our 5 senses can only perceive representation of reality but not the actual r...
Our 5 senses can only perceive representation of reality but not the actual r...
Ashwin Pawar
 

More from Ashwin Pawar (20)

16TB Max file size.pdf
16TB Max file size.pdf16TB Max file size.pdf
16TB Max file size.pdf
 
Our 5 senses can only perceive representation of reality but not the actual r...
Our 5 senses can only perceive representation of reality but not the actual r...Our 5 senses can only perceive representation of reality but not the actual r...
Our 5 senses can only perceive representation of reality but not the actual r...
 
E=C+O
E=C+OE=C+O
E=C+O
 
SnapDiff
SnapDiffSnapDiff
SnapDiff
 
Oracle database might have problems with stale NFSv3 locks upon restart
Oracle database might have problems with stale NFSv3 locks upon restartOracle database might have problems with stale NFSv3 locks upon restart
Oracle database might have problems with stale NFSv3 locks upon restart
 
Is it possible to upgrade or revert ontap versions on a Simulator
Is it possible to upgrade or revert ontap versions on a SimulatorIs it possible to upgrade or revert ontap versions on a Simulator
Is it possible to upgrade or revert ontap versions on a Simulator
 
Cannot split clone snapcenter 4.3
Cannot split clone snapcenter 4.3Cannot split clone snapcenter 4.3
Cannot split clone snapcenter 4.3
 
Network port administrative speed does not display correctly on NetApp storage
Network port administrative speed does not display correctly on NetApp storageNetwork port administrative speed does not display correctly on NetApp storage
Network port administrative speed does not display correctly on NetApp storage
 
How to connect to NetApp FILER micro-USB console port
How to connect to NetApp FILER micro-USB console portHow to connect to NetApp FILER micro-USB console port
How to connect to NetApp FILER micro-USB console port
 
NDMP backup models
NDMP backup modelsNDMP backup models
NDMP backup models
 
How to use Active IQ tool to access filer information
How to use Active IQ tool to access filer informationHow to use Active IQ tool to access filer information
How to use Active IQ tool to access filer information
 
San vs Nas fun series
San vs Nas fun seriesSan vs Nas fun series
San vs Nas fun series
 
SnapDiff
SnapDiffSnapDiff
SnapDiff
 
SnapDiff process flow chart
SnapDiff process flow chartSnapDiff process flow chart
SnapDiff process flow chart
 
SnapDiff performance issue
SnapDiff performance issueSnapDiff performance issue
SnapDiff performance issue
 
Volume level restore fails with error transient snapshot copy is not supported
Volume level restore fails with error transient snapshot copy is not supportedVolume level restore fails with error transient snapshot copy is not supported
Volume level restore fails with error transient snapshot copy is not supported
 
Disk reports predicted failure event
Disk reports predicted failure eventDisk reports predicted failure event
Disk reports predicted failure event
 
OCUM shows ONTAP cluster health degraded
OCUM shows ONTAP cluster health degradedOCUM shows ONTAP cluster health degraded
OCUM shows ONTAP cluster health degraded
 
NDMPCOPY lun from 7-mode NetApp to cDOT
NDMPCOPY lun from 7-mode NetApp to cDOTNDMPCOPY lun from 7-mode NetApp to cDOT
NDMPCOPY lun from 7-mode NetApp to cDOT
 
NVRAM vs NVMEM
NVRAM vs NVMEMNVRAM vs NVMEM
NVRAM vs NVMEM
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Steps to identify ONTAP latency related issues

  • 1. Steps to identify & understand NetApp ONTAP FILER’s Latency related issues Following snapshot captures one of the FAS8020 (2-Node-cluster) Node. [FAS8020 | 6 cores, 24GB ram & 4GB separate NVRAM = Per Node | Running: ONTAP 9.1P13] ::> set diag Warning: These diagnostic commands are for use by NetApp personnel only. Do you want to continue? {y|n}: y ::*> node run -node node-01 sysstat -M 1 WAFL_ex (marked 1) = represents parallel WAFL processing while Kahuna (marked 2) represents serial WAFL processing. These two logical domains are mutually exclusive, meaning either Kahuna can be active on 1 CPU, or WAFL_ex can be active on 1+ CPUs, but both Kahuna and WAFL_Ex cannot be active at the same time. A bottleneck on a physical CPU core is not possible without either reaching a domain bottleneck or average CPU bottleneck. In the above ‘snapshot’, neither individual domain nor cumulative (93%) hits 100 % mark. In that example, Average CPU Utilization is 97% across the 6 cores. The busiest domains are WAFL Exempt (1) at 200 % approx., networking exempt (3) at 230 % approx., RAID exempt (4) at 40 % approx., and exempt{general parallel processing} (5) at 40 % approx. In our example - WAFL was active 93% of the sample x 10 interval, with 2% spent in serial processing and 91% in parallel processing. Because WAFL serial processing is quite low it is likely that more work could be completed by parallelized WAFL, so being 93% active is absolutely normal.
  • 2. However, if you look at the avg overall CPU, its almost reaching 100%, and that could translate to work-will-begin-to-queue for CPU, in other words 'latency' wil start to increase. This represents so called ‘NODE PERFORMANCE (cpu)’ capability. However, this is not yet the root cause. NetApp highly recommends looking at the service level latency (lun/volume/protocol) instead of just focussing on the CPU bottleneck of the Physical cores that are available to FILER. No doubt if the Avg CPU cores are clocking around 100% consistently then it will inevitably affect everything that goes IN and goes OUT of the FILER. However, it is important to break down the latency related issue/complains at the service level first. Identify which LUN or Volume is affected; check the physical container (Aggregate percentile).Ideally Aggregate utilization should never cross 80% (only true for spinning disks), for SSD it doesn’t matter (may be in specific cases) and for flash-pool it depends how much I services from the cache. As WAFL does optimized striping, it needs free space to write those chained blocks, which could result in delayed CPs which eventually leads to fragmentation and reads gets affected. (Another topic in its own right and needs separate explanation, anyway let’s stick to our current topic). In general if you can show and prove that NetApp filer does not have any of those above bottlenecks reported by the FILER (7-mode/ONTAP), then it’s up to either ‘NETWORK/PIPE’ OR ‘SERVER/vmware’ side issue. Game changer: Compared to 7-mode, ONTAP makes this job super easy. With QoS level commands one can quickly identify the components of the bottlenecks. Following on the case we are discussing, let’s explore the QoS level command to identify the actual culprits. According to NetApp, Data column (CPU), should ideally not go above 2ms to be considered normal and healthy. In this example – It’s in sub-milliseconds. Hence, CPU is not the root cause here.
  • 3. Following ‘QoS’ command is available from command line to all ONTAP filers. This one is cool, b’cos it breaks down the entire storage layers in to sub-components that makes up the FILER, and therefore enables administrator to pin-point issues related to storage. As per storage industry standards, for ‘DISK’ anything under 5ms (for sensitive applications) and for the rest anything under 10ms I wouldn’t be bothered. However, as the objective of this KB is to logically demonstrate steps to troubleshoot latency related issues; hence we are tasked to find out the source of the ‘3ms’ latency showing up in the ‘DISK’ column in the previous snapshot. Next thing, I would look at the Aggregate(s) on the Node to determine what is causing this so called ‘3ms’ disk latency. node-01> df -Ah Aggregate total used avail capacity aggr_01_t1 61TB 54TB 6951GB 89% Please note: This aggregate is made of flash_pool (SAS & SSD) and hence even at 89% it has not caused any significant latency, but ideally should be kept below 85 % considering the number of SAS disks in the raid-group. However, in our example, I will become concerned if gets too close to 95, b’cos that is when the writes & CPs will be affected. ashwinwriter@gmail.com June, 2019