SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Avoiding Chaos:  Methodology for Managing Performance  in a Shared Storage Area Network Environment Brett Allison July 25-29, 2005 New Orleans, LA P10
Trademarks ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Table of Contents ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Chaos and How to Avoid It? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
What Are the Major Benefits of SAN and Shared Storage? Perf Availability Reduce Cost
What is a SAN? ISL’s Core Switch - A Core Switch - B Servers Edge Switch - A Edge Switch - B Storage Servers What can be measured? Links Links Storage Switch - A Storage Switch - B
What is Shared on the Enterprise Storage Server? Front End Central Back End Legend Rank1 Rank9 D = Data  P = Parity S = Spare Host Adapters ESCON FICON SCSI CPUs Cache NVS Cluster SSA SSA SSA Adapters SSA Raid 5 Ranks D D D S D P D D Eight Pack 1 Eight Pack 2 D D P D S D D D Loop A Loop B Disks
How is Data Shared on the Disks? S Eight Pack 1 Eight Pack 2 Loop A Disks (Rank1) 1 2 3 4 P 5 6 1 2 3 P 4 5 6 1 2 P 3 4 5 6 1 P 2 3 4 5 6 P 1 2 3 4 5 6 1 2 3 4 P 6 5 Legend Volume 1 – Staging Server Test DB Volume 2 – Production DB Volume 3 – TSM Disk Pool Volume 4 – Data Warehouse Load Volume 5 – Production DB Log Files Volume 6 – Production DB Index
What Role Does Performance Management Play in Shared Storage? Performance Management Planning Predictive Reactive Proactive
Assessment and Design Considerations Shared Workload Variance RT Sensitivity Low/Small High/Large Bandwidth Budget Dedicated
A Reactive Methodology – Online Focus Host resource issue? Fix it ID hot Host disks ID hot Host disks Host Storage server SAN config, SAN perf data Storage Srvr perf data Fix it N Y
Identify Host Disks with High I/O Response Time – Example of AIX Server with SDD installed ------------------------------------------------------------------------ Detailed Physical Volume Stats  (512 byte blocks) ------------------------------------------------------------------------ VOLUME: /dev/ hdisk23   description: IBM FC 2105800 reads: 1659 (0 errs) read sizes (blks):  avg  8.0 min  8 max  8 sdev  0.0 read times (msec): avg  30.25 min  13.335 max  36.228 sdev  6.082 read sequences:  1659 read seq. lengths: avg  8.0 min  8 max  8 sdev  0.0 Gather Response Time Data ‘filemon’ (See Appendix C) Gather LUN ->hdisk information (‘lsvp –a’ See Appendix D) Hostname VG  vpath  hdisk  Location LUN SN  S Connection  Size -------- --  -------  --------- -------- -------- - ----------  ---- server1  vg1 vpath96  hdisk23   2Y-08-02  71012345  Y   R1-B4-H1-ZA 8.0  Format the data (Script - See Appendix H) 4 5.88 469 4 30.25 1659 hdisk23 12345 71012345 server1 9:00 12/17/05 AVG WRITE SIZE (KB) WRITE TIMES (ms) # WRITES AVG READ SIZE (KB) READ TIMES (ms) # READS HDISK ESS LUN SERVER NAME TIME DATE
Drilling Down 29.80 99th 15.91 95th 13.72 90th 10.32 85th Read Time (ms) Percentile Does the I/O Response time warrant further investigation? ,[object Object],74742 4.79 458 31.56 2299 rank40 71F12345 12170900 72660 11.22 1734 27.65 1924 rank40 71A12345 12170900 52950 5.88 469 30.25 1659 rank40 71012345 12170900 Total RT (ms) Avg Write RT (ms) Writes Avg Read RT (ms) Reads Array LUN Time Filemon sample time was at 9:00 AM.  What was happening on ESS 12345 and Array rank40 at that time? If yes, then correlate the normalized data with ESS Arrays ,[object Object],[object Object]
Did Contention Exist on the Storage Server for the Time Periods When the Attached Server had Contention? Array rank40 had a large spike in activity causing disk utilization to rise to 68% on average for the period starting at 8:45 AM and ending at 9:00 AM Gather ESS Physical Array Data – Appendix E Spike in Utilization
What Caused the Spike in Disk Utilization on Array rank40? Gather LUN level data – Appendix F Spike in C2D  During the 8:45 – 9:00 AM interval there was a significant spike in Cache 2 Disk Track transfers to LUN 73912345.  The owner of the LUN was server2  and from working with the SA we find that this LUN is TSM storage pool
Fixing the Problem? Identify Hot Array Legend :  ArrayH = Hot Array;  ArrayT = Target Array; IOR = I/O Rate Migrate LUNs to Target Quantify LUN I/O Rate on Array ArrayH:  LUN IOR = (R+W – CH)/Interval Quantify Array I/O Rate Delta ArrayH:IOR - Threshold IOR = Delta IOR Identify Target Array IOR Threshold  < (Delta IOR + ArrayT:IOR)
ESS Analysis Considerations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
ESS Analysis Gotchas Variance Time Stamps Expectations Availability of Data Lack of Config. Info. Measure-ability
Getting Proactive/Predictive ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Appendix A - Best Practices for Performance in a Shared ESS Environment Isolate source and backup volumes on separate disk groups General Utilize at least 4 paths for heavy workloads AIX SDD/HBA Use small LUN size (8-16) for more granular tuning General Isolate/dedicate high bandwidth workloads (Data Warehouse) General Understand AIX – LV Intra Policy of Max and how it effects placement – Spreads LV partitions across all LUNs in VG AIX LV Understand implications of Filesystem striping FS Striping If write activity is heavy (Logs) segregate at array level from other workloads  Database(s) Disk Group/Adapter isolation for Flash copy source and target Flash Copy Avoid placing LUNs on heavily utilized disk groups General Spread I/O evenly across adapters and disk groups General Description Technology
Appendix B:  Resources ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Appendix C - Measure End-to-End Host Disk I/O Response Time The iostat package for Linux is only valid with a 2.4 & 2.6 kernel See Appendix B for links to more information Avg. Disk sec/Read Physical Disk perfmon NT/Wintel svctm (ms) iostat –d 2 5 *iostat Linux iostat –xcn 2 5 sar –d  filemon -o /tmp/filemon.log -O all  Command/Object iostat sar filemon Native Tool svc_t (ms) Solaris avserv (ms) HP-UX read time (ms) write time (ms) AIX Metric(s) OS
Appendix D:  Getting LUN Serial Numbers for ESS Devices Note :  ESS Utilities for AIX/HP-UX/Solaris are available at:  http://www-1.ibm.com/servers/storage/support/disk/2105/downloading.html Host config. -  http://www.redbooks.ibm.com/abstracts/tips0553.html Device Name LUN SN lsvpcfg SDD Linux SDD ESS Util Tool Device Name Serial Datapath query device Wintel VG, hostname, Connection, hdisk LUN SN lsvp –a AIX, HP-UX, Solaris Other Metrics Key Command OS
Appendix E:  DB2 Query for Array Performance Data Note :  This information is relevant only if you have the TotalStorage Expert installed and access to the DB2 command line on the TSE server. SELECT DISTINCT A.*, B.M_CARD_NUM, B.M_LOOP_ID, B.M_GRP_NUM FROM DB2ADMIN.VPCRK A, DB2ADMIN.VPCFG B WHERE ( ( A.PC_DATE_B >= '%STARTDATE' AND A.PC_DATE_E <= '%ENDDATE' AND A.PC_TIME_B >= '%STARTTIME' AND A.PC_TIME_E <= '%ENDTIME' AND A.M_MACH_SN = '%ESSID' AND A.M_MACH_SN = B.M_MACH_SN AND A.M_ARRAY_ID = B.M_ARRAY_ID AND A.P_TASK = B.P_TASK ) ) ORDER BY A.M_ARRAY_ID, A.PC_DATE_B, A.PC_DATE_E with ur;
Appendix F:  DB2 Query for LUN Performance Data Note :  This query requires sql access to the TotalStorage Expert for ESS SELECT DISTINCT  A.M_VOL_ADDR, B.* FROM  VPVOL A, VPCCH B WHERE ( A.M_MACH_SN = '%ESSID' AND A.M_MACH_SN = B.M_MACH_SN AND A.M_LSS_LA = B.M_LSS_LA AND  A.M_VOL_NUM = B.M_VOL_NUM AND B.PC_DATE_B >= '%STARTDATE' AND  B.PC_DATE_E <= '%ENDDATE' AND  B.PC_TIME_B >= '%STARTTIME' AND  B.PC_TIME_E <= '%ENDTIME' ) ;
Appendix G:  Reactive Methodology High Level Workflow ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Appendix H:  Format ‘lsvp –a’ and ‘filemon’ (Logic) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Appendix I – Sample Wintel Datapath Query Output DEV#:  0  DEVICE NAME: Disk0 Part0  TYPE: 2105F20  POLICY: RESERVE SERIAL: 02612345 ============================================================================ Path#  Adapter/Hard Disk  State  Mode  Select  Errors 0  Scsi Port5 Bus0/Disk0 Part0  OPEN  NORMAL  3212602  1 1  Scsi Port5 Bus0/Disk0 Part0  OPEN  NORMAL  865  1 Note:  The SERIAL number indicates the LUN information.  The first 3 digits are the LUN number and the last 5 are the ESS serial number.
Appendix J:  Array Level Information – VPCRK -  Gotchas ,[object Object],[object Object],[object Object],[object Object]
Appendix K:  ESS Components No statistics - CPU Calculated Response Time, Disk  Disk Drive Logical statistics (Cache/Tracks/etc) LUN Level   KB Read/sec, KB Written/sec, I/O Rates, Sequential PCT, Read PCT Arrays   No TSE statistics.  It is possible to roll up from Array level or use CLI to get stats SSA Adapters Backend  No statistics - I/O planar Throughput/RT Available via CLI but not feasible for continuous measurement FC HBA Adapter Front-End Percent of delays caused by limitations in NVS NVS PCT cache hits/Cache Hold Time Cache  Cluster Level Metrics Sub-component Component
Appendix L:  A Process for New LUN Allocations with Performance Input Allocation  Request Identify healthy target arrays Identify arrays  with free space Assign LUNs on  target arrays
Appendix M:  ESS Array HealthCheck and Drill Down
Appendix N:  Glossary ,[object Object],[object Object],[object Object]
Biography Brett Allison has been doing distributed systems performance related work since 1997 including J2EE application analysis, UNIX/NT, and Storage technologies.  His current role is performance analyst for the IGS Managed Storage Services offering.  MSS currently manages over 1 Petabyte of data.  He has developed a number of internally used performance analysis tools used by ITS/IGS.  He has spoken at a previous Storage Symposium and is the author of several White Papers on performance

Mais conteúdo relacionado

Mais procurados

AIXpert - AIX Security expert
AIXpert - AIX Security expertAIXpert - AIX Security expert
AIXpert - AIX Security expertdlfrench
 
EMC Dteata domain advanced command troubleshoot
EMC Dteata domain advanced command troubleshootEMC Dteata domain advanced command troubleshoot
EMC Dteata domain advanced command troubleshootsolarisyougood
 
Masters stretched svc-cluster-2012-04-13 v2
Masters stretched svc-cluster-2012-04-13 v2Masters stretched svc-cluster-2012-04-13 v2
Masters stretched svc-cluster-2012-04-13 v2solarisyougood
 
Optimize solution for oracle db technical presentation
Optimize solution for oracle db   technical presentationOptimize solution for oracle db   technical presentation
Optimize solution for oracle db technical presentationxKinAnx
 
Impact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPCImpact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPCMemVerge
 
Open world exadata_top_10_lessons_learned
Open world exadata_top_10_lessons_learnedOpen world exadata_top_10_lessons_learned
Open world exadata_top_10_lessons_learnedchet justice
 
Spectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf WeiserSpectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf WeiserSandeep Patil
 
Accel - EMC - Data Domain Series
Accel - EMC - Data Domain SeriesAccel - EMC - Data Domain Series
Accel - EMC - Data Domain Seriesaccelfb
 
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...xKinAnx
 
Ibm aix technical deep dive workshop advanced administration and problem dete...
Ibm aix technical deep dive workshop advanced administration and problem dete...Ibm aix technical deep dive workshop advanced administration and problem dete...
Ibm aix technical deep dive workshop advanced administration and problem dete...solarisyougood
 
EMC Data domain advanced features and functions
EMC Data domain advanced features and functionsEMC Data domain advanced features and functions
EMC Data domain advanced features and functionssolarisyougood
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAlluxio, Inc.
 
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...xKinAnx
 
[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』
[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』
[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』Insight Technology, Inc.
 

Mais procurados (20)

CSL_Cochin_c
CSL_Cochin_cCSL_Cochin_c
CSL_Cochin_c
 
AIXpert - AIX Security expert
AIXpert - AIX Security expertAIXpert - AIX Security expert
AIXpert - AIX Security expert
 
EMC Dteata domain advanced command troubleshoot
EMC Dteata domain advanced command troubleshootEMC Dteata domain advanced command troubleshoot
EMC Dteata domain advanced command troubleshoot
 
Masters stretched svc-cluster-2012-04-13 v2
Masters stretched svc-cluster-2012-04-13 v2Masters stretched svc-cluster-2012-04-13 v2
Masters stretched svc-cluster-2012-04-13 v2
 
Optimize solution for oracle db technical presentation
Optimize solution for oracle db   technical presentationOptimize solution for oracle db   technical presentation
Optimize solution for oracle db technical presentation
 
HP 3PAR SSMC 2.1
HP 3PAR SSMC 2.1HP 3PAR SSMC 2.1
HP 3PAR SSMC 2.1
 
Impact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPCImpact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPC
 
Open world exadata_top_10_lessons_learned
Open world exadata_top_10_lessons_learnedOpen world exadata_top_10_lessons_learned
Open world exadata_top_10_lessons_learned
 
Spectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf WeiserSpectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf Weiser
 
3 Par
3 Par3 Par
3 Par
 
Accel - EMC - Data Domain Series
Accel - EMC - Data Domain SeriesAccel - EMC - Data Domain Series
Accel - EMC - Data Domain Series
 
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
 
Ibm aix technical deep dive workshop advanced administration and problem dete...
Ibm aix technical deep dive workshop advanced administration and problem dete...Ibm aix technical deep dive workshop advanced administration and problem dete...
Ibm aix technical deep dive workshop advanced administration and problem dete...
 
Storage Event: HP 3PAR live erleben
Storage Event: HP 3PAR live erlebenStorage Event: HP 3PAR live erleben
Storage Event: HP 3PAR live erleben
 
EMC Data domain advanced features and functions
EMC Data domain advanced features and functionsEMC Data domain advanced features and functions
EMC Data domain advanced features and functions
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
IBM XIV Gen3 Storage System
IBM XIV Gen3 Storage SystemIBM XIV Gen3 Storage System
IBM XIV Gen3 Storage System
 
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
 
HP 3PAR Express Writes
HP 3PAR Express WritesHP 3PAR Express Writes
HP 3PAR Express Writes
 
[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』
[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』
[db tech showcase Tokyo 2018] #dbts2018 #B17 『オラクル パフォーマンス チューニング - 神話、伝説と解決策』
 

Destaque

Ibm tivoli storage manager bare machine recovery for aix with sysback - red...
Ibm tivoli storage manager   bare machine recovery for aix with sysback - red...Ibm tivoli storage manager   bare machine recovery for aix with sysback - red...
Ibm tivoli storage manager bare machine recovery for aix with sysback - red...Banking at Ho Chi Minh city
 
Visual studio 2008 overview
Visual studio 2008 overviewVisual studio 2008 overview
Visual studio 2008 overviewsagaroceanic11
 
Ibm tivoli storage manager in a clustered environment sg246679
Ibm tivoli storage manager in a clustered environment sg246679Ibm tivoli storage manager in a clustered environment sg246679
Ibm tivoli storage manager in a clustered environment sg246679Banking at Ho Chi Minh city
 
Overview of v cloud case studies
Overview of v cloud case studiesOverview of v cloud case studies
Overview of v cloud case studiessolarisyougood
 
Sparc t4 systems customer presentation
Sparc t4 systems customer presentationSparc t4 systems customer presentation
Sparc t4 systems customer presentationsolarisyougood
 
Aix admin course provider Navi Mumbai | AIX Admin Course Training Navi Mumbai...
Aix admin course provider Navi Mumbai | AIX Admin Course Training Navi Mumbai...Aix admin course provider Navi Mumbai | AIX Admin Course Training Navi Mumbai...
Aix admin course provider Navi Mumbai | AIX Admin Course Training Navi Mumbai...VibrantGroup
 
Proof of concept guide for ibm tivoli storage manager version 5.3 sg246762
Proof of concept guide for ibm tivoli storage manager version 5.3 sg246762Proof of concept guide for ibm tivoli storage manager version 5.3 sg246762
Proof of concept guide for ibm tivoli storage manager version 5.3 sg246762Banking at Ho Chi Minh city
 
2.ibm flex system manager overview
2.ibm flex system manager overview2.ibm flex system manager overview
2.ibm flex system manager overviewsolarisyougood
 
RHT Upgrading to vSphere 5
RHT Upgrading to vSphere 5RHT Upgrading to vSphere 5
RHT Upgrading to vSphere 5virtualsouthwest
 
HP-UX Dynamic Root Disk vs Solaris Live Upgrade vs AIX Multibos by Dusan Balj...
HP-UX Dynamic Root Disk vs Solaris Live Upgrade vs AIX Multibos by Dusan Balj...HP-UX Dynamic Root Disk vs Solaris Live Upgrade vs AIX Multibos by Dusan Balj...
HP-UX Dynamic Root Disk vs Solaris Live Upgrade vs AIX Multibos by Dusan Balj...Circling Cycle
 
Leveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN PerformanceLeveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN Performancebrettallison
 
CTI Group- Blue power technology storwize technical training for customer - p...
CTI Group- Blue power technology storwize technical training for customer - p...CTI Group- Blue power technology storwize technical training for customer - p...
CTI Group- Blue power technology storwize technical training for customer - p...Tri Susilo
 
V mware v center orchestrator 5.5 knowledge transfer kit
V mware v center orchestrator 5.5 knowledge transfer kitV mware v center orchestrator 5.5 knowledge transfer kit
V mware v center orchestrator 5.5 knowledge transfer kitsolarisyougood
 
Customer overview oracle solaris cluster, enterprise edition
Customer overview oracle solaris cluster, enterprise editionCustomer overview oracle solaris cluster, enterprise edition
Customer overview oracle solaris cluster, enterprise editionsolarisyougood
 
Virtual san hardware guidance &amp; best practices
Virtual san hardware guidance &amp; best practicesVirtual san hardware guidance &amp; best practices
Virtual san hardware guidance &amp; best practicessolarisyougood
 

Destaque (19)

Ibm tivoli storage manager bare machine recovery for aix with sysback - red...
Ibm tivoli storage manager   bare machine recovery for aix with sysback - red...Ibm tivoli storage manager   bare machine recovery for aix with sysback - red...
Ibm tivoli storage manager bare machine recovery for aix with sysback - red...
 
AIX 5L Differences Guide Version 5.3 Edition
AIX 5L Differences Guide Version 5.3 EditionAIX 5L Differences Guide Version 5.3 Edition
AIX 5L Differences Guide Version 5.3 Edition
 
Visual studio 2008 overview
Visual studio 2008 overviewVisual studio 2008 overview
Visual studio 2008 overview
 
Ibm tivoli storage manager in a clustered environment sg246679
Ibm tivoli storage manager in a clustered environment sg246679Ibm tivoli storage manager in a clustered environment sg246679
Ibm tivoli storage manager in a clustered environment sg246679
 
Overview of v cloud case studies
Overview of v cloud case studiesOverview of v cloud case studies
Overview of v cloud case studies
 
Sparc t4 systems customer presentation
Sparc t4 systems customer presentationSparc t4 systems customer presentation
Sparc t4 systems customer presentation
 
IBMRedbook
IBMRedbookIBMRedbook
IBMRedbook
 
Aix admin course provider Navi Mumbai | AIX Admin Course Training Navi Mumbai...
Aix admin course provider Navi Mumbai | AIX Admin Course Training Navi Mumbai...Aix admin course provider Navi Mumbai | AIX Admin Course Training Navi Mumbai...
Aix admin course provider Navi Mumbai | AIX Admin Course Training Navi Mumbai...
 
Proof of concept guide for ibm tivoli storage manager version 5.3 sg246762
Proof of concept guide for ibm tivoli storage manager version 5.3 sg246762Proof of concept guide for ibm tivoli storage manager version 5.3 sg246762
Proof of concept guide for ibm tivoli storage manager version 5.3 sg246762
 
Accelerate Return on Data
Accelerate Return on DataAccelerate Return on Data
Accelerate Return on Data
 
RHT Design for Security
RHT Design for SecurityRHT Design for Security
RHT Design for Security
 
2.ibm flex system manager overview
2.ibm flex system manager overview2.ibm flex system manager overview
2.ibm flex system manager overview
 
RHT Upgrading to vSphere 5
RHT Upgrading to vSphere 5RHT Upgrading to vSphere 5
RHT Upgrading to vSphere 5
 
HP-UX Dynamic Root Disk vs Solaris Live Upgrade vs AIX Multibos by Dusan Balj...
HP-UX Dynamic Root Disk vs Solaris Live Upgrade vs AIX Multibos by Dusan Balj...HP-UX Dynamic Root Disk vs Solaris Live Upgrade vs AIX Multibos by Dusan Balj...
HP-UX Dynamic Root Disk vs Solaris Live Upgrade vs AIX Multibos by Dusan Balj...
 
Leveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN PerformanceLeveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN Performance
 
CTI Group- Blue power technology storwize technical training for customer - p...
CTI Group- Blue power technology storwize technical training for customer - p...CTI Group- Blue power technology storwize technical training for customer - p...
CTI Group- Blue power technology storwize technical training for customer - p...
 
V mware v center orchestrator 5.5 knowledge transfer kit
V mware v center orchestrator 5.5 knowledge transfer kitV mware v center orchestrator 5.5 knowledge transfer kit
V mware v center orchestrator 5.5 knowledge transfer kit
 
Customer overview oracle solaris cluster, enterprise edition
Customer overview oracle solaris cluster, enterprise editionCustomer overview oracle solaris cluster, enterprise edition
Customer overview oracle solaris cluster, enterprise edition
 
Virtual san hardware guidance &amp; best practices
Virtual san hardware guidance &amp; best practicesVirtual san hardware guidance &amp; best practices
Virtual san hardware guidance &amp; best practices
 

Semelhante a Avoiding Chaos: Methodology for Managing Performance in a Shared Storage Area Network Environment

Storage, San And Business Continuity Overview
Storage, San And Business Continuity OverviewStorage, San And Business Continuity Overview
Storage, San And Business Continuity OverviewAlan McSweeney
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
Storage Consumption and Chargeback
Storage Consumption and ChargebackStorage Consumption and Chargeback
Storage Consumption and Chargebackbrettallison
 
Oracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningOracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningScott Jenner
 
Oracle RAC Presentation at Oracle Open World
Oracle RAC Presentation at Oracle Open WorldOracle RAC Presentation at Oracle Open World
Oracle RAC Presentation at Oracle Open WorldPaul Marden
 
Storage essentials (by Merlin Ran)
Storage essentials (by Merlin Ran)Storage essentials (by Merlin Ran)
Storage essentials (by Merlin Ran)gigix1980
 
Ms Tech Ed Best Practices For Exchange Server Cluster Deployments June 2003
Ms Tech Ed   Best Practices For Exchange Server Cluster Deployments June 2003Ms Tech Ed   Best Practices For Exchange Server Cluster Deployments June 2003
Ms Tech Ed Best Practices For Exchange Server Cluster Deployments June 2003Armando Leon
 
Champion Fas Deduplication
Champion Fas DeduplicationChampion Fas Deduplication
Champion Fas DeduplicationMichael Hudak
 
Webinar: Untethering Compute from Storage
Webinar: Untethering Compute from StorageWebinar: Untethering Compute from Storage
Webinar: Untethering Compute from StorageAvere Systems
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the ElephantDataWorks Summit
 
Resume_CQ_Edward
Resume_CQ_EdwardResume_CQ_Edward
Resume_CQ_Edwardcaiqi wang
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...In-Memory Computing Summit
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Community
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014Philippe Fierens
 

Semelhante a Avoiding Chaos: Methodology for Managing Performance in a Shared Storage Area Network Environment (20)

Storage, San And Business Continuity Overview
Storage, San And Business Continuity OverviewStorage, San And Business Continuity Overview
Storage, San And Business Continuity Overview
 
Clustering and High Availability
Clustering and High Availability Clustering and High Availability
Clustering and High Availability
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Storage Consumption and Chargeback
Storage Consumption and ChargebackStorage Consumption and Chargeback
Storage Consumption and Chargeback
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Oracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningOracle R12 EBS Performance Tuning
Oracle R12 EBS Performance Tuning
 
Oracle RAC Presentation at Oracle Open World
Oracle RAC Presentation at Oracle Open WorldOracle RAC Presentation at Oracle Open World
Oracle RAC Presentation at Oracle Open World
 
Storage essentials (by Merlin Ran)
Storage essentials (by Merlin Ran)Storage essentials (by Merlin Ran)
Storage essentials (by Merlin Ran)
 
Ms Tech Ed Best Practices For Exchange Server Cluster Deployments June 2003
Ms Tech Ed   Best Practices For Exchange Server Cluster Deployments June 2003Ms Tech Ed   Best Practices For Exchange Server Cluster Deployments June 2003
Ms Tech Ed Best Practices For Exchange Server Cluster Deployments June 2003
 
Champion Fas Deduplication
Champion Fas DeduplicationChampion Fas Deduplication
Champion Fas Deduplication
 
Webinar: Untethering Compute from Storage
Webinar: Untethering Compute from StorageWebinar: Untethering Compute from Storage
Webinar: Untethering Compute from Storage
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the Elephant
 
RAC - Test
RAC - TestRAC - Test
RAC - Test
 
Resume_CQ_Edward
Resume_CQ_EdwardResume_CQ_Edward
Resume_CQ_Edward
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
MYSQL
MYSQLMYSQL
MYSQL
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
 

Avoiding Chaos: Methodology for Managing Performance in a Shared Storage Area Network Environment

  • 1. Avoiding Chaos: Methodology for Managing Performance in a Shared Storage Area Network Environment Brett Allison July 25-29, 2005 New Orleans, LA P10
  • 2.
  • 3.
  • 4.
  • 5. What Are the Major Benefits of SAN and Shared Storage? Perf Availability Reduce Cost
  • 6. What is a SAN? ISL’s Core Switch - A Core Switch - B Servers Edge Switch - A Edge Switch - B Storage Servers What can be measured? Links Links Storage Switch - A Storage Switch - B
  • 7. What is Shared on the Enterprise Storage Server? Front End Central Back End Legend Rank1 Rank9 D = Data P = Parity S = Spare Host Adapters ESCON FICON SCSI CPUs Cache NVS Cluster SSA SSA SSA Adapters SSA Raid 5 Ranks D D D S D P D D Eight Pack 1 Eight Pack 2 D D P D S D D D Loop A Loop B Disks
  • 8. How is Data Shared on the Disks? S Eight Pack 1 Eight Pack 2 Loop A Disks (Rank1) 1 2 3 4 P 5 6 1 2 3 P 4 5 6 1 2 P 3 4 5 6 1 P 2 3 4 5 6 P 1 2 3 4 5 6 1 2 3 4 P 6 5 Legend Volume 1 – Staging Server Test DB Volume 2 – Production DB Volume 3 – TSM Disk Pool Volume 4 – Data Warehouse Load Volume 5 – Production DB Log Files Volume 6 – Production DB Index
  • 9. What Role Does Performance Management Play in Shared Storage? Performance Management Planning Predictive Reactive Proactive
  • 10. Assessment and Design Considerations Shared Workload Variance RT Sensitivity Low/Small High/Large Bandwidth Budget Dedicated
  • 11. A Reactive Methodology – Online Focus Host resource issue? Fix it ID hot Host disks ID hot Host disks Host Storage server SAN config, SAN perf data Storage Srvr perf data Fix it N Y
  • 12. Identify Host Disks with High I/O Response Time – Example of AIX Server with SDD installed ------------------------------------------------------------------------ Detailed Physical Volume Stats (512 byte blocks) ------------------------------------------------------------------------ VOLUME: /dev/ hdisk23 description: IBM FC 2105800 reads: 1659 (0 errs) read sizes (blks): avg 8.0 min 8 max 8 sdev 0.0 read times (msec): avg 30.25 min 13.335 max 36.228 sdev 6.082 read sequences: 1659 read seq. lengths: avg 8.0 min 8 max 8 sdev 0.0 Gather Response Time Data ‘filemon’ (See Appendix C) Gather LUN ->hdisk information (‘lsvp –a’ See Appendix D) Hostname VG vpath hdisk Location LUN SN S Connection Size -------- -- ------- --------- -------- -------- - ---------- ---- server1 vg1 vpath96 hdisk23 2Y-08-02 71012345 Y R1-B4-H1-ZA 8.0 Format the data (Script - See Appendix H) 4 5.88 469 4 30.25 1659 hdisk23 12345 71012345 server1 9:00 12/17/05 AVG WRITE SIZE (KB) WRITE TIMES (ms) # WRITES AVG READ SIZE (KB) READ TIMES (ms) # READS HDISK ESS LUN SERVER NAME TIME DATE
  • 13.
  • 14. Did Contention Exist on the Storage Server for the Time Periods When the Attached Server had Contention? Array rank40 had a large spike in activity causing disk utilization to rise to 68% on average for the period starting at 8:45 AM and ending at 9:00 AM Gather ESS Physical Array Data – Appendix E Spike in Utilization
  • 15. What Caused the Spike in Disk Utilization on Array rank40? Gather LUN level data – Appendix F Spike in C2D During the 8:45 – 9:00 AM interval there was a significant spike in Cache 2 Disk Track transfers to LUN 73912345. The owner of the LUN was server2 and from working with the SA we find that this LUN is TSM storage pool
  • 16. Fixing the Problem? Identify Hot Array Legend : ArrayH = Hot Array; ArrayT = Target Array; IOR = I/O Rate Migrate LUNs to Target Quantify LUN I/O Rate on Array ArrayH: LUN IOR = (R+W – CH)/Interval Quantify Array I/O Rate Delta ArrayH:IOR - Threshold IOR = Delta IOR Identify Target Array IOR Threshold < (Delta IOR + ArrayT:IOR)
  • 17.
  • 18. ESS Analysis Gotchas Variance Time Stamps Expectations Availability of Data Lack of Config. Info. Measure-ability
  • 19.
  • 20.  
  • 21. Appendix A - Best Practices for Performance in a Shared ESS Environment Isolate source and backup volumes on separate disk groups General Utilize at least 4 paths for heavy workloads AIX SDD/HBA Use small LUN size (8-16) for more granular tuning General Isolate/dedicate high bandwidth workloads (Data Warehouse) General Understand AIX – LV Intra Policy of Max and how it effects placement – Spreads LV partitions across all LUNs in VG AIX LV Understand implications of Filesystem striping FS Striping If write activity is heavy (Logs) segregate at array level from other workloads Database(s) Disk Group/Adapter isolation for Flash copy source and target Flash Copy Avoid placing LUNs on heavily utilized disk groups General Spread I/O evenly across adapters and disk groups General Description Technology
  • 22.
  • 23. Appendix C - Measure End-to-End Host Disk I/O Response Time The iostat package for Linux is only valid with a 2.4 & 2.6 kernel See Appendix B for links to more information Avg. Disk sec/Read Physical Disk perfmon NT/Wintel svctm (ms) iostat –d 2 5 *iostat Linux iostat –xcn 2 5 sar –d filemon -o /tmp/filemon.log -O all Command/Object iostat sar filemon Native Tool svc_t (ms) Solaris avserv (ms) HP-UX read time (ms) write time (ms) AIX Metric(s) OS
  • 24. Appendix D: Getting LUN Serial Numbers for ESS Devices Note : ESS Utilities for AIX/HP-UX/Solaris are available at: http://www-1.ibm.com/servers/storage/support/disk/2105/downloading.html Host config. - http://www.redbooks.ibm.com/abstracts/tips0553.html Device Name LUN SN lsvpcfg SDD Linux SDD ESS Util Tool Device Name Serial Datapath query device Wintel VG, hostname, Connection, hdisk LUN SN lsvp –a AIX, HP-UX, Solaris Other Metrics Key Command OS
  • 25. Appendix E: DB2 Query for Array Performance Data Note : This information is relevant only if you have the TotalStorage Expert installed and access to the DB2 command line on the TSE server. SELECT DISTINCT A.*, B.M_CARD_NUM, B.M_LOOP_ID, B.M_GRP_NUM FROM DB2ADMIN.VPCRK A, DB2ADMIN.VPCFG B WHERE ( ( A.PC_DATE_B >= '%STARTDATE' AND A.PC_DATE_E <= '%ENDDATE' AND A.PC_TIME_B >= '%STARTTIME' AND A.PC_TIME_E <= '%ENDTIME' AND A.M_MACH_SN = '%ESSID' AND A.M_MACH_SN = B.M_MACH_SN AND A.M_ARRAY_ID = B.M_ARRAY_ID AND A.P_TASK = B.P_TASK ) ) ORDER BY A.M_ARRAY_ID, A.PC_DATE_B, A.PC_DATE_E with ur;
  • 26. Appendix F: DB2 Query for LUN Performance Data Note : This query requires sql access to the TotalStorage Expert for ESS SELECT DISTINCT A.M_VOL_ADDR, B.* FROM VPVOL A, VPCCH B WHERE ( A.M_MACH_SN = '%ESSID' AND A.M_MACH_SN = B.M_MACH_SN AND A.M_LSS_LA = B.M_LSS_LA AND A.M_VOL_NUM = B.M_VOL_NUM AND B.PC_DATE_B >= '%STARTDATE' AND B.PC_DATE_E <= '%ENDDATE' AND B.PC_TIME_B >= '%STARTTIME' AND B.PC_TIME_E <= '%ENDTIME' ) ;
  • 27.
  • 28.
  • 29. Appendix I – Sample Wintel Datapath Query Output DEV#: 0 DEVICE NAME: Disk0 Part0 TYPE: 2105F20 POLICY: RESERVE SERIAL: 02612345 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port5 Bus0/Disk0 Part0 OPEN NORMAL 3212602 1 1 Scsi Port5 Bus0/Disk0 Part0 OPEN NORMAL 865 1 Note: The SERIAL number indicates the LUN information. The first 3 digits are the LUN number and the last 5 are the ESS serial number.
  • 30.
  • 31. Appendix K: ESS Components No statistics - CPU Calculated Response Time, Disk Disk Drive Logical statistics (Cache/Tracks/etc) LUN Level   KB Read/sec, KB Written/sec, I/O Rates, Sequential PCT, Read PCT Arrays   No TSE statistics. It is possible to roll up from Array level or use CLI to get stats SSA Adapters Backend No statistics - I/O planar Throughput/RT Available via CLI but not feasible for continuous measurement FC HBA Adapter Front-End Percent of delays caused by limitations in NVS NVS PCT cache hits/Cache Hold Time Cache Cluster Level Metrics Sub-component Component
  • 32. Appendix L: A Process for New LUN Allocations with Performance Input Allocation Request Identify healthy target arrays Identify arrays with free space Assign LUNs on target arrays
  • 33. Appendix M: ESS Array HealthCheck and Drill Down
  • 34.
  • 35. Biography Brett Allison has been doing distributed systems performance related work since 1997 including J2EE application analysis, UNIX/NT, and Storage technologies. His current role is performance analyst for the IGS Managed Storage Services offering. MSS currently manages over 1 Petabyte of data. He has developed a number of internally used performance analysis tools used by ITS/IGS. He has spoken at a previous Storage Symposium and is the author of several White Papers on performance

Notas do Editor

  1. Scope - The primary focus of this presentation is on the methodology we use for managing performance in a very large shared Storage Area Network environment with a Primary focus on Distributed Systems and IBM Enterprise Storage Server. The focus on this presentation is methodology and NOT measurement. There are numerous excellent presentations already out there on measurement. However, there are several references in the back of the presentation to measurement tools.
  2. Shared storage cannot be left to chance! Performance is too important! There are many similarities between shared storage that is not managed for performance and chaos. Confused customers (why is I/O response time unpredictable and unexpectedly high?), Workloads that don’t play nice are mixed with workloads that do play nice. Without proper planning runaway staging and development server applications can clobber shared resources and impact product users.
  3. Cost savings can be achieved through physical consolidation of storage by: Reduction of staff overhead/redundancy Reuse/deepening intellectual capital Standardization of storage architecture Process optimization Decreased provisioning time Increased capacity usage/reduction in excess capacity Reduced tape/backup resources Improved Scalability/Availability Out of band storage reduces network load
  4. “ Shared storage” typically refers to the storage shared on a SAN. This includes the s torage Area Network Switches and other Fabric components (ISL’s, routers, etc) Link information includes Tput, packets/sec, errors
  5. Storage is shared on the Storage Server. This includes but is not limited to: Front-End Adapters (HBAs), I/O planar/System bus, CPUs, Cache, NVS, Backend Adapters, Physical Disk Drives. Server HBAs may be shared on partitioned servers but this is out of scope of this presentation DS6000 and DS4000 do not support ESCON DS4000 and Open systems do not support FICON DS4000 to DS8000 support SCSI protocol on FC, but not real SCSI DS4000 to DS8000 do not support SSA, rather they use switched FC..
  6. Each Volume consists of some number of 32 Kbytes stripes. The number of stripes associated with each volume is dictated by the volume size. The volume size is user configurable. You can easily see how activities from different parts of the same server or other servers can impact performance.
  7. Planning – Assessment and Design. Reactive – Identification of resources that are over utilized and load reduction. The majority of the presentation will focus on this slide? Proactive – Recommendations to manage resources to levels where contention does not impact customer Predictive – Identification of trends in resource consumption caused by organic growth and business trends and recommendation of preventative steps to mitigate impact of increased consumption
  8. Determine storage performance requirements: Average/Max I/O bandwidth per GB Workload type (reads vs writes; random vs sequential) Workload variance I/O response time or throughput requirements (SLA’s) Design for customer requirements Determine sharing level (shared, storage server, network, etc.) for business or workload requirements Follow Best Practices – See Appendix A During the planning phase it is important to understand the customer’s workload intensity and characteristics. Use these to design a storage solution that meets requirements. Not all workloads are well suited for shared storage. Be extremely careful when defining SLAs around performance in shared storage. There are some excellent planning guides and material available for Disk. They are beyond the scope of this presentation but can be provided upon request.
  9. Confirm that the issue is NOT with server resources Verify that host CPU utilization, Paging I/O, and local HBA saturation are not source of performance issues Identify any host disks with high I/O response time See Appendix C for ESS Map the host disk to Storage Server device name See Appendix D for ESS Gather Storage Server performance and configuration data Appendices E &amp; F Sample Queries for ESS data Gather SAN fabric configuration and exception data If port saturation then contact SAN design team Analyze storage server configuration and performance data If ESS issue exists then recommend corrective actions Why do we use I/O response time? On most systems with virtualized storage with multiple paths the disk utilization numbers are misleading. You might have a device that shows 100% busy but it could have excellent response time. The device is not actually 100% busy because it is not really a device but a path to a logical storage unit located on multiple devices. The most telling type of I/O metric is the I/O response time. LUN Serial Numbers can be used to correlate the Storage Server performance data with the server physical device information. Unfortunately the LSS and Rank information provided by the ESS utilities do NOT match the information stored in the ESS Expert.
  10. The AIX filemon tool is a trace based facility and should only be ran for a couple of minutes at a time. The other UNIX flavors provide I/O response time data that can be gathered continuously at reasonable intervals as they are not trace based (See Appendix C for other flavors). The read size is always in 512 byte blocks chunks. So in this case there were 620 reads. The avg read size was 8 Blocks (512 Byte blocks) or 4096 Bytes (4 KB) chunks. These are random I/Os as the number of read sequences is the same as the number of I/Os. The minimal information that you need to pull from this is: Time When filemon started, Volume, Reads, Avg Read Time, Writes, Avg Write Time. I would filter out any records that have 1 I/O or less. For the LUN &gt;hdisk, the ESS Utilities provide the ‘lsvp –a’ command (See appendix D). Minimally you will want to pull the Hostname and the hdisk information on a daily basis if you have access to the servers or install an agent that ftp’s the information to somewhere where you can load it in a configuration database. After the data has been formatted, sort by the highest average response time. It is helpful to create a pivot table and average the I/O response times for each of the LUNs and create a sorted list of the LUNs with the highest response time.
  11. After identifying the LUNs with the highest response time, it is helpful to look at the response times of those LUNs with the highest resonse times. The Analysis is the last step in the identification of the LUN’s with high I/O response time. I like to summarize the read response time data by percentiles using the normalized filemon data (output of previous step). If the workload is primarily reads (typical of online) I like to focus on the read response time. I also like to summarize by ESS if there are multiple ESS(s) using Excel pivot tables. Once I have summarized by ESS I like to determine if there is an actual I/O response time issue. If the I/O response times are greater than reasonable then there is likely contention in the I/O subsystem. This is where you have to drill down to the next level. The analysis assumes a representative sample of the I/Os. Garbage in = Garbage out!
  12. I like to look at the array disk utilizations over time as well as the throughput and I/O rates. The spike on rank40 only lasted 15 minutes. Since there is only 1 spike it is not a good candidate for migrating data. It is not always bad to have high disk utilization, as it indicates you are getting more use from your hardware, however, as utilization increases, so does queue time. As queue time increases response time increases. At some point the response time may increase to a point where OLTP clients are negatively/noticeably impacted.
  13. For the sake of argument, lets assume that based on our prior I like to summarize the Arrays during the time period looking at a number of metrics for each array including the configuration and creating a Rank Score and sorting in descending order. The best way to
  14. Verify that problem is recurring based on multiple data points – never tune to just one data point Identify which LUNs and associated servers are driving I/O to the over-utilized resource Determine a reasonable target reduction in I/O Identify 1 or more LUNs to migrate to a lesser utilized Cluster, Adapter, or Array
  15. Time stamps for the ESS data reported in the ESS Expert reflect the time clocks of the ESS cluster. The ESS cluster clock is set manually by the CE and are not synched with an external time server. They might be hours or even days off. Its important to understand the offset before beginning your analysis. Now that we have confirmed that there is some type of I/O subsystem degradation we should confirm that the issue is not at the ESS. This slide asks the key questions that can be answered with ESS level data. I have provided queries in appendix D &amp; E to pull the raw performance data. I like to look at the array utilization, i/o rates and throughput and use pivot tables to summarize at the ESS, Cluster, and adapter level. Depending on the ESS configuration, including model, the cache, the disk drive RPMs, and the adapter bandwidth can vary. Rules of thumb should be derived for your environment using empirical data and correlation of I/O response times with ESS data.
  16. Time stamps are not likely close and intervals are different between different tools and samples. Time stamp of ESS data is based on time stamp of ESS server which is set by CE and is not synchronized with external reference. The time stamps might be days off. Make sure you understand the relative offset! The interval is also a gotcha as it is unrealistic to gather data any more frequently than 15 minutes. This may cause a problem when attempting to view server data that is collected at trace level recording all I/Os for a fixed period. Direct correlation is difficult if not impossible at times. High workload variance may create problem with correlation (Don’t tune for 1 data point!). If you have workloads that change frequently causing spikes in I/O you may not catch them in the 15 minute interval that ESS expert typically collects for. Adjust “reasonable” response times expectations based on I/O size and customer requirements. LUN level data does not contain reliable server information. You have to go to the server to get reliable server names. This is because the server names in the Storage Expert are taken from the server names entered by an operator in the Storwatch specialist. They may be incorrect. This may require installing an agent on the server to push the data to a central repository. Consider this a must! The data you need might not be available (See Appendix K) for some of the metrics that are available. Some of the components were not built to measure so they are essentially un-manageable! The TotalStorage Expert in particular has availbility issues particularly when collecting data from more than 1 ESS. This might cause data to be missing! Expectations – What is the problem and service level definition for the problem resolution!
  17. Develop rules of thumb for measureable resources in your environment based on reasonable assumptions Collect data for each ESS/Storage Server daily Summarize data and apply rules of thumb (be conservative) Examine prime shift and 24 hour period at a minimum Save summarized AND exception data Review data on a daily basis at first, and weekly later Create health check of environment that can be given to customers Educate customers about activities that negatively impact themselves and others (DB loads, Backups, etc) and set policies to perform offline (this gets very dicey!) Develop a process for reducing load Include performance reviews as part of the new LUN allocation process (Appendix L) Create a capacity report that considers performance Identify customer requirements and develop a process for evaluating their impact on shared environment Avoid placing highly sequential, high variance workloads in a highly sensitive shared environment Identify components that are trending towards contention There are a number of reports that can be used and exceptions that can be created in the TSE, however, for large environments this may not be very usable. The current IBM product for managing the performance of multiple storage devices is called MDM. In the fall IBM Total Productivity Center will roll out performance support for ESS, DS8000 and other devices. This may be a better alternative to a Roll your own approach.
  18. Generally speaking the I/O response time is the amount of time it takes from the point where the I/O request hits the device driver until the I/O is returned from the device driver
  19. Tools are imperfect and there is no way to clearly trace the time it takes at each of the components on the SAN I/O path. In some cases there may be no clear correlation between high I/O response time and shared SAN component over-utilizations.
  20. For IBM’ers I have a sample script that I can make available. For external customers I would advise you to contact your local IBM AIX field reps to see if they have anything or roll your own script.
  21. Contact TSE support for the Q_IO_SEQ and Q_CL_NVS_FULL_PRCT reporting issues. The patch fixes a problem with the VSXPCalculator.class
  22. This is an example of a report we use call the Rank Report or Array level report. It provides a sorted view of the hottest arrays on any selected ESS(s) and provides a drill down to the array level exceptions if available. In addition. This is a quick way to see if array level contention exists.