Scope - The primary focus of this presentation is on the methodology we use for managing performance in a very large shared Storage Area Network environment with a Primary focus on Distributed Systems and IBM Enterprise Storage Server. The focus on this presentation is methodology and NOT measurement. There are numerous excellent presentations already out there on measurement. However, there are several references in the back of the presentation to measurement tools.
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
Avoiding Chaos: Methodology for Managing Performance in a Shared Storage Area Network Environment
1. Avoiding Chaos: Methodology for Managing Performance in a Shared Storage Area Network Environment Brett Allison July 25-29, 2005 New Orleans, LA P10
2.
3.
4.
5. What Are the Major Benefits of SAN and Shared Storage? Perf Availability Reduce Cost
6. What is a SAN? ISL’s Core Switch - A Core Switch - B Servers Edge Switch - A Edge Switch - B Storage Servers What can be measured? Links Links Storage Switch - A Storage Switch - B
7. What is Shared on the Enterprise Storage Server? Front End Central Back End Legend Rank1 Rank9 D = Data P = Parity S = Spare Host Adapters ESCON FICON SCSI CPUs Cache NVS Cluster SSA SSA SSA Adapters SSA Raid 5 Ranks D D D S D P D D Eight Pack 1 Eight Pack 2 D D P D S D D D Loop A Loop B Disks
8. How is Data Shared on the Disks? S Eight Pack 1 Eight Pack 2 Loop A Disks (Rank1) 1 2 3 4 P 5 6 1 2 3 P 4 5 6 1 2 P 3 4 5 6 1 P 2 3 4 5 6 P 1 2 3 4 5 6 1 2 3 4 P 6 5 Legend Volume 1 – Staging Server Test DB Volume 2 – Production DB Volume 3 – TSM Disk Pool Volume 4 – Data Warehouse Load Volume 5 – Production DB Log Files Volume 6 – Production DB Index
9. What Role Does Performance Management Play in Shared Storage? Performance Management Planning Predictive Reactive Proactive
11. A Reactive Methodology – Online Focus Host resource issue? Fix it ID hot Host disks ID hot Host disks Host Storage server SAN config, SAN perf data Storage Srvr perf data Fix it N Y
12. Identify Host Disks with High I/O Response Time – Example of AIX Server with SDD installed ------------------------------------------------------------------------ Detailed Physical Volume Stats (512 byte blocks) ------------------------------------------------------------------------ VOLUME: /dev/ hdisk23 description: IBM FC 2105800 reads: 1659 (0 errs) read sizes (blks): avg 8.0 min 8 max 8 sdev 0.0 read times (msec): avg 30.25 min 13.335 max 36.228 sdev 6.082 read sequences: 1659 read seq. lengths: avg 8.0 min 8 max 8 sdev 0.0 Gather Response Time Data ‘filemon’ (See Appendix C) Gather LUN ->hdisk information (‘lsvp –a’ See Appendix D) Hostname VG vpath hdisk Location LUN SN S Connection Size -------- -- ------- --------- -------- -------- - ---------- ---- server1 vg1 vpath96 hdisk23 2Y-08-02 71012345 Y R1-B4-H1-ZA 8.0 Format the data (Script - See Appendix H) 4 5.88 469 4 30.25 1659 hdisk23 12345 71012345 server1 9:00 12/17/05 AVG WRITE SIZE (KB) WRITE TIMES (ms) # WRITES AVG READ SIZE (KB) READ TIMES (ms) # READS HDISK ESS LUN SERVER NAME TIME DATE
13.
14. Did Contention Exist on the Storage Server for the Time Periods When the Attached Server had Contention? Array rank40 had a large spike in activity causing disk utilization to rise to 68% on average for the period starting at 8:45 AM and ending at 9:00 AM Gather ESS Physical Array Data – Appendix E Spike in Utilization
15. What Caused the Spike in Disk Utilization on Array rank40? Gather LUN level data – Appendix F Spike in C2D During the 8:45 – 9:00 AM interval there was a significant spike in Cache 2 Disk Track transfers to LUN 73912345. The owner of the LUN was server2 and from working with the SA we find that this LUN is TSM storage pool
18. ESS Analysis Gotchas Variance Time Stamps Expectations Availability of Data Lack of Config. Info. Measure-ability
19.
20.
21. Appendix A - Best Practices for Performance in a Shared ESS Environment Isolate source and backup volumes on separate disk groups General Utilize at least 4 paths for heavy workloads AIX SDD/HBA Use small LUN size (8-16) for more granular tuning General Isolate/dedicate high bandwidth workloads (Data Warehouse) General Understand AIX – LV Intra Policy of Max and how it effects placement – Spreads LV partitions across all LUNs in VG AIX LV Understand implications of Filesystem striping FS Striping If write activity is heavy (Logs) segregate at array level from other workloads Database(s) Disk Group/Adapter isolation for Flash copy source and target Flash Copy Avoid placing LUNs on heavily utilized disk groups General Spread I/O evenly across adapters and disk groups General Description Technology
22.
23. Appendix C - Measure End-to-End Host Disk I/O Response Time The iostat package for Linux is only valid with a 2.4 & 2.6 kernel See Appendix B for links to more information Avg. Disk sec/Read Physical Disk perfmon NT/Wintel svctm (ms) iostat –d 2 5 *iostat Linux iostat –xcn 2 5 sar –d filemon -o /tmp/filemon.log -O all Command/Object iostat sar filemon Native Tool svc_t (ms) Solaris avserv (ms) HP-UX read time (ms) write time (ms) AIX Metric(s) OS
24. Appendix D: Getting LUN Serial Numbers for ESS Devices Note : ESS Utilities for AIX/HP-UX/Solaris are available at: http://www-1.ibm.com/servers/storage/support/disk/2105/downloading.html Host config. - http://www.redbooks.ibm.com/abstracts/tips0553.html Device Name LUN SN lsvpcfg SDD Linux SDD ESS Util Tool Device Name Serial Datapath query device Wintel VG, hostname, Connection, hdisk LUN SN lsvp –a AIX, HP-UX, Solaris Other Metrics Key Command OS
25. Appendix E: DB2 Query for Array Performance Data Note : This information is relevant only if you have the TotalStorage Expert installed and access to the DB2 command line on the TSE server. SELECT DISTINCT A.*, B.M_CARD_NUM, B.M_LOOP_ID, B.M_GRP_NUM FROM DB2ADMIN.VPCRK A, DB2ADMIN.VPCFG B WHERE ( ( A.PC_DATE_B >= '%STARTDATE' AND A.PC_DATE_E <= '%ENDDATE' AND A.PC_TIME_B >= '%STARTTIME' AND A.PC_TIME_E <= '%ENDTIME' AND A.M_MACH_SN = '%ESSID' AND A.M_MACH_SN = B.M_MACH_SN AND A.M_ARRAY_ID = B.M_ARRAY_ID AND A.P_TASK = B.P_TASK ) ) ORDER BY A.M_ARRAY_ID, A.PC_DATE_B, A.PC_DATE_E with ur;
26. Appendix F: DB2 Query for LUN Performance Data Note : This query requires sql access to the TotalStorage Expert for ESS SELECT DISTINCT A.M_VOL_ADDR, B.* FROM VPVOL A, VPCCH B WHERE ( A.M_MACH_SN = '%ESSID' AND A.M_MACH_SN = B.M_MACH_SN AND A.M_LSS_LA = B.M_LSS_LA AND A.M_VOL_NUM = B.M_VOL_NUM AND B.PC_DATE_B >= '%STARTDATE' AND B.PC_DATE_E <= '%ENDDATE' AND B.PC_TIME_B >= '%STARTTIME' AND B.PC_TIME_E <= '%ENDTIME' ) ;
27.
28.
29. Appendix I – Sample Wintel Datapath Query Output DEV#: 0 DEVICE NAME: Disk0 Part0 TYPE: 2105F20 POLICY: RESERVE SERIAL: 02612345 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port5 Bus0/Disk0 Part0 OPEN NORMAL 3212602 1 1 Scsi Port5 Bus0/Disk0 Part0 OPEN NORMAL 865 1 Note: The SERIAL number indicates the LUN information. The first 3 digits are the LUN number and the last 5 are the ESS serial number.
30.
31. Appendix K: ESS Components No statistics - CPU Calculated Response Time, Disk Disk Drive Logical statistics (Cache/Tracks/etc) LUN Level KB Read/sec, KB Written/sec, I/O Rates, Sequential PCT, Read PCT Arrays No TSE statistics. It is possible to roll up from Array level or use CLI to get stats SSA Adapters Backend No statistics - I/O planar Throughput/RT Available via CLI but not feasible for continuous measurement FC HBA Adapter Front-End Percent of delays caused by limitations in NVS NVS PCT cache hits/Cache Hold Time Cache Cluster Level Metrics Sub-component Component
32. Appendix L: A Process for New LUN Allocations with Performance Input Allocation Request Identify healthy target arrays Identify arrays with free space Assign LUNs on target arrays
33. Appendix M: ESS Array HealthCheck and Drill Down
34.
35. Biography Brett Allison has been doing distributed systems performance related work since 1997 including J2EE application analysis, UNIX/NT, and Storage technologies. His current role is performance analyst for the IGS Managed Storage Services offering. MSS currently manages over 1 Petabyte of data. He has developed a number of internally used performance analysis tools used by ITS/IGS. He has spoken at a previous Storage Symposium and is the author of several White Papers on performance
Notas do Editor
Scope - The primary focus of this presentation is on the methodology we use for managing performance in a very large shared Storage Area Network environment with a Primary focus on Distributed Systems and IBM Enterprise Storage Server. The focus on this presentation is methodology and NOT measurement. There are numerous excellent presentations already out there on measurement. However, there are several references in the back of the presentation to measurement tools.
Shared storage cannot be left to chance! Performance is too important! There are many similarities between shared storage that is not managed for performance and chaos. Confused customers (why is I/O response time unpredictable and unexpectedly high?), Workloads that don’t play nice are mixed with workloads that do play nice. Without proper planning runaway staging and development server applications can clobber shared resources and impact product users.
Cost savings can be achieved through physical consolidation of storage by: Reduction of staff overhead/redundancy Reuse/deepening intellectual capital Standardization of storage architecture Process optimization Decreased provisioning time Increased capacity usage/reduction in excess capacity Reduced tape/backup resources Improved Scalability/Availability Out of band storage reduces network load
“ Shared storage” typically refers to the storage shared on a SAN. This includes the s torage Area Network Switches and other Fabric components (ISL’s, routers, etc) Link information includes Tput, packets/sec, errors
Storage is shared on the Storage Server. This includes but is not limited to: Front-End Adapters (HBAs), I/O planar/System bus, CPUs, Cache, NVS, Backend Adapters, Physical Disk Drives. Server HBAs may be shared on partitioned servers but this is out of scope of this presentation DS6000 and DS4000 do not support ESCON DS4000 and Open systems do not support FICON DS4000 to DS8000 support SCSI protocol on FC, but not real SCSI DS4000 to DS8000 do not support SSA, rather they use switched FC..
Each Volume consists of some number of 32 Kbytes stripes. The number of stripes associated with each volume is dictated by the volume size. The volume size is user configurable. You can easily see how activities from different parts of the same server or other servers can impact performance.
Planning – Assessment and Design. Reactive – Identification of resources that are over utilized and load reduction. The majority of the presentation will focus on this slide? Proactive – Recommendations to manage resources to levels where contention does not impact customer Predictive – Identification of trends in resource consumption caused by organic growth and business trends and recommendation of preventative steps to mitigate impact of increased consumption
Determine storage performance requirements: Average/Max I/O bandwidth per GB Workload type (reads vs writes; random vs sequential) Workload variance I/O response time or throughput requirements (SLA’s) Design for customer requirements Determine sharing level (shared, storage server, network, etc.) for business or workload requirements Follow Best Practices – See Appendix A During the planning phase it is important to understand the customer’s workload intensity and characteristics. Use these to design a storage solution that meets requirements. Not all workloads are well suited for shared storage. Be extremely careful when defining SLAs around performance in shared storage. There are some excellent planning guides and material available for Disk. They are beyond the scope of this presentation but can be provided upon request.
Confirm that the issue is NOT with server resources Verify that host CPU utilization, Paging I/O, and local HBA saturation are not source of performance issues Identify any host disks with high I/O response time See Appendix C for ESS Map the host disk to Storage Server device name See Appendix D for ESS Gather Storage Server performance and configuration data Appendices E & F Sample Queries for ESS data Gather SAN fabric configuration and exception data If port saturation then contact SAN design team Analyze storage server configuration and performance data If ESS issue exists then recommend corrective actions Why do we use I/O response time? On most systems with virtualized storage with multiple paths the disk utilization numbers are misleading. You might have a device that shows 100% busy but it could have excellent response time. The device is not actually 100% busy because it is not really a device but a path to a logical storage unit located on multiple devices. The most telling type of I/O metric is the I/O response time. LUN Serial Numbers can be used to correlate the Storage Server performance data with the server physical device information. Unfortunately the LSS and Rank information provided by the ESS utilities do NOT match the information stored in the ESS Expert.
The AIX filemon tool is a trace based facility and should only be ran for a couple of minutes at a time. The other UNIX flavors provide I/O response time data that can be gathered continuously at reasonable intervals as they are not trace based (See Appendix C for other flavors). The read size is always in 512 byte blocks chunks. So in this case there were 620 reads. The avg read size was 8 Blocks (512 Byte blocks) or 4096 Bytes (4 KB) chunks. These are random I/Os as the number of read sequences is the same as the number of I/Os. The minimal information that you need to pull from this is: Time When filemon started, Volume, Reads, Avg Read Time, Writes, Avg Write Time. I would filter out any records that have 1 I/O or less. For the LUN >hdisk, the ESS Utilities provide the ‘lsvp –a’ command (See appendix D). Minimally you will want to pull the Hostname and the hdisk information on a daily basis if you have access to the servers or install an agent that ftp’s the information to somewhere where you can load it in a configuration database. After the data has been formatted, sort by the highest average response time. It is helpful to create a pivot table and average the I/O response times for each of the LUNs and create a sorted list of the LUNs with the highest response time.
After identifying the LUNs with the highest response time, it is helpful to look at the response times of those LUNs with the highest resonse times. The Analysis is the last step in the identification of the LUN’s with high I/O response time. I like to summarize the read response time data by percentiles using the normalized filemon data (output of previous step). If the workload is primarily reads (typical of online) I like to focus on the read response time. I also like to summarize by ESS if there are multiple ESS(s) using Excel pivot tables. Once I have summarized by ESS I like to determine if there is an actual I/O response time issue. If the I/O response times are greater than reasonable then there is likely contention in the I/O subsystem. This is where you have to drill down to the next level. The analysis assumes a representative sample of the I/Os. Garbage in = Garbage out!
I like to look at the array disk utilizations over time as well as the throughput and I/O rates. The spike on rank40 only lasted 15 minutes. Since there is only 1 spike it is not a good candidate for migrating data. It is not always bad to have high disk utilization, as it indicates you are getting more use from your hardware, however, as utilization increases, so does queue time. As queue time increases response time increases. At some point the response time may increase to a point where OLTP clients are negatively/noticeably impacted.
For the sake of argument, lets assume that based on our prior I like to summarize the Arrays during the time period looking at a number of metrics for each array including the configuration and creating a Rank Score and sorting in descending order. The best way to
Verify that problem is recurring based on multiple data points – never tune to just one data point Identify which LUNs and associated servers are driving I/O to the over-utilized resource Determine a reasonable target reduction in I/O Identify 1 or more LUNs to migrate to a lesser utilized Cluster, Adapter, or Array
Time stamps for the ESS data reported in the ESS Expert reflect the time clocks of the ESS cluster. The ESS cluster clock is set manually by the CE and are not synched with an external time server. They might be hours or even days off. Its important to understand the offset before beginning your analysis. Now that we have confirmed that there is some type of I/O subsystem degradation we should confirm that the issue is not at the ESS. This slide asks the key questions that can be answered with ESS level data. I have provided queries in appendix D & E to pull the raw performance data. I like to look at the array utilization, i/o rates and throughput and use pivot tables to summarize at the ESS, Cluster, and adapter level. Depending on the ESS configuration, including model, the cache, the disk drive RPMs, and the adapter bandwidth can vary. Rules of thumb should be derived for your environment using empirical data and correlation of I/O response times with ESS data.
Time stamps are not likely close and intervals are different between different tools and samples. Time stamp of ESS data is based on time stamp of ESS server which is set by CE and is not synchronized with external reference. The time stamps might be days off. Make sure you understand the relative offset! The interval is also a gotcha as it is unrealistic to gather data any more frequently than 15 minutes. This may cause a problem when attempting to view server data that is collected at trace level recording all I/Os for a fixed period. Direct correlation is difficult if not impossible at times. High workload variance may create problem with correlation (Don’t tune for 1 data point!). If you have workloads that change frequently causing spikes in I/O you may not catch them in the 15 minute interval that ESS expert typically collects for. Adjust “reasonable” response times expectations based on I/O size and customer requirements. LUN level data does not contain reliable server information. You have to go to the server to get reliable server names. This is because the server names in the Storage Expert are taken from the server names entered by an operator in the Storwatch specialist. They may be incorrect. This may require installing an agent on the server to push the data to a central repository. Consider this a must! The data you need might not be available (See Appendix K) for some of the metrics that are available. Some of the components were not built to measure so they are essentially un-manageable! The TotalStorage Expert in particular has availbility issues particularly when collecting data from more than 1 ESS. This might cause data to be missing! Expectations – What is the problem and service level definition for the problem resolution!
Develop rules of thumb for measureable resources in your environment based on reasonable assumptions Collect data for each ESS/Storage Server daily Summarize data and apply rules of thumb (be conservative) Examine prime shift and 24 hour period at a minimum Save summarized AND exception data Review data on a daily basis at first, and weekly later Create health check of environment that can be given to customers Educate customers about activities that negatively impact themselves and others (DB loads, Backups, etc) and set policies to perform offline (this gets very dicey!) Develop a process for reducing load Include performance reviews as part of the new LUN allocation process (Appendix L) Create a capacity report that considers performance Identify customer requirements and develop a process for evaluating their impact on shared environment Avoid placing highly sequential, high variance workloads in a highly sensitive shared environment Identify components that are trending towards contention There are a number of reports that can be used and exceptions that can be created in the TSE, however, for large environments this may not be very usable. The current IBM product for managing the performance of multiple storage devices is called MDM. In the fall IBM Total Productivity Center will roll out performance support for ESS, DS8000 and other devices. This may be a better alternative to a Roll your own approach.
Generally speaking the I/O response time is the amount of time it takes from the point where the I/O request hits the device driver until the I/O is returned from the device driver
Tools are imperfect and there is no way to clearly trace the time it takes at each of the components on the SAN I/O path. In some cases there may be no clear correlation between high I/O response time and shared SAN component over-utilizations.
For IBM’ers I have a sample script that I can make available. For external customers I would advise you to contact your local IBM AIX field reps to see if they have anything or roll your own script.
Contact TSE support for the Q_IO_SEQ and Q_CL_NVS_FULL_PRCT reporting issues. The patch fixes a problem with the VSXPCalculator.class
This is an example of a report we use call the Rank Report or Array level report. It provides a sorted view of the hottest arrays on any selected ESS(s) and provides a drill down to the array level exceptions if available. In addition. This is a quick way to see if array level contention exists.