O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

PT203: SOS! NUTANIX TROUBLESHOOTING

1.043 visualizações

Publicada em

Resolve issues faster, and learn how to partner with Nutanix support teams. Attendees at this technical session will get acquainted with critical support tools, along with tried and true best practices for successfully managing Nutanix environments. Listen in as Nutanix engineers share how to get the most out of Pulse HD, System Alerts, Nutanix Cluster Check (NCC), and show you live how to troubleshoot a real production issue.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

PT203: SOS! NUTANIX TROUBLESHOOTING

  1. 1. @nutanix #nextconf #PT203
  2. 2. Forward-Looking Statement Disclaimer This presentation and the accompanying oral commentary may include express and implied forward-looking statements, including but not limited to statements concerning our business plans and objectives, product features and technology that are under development or in process and capabilities of such product features and technology, our plans to introduce product features in future releases, the implementation of our products on additional hardware platforms, strategic partnerships that are in process, product performance, competitive position, industry environment, and potential market opportunities. These forward-looking statements are not historical facts, and instead are based on our current expectations, estimates, opinions and beliefs. The accuracy of such forward-looking statements depends upon future events, and involves risks, uncertainties and other factors beyond our control that may cause these statements to be inaccurate and cause our actual results, performance or achievements to differ materially and adversely from those anticipated or implied by such statements, including, among others: failure to develop, or unexpected difficulties or delays in developing, new product features or technology on a timely or cost-effective basis; delays in or lack of customer or market acceptance of our new product features or technology; the failure of our software to interoperate on different hardware platforms; failure to form, or delays in the formation of, new strategic partnerships and the possibility that we may not receive anticipated results from forming such strategic partnerships; the introduction, or acceleration of adoption of, competing solutions, including public cloud infrastructure; a shift in industry or competitive dynamics or customer demand; and other risks detailed in our Annual Report on Form 10-K for the fiscal year ended July 31, 2017, filed with the SEC, filed with the Securities and Exchange Commission. These forward-looking statements speak only as of the date of this presentation and, except as required by law, we assume no obligation to update forward-looking statements to reflect actual results or subsequent events or circumstances. Any future product or roadmap information is intended to outline general product directions, and is not a commitment, promise or legal obligation for Nutanix to deliver any material, code, or functionality. This information should not be used when making a purchasing decision. Further, note that Nutanix has made no determination as to if separate fees will be charged for any future product enhancements or functionality which may ultimately be made available. Nutanix may, in its own discretion, choose to charge separate fees for the delivery of any product enhancements or functionality which are ultimately made available. Certain information contained in this presentation and the accompanying oral commentary may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this presentation, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources. Trademark Disclaimer © 2017 Nutanix, Inc. All rights reserved. Nutanix, the Enterprise Cloud Platform, the Nutanix logo and any other Nutanix products and features mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names and logos mentioned herein are for identification purposes only and are the property of their respective holder(s), Nutanix may not associated with, or sponsored or endorsed by such holder(s).
  3. 3. Agenda I. MANAGING NUTANIX ENVIRONMENTS • Cluster Monitoring • NCC overview • Prism Analysis (and Prism Central) II. TROUBLESHOOTING NUTANIX ENVIRONMENTS • General Troubleshooting • Troubleshooting Scenarios • Engaging support best practices • Additional Resources III. Q/A
  4. 4. Monitoring SNMP Email Prism Alerts Syslog Pulse
  5. 5. Prism Alerts & Pulse HD Phone Home Alerts Hourly Cluster Reports Automatic Case generation Deep Analytics And Inventory Pulse Cluster Health Prism Alerts
  6. 6. Auto-case Generation Example:
  7. 7. Auto-case Generation If you want up to date information check http://portal.nutanix.com/kb/1959 on the portal – KB 1959 For our customers leveraging our partners hardware platforms, we will generate software based alerts which triggers auto support cases. THESE ALERTS WILL AUTO GENERATE SUPPORT CASES: • Stargate process is down for more than 3 hours (StargateTemporarilyDown) • Curator scan fails (CuratorScanFailure) • Running out of space on the cluster • Running out of space on CVMs • Hardware Clock Failure (HardwareClockFailure) • Faulty RAM module (RAMFault) • Power Supply failure (PowerSupplyDown)
  8. 8. Working with Prism Alerts Integrated RCA Resolution Steps Modify Priority Neglected Environment? Clear all alerts and see what bubbles up
  9. 9. Working with Prism Central Alerts Dashboard
  10. 10. NCC Health Checks CLI - (NCC HEALTH_CHECKS RUN_ALL) PRISM (AOS 5.X)
  11. 11. CHECK STATUSES • • • • NCC Checks
  12. 12. Prism Analysis Change Time Range Choose Charts Create Charts Alerts and Events
  13. 13. Entity & Metric Charts
  14. 14. Prism Central Analysis
  15. 15. Troubleshooting Nutanix Environments: A Framework • Problem Isolation • Fixes and Mitigations • Root Cause Analysis • Product Improvement
  16. 16. Troubleshooting by Layers APPLICATION • SQL, VDI, Oracle RAC, etc. CVM • Stargate, Curator, Cassandra, etc. HYPERVISOR • AHV, ESXi, Hyper-V, XenServer HARDWARE • NVMe, SSD, HDD, Memory, NIC, Processor, etc. NETWORK • OVS, vSwitch, Physical Switch, etc.
  17. 17. Troubleshooting: Problem Isolation • Rapidly reduce failure domain scope, achieve faster resolution. • Any recent changes in the environment? IMPACT • Is storage available? • Are there performance issues? • Can you reach Prism? Use Build-In REPORTING • Prism Alerts • Cluster Health • NCC • Cluster logs • User Reports
  18. 18. Troubleshooting: Problem Isolation – Cluster Status Helpful additional commands • cluster status | grep -v UP → showing condensed version • genesis status → shows only local services/processes
  19. 19. Troubleshooting: Problem Isolation – allssh, hostssh, NCC, Logging • allssh • NCC LOGGING • /home/nutanix/data/logs and sysstats • INFO, WARN, ERROR, FATAL • allssh “ls -ltr data/logs/*.FATAL” • If FATALs are actively occurring and you’re experiencing issues, they may be related. • hostssh “vmware -vl” instead of allssh ‘ssh -l root 192.168.5.1 “vmware -vl’” • If you’re seeing an error, check the Nutanix Knowledge Base!
  20. 20. Problem Isolation - Data Resiliency Status • ncli cluster get-domain-fault-tolerance-status type=node
  21. 21. Typical Troubleshooting Scenarios UPGRADE IS NOT PROGRESSING • Logging: genesis.out, host_upgrade.out, firmware_upgrade.out • upgrade_status • host_upgrade_status • firmware_upgrade_status STORAGE UNAVAILABLE • Do all CVMs have connectivity to each other and to the hypervisor? • Recent stargate FATALs? • Cassandra status? REPLICATION, SNAPSHOTTING, AND METRO RELATED ISSUES • Logging: Cerebro logs NCC // HEALTH CHECKS FAILING • Running NCC should indicate the nature of the issue and give a KB describing how to resolve the issue.
  22. 22. Scenario - Host Offline
  23. 23. Root Cause Analysis - Log Collection
  24. 24. Best Practices for Engaging Support • Update your break/fix contact via My Nutanix Portal • Upgrade to the latest NCC and start a health_check • Clear problem description • What steps have you already taken? • Keep components on the recommended version levels • Press the Escalate Button in portal for immediate attention • Provide feedback after case closure. Surveys matter! COMPATIBILITY MATRIX:
  25. 25. Additional Resources The Nutanix Bible - Architecture details portal.nutanix.com - Nutanix Support Portal, KBs, Documentation, Software, etc. portal.nutanix.com/kb/4530 – Additional troubleshooting details for Acropolis File Services IF YOU LIKED THIS SESSION, YOU MAY ALSO LIKE: • Nutanix Architecture Deep Dive and the Deep Dive Super Session • Getting the Network Right (The First Time) • Fail Fast and Never Again • AHV – Virtualization You Always Wanted
  26. 26. @nutanix #nextconf #PT203 Thank You Bassem Rezkalla | bassem.rezkalla@nutanix.com Guido Hagemann | guido.hagemann@nutanix.com | @guido.hagemann

×