O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

"It can always get worse!" – Lessons Learned in over 20 years working with Oracle MAA

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 28 Anúncio

"It can always get worse!" – Lessons Learned in over 20 years working with Oracle MAA

Baixar para ler offline

First presented during the DOAG 2022 Conference and Exhibition, this presentation discusses and reviews the most significant lessons learned in over 20 years of working with Oracle Maximum Availability Architecture. It explains why documentation is good, but automated checks are better, and why standardization can help increase the availability of nearly all systems, including database systems.

First presented during the DOAG 2022 Conference and Exhibition, this presentation discusses and reviews the most significant lessons learned in over 20 years of working with Oracle Maximum Availability Architecture. It explains why documentation is good, but automated checks are better, and why standardization can help increase the availability of nearly all systems, including database systems.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Mais de Markus Michalewicz (20)

Mais recentes (20)

Anúncio

"It can always get worse!" – Lessons Learned in over 20 years working with Oracle MAA

  1. 1. "It can always get worse!" – Lessons Learned in over 20 years working with Oracle MAA Markus Michalewicz Vice President of Product Management Database HA, Scalability, DR, MAA, and ZDM slideshare.net/markusmichalewicz @KnownAsMarkus markusmichalewicz markus.michalewicz@oracle.com
  2. 2. A dialog on a critical escalation call (translated from German) Customer / Partner (P): Ø P: When we pull the network cable from the RAC, the cluster goes down – what HA is that? Ø P: The network cable on the server, of course. Ø P: What interconnect? Ø Customer: I think we are done here. Oracle: Ø What network cable, please? Ø The public network or the network for the private interconnect connecting the servers? Recollection of an Early Incident (20+ years ago) Copyright © 2022, Oracle and/or its affiliates 2 An awkward moment of silence…
  3. 3. Agenda 1. It can always get worse than one thought. 2. Documentation is good, checks are better. 3. Standardization and the cloud help. 4. There is no magic key for availability. Lessons Learned Copyright © 2022, Oracle and/or its affiliates 3
  4. 4. Oracle Maximum Availability Architecture (MAA) Copyright © 2022, Oracle and/or its affiliates 4 Scale out & Lifecycle Data protection Reference architectures Deployment choices HA features, configurations and operational practices Customer insights and expert recommendations Production site Replicated site Replication Generic Systems Engineered Systems BaseDB ExaDB/ExaCC Autonomous DB Flashback RMAN + ZDLRA Continuous availability Application Continuity Edition-based Redefinition Active replication Active Data Guard RAC Sharding FPP 24/7 GoldenGate Online Redefinition Zero Downtime Migration (ZDM) Bronze Silver Gold Platinum
  5. 5. Agenda 1. It can always get worse than one thought. 2. Documentation is good, checks are better. 3. Standardization and the cloud help. 4. There is no magic key for availability. Lessons Learned Copyright © 2022, Oracle and/or its affiliates 5
  6. 6. Downtime Protection is Important Practical relevance 6 Copyright © 2022, Oracle and/or its affiliates Financial risk Customer risk Regulatory risk • Business interruption means revenue loss • Unplanned recovery costs • Reputational / brand damage can reduce market value • Customers who have a bad experience may not return • Widely publicized outages make it harder to attract new customers • Regulated businesses may face penalties for unplanned interruptions • May also be subject to additional ongoing scrutiny
  7. 7. 7 $350K average cost of downtime per hour $10M average cost of unplanned data center outage or disaster 87 hours average amount of downtime per year 91% percentage of companies that have experienced an unplanned data center outage in the last 24 months Copyright © 2022, Oracle and/or its affiliates Impact of downtime
  8. 8. With increasing data volume and complex IO subsystems, data failures are inevitable Storage Systems have known issues: • Schroeder and Gibson, Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?, 2007 • Krioukov, et al, Parity Lost and Parity Regained, 2008 • Bairavasundaram, et al, An Analysis of Data Corruption in the Storage Stack, 2008 • Jiang, et al, Are Disks the Dominant Contributor for Storage Failures?, 2008 • Zheng, et al, Understanding the Robustness of SSDs under Power Fault, 2013 • InfoWorld Tech Watch, Test your SSDs or risk massive data loss, researchers warn, 2013 More insidious than outright failures are latent data corruptions. Data Failures Happen Copyright © 2022, Oracle and/or its affiliates 8
  9. 9. “There are some things you have to experience to understand.” --Anonymous Real Life Business Breakdowns Copyright © 2022, Oracle and/or its affiliates 9 California DMV loses two backup systems due to outage • Simultaneous hard drive failures in both primary and backup systems • Impacted operations at 100 field offices • Outage shut down operations for several days Source: CBS Sacramento Source: Financial Times Ransomware leads to cancellation of 2800 patient procedures • Attack occurred “before the necessary work on the weakest parts of the system had been completed” • Halted operations at three Goole NHS Foundation Trust hospitals for five days Source: CNN Business 5-hour Delta Airlines outage cost $150M • Power outage at operations center resulted in 2000+ flight cancellations • Critical systems failed to switch over to backups • Many affected customers were given refunds + vouchers for future travel 4-hour data center shutdown takes 2% off Wells Fargo share price • Restoration process interrupted transactions, resulting in missing deposits • 5500 Wells Fargo branches had to temporarily offer extended hours • CTO departed a month later Source: thestreet.com
  10. 10. Oracle Maximum Availability Architecture (MAA) Copyright © 2022, Oracle and/or its affiliates 10 Scale out & Lifecycle Data protection Reference architectures Deployment choices HA features, configurations and operational practices Customer insights and expert recommendations Production site Replicated site Replication Generic Systems Engineered Systems BaseDB ExaDB/ExaCC Autonomous DB Flashback RMAN + ZDLRA Continuous availability Application Continuity Edition-based Redefinition Active replication Active Data Guard RAC Sharding FPP 24/7 GoldenGate Online Redefinition Zero Downtime Migration (ZDM) Bronze Silver Gold Platinum
  11. 11. Availability service levels MAA reference architectures Dev, test, prod Single instance DB Restartable Backup/restore Prod/departmental Bronze + Database HA with RAC Application continuity Business critical Silver + DB replication with Active Data Guard Mission critical Gold + GoldenGate Edition-Based Redefinition All tiers exist with on-premises and cloud. However, platinum currently must be configured manually while bronze to gold are covered with cloud tool automation for the most part depending on the desired RTO (i.e. FSFO and multiple standby databases still must be manually configured for example) Bronze Silver Gold Platinum Copyright © 2022, Oracle and/or its affiliates 11
  12. 12. Agenda for this talk 1. It can always get worse than one thought. 2. Documentation is good, checks are better. 3. Standardization and the cloud help. 4. There is no magic key for availability. Lessons Learned Copyright © 2022, Oracle and/or its affiliates 12
  13. 13. Documentation is Good Copyright © 2022, Oracle and/or its affiliates 13 https://www.oracle.com/database/technologies/high- availability/oracle-database-maa-best-practices.html https://www.oracle.com/database/technologies/hig h-availability/oracle-applications-maa.html 119 pages
  14. 14. Use pre-checks and regular checks ORAchk/EXAChk https://docs.oracle.com/en/engineered- systems/health- diagnostics/exachk/oexug/oracle-orachk-and- exachk-common-features-tasks.html Use • Actively on the command line • In daemon mode (scheduled) • With Profiles – e.g. for: • asm • clusterware • goldengate • maa • … MAA Score Card https://docs.oracle.com/en/engineered- systems/health- diagnostics/exachk/oexug/understanding-and- managing-reports-and-output.html ACchk https://docs.oracle.com/en/engineered- systems/health- diagnostics/exachk/oexug/deploying- application-continuity.html • Use Oracle ORAchk to Confirm System Readiness for Implementing Application Continuity • Provides textual coverage report/an ACchk Scorecard Checks are Better Copyright © 2022, Oracle and/or its affiliates 14
  15. 15. Troubleshooting and diagnostics tools improving availability Attention Log • Available with Oracle DB 21c • Contains only important events requiring customer attention • Includes defined set of messages and attributes • All messages include these attributes: • Type • Urgency • Scope • Target User • Cause and Action • Additional debug information Location: $ORACLE_BASE/diag/rdbms/database_ name/instance_id/log/ Autonomous Health Framework • AHF preserves availability of your database system during both software (DB, GI, OS) and hardware (CPU, network, memory, storage) issues by: • Providing early warnings for potential availability issues • Identifying underlying cause(s) and recommended actions for a quick resolution • Gathering relevant and complete diagnostics for efficient triage by Oracle Support Services Trace File Analyzer (TFA) • Enables diagnostic data collection (across cluster nodes) and consolidates data in one place. • Monitors logs for significant problems that can impact your service. • Automatically collects relevant diagnostics when it detects any potential problems. • Can identify relevant information in log files and trims log files to just the parts that are necessary to resolve an issue. • Oracle Trace File Analyzer hides the complexity by providing a single interface and syntax for them all. Added Proof is Best Copyright © 2022, Oracle and/or its affiliates 15
  16. 16. Agenda for this talk 1. It can always get worse than one thought. 2. Documentation is good, checks are better. 3. Standardization and the cloud help. 4. There is no magic key for availability. Lessons Learned Copyright © 2022, Oracle and/or its affiliates 17
  17. 17. Standardization improves availability – some examples On generic systems use • Standardized components AND • Gold image-based deployments OR • Container-based deployments OR • VM-based deployments to scale deployments safely. Oracle Engineered Systems • Use standardized components • Come pre-configured with operational best practices to ensure better availability. Cloud environments provide • Standardized components • Pre-configuration • User guidance to ensure stable operations. Oracle Cloud is based on MAA! Define, Use, Improve, Re-Use Copyright © 2022, Oracle and/or its affiliates 18
  18. 18. Just not explicitly considered as part of MAA Oracle-provided container images include: • EE, SE2, Single Instance, Sharding, and RAC • Available at: • https://container-registry.oracle.com • https://github.com/oracle/docker- images/tree/main/OracleDatabase These images are supported for production use • RAC supported only on-premises. Use Oracle’s managed cloud services for RAC support in the cloud (Autonomous Database, ExaCS, DBCS, …) • RAC image support assumes the underlying OS, Hardware, etc. are also supported for Oracle RAC. Orchestration solutions are supported if underlying support requirements are met. • It is assumed that • The underlying OS, Hardware, etc. are supported • The solution used understands those requirements Specifically: • Oracle Database Docker images can be deployed in Kubernetes using Helm charts • Charts describe the application structure so that Helm can install and configure the pieces of the application. • OpenStack is supported depending on certain configurations and subject to above guideline. • Consider RAC requirements for network and storage Containers and Orchestration Solutions are Supported Copyright © 2022, Oracle and/or its affiliates 19
  19. 19. MAA Solutions: On-Premises to Cloud Copyright © 2022, Oracle and/or its affiliates 20 On-Premises On-Premises Exadata and Recovery Appliance BaseDB/ExaDB/ ExaCS Autonomous Database MAA Reference Architectures and Best Practices MAA integrated Engineered Systems (configuration best practices, EXAchk, lowest brownouts, data protection, etc.) Adding MAA configuration and life cycle operations, shifting administrative burden to Oracle with MAA SLAs
  20. 20. Breaking things to ensure your peace of mind Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. --Wikipedia In the digital age, this includes but is not limited to: • Network, server & storage failures • Human errors & data corruption • Data corruption • Power failures or site failure (i.e. Godzilla attack or hurricane) • Application, database & server software updates • Data reorganization or changes • Application changes and optimizations MAA and Chaos Engineering Copyright © 2022, Oracle and/or its affiliates 21
  21. 21. Agenda for this talk 1. It can always get worse than one thought. 2. Documentation is good, checks are better. 3. Standardization and the cloud help. 4. There is no magic key for availability. Lessons Learned Copyright © 2022, Oracle and/or its affiliates 22
  22. 22. Think again! https://aws.amazon.com/ message/65648/ Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region April 29, 2011 The issues affecting EC2 customers last week primarily involved a subset of the Amazon Elastic Block Store (“EBS”) volumes in a single Availability Zone within the US East Region that became unable to service read and write operations. https://www.techrepublic.c om/article/aws-outage- how-netflix-weathered- the-storm-by-preparing- for-the-worst/ September 21, 2015 Some of the internet’s biggest sites and apps were intermittently unavailable after more than 20 services on the AWS platform began failing. Helping [Netflix] to weather the service disruption was its practice of what it calls “chaos engineering”. https://www.theguardian.co m/technology/2021/dec/1 5/amazon-down-web- services-outage-netflix- slack-ring-doordash-latest Wed 15 Dec 2021 1) … briefly faced internet connectivity problems in two regions on the US West Coast on Wednesday, marking the second time in less than two weeks that the service was disturbed. 2) That outage lasted for several hours, and resulted in Netflix, Disney+, Robinhood and a slew of other services being inaccessible. Last week’s outage impacted the US- East-1 Region. https://www.bleepingcom puter.com/news/security/ uk-heat-wave-causes- google-and-oracle-cloud- outages/ July 19, 2022 An ongoing heatwave in the United Kingdom has led to Google Cloud and Oracle Cloud outages after cooling systems failed at the companies' data centers. Some Mistakenly Believe that the Cloud Ensures Availability Copyright © 2022, Oracle and/or its affiliates 23
  23. 23. Availability service levels MAA Reference Architectures Dev, test, prod Single instance DB Restartable Backup/restore Prod/departmental Bronze + Database HA with RAC Application continuity Business critical Silver + DB replication with Active Data Guard Mission critical Gold + GoldenGate Edition-Based Redefinition All tiers exist with on-premises and cloud. However, platinum currently must be configured manually while bronze to gold are covered with cloud tool automation for the most part depending on the desired RTO (i.e. FSFO and multiple standby databases still must be manually configured for example) Bronze Silver Gold Platinum Copyright © 2022, Oracle and/or its affiliates 24
  24. 24. https://docs.oracle.com/solutions The Solution: Reference Architectures – Referring to MAA Blueprints Copyright © 2022, Oracle and/or its affiliates 25
  25. 25. Conclusion Copyright © 2022, Oracle and/or its affiliates 26
  26. 26. Oracle MAA is for Everybody! Copyright © 2022, Oracle and/or its affiliates 27 For Oracle (database) customers wanting to improve their system availability to reduce costs caused by downtime. For non-Oracle Customers to get an idea what failure scenarios need to be covered and how Oracle can help. For Application Developers to understand which failure scenarios should be tackled by the application as needed.
  27. 27. MAA will continue to Help Copyright © 2022, Oracle and/or its affiliates 28 Provide the best HA, disaster recovery, and data protection solutions for Oracle Database – all active versions Continue to enhance validated Maximum Availability Architecture (MAA) solutions
  28. 28. Copyright © 2022, Oracle and/or its affiliates 29 Thank you Markus Michalewicz Markus.Michalewicz@oracle.com slideshare.net/markusmichalewicz @KnownAsMarkus markusmichalewicz

×