7. 単一障害点をなくそう
All components, services, resources, and compute instances should be deployed as multiple
instances to prevent a single point of failure from affecting availability. This includes
authentication mechanisms. Design the application to be configurable to use multiple instances,
and to automatically detect failures and redirect requests to non-failed instances where the
platform does not do this automatically.
8. サービスレベルの異なるワークロードは分離しよう
If a service is composed of critical and less-critical workloads, manage them differently and specify
the service features and number of instances to meet their availability requirements.
9. 依存関係を理解し、最小化しよう
Minimize the number of different services used where possible, and ensure you understand all of
the feature and service dependencies that exist in the system. This includes the nature of these
dependencies, and the impact of failure or reduced performance in each one on the overall
application. Microsoft guarantees at least 99.9 percent availability for most services, but this
means that every additional service an application relies on potentially reduces the overall
availability SLA of your system by 0.1 percent.
10. タスクとメッセージはべき等(安全に繰り返せるよう)にしよう
so that duplicated requests will not cause problems. For example, a service can act as a consumer
that handles messages sent as requests by other parts of the system that act as producers. If the
consumer fails after processing the message, but before acknowledging that it has been
processed, a producer might submit a repeat request which could be handled by another instance
of the consumer. For this reason, consumers and the operations they carry out should be
idempotent so that repeating a previously executed operation does not render the results invalid.
This may mean detecting duplicated messages, or ensuring consistency by using an optimistic
approach to handling conflicts.
11. メッセージブローカーでクリティカルなトランザクションの可用性を上げよう
Many scenarios for initiating tasks or accessing remote services use messaging to pass
instructions between the application and the target service. For best performance, the application
should be able to send the message and then return to process more requests, without needing
to wait for a reply. To guarantee delivery of messages, the messaging system should provide high
availability. Azure Service Bus message queues implement at least once semantics. This means that
each message posted to a queue will not be lost, although duplicate copies may be delivered
under certain circumstances. If message processing is idempotent (see the previous item),
repeated delivery should not be a problem.
12. 機能的縮退を考慮しよう
when reaching resource limits, and take appropriate action to minimize the impact for the user. In
some cases, the load on the application may exceed the capacity of one or more parts, causing
reduced availability and failed connections. Scaling can help to alleviate this, but it may reach a
limit imposed by other factors, such as resource availability or cost. Design the application so that,
in this situation, it can automatically degrade gracefully. For example, in an ecommerce system, if
the order-processing subsystem is under strain (or has even failed completely), it can be
temporarily disabled while allowing other functionality (such as browsing the product catalog) to
continue. It might be appropriate to postpone requests to a failing subsystem, for example still
enabling customers to submit orders but saving them for later processing, when the orders
subsystem is available again.
13. 突発的なイベント増に対処しよう
Most applications need to handle varying workloads over time, such as peaks first thing in the
morning in a business application or when a new product is released in an ecommerce site. Auto-
scaling can help to handle the load, but it may take some time for additional instances to come
online and handle requests. Prevent sudden and unexpected bursts of activity from overwhelming
the application: design it to queue requests to the services it uses and degrade gracefully when
queues are near to full capacity. Ensure that there is sufficient performance and capacity available
under non-burst conditions to drain the queues and handle outstanding requests. For more
information, see the Queue-Based Load Leveling Pattern.
14.
15. 各サービスは複数のインスタンスにデプロイしよう
Microsoft makes availability guarantees for services that you create and deploy, but these
guarantees are only valid if you deploy at least two instances of each role in the service. This
enables one role to be unavailable while the other remains active. This is especially important if
you need to deploy updates to a live system without interrupting clients' activities; instances can
be taken down and upgraded individually while the others continue online.
16. アプリを複数のデータセンターに配置しよう
Although extremely unlikely, it is possible for an entire datacenter to go offline through an event
such as a natural disaster or Internet failure. Vital business applications should be hosted in more
than one datacenter to provide maximum availability. This can also reduce latency for local users,
and provide additional opportunities for flexibility when updating applications.
17. デプロイとメンテナンス作業は、自動化、テストできるようにしよう
Distributed applications consist of multiple parts that must work together. Deployment should
therefore be automated, using tested and proven mechanisms such as scripts and deployment
applications. These can update and validate configuration, and automate the deployment process.
Automated techniques should also be used to perform updates of all or parts of applications. It is
vital to test all of these processes fully to ensure that errors do not cause additional downtime. All
deployment tools must have suitable security restrictions to protect the deployed application;
define and enforce deployment policies carefully and minimize the need for human intervention.
18. ステージング環境を用意し、本番環境と切り換える仕組みにしよう
where these are available. For example, using Azure Cloud Services staging and production
environments allows applications to be switched from one to another instantly through a virtual IP
address swap (VIP Swap). However, if you prefer to stage on-premises, or deploy different versions
of the application concurrently and gradually migrate users, you may not be able to use a VIP
Swap operation.
19. 設定変更で再起動が必要な要素を理解し、対処しよう
the instance when possible. In many cases, the configuration settings for an Azure application or
service can be changed without requiring the role to be restarted. Role expose events that can be
handled to detect configuration changes and apply them to components within the application.
However, some changes to the core platform settings do require a role to be restarted. When
building components and services, maximize availability and minimize downtime by designing
them to accept changes to configuration settings without requiring the application as a whole to
be restarted.
20. 更新ドメインを意識してダウンタイムなしでアップデートしよう
Azure compute units such as web and worker roles are allocated to upgrade domains. Upgrade
domains group role instances together so that, when a rolling update takes place, each role in the
upgrade domain is stopped, updated, and restarted in turn. This minimizes the impact on
application availability. You can specify how many upgrade domains should be created for a
service when the service is deployed.
21. (大事なことなので何回も言います) 可用性セットを使おう
Placing two or more virtual machines in the same availability set guarantees that these virtual
machines will not be deployed to the same fault domain. To maximize availability, you should
create multiple instances of each critical virtual machine used by your system and place these
instances in the same availability set. If you are running multiple virtual machines that serve
different purposes, create an availability set for each virtual machine. Add instances of each virtual
machine to each availability set. For example, if you have created separate virtual machines to act
as a web server and a reporting server, create an availability set for the web server and another
availability set for the reporting server. Add instances of the web server virtual machine to the
web server availability set, and add instances of the reporting server virtual machine to the
reporting server availability set.
22.
23. データを遠隔地に複製しよう
Data in Azure Storage is automatically replicated within in a datacenter. For even higher availability,
use Read-access geo-redundant storage (-RAGRS), which replicates your data to a secondary
region and provides read-only access to the data in the secondary location. The data is durable
even in the case of a complete regional outage or a disaster.
24. データベースを遠隔地に複製しよう
Azure SQL Database and Cosmos DB both support geo-replication, which enables you to
configure secondary database replicas in other regions. Secondary databases are available for
querying and for failover in the case of a data center outage or the inability to connect to the
primary database. For more information, see Failover groups and active geo-replication (SQL
Database) and How to distribute data globally with Azure Cosmos DB?.
25. (使えるところでは) 楽観的平行性制御と結果整合性でいこう
where possible. Transactions that block access to resources through locking (pessimistic
concurrency) can cause poor performance and considerably reduce availability. These problems
can become especially acute in distributed systems. In many cases, careful design and techniques
such as partitioning can minimize the chances of conflicting updates occurring. Where data is
replicated, or is read from a separately updated store, the data will only be eventually consistent.
But the advantages usually far outweigh the impact on availability of using transactions to ensure
immediate consistency.
26. 戻すことを意識してバックアップしてますか
and ensure it meets the Recovery Point Objective (RPO). Regularly and automatically back up data
that is not preserved elsewhere, and verify you can reliably restore both the data and the
application itself should a failure occur. Data replication is not a backup feature because errors
and inconsistencies introduced through failure, error, or malicious operations will be replicated
across all stores. The backup process must be secure to protect the data in transit and in storage.
Databases or parts of a data store can usually be recovered to a previous point in time by using
transaction logs. Microsoft Azure provides a backup facility for data stored in Azure SQL Database.
The data is exported to a backup package on Azure blob storage, and can be downloaded to a
secure on-premises location for storage.
29. タイムアウト設定は戦略的に
Services and resources may become unavailable, causing requests to fail. Ensure that the timeouts
you apply are appropriate for each service or resource as well as the client that is accessing them.
(In some cases, it may be appropriate to allow a longer timeout for a particular instance of a client,
depending on the context and other actions that the client is performing.) Very short timeouts
may cause excessive retry operations for services and resources that have considerable latency.
Very long timeouts can cause blocking if a large number of requests are queued, waiting for a
service or resource to respond.
30. リトライも戦略的に
Design a retry strategy for access to all services and resources where they do not inherently
support automatic connection retry. Use a strategy that includes an increasing delay between
retries as the number of failures increases, to prevent overloading of the resource and to allow it
to gracefully recover and handle queued requests. Continual retries with very short delays are
likely to exacerbate the problem.
31. あきらめも重要
when remote services are unavailable. There may be situations in which transient or other faults,
ranging in severity from a partial loss of connectivity to the complete failure of a service, take
much longer than expected to return to normal. Additionally, if a service is very busy, failure in
one part of the system may lead to cascading failures, and result in many operations becoming
blocked while holding onto critical system resources such as memory, threads, and database
connections. Instead of continually retrying an operation that is unlikely to succeed, the
application should quickly accept that the operation has failed, and gracefully handle this failure.
You can use the circuit breaker pattern to reject requests for specific operations for defined
periods. For more information, see Circuit Breaker Pattern.
32. ダメなら他へつなぐ
to mitigate the impact of a specific service being offline or unavailable. Design applications to take
advantage of multiple instances without affecting operation and existing connections where
possible. Use multiple instances and distribute requests between them, and detect and avoid
sending requests to failed instances, in order to maximize availability.
33. ダメなら他へ(応用編)
where possible. For example, if writing to SQL Database fails, temporarily store data in blob
storage. Provide a facility to replay the writes in blob storage to SQL Database when the service
becomes available. In some cases, a failed operation may have an alternative action that allows
the application to continue to work even when a component or service fails. If possible, detect
failures and redirect requests to other services that can offer a suitable alternative functionality, or
to back up or reduced functionality instances that can maintain core operations while the primary
service is offline.
34.
35. 起こりやすい障害の対処法はまとめておく
to report the situation to operations staff. For failures that are likely but have not yet occurred,
provide sufficient data to enable operations staff to determine the cause, mitigate the situation,
and ensure that the system remains available. For failures that have already occurred, the
application should return an appropriate error message to the user but attempt to continue
running, albeit with reduced functionality. In all cases, the monitoring system should capture
comprehensive details to enable operations staff to effect a quick recovery, and if necessary, for
designers and developers to modify the system to prevent the situation from arising again.
36. 落ちる前に気づく
The health and performance of an application can degrade over time, without being noticeable
until it fails. Implement probes or check functions that are executed regularly from outside the
application. These checks can be as simple as measuring response time for the application as a
whole, for individual parts of the application, for individual services that the application uses, or
for individual components. Check functions can execute processes to ensure they produce valid
results, measure latency and check availability, and extract information from the system.
37. いざというとき本当に切り替わりますか
to ensure they are available and operate as expected. Changes to systems and operations may
affect failover and fallback functions, but the impact may not be detected until the main system
fails or becomes overloaded. Test it before it is required to compensate for a live problem at
runtime.
38. すべては監視システムの信頼の上に
Automated failover and fallback systems, and manual visualization of system health and
performance by using dashboards, all depend on monitoring and instrumentation functioning
correctly. If these elements fail, miss critical information, or report inaccurate data, an operator
might not realize that the system is unhealthy or failing.
39. 実行時間が長いワークフロー全体が落ちるとショックでかい
and retry on failure. Long-running workflows are often composed of multiple steps. Ensure that
each step is independent and can be retried to minimize the chance that the entire workflow will
need to be rolled back, or that multiple compensating transactions need to be executed. Monitor
and manage the progress of long-running workflows by implementing a pattern such
as Scheduler Agent Supervisor Pattern.
40. 広域災害に対する仕組みと訓練
Create an accepted, fully-tested plan for recovery from any type of failure that may affect system
availability. Choose a multi-site disaster recovery architecture for any mission-critical applications.
Identify a specific owner of the disaster recovery plan, including automation and testing. Ensure
the plan is well-documented, and automate the process as much as possible. Establish a backup
strategy for all reference and transactional data, and test the restoration of these backups
regularly. Train operations staff to execute the plan, and perform regular disaster simulations to
validate and improve the plan.