Active Active - C* Behind the Scenes at Netflix

WHAT IS ACTIVE-ACTIVE
Also called dual active, it is a phrase used to
describe a network of independent processing nodes
where each node has access to a replicated database
giving each node access and usage of single
application. In an active-active system all requests are
load balanced across all available processing capacity,
Where a failure occurs on a node, another node in the
network takes its place.

DOES AN INSTANCE FAIL?
• It can, plan for it
• Bad code / configuration pushes
• Latent issues
• Hardware failure
• Test with Chaos Monkey

DOES A ZONE FAIL?
• Rarely, but happened before
• Routing issues
• DC-specific issues
• App-specific issues within a zone
• Test with Chaos Gorilla

DOES A REGION FAIL?
• Full region – unlikely, very rare
• Individual Services can fail region-wide
• Most likely, a region-wide configuration issues
• Test with Chaos Kong

EVERYTHING FAILS… EVENTUALLY
• Keep your services running by embracing isolation and
redundancy
• Construct a highly agile and highly available service
from ephemeral and assumed broken components

ISOLATION
• Changes in one region should not affect others
• Regional outage should not affect others
• Network partitioning between regions should not affect
functionality / operations

REDUNDANCY
• Make more than one (of pretty much everything)
• Specifically, distribute services across Availability
Zones and regions

HISTORY: X-MAS EVE 2012
• Netflix multi-hour outage
• US-East1 regional Elastic Load Balancing issue
• “...data was deleted by a maintenance process
that was inadvertently run against the
production ELB state data”

SNITCH CHANGES
EC2Snitch EC2MultiRegionSnitch
Uses Private IPs Uses Public IPs

PRIAM.MULTIREGION.ENABLE =TRUE
tcp 7101-7101 [ ] [10.190.21.36/32, 10.232.200.17/32, 10.33.573.26/32,
10.20.151.165/32, 10.226.99.46/32, 10.244.143.193/32]
tcp 7103-7103 [ ] [54.196.221.136/32, 54.202.200.217/32, 54.203.57.226/32,
54.205.151.165/32, 54.226.99.46/32, 54.244.143.193/32]

SPIN UP NODES IN NEW REGION
us-east-1 us-west-2
APP

UPDATE KEYSPACE
Update keyspace <keyspace> with placement_strategy =
'NetworkTopologyStrategy'
and strategy_options = {us-east : 3, us-west-2 : 3};
Existing region and replication factor New region and replication factor

REBUILD NEW REGION
Run – nodetool rebuild us-east-1 on all us-west-2 nodes

BENCHMARKING GLOBAL CASSANDRA
WRITE INTENSIVE TEST OF CROSS-REGION REPLICATION
CAPACITY
16 X HI1.4XLARGE SSD NODES PER ZONE = 96 TOTAL
192 TB OF SSD IN SIX LOCATIONS UP AND RUNNING
CASSANDRA IN 20 MINUTES
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-West-2 Region - Oregon
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East-1 Region - Virginia
Test
Load
Test
Load
Validation
Load
Interzone Traffic
1 Million Writes
CL.ONE (Wait for
One Replica to ack)
1 Million Reads
after 500 ms
CL.ONE with No
Data Loss
Interregional Traffic
Up to 9Gbits/s, 83ms 18 TB backups
from S3

TEST FOR RETRIES
FAILURE
RETRY

KEY METRICS USED
• 99 /95 th Read Latency (Client & C*)
• Dropped Metrics on C*
• Exceptions on C*
• Heap Usage on C*
• CPU Usage (Client & C*)
• Threads Pending on C*

CONFIGURATION FOR TEST
• 24 Node C* SSDs
• 220 Client instances
• 70+ Jmeter Instances

TOTAL READ IOPS
TOTAL WRITE IOPS

NETWORK PARTITION
us-east-1 us-west-2

REPAIRS AFTER EXTENSION ARE PAINFUL !!

TIME TO REPAIR DEPENDS ON
• Number of regions
• Number of replicas
• Data size
• Amount of entropy

ADJUST GC_GRACE AFTER
EXTENSION
• Column Family Setting
• Defined in seconds
• Default 10 days
• Tweak gc_grace settings to
accommodate time taken to repair
• BEWARE of deleted columns

CONSISTENCY LEVEL
• Check the client for consistency level setting
• In a Multiregional cluster QUORUM <>
LOCAL_QUORUM
• Recommended consistency levels
LOCAL_ONE (CASSANDRA-6202) for reads
and LOCAL_QUORUM for writes
• For region resiliency avoid – ALL or
QUORUM calls

HOW DO WE KNOW IT WORKS?
CREATE CHAOS!!

Benchmark …
Time Consuming
But worth it!

Active Active - C* Behind the Scenes at Netflix

Active Active - C* Behind the Scenes at Netflix

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Mais de DataStax Academy

Mais de DataStax Academy (20)

Último

Último (20)

Active Active - C* Behind the Scenes at Netflix