Running Cassandra on Amazon EC2

Reminder
Next meetup Wednesday 8th December
Jake Luciani will be giving a talk on
"Lucandra" (a Cassandra backend for
Lucene open source search software)

Quick intro to Cassandra
• Decentralized
• Fault-tolerant
• Tunable consistency
• Elasticity

This talk
Why consider EC2?
What are the challenges of running
Cassandra on EC2?
Is it a good idea?

Cassandra design decisions
Cassandra designed to run on many
commodity servers
It is designed to deal with unreliable
hardware and networks

Why consider EC2?
On demand instances
“frees you from the costs and complexities
of planning, purchasing, and maintaining
hardware and transforms what are
commonly large fixed costs into much
smaller variable costs”
http://aws.amazon.com/ec2/pricing/

Why consider EC2?
Multiple “Availability Zones” in multiple
regions (US East, US West, Ireland and
Singapore)
http://aws.amazon.com/ec2/

Writing to Cassandra
1. Write added to local log on target
machine
2. Memtable updated
3. Memtable flushed to disk as data
files (SSTable plus SSTable Index)
4. Eventually data files are compacted
http://wiki.apache.org/cassandra/ArchitectureOverview#Write_path
IO
IO
IO

Reading from Cassandra
1. Read from any node
2. Partitioner
3. Wait for R responses
4. Wait for N – R responses in the
background and perform read repair
http://wiki.apache.org/cassandra/ArchitectureOverview#Read_path
IO
IO

Reading from Cassandra
Reads from multiple SSTables
The application use-case will affect
performance and what the bottleneck is
(totally random reads being worst case)
IO

The challenges
Getting good enough I/O performance
Not a huge number of resources on the
Internet (new and shiny)
Some minor setup and monitoring
challenges (documentation is available)

EC2 I/O performance
Ephemeral or EBS; low, moderate or
high I/O performance indicators
“other resources like the network and the disk
subsystem are shared among instances…
when a resource is under-utilized you will
often be able to consume a higher share of
that resource”
http://aws.amazon.com/ec2/instance-types/

EBS or ephemeral?
Jonathan Ellis recently on mailing list:
“we recommend using raid0 ephemeral disks
on EC2 with L or XL instance sizes for better
i/o performance.”
http://cassandra-user-incubator-apache-
org.3065146.n2.nabble.com/Cold-boot-performance-problems-
tp5615829p5615889.html
http://www.coreyhulen.org/?p=326

EBS or ephemeral?
Amazon suggest EBS is better:
“Amazon EBS is particularly suited for
applications that require a database, file
system, or access to raw block level storage”
http://aws.amazon.com/ebs/

“The latency and throughput of Amazon EBS
volumes is designed to be significantly better
than the Amazon EC2 instance stores in nearly
all cases. You can also attach multiple volumes
to an instance and stripe across the volumes.
This is one way to improve I/O rates,
especially if your application performs a lot of
random access across your data set.”
http://aws.amazon.com/ebs/

EC2 I/O benchmark
Throughput measured using dd
Seek measured using seeker.c
Software RAID uses mdadm
http://www.linuxinsight.com/how_fast_is_your_disk.html
http://en.wikipedia.org/wiki/Mdadm

Which is better?
EBS has better throughput, ephemeral
better for random seeks
Generic benchmarks aren’t great –
depends on your use case
Warning: EC2 performance not
consistent

EC2 Cassandra benchmark
Read and write TPS
Benchmarks carried out by Corey Hulen
http://www.coreyhulen.org/?p=326

Which is better?
Corey suggests:
“Raid 0 EBS drives are the way to go”
“We didn’t notice a difference above the
normal EC2 fluctuations when testing
for 2 vs 4 drives”

Conclusions
Cassandra will run acceptably on EC2, but
real HW is better
It will depend on your use case –
particularly the types of read that you do
Real HW may work out cheaper

Conclusions
Ephemeral I/O seems to be better than
EBS, although EBS has other advantages
(doesn’t disappear if you stop the node)
Again, it will depend on use case

Conclusions
Large nodes are the best bet
Small nodes have poor I/O
Extra large nodes are probably not
worth it (better to have more nodes)
http://cassandra-user-incubator-apache-
org.3065146.n2.nabble.com/Nodes-dropping-out-of-cluster-
due-to-GC-tp5128481p5131568.html

Questions?
Please leave feedback on meetup.com
Follow @cassandralondon on Twitter

Running Cassandra on Amazon EC2

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Running Cassandra on Amazon EC2

Semelhante a Running Cassandra on Amazon EC2 (20)

Mais de Dave Gardner

Mais de Dave Gardner (13)

Último

Último (20)

Running Cassandra on Amazon EC2