On Rails with Apache Cassandra

On Rails with Apache Cassandra

Austin on Rails
April 27th 2010
Stu Hood (@stuhood) – Technical Lead, Rackspace

My, what a large/volatile dataset you
have!
● Large
● Larger than 1 node can handle
● Volatile
● More than 25% (ish) writes
● (but still larger than available memory)
● Expensive
● More than you can afford with a commercial
solution

My, what a large/volatile dataset you
have!
● For example:
● Event/log data
● Output of batch processing or log analytics jobs
● Social network relationships/updates
● In general:
● Large volume of high fanout data

Conversely...
● If your pattern easily fits one RDBMS machine:
● Don't Use Cassandra
● Possibly consider MongoDB, CouchDB, Neo4j,
Redis, etc
– For schema freedom and flexibility

Case Study: Digg
1.Vertical partitioning and master/slave trees
2.Developed sharding solution
● IDDB
● Awkward replication, fragile scaling
3.Began populating Cassandra in parallel
● Initial dataset for 'green badges'
– 3 TB
– 76 billion kv pairs
● Most applications being ported to Cassandra

Standing on the shoulders of:
Amazon Dynamo
● No node in the cluster is special
● No special roles
● No scaling bottlenecks
● No single point of failure
● Techniques
● Gossip
● Eventual consistency

Standing on the shoulders of:
Google Bigtable
● “Column family” data model
● Range queries for rows:
● Scan rows in order
● Memtable/SSTable structure
● Always writes sequentially to disk
● Bloom filters to minimize random reads
● Trounces B-Trees for big data
– Linear insert performance
– Log growth for reads

Enter Cassandra
● Hybrid of ancestors
● Adopts listed features
● And adds:
● Pluggable partitioning
● Multi datacenter
support
– Pluggable locality
awareness
● Datamodel
improvements

Enter Cassandra
● Project status
● Open sourced by Facebook in 2008 (no longer active)
● Apache License, Version 2.0
● Graduated to Apache TLP February 2010
● Major releases: 0.3 through 0.6.1 (0.7 this summer)
● cassandra.apache.org
● Known deployments at:
● Cloudkick, Digg, Mahalo, SimpleGeo, Twitter,
Rackspace, Reddit

The Datamodel
Cluster

Nodes have Tokens:
OrderPreservingPartitioner:
Actual keys
RandomPartitioner:
MD5s of keys

The Datamodel
Cluster > Keyspace

Like an RDBMS schema:
Keyspace per application

The Datamodel
Cluster > Keyspace > Column Family

Sorted hash:
Bytes → Row Like an RDBMS table:
Separates classes of Objects
Row Key → Row

The Datamodel
Cluster > Keyspace > Column Family > Row

Sorted hash: Name → Value
...

The Datamodel
Cluster > Keyspace > Column Family > Row > “Column”

Not like an RDBMS column:
Attribute of the row: each row can
contain millions of different columns

…
Name → Value
bytes → bytes

+version timestamp

StatusApp: another Twitter clone.

StatusApp Example
<ColumnFamily Name=”Users”>
● Unique id as key: name->value pairs contain
user attributes
{key: “rails_user”, row: {“fullname”: “Damon
Clinkscales”, “joindate”: “back_in_the_day” … }}

StatusApp Example
<ColumnFamily Name=”Timelines”>
● User id and timeline name as key: row contains
list of updates from that timeline
{key: “user19:personal”, row: {<timeuuid1>:
“status19”, <timeuuid2>: “status21”, … }}

Raw Client API
● Thrift RPC framework
● Generates client bindings for (almost) any language

1. Get the most recent status in a timeline:
● get_slice(keyspace, key, [column_family,
column_name], predicate, consistency_level)
● get_slice(“statusapp”, “userid19:personal”,
[“Timelines”], {start: ””, count: 1}, QUORUM)
> <timeuuid1>: “status19”

But...
● Don't use the Raw Thrift API!
● You won't enjoy it
● Use high level Client APIs
● Many options for each language

Consistency Levels?
● Eventual consistency
● Synch to Washington, asynch to Hong Kong
● Client API Tunables
● Synchronously write to W replicas
● Confirm R replicas match at read time
● of N total replicas
● Allows for almost-strong consistency
● When W + R > N

Write Example

Replication Factor == N == 3:
3 Copies

Write Example

Client connects to arbitrary node

Write Example

cl.ONE:
W == 1
Block for success on 1 replica

Write Example

cl.QUORUM:
W == N/2+1
Block for success on a majority

Caveat consumptor
● No secondary indexes:
● Typically implemented in client libraries
● No transactions
● But atomic increment/decrement RSN
● Absolutely no joins
● You don't really want 'em anyway

“That doesn't sound worth the
trouble!"

Cassandra Ruby Support:
Cassandra Object
● Mostly duck-type compatible with ActiveRecord
objects
● Transparently builds (non-)unique secondary
indexes
● Excludes:
– :order
– :conditions
– :join
– :group

Cassandra Ruby Support: RDF.rb
● Repository implementation for RDF.rb
● Stores triple of (subject, predicate, object) as
(rowkey, name, subname)

Silver linings: Ops
● Dead drive?
● Swap the drive, restart, run 'repair'
● Streams missing data from other replicas
● Dead node?
● Start a new node with the same IP and token, run
'repair'

Silver linings: Ops
● Need N new nodes?
● Start more nodes with the same config file
● New nodes request load information from the
cluster and join with a token that balances the
cluster

Silver linings: Ops
● Adding a datacenter?
● Configure “dc/rack/ip” describing node location
● Add new nodes as before

Getting started
● `gem install cassandra`
● `git clone
git://github.com/tritonrc/cassandra_object.git`
● http://cassandra.apache.org/
● Read "Getting Started"... Roughly:
– Start one node
– Test/develop app, editing node config as necessary
– Launch cluster by starting more nodes with chosen config

Resources
● http://cassandra.apache.org/
● http://wiki.apache.org/cassandra/
● Mailing Lists
● #cassandra on freenode.net

References
● Digg Technology Blog
● http://about.digg.com/blog/looking-future-cassandra
● http://about.digg.com/blog/introducing-digg’s-iddb-infrastructure
● Github Projects
● http://github.com/tritonrc/cassandra_object
● http://github.com/bendiken/rdf-cassandra
● Cassandra Wiki
● http://wiki.apache.org/cassandra/
● Brandon William's perf tests
● http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png

On Rails with Apache Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to On Rails with Apache Cassandra

Similar to On Rails with Apache Cassandra (20)

Recently uploaded

Recently uploaded (20)

On Rails with Apache Cassandra