Cassandra is a powerful and easy to use and manage NoSQL database that provides high-availability, automatic replication across multiple data-centers, and tunable consistency.
Cassandra is a good choice for use cases requiring high write throughput including Internet of Things / Sensor Data, Analytics, Logging / Clickstream, and many others
5. Consistency in CassandraCONSISTENCY
• ACID - Atomicity Consistency Isolation Durability
• BASE - Basically Available Soft-state Eventual consistency
• Isolation on the row level
• Atomic batches starting Cassandra 1.2
• Consistency level for READs and WRITEs set for every request
• Tunable consistency
• Log: CL_WRITE = ANY or ONE
• Strong: CL_READ + CL_WRITE > REPLICATION_FACTOR
• Recommended default: LOCAL_QUORUM
6. Consistency in Cassandra - continuedCONSISTENCY
Level Description
ANY
A write must be written to at least one node. If all replica nodes
for the given row key are down, the write can still succeed once
a hinted handoff has been written. Note that if all replica nodes
are down at write time, an ANY write will not be readable until
the replica nodes for that row key have recovered.
ONE
A write must be written to the commit log and memory table of
at least one replica node.
QUORUM
A write must be written to the commit log and memory table on
a quorum of replica nodes.
LOCAL_QUORUM
A write must be written to the commit log and memory table on
a quorum of replica nodes in the same data center as the
coordinator node. Avoids latency of inter-data center
communication.
EACH_QUORUM
A write must be written to the commit log and memory table on
a quorum of replica nodes in all data centers.
ALL
A write must be written to the commit log and memory table on
all replica nodes in the cluster for that row key.
10. Astyanax clientASTYANAX
• Based on Hector
• High level, simple object oriented interface to Cassandra.
• Fail-over behavior on the client side.
• Connection pool abstraction (round robin connection pool)
• Monitoring to get event notification from the connection pool.
• Complete encapsulation of the underlying Thrift API.
• Automatic retry of downed hosts.
• Automatic discovery of additional hosts in the cluster.
• Suspension of hosts for a short period of time after timeouts.
12. Data Modeling in CassandraDATAMODELING
• Column Families are NOT tables!
• Map<RowKey, SortedMap<ColumnKey, ColumnValue>>
• Values could be and often are stored in column names
• Number of columns could be different for different rows
• There could be 2 billions columns in one row!
• Use UUIDs
• Separate read-heavy from write-heavy data
13. Data Modeling in Cassandra - continuedDATAMODELING
• Client joins
• Denormalize data
• Wide rows
• Materialized views
• Model around queries
• Row key is “shard” key
14. Modeling nested entities and documentsDATAMODELING
Motivation
• Parent-child decomposition lacks performance in Cassandra.
• No JOIN operator in CQL!
• The only solution is to store tree-like structure with nested “children”
• Cassandra doesn’t have built-in support for a document object
Solution
• Column Families are NOT tables
• Domain object fields are traversed along with the nested entities
• Collection and Map fields (of any level of deepness) are unwrapped
into plain key-value pairs (mapped to Cassandra column name – value)
17. Range queries in CassandraQUERIES
Motivation
• No CQL equivalent for SQL clause:
WHERE “field_name” >= value1 and “field_name” <= value2
• For indexed fields the only possible query is
WHERE “field_name” [<,>,<=,>=,=] “value” but “field_name” can be
specified in a cql query only once
Solution
• Any name of Cassandra column is a byte buffer ~ byte [] columnName
• Column names (in comparison with the values)
may be filtered by the specified range,
i.e. if two border values
• byte [] lowMargin,
• byte [] highMargin
are defined it is possible to select columns with columName
WHERE columnName >= lowMargin AND columnName <= highMargin
• As there are ~ 2 bln columns can be persisted for the same key
it is possible to search quickly among lists of size < 2 * 10^9
18. Composite Column FamiliesQUERIES
Motivation
• Raw untyped column names are not convenient in processing.
• If there are 2 or more components of a column name serialized
to a same byte buffer it is hard to build quick search on a single part.
For instance, let’s introduce column name consisting of two components:
• person_name: String
• time_stamp: Date
How to build a column range returning all the previously persisted
combinations of person_name = “Tom” and time_stamp >= “1999-01-01” and
time_stamp <= “2012-01-01”?
Solution
Cassandra has built-in CompositeType comparator which can be defined for
number of components and sorts columns first by component number 0, 1, …
19. Composite Column Families - mappingQUERIES
public class ReferenceCategoryValue {
@Id
private String category; //maps to row key
@Component(ordinal = 0) //the following three fields are serialized
private UUID id; //into a column name
@Component(ordinal = 1)
private String description;
@Component(ordinal = 2)
private String code;
@Value
private String value // the value which is saved for the column
}