Cassandra is a highly scalable, distributed database designed to handle large amounts of structured data across multiple nodes without single points of failure. It uses a peer-to-peer model where data is distributed across nodes in the cluster. Cassandra has a flexible data model based on keyspaces, tables and columns and is optimized for writes without compromising read performance.
2. It is a distributed database from Apache .
It is highly scalable and designed to manage very large amounts of
structured data.
High availability with no single point of failure.
It is a column-oriented database
2
3. Cassandra RDBMS
It is used to deal with unstructured data. It is used to deal with structured data.
Flexible schema Fixed Schema
Relationships are represented using
collections.
In RDBMS, there are concept of foreign keys,
joins etc.
It won’t support Join’s It support Join’s
3
4. Cassandra is to handle big data workloads across multiple
nodes without any single point of failure.
Cassandra has peer-to-peer distributed system across its
nodes.
Data is distributed among all the nodes in a cluster.
Advantages and Applicable Area
Open Source
Peer to peer
High Availability & performance..
4
5. The components of Cassandra data model are keyspaces,
tables, and columns.
Keyspaces - is the outermost container for data in Cassandra.
◦ no default keyspace
◦ Replication is specified at the keyspace level.
5
6. CQL does not support aggregation queries like max, min, avg
CQL does not support group by, having queries.
CQL does not support joins.
CQL does not support OR queries.
CQL does not support wildcard queries.
CQL does not support Union, Intersection queries.
Table columns cannot be filtered without creating the index.
Greater than (>) and less than (<) query is only supported on
clustering column.Cassandra query language is not suitable
for analytics purposes because it has so many limitations.
6
7. It is the internal communication technique for nodes in a cluster to talk to each other.
It runs every second for every node and exchange state messages with up to three other nodes in the
cluster.
7
8. Snitch job is to determine which data centers and racks it should use to read data from and write data
to.
Types of Snitches:
SimpleSnitch
GossipingPropertyFileSnitch
PropertyFileSnitch
Ec2Snitch
Ec2MultiRegionSnitch
RackInferringSnitch
8
9. Compaction refers to a maintenance process in Cassandra, in which the SSTables are reorganized for
data optimization of data structures on the disk.
It is useful during interacting with memtables.
There are two types of compaction in Cassandra.
◦ Minor compaction: It gets started automatically when a new SSTable is created. Here, Cassandra
condenses all the equally sized SSTables into one.
◦ Major compaction: It is triggered manually using the nodetool. It compacts all SSTables of a column
family into one.
9
10. Bloom filters are a probabilistic data structure that allows Cassandra to determine one of two
possible states: - The data definitely does not exist in the given file, or - The data probably
exists in the given file.
It checks if the requested row exists in the SSTable before doing any disk I/O.
To change the Bloom filter attribute on a column family,
◦ ALTER TABLE addamsFamily WITH bloom_filter_fp_chance = 0.01;
10
11. Bloom filters are a probabilistic data structure that allows Cassandra to determine one of two
possible states: - The data definitely does not exist in the given file, or - The data probably
exists in the given file.
It checks if the requested row exists in the SSTable before doing any disk I/O.
To change the Bloom filter attribute on a column family,
◦ ALTER TABLE addamsFamily WITH bloom_filter_fp_chance = 0.01;
11