Basic Introduction to Cassandra with Architecture and strategies.
with big data challenge. What is NoSQL Database.
The Big Data Challenge
The Cassandra Solution
The CAP Theorem
The Architecture of Cassandra
The Data Partition and Replication
8. Objective
• Schema Free
• Easy Replication
• SimpleAPI
• Consistence
• Can Handle huge data
NoSQLDatabase
• simplicity of design,
• horizontal scaling
• finer control over availability
9. Relational Database vs. NoSQL
Relational Database
• Supports powerful query language
• It has a fixed schema
• Follows ACID (Atomicity,Consistency,
Isolation, and Durability)
• Supports transactions
NoSQL Database
• Supports very simple query language
• No Fixed Schema
• It is only “eventually consistent”
• Does not support transactions
10. Other NoSQL Database
• Apache HBase - HBase is an open source, non-relational,
distributed database modeled after Google’s BigTable and is
written in Java. It is developed as a part of Apache Hadoop
project and runs on top of HDFS, providing BigTable-like
capabilities for Hadoop.
• MongoDB - MongoDB is a cross-platform document-
oriented database system that avoids using the traditional
table-based relational database structure in favor of JSON-
like documents with dynamic schemas making the
integration of data in certain types of applications easier and
faster.
11. What is Apache Cassandra?
• Apache Cassandra™ is a free
• Distributed
• High performance
• Extremely scalable
• Fault tolerant (i.e. no single point of failure)
• post-relational database solution. Cassandra can serve as both real-time
data store (the “system of record”) for online/transactional applications, and
as a read intensive database for business intelligence systems.
12. Features of Cassandra
• Elastic scalability
• Always on architecture
• Fast linear-scale performance
• Flexible data storage
• Easy data distribution
• Transaction support
• Fast writes
13. History of Cassandra
• Cassandra was developed at Facebook for inbox search.
• It was open-sourced by Facebook in July 2008.
• Cassandra was accepted into Apache Incubator in March 2009.
• It was made an Apache top-level project since February 2010.
15. CAPTheorem
• Distributed System Can only provide two of
• Availability
• Consistency
• PartitionTolerance
• AKA BrewersTheorem
16. Cassandra AP
• Cassandra Prioritizes Availability and PartitionTolerance
• Consistency is not guaranteed
• Tradeoffs between latency and Consistency
17. Other Approaches -CP
• Eg. Hbase
• Implements Row locking for consistency
• HBase has master/slave & Single point of Failure
• No A
19. Architecture Overview
• Cassandra was designed with the understanding that
system/hardware failures can and do occur
• Peer-to-peer, distributed system
• All nodes the same
• Data partitioned among all nodes in the cluster
• Custom data replication to ensure fault tolerance
• Read/Write-anywhere design
20. Architecture Overview
• Each node communicates with each other through the
• Gossip protocol, which exchanges information across
the
• cluster every second
• A commit log is used on each node to capture write
• activity. Data durability is assured
• Data also written to an in-memory structure
(memtable)
• and then to disk once the memory structure is full (an
• SStable)
21. Architecture Overview
• The schema used in Cassandra is mirrored after
Google
• Bigtable. It is a row-oriented, column structure
• A keyspace is akin to a database in the RDBMS world
• A column family is similar to an RDBMS table but is
more
• flexible/dynamic
• A row in a column family is indexed by its key. Other
• columns may be indexed as well
22. Components of Cassandra
• Node − It is the place where data is stored.
• Data center − It is a collection of related nodes.
• Cluster − A cluster is a component that contains one or more data centers.
• Commit log −The commit log is a crash-recovery mechanism in Cassandra. Every write
operation is written to the commit log.
• Mem-table − A mem-table is a memory-resident data structure. After commit log, the data
will be written to the mem-table. Sometimes, for a single-column family, there will be
multiple mem-tables.
• SSTable − It is a disk file to which the data is flushed from the mem-table when its contents
reach a threshold value.
• Bloom filter −These are nothing but quick, nondeterministic, algorithms for testing
whether an element is a member of a set. It is a special kind of cache. Bloom filters are
accessed after every query.
25. Partition Process
• Data is transparently portioned across the nodes
• Data sent to a node is hashed and sent to partition based on hash
• The data partitioning strategy is controlled via the partitioner option inside cassandra.yaml
file
• Once a cluster in initialized with a partitioner option, it can not be changed without
reloading all of the data in the cluster
26. Partitioning Strategies
• Random Partitioning
• This is the default and recommended strategy.
• Partition data as evenly as possible across all nodes
• using an MD5 hash of every column family row key
• Ordered Partitioning
• Store column family row keys in sorted order across all nodes in the cluster.
• Sequential writes can cause hot spots
• More administrative overhead to load balance the cluster
• Uneven load balancing for multiple column families
27. Replication
• To ensure fault tolerance and no single point of failure, you can replicate one or more
copies of every row across nodes in the cluster
• Replication is controlled by the parameters replication factor and replication strategy of a
keyspace
• Replication factor controls how many copies of a row should be store in the cluster
• Replication strategy controls how the data being replicated.
28. Replication Strategies
• Simple Strategy
• Place the original row on a node determined by the partitioner.Additional replica rows
are placed on the new nodes clockwise in the ring.
• NetworkTopology Strategy
• Allow replication between different racks in a data center and or between multiple
data centers
• The original row is placed according the partitioner.Additional replica rows in the same
data center are then placed by walking the ring clockwise until a node in a different
rack from previous replica is found. If there is no such node, additional replicas will be
placed in the same rack.
Notas do Editor
Apache Cassandra is a highly scalable, high-performance distributed database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It is a type of NoSQL database. Let us first understand what a NoSQL database does.
NoSQLDatabase
A NoSQL database (sometimes called as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data.
The primary objective of a NoSQL database is to have
simplicity of design,
horizontal scaling, and
finer control over availability.
NoSql databases use different data structures compared to relational databases. It makes some operations faster in NoSQL. The suitability of a given NoSQL database depends on the problem it must solve.