This document provides an overview of Apache Cassandra, including:
- Its history originating from Facebook's need to solve an inbox search problem.
- Its key features like high availability, linear scalability, fault tolerance and tunable consistency.
- Its architecture based on consistent hashing and a ring topology for data distribution.
- Its data model using keyspaces, column families, rows, and columns differently than a relational database.
- Examples of using the Cassandra CLI to create a schema, insert data, and perform queries.
2. Learning Targets
Big Data introduction
Understand driving forces behind NoSQL development
Map known RDBMS concepts to corresponding NoSQL
paradigms
Get overview about Apache CassandraTM architecture
Get overview about CassandraTM data model
Get first experience of CassandraTM packaging and CLI
3. Agenda
•
•
•
•
•
•
•
Big Data
NoSQL. Main Technologies
NoSQL. Products
Apache CassandraTM Features
Apache CassandraTM Architecture
Apache CassandraTM Data Modeling
Apache CassandraTM CLI
5. Origin
•
•
April 1998 John R. Mashey from SGI, Usenix
talk: “Big Data and the Next Wave of Infrastress”
Big Data refers to huge data volumes,
continuously increasing data sources, velocity of
data generation, data analysis and related
technology solutions
10. Big Data Driving Forces
•
•
•
•
Continued growth of Internet usage, social networks,
and smartphones
The falling costs of the technology for information
creation, capturing and storage
Migration from analog TV to digital TV
Growth of machine-to-machine communication
11. Main Producer
•
•
Machine-generated data is a key factor behind expansion
Growth from 11% of the digital universe in 2005 to more than 40% in 2020
–
–
–
–
–
Machine logs
RFID readers
Sensor networks
Vehicle GPS traces
Retail transactions
13. Storage Capacities
•
•
•
•
I/O for HDDs is time consuming
For a 1 TB with with transfer speed of 300 MB/s (SATA) it
takes ~ 1 h
SSD are 5 faster in average
SSD are more expensive
14. Random Seeks
•
•
•
Seek time is improving more slowly than transfer
rate
Random seeks are expensive
Inherent to most RDBMS
15. Structure
•
•
•
Data is becoming increasingly semi-structured and
unstructured
Unstructured data is data without a schema
Semi-structured
– no conformity to relational databases structures
– self-describing, containing tags or structure
related markers
16. Limitations of RDBMS
•
•
•
•
•
•
Up-front schema declaration is needed
Referential integrity is necessary
Use mainly B-Tree indexes
Non-Liniar scaling
Are build around OLTP and OLAP approaches
Many solutions are really expensive
23. BASE
•
BASE - Basically Available Soft-state Eventually
consistency
R — Number of nodes that are read from
W — Number of nodes that are written to
N — Total number of nodes in the cluster
R + W = 2N – ACID complaint
26. Setting Context
•
•
•
•
„The Google File System”, October 2003
“MapReduce: Simplified Data Processing on Large
Clusters”, December 2004
“Bigtable: A Distributed Storage System for Structured
Data”, November 2006
“The Chubby Lock Service for Loosely-Coupled
Distributed Systems”, November 2006
27. MapReduce
•
•
•
•
•
•
Created by Google
Parallel processing model
Data locality
Allows distributed processing on large data sets in cluster
Derives its ideas from functional programming
Works with semi-structured data
29. Amazon Dynamo
•
•
•
•
“Dynamo: Amazon’s Highly Available Key/value Store”,
October 2007
Introduction of notion of eventual consistency
There could be small intervals of inconsistency
between replicated nodes
Eventual consistency does not mean inconsistency
31. Apache Hadoop
•
•
•
2004—Initial versions of Hadoop Distributed Filesystem
and Map-Reduce implemented
January 2006—Doug Cutting joins Yahoo!
February 2006—Apache Hadoop project officially
started
32. NoSQL Features
•
•
•
•
•
•
Advocated horizontal scalability in favor of vertical
scalability
Promises linear scalability
Uses new advanced technologies for parallel processing
Often uses custom file system implementation or
advanced storage techniques
Optionally schema-free
No the concept of locking or locking is a choice by
design
34. Ordered Column-Oriented Stores
•
•
Store data sets (Column Families) as sections of
columns
• Set of key(column)/value pairs
• Sorted by row-key (primary key)
Units of data are sorted and ordered on the basis of the
row-key
36. Key/Value Stores
•
•
•
Idea
– HashMap – fast O(1) access
The key of a key/value pair is a unique value in the set
and can be easily looked up to access the data
Eventual consistency
38. Document Databases
•
•
•
Keep documents as loosely structured sets of key/value
pairs, typically JSON (JavaScript Object Notation)
Treat document as a whole and avoid splitting a
document into its constituent name/value pairs
Allow indexing of documents on the basis of not only its
primary identifier but also its properties
41. Graph Databases
•
•
•
•
•
Use graph structures with nodes, edges, and properties
to represent and store data
Are based on graph theory
Are faster for associative data sets
Don’t not require expensive join operations
Best suitable for graph-like queries
44. History
•
•
•
•
•
Originated at Facebook in 2007 to solve company’s
inbox search problem
July 2008, open source Google Code project
March 2009, Apache Incubator project
February 2010, top level Apache Project
November 2013, version 2.0.3 was released
45. Cassandra Features (Part I)
•
•
•
•
•
High availability
Linear and elastic scalability
Distributed and decentralized
Peer-to-Peer
No single point of failure
46. Cassandra Features (Part II)
•
•
•
•
•
Fault tolerance and built-in failure
detection
Tunable consistency
Supports basic subset of SQL via CQL
A command-line access to the store
Basic security support
47. Cassandra Features (Part III)
•
•
•
•
Thrift interface and an internal Java API
Clients for multiple Java, Python, Grails, PHP,
.NET., Ruby, Scala
Support of JMX interfaces
Built-in benchmarking
•
Hadoop and MapReduce integration
55. Simple Strategy
•
•
•
For single data center clusters
First replica on a node determined by a partitioner
Additional replicas are placed on the next nodes
clockwise in the ring
60. Network Topology
•
•
•
Data center - grouping of nodes configured together for
replication purposes
Rack - similar physical grouping of nodes
Snitch maps IPs to racks and data centers
– All nodes in a cluster must use the same snitch
configuration
61. Cassandra Client API
•
•
•
•
•
Cassandra CLI, Thrift based
CQL3, native protocol
Cqlsh with Python dependency
Multiple languages drivers
Java: CQL3 via DataStax 1.0 driver
62. DataStax Java Driver
•
•
•
•
•
•
Works only with CQL3
Layered architecture
Relies on Netty to provide non-blocking I/O for providing
a fully asynchronous architecture
Connection pooling, node discovery
Automatic failover, load balancing
Prepared statements are supported
67. Cassandra vs. RDBMS (Part I)
•
•
•
•
No referential integrity
Doesn’t support joins
Limited SQL support
Denormalization
68. Cassandra vs. RDBMS (Part II)
•
•
•
•
Storing of collections in a field is possible
Row size is a design issue
Comparators for column families
Ordering is the design issue
71. Column Families
•
•
•
•
•
Serve as container for an ordered collection of
columns/rows
Are not equal to RDBMS tables
Column families have to be defined, the columns shouldn't
Entries in column families are grouped by row key
All data for a single row must fit on a single machine in the
cluster
73. Static Column Families
•
•
•
Use a relatively static set of column names
Are more similar to a relational database table
Have metadata definition for individual columns
74. Dynamic Column Families
•
•
•
Allow to pre-compute result sets and store them in a
single row for efficient data retrieval
Defines the type information for column names and
values (comparators and validators)
Actual column names and values are set by the
application when a column is inserted
75. Column
•
•
•
Row keys and column names can be any kind of byte array
Useful data can be stored in the key itself, not only in the
value
2 billion columns per (physical) row
77. Composite Columns
•
•
•
•
•
Are used under the hood to store clustered rows
All the logical rows with the same partition key get
stored as a single, physical wide row
Can be created and queried using CQL 3
Support range queries
Substitute Super Columns
78. Skinny Rows
•
•
•
Are like traditional RDBMS rows
Each row contains similar sets of column names
But all columns are optional
79. Wide Rows
•
•
•
•
Have lots (eventually millions) of columns
Typically contain automatically generated names (like
UUIDs or timestamps)
Are used to store lists of things
All the logical rows with the same partition key get
stored as a single, physical row
81. Download and Install
•
•
•
•
•
Cassandra requires minimum version of Java 1.7 JDK
(http://www.oracle.com/technetwork/java/javase/downloa
ds/index.html)
Download from http://cassandra.apache.org/download/
Extract in some directory
Customize cassandra.yaml in the /conf directory
Start with bin/cassandra -f
82. Create Schema
•
•
•
•
•
•
cassandra-cli -host localhost -port 9160
create keyspace TestsDataStore;
show keyspaces;
use TestsDataStore;
create column family Cars
with comparator = UTF8Type;
update column family Cars with
column_metadata =
[
{column_name: make, validation_class: UTF8Type},
{column_name: model, validation_class: UTF8Type},
];
83. Populate With Data
•
•
•
•
•
•
•
•
•
assume Cars keys as utf8;
set Cars['Cabrio']['make'] = 'bmw'
set Cars['Cabrio']['model'] = '640i';
set Cars['Corolla']['make'] = 'toyota';
set Cars['Corolla']['model'] = 'le';
set Cars['fit']['make'] = 'honda';
set Cars['fit']['model'] = 'fit sport';
set Cars['focus']['make'] = 'ford';
set Cars['focus']['model'] = 'sel';
84. Data Manipulation
•
•
•
•
•
•
•
•
get Cars['Cabrio'];
get Cars['Cabrio']['make'];
update column family Cars with comparator=UTF8Type
and column_metadata=[{column_name: make,
validation_class: UTF8Type,
index_type: KEYS}, {column_name: model,
validation_class: UTF8Type}];
del Cars['Cabrio']['bmw'];
drop column family Cars;
drop keyspace TestsDataStore;
show keyspaces;
85. Agile Development with Cassandra
•
•
•
•
Facilitates agile development providing schema free
data model and query first paradigm
Makes TDD easier providing build in test tools
Is built around multiple design patterns, facilitating Clean
Code approach
Decentralized nature makes distributed work easier
(including geographical distribution)
86. Use Cases
•
•
•
•
•
Large deployments
Lots of writes, statistics, and analysis
Geographical distribution
Very large data volumes
High reliability requirements for data
storage