SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Deep Dive into
Apache Cassandra
Big Data Madison - February 2015
@brenttheisen
Who I Am, Where I Work
• Brent Theisen, Principal Engineer at
Womply.
• Womply uses data to grow, protect and
simplify small businesses.
• Obtain transaction, weather and third
party site data relevant to merchants.
• Provide brick and mortar merchants with
products and analytics based on this
data.
• Using Cassandra since Aug 2013.
Cassandra History
• Originally developed at Facebook around 2008.
• Name comes from Greek mythological prophet
cursed with accurate predictions no one believed.
‣ Move to NoSQL inevitable but few realize it.
‣ Cassandra was a cursed Oracle.
• Modeled after the Amazon DynamoDB and Google BigTable papers.
• Open sourced on Google Code in 2008.
• Top level Apache project in 2010.
• Today its used by 100s of companies: Netflix, Twitter, eBay, Call of Duty.
Why Use Cassandra?
• Store terabytes of data.
• Perform a high number of writes/second.
• Perform key based queries in “real time”.
• Run massive parallel processing jobs.
• Replicate data across multiple data centers in full duplex.
• Scale horizontally in a predictable way.
Cassandra might be the right tool for the job if you need to:
What Can It Run On?
• Written in Java. Targets Oracle Java 1.7.
• Can run on Windows, Linux, OS X and just about any UNIXish OS.
• Client drivers available for many languages:
‣ Java, C++, Python, Node.js, Ruby, C#, etc.
‣ Not all languages have clients that support all new features.
• Support for MPP frameworks:
‣ DataStax Spark Connector.
‣ Hadoop input/output format.
Installing Cassandra
• Distributions of Cassandra:
‣ Binaries: DataStax Community (DSC) and Enterprise (DSE).
‣ Apache source distribution.
‣ Try to stick to whatever the latest stable version is.
‣ Current most stable is 2.0.12, latest is 2.1.3.
• Support available for Docker, BOSH, Chef, Puppet, etc.
• Development environments should be provisioned as close to
production as possible. Docker/Vagrant are your friends.
Configuring Cassandra
• Almost all config options are set in cassandra.yaml.
• Some of the more important ones:
‣ seeds: List of IPs that new nodes should use as Gossip contact points when
joining the cluster.
‣ endpoint_snitch: Java class that informs Cassandra about network
topology to efficiently route Gossip P2P requests. Some options:
RackInferringSnitch, PropertyFileSnitch, EC2Snitch.
‣ initial_token: In a single node per token range, specifies the starting
point of the range for the node.
‣ num_tokens: In a virtual node cluster, specifies the number of tokens
randomly assigned to the node.
Types of Key Rings
• Node per range requires
manually recalculating
initial_token for all nodes
when adding/removing nodes.
• Vnodes save you from this.
• If you have a mix of hardware/
instance types, you can set
num_tokens accordingly.
Cassandra CLI Tools
• cqlsh: CQL client for running queries against a node(s).
• nodetool: Provides a number of subcommands useful for
administration/monitoring. Really just a JMX client.
brent@cassandra1:~$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.17.0.33 76.09 KB 256 68.5% 9daf06ab-85d6-42b2-8190-49edd3d987a4 rack1
UN 172.17.0.36 76.14 KB 256 65.0% 18f008ff-0057-46c1-a71f-cc204a018808 rack1
UN 172.17.0.9 62.43 KB 256 66.5% 50c90d36-9c0a-43f0-8dd3-d66c24505ecb rack1
Output from one of the most basic commands, nodetool status:
Cassandra Data Model
• Keyspaces (aka “schemas”) contain tables.
‣ Specify a per data center replication factor.
• Data is stored in tables (aka “column families”).
• Tables must have a primary key:
‣ Natural keys preferred but random UUIDs can also be used.
‣ Can use several columns to form a compound key.
‣ How you structure your primary key determines partitioning.
• That is all thats required, non-primary key columns optional.
Creating a Keyspace
CREATE KEYSPACE my_keyspace WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'datacenter1': 3,
'datacenter2': 2
};
• Specifies a name for the keyspace and it’s replication settings.
• Two types of replication strategies:
‣ SimpleStrategy: Single data center only. Not for production.
‣ NetworkTopologyStrategy: Allows for multi-datacenter clusters.
• Replication factor specifies how many nodes a key should be stored on.
Creating a Table
USE my_keyspace;
CREATE TABLE users (
username text PRIMARY KEY,
email text,
first_name text,
last_name text
);
• USE just changes the
current schema we are
working in.
• CQL CREATE TABLE
syntax pretty similar to
SQL.
• The username PK
defines how the data is
partitioned and indexed.
Scalar Data Types
• All the usual suspects: boolean, text, int, bigint (long), float, double, blob
• counter: Counter column (64-bit signed value). Only increment and decrement.
• varint: Variable precision integer.
• decimal: Variable precision decimal.
• inet: An IPv4 or IPv6 address.
• timestamp: Date and time including timezone. Time gets converted to UTC.
• timeuuid: Type 1 UUID: Nano second time + dupe prevent + MAC. Can use NOW().
• uuid : Type 1 or type 4 UUID. Have to generate your own.
Collection Data Types
• list: Stores elements in whatever order you’d like.
‣ Control order of values by adding and deleting values by index.
‣ Individual values can also be removed.
‣ Must specify a sub type in your CREATE or ALTER TABLE:
• set: Stores unique elements in natural sort order.
‣ Other than uniqueness and sort order, similar to list.
• map: Stores key/value pairs.
‣ Set and delete key/value pairs by key.
‣ Must specify types for both the key and value in your CREATE or ALTER TABLE:
ALTER TABLE users ADD login_times list<timestamp>;
ALTER TABLE users ADD login_times set<timestamp>;
ALTER TABLE users ADD login_times set<timestamp, text>;
User Defined Types
• New in Cassandra 2.1.
• Can be used to store denormalized data that might have otherwise been stored
in another table.
CREATE TYPE address (
street text,
city text,
zip_code int,
phones set<text>
);
ALTER TABLE users ADD addresses map<text, frozen <address>>;
Serialized as a BLOB so all components
of the type must be passed on write.
Inserting Data
CONSISTENCY ONE;
INSERT INTO users
(username, email, first_name, last_name)
VALUES
('alice', 'alice@example.com', 'Alice', 'Smith');
How many nodes should receive
the write before query returns?
Consistency Levels
• ONE, TWO, THREE: Exact number of nodes that need to reply successfully.
• LOCAL_ONE: Must succeed on at least one node in local DC.
• QUORUM: A quorum of nodes must reply. (replication factor / 2) + 1
• LOCAL_QUORUM: Same as quorum but only nodes in local data center.
• ALL: Query fails if any one of the replicas does not reply.
• ANY: Only applies to writes. Guarantees that the write succeeds even if all
replicas are down. Coordinator node may persist locally and replay when replica
available (AKA “hinted handoff”).
Specifies how many nodes need to successfully process a read or write
for query to succeed. Some of the more prevalent consistency levels:
Insert/Update, Same Difference
UPDATE users
SET
email = 'bob@example.com',
first_name = 'Bob'
WHERE username = 'bob';
In Cassandra, inserts and updates are the same thing so they
are often referred to as writes or “upserts”.
Anatomy of a Write
• Client sends query to a node which will act as a coordinator for the request.
• Coordinator determines which replicas in the ring own the key and sends the details of
the write to all of them.
• Client’s query blocks until the
coordinator has gotten enough
successful responses back to satisfy
the consistency level.
• If a replica is down it will miss the
write but Cassandra has
mechanisms to ensure those
replicas become consistent: hinted
hand off, read repair and nodetool
repair.
How Replicas Persist Data
• Replica receives a write
request from a coordinator via
Gossip.
• Writes an entry in its commit
log.
• Adds a entry in the Memtable.
• Once a Memtable has exceeded a threshold it is flushed to disk in a file
called an SSTable.
SSTables
• SSTables = Sorted String Tables
• Only contain data for a single Cassandra table.
• Sorted by row key and column key.
• Immutable, once written they never change.
• By default, contents are compressed using Snappy.
• A replica will often have several SSTable files for a given CQL
table.
How Our Data Looks in a SSTable File
alice email = alice@example.com
first_name = Alice
last_name = Smith
bob email = bob@example.com
first_name = Bob
• Partition key identifies
the row.
• Column key is the name
of the column.
• All sorted on row key
and column key.
• We didn’t specify a
value for Bob’s
last_name so it simply
isn’t there.
Querying for Users by Key
cqlsh:my_keyspace> SELECT * FROM users WHERE username = 'alice';
username | email | first_name | last_name
----------+-------------------+------------+-----------
alice | alice@example.com | Alice | Smith
(1 rows)
Anatomy of a Read
• Replica gets read request.
• Bloom filter identifies which
SSTables might contain data for
the row key.
• For each SSTable file:
‣ Gets SSTable offset for row key
from either partition key cache
or partition index.
‣ Reads relevant portion of
SSTable file sequentially.
• Sends merged results to
coordinator.
Querying for Users by Email
SELECT * FROM users WHERE email = 'alice@example.com';
code=2200 [Invalid query] message="No indexed columns
present in by-columns clause with Equal operator"
Can’t query against a non-primary key column because that is all
that is indexed:
Secondary Indexes
cqlsh:my_keyspace> CREATE INDEX ON users (email);
cqlsh:my_keyspace> SELECT * FROM users WHERE email = 'alice@example.com';
username | email | first_name | last_name
----------+-------------------+------------+-----------
alice | alice@example.com | Alice | Smith
(1 rows)
Deleting Data
cqlsh:my_keyspace> SELECT COUNT(*) FROM users WHERE username = 'bob';
count
-------
1
cqlsh:my_keyspace> DELETE FROM users WHERE username = 'bob';
cqlsh:my_keyspace> SELECT COUNT(*) FROM users WHERE username = 'bob';
count
-------
0
Tombstones
• Deletes don’t actually delete data immediately.
• They write a “tombstone” to the replicas that marks those
columns as having been deleted.
• When performing a read, tombstones will be found via the
normal read process and a null value returned to the client.
• The data actually gets “deleted” when SSTable files with a
tombstone get compacted with other SSTable files containing
data for that column.
Compaction
• Compaction is the process of merging SSTables in to one and deleting the old ones.
• Two types of compaction: major and minor.
• Major compactions:
‣ Triggered by running nodetool compact.
‣ Compacts all SSTable files on a node in to one big SSTable file.
‣ Should be avoided as it makes it unlikely minor compactions will occur.
• Minor compactions:
‣ Happen automatically in the background.
‣ How minor compactions work depends on the compaction strategy and its settings
specified in CREATE TABLE.
‣ Size Tiered: SSTables get compacted in tiers based on their size.
‣ Date Tiered: SSTables get compacted in tiers based on time window they cover.
‣ Leveled: Ensures data for a row key is not overlapped in SSTables. Good for reads.
A Example Use Case
• We are running a e-commerce site that sells things.
• Need to be able to record interaction with our site for future
analysis.
• Browser side code will send us page view, hover, click and
other events.
• Our job is to model the Cassandra tables and queries the
server side component persists events in.
An Event Table
CREATE TABLE events (
username text,
time timestamp,
type text,
params map<text, text>,
PRIMARY KEY ((username), time)
);
• The username column acts as
the “partition key”.
• The time column acts as the
“cluster key”. Can perform range
queries against it.
• The params column is a map
that will contain event specific
properties.
Event Table Write Examples
INSERT INTO events (username, time, type, params) VALUES
('alice', '2015-02-24 08:05:03-0600', 'hover', { 'product_id': 'regular-widget' } );
INSERT INTO events (username, time, type, params) VALUES
('alice', '2015-02-24 08:05:15-0600', 'hover', { 'product_id': 'mega-widget' } );
Alice goes to the homepage:
INSERT INTO events (username, time, type, params) VALUES
('alice', '2015-02-24 08:05:01-0600', 'page_load',
{ 'url': 'http://example.com/' } );
INSERT INTO events (username, time, type, params) VALUES
('alice', '2015-02-24 08:05:24-0600', 'page_load', {
'url': 'http://example.com/super-mega-widget',
'product_id': 'super-mega-widget'
}
);
Alice hovers some product links:
Alice goes to the Super Mega Widget product page:
Querying Events
• Count all the events for Alice:
cqlsh:my_keyspace> SELECT COUNT(*) FROM events WHERE username = 'alice';
count
-------
4
• Find all events for Alice within a time range:
cqlsh:my_keyspace> SELECT * FROM events WHERE username = 'alice' AND
time >= '2015-02-24 08:05:03-0600' AND

time <= '2015-02-24 08:05:21-0600';
username | time | params | type
----------+--------------------------+----------------------------------+-------
alice | 2015-02-24 14:05:03+0000 | {'product_id': 'regular-widget'} | hover
alice | 2015-02-24 14:05:15+0000 | {'product_id': 'mega-widget'} | hover
What CQL Does Not Do
• Joins
‣ Doing a join on tables partitioned across many nodes is too expensive.
‣ Instead, you should attempt to denormalize your data model.
• Subqueries
‣ Cassandra sticks to one thing: storing and retrieving key partitioned data at scale.
‣ Things you might use a subquery for are usually solved by denormalizing and/or
having an app specific data layer do the heavy lifting.
• Group By
‣ Aggregate datasets are usually pre-computed and stored in their own tables.
‣ Probably want to use Spark Streaming/Storm/etc to get sliding windows.
“Real Time” Aggregation
• Lets say we need to graph all product page views in “real time”.
• Each data point could be a five minute aggregate, or whatever.
• What might the table look like that Spark Streaming/Storm/etc persist to?
CREATE TABLE event_time_series (
date text,
time timestamp,
product_id_page_views map<text, int>,
PRIMARY KEY ((date), time)
);
Ensures data points get
distributed across the cluster.
Allows time based range
queries within a day.
Analytics Options
• Hadoop
‣ Painful to setup, best to use DataStax Enterprise.
‣ Map reduce, Hive and Pig are all supported to varying degrees.
‣ All but deprecated within much of the Cassandra community.
• Apache Spark
‣ Does not require DSE.
‣ Use the DataStax Spark Cassandra Spark connector.
‣ Highly recommend Spark for doing analytics with Cassandra.
Analytics Best Practices
• Ensure good data locality by running Hadoop/Spark on the same
nodes that run Cassandra.
• Keep data centers doing OLAP separate from those doing OLTP.
• Spend some extra time ensuring Hadoop/Spark use system
resources effectively without trampling each other.
‣ A resource manager like YARN or Mesos could help.
Administrivia
• Each node should run nodetool repair -pr on a regular basis to ensure
decent consistency.
• Use NTP to ensure clocks on all nodes are accurate.
• Data directory should always be on local (preferably SSD) storage, never a
SAN.
• Cassandra can do JBoD or you can RAID 0. RAID levels above 0 unneeded.
• Compaction requires there to be extra available storage capacity (50% [worst
case] for tiered compaction, 10% for leveled compaction).
• Read and writes ops/sec should scale linearly. Two nodes = 2x throughput,
four nodes = 4x, etc.
Predictable Linear Scaling
More Info on Cassandra
• DataStax Developer

http://www.datastax.com/dev
• Planet Cassandra

http://planetcassandra.org/
• Beware of old documentation, a lot has changed.
‣ Stick to CQLv3 in particular.
‣ Avoid the Thrift API.

Mais conteúdo relacionado

Mais procurados

Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
DataStax
 

Mais procurados (20)

Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
KSQL Intro
KSQL IntroKSQL Intro
KSQL Intro
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
How to Secure Your Scylla Deployment: Authorization, Encryption, LDAP Authent...
How to Secure Your Scylla Deployment: Authorization, Encryption, LDAP Authent...How to Secure Your Scylla Deployment: Authorization, Encryption, LDAP Authent...
How to Secure Your Scylla Deployment: Authorization, Encryption, LDAP Authent...
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basics
 
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Understanding oracle rac internals part 1 - slides
Understanding oracle rac internals   part 1 - slidesUnderstanding oracle rac internals   part 1 - slides
Understanding oracle rac internals part 1 - slides
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACThe Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
 

Destaque

Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
 

Destaque (7)

Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache Spark
 
Manchester Hadoop User Group: Cassandra Intro
Manchester Hadoop User Group: Cassandra IntroManchester Hadoop User Group: Cassandra Intro
Manchester Hadoop User Group: Cassandra Intro
 
Open source or proprietary, choose wisely!
Open source or proprietary,  choose wisely!Open source or proprietary,  choose wisely!
Open source or proprietary, choose wisely!
 
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
 
Introduction to cassandra 2014
Introduction to cassandra 2014Introduction to cassandra 2014
Introduction to cassandra 2014
 
Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 

Semelhante a Deep Dive into Cassandra

Cassandra
CassandraCassandra
Cassandra
exsuns
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_upload
Rajini Ramesh
 

Semelhante a Deep Dive into Cassandra (20)

Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 
Cassandra
CassandraCassandra
Cassandra
 
NoSQL - Cassandra & MongoDB.pptx
NoSQL -  Cassandra & MongoDB.pptxNoSQL -  Cassandra & MongoDB.pptx
NoSQL - Cassandra & MongoDB.pptx
 
Cassandra Overview
Cassandra OverviewCassandra Overview
Cassandra Overview
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
 
Cassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting dataCassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting data
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
 
Using cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb dataUsing cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb data
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_upload
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 
Introduction to NoSQL CassandraDB
Introduction to NoSQL CassandraDBIntroduction to NoSQL CassandraDB
Introduction to NoSQL CassandraDB
 
cassandra_presentation_final
cassandra_presentation_finalcassandra_presentation_final
cassandra_presentation_final
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandra
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
 

Último

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 

Último (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Deep Dive into Cassandra

  • 1. Deep Dive into Apache Cassandra Big Data Madison - February 2015 @brenttheisen
  • 2. Who I Am, Where I Work • Brent Theisen, Principal Engineer at Womply. • Womply uses data to grow, protect and simplify small businesses. • Obtain transaction, weather and third party site data relevant to merchants. • Provide brick and mortar merchants with products and analytics based on this data. • Using Cassandra since Aug 2013.
  • 3. Cassandra History • Originally developed at Facebook around 2008. • Name comes from Greek mythological prophet cursed with accurate predictions no one believed. ‣ Move to NoSQL inevitable but few realize it. ‣ Cassandra was a cursed Oracle. • Modeled after the Amazon DynamoDB and Google BigTable papers. • Open sourced on Google Code in 2008. • Top level Apache project in 2010. • Today its used by 100s of companies: Netflix, Twitter, eBay, Call of Duty.
  • 4. Why Use Cassandra? • Store terabytes of data. • Perform a high number of writes/second. • Perform key based queries in “real time”. • Run massive parallel processing jobs. • Replicate data across multiple data centers in full duplex. • Scale horizontally in a predictable way. Cassandra might be the right tool for the job if you need to:
  • 5. What Can It Run On? • Written in Java. Targets Oracle Java 1.7. • Can run on Windows, Linux, OS X and just about any UNIXish OS. • Client drivers available for many languages: ‣ Java, C++, Python, Node.js, Ruby, C#, etc. ‣ Not all languages have clients that support all new features. • Support for MPP frameworks: ‣ DataStax Spark Connector. ‣ Hadoop input/output format.
  • 6. Installing Cassandra • Distributions of Cassandra: ‣ Binaries: DataStax Community (DSC) and Enterprise (DSE). ‣ Apache source distribution. ‣ Try to stick to whatever the latest stable version is. ‣ Current most stable is 2.0.12, latest is 2.1.3. • Support available for Docker, BOSH, Chef, Puppet, etc. • Development environments should be provisioned as close to production as possible. Docker/Vagrant are your friends.
  • 7. Configuring Cassandra • Almost all config options are set in cassandra.yaml. • Some of the more important ones: ‣ seeds: List of IPs that new nodes should use as Gossip contact points when joining the cluster. ‣ endpoint_snitch: Java class that informs Cassandra about network topology to efficiently route Gossip P2P requests. Some options: RackInferringSnitch, PropertyFileSnitch, EC2Snitch. ‣ initial_token: In a single node per token range, specifies the starting point of the range for the node. ‣ num_tokens: In a virtual node cluster, specifies the number of tokens randomly assigned to the node.
  • 8. Types of Key Rings • Node per range requires manually recalculating initial_token for all nodes when adding/removing nodes. • Vnodes save you from this. • If you have a mix of hardware/ instance types, you can set num_tokens accordingly.
  • 9. Cassandra CLI Tools • cqlsh: CQL client for running queries against a node(s). • nodetool: Provides a number of subcommands useful for administration/monitoring. Really just a JMX client. brent@cassandra1:~$ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 172.17.0.33 76.09 KB 256 68.5% 9daf06ab-85d6-42b2-8190-49edd3d987a4 rack1 UN 172.17.0.36 76.14 KB 256 65.0% 18f008ff-0057-46c1-a71f-cc204a018808 rack1 UN 172.17.0.9 62.43 KB 256 66.5% 50c90d36-9c0a-43f0-8dd3-d66c24505ecb rack1 Output from one of the most basic commands, nodetool status:
  • 10. Cassandra Data Model • Keyspaces (aka “schemas”) contain tables. ‣ Specify a per data center replication factor. • Data is stored in tables (aka “column families”). • Tables must have a primary key: ‣ Natural keys preferred but random UUIDs can also be used. ‣ Can use several columns to form a compound key. ‣ How you structure your primary key determines partitioning. • That is all thats required, non-primary key columns optional.
  • 11. Creating a Keyspace CREATE KEYSPACE my_keyspace WITH REPLICATION = { 'class': 'NetworkTopologyStrategy', 'datacenter1': 3, 'datacenter2': 2 }; • Specifies a name for the keyspace and it’s replication settings. • Two types of replication strategies: ‣ SimpleStrategy: Single data center only. Not for production. ‣ NetworkTopologyStrategy: Allows for multi-datacenter clusters. • Replication factor specifies how many nodes a key should be stored on.
  • 12. Creating a Table USE my_keyspace; CREATE TABLE users ( username text PRIMARY KEY, email text, first_name text, last_name text ); • USE just changes the current schema we are working in. • CQL CREATE TABLE syntax pretty similar to SQL. • The username PK defines how the data is partitioned and indexed.
  • 13. Scalar Data Types • All the usual suspects: boolean, text, int, bigint (long), float, double, blob • counter: Counter column (64-bit signed value). Only increment and decrement. • varint: Variable precision integer. • decimal: Variable precision decimal. • inet: An IPv4 or IPv6 address. • timestamp: Date and time including timezone. Time gets converted to UTC. • timeuuid: Type 1 UUID: Nano second time + dupe prevent + MAC. Can use NOW(). • uuid : Type 1 or type 4 UUID. Have to generate your own.
  • 14. Collection Data Types • list: Stores elements in whatever order you’d like. ‣ Control order of values by adding and deleting values by index. ‣ Individual values can also be removed. ‣ Must specify a sub type in your CREATE or ALTER TABLE: • set: Stores unique elements in natural sort order. ‣ Other than uniqueness and sort order, similar to list. • map: Stores key/value pairs. ‣ Set and delete key/value pairs by key. ‣ Must specify types for both the key and value in your CREATE or ALTER TABLE: ALTER TABLE users ADD login_times list<timestamp>; ALTER TABLE users ADD login_times set<timestamp>; ALTER TABLE users ADD login_times set<timestamp, text>;
  • 15. User Defined Types • New in Cassandra 2.1. • Can be used to store denormalized data that might have otherwise been stored in another table. CREATE TYPE address ( street text, city text, zip_code int, phones set<text> ); ALTER TABLE users ADD addresses map<text, frozen <address>>; Serialized as a BLOB so all components of the type must be passed on write.
  • 16. Inserting Data CONSISTENCY ONE; INSERT INTO users (username, email, first_name, last_name) VALUES ('alice', 'alice@example.com', 'Alice', 'Smith'); How many nodes should receive the write before query returns?
  • 17. Consistency Levels • ONE, TWO, THREE: Exact number of nodes that need to reply successfully. • LOCAL_ONE: Must succeed on at least one node in local DC. • QUORUM: A quorum of nodes must reply. (replication factor / 2) + 1 • LOCAL_QUORUM: Same as quorum but only nodes in local data center. • ALL: Query fails if any one of the replicas does not reply. • ANY: Only applies to writes. Guarantees that the write succeeds even if all replicas are down. Coordinator node may persist locally and replay when replica available (AKA “hinted handoff”). Specifies how many nodes need to successfully process a read or write for query to succeed. Some of the more prevalent consistency levels:
  • 18. Insert/Update, Same Difference UPDATE users SET email = 'bob@example.com', first_name = 'Bob' WHERE username = 'bob'; In Cassandra, inserts and updates are the same thing so they are often referred to as writes or “upserts”.
  • 19. Anatomy of a Write • Client sends query to a node which will act as a coordinator for the request. • Coordinator determines which replicas in the ring own the key and sends the details of the write to all of them. • Client’s query blocks until the coordinator has gotten enough successful responses back to satisfy the consistency level. • If a replica is down it will miss the write but Cassandra has mechanisms to ensure those replicas become consistent: hinted hand off, read repair and nodetool repair.
  • 20. How Replicas Persist Data • Replica receives a write request from a coordinator via Gossip. • Writes an entry in its commit log. • Adds a entry in the Memtable. • Once a Memtable has exceeded a threshold it is flushed to disk in a file called an SSTable.
  • 21. SSTables • SSTables = Sorted String Tables • Only contain data for a single Cassandra table. • Sorted by row key and column key. • Immutable, once written they never change. • By default, contents are compressed using Snappy. • A replica will often have several SSTable files for a given CQL table.
  • 22. How Our Data Looks in a SSTable File alice email = alice@example.com first_name = Alice last_name = Smith bob email = bob@example.com first_name = Bob • Partition key identifies the row. • Column key is the name of the column. • All sorted on row key and column key. • We didn’t specify a value for Bob’s last_name so it simply isn’t there.
  • 23. Querying for Users by Key cqlsh:my_keyspace> SELECT * FROM users WHERE username = 'alice'; username | email | first_name | last_name ----------+-------------------+------------+----------- alice | alice@example.com | Alice | Smith (1 rows)
  • 24. Anatomy of a Read • Replica gets read request. • Bloom filter identifies which SSTables might contain data for the row key. • For each SSTable file: ‣ Gets SSTable offset for row key from either partition key cache or partition index. ‣ Reads relevant portion of SSTable file sequentially. • Sends merged results to coordinator.
  • 25. Querying for Users by Email SELECT * FROM users WHERE email = 'alice@example.com'; code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator" Can’t query against a non-primary key column because that is all that is indexed:
  • 26. Secondary Indexes cqlsh:my_keyspace> CREATE INDEX ON users (email); cqlsh:my_keyspace> SELECT * FROM users WHERE email = 'alice@example.com'; username | email | first_name | last_name ----------+-------------------+------------+----------- alice | alice@example.com | Alice | Smith (1 rows)
  • 27. Deleting Data cqlsh:my_keyspace> SELECT COUNT(*) FROM users WHERE username = 'bob'; count ------- 1 cqlsh:my_keyspace> DELETE FROM users WHERE username = 'bob'; cqlsh:my_keyspace> SELECT COUNT(*) FROM users WHERE username = 'bob'; count ------- 0
  • 28. Tombstones • Deletes don’t actually delete data immediately. • They write a “tombstone” to the replicas that marks those columns as having been deleted. • When performing a read, tombstones will be found via the normal read process and a null value returned to the client. • The data actually gets “deleted” when SSTable files with a tombstone get compacted with other SSTable files containing data for that column.
  • 29. Compaction • Compaction is the process of merging SSTables in to one and deleting the old ones. • Two types of compaction: major and minor. • Major compactions: ‣ Triggered by running nodetool compact. ‣ Compacts all SSTable files on a node in to one big SSTable file. ‣ Should be avoided as it makes it unlikely minor compactions will occur. • Minor compactions: ‣ Happen automatically in the background. ‣ How minor compactions work depends on the compaction strategy and its settings specified in CREATE TABLE. ‣ Size Tiered: SSTables get compacted in tiers based on their size. ‣ Date Tiered: SSTables get compacted in tiers based on time window they cover. ‣ Leveled: Ensures data for a row key is not overlapped in SSTables. Good for reads.
  • 30. A Example Use Case • We are running a e-commerce site that sells things. • Need to be able to record interaction with our site for future analysis. • Browser side code will send us page view, hover, click and other events. • Our job is to model the Cassandra tables and queries the server side component persists events in.
  • 31. An Event Table CREATE TABLE events ( username text, time timestamp, type text, params map<text, text>, PRIMARY KEY ((username), time) ); • The username column acts as the “partition key”. • The time column acts as the “cluster key”. Can perform range queries against it. • The params column is a map that will contain event specific properties.
  • 32. Event Table Write Examples INSERT INTO events (username, time, type, params) VALUES ('alice', '2015-02-24 08:05:03-0600', 'hover', { 'product_id': 'regular-widget' } ); INSERT INTO events (username, time, type, params) VALUES ('alice', '2015-02-24 08:05:15-0600', 'hover', { 'product_id': 'mega-widget' } ); Alice goes to the homepage: INSERT INTO events (username, time, type, params) VALUES ('alice', '2015-02-24 08:05:01-0600', 'page_load', { 'url': 'http://example.com/' } ); INSERT INTO events (username, time, type, params) VALUES ('alice', '2015-02-24 08:05:24-0600', 'page_load', { 'url': 'http://example.com/super-mega-widget', 'product_id': 'super-mega-widget' } ); Alice hovers some product links: Alice goes to the Super Mega Widget product page:
  • 33. Querying Events • Count all the events for Alice: cqlsh:my_keyspace> SELECT COUNT(*) FROM events WHERE username = 'alice'; count ------- 4 • Find all events for Alice within a time range: cqlsh:my_keyspace> SELECT * FROM events WHERE username = 'alice' AND time >= '2015-02-24 08:05:03-0600' AND
 time <= '2015-02-24 08:05:21-0600'; username | time | params | type ----------+--------------------------+----------------------------------+------- alice | 2015-02-24 14:05:03+0000 | {'product_id': 'regular-widget'} | hover alice | 2015-02-24 14:05:15+0000 | {'product_id': 'mega-widget'} | hover
  • 34. What CQL Does Not Do • Joins ‣ Doing a join on tables partitioned across many nodes is too expensive. ‣ Instead, you should attempt to denormalize your data model. • Subqueries ‣ Cassandra sticks to one thing: storing and retrieving key partitioned data at scale. ‣ Things you might use a subquery for are usually solved by denormalizing and/or having an app specific data layer do the heavy lifting. • Group By ‣ Aggregate datasets are usually pre-computed and stored in their own tables. ‣ Probably want to use Spark Streaming/Storm/etc to get sliding windows.
  • 35. “Real Time” Aggregation • Lets say we need to graph all product page views in “real time”. • Each data point could be a five minute aggregate, or whatever. • What might the table look like that Spark Streaming/Storm/etc persist to? CREATE TABLE event_time_series ( date text, time timestamp, product_id_page_views map<text, int>, PRIMARY KEY ((date), time) ); Ensures data points get distributed across the cluster. Allows time based range queries within a day.
  • 36. Analytics Options • Hadoop ‣ Painful to setup, best to use DataStax Enterprise. ‣ Map reduce, Hive and Pig are all supported to varying degrees. ‣ All but deprecated within much of the Cassandra community. • Apache Spark ‣ Does not require DSE. ‣ Use the DataStax Spark Cassandra Spark connector. ‣ Highly recommend Spark for doing analytics with Cassandra.
  • 37. Analytics Best Practices • Ensure good data locality by running Hadoop/Spark on the same nodes that run Cassandra. • Keep data centers doing OLAP separate from those doing OLTP. • Spend some extra time ensuring Hadoop/Spark use system resources effectively without trampling each other. ‣ A resource manager like YARN or Mesos could help.
  • 38. Administrivia • Each node should run nodetool repair -pr on a regular basis to ensure decent consistency. • Use NTP to ensure clocks on all nodes are accurate. • Data directory should always be on local (preferably SSD) storage, never a SAN. • Cassandra can do JBoD or you can RAID 0. RAID levels above 0 unneeded. • Compaction requires there to be extra available storage capacity (50% [worst case] for tiered compaction, 10% for leveled compaction). • Read and writes ops/sec should scale linearly. Two nodes = 2x throughput, four nodes = 4x, etc.
  • 40. More Info on Cassandra • DataStax Developer
 http://www.datastax.com/dev • Planet Cassandra
 http://planetcassandra.org/ • Beware of old documentation, a lot has changed. ‣ Stick to CQLv3 in particular. ‣ Avoid the Thrift API.