SlideShare uma empresa Scribd logo
1 de 48
Cassandra & Python
Adam Hutson, Data Architect
@adamhutson
Who am I and What do we do?
•Adam Hutson
•Data Architect of DataScale -> www.datascale.io
•DataStax MVP for Apache Cassandra
•DataScale provides hosted data platforms as a service
•Offering Cassandra & Spark, with more to come
•Currently hosted in Amazon or Azure
Fun Fact
•DataScale was purchased by DataStax last week
•It was publically made official today … Surprise!
Cassandra Overview
What is Big Data?
Small Data - flat file, script-based
Medium Data - single server; typical RDBMS; ACID
Big Data - multiple servers; replication = lag; sharding = management headache;
expensive machines; no longer ACID; lots of people to maintain
Cassandra Overview
Distributed, database management system
Peer-to-peer design (no master, no slaves)
Can run on commodity machines
No single point of failure
Has linear scalability
Is a cluster/ring of equal machines divided into a ring of hash values
Chooses Availability & Partition over Consistency; Is eventually consistent
Data is replicated automatically
Cassandra Origins
Based on Amazon’s Dynamo and Google’s BigTable
Created at Facebook for the Inbox search system in 2007
Facebook open-sourced on Google code in 2008
Became an Apache Incubator project in 2009
By 2010, it graduated to a top-level project at Apache
Apache Cassandra can be run completely royalty-free
DataStax offers a licensed/corporate version with additional tools/integrations
Why Cassandra over traditional RDBMS?
RDBMS Pros:
Single Machine
ACID guarantees
Scales vertical (bigger machine)
RDBMS Cons:
Growing past Single Machine
Scale horizontal = Replication lag
Sharding = Complicated codebase
Failover = on-call headache
Why Cassandra over traditional RDBMS?
Cassandra Pros:
Scale horizontal
Shard-less
Failure indifferent
CAP: AP with tunable C
Cassandra Cons:
Eventually Consistent
ACID
Atomicity = All or nothing transactions
Consistency = Guarantees committed transaction state
Isolation = Transactions are independent
Durability = Committed data is never lost
Nice warm, fuzzy that transactions are perfect.
Cassandra follows the A, I, & D in ACID, but not C; is Eventually Consistent
In NoSQL, we make trade-offs to serve the greatest need
CAP Theorem
Consistency = all nodes see the same data at the same time
Availability = every request gets response about whether it succeeded or failed
Partition Tolerance = the system continues to operate despite partition failures
You have to choose 2: CA, CP, or AP
Cassandra chooses AP; To be highly available in a network partition
Architecture Terms
Nodes: where data is stored; typically a logical machine
Data Center: collection of nodes; assigned replication; logical workload grouping;
should not span physical location
Cluster: one or more data center(s); can span physical location
Commit Log: first stop for any write; provides durability
Table: collection of ordered columns fetched by row;
SSTable: sorted string table; immutable file; append only; sequentially stored
Architecture Components
Gossip: peer-to-peer communication protocol; discovers/shares data & locations
Partitioner: determines how to distribute data to replicas
Replication Factor: determines how many replicas to maintain
Replica Placement Strategy: determines which node(s) will contain replicas
Snitch: defines groups of nodes into data centers & racks that replication uses
What is a Node?
Just a small part of a big system
Represents a single machine (Server/VM/Container)
Has a JVM running the Cassandra Java process
Can run anywhere (RaspberryPi/laptop/cloud/on-premise)
Responsible for writing & reading it’s data
Typically 3-5,000 tps/core & 1-3 TB of data
Cluster
A cluster is a bunch of nodes that together are responsible for all the data.
Each node is responsible for a different range of the data, aka the token ranges.
A cluster can hold token values from -263 to 263-1.
The ring starts with the smallest number and circles clockwise to the largest.
We are using 0-99
to represent tokens.
Because -263 to 263-1 is
too hard to visualize
Node 1 is responsible
for tokens 76-0
Node 2 is responsible
for tokens 1-25
Node 3 is responsible
for tokens 26-50
Node 4 is responsible
for tokens 51-75
Replication
Replication is when we store replicas of data on multiple nodes to ensure reliability
and fault tolerance. The total number of replicas is the replication factor.
The cluster just shown had the most basic replication factor of 1 (RF=1).
Each node was responsible for its only own data.
What happens if a node is lost/corrupt/offline? → We need replicas.
Change the replication factor to 2 (RF=2)
Each node will be responsible for its own data and its neighbors data.
We are using 0-99
to represent tokens.
Because -263 to 263-1 is
too hard to visualize
Replication Factor = 2
Node 1 is responsible
for tokens 76-0 & 1-25
Node 2 is responsible
for tokens 1-25 & 26-50
Node 3 is responsible
for tokens 26-50 & 51-75
Node 4 is responsible
for tokens 51-75 & 76-0
Replication with Node Failure
Sooner or later, we’re going to lose a node. With replication, we’re covered.
Node 1 goes offline for maintenance, or someone turned off that rack.
I need row with token value of 67, but Node 1 was responsible for that range?
Node 4 saves the day because it has a replica of Node 1’s range.
Consistency
We want our data to be consistent.
We have to be aware that being Available and having Partition Tolerance is more
important than strictly consistent.
Consistency is tunable though. We can choose to have certain queries have
stronger consistency than others.
The client can specify a Consistency Level for each read or write query issued.
Consistency Level
ONE: only one replica has to acknowledge the read or write
QUORUM: 51% of the replicas have to acknowledge the read or write
ALL: all of the replicas have to acknowledge the read or write
With Multiple Data Centers:
LOCAL_ONE: only one replica in the local data center
LOCAL_QUORUM: 51% of the replicas in the local data center
Consistency & Replication working together
Scenario: 4 Nodes with Replication Factor of 3
Desire High Write Volume/Speed, Low Read frequency
Write of Consistency Level of ONE, will write to 1 node, replication will sync to other 2
Read of Consistency Level of QUORUM, will read from 2
Desire High Read Volume/Speed, Low Write frequency
Write of Consistency Level of QUORUM, will write to 2 nodes, replication will sync to other
1.
Read of Consistency Level of ONE, will read from 1 node
Peer-to-peer
Cassandra is not Client/Server or Master/Slave or Read-Write/Read-Only.
No routing data to shards; no holding leader elections, no split brains.
In Cassandra, every node is a peer to every other node.
Every instance of Cassandra is running the same java process.
Each node is independent and completely replaceable.
Gossip
One node tells its neighbor nodes
Its neighbor nodes tell their neighbor nodes
Their neighbor nodes tell more neighbor nodes
Soon, every node knows about every other node’s business.
“Node X is down”
“Node Y is joining”
“There’s a new data center”
“The east coast data center is gone”
Hinted Handoff
Failed writes will happen. When a write happens, the nodes are trying the new
data to get all the replicas.
If one of the replica nodes is offline, then the other replica nodes are going to
remember what data the down node was supposed to receive, aka keep hints.
When that node appears online again, then the other nodes that kept hints are
going to handoff those hints to the newly online node.
Hints are kept for a tunable period, defaults to 3 hours.
Write Path
Nothing to do with which node the data will be written to, and everything to do with
what happens internally in the node to get the data persisted.
When the data enters the Cassandra java process, the data is written to 2 places:
1. First appended to the Commit Log file.
2. Second to the MemTable.
The Commit Log is immediately durable and is persisted to disk.
The MemTable is a representation of the data in RAM.
Afterwards, the write is acknowledged and goes back to the client.
Write Path (cont’d)
Once the MemTable fills up,
then it flushes its data to
disk as an SSTable.
If the node crashes before
the data is flushed,
then the Commit Log
is replayed to re-populate
the MemTable.
Read Path
The idea is that it looks in the MemTable first, then in the SSTable.
The MemTable has the most recent partitions of data.
The SSTable files is sequential, but can get really, really big.
Partition Index file keeps track of partitions and the offset of their locations in the
SSTable, but they too can get large.
Summary Index file keeps track of offsets in Partition Index.
Read Path (cont’d)
We can go faster by using Key Cache, which is an in-memory index of partitions
and what it’s offset is in the SSTable. Skips the Partition & Summary Index files.
Only works on previously requested keys.
But which SSTable/Partition/Summary file should be looked at? Bloom Filters are
a probabilistic data structure. keep track by saying that the key you’re looking for
is “definitely not there” or “maybe it’s there”. The false positive “maybes” are rare,
but tunable.
Deleting Data
Data is never deleted in Cassandra. When a column value in a row, or an entire
row is requested to be deleted, Cassandra actually writes additional data that
marks the column or row with a timestamp that says it’s deleted. This is called a
tombstone.
Whenever a read occurs, it will skip over the tombstoned data and not return it.
Skipping over tombstones still incur I/O though. There’s even a metric that will tell
you the avg. tombstones being read. So we’ll need to remove them from the
SSTable at some point.
Compaction
Compaction is the act of merging SSTables together. But why → Housekeeping.
SSTables are immutable, so we can never update the file to change a column’s
value.
If your client writes the same data 3 times, there will be 3 entries in potentially 3
different SSTables. (This assumes the MemTable flushed the data in-between
writes).
Reads have to read all 3 SSTables and compare the write timestamps to get the
correct value.
Compaction
What happens is that Cassandra will compact the SSTables by reading in those 3
SSTables and writing out a new SSTable with only the single entry. The 3 older
SSTables will get deleted at that point.
Compaction is when tombstones are purged too. Tombstones are kept around
long enough so that we don’t get “phantom deletes” though (tunable period).
Compaction will keep your future seeks on disk low.
There’s a whole algorithm on when Compaction runs, but it’s automatically setup
by default.
Repairs
Repairs are needed because, over time, distributed data naturally gets out of sync
with all the locations. Repairs just makes sure that all your nodes are consistent.
Repairs happen at two times.
1. After each read, there is a (tunable) chance that a repair will occur. When a
client requests to read a particular key, a background process will gather all
the data from all the replicas, and update all the replicas to be consistent.
2. At scheduled times that are manually controlled by an admin.
Failure Recovery
Sometimes nodes go down due to maintenance, or a real catastrophe.
Cassandra will keep track of down nodes with gossip. Hints are automatically
held for a (tunable) period. So when/ifr the node comes back online, the other
nodes will tell it what it missed.
If the node doesn’t come back online, you have to create a new node to replace it.
Assign it the same tokens as the lost node, and the other nodes will stream the
necessary data to it from replicas.
Scaling
Scaling is when you add more capacity to the cluster. Typically, this is when you
add more nodes.
You create a new node(s) and add it to the cluster.
A new node will join the ring and take responsibility for a part of the token ranges.
While it’s joining, other nodes will stream data for the token ranges it will own.
Once it’s fully joined, the node will start participating in normal operations.
Python with Cassandra
Python - Getting Started
Install the python driver via pip:
pip install cassandra-driver
In a .py file, create a cluster & session object and connect to it:
from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect()
The cluster object represents a cassandra cluster on your localhost.
The session object will manage all connection pooling.
Only create a session object once in your codebase and reuse it throughout.
If not, that repeated initial connection establishment will eventually become a bottleneck.
Specifying a Cluster
Localhost is great for sandboxing, but soon a real cluster with real IPs will be needed.
cluster = Cluster(['54.211.95.95', '52.90.150.156', '52.87.198.119'])
This is the set of IPs from a demo cluster I’ve created.
Authorization
Every cluster should have at least PlainTextAuthProvider enabled.
import cassandra.auth
my_auth_provider = cassandra.auth.PlainTextAuthProvider(
username='adamhutson',
password='P@$$w0rD'
)
There are other methods of authorization, but this is the most common.
Add the above snippet before you create your Cluster object.
Then pass the my_auth_provider object to the Cluster’s auth_provider option key.
cluster = Cluster(auth_provider=my_auth_provider)
Keyspace Selection
There are 3 ways to specify a keyspace to use:
1. session = cluster.connect('my_keyspace')
2. session.set_keyspace('my_keyspace')
3. session.execute('select * from my_keyspace.my_table')
It doesn’t matter which way you chose to go, just be consistent with your selection.
Personally, I use choice #2, as I can run it before I interact with the database.
It keeps me from pinning myself to a single keyspace at session creation time.
It also keeps me from having to type out the keyspace name every time I write a DML statement.
Simple Statement
The first thing most will want to do is select some data out of Cassandra. Let’s use the system keyspace,
and retrieve some of the meta data about the cluster.
session.set_keyspace('system')
rows = session.execute('SELECT source_id, date, event_time, event_value FROM
time_series WHERE source_id = ''123'' and date = ''2016-09-01''')
for row in rows:
print row.source_id, row.date, row.event_time, row.event_value
Prepared/Bound Statement
Every time we run that Simple Statement from above, Cassandra has to compile the query.
What if you’re going to run the same select repeatedly.
That compile time will become a bottleneck.
session.set_keyspace('training')
prepared_stmt = session.prepare('SELECT source_id, date, event_time,
event_value FROM time_series WHERE source_id = ? and date = ?')
bound_stmt = prepared_stmt.bind(['123','2016-09-01'])
rows = session.execute(bound_stmt)
for row in rows:
print row.source_id, row.date, row.event_time, row.event_value
Batch Statement
A batch level of operations where they are applied atomically. Specify a BatchType
insert_user = session.prepare("INSERT INTO users (name, age) VALUES (?, ?)")
batch = BatchStatement(consistency_level=ConsistencyLevel.QUORUM)
for (name, age) in users_to_insert:
batch.add(insert_user, (name, age))
session.execute(batch)
Be careful. This can potentially be inserting to token ranges all over the cluster. Best practice is to use
Batches for inserting multiples into the same partition.
Consistency Level
Consistency Level can be specified at the query level.
Just need to import the necessary library and set it.
This setting will remain with the session until you destroy the object or set it to a different CL.
from cassandra import ConsistencyLevel
session.default_consistency_level = ConsistencyLevel.ONE
There are a bunch of session level options that you can specify.
There are most of the same options available at the Statement level.
Shutdown
This is so simple, but so important. Finish every script with the following:
cluster.shutdown()
If you don’t do this at the end of your python file, you will leak connections on the server side.
I’ve done it, and it was completely embarrassing. Learn from my mistakes. Don’t forget it.
Thank You!
Questions?
Adam Hutson
adam@datascale.io
adam.hutson@datastax.com
@AdamHutson
@DataScaleInc
@DataStax

Mais conteúdo relacionado

Mais procurados

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer
 
Cassandra under the hood
Cassandra under the hoodCassandra under the hood
Cassandra under the hoodAndriy Rymar
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Sparknickmbailey
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with CassandraRyan King
 
Introduction to Cassandra Architecture
Introduction to Cassandra ArchitectureIntroduction to Cassandra Architecture
Introduction to Cassandra Architecturenickmbailey
 
Apache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupApache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupSnehal Nagmote
 
Apache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep diveApache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep diveAlex Thompson
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...Sri Ambati
 
High order bits from cassandra & hadoop
High order bits from cassandra & hadoopHigh order bits from cassandra & hadoop
High order bits from cassandra & hadoopsrisatish ambati
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorRussell Spitzer
 
Zookeeper Architecture
Zookeeper ArchitectureZookeeper Architecture
Zookeeper ArchitecturePrasad Wali
 
High order bits from cassandra & hadoop
High order bits from cassandra & hadoopHigh order bits from cassandra & hadoop
High order bits from cassandra & hadoopsrisatish ambati
 

Mais procurados (20)

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Cassandra under the hood
Cassandra under the hoodCassandra under the hood
Cassandra under the hood
 
Cassandra Insider
Cassandra InsiderCassandra Insider
Cassandra Insider
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with Cassandra
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
 
Introduction to Cassandra Architecture
Introduction to Cassandra ArchitectureIntroduction to Cassandra Architecture
Introduction to Cassandra Architecture
 
Apache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupApache Kafka Women Who Code Meetup
Apache Kafka Women Who Code Meetup
 
Elastic search
Elastic searchElastic search
Elastic search
 
Apache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep diveApache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep dive
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
 
High order bits from cassandra & hadoop
High order bits from cassandra & hadoopHigh order bits from cassandra & hadoop
High order bits from cassandra & hadoop
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Nov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on HadoopNov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on Hadoop
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra Connector
 
Zookeeper Architecture
Zookeeper ArchitectureZookeeper Architecture
Zookeeper Architecture
 
High order bits from cassandra & hadoop
High order bits from cassandra & hadoopHigh order bits from cassandra & hadoop
High order bits from cassandra & hadoop
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 

Destaque

Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...DataStax
 
A look at the CQL changes in 3.x (Benjamin Lerer, Datastax) | Cassandra Summi...
A look at the CQL changes in 3.x (Benjamin Lerer, Datastax) | Cassandra Summi...A look at the CQL changes in 3.x (Benjamin Lerer, Datastax) | Cassandra Summi...
A look at the CQL changes in 3.x (Benjamin Lerer, Datastax) | Cassandra Summi...DataStax
 
Jaye Lapachet: CI Resources- the good, the bad and the ugly
Jaye Lapachet: CI Resources- the good, the bad and the uglyJaye Lapachet: CI Resources- the good, the bad and the ugly
Jaye Lapachet: CI Resources- the good, the bad and the uglyJaye Lapachet
 
VA- FINANCING- MINIMUM PROPERTY REQUIREMENTS
VA- FINANCING- MINIMUM PROPERTY REQUIREMENTSVA- FINANCING- MINIMUM PROPERTY REQUIREMENTS
VA- FINANCING- MINIMUM PROPERTY REQUIREMENTSSUSAN HARVEY
 
Outsourcing 2.0 - The Global Delivery Model
Outsourcing 2.0 - The Global Delivery ModelOutsourcing 2.0 - The Global Delivery Model
Outsourcing 2.0 - The Global Delivery ModelMorten Munk
 
Keynote Talk - Is agile struggling in India
Keynote Talk - Is agile struggling in IndiaKeynote Talk - Is agile struggling in India
Keynote Talk - Is agile struggling in IndiaNaveen Nanjundappa
 
Automating the Use of Web APIs through Lightweight Semantics
Automating the Use of Web APIs through Lightweight SemanticsAutomating the Use of Web APIs through Lightweight Semantics
Automating the Use of Web APIs through Lightweight Semanticsmmaleshkova
 
Indus Pride Take Pride India Quiz case study
Indus Pride Take Pride India Quiz case studyIndus Pride Take Pride India Quiz case study
Indus Pride Take Pride India Quiz case studyKumpz King
 
Why agile is struggling in india naveen nanjundappa
Why agile is struggling in india   naveen nanjundappa Why agile is struggling in india   naveen nanjundappa
Why agile is struggling in india naveen nanjundappa Naveen Nanjundappa
 
Web 2 0 for Tough Times
Web 2 0 for Tough TimesWeb 2 0 for Tough Times
Web 2 0 for Tough TimesJaye Lapachet
 
Mobile augmented reality with audio Supporting fieldwork of Cultural Sciences...
Mobile augmented reality with audio Supporting fieldwork of Cultural Sciences...Mobile augmented reality with audio Supporting fieldwork of Cultural Sciences...
Mobile augmented reality with audio Supporting fieldwork of Cultural Sciences...Fred de Vries
 

Destaque (19)

Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
A look at the CQL changes in 3.x (Benjamin Lerer, Datastax) | Cassandra Summi...
A look at the CQL changes in 3.x (Benjamin Lerer, Datastax) | Cassandra Summi...A look at the CQL changes in 3.x (Benjamin Lerer, Datastax) | Cassandra Summi...
A look at the CQL changes in 3.x (Benjamin Lerer, Datastax) | Cassandra Summi...
 
Jaye Lapachet: CI Resources- the good, the bad and the ugly
Jaye Lapachet: CI Resources- the good, the bad and the uglyJaye Lapachet: CI Resources- the good, the bad and the ugly
Jaye Lapachet: CI Resources- the good, the bad and the ugly
 
Essence of Retail e-Commerce and its Optimization Webinar
Essence of Retail e-Commerce and its Optimization WebinarEssence of Retail e-Commerce and its Optimization Webinar
Essence of Retail e-Commerce and its Optimization Webinar
 
The techno-source
The techno-sourceThe techno-source
The techno-source
 
Agile adaptation challenges
Agile adaptation challengesAgile adaptation challenges
Agile adaptation challenges
 
Product owner
Product ownerProduct owner
Product owner
 
VA- FINANCING- MINIMUM PROPERTY REQUIREMENTS
VA- FINANCING- MINIMUM PROPERTY REQUIREMENTSVA- FINANCING- MINIMUM PROPERTY REQUIREMENTS
VA- FINANCING- MINIMUM PROPERTY REQUIREMENTS
 
Outsourcing 2.0 - The Global Delivery Model
Outsourcing 2.0 - The Global Delivery ModelOutsourcing 2.0 - The Global Delivery Model
Outsourcing 2.0 - The Global Delivery Model
 
Disciplinar
DisciplinarDisciplinar
Disciplinar
 
Empathize and define
Empathize and defineEmpathize and define
Empathize and define
 
Keynote Talk - Is agile struggling in India
Keynote Talk - Is agile struggling in IndiaKeynote Talk - Is agile struggling in India
Keynote Talk - Is agile struggling in India
 
Automating the Use of Web APIs through Lightweight Semantics
Automating the Use of Web APIs through Lightweight SemanticsAutomating the Use of Web APIs through Lightweight Semantics
Automating the Use of Web APIs through Lightweight Semantics
 
Indus Pride Take Pride India Quiz case study
Indus Pride Take Pride India Quiz case studyIndus Pride Take Pride India Quiz case study
Indus Pride Take Pride India Quiz case study
 
Vlsi design services
Vlsi design servicesVlsi design services
Vlsi design services
 
Why agile is struggling in india naveen nanjundappa
Why agile is struggling in india   naveen nanjundappa Why agile is struggling in india   naveen nanjundappa
Why agile is struggling in india naveen nanjundappa
 
Guide To Finance
Guide To FinanceGuide To Finance
Guide To Finance
 
Web 2 0 for Tough Times
Web 2 0 for Tough TimesWeb 2 0 for Tough Times
Web 2 0 for Tough Times
 
Mobile augmented reality with audio Supporting fieldwork of Cultural Sciences...
Mobile augmented reality with audio Supporting fieldwork of Cultural Sciences...Mobile augmented reality with audio Supporting fieldwork of Cultural Sciences...
Mobile augmented reality with audio Supporting fieldwork of Cultural Sciences...
 

Semelhante a Cassandra & Python - Springfield MO User Group

Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for SysadminsNathan Milford
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandraWu Liang
 
Cassandra consistency
Cassandra consistencyCassandra consistency
Cassandra consistencyzqhxuyuan
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandraaaronmorton
 
Design Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databasesDesign Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databaseslovingprince58
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databasesguestdfd1ec
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageNilesh Salpe
 
5266732.ppt
5266732.ppt5266732.ppt
5266732.ppthothyfa
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011sandeep_tata
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...DataStax Academy
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsFirat Atagun
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinChristian Johannsen
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial Na Zhu
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...Basics of the Highly Available Distributed Databases - teowaki - javier ramir...
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...javier ramirez
 
Everything you always wanted to know about highly available distributed datab...
Everything you always wanted to know about highly available distributed datab...Everything you always wanted to know about highly available distributed datab...
Everything you always wanted to know about highly available distributed datab...Codemotion
 

Semelhante a Cassandra & Python - Springfield MO User Group (20)

Cassandra
CassandraCassandra
Cassandra
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
 
Cassandra consistency
Cassandra consistencyCassandra consistency
Cassandra consistency
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Design Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databasesDesign Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databases
 
Cassandra architecture
Cassandra architectureCassandra architecture
Cassandra architecture
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databases
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed Storage
 
5266732.ppt
5266732.ppt5266732.ppt
5266732.ppt
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Cassandra no sql ecosystem
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...Basics of the Highly Available Distributed Databases - teowaki - javier ramir...
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...
 
Everything you always wanted to know about highly available distributed datab...
Everything you always wanted to know about highly available distributed datab...Everything you always wanted to know about highly available distributed datab...
Everything you always wanted to know about highly available distributed datab...
 

Mais de Adam Hutson

Montreal User Group - Cloning Cassandra
Montreal User Group - Cloning CassandraMontreal User Group - Cloning Cassandra
Montreal User Group - Cloning CassandraAdam Hutson
 
Cassandra Summit: C* Keys - Partitioning, Clustering, & Crossfit
Cassandra Summit: C* Keys - Partitioning, Clustering, & CrossfitCassandra Summit: C* Keys - Partitioning, Clustering, & Crossfit
Cassandra Summit: C* Keys - Partitioning, Clustering, & CrossfitAdam Hutson
 
Cassandra Summit: Data Modeling A Scheduling App
Cassandra Summit: Data Modeling A Scheduling AppCassandra Summit: Data Modeling A Scheduling App
Cassandra Summit: Data Modeling A Scheduling AppAdam Hutson
 
Dimensional data modeling
Dimensional data modelingDimensional data modeling
Dimensional data modelingAdam Hutson
 
Alternatives to Relational Databases
Alternatives to Relational DatabasesAlternatives to Relational Databases
Alternatives to Relational DatabasesAdam Hutson
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersAdam Hutson
 

Mais de Adam Hutson (6)

Montreal User Group - Cloning Cassandra
Montreal User Group - Cloning CassandraMontreal User Group - Cloning Cassandra
Montreal User Group - Cloning Cassandra
 
Cassandra Summit: C* Keys - Partitioning, Clustering, & Crossfit
Cassandra Summit: C* Keys - Partitioning, Clustering, & CrossfitCassandra Summit: C* Keys - Partitioning, Clustering, & Crossfit
Cassandra Summit: C* Keys - Partitioning, Clustering, & Crossfit
 
Cassandra Summit: Data Modeling A Scheduling App
Cassandra Summit: Data Modeling A Scheduling AppCassandra Summit: Data Modeling A Scheduling App
Cassandra Summit: Data Modeling A Scheduling App
 
Dimensional data modeling
Dimensional data modelingDimensional data modeling
Dimensional data modeling
 
Alternatives to Relational Databases
Alternatives to Relational DatabasesAlternatives to Relational Databases
Alternatives to Relational Databases
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for Programmers
 

Último

MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 

Último (20)

MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 

Cassandra & Python - Springfield MO User Group

  • 1. Cassandra & Python Adam Hutson, Data Architect @adamhutson
  • 2. Who am I and What do we do? •Adam Hutson •Data Architect of DataScale -> www.datascale.io •DataStax MVP for Apache Cassandra •DataScale provides hosted data platforms as a service •Offering Cassandra & Spark, with more to come •Currently hosted in Amazon or Azure
  • 3. Fun Fact •DataScale was purchased by DataStax last week •It was publically made official today … Surprise!
  • 5. What is Big Data? Small Data - flat file, script-based Medium Data - single server; typical RDBMS; ACID Big Data - multiple servers; replication = lag; sharding = management headache; expensive machines; no longer ACID; lots of people to maintain
  • 6. Cassandra Overview Distributed, database management system Peer-to-peer design (no master, no slaves) Can run on commodity machines No single point of failure Has linear scalability Is a cluster/ring of equal machines divided into a ring of hash values Chooses Availability & Partition over Consistency; Is eventually consistent Data is replicated automatically
  • 7. Cassandra Origins Based on Amazon’s Dynamo and Google’s BigTable Created at Facebook for the Inbox search system in 2007 Facebook open-sourced on Google code in 2008 Became an Apache Incubator project in 2009 By 2010, it graduated to a top-level project at Apache Apache Cassandra can be run completely royalty-free DataStax offers a licensed/corporate version with additional tools/integrations
  • 8. Why Cassandra over traditional RDBMS? RDBMS Pros: Single Machine ACID guarantees Scales vertical (bigger machine) RDBMS Cons: Growing past Single Machine Scale horizontal = Replication lag Sharding = Complicated codebase Failover = on-call headache
  • 9. Why Cassandra over traditional RDBMS? Cassandra Pros: Scale horizontal Shard-less Failure indifferent CAP: AP with tunable C Cassandra Cons: Eventually Consistent
  • 10. ACID Atomicity = All or nothing transactions Consistency = Guarantees committed transaction state Isolation = Transactions are independent Durability = Committed data is never lost Nice warm, fuzzy that transactions are perfect. Cassandra follows the A, I, & D in ACID, but not C; is Eventually Consistent In NoSQL, we make trade-offs to serve the greatest need
  • 11. CAP Theorem Consistency = all nodes see the same data at the same time Availability = every request gets response about whether it succeeded or failed Partition Tolerance = the system continues to operate despite partition failures You have to choose 2: CA, CP, or AP Cassandra chooses AP; To be highly available in a network partition
  • 12.
  • 13. Architecture Terms Nodes: where data is stored; typically a logical machine Data Center: collection of nodes; assigned replication; logical workload grouping; should not span physical location Cluster: one or more data center(s); can span physical location Commit Log: first stop for any write; provides durability Table: collection of ordered columns fetched by row; SSTable: sorted string table; immutable file; append only; sequentially stored
  • 14. Architecture Components Gossip: peer-to-peer communication protocol; discovers/shares data & locations Partitioner: determines how to distribute data to replicas Replication Factor: determines how many replicas to maintain Replica Placement Strategy: determines which node(s) will contain replicas Snitch: defines groups of nodes into data centers & racks that replication uses
  • 15. What is a Node? Just a small part of a big system Represents a single machine (Server/VM/Container) Has a JVM running the Cassandra Java process Can run anywhere (RaspberryPi/laptop/cloud/on-premise) Responsible for writing & reading it’s data Typically 3-5,000 tps/core & 1-3 TB of data
  • 16. Cluster A cluster is a bunch of nodes that together are responsible for all the data. Each node is responsible for a different range of the data, aka the token ranges. A cluster can hold token values from -263 to 263-1. The ring starts with the smallest number and circles clockwise to the largest.
  • 17. We are using 0-99 to represent tokens. Because -263 to 263-1 is too hard to visualize Node 1 is responsible for tokens 76-0 Node 2 is responsible for tokens 1-25 Node 3 is responsible for tokens 26-50 Node 4 is responsible for tokens 51-75
  • 18. Replication Replication is when we store replicas of data on multiple nodes to ensure reliability and fault tolerance. The total number of replicas is the replication factor. The cluster just shown had the most basic replication factor of 1 (RF=1). Each node was responsible for its only own data. What happens if a node is lost/corrupt/offline? → We need replicas. Change the replication factor to 2 (RF=2) Each node will be responsible for its own data and its neighbors data.
  • 19. We are using 0-99 to represent tokens. Because -263 to 263-1 is too hard to visualize Replication Factor = 2 Node 1 is responsible for tokens 76-0 & 1-25 Node 2 is responsible for tokens 1-25 & 26-50 Node 3 is responsible for tokens 26-50 & 51-75 Node 4 is responsible for tokens 51-75 & 76-0
  • 20. Replication with Node Failure Sooner or later, we’re going to lose a node. With replication, we’re covered. Node 1 goes offline for maintenance, or someone turned off that rack. I need row with token value of 67, but Node 1 was responsible for that range? Node 4 saves the day because it has a replica of Node 1’s range.
  • 21. Consistency We want our data to be consistent. We have to be aware that being Available and having Partition Tolerance is more important than strictly consistent. Consistency is tunable though. We can choose to have certain queries have stronger consistency than others. The client can specify a Consistency Level for each read or write query issued.
  • 22. Consistency Level ONE: only one replica has to acknowledge the read or write QUORUM: 51% of the replicas have to acknowledge the read or write ALL: all of the replicas have to acknowledge the read or write With Multiple Data Centers: LOCAL_ONE: only one replica in the local data center LOCAL_QUORUM: 51% of the replicas in the local data center
  • 23. Consistency & Replication working together Scenario: 4 Nodes with Replication Factor of 3 Desire High Write Volume/Speed, Low Read frequency Write of Consistency Level of ONE, will write to 1 node, replication will sync to other 2 Read of Consistency Level of QUORUM, will read from 2 Desire High Read Volume/Speed, Low Write frequency Write of Consistency Level of QUORUM, will write to 2 nodes, replication will sync to other 1. Read of Consistency Level of ONE, will read from 1 node
  • 24. Peer-to-peer Cassandra is not Client/Server or Master/Slave or Read-Write/Read-Only. No routing data to shards; no holding leader elections, no split brains. In Cassandra, every node is a peer to every other node. Every instance of Cassandra is running the same java process. Each node is independent and completely replaceable.
  • 25. Gossip One node tells its neighbor nodes Its neighbor nodes tell their neighbor nodes Their neighbor nodes tell more neighbor nodes Soon, every node knows about every other node’s business. “Node X is down” “Node Y is joining” “There’s a new data center” “The east coast data center is gone”
  • 26. Hinted Handoff Failed writes will happen. When a write happens, the nodes are trying the new data to get all the replicas. If one of the replica nodes is offline, then the other replica nodes are going to remember what data the down node was supposed to receive, aka keep hints. When that node appears online again, then the other nodes that kept hints are going to handoff those hints to the newly online node. Hints are kept for a tunable period, defaults to 3 hours.
  • 27. Write Path Nothing to do with which node the data will be written to, and everything to do with what happens internally in the node to get the data persisted. When the data enters the Cassandra java process, the data is written to 2 places: 1. First appended to the Commit Log file. 2. Second to the MemTable. The Commit Log is immediately durable and is persisted to disk. The MemTable is a representation of the data in RAM. Afterwards, the write is acknowledged and goes back to the client.
  • 28. Write Path (cont’d) Once the MemTable fills up, then it flushes its data to disk as an SSTable. If the node crashes before the data is flushed, then the Commit Log is replayed to re-populate the MemTable.
  • 29. Read Path The idea is that it looks in the MemTable first, then in the SSTable. The MemTable has the most recent partitions of data. The SSTable files is sequential, but can get really, really big. Partition Index file keeps track of partitions and the offset of their locations in the SSTable, but they too can get large. Summary Index file keeps track of offsets in Partition Index.
  • 30. Read Path (cont’d) We can go faster by using Key Cache, which is an in-memory index of partitions and what it’s offset is in the SSTable. Skips the Partition & Summary Index files. Only works on previously requested keys. But which SSTable/Partition/Summary file should be looked at? Bloom Filters are a probabilistic data structure. keep track by saying that the key you’re looking for is “definitely not there” or “maybe it’s there”. The false positive “maybes” are rare, but tunable.
  • 31. Deleting Data Data is never deleted in Cassandra. When a column value in a row, or an entire row is requested to be deleted, Cassandra actually writes additional data that marks the column or row with a timestamp that says it’s deleted. This is called a tombstone. Whenever a read occurs, it will skip over the tombstoned data and not return it. Skipping over tombstones still incur I/O though. There’s even a metric that will tell you the avg. tombstones being read. So we’ll need to remove them from the SSTable at some point.
  • 32. Compaction Compaction is the act of merging SSTables together. But why → Housekeeping. SSTables are immutable, so we can never update the file to change a column’s value. If your client writes the same data 3 times, there will be 3 entries in potentially 3 different SSTables. (This assumes the MemTable flushed the data in-between writes). Reads have to read all 3 SSTables and compare the write timestamps to get the correct value.
  • 33. Compaction What happens is that Cassandra will compact the SSTables by reading in those 3 SSTables and writing out a new SSTable with only the single entry. The 3 older SSTables will get deleted at that point. Compaction is when tombstones are purged too. Tombstones are kept around long enough so that we don’t get “phantom deletes” though (tunable period). Compaction will keep your future seeks on disk low. There’s a whole algorithm on when Compaction runs, but it’s automatically setup by default.
  • 34. Repairs Repairs are needed because, over time, distributed data naturally gets out of sync with all the locations. Repairs just makes sure that all your nodes are consistent. Repairs happen at two times. 1. After each read, there is a (tunable) chance that a repair will occur. When a client requests to read a particular key, a background process will gather all the data from all the replicas, and update all the replicas to be consistent. 2. At scheduled times that are manually controlled by an admin.
  • 35. Failure Recovery Sometimes nodes go down due to maintenance, or a real catastrophe. Cassandra will keep track of down nodes with gossip. Hints are automatically held for a (tunable) period. So when/ifr the node comes back online, the other nodes will tell it what it missed. If the node doesn’t come back online, you have to create a new node to replace it. Assign it the same tokens as the lost node, and the other nodes will stream the necessary data to it from replicas.
  • 36. Scaling Scaling is when you add more capacity to the cluster. Typically, this is when you add more nodes. You create a new node(s) and add it to the cluster. A new node will join the ring and take responsibility for a part of the token ranges. While it’s joining, other nodes will stream data for the token ranges it will own. Once it’s fully joined, the node will start participating in normal operations.
  • 38. Python - Getting Started Install the python driver via pip: pip install cassandra-driver In a .py file, create a cluster & session object and connect to it: from cassandra.cluster import Cluster cluster = Cluster() session = cluster.connect() The cluster object represents a cassandra cluster on your localhost. The session object will manage all connection pooling. Only create a session object once in your codebase and reuse it throughout. If not, that repeated initial connection establishment will eventually become a bottleneck.
  • 39. Specifying a Cluster Localhost is great for sandboxing, but soon a real cluster with real IPs will be needed. cluster = Cluster(['54.211.95.95', '52.90.150.156', '52.87.198.119']) This is the set of IPs from a demo cluster I’ve created.
  • 40. Authorization Every cluster should have at least PlainTextAuthProvider enabled. import cassandra.auth my_auth_provider = cassandra.auth.PlainTextAuthProvider( username='adamhutson', password='P@$$w0rD' ) There are other methods of authorization, but this is the most common. Add the above snippet before you create your Cluster object. Then pass the my_auth_provider object to the Cluster’s auth_provider option key. cluster = Cluster(auth_provider=my_auth_provider)
  • 41. Keyspace Selection There are 3 ways to specify a keyspace to use: 1. session = cluster.connect('my_keyspace') 2. session.set_keyspace('my_keyspace') 3. session.execute('select * from my_keyspace.my_table') It doesn’t matter which way you chose to go, just be consistent with your selection. Personally, I use choice #2, as I can run it before I interact with the database. It keeps me from pinning myself to a single keyspace at session creation time. It also keeps me from having to type out the keyspace name every time I write a DML statement.
  • 42. Simple Statement The first thing most will want to do is select some data out of Cassandra. Let’s use the system keyspace, and retrieve some of the meta data about the cluster. session.set_keyspace('system') rows = session.execute('SELECT source_id, date, event_time, event_value FROM time_series WHERE source_id = ''123'' and date = ''2016-09-01''') for row in rows: print row.source_id, row.date, row.event_time, row.event_value
  • 43. Prepared/Bound Statement Every time we run that Simple Statement from above, Cassandra has to compile the query. What if you’re going to run the same select repeatedly. That compile time will become a bottleneck. session.set_keyspace('training') prepared_stmt = session.prepare('SELECT source_id, date, event_time, event_value FROM time_series WHERE source_id = ? and date = ?') bound_stmt = prepared_stmt.bind(['123','2016-09-01']) rows = session.execute(bound_stmt) for row in rows: print row.source_id, row.date, row.event_time, row.event_value
  • 44. Batch Statement A batch level of operations where they are applied atomically. Specify a BatchType insert_user = session.prepare("INSERT INTO users (name, age) VALUES (?, ?)") batch = BatchStatement(consistency_level=ConsistencyLevel.QUORUM) for (name, age) in users_to_insert: batch.add(insert_user, (name, age)) session.execute(batch) Be careful. This can potentially be inserting to token ranges all over the cluster. Best practice is to use Batches for inserting multiples into the same partition.
  • 45. Consistency Level Consistency Level can be specified at the query level. Just need to import the necessary library and set it. This setting will remain with the session until you destroy the object or set it to a different CL. from cassandra import ConsistencyLevel session.default_consistency_level = ConsistencyLevel.ONE There are a bunch of session level options that you can specify. There are most of the same options available at the Statement level.
  • 46. Shutdown This is so simple, but so important. Finish every script with the following: cluster.shutdown() If you don’t do this at the end of your python file, you will leak connections on the server side. I’ve done it, and it was completely embarrassing. Learn from my mistakes. Don’t forget it.
  • 47.