Cassandra hands on

Cassandra Hands On
Niall Milton, CTO, DigBigData
Examples courtesy of Patrick Callaghan, DataStax
Sponsored By

Introduction
—  We will be walking through Cassandra use cases
from Patrick Callaghan on github.
—  https://github.com/PatrickCallaghan/
—  Patrick sends his apologies but due to Aer Lingus
air strike on Friday he couldn’t get a flight back to
UK
—  This presentation will cover the important points
from each sample application

Agenda
—  Transactions Example
—  Paging Example
—  Analytics Example
—  Risk Sensitivity Example

Scenario
—  We want to add products, each with a quantity to
an order
—  Orders come in concurrently from random buyers
—  Products that have sold out will return “OUT OF
STOCK”
—  We want to use lightweight transactions to
guarantee that we do not allow orders to complete
when no stock is available

Lightweight Transactions
—  Guarantee a serial isolation level, ACID
—  Uses PAXOS consensus algorithm to achieve this in a
distributed system. See:
—  http://research.microsoft.com/en-us/um/people/lamport/
pubs/paxos-simple.pdf
—  Every node is still equal, no master or locks
—  Allows for conditional inserts & updates
—  The cost of linearizable consistency is higher latency,
not suitable for high volume writes where low latency is
required

Retrieve & Run the Code
1.  git clone
https://github.com/PatrickCallaghan/datastax-
transaction-demo.git
2.  mvn clean compile exec:java -
Dexec.mainClass="com.datastax.demo.SchemaSetup”
Dexec.mainClass="com.datastax.transactions.Main" -
Dload=true -DcontactPoints=127.0.0.1 -
DnoOfThreads=10

Schema
1.  create keyspace if not exists
datastax_transactions_demo WITH replication =
{'class': 'SimpleStrategy',
'replication_factor': '1' };
2.  create table if not exists products(productId
text, capacityleft int, orderIds set<text>,
PRIMARY KEY (productId));
3.  create table if not exists
buyers_orders(buyerId text, orderId text,
productId text, PRIMARY KEY(buyerId, orderId));

Model
public class Order {

private String orderId;
private String productId;
private String buyerId;

…
}

Method
—  Find current product quantity at CL.SERIAL
—  This allows us to execute a PAXOS query without
proposing an update, i.e. read the current value
SELECT capacityLeft from products WHERE
productId = ‘1234’
e.g. capacityLeft = 5

Method Contd.
—  Do a conditional update using IF operator to make
sure product quantity has not changed since last
quantity check
—  Note the use of the set collection type here.
—  This statement will only succeed if the IF condition is
met
UPDATE products SET orderIds=orderIds +
{'3'}, capacityleft = 4 WHERE productId =
’1234' IF capacityleft = 5;

Method Contd.
—  If last query succeeds, simply insert the order.
INSERT into orders (buyerId, orderId,
productId) values (1,3,’1234’);
—  This guarantees that no order will be placed where
there is insufficient quantity to fulfill it.

Comments
—  Using LWT incurs a cost of higher latency because
all replicas must be consulted before a value is
committed / returned.
—  CL.SERIAL does not propose a new value but is
used to read the possibly uncommitted PAXOS
state
—  The IF operator can also be used as IF NOT EXISTS
which is useful for user creation for example

Scenario
—  We have 1000s of products in our product
catalogue
—  We want to browse these using a simple select
—  We don’t want to retrieve all at once!

Cursors
—  We are often dealing with wide rows in Cassandra
—  Reading entire rows or multiple rows at once could
lead to OOM errors
—  Traditionally this meant using range queries to
retrieve content
—  Cassandra 2.0 (and Java driver) introduces cursors
—  Makes row based queries more efficient (no need to
use the token() function)
—  This will simplify client code

1.  git clone
paging-demo.git
Dexec.mainClass="com.datastax.demo.SchemaSetup"
Dexec.mainClass="com.datastax.paging.Main"

Schema
create table if not exists
products(productId text, capacityleft int,
orderIds set<text>, PRIMARY KEY
(productId));
—  N.B With the default partitioner, products will be
ordered based on Murmer3 hash value. Old way we
would need to use the token() function to retrieve
them in order

Model
public class Product {

private String productId;
private int capacityLeft;
private Set<String> orderIds;

…
}

Method
1.  Create a simple select query for the products
table.
2.  Set the fetch size parameter
3.  Execute the statement
Statement stmt = new
SimpleStatement("Select * from products”);
stmt.setFetchSize(100);
ResultSet resultSet =
this.session.execute(stmt);

Method Contd.
1.  Get an iterator for the result set
2.  Use a while loop to iterate over the result set
Iterator<Row> iterator = resultSet.iterator();
while (iterator.hasNext()){
Row row = iterator.next();
// do stuff with the row
}

Comments
—  Very easy to transparently iterate in a memory
efficient way over a large result set
—  Cursor state is maintained by driver.
—  Allows for failover between different page
responses, i.e. the state is not lost if a page fails to
load from a node in the replica set, the page will be
requested from another node
—  See: http://www.datastax.com/dev/blog/client-
side-improvements-in-cassandra-2-0

Scenario
—  Don’t have Hadoop but want to run some HIVE type
analytics on our large dataset
—  Example: Get the Top10 financial transactions
ordered by monetary value for each user
—  May want to add more complex filtering later
(where value > 1000) or even do mathematical
groupings, percentiles, means, min, max

Cassandra for Analytics
—  Useful for many scenarios when no other analytics
solution is available
—  Using cursors, queries are bounded & memory efficient
depending on the operation
—  Can be applied anywhere we can do iterative or recursive
processing, SUM, AVG, MIN, MAX etc.
—  NB: The example code also includes an
CQLSSTableWriter which is fast & convenient if we want
to manually create SSTables of large datasets rather
than send millions of insert queries to Cassandra

1.  git clone
analytics-example.git
2.  export MAVEN_OPTS=-Xmx512M (up the memory)
Dexec.mainClass="com.datastax.bulkloader.Main"
Dexec.mainClass="com.datastax.analytics.TopTrans
actionsByAmountForUserRunner"

Schema
create table IF NOT EXISTS transactions (
accid text,
txtnid uuid,
txtntime timestamp,
amount double,
type text,
reason text,
PRIMARY KEY(accid, txtntime)
);

Model
public class Transaction {
pivate String txtnId;
private String acountId;
private double amount;
private Date txtnDate;
private String reason;
private String type;
…
}

Method
—  Pass a blocking queue into the DAO method which cursors the
data, allows us to pop items off as they are added
—  NB: Could also use a callback here to update the queue
public void
getAllProducts(BlockingQueue<Transaction>
processorQueue)
Statement stmt = new SimpleStatement(“SELECT * FROM
transactions”);
stmt.setFetchSize(2500);
ResultSet resultSet = this.session.execute(stmt);

Method Contd.
1.  Get an iterator for the result set
2.  Use a while loop to iterate over the result set, add each row
into the queue
while (iterator.hasNext()) {
Row row = iterator.next();
Transaction transaction =
createTransactionFromRow(row); //convenience
queue.offer(transaction);
}

Method Contd.
1.  Use Java Collections & Transaction comparator to
track Top results
private Set<Transaction> orderedSet = new
BoundedTreeSet<Transaction>(10, new
TransactionAmountComparator());

Comments
—  Entirely possible, but probably not to be thought of as a
complete replacement for dedicated analytics solutions
—  Issues are token distribution across replicas and mixed write
and read patterns
—  Running analytics or MR operations can be a read heavy
operation (as well as memory and i/o intensive)
—  Transaction logging tends to be write heavy
—  Cassandra can handle it, but in practice it is better to split
workloads except for smaller cases, where latency doesn’t
matter or where the cluster is not generally under significant
load
—  Consider DSE Hadoop, Spark, Storm as alternatives

Scenario
—  In financial risk systems, positions have sensitivity to
certain variable
—  Positions are hierarchical and is associated with a trader
at a desk which is part of an asset type in a certain
location.
—  E.g. Frankfurt/FX/desk10/trader7/position23
—  Sensitivity values are inserted for each position. We
need to aggregate them for each level in the hierarchy
—  The Sum of all sensitivities over time is the new
sensitivity as they are represented by deltas.

Scenario
—  E.g. Aggregations for:
—  Frankfurt/FX/desk10/trader7
—  Frankfurt/FX/desk10
—  Frankfurt/FX
—  As new positions are entered the risk sensitivities will
change and will need to be aggregated for each level
for the new value to be available

Queries
select * from risk_sensitivities_hierarchy
where hier_path = 'Paris/FX'; !
where hier_path = 'Paris/FX/desk4' and
sub_hier_path='trader3'; !
where hier_path = 'Paris/FX/desk4' and
sub_hier_path='trader3' and
risk_sens_name='irDelta';!

1.  git clone
analytics-example.git
2.  export MAVEN_OPTS=-Xmx512M (up the memory)
Dexec.mainClass="com.datastax.bulkloader.Main"
Dexec.mainClass="com.heb.finance.analytics.Main"
-DstopSize=1000000

Schema
create table if not exists risk_sensitivities_hierarchy (
hier_path text,
sub_hier_path text,
risk_sens_name text,
value double,
PRIMARY KEY (hier_path, sub_hier_path,
risk_sens_name)
) WITH compaction={'class': 'LeveledCompactionStrategy'};
NB: Notice the use of LCS as we want the table to be efficient for
reads also

Model
public class RiskSensitivity
public final String name;
public final String path;
public final String position;
public final BigDecimal value;
…
}

Method
—  Write a service to write new sensitivities to
Cassandra Periodically.
insert into risk_sensitivities_hierarchy
(hier_path, sub_hier_path, risk_sens_name,
value) VALUES (?, ?, ?, ?)

Method Contd.
—  In our aggregator do the following periodically
—  Select data for hierarchies we wish to aggregate
select * from risk_sensitivities_hierarchy where
hier_path = ‘Frankfurt/FX/desk10/trader4’
—  Will get all positions related to this hierarchy
—  Add the values (represented as deltas) to each other to get
the new sensitivity
—  E.g. S1 = -3, S2 = 2, S3= -1
—  Write it back for ‘Frankfurt/FX/desk10/trader4’

Comments
—  Simple way to maintain up to date risk sensitivity
on an on going basis based on previous data
—  Will mean (N Hierarchies) * (N variables) queries
are executed periodically (keep an eye on this)
—  Cursors, blocking queue and bounded collections
help us achieve the same result without reading
entire rows
—  Has other applications such as roll ups for stream
data provided you have a reasonably low cardinality
in terms of number of (time resolution) * variables.

—  Thanks Patrick Callaghan for the hard work coding
the examples!
— Questions?

Cassandra hands on

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Cassandra hands on

Semelhante a Cassandra hands on (20)

Último

Último (20)

Cassandra hands on