More Related Content Similar to Top five questions to ask when choosing a big data solution (20) Top five questions to ask when choosing a big data solution1. Five factors to consider when
choosing a big data solution!
Jonathan Ellis
CTO, DataStax
Project Chair, Apache Cassandra
2. how do I
my application?
model
©2012 DataStax
3. Popular options
• Key/value
• Tabular
• Document
• Graph?
©2012 DataStax
4. Schema is your friend
{
"id": "e451dd42-ece3-11e1-a0a3-34159e154f4c",
"name": "jbellis",
"state": "TX",
"birthdate": "1/1/1976",
"email_addresses": ["jbellis@gmail", "jbellis@datastax.com"],
}
©2012 DataStax
5. SQL can be your friend too
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
state text,
birth_date date
);
CREATE INDEX ON users(state);
SELECT * FROM users
WHERE state=‘Texas’ AND birth_date > ‘1950-01-01’;
©2012 DataStax
6. Collections
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
state text,
birth_date date
);
CREATE TABLE users_addresses (
user_id uuid REFERENCES users,
email text
);
SELECT *
FROM users NATURAL JOIN users_addresses;
©2012 DataStax
7. Collections
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
state text,
X
birth_date date
);
CREATE TABLE users_addresses (
user_id uuid REFERENCES users,
email text
);
SELECT *
FROM users NATURAL JOIN users_addresses;
©2012 DataStax
8. Collections
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
state text,
birth_date date,
email_addresses set<text>
);
UPDATE users
SET email_addresses = email_addresses + {‘jbellis@gmail.com’,
‘jbellis@datastax.com’};
©2012 DataStax
9. Joins don’t scale
• No joins
• No subqueries
• No aggregation functions* or GROUP BY
• ORDER BY?
©2012 DataStax
10. SELECT * FROM tweets
WHERE user_id IN (SELECT follower FROM followers
WHERE user_id = ’driftx’)
followers
?
©2012 DataStax
tweets
11. Clustering in Cassandra
CREATE TABLE timeline ( user_id tweet_id _author _body
user_id uuid,
tweet_id timeuuid, jbellis 3290f9da.. rbranson lorem
tweet_author uuid, jbellis 3895411a.. tjake ipsum
tweet_body text, ... ... ...
PRIMARY KEY (user_id,
tweet_id) driftx 3290f9da.. rbranson lorem
);
driftx 71b46a84.. yzhang dolor
... ... ...
yukim 3290f9da.. rbranson lorem
yukim e451dd42.. tjake amet
... ... ...
©2012 DataStax
12. Clustering in Cassandra
CREATE TABLE timeline ( user_id tweet_id _author _body
user_id uuid,
tweet_id timeuuid, jbellis 3290f9da.. rbranson lorem
tweet_author uuid, jbellis 3895411a.. tjake ipsum
tweet_body text, ... ... ...
PRIMARY KEY (user_id,
tweet_id) driftx 3290f9da.. rbranson lorem
);
driftx 71b46a84.. yzhang dolor
... ... ...
SELECT * FROM timeline
WHERE user_id = ’driftx’; yukim 3290f9da.. rbranson lorem
yukim e451dd42.. tjake amet
... ... ...
©2012 DataStax
17. UPDATE users
SET email_addresses = email_addresses + {...}
WHERE user_id = ‘jbellis’;
©2012 DataStax
19. C* storage engine very briefly
write( k1 , c1:v1 )
Memory
Memtable
Commit log
©2012 DataStax Hard drive
20. write( k1 , c1:v1 )
Memory
k1 c1:v1
Memtable
k1 c1:v1
Commit log
©2012 DataStax Hard drive
21. write( k1 , c2:v2 )
Memory
k1 c1:v1 c2:v2
k1 c1:v1
k1 c2:v2
©2012 DataStax Hard drive
22. write( k2 , c1:v1 c2:v2 )
Memory
k1 c1:v1 c2:v2
k2 c1:v1 c2:v2
k1 c1:v1
k1 c2:v2
k2 c1:v1 c2:v2
©2012 DataStax Hard drive
23. write( k1 , c1:v4 c3:v3 )
Memory
k1 c1:v4 c2:v2 c3:v3
k2 c1:v1 c2:v2
k1 c1:v1
k1 c2:v2
k2 c1:v1 c2:v2
k1 c1:v4 c3:v3
©2012 DataStax Hard drive
24. Memory
flush
index
cleanup k1 c1:v4 c2:v2 c3:v3
k2 c1:v1 c2:v2
SSTable
©2012 DataStax Hard drive
26. reads/s writes/s
35000
30000
25000
20000
15000
10000
5000
Cassandra 0.6
0
©2012 DataStax
Cassandra 1.0
29. Availability
• “High availability implies that a single fault will not bring
down your system. Not ‘we’ll recover quickly.’”
-- Ben Coverston: DataStax
• “The biggest problem with failover is that you're almost
never using it until it really hurts. It's like backups that
you never test.”
-- Rick Branson: Instagram
©2012 DataStax
34. Scaling antipatterns
• Metadata servers
• Router bottlenecks
• Overloading existing nodes when adding capacity
©2012 DataStax
38. Data model: Realtime
LiveStocks stock last
GOOG $95.52
AAPL $186.10
AMZN $112.98
Portfolios user stock shares
jbellis GOOG 80
jbellis LNKD 20
yukim AMZN 100
StockHist stock date price
GOOG 2011-01-01 $8.23
GOOG 2011-01-02 $6.14
GOOG 2011-001-03 $7.78
©2012 DataStax
39. Data model: Analytics
HistLoss worst_date loss
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-03-11 -$11432.24
Portfolio3 2011-05-21 -$1476.93
©2012 DataStax
40. Data model: Analytics
10dayreturns
stock rdate return
GOOG 2011-07-25 $8.23
GOOG 2011-07-24 $6.14
GOOG 2011-07-23 $7.78
AAPL 2011-07-25 $15.32
AAPL 2011-07-24 $12.68
INSERT OVERWRITE TABLE 10dayreturns
SELECT a.stock,
b.date as rdate,
b.price - a.price
FROM StockHist a
JOIN StockHist b
ON (a.stock = b.stock
AND date_add(a.date, 10) = b.date);
©2012 DataStax
41. Data model: Analytics
portfolio_returns
portfolio rdate preturn
Portfolio1 2011-07-25 $118.21
Portfolio1 2011-07-24 $60.78
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-07-25 $2143.92
Portfolio3 2011-07-24 -$10.19
INSERT OVERWRITE TABLE portfolio_returns
SELECT portfolio,
rdate,
SUM(b.return)
FROM portfolios a JOIN 10dayreturns b
ON (a.stock = b.stock)
GROUP BY portfolio, rdate;
©2012 DataStax
42. Data model: Analytics
HistLoss
worst_date loss
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-03-11 -$11432.24
Portfolio3 2011-05-21 -$1476.93
INSERT OVERWRITE TABLE HistLoss
SELECT a.portfolio, rdate, minp
FROM (
SELECT portfolio, min(preturn) as minp
FROM portfolio_returns
GROUP BY portfolio
) a
JOIN portfolio_returns b
ON (a.portfolio = b.portfolio and a.minp = b.preturn);
©2012 DataStax
45. Questions?
Image credits
• http://www.flickr.com/photos/26817893@N05/2573006312/
• http://www.flickr.com/photos/rowanbank/7686239548
• http://www.flickr.com/photos/mervtheswerve/6081933265
• http://www.flickr.com/photos/dg_pics/2526208830
• http://www.flickr.com/photos/wainwright/351684037
• http://www.flickr.com/photos/mikeneilson/1606662529
• http://www.flickr.com/photos/sbisson/3852905534
• http://www.flickr.com/photos/breadnbadger/2674928517