MemSQL DB Class, Ankur Goyal

15-415/615 1
Ankur Goyal
3/17/2016
1
Based on a lecture given at Carnegie Mellon University.
(c) Ankur Goyal

Ques%ons We Will Answer
• What is an in-memory database?
• Why do they ma3er?
• How do you build one?
• How do people use MemSQL?
(c) Ankur Goyal

Topics
• In-Memory Databases
• In-Memory Architecture
• MemSQL in the Wild
• Q/A
(c) Ankur Goyal

Ankur Goyal
• CMU SCS (2008-2011), PDL (2010-2011)
• Microso7 (2010)
• VP of Engineering @ MemSQL (2011-)
• I ❤ databases
(c) Ankur Goyal

What is an in-memory database?
(c) Ankur Goyal

In-Memory Databases...
• Use memory instead of disk
(c) Ankur Goyal

• Do not (need to) save data on disk
(c) Ankur Goyal

• Put the whole dataset in memory
(c) Ankur Goyal

• Put the whole dataset in memory
Well, some)mes...
(c) Ankur Goyal

Wikipedia says...
In-memory databases primarily rely
on main-memory for storage.
(c) Ankur Goyal

In-Memory Databases
• Are durable to disk (and respect ACID)
(c) Ankur Goyal

In-Memory Databases
• Can spill on disk or pin data in-memory (and take advantage of it)
(c) Ankur Goyal

In-Memory Databases
• Tradeoﬀs are suited to systems with lots of memory
(c) Ankur Goyal

In-Memory Databases
• Tend to be distributed systems
(c) Ankur Goyal

In-Memory Databases
• Tend to be distributed systems
• Have a diﬀerent set of boClenecks
(c) Ankur Goyal

All database workloads will be
running on in-memory databases
(c) Ankur Goyal

Why?
• Memory is ge,ng cheaper (about 40% every year)
(c) Ankur Goyal

Why?
• Cache is the new RAM (RAM is the new disk, disk is the new
tape, etc)
(c) Ankur Goyal

Why?
tape, etc)
• In-memory databases leverage SSD (no random writes)
(c) Ankur Goyal

Why?
tape, etc)
• NVRAM is coming (and could be cheaper than SSD)
(c) Ankur Goyal

Why?
tape, etc)
• NVRAM is coming (and could be cheaper than SSD)
In-memory databases are tuned to
modern hardware and modern workloads
(c) Ankur Goyal

In-Memory Architecture
(c) Ankur Goyal

Architecture Topics
• In-Memory Storage
• Transac3ons and Concurrency Control
• Crash Recovery and Replica3on
• Code Genera3on
• Distributed Execu3on
(c) Ankur Goyal

In-Memory Storage Mo/va/on
• Insanely fast random reads & writes
(c) Ankur Goyal

• Atomic writes as granular as a byte
(c) Ankur Goyal

• Working space is precious (RAM)
(c) Ankur Goyal

• Working space is precious (RAM)
• Very diﬀerent for rowstores and columnstores
(c) Ankur Goyal

In-Memory Rowstore
• Rowstores have lots of random reads/writes
(c) Ankur Goyal

In-Memory Rowstore
• Datasets are usually small < 10 TB
(c) Ankur Goyal

In-Memory Rowstore
Solu%on: keep the whole dataset in memory
(c) Ankur Goyal

In-Memory Rowstore
Solu%on: keep the whole dataset in memory
• Use memory op+mized data structures (skip list)
(c) Ankur Goyal

What is a Skip List
• Invented in 1989 by William Pugh
(c) Ankur Goyal

What is a Skip List
• Expected O(log(n)) lookup, insert, delete
(c) Ankur Goyal

What is a Skip List
• Expected O(log(n)) lookup, insert, delete
• No pages
(c) Ankur Goyal

Common Concerns
• Memory overhead
(c) Ankur Goyal

Skip List Struct Layout
struct Table_Row {
int col_a;
char* col_b;
…
Tower* idx_1_ptrs;
Tower* idx_2_ptrs;
};
(c) Ankur Goyal

Common Concerns
• Memory overhead
• Scan performance
(c) Ankur Goyal

Ineﬃcient Skip List
(c) Ankur Goyal

Eﬃcient Skip List
(c) Ankur Goyal

Common Concerns
• Memory overhead
• Reverse Itera6on
(c) Ankur Goyal

Common Concerns
• Memory overhead
• Reverse Itera6on
(HW Assignment)
(c) Ankur Goyal

Concurrency Control
(c) Ankur Goyal

Concurrency Control
• No pages => No latches
(c) Ankur Goyal

Concurrency Control
• Skip list in MemSQL is lockfree
(c) Ankur Goyal

Concurrency Control
• Every node is a lock-free linked list
(c) Ankur Goyal

Concurrency Control
• Row locks are implemented with futexes (4 bytes)
(c) Ankur Goyal

Concurrency Control
• Row locks are implemented with futexes (4 bytes)
• Read-commiGed and snapshot isolaHon
(c) Ankur Goyal

In-Memory Columnstore
(c) Ankur Goyal

Columnstore Review
• Big sequen+al scans and writes
(c) Ankur Goyal

Columnstore Review
• Huge immutable vectors of data
(c) Ankur Goyal

Columnstore Review
• Huge immutable vectors of data
Solu%on: Cache dataset in memory
(c) Ankur Goyal

How do columnstores
beneﬁt from in-memory?
(c) Ankur Goyal

Have a lock-free skip list handy?
(c) Ankur Goyal

Have a lock-free skip list handy?
• Keep metadata in-memory
• Use sidecar rowstore for fast small-batch writes
(c) Ankur Goyal

Columnstore LSM
• Log-Structured Merge of sorted runs
(c) Ankur Goyal

Columnstore LSM
• Tunable tradeoﬀs for read/write ampliﬁca=on
(c) Ankur Goyal

Columnstore LSM
• Enables fast writes to a sorted columnstore
(c) Ankur Goyal

Columnstore LSM
• Enables fast writes to a sorted columnstore
• Smallest sorted run is a skip list
(c) Ankur Goyal

Crash Recovery
(c) Ankur Goyal

Durability in an In-Memory System?
• Memory is not a reliable medium (yet)
(c) Ankur Goyal

• There is always a hierarchy
(c) Ankur Goyal

• E.g. EBS -> S3 -> Glacier
(c) Ankur Goyal

• E.g. EBS -> S3 -> Glacier
• To operate at in-memory speed, all disk I/O must be sequenHal
(c) Ankur Goyal

Durability in the Rowstore
• Indexes are not materialized on disk
(c) Ankur Goyal

• Reconstruct indexes on the ﬂy during recovery
(c) Ankur Goyal

• Only need to log PK data
(c) Ankur Goyal

• Take full database snapshots periodically
(c) Ankur Goyal

• Take full database snapshots periodically
• Tunable to be sync/async
(c) Ankur Goyal

Durability in the Columnstore
• Metadata uses ordinary rowstore mechanism
(c) Ankur Goyal

• Segments are huge (several KB or even MB)
(c) Ankur Goyal

• Read/wri=en sequen?ally
(c) Ankur Goyal

• Columnstore segments synchronously wri=en to disk
(c) Ankur Goyal

• Columnstore segments synchronously wri=en to disk
• Memory-speed writes go to sidecar rowstore
(c) Ankur Goyal

Crash Recovery
• Replay latest snapshot, and then every log ﬁle since
(c) Ankur Goyal

Crash Recovery
• No par7ally wri9en state on disk, so no undos
(c) Ankur Goyal

Crash Recovery
• Columnstore just replays metadata
(c) Ankur Goyal

Crash Recovery
• Columnstore just replays metadata
• Replica7on == Con7nuous replay over the network
(c) Ankur Goyal

Code Genera*on
(c) Ankur Goyal

class Row(object):
def __init__(self, a):
self.a = a
t = [Row(x) for x in range(1000000)]
class State(object):
def __init__(self):
self.agg_sum = 0
def loop(state, row):
state.agg_sum += row.a + 1
def query():
state = State()
for r in t:
loop(state, r)
return state
if __name__ == '__main__':
start = time.time()
state = query()
end = time.time()
print "Answer: %d, Time (s): %g" % (state.agg_sum, (end-start))
(c) Ankur Goyal

struct Row int main(void)
{ {
Row(int a_arg) : a(a_arg) { } std::vector<Row> rows;
int a; for (int i = 0; i < 1000000; i++)
}; {
rows.emplace_back(i);
struct State }
{
State() : agg_sum(0) { } clock_t start = clock();
int64_t agg_sum; State state = query(rows);
}; clock_t end = clock();
inline void loop(State& state, const Row& row) printf("Answer: %lld, Time (s): %gn",
{ state.agg_sum, (end-start) * 1.0 / CLOCKS_PER_SEC);
state.agg_sum += row.a + 1; }
}
inline State query(std::vector<Row>& rows)
{
State s;
for (Row& r : rows)
{
loop(s, r);
}
return s;
}
(c) Ankur Goyal

Comparison
$ python test.py
Answer: 500000500000, Time (s): 0.251049
$ time g++ test.cpp -o test-cpp -std=c++0x
real 0m0.176s
user 0m0.150s
sys 0m0.023s
$ ./test-cpp
Answer: 500000500000, Time (s): 0.006745
(c) Ankur Goyal

Comparison
$ python test.py
Answer: 500000500000, Time (s): 0.251049
real 0m0.176s
user 0m0.150s
sys 0m0.023s
$ ./test-cpp
Answer: 500000500000, Time (s): 0.006745
37x diﬀerence in execu+on
(c) Ankur Goyal

Comparison
$ python test.py
Answer: 500000500000, Time (s): 0.251049
real 0m0.176s
user 0m0.150s
sys 0m0.023s
$ ./test-cpp
Answer: 500000500000, Time (s): 0.006745
37x diﬀerence in execu+on
1.37x even with compila+on +me
(c) Ankur Goyal

Code Genera*on
• Expression execu.on
(c) Ankur Goyal

Code Genera*on
• Inline scans
(c) Ankur Goyal

Code Genera*on
• Inline scans
• Need a powerful plan cache
(c) Ankur Goyal

Code Genera*on
• Inline scans
• Need a powerful plan cache
• OLTP vs. data explora.on
(c) Ankur Goyal

Plancache Example (1)
SELECT * FROM users WHERE id = 5
SELECT * FROM users WHERE id = 8
=>
SELECT * FROM users WHERE id = @
(c) Ankur Goyal

Plancache Example (2)
SELECT * FROM users WHERE id IN (1,2,3,4,5) OR a IN (3,5,7)
SELECT * FROM users WHERE id IN (20) OR a IN (1,2,3,4)
=>
SELECT * FROM users WHERE id IN (@) OR a IN (@)
(c) Ankur Goyal

Drill Down Example
SELECT SELECT SELECT
region, SUM(price) rep, SUM(price) rep, SUM(price)
FROM sales => FROM sales => FROM sales
GROUP BY region WHERE region="northeast" WHERE region=^
GROUP BY rep; GROUP BY rep;
SELECT SELECT
product, SUM(price) product, SUM(price)
=> FROM sales => FROM sales
WHERE region="northwest" WHERE region=^
GROUP BY product; GROUP BY product;
(c) Ankur Goyal

Drill Down Example
SELECT SELECT SELECT
region, SUM(price) rep, SUM(price) rep, SUM(price)
FROM sales => FROM sales => FROM sales
GROUP BY region WHERE region="northeast" WHERE region=^
GROUP BY rep; GROUP BY rep;
SELECT SELECT
product, SUM(price) product, SUM(price)
=> FROM sales => FROM sales
WHERE region="northwest" WHERE region=^
GROUP BY product; GROUP BY product;
No plancache match !
(c) Ankur Goyal

Let's look at some generated code
(c) Ankur Goyal

Old Code Genera,on
+----------------------+
+----------------------+
| foobar |
+----------------------+
bool overflow = false;
VarCharTemp result1("foo", 3, threadId);
VarCharTemp result2("bar", 3, threadId);
opt<TemporaryImmutableString> result3;
op_Concat(result3, result1, result2, overflow, threadId);
(c) Ankur Goyal

Code Genera*on is Hard
• Old compilers adage: Pick 2 of 3
(c) Ankur Goyal

• Fast execu:on :me
• Fast compile :me
• Fast development :me
(c) Ankur Goyal

• E.g. Assembly, C++, Python
(c) Ankur Goyal

• E.g. Assembly, C++, Python
• JIT compilers turned this on its head
(c) Ankur Goyal

MemSQL Compiler Pipeline
(c) Ankur Goyal

Expression Snippet (MPL)
+----------------------+
+----------------------+
| foobar |
+----------------------+
declare outRow3 <- OutRowInit()
OutRowString(&outRow3,
&Concat(UpdateCollation(OptString("foo"),2),
UpdateCollation(OptString("bar"),2)))
OutRowSend(&outRow3)
(c) Ankur Goyal

MBC Snippet
OutRowString(&outRow3,
&Concat(UpdateCollation(OptString("foo"),2),
UpdateCollation(OptString("bar"),2)))
0x0048 OutRowInit local=&outRow
0x0050 InitString local=&local_2 data=0 i64=3 coll=unspecified
0x0068 UpdateCollation local=&local_2 coll=utf8_general_ci
0x008c UpdateCollation local=&local_3 coll=utf8_general_ci
0x0098 Concat local=&local local=&local_2 local=&local_3
0x00a8 OutRowString local=&outRow local=&local target=0x01ac
0x00b8 OptStringFree local=&local
0x00c0 OptStringFree local=&local_3
0x00d0 InitString local=&local_5 data=2 i64=3 coll=unspecified
0x00e8 UpdateCollation local=&local_5 coll=utf8_general_ci
0x00f4 InitString local=&local_6 data=3 i64=3 coll=unspecified
0x0118 Concat local=&local_4 local=&local_5 local=&local_6
0x0128 OutRowString local=&outRow local=&local_4 target=0x018c
0x0138 OptStringFree local=&local_4
(c) Ankur Goyal

MBC Snippet
0x0048 OutRowInit local=&outRow
0x0068 UpdateCollation local=&local_2 coll=utf8_general_ci
0x0098 Concat local=&local local=&local_2 local=&local_3
0x00a8 OutRowString local=&outRow local=&local target=0x01ac
0x00b8 OptStringFree local=&local
0x00d0 InitString local=&local_5 data=2 i64=3 coll=unspecified
0x00e8 UpdateCollation local=&local_5 coll=utf8_general_ci
0x00f4 InitString local=&local_6 data=3 i64=3 coll=unspecified
(c) Ankur Goyal

Distributed Query Execu0on
(c) Ankur Goyal

First, some terminology
(c) Ankur Goyal

Much easier to reason in
terms of shipping SQL
(c) Ankur Goyal

SELECT supp_nation,
       cust_nation,
       l_year,
       Sum(volume) AS revenue
FROM   (SELECT n1.n_name AS supp_nation,
               n2.n_name AS cust_nation,
               Extract(year FROM l_shipdate) AS l_year,
               l_extendedprice * ( 1 -
l_discount ) AS volume
        FROM   supplier,
               lineitem,
               orders,
               customer,
               nation n1,
               nation n2,
        WHERE  s_suppkey = l_suppkey
               AND o_orderkey = l_orderkey
               AND c_custkey = o_custkey
               AND s_nationkey = n1.n_nationkey
               AND c_nationkey = n2.n_nationkey
               AND ( ( n1.n_name = 'CANADA'
                       AND n2.n_name = 'UNITED STATES' )
                      OR ( n1.n_name = 'RUSSIA'
                           AND n2.n_name =
'UNITED STATES' ) )
               AND l_shipdate BETWEEN
Date('1995-01-01') AND
Date('1996-12-31'))
       AS shipping
GROUP  BY supp_nation,
          cust_nation,
          l_year
ORDER  BY supp_nation,
          cust_nation,
          l_year;
(c) Ankur Goyal

Abstrac(ons
• Distributed Query Plan created on aggregator
(c) Ankur Goyal

Abstrac(ons
• Layers of primi9ve opera9ons glued together
(c) Ankur Goyal

Abstrac(ons
• Layers of primi9ve opera9ons glued together
• Full SQL on leaves
• REMOTE tables
• RESULT tables
(c) Ankur Goyal

Primi%ves (SQL)
• Queries over physical indexes
(c) Ankur Goyal

Primi%ves (SQL)
• Hook into global transac9onal state
(c) Ankur Goyal

Primi%ves (SQL)
• Full SQL on a single par99on
(c) Ankur Goyal

Primi%ves (SQL)
• Full SQL on a single par99on
• Access to rowstores and columnstores
(c) Ankur Goyal

Primi%ves (SQL)
Example query the aggregator can send to the leaf:
SELECT
t.a, t.b, SUM(t.price)
FROM
t -- This will scan a physical table on the leaf
WHERE
t.c = 1000 -- This will use a local index
GROUP BY
t.a, t.b -- This will produce 1 row per group
(c) Ankur Goyal

Primi%ves (Remote Tables)
• Address data across leaves
(c) Ankur Goyal

• SQL interface + custom shard key
(c) Ankur Goyal

• SQL interface + custom shard key
• Parallel execu<on primi<ves
• Reshuﬄing
• Merging on group keys
• Merging data from joins (e.g. leE joins)
(c) Ankur Goyal

SELECT
t.a, SUM(s_net.c)
FROM
-- The row in s where s_net.b = t.a may not
-- be on the same node as the local t. REMOTE(s)
-- addresses the table across the cluster.
t, REMOTE(s) AS s_net
WHERE
t.a = s_net.b
GROUP BY
t.a
(c) Ankur Goyal

SELECT
t.a, SUM(s_net.c)
FROM
-- This is a reshuffle operation. It relies on t
-- being sharded on (t.a) and type(t.a) == type(s.b).
-- It will only pull rows in s.b that match the
-- shard key's local values of (t.a).
t, REMOTE(s) WITH (shard_key=(s.b)) AS s_net
WHERE
t.a = s_net.b
GROUP BY
t.a
(c) Ankur Goyal

Primi%ves (Result Tables)
• Shared, cached results of SQL queries
(c) Ankur Goyal

• Shares scans/computa9ons across readers
(c) Ankur Goyal

• Supports streaming seman9cs
(c) Ankur Goyal

• Technically an op9miza9on
(c) Ankur Goyal

• Technically an op9miza9on
• Similar to an RDD in Spark
(c) Ankur Goyal

CREATE RESULT TABLE
t_reshuffled AS
SELECT
t.a, t.b, SUM(t.price)
FROM
t
GROUP BY
t.a, t.b
SHARD BY
t.a, t.b
(c) Ankur Goyal

Op#miza#ons
• Single-machine op0miza0ons
(c) Ankur Goyal

Op#miza#ons
• Index selec0on, Sor0ng/Grouping
(c) Ankur Goyal

Op#miza#ons
• SQL -> SQL rewrites
(c) Ankur Goyal

Op#miza#ons
• Cost-based distributed op0mizer
(c) Ankur Goyal

Op#miza#ons
• Broadcast vs. Reshuﬄing
(c) Ankur Goyal

Op#miza#ons
• Broadcast vs. Reshuﬄing
• and many, many more
(c) Ankur Goyal

MemSQL in the Wild
(c) Ankur Goyal

Horizontals and Ver/cals
• Real-'me data processing is everywhere
(c) Ankur Goyal

• Top use-cases:
Real-Time Analy'cs and Large-Scale Applica'ons
(c) Ankur Goyal

• Top use-cases:
Real-Time Analy'cs and Large-Scale Applica'ons
• Top ver'cals:
Financial Services, Webscale, Telco, Federal, Media
(c) Ankur Goyal

Real-&me Analy&cs
• High volumes of data, processed in real-8me
(c) Ankur Goyal

Real-&me Analy&cs
• Fast updates in the rowstore
• INSERT ... ON DUPLICATE KEY UPDATE
• E.g. 2M update transac8ons/sec on 10 nodes
(c) Ankur Goyal

Real-&me Analy&cs
• Fast updates in the rowstore
• INSERT ... ON DUPLICATE KEY UPDATE
• E.g. 2M update transac8ons/sec on 10 nodes
• Fast appends, even one row at a 8me, in the columnstore
• E.g. 1 GB/s on 16 EC2 nodes
(c) Ankur Goyal

Real-&me Analy&cs
• Converging with mainline analy2cs
(c) Ankur Goyal

Real-&me Analy&cs
• No compromises, e.g. limited SQL, limited windows
(c) Ankur Goyal

Real-&me Analy&cs
• Real-2me means fast reads as well
(c) Ankur Goyal

Real-&me Analy&cs
• Subsecond queries for dashboards
(c) Ankur Goyal

Real-&me Analy&cs
• Subsecond queries for dashboards
• Millisecond queries for applica2ons
(c) Ankur Goyal

Large-Scale Applica.ons
• Large-scale opera.onal analy.cs and applica.ons
(c) Ankur Goyal

• Hundreds of nodes for perf and HA
(c) Ankur Goyal

• True "produc.on" workloads
(c) Ankur Goyal

• Exis.ng OLTP databases lack scalability and SQL perf
(c) Ankur Goyal

• Exis.ng OLTP databases lack scalability and SQL perf
• Exis.ng OLAP databases lack opera.onal features
(c) Ankur Goyal

Take-Aways
• In-memory Database != All-memory Database
(c) Ankur Goyal

Take-Aways
• In-memory Databases are databases built to modern tradeoﬀs
(c) Ankur Goyal

Take-Aways
• Old problems with new solu<ons
(c) Ankur Goyal

Take-Aways
• Real-<me analy<cs and Large-scale applica<ons == New projects
(c) Ankur Goyal

Take-Aways
• Real-<me analy<cs and Large-scale applica<ons == New projects
• We are hiring and ❤ Waterloo.
• Come visit us in SF: email ankur@memsql.com
(c) Ankur Goyal

MemSQL DB Class, Ankur Goyal

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to MemSQL DB Class, Ankur Goyal

Similar to MemSQL DB Class, Ankur Goyal (20)

More from SingleStore

More from SingleStore (20)

Recently uploaded

Recently uploaded (20)

MemSQL DB Class, Ankur Goyal