11. In-Memory Databases...
• Use memory instead of disk
• Do not (need to) save data on disk
• Put the whole dataset in memory
(c) Ankur Goyal
12. In-Memory Databases...
• Use memory instead of disk
• Do not (need to) save data on disk
• Put the whole dataset in memory
(c) Ankur Goyal
13. In-Memory Databases...
• Use memory instead of disk
• Do not (need to) save data on disk
• Put the whole dataset in memory
Well, some)mes...
(c) Ankur Goyal
16. In-Memory Databases
• Are durable to disk (and respect ACID)
• Can spill on disk or pin data in-memory (and take advantage of it)
(c) Ankur Goyal
17. In-Memory Databases
• Are durable to disk (and respect ACID)
• Can spill on disk or pin data in-memory (and take advantage of it)
• Tradeoffs are suited to systems with lots of memory
(c) Ankur Goyal
18. In-Memory Databases
• Are durable to disk (and respect ACID)
• Can spill on disk or pin data in-memory (and take advantage of it)
• Tradeoffs are suited to systems with lots of memory
• Tend to be distributed systems
(c) Ankur Goyal
19. In-Memory Databases
• Are durable to disk (and respect ACID)
• Can spill on disk or pin data in-memory (and take advantage of it)
• Tradeoffs are suited to systems with lots of memory
• Tend to be distributed systems
• Have a different set of boClenecks
(c) Ankur Goyal
22. Why?
• Memory is ge,ng cheaper (about 40% every year)
(c) Ankur Goyal
23. Why?
• Memory is ge,ng cheaper (about 40% every year)
• Cache is the new RAM (RAM is the new disk, disk is the new
tape, etc)
(c) Ankur Goyal
24. Why?
• Memory is ge,ng cheaper (about 40% every year)
• Cache is the new RAM (RAM is the new disk, disk is the new
tape, etc)
• In-memory databases leverage SSD (no random writes)
(c) Ankur Goyal
25. Why?
• Memory is ge,ng cheaper (about 40% every year)
• Cache is the new RAM (RAM is the new disk, disk is the new
tape, etc)
• In-memory databases leverage SSD (no random writes)
• NVRAM is coming (and could be cheaper than SSD)
(c) Ankur Goyal
26. Why?
• Memory is ge,ng cheaper (about 40% every year)
• Cache is the new RAM (RAM is the new disk, disk is the new
tape, etc)
• In-memory databases leverage SSD (no random writes)
• NVRAM is coming (and could be cheaper than SSD)
In-memory databases are tuned to
modern hardware and modern workloads
(c) Ankur Goyal
30. In-Memory Storage Mo/va/on
• Insanely fast random reads & writes
• Atomic writes as granular as a byte
(c) Ankur Goyal
31. In-Memory Storage Mo/va/on
• Insanely fast random reads & writes
• Atomic writes as granular as a byte
• Working space is precious (RAM)
(c) Ankur Goyal
32. In-Memory Storage Mo/va/on
• Insanely fast random reads & writes
• Atomic writes as granular as a byte
• Working space is precious (RAM)
• Very different for rowstores and columnstores
(c) Ankur Goyal
35. In-Memory Rowstore
• Rowstores have lots of random reads/writes
• Datasets are usually small < 10 TB
Solu%on: keep the whole dataset in memory
(c) Ankur Goyal
36. In-Memory Rowstore
• Rowstores have lots of random reads/writes
• Datasets are usually small < 10 TB
Solu%on: keep the whole dataset in memory
• Use memory op+mized data structures (skip list)
(c) Ankur Goyal
37. What is a Skip List
• Invented in 1989 by William Pugh
(c) Ankur Goyal
38. What is a Skip List
• Invented in 1989 by William Pugh
• Expected O(log(n)) lookup, insert, delete
(c) Ankur Goyal
39. What is a Skip List
• Invented in 1989 by William Pugh
• Expected O(log(n)) lookup, insert, delete
• No pages
(c) Ankur Goyal
53. Concurrency Control
• No pages => No latches
• Skip list in MemSQL is lockfree
• Every node is a lock-free linked list
(c) Ankur Goyal
54. Concurrency Control
• No pages => No latches
• Skip list in MemSQL is lockfree
• Every node is a lock-free linked list
• Row locks are implemented with futexes (4 bytes)
(c) Ankur Goyal
55. Concurrency Control
• No pages => No latches
• Skip list in MemSQL is lockfree
• Every node is a lock-free linked list
• Row locks are implemented with futexes (4 bytes)
• Read-commiGed and snapshot isolaHon
(c) Ankur Goyal
68. Columnstore LSM
• Log-Structured Merge of sorted runs
• Tunable tradeoffs for read/write amplifica=on
• Enables fast writes to a sorted columnstore
(c) Ankur Goyal
69. Columnstore LSM
• Log-Structured Merge of sorted runs
• Tunable tradeoffs for read/write amplifica=on
• Enables fast writes to a sorted columnstore
• Smallest sorted run is a skip list
(c) Ankur Goyal
72. Durability in an In-Memory System?
• Memory is not a reliable medium (yet)
(c) Ankur Goyal
73. Durability in an In-Memory System?
• Memory is not a reliable medium (yet)
• There is always a hierarchy
(c) Ankur Goyal
74. Durability in an In-Memory System?
• Memory is not a reliable medium (yet)
• There is always a hierarchy
• E.g. EBS -> S3 -> Glacier
(c) Ankur Goyal
75. Durability in an In-Memory System?
• Memory is not a reliable medium (yet)
• There is always a hierarchy
• E.g. EBS -> S3 -> Glacier
• To operate at in-memory speed, all disk I/O must be sequenHal
(c) Ankur Goyal
76. Durability in the Rowstore
• Indexes are not materialized on disk
(c) Ankur Goyal
77. Durability in the Rowstore
• Indexes are not materialized on disk
• Reconstruct indexes on the fly during recovery
(c) Ankur Goyal
78. Durability in the Rowstore
• Indexes are not materialized on disk
• Reconstruct indexes on the fly during recovery
• Only need to log PK data
(c) Ankur Goyal
79. Durability in the Rowstore
• Indexes are not materialized on disk
• Reconstruct indexes on the fly during recovery
• Only need to log PK data
• Take full database snapshots periodically
(c) Ankur Goyal
80. Durability in the Rowstore
• Indexes are not materialized on disk
• Reconstruct indexes on the fly during recovery
• Only need to log PK data
• Take full database snapshots periodically
• Tunable to be sync/async
(c) Ankur Goyal
82. Durability in the Columnstore
• Metadata uses ordinary rowstore mechanism
(c) Ankur Goyal
83. Durability in the Columnstore
• Metadata uses ordinary rowstore mechanism
• Segments are huge (several KB or even MB)
(c) Ankur Goyal
84. Durability in the Columnstore
• Metadata uses ordinary rowstore mechanism
• Segments are huge (several KB or even MB)
• Read/wri=en sequen?ally
(c) Ankur Goyal
85. Durability in the Columnstore
• Metadata uses ordinary rowstore mechanism
• Segments are huge (several KB or even MB)
• Read/wri=en sequen?ally
• Columnstore segments synchronously wri=en to disk
(c) Ankur Goyal
86. Durability in the Columnstore
• Metadata uses ordinary rowstore mechanism
• Segments are huge (several KB or even MB)
• Read/wri=en sequen?ally
• Columnstore segments synchronously wri=en to disk
• Memory-speed writes go to sidecar rowstore
(c) Ankur Goyal
88. Crash Recovery
• Replay latest snapshot, and then every log file since
• No par7ally wri9en state on disk, so no undos
(c) Ankur Goyal
89. Crash Recovery
• Replay latest snapshot, and then every log file since
• No par7ally wri9en state on disk, so no undos
• Columnstore just replays metadata
(c) Ankur Goyal
90. Crash Recovery
• Replay latest snapshot, and then every log file since
• No par7ally wri9en state on disk, so no undos
• Columnstore just replays metadata
• Replica7on == Con7nuous replay over the network
(c) Ankur Goyal
92. class Row(object):
def __init__(self, a):
self.a = a
t = [Row(x) for x in range(1000000)]
class State(object):
def __init__(self):
self.agg_sum = 0
def loop(state, row):
state.agg_sum += row.a + 1
def query():
state = State()
for r in t:
loop(state, r)
return state
if __name__ == '__main__':
start = time.time()
state = query()
end = time.time()
print "Answer: %d, Time (s): %g" % (state.agg_sum, (end-start))
(c) Ankur Goyal
93. struct Row int main(void)
{ {
Row(int a_arg) : a(a_arg) { } std::vector<Row> rows;
int a; for (int i = 0; i < 1000000; i++)
}; {
rows.emplace_back(i);
struct State }
{
State() : agg_sum(0) { } clock_t start = clock();
int64_t agg_sum; State state = query(rows);
}; clock_t end = clock();
inline void loop(State& state, const Row& row) printf("Answer: %lld, Time (s): %gn",
{ state.agg_sum, (end-start) * 1.0 / CLOCKS_PER_SEC);
state.agg_sum += row.a + 1; }
}
inline State query(std::vector<Row>& rows)
{
State s;
for (Row& r : rows)
{
loop(s, r);
}
return s;
}
(c) Ankur Goyal
94. Comparison
$ python test.py
Answer: 500000500000, Time (s): 0.251049
$ time g++ test.cpp -o test-cpp -std=c++0x
real 0m0.176s
user 0m0.150s
sys 0m0.023s
$ ./test-cpp
Answer: 500000500000, Time (s): 0.006745
(c) Ankur Goyal
95. Comparison
$ python test.py
Answer: 500000500000, Time (s): 0.251049
$ time g++ test.cpp -o test-cpp -std=c++0x
real 0m0.176s
user 0m0.150s
sys 0m0.023s
$ ./test-cpp
Answer: 500000500000, Time (s): 0.006745
37x difference in execu+on
(c) Ankur Goyal
96. Comparison
$ python test.py
Answer: 500000500000, Time (s): 0.251049
$ time g++ test.cpp -o test-cpp -std=c++0x
real 0m0.176s
user 0m0.150s
sys 0m0.023s
$ ./test-cpp
Answer: 500000500000, Time (s): 0.006745
37x difference in execu+on
1.37x even with compila+on +me
(c) Ankur Goyal
100. Code Genera*on
• Expression execu.on
• Inline scans
• Need a powerful plan cache
• OLTP vs. data explora.on
(c) Ankur Goyal
101. Plancache Example (1)
SELECT * FROM users WHERE id = 5
SELECT * FROM users WHERE id = 8
=>
SELECT * FROM users WHERE id = @
(c) Ankur Goyal
102. Plancache Example (2)
SELECT * FROM users WHERE id IN (1,2,3,4,5) OR a IN (3,5,7)
SELECT * FROM users WHERE id IN (20) OR a IN (1,2,3,4)
=>
SELECT * FROM users WHERE id IN (@) OR a IN (@)
(c) Ankur Goyal
103. Drill Down Example
SELECT SELECT SELECT
region, SUM(price) rep, SUM(price) rep, SUM(price)
FROM sales => FROM sales => FROM sales
GROUP BY region WHERE region="northeast" WHERE region=^
GROUP BY rep; GROUP BY rep;
SELECT SELECT
product, SUM(price) product, SUM(price)
=> FROM sales => FROM sales
WHERE region="northwest" WHERE region=^
GROUP BY product; GROUP BY product;
(c) Ankur Goyal
104. Drill Down Example
SELECT SELECT SELECT
region, SUM(price) rep, SUM(price) rep, SUM(price)
FROM sales => FROM sales => FROM sales
GROUP BY region WHERE region="northeast" WHERE region=^
GROUP BY rep; GROUP BY rep;
SELECT SELECT
product, SUM(price) product, SUM(price)
=> FROM sales => FROM sales
WHERE region="northwest" WHERE region=^
GROUP BY product; GROUP BY product;
No plancache match !
(c) Ankur Goyal
108. Code Genera*on is Hard
• Old compilers adage: Pick 2 of 3
(c) Ankur Goyal
109. Code Genera*on is Hard
• Old compilers adage: Pick 2 of 3
• Fast execu:on :me
• Fast compile :me
• Fast development :me
(c) Ankur Goyal
110. Code Genera*on is Hard
• Old compilers adage: Pick 2 of 3
• Fast execu:on :me
• Fast compile :me
• Fast development :me
• E.g. Assembly, C++, Python
(c) Ankur Goyal
111. Code Genera*on is Hard
• Old compilers adage: Pick 2 of 3
• Fast execu:on :me
• Fast compile :me
• Fast development :me
• E.g. Assembly, C++, Python
• JIT compilers turned this on its head
(c) Ankur Goyal
128. Abstrac(ons
• Distributed Query Plan created on aggregator
• Layers of primi9ve opera9ons glued together
• Full SQL on leaves
• REMOTE tables
• RESULT tables
(c) Ankur Goyal
130. Primi%ves (SQL)
• Queries over physical indexes
• Hook into global transac9onal state
(c) Ankur Goyal
131. Primi%ves (SQL)
• Queries over physical indexes
• Hook into global transac9onal state
• Full SQL on a single par99on
(c) Ankur Goyal
132. Primi%ves (SQL)
• Queries over physical indexes
• Hook into global transac9onal state
• Full SQL on a single par99on
• Access to rowstores and columnstores
(c) Ankur Goyal
133. Primi%ves (SQL)
Example query the aggregator can send to the leaf:
SELECT
t.a, t.b, SUM(t.price)
FROM
t -- This will scan a physical table on the leaf
WHERE
t.c = 1000 -- This will use a local index
GROUP BY
t.a, t.b -- This will produce 1 row per group
(c) Ankur Goyal
136. Primi%ves (Remote Tables)
• Address data across leaves
• SQL interface + custom shard key
• Parallel execu<on primi<ves
• Reshuffling
• Merging on group keys
• Merging data from joins (e.g. leE joins)
(c) Ankur Goyal
137. Primi%ves (Remote Tables)
SELECT
t.a, SUM(s_net.c)
FROM
-- The row in s where s_net.b = t.a may not
-- be on the same node as the local t. REMOTE(s)
-- addresses the table across the cluster.
t, REMOTE(s) AS s_net
WHERE
t.a = s_net.b
GROUP BY
t.a
(c) Ankur Goyal
138. Primi%ves (Remote Tables)
SELECT
t.a, SUM(s_net.c)
FROM
-- This is a reshuffle operation. It relies on t
-- being sharded on (t.a) and type(t.a) == type(s.b).
-- It will only pull rows in s.b that match the
-- shard key's local values of (t.a).
t, REMOTE(s) WITH (shard_key=(s.b)) AS s_net
WHERE
t.a = s_net.b
GROUP BY
t.a
(c) Ankur Goyal
142. Primi%ves (Result Tables)
• Shared, cached results of SQL queries
• Shares scans/computa9ons across readers
• Supports streaming seman9cs
• Technically an op9miza9on
(c) Ankur Goyal
143. Primi%ves (Result Tables)
• Shared, cached results of SQL queries
• Shares scans/computa9ons across readers
• Supports streaming seman9cs
• Technically an op9miza9on
• Similar to an RDD in Spark
(c) Ankur Goyal
144. Primi%ves (Result Tables)
CREATE RESULT TABLE
t_reshuffled AS
SELECT
t.a, t.b, SUM(t.price)
FROM
t
GROUP BY
t.a, t.b
SHARD BY
t.a, t.b
(c) Ankur Goyal
153. Horizontals and Ver/cals
• Real-'me data processing is everywhere
• Top use-cases:
Real-Time Analy'cs and Large-Scale Applica'ons
(c) Ankur Goyal
154. Horizontals and Ver/cals
• Real-'me data processing is everywhere
• Top use-cases:
Real-Time Analy'cs and Large-Scale Applica'ons
• Top ver'cals:
Financial Services, Webscale, Telco, Federal, Media
(c) Ankur Goyal
156. Real-&me Analy&cs
• High volumes of data, processed in real-8me
• Fast updates in the rowstore
• INSERT ... ON DUPLICATE KEY UPDATE
• E.g. 2M update transac8ons/sec on 10 nodes
(c) Ankur Goyal
157. Real-&me Analy&cs
• High volumes of data, processed in real-8me
• Fast updates in the rowstore
• INSERT ... ON DUPLICATE KEY UPDATE
• E.g. 2M update transac8ons/sec on 10 nodes
• Fast appends, even one row at a 8me, in the columnstore
• E.g. 1 GB/s on 16 EC2 nodes
(c) Ankur Goyal
159. Real-&me Analy&cs
• Converging with mainline analy2cs
• No compromises, e.g. limited SQL, limited windows
(c) Ankur Goyal
160. Real-&me Analy&cs
• Converging with mainline analy2cs
• No compromises, e.g. limited SQL, limited windows
• Real-2me means fast reads as well
(c) Ankur Goyal
161. Real-&me Analy&cs
• Converging with mainline analy2cs
• No compromises, e.g. limited SQL, limited windows
• Real-2me means fast reads as well
• Subsecond queries for dashboards
(c) Ankur Goyal
162. Real-&me Analy&cs
• Converging with mainline analy2cs
• No compromises, e.g. limited SQL, limited windows
• Real-2me means fast reads as well
• Subsecond queries for dashboards
• Millisecond queries for applica2ons
(c) Ankur Goyal
165. Large-Scale Applica.ons
• Large-scale opera.onal analy.cs and applica.ons
• Hundreds of nodes for perf and HA
• True "produc.on" workloads
(c) Ankur Goyal
166. Large-Scale Applica.ons
• Large-scale opera.onal analy.cs and applica.ons
• Hundreds of nodes for perf and HA
• True "produc.on" workloads
• Exis.ng OLTP databases lack scalability and SQL perf
(c) Ankur Goyal
167. Large-Scale Applica.ons
• Large-scale opera.onal analy.cs and applica.ons
• Hundreds of nodes for perf and HA
• True "produc.on" workloads
• Exis.ng OLTP databases lack scalability and SQL perf
• Exis.ng OLAP databases lack opera.onal features
(c) Ankur Goyal
170. Take-Aways
• In-memory Database != All-memory Database
• In-memory Databases are databases built to modern tradeoffs
(c) Ankur Goyal
171. Take-Aways
• In-memory Database != All-memory Database
• In-memory Databases are databases built to modern tradeoffs
• Old problems with new solu<ons
(c) Ankur Goyal
172. Take-Aways
• In-memory Database != All-memory Database
• In-memory Databases are databases built to modern tradeoffs
• Old problems with new solu<ons
• Real-<me analy<cs and Large-scale applica<ons == New projects
(c) Ankur Goyal
173. Take-Aways
• In-memory Database != All-memory Database
• In-memory Databases are databases built to modern tradeoffs
• Old problems with new solu<ons
• Real-<me analy<cs and Large-scale applica<ons == New projects
• We are hiring and ❤ Waterloo.
• Come visit us in SF: email ankur@memsql.com
(c) Ankur Goyal