3. Introduction
Dremel is an query system
For analysis of read-only nested data
Use case
Interactive Trends
Tools Detection
Web Spam Network
Dashboards Optimization
3
4. DocId: 10
Links
Features
Forward: 20
Name
Language
Code: 'en-us'
Country: 'us'
Multi-level structure Url: 'http://A'
Name
SQL-like query language
Url: 'http://B'
Fast: Trillion-row tables in second level
Big scale: thousands of CPUs and
petabytes of data
Widely used by google: in production
since 2006
SELECT A, COUNT(B) FROM T
GROUP BY A
T = {/gfs/1, /gfs/2, …, /gfs/100000}
4
5. Contribution
This paper presented two major
technique in Dremel:
Nested columnar storage
◦ store / split into columns
◦ Assembly to record
Query
◦ Language
◦ Execution
5
7. Why nested?
Flexible
◦ Data in web and scientific is often non-
relational
Reduce cost
◦ Normalizing and recombining nested data
is often prohibited in TB, PB scale of data.
7
8. Why column?
DocId: 10
Links
column-oriented
Forward: 20
Name A
* *
...
Language
Code: 'en-us'
Country: 'us' B E
Url: 'http://A' *
C D
r1
Name
Url: 'http://B'
r1
r1 r1
r2 r2
r2 Read less, r2
cheaper
Record-oriented decompression
Challenge: preserve structure, reconstruct from a subset of fields
8
9. DocId: 10
Nested data model
Links
Forward: 20 message Document {
Forward: 40
Forward: 60 required int64 DocId; [1,1]
Name optional group Links {
Language
Code: 'en-us'
repeated int64 Backward; [0,*]
Country: 'us' repeated int64 Forward;
Language }
Code: 'en'
Url: 'http://A' repeated group Name {
Name repeated group Language {
Url: 'http://B'
Name required string Code;
Language optional string Country; [0,1]
Code: 'en-gb'
Country: 'gb'
}
optional string Url;
DocId: 20 }
Links }
Backward: 10
Backward: 30
Forward: 80
Name
http://code.google.com/apis/protocolbuffers
Url: 'http://C' 9
10. Column-striped representation
DocId Name.Url Links.Forward Links.Backward
value r d value r d value r d value r d
10 0 0 http://A 0 2 20 0 2 NULL 0 1
20 0 0 http://B 1 2 40 1 2 10 0 2
DocId: 10
NULL 1 1 60 1 2 30 1 2
Links http://C 0 2 80 0 2
Forward: 20
Forward: 40
Forward: 60
Name
Language Name.Language.Code Name.Language.Country
Code: 'en-us'
Country: 'us' value r d value r d
Language en-us 0 2 us 0 3
Code: 'en'
Url: 'http://A' en 2 2 NULL 2 2
Name
Url: 'http://B' NULL 1 1 NULL 1 1
Name
Language en-gb 1 2 gb 1 3
Code: 'en-gb' NULL 0 1 NULL 0 1
Country: 'gb' 10
11. Column-oriented problem
When query sub-tree, there is no way for
you to know the whole structure(even the
position of yourself) A
* *
B ... E
To solve this problem, *
this paper presents C D
repetition & definition
r1
r1
r1
r2 r2
Where am I?
r2
11
12. Repetition and definition levels
r
DocId: 10
Links 1
Forward: 20
Forward: 40
Name.Language.Code Forward: 60
value r d Name
First time, repeat=0 Language
en-us 0 2 Code: 'en-us'
Language repeat, level = 2 Country: 'us'
en 2 2 Language
Code: 'en'
NULL 1 1 Url: 'http://A'
en-gb 1 2 Name
Url: 'http://B'
NULL 0 1 Name
Language
Code: 'en-gb'
None repeat, level = 0 Country: 'gb'
DocId: 20
2 r
Links
Backward: 10
Backward: 30
r: At what repeated field in the field's path Forward: 80
the value has repeated Name
Url: 'http://C'12
13. Repetition and definition levels
r
DocId: 10
Links 1
Forward: 20
Forward: 40
Name.Language.Country Forward: 60
value r d Name
Language
us 0 3 Code: 'en-us'
Country: 'us'
NULL 2 2 Language
Code: 'en'
NULL 1 1 Url: 'http://A'
gb 1 3 Name
Url: 'http://B'
NULL 0 1 Name
Language
Code: 'en-gb'
Country: 'gb'
DocId: 20
2 r
Links
Backward: 10
Backward: 30
d: How many fields in paths that could be Forward: 80
Name
undefined (opt. or rep.) are actually present Url: 'http://C'13
14. Column-oriented is all right,
but what if I still need record-
oriented query?
(e.g., MapReduce)
14
15. Record assembly FSM
Transitions labeled with
repetition levels
DocId
0
0 1
1 Links.Backward Links.Forward
0
0,1,2
Name.Language.Code Name.Language.Country
2
Name.Ur 0,1
1
l
0
For record-oriented data processing (e.g., MapReduce)
15
16. Reading two fields
DocId: 10 s 1
Name
Language
DocId Country: 'us'
Language
0 Name
Name
1,2 Name.Language.Country Language
0 Country: 'gb'
DocId: 20 s2
Name
Structure of parent fields is preserved.
Useful for queries like /Name[3]/Language[1]/Country
16
18. Query language
Based on SQL
Performs
◦ Projection
◦ Selection
◦ Nested subqueries
◦ inner and intra-record aggregation
◦ Top-k
◦ Joins
◦ User-defined functions
18
19. DocId: 10
Links
Forward: 20
Forward: 40
Example usage Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
SELECT DocId AS Id, Language
Code: 'en'
COUNT(Name.Language.Code) WITHIN Name AS Cnt, Url: 'http://A'
Name
Name.Url + ',' + Name.Language.Code AS Str Url: 'http://B'
Name
FROM t Language
WHERE REGEXP(Name.Url, '^http') AND DocId < 20; Code: 'en-gb'
Country: 'gb'
Output table Output schema
Id: 10 t1 message QueryResult {
Name required int64 Id;
Cnt: 2 repeated group Name {
Language optional uint64 Cnt;
Str: 'http://A,en-us' repeated group Language {
Str: 'http://A,en' optional string Str;
Name }
Cnt: 0 }
}
19
20. Query execution
client
• Parallelizes scheduling
root server and aggregation
• Fault tolerance
intermediate
... • query dispatcher can
servers
dispatch servers
leaf servers ...
(with local ...
storage)
storage layer (e.g., GFS) 20
21. Example: count()
SELECT A, COUNT(B) FROM T GROUP SELECT A, SUM(c)
0 BY A FROM (R11 UNION ALL R110)
T = {/gfs/1, /gfs/2, …, /gfs/100000} GROUP BY A
R11 R12
SELECT A, COUNT(B) AS c SELECT A, COUNT(B) AS c
1 FROM T11 GROUP BY A FROM T12 GROUP BY A
T11 = {/gfs/1, …, /gfs/10000} T12 = {/gfs/10001, …, /gfs/20000}
...
SELECT A, COUNT(B) AS c
3 FROM T31 GROUP BY A ...
T31 = {/gfs/1}
Data access ops 21
23. Experiment Data
• 1 PB of real data
(uncompressed, non-replicated)
• 100K-800K tablets per table
• Experiments run during business hours
Table Number of Size (unrepl., Number Data Repl.
name records compressed) of fields center factor
T1 85 billion 87 TB 270 A 3×
T2 24 billion 13 TB 530 A 3×
T3 4 billion 70 TB 1200 A 3×
T4 1+ trillion 105 TB 50 B 3×
T5 1+ trillion 20 TB 30 B 2×
23
24. Column v.s. Record"cold" time on local disk,
time (sec) averaged over 30 runs
(e) parse as
from records C++ objects
10x speedup
using columnar objects
storage (d) read +
decompress
records
(c) parse as
from columns
columns
C++ objects
(b) assemble
2-4x overhead of records
using records (a) read +
decompress
number of fields
Table partition: 375 MB (compressed), 300K rows, 125 columns 24
25. MR and Dremel execution
Avg # of terms in txtField in 85 billion record table T1
execution time (sec) on 3000 nodes
Sawzall program ran on MR:
num_recs: table sum of int;
num_words: table sum of int;
emit num_recs <- 1;
emit num_words <-
count_words(input.txtField);
87 TB 0.5 TB 0.5 TB
Q1: SELECT SUM(count_words(txtField)) / COUNT(*)
FROM T1
MR overheads: launch jobs, schedule 0.5M tasks, assemble record
25
26. Impact of serving tree depth
execution time (sec)
(returns 100s of records) (returns 1M records)
Q2: SELECT country, SUM(item.amount) FROM T2
GROUP BY country
Q3: SELECT domain, SUM(item.amount) FROM T2
WHERE domain CONTAINS ’.net’
GROUP BY domain
40 billion nested items26
27. Scalability
execution time (sec)
number of
leaf servers
Q5 on a trillion-row table T4:
SELECT TOP(aid, 20), COUNT(*) FROM T4
27
29. Observation
Monthly query workload
of one 3000-node
percentage of queries Dremel instance
execution
time (sec)
Most queries complete under 10 sec 29
30. Conclusion
Dremel is an query system
For analysis of read-only nested data
Main feature is fast(interactive response
time), column-oriented, SQL-like Query
language
Introduced two major method:
◦ Nested columnar storage – in order to solve
partial query problem.
◦ Query processing – parallel processing &
distributing, decompressing queries
30
31. Comments
Google is awesome!
Pros
◦ Nested storage gives us more flexibility.
◦ repetition & definition is an novel idea
and it can solve the locality problem
easily.
◦ Distributed serving tree is awesome,
faster than MR.
31
32. Comments
Cons
◦ Read-only may not fit every requirement
◦ Dremel is not a database, so you’ll need
to convert your real data into dremel
when analyzing
Converting may cost lost of time and space
Google doesn’t care this problem, they have
GFS and many servers.
32