Google Dremel

Google Dremel
maruyama097
丸山不二夫

Dremelが利用されている領域
 クロールされたWebドキュメントの解析
 Androidマーケットで、インスロールされたアプリの
追跡
 Google製品のクラッシュ報告
 Google BooksのOCRの結果.
 スパムの解析
 Google Mapのmapタイルのデバッグ
 BigtableインスタンスのTabletマイグレーションの
管理
 Googleの分散ビルドシステム上での実行テスト結果
 数十万のディスクのDisk I/O 統計
 Googleデータセンターで実行されているジョブの、

Dremelのアイデア
 第一。分散検索エンジンで利用されている、木構造の
アーキテクチャー。Web検索と同様に、検索は、木
構造をたどって行われ、それを書き換える。検索結果
は、下位の木構造から得られる答えを集約することで
行われる。
 第二。Dremelは、SQL-likeな検索言語をサポートす
る。それは、HiveやPigとはことなり、Map-Reduce
には変換されずに、ネーティブに検索を実行する。
 第三。Dremelは、カラム単位で分割されたデータ構
造を利用する。カラム型のストレージは、読み込む
データを少なくして、圧縮が簡単に出来るので、CPU
のコストを削減する。カラム型のデータ・ストアは、
リレーショナルなデータの解析には採用されてきたが、

この論文が明らかにすること
 ネストしたデータの為の、カラム型ストレージの
フォーマット。ネストしたデータのカラム型への変
換、逆に、カラム型からもとのデータの復元のアルゴ
リズムを与える。
 Dremelの検索言語とその実行の概要。カラム型のネ
スト・データ上で効率的に実行され、ネストされた
データの再構成を必要としない。
 Web検索システムの実行木が、データベース処理に
も利用出来ることを示す。集約検索が効率的に行われ
ることを示して、そのメリットを明らかにする。
 1000~4000ノードで稼働する、一兆個のレコード、
数テラバイトのデータセット上での実験の報告。

Dremelの背景
 インタラクティブな検索処理が、広い範囲のデータ管理システ
ム環境に、適合することを、まず示そう。
 シナリオ










Map Reduce ->billions of records
Dremel SQL
DEFINE TABLE t AS /path/to/data/*
SELECT TOP(signal1, 100), COUNT(*) FROM t
 Dashboard、Catalog

検索処理とデータ管理ツールの相互運用性
共通ストレージ層 GFS
高パフォーマンスストレージ層
共有データフォーマット

message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward; }
repeated group Name {
repeated group Language {
required string Code;
optional string Country; }
optional string Url; }}

Protocol Buffer

Document
DocID

Links?
Backward*

Name*
Forward*

サンプルのスキーマ

Language*

Code

Url?

Country?

Document

DocID

Links?

Backward*

Forward*

Names*

Language*

Code

Url?

Country?

Document

DocID

Repetition Level

Names* 1

Links?

Backward* 1 Forward* 1 Language* 2 Url?

Code

Country?

Document

DocID ０

Links? 1

Definition Level

Names* 1

Backward* 2 Forward* 2 Language* 2 Url? 2

Code 2

Country? 3

Document

Repetetion Level
Definition Level

DocID ０

Links? 1

Names* 1 1

Backward* 1 Forward* 1 Language* 2 Url? 2
2
2
2

Code 2

Country? 3

DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'

DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
0,0
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'

DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL 0,1
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'

DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL 0,1
Forward: 20
0,2
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'

DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL 0,1
Forward: 20
0,2
Forward: 40
1,2
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'

DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'

0,1
0,2
1,2
1,2

DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'

0,1
0,2
1,2
1,2
0,2

DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us’
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'

0,1
0,2
1,2
1,2
0,2
0,3

DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us’
Language
Code: 'en’
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'

0,1
0,2
1,2
1,2
0,2
0,3
2,2

DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us’
Language
Code: 'en’
Country:NULL
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'

0,1
0,2
1,2
1,2
0,2
0,3
2,2
2,2

DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us’
Language
Code: 'en’
Country:NULL
Url: 'http://A’
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'

0,1
0,2
1,2
1,2
0,2
0,3
2,2
2,2
0,2

DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us’
Language
Code: 'en’
Country:NULL
Url: 'http://A’
Name
Language
Url: 'http://B'

0,1
0,2
1,2
1,2
0,2
0,3
2,2
2,2
0,2

Name
Language
Code: 'en-gb'
Country: 'gb’
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us’
Language
Code: 'en’
Country:NULL
Url: 'http://A’
Name
Language
Code: NULL
Url: 'http://B'

0,1
0,2
1,2
1,2
0,2
0,3
2,2
2,2
0,2
1,1

Name
Language
Code: 'en-gb'
Country: 'gb’
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us’
Language
Code: 'en’
Country:NULL
Url: 'http://A’
Name
Language
Code: NULL
Country:NULL
Url: 'http://B'

0,1
0,2
1,2
1,2
0,2
0,3
2,2
2,2
0,2
1,1
1,1

Name
Language
Code: 'en-gb'
Country: 'gb’
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us’
Language
Code: 'en’
Country:NULL
Url: 'http://A’
Name
Language
Code: NULL
Country:NULL
Url: 'http://B’

0,1
0,2
1,2
1,2
0,2
0,3
2,2
2,2
0,2
1,1
1,1
1,2

Name
Language
Code: 'en-gb'
Country: 'gb’
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us’
Language
Code: 'en’
Country:NULL
Url: 'http://A’
Name
Language
Code: NULL
Country:NULL
Url: 'http://B’

0,1
0,2
1,2
1,2
0,2
0,3
2,2
2,2
0,2
1,1
1,1
1,2

Name
Language
Code: 'en-gb’ 1,2
Country: 'gb’
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us’
Language
Code: 'en’
Country:NULL
Url: 'http://A’
Name
Language
Code: NULL
Country:NULL
Url: 'http://B’

0,1
0,2
1,2
1,2
0,2
0,3
2,2
2,2
0,2
1,1
1,1
1,2

Name
Language
Code: 'en-gb’
Country: 'gb’
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us’
Language
Code: 'en’
Country:NULL
Url: 'http://A’
Name
Language
Code: NULL
Country:NULL
Url: 'http://B’

0,1
0,2
1,2
1,2
0,2
0,3
2,2
2,2
0,2
1,1
1,1
1,2

Name
Language
Code: 'en-gb’ 1,2
Country: 'gb’ 1,3
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us’
Language
Code: 'en’
Country:NULL
Url: 'http://A’
Name
Language
Code: NULL
Country:NULL
Url: 'http://B’

0,1
0,2
1,2
1,2
0,2
0,3
2,2
2,2
0,2
1,1
1,1
1,2

Name
Language
Code: 'en-gb’ 1,2
Country: 'gb’ 1,3
URI:NULL
1,1
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

DocId: 10
０,0
Links
Backward:NULL
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us‘
Country: 'us’
Language
Code: 'en’
Country:NULL
Url: 'http://A’
Name
Language
Code: NULL
Country:NULL
Url: 'http://B’

0,1
0,2
1,2
1,2
0,2
0,3
2,2
2,2
0,2
1,1
1,1
1,2

Name
Language
Code: 'en-gb’ 1,2
Country: 'gb’ 1,3
URI:NULL
1,1
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

0,0

Name
Language
Code: 'en-gb’ 1,2
Country: 'gb’ 1,3
URI:NULL
1,1
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

0,0

Name
Language
Code: 'en-gb’ 1,2
Country: 'gb’ 1,3
URI:NULL
1,1
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

0,0
0,2

Name
Language
Code: 'en-gb’ 1,2
Country: 'gb’ 1,3
URI:NULL
1,1
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

0,0
0,2
1,2

Name
Language
Code: 'en-gb’ 1,2
Country: 'gb’ 1,3
URI:NULL
1,1
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C‘

0,0
0,2
1,2
0,2

Name
Language
Code: 'en-gb’ 1,2
Country: 'gb’ 1,3
URI:NULL
1,1
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Language
Url: 'http://C‘

0,0
0,2
1,2
0,2

Name
Language
Code: 'en-gb’ 1,2
Country: 'gb’ 1,3
URI:NULL
1,1
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Language
Code: NULL
Url: 'http://C‘

0,0
0,2
1,2
0,2
0,1

Name
Language
Code: 'en-gb’ 1,2
Country: 'gb’ 1,3
URI:NULL
1,1
DocId: 20
0,0
Links
Backward: 10 0,2
Backward: 30 1,2
Forward: 80
0,2
Name
Language
Code: NULL
0,1
Country: NULL 0,1
Url: 'http://C‘

Name
Language
Code: 'en-gb’ 1,2
Country: 'gb’ 1,3
URI:NULL
1,1
DocId: 20
0,0
Links
Backward: 10 0,2
Backward: 30 1,2
Forward: 80
0,2
Name
Language
Code: NULL
0,1
Country: NULL 0,1
Url: 'http://C‘ 0,2

Level 0

Document

DocID

Backward*

Links?

Forward*

Names*

Language*

Code

Url?

Country?

Document

DocID

Level 1

Names* 1

Links?

Backward* 1 Forward* 1 Language*

Code

Url?

Country?

Document

DocID

Level 0
Level 1

Names* 1

Links?

Backward* 1 Forward* 1 Language*

Code

Url?

Country?

Document

DocID

Backward*

Level 2

Links?

Forward*

Names*

Language* 2 Url?

Code

Country?

Level 0
Level 1
Level 2

Document

DocID

Names* 1

Links?

Backward* 1 Forward* 1 Language* 2 Url?

Code

Country?

Document

DocID

Backward*

Links?

Forward*

Names*

Language*

Code

Url?

Country?

Level 0
Level 1
Level 2

Document

DocID

Backward*

Links?

Forward*

Names*

Language*

Code

Url?

Country?

Finite State Machine
CONSTRUCTION ALGORITHM

Finite State Machine
CONSTRUCTION ALGORITHM
 このアルゴリズムは、入力として、スキーマ中に現れ
る順番でレコード中に展開されているフィールドをと
る。
 このアルゴリズムは、「共通繰り返しレベル」という
コンセプトを用いている。それは、共通のもっとも近
い先祖の繰り返しレベルのことである。例えば、
Links.BackwardとLinks.Forwardの「共通繰り返し
レベル」は、１になる。
 第二のコンセプトは、「バリア」である。それは、現
在のフィールドの、順番では次のフィールドのことで
ある。
 直感的には、バリアにあたって、以前にみたフィール
ドにジャンプが必要となるまで、それぞれのフィール

1. procedure ConstructFSM(Field[] fields):
2. フィールズ上のそれぞれのフィールドに対し
て
3. maxLevelを、フィールドの最大繰り返し
レベルに設定する。
4. barrierを、フィールドの次のフィールド
か、
そうでなければ、FSMの最終状態に設定
す
る。

 ステップ１(Lines 6-10)では、「共通繰り返
しレベル」を後ろ向きに走査する。これらは、
増大しないことが保証されている。我々が出
会う、それぞれの繰り返しレベルで、列の
もっとも左のフィールドを選ぶ。
FieldReaderで繰り返しレベルが返された時、
我々が遷移すべきなのは、このフィールドで
ある。

6. フィールドの前の全てのpreFieldについて、
次の処理を繰り返す。ここで、
7. preFieldの繰り返しレベルは、barrierLevel
より大きいものを選ぶ。
8. backLevelを、preFieldとフィールドの
共通繰り返しレベルとする
9. 次のように遷移を設定する。
(field, backLevel) -> preField
10.繰り返し終わり。

 ステップ2では、ギャップを埋める (Lines
11-14)。
このギャップは、全ての繰り返しレベルが、
Line8で計算された共通繰り返しレベルに現
れていないことから生まれる。

11. [barrierLevel+1..maxLevel]の全ての
レベルの
12.フィールドからの遷移を欠いているもの
について、次の処理を繰り返す。
13.
level-1の遷移の終端をコピーする
14.繰り返し、終わり。

 ステップ 3(Lines 15-17)では、
barrierLevel以下の全てのレベルについて、
barrier フィールドへのジャンプをセットす
る。
もしも、FieldReaderが、このようなレベル
を生み出すなら、ネストされたレコードの構
成を続ける必要がある。barrierに、戻る必要
はない。

15.[0..barrierLevel]の全てのレベルについ
て
16.
遷移を次のようにセットする
(field, level) -> barrier
17.繰り返し終わり

 プロシージャDissectRecord は、RecordDecoderの
インスタンスに渡される。RecordDecoderは、バイ
ナリーにエンコードされたレコードの解析に利用され
る。
 FieldWritersは、入力スキーマの階層構造と同形の木
階層構造を構成している。
 ルートのFieldWriterは、それぞれの新しいレコード
毎に、繰り返しレベルを０にセットして、このアル
ゴリズムに渡される。
 DissectRecordプロシージャの最初の仕事は、現在の
繰り返しレベルを維持することである。

 現在の定義レベルは、現在のwriterのツリー上の位置
によって、フィールド・パスのオプショナルとリピー
ト・フィールドの数の和として、一意に決定される。
 このアルゴリズムのWhileループ (Line 5)は、与えら
れたレコードに含まれる、全てのアトミックとレコー
ド値を持つフィールド上を繰り返し処理する。

 seenFieldsの集合は、レコード中で、そのフィール
ドが処理されたか否かを追跡する。それは、どの
フィールドが、もっとも最近繰り返されたかを決定す
るのに利用される。
 子供の繰り返しレベルchRepetitionLevelは、もっと
も最近繰り返されたレベルか、あるいは、その親のレ
ベルのデフォールト値がセットされる (Lines 913)。
 このプロシージャは、ネストしたレコードに対して、
リカーシブに呼び出される (Line 18)。

 Section 4.2で、 FieldWriters がどのようにレベル
を蓄積して、低レベルのwriterに、必要に応じてそれ
らを伝播していくのかを見てきた。これらは、次のよ
うに行われる。
 それぞれのnon-leaf writerは、(繰り返しレベル、定
義レベル)の列を保持している。それぞれのwriterは、
それに関連した「バージョン」数を持っている。単純
にいえば、writerのバージョンはレベルが加えられる
時には必ず一つづつ増加される。子供にとっては、
最後に同期した親のバージョンを知っていれば十分で
ある。
 もし子供のwriterが（non-null）の値を獲得していれ
ば、それは新しいレベルを取得して親と状態を同期す

 入力データは、数千のフィールド、数百万のレコード
を持つことがあり得るので、全てのレベルをメモリー
に置くというのはありそうもない。いくつかのレベル
は、ディスク上のファイルに置かれるだろう。
 空の（サブ）レコードの損失のないエンコードの為に、
(図2のName.Languageのような)non-atomic
フィールドは、自分自身のカラム・ストライプを持つ
必要があるかもしれない。それは、レベルを持つだけ
で、non-NULLな値を持つことはない。

1. procedure DissectRecord(
RecordDecoder decoder,
FieldWriter writer, int 繰り返しレベル):
2. 現在の繰り返しレベルと定義レベルをwriter
に追加する。
3. seenFields = {} // 整数の空集合
4. decoderがフィールドの値を持つ間、繰り返
す
5. FieldWriter chWriter =
decoderによって読まれるフィールドの
子writer

6.
7.

int 子の繰り返しレベル = 繰り返しレベル
if seenFieldsが子writerのIDを含んで
いれば、
子の繰り返しレベル = 子writerの深さ
8. else
9.
子writerのIDをseenFieldsに追加する
10. end if

11. if 子Writerがatomic fieldに対応すれば
12.
子の繰り返しレベルの子writerを使って、
現在のフィールドの値を書き出す
13. else
14.
DissectRecord(ネストしたレコードの
為の
新しいRecordDecoder,
子writer, 子の繰り返しレベル）
15. end if
16.end while

 In their on-the-wire representation, records are
laid out as pairs of a field identifier followed by
a field value. Nested records can be thought of
as having an ‘opening tag’ and a ‘closing
tag’, similar to XML (actual binary encoding
may differ, see [21] for details).
 In the following, writing opening tags is
referred to as ‘starting’ the record, and writing
closing tags is called ’ending’ it.

 AssembleRecord procedure takes as input a set
of FieldReaders and (implicitly) the FSM with
state transitions between the readers.
 Variable reader holds the current FieldReader
in the main routine (Line 4). Variable
lastReader holds the last reader whose value
we appended to the record and is available to
all three procedures shown in Figure 17. The
main while-loop is at Line 5.

 We fetch the next value from the current
reader. If the value is not NULL, which is
determined by looking at its definition level, we
synchronize the record being assembled to the
record structure of the current reader in the
method MoveToLevel, and append the field
value to the record.
 Otherwise, we merely adjust the record
structure without appending any value—which
needs to be done if empty records are present.

 On Line 12, we use a ‘full definition level’.
Recall that the definition level factors out
required fields (only repeated and optional
fields are counted). Full definition level takes
all fields into account.

 Procedure MoveToLevel transitions the record
from the state of the lastReader to that of the
nextReader (see Line 22).
 For example, suppose the lastReader
corresponds to Links.Backward in Figure 2 and
nextReader is Name.Language.Code. The
method ends the nested record Links and
starts new records Name and Language, in that
order. Procedure ReturnsToLevel (Line 30) is a
counterpart of MoveToLevel that only ends
current records without starting any new ones.

SELECT PROJECT AGGREGATE
EVALUATION ALGORITHM

 The algorithm addresses a general case when a
query may reference repeated fields; a simpler
optimized version is used for flat-relational
queries, i.e., those referencing only required
and optional fields. The algorithm has two
implicit inputs: a set of FieldReaders, one for
each field appearing in the query, and a set of
scalar expressions, including aggregate
expressions, present in the query. The
repetition level of a scalar expression (used in
Line 8) is determined as the maximum
repetition level of the fields used in that
expression.

 In essence, the algorithm advances the readers
in lockstep to the next set of values, and, if the
selection conditions are met, emits the
projected values. Selection and projection are
controlled by two variables, fetchLevel and
selectLevel. During execution, only readers
whose next repetition level is no less than
fetchLevel are advanced (see Fetch method at
Line 19). In a similar vein, only expressions
whose current repetition level is no less than
selectLevel are emitted (Lines 7-10).

 The algorithm ensures that expressions at a
higher-level of nesting, i.e., those having a
smaller repetition level, get evaluated and
emitted only once for each deeper nested
expression.

サンプルの格納状態
Document
DocID

10 0 0
20 0 0

Links?

Backward*

NULL 0 1
10
02
30
12

Name*

Forward*

20
40
60
80

0
1
1
0

2
2
2
2

Language*

Url?

http://A 0 2
Code Country? http://B 1 2
NULL
11
http://C 0 2
en-us 0 2 us
03
en
2 2 NULL 2 2
NULL 1 1 NULL 1 1
en-gb 1 2 gb
13
NULL 0 1 NULL 0 1

Dremelの検索言語
SELECT DocId AS Id,
COUNT(Name.Language.Code) WITHIN Name AS Cnt,
Name.Url + ',' + Name.Language.Code AS Str
FROM t
WHERE REGEXP(Name.Url, '^http') AND DocId < 20;

出力結果
Id: 10
Name
Cnt: 2
Language
Str: 'http://A,en-us'
Str: 'http://A,en'
Name
Cnt: 0

出力のスキーマ
message QueryResult {
required int64 Id;
repeated group Name {
optional uint64 Cnt;
repeated group Language {
optional string Str; }}}

検索の木構造
 SELECT A, COUNT(B) FROM T GROUP BY A

 SELECT A, SUM(c) FROM (R11 UNION ALL ...R1n)
GROUP BY A
R1i = SELECT A, COUNT(B) AS c FROM T1i
GROUP BY A
T1iは、レベル1のサーバーiで処理される、テーブルTの
Tabletの分割

実験で使われたデータセット

カラム型とレコード型の
パフォーマンスの比較

フィールドの数

MapReduceとDremel

numRecs: table sum of int;
numWords: table sum of int;
emit numRecs <- 1;
emit numWords <- CountWords(input.txtField);

Q1: SELECT SUM(CountWords(txtField))
/ COUNT(*) FROM T1

3000 nodes, 85 billion records

木構造の深さとパフォーマンス

Q2: SELECT country, SUM(item.amount) FROM T2
GROUP BY country
Q3: SELECT domain, SUM(item.amount) FROM T2
WHERE domain CONTAINS ’.net’
GROUP BY domain

テーブルT2 24billionレコード 13TB 530フィールドでの検索

システムのノード数とスケーリング

Q5: SELECT TOP(aid, 20), COUNT(*) FROM T4
WHERE bid = {value1} AND cid = {value2}

T4の一兆個のレコード

リーフサーバーの数

毎月の処理の処理時間の分布

結論
 スキャンベースの検索は、一兆個以上のディスク上の
データセットに対して、インタラクティブなスピード
で実行出来る。
 数千ノードを含むシステムで、カラムとサーバーの数
に関して、ほとんどリニアーなスケーラビリティーを
達成出来る。
 DBMSとまったく同様に、MapReduceは、カラム型
のストレージの恩恵を受ける。
 レコードの収集とパージングは、コストがかかる。ソ
フトウェアの層は（検索実行の層を超えて）、直接カ
ラム指向のデータを利用出来るようになる必要があ
る。

結論
 MapReduceと検索処理は、一方の層の出力が、他方
の層の入力を与えるというように、相互補完的なやり
方で、利用出来る。
 マルチ・ユーザーの環境では、大規模システムは、よ
り良いユーザー・エキスペアレンスの質を提供しなが
ら、規模の経済のメリットを享受出来る。
 もしも、正確さよりも、スピードの方が受け入れやす
いなら、検索は、もっと早くすませることが出来る
し、それにもかかわらず、もっと多くのデータを見る
ことが出来る。
 Webスケールの大量のデータセットは、高速にス
キャン出来る。厳しい時間の制限の中で、最後の数
パーセントのデータを得るのは難しい。

COLUMN STRIPING
Algorithm
The algorithm for decomposing a
record into columns

DissectRecord
1. procedure DissectRecord(
RecordDecoder decoder,
FieldWriter writer,
int repetitionLevel):
 RecordDecoder, which is used to traverse
binary-encoded records.
 FieldWriters form a tree hierarchy isomorphic
to that of the input schema.

Primary Job
 The root FieldWriter is passed to the algorithm
for each new record, with repetitionLevel set to
0.
 The primary job of the DissectRecord
procedure is to maintain the current
repetitionLevel.
 The current definitionLevel is uniquely
determined by the tree position of the current
writer, as the sum of the number of optional
and repeated fields in the field’s path.

Google Dremel

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (6)

Semelhante a Google Dremel

Semelhante a Google Dremel (20)

Mais de maruyama097

Mais de maruyama097 (20)

Último

Último (9)

Google Dremel