Presto、Spark SQLとHive on Tezの性能に関して、数万件から数十億件までのデータ上に、常用クエリパターンの実行スピードなどを検証してみた。
We conducted a benchmark test on mainstream big data sql engines including Presto, Spark SQL, Hive on Tez.
We focused on the performance over medium data (from tens of GB to 1 TB) which is the major case used in most services.
7. 7
テストSQL
No. Query Description
Q1 SELECT pageURL, pageRank FROM rankings WHERE pageRank > X Scan Query
Q2
SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY
SUBSTR(sourceIP, 1, X)
Aggregation Query
Q3
SELECT sourceIP, totalRevenue, avgPageRank
FROM
(SELECT sourceIP,
AVG(pageRank) as avgPageRank,
SUM(adRevenue) as totalRevenue
FROM Rankings AS R, UserVisits AS UV
WHERE R.pageURL = UV.destURL
AND UV.visitDate BETWEEN Date('1980-01-01') AND Date('X')
GROUP BY UV.sourceIP)
ORDER BY totalRevenue DESC LIMIT 100
Join Query
Q4
SELECT destURL, adRevenue, visitDate FROM UserVisits ORDER BY adRevenue
DESC LIMIT 100
Sort Query
Q5
Insert overwrite table UserVisitsWithHighRevenue SELECT sourceIP, destURL,
adRevenue, visitDate FROM UserVisits WHERE adRevenue > X ORDER BY visitDate
Insert
8. 8
Performance Overview
Data Case Mean Presto
(sec)
Spark SQL
(sec)
Hive on
Tez (sec)
Descriptio
n
Small-
Medium
Geometric 2.2 17.2 6.7 D1~D3,
Q1~Q4Arithmetic 5.2 32.1 15.8
Medium-
Large
Geometric 11.3 34.5 6.1 D3~D5,
Not Q3
and Q5Arithmetic 22.8 64.1 69.0
Large Geometric 36.5 79.1 11.8 D5,
Not Q3
and Q5Arithmetic 46.1 128.4 129.3
Total Geometric 4.4 26.3 22.4 D1~D5,
Q1~Q5*Arithmetic 14.5 49.5 437.7
*成功したQuery数: Presto=17, Spark SQL = 21, Hive on Tez = 25
9. 9
Performance Overview
*成功したQuery数: Presto=17, Spark SQL = 21, Hive on Tez = 25
3.0 X 3.0 X 2.8 X
30.2 X
0.5 X 1.1 X 1.0 X
8.8 X
1.0 X 1.0 X 1.0 X 1.0 X
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
Small-Medium Medium-Large Large Total
倍
数
データサイズ
Hive On Tezに比べて何倍早いか
「算術平均」
Presto Spark SQL Hive on Tez
10. 10
Performance Overview
*成功したQuery数: Presto=17, Spark SQL = 21, Hive on Tez = 25
3.0 X
0.5 X
0.3 X
5.1 X
0.4 X
0.2 X 0.1 X
0.9 X1.0 X 1.0 X 1.0 X 1.0 X
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Small-Medium Medium-Large Large Total
倍
数
データサイズ
Hive On Tezに比べて何倍早いか
「幾何平均」
Presto Spark SQL Hive on Tez
11. 11
Query on different data size
Q1 Q2 Q3 Q4 Q5
Spark SQL 5.9 15.3 15.9 6.5 16.6
Presto 3.2 2.4 1.0 0.2
Hive on Tez 2.0 8.7 7.4 6.0 9.8
5.9
15.3
15.9
6.5
16.6
3.2
2.4
1.0
0.2
2.0
8.7
7.4
6.0
9.8
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
実
行
時
間
(
秒
)
Query time On Data Set “D1” (15K records)
Spark SQL Presto Hive on Tez
12. 12
Query on different data size
Q1 Q2 Q3 Q4 Q5
Spark SQL 7.2 28.0 50.9 11.3 35.1
Presto 0.3 3.1 29.2 0.8
Hive on Tez 1.9 17.3 18.1 11.6 44.5
7.2
28.0
50.9
11.3
35.1
0.3
3.1
29.2
0.81.9
17.3 18.1
11.6
44.5
0.0
10.0
20.0
30.0
40.0
50.0
60.0
実
行
時
間
(
秒
)
Query time On Data Set "D2" (6.1M records)
Spark SQL Presto Hive on Tez
13. 13
Query on different data size
Q1 Q2 Q3 Q4 Q5
Spark SQL 7.3 33.5 189.9 13.5 80.6
Presto 3.5 7.2 6.5
Hive on Tez 0.0 29.0 72.4 15.6 372.2
7.3
33.5
189.9
13.5
80.6
3.5 7.2 6.50.0
29.0
72.4
15.6
372.2
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
実
行
時
間
(
秒
)
Query time On Data Set "D3" (61M records)
Spark SQL Presto Hive on Tez
14. 14
Query on different data size
Q1 Q2 Q3 Q4 Q5
Spark SQL 10.9 75.9 51.0
Presto 0.6 28.6 20.8
Hive on Tez 0.0 129.7 539.4 58.8 2828.4
10.9 75.9 51.00.6 28.6 20.80.0
129.7
539.4
58.8
2828.4
0.0
500.0
1000.0
1500.0
2000.0
2500.0
3000.0
実
行
時
間
(
秒
)
Query time On Data Set "D4" (610M records)
Spark SQL Presto Hive on Tez
15. 15
Query on different data size
Q1 Q2 Q3 Q4 Q5
Spark SQL 16.5 246.8 121.8
Presto 12.4 57.6 68.4
Hive on Tez 0.0 248.8 932.8 139.1 5449.6
16
247 12212 58 680
249
933
139
5,450
0
1000
2000
3000
4000
5000
6000
実
行
時
間
(
秒
)
Query time On Data Set "D5" (1.2B records)
Spark SQL Presto Hive on Tez
16. 16
Performance over Memory & Nodes
5.9
38.8
12.7
4.8
19.3
12.5
18%
50%
2%
0%
10%
20%
30%
40%
50%
60%
0
5
10
15
20
25
30
35
40
45
Presto SparkSQL HoT
実
行
時
間
(
s
e
c
)
2GB 4GB 短縮比率
*Spark SQLの場合、Node数とメモリ両方が倍にしました