This document discusses speeding up OLAP cube building in Apache Kylin using Spark. Cubing with MapReduce can be slow due to serialization overhead and repeated job submissions. Spark allows caching data in memory across cuboid layers in one job, significantly reducing build times compared to MapReduce as shown in a benchmark on a 160 million row dataset. Spark simplifies Kylin development and brings capabilities for real-time OLAP and cloud integration.
2. About Us
Shaofeng Shi (史少锋)
• Apache Kylin PMC
• Sr. Architect,
Kyligence Inc.
Luke Han (韩卿)
• VP of Apache Kylin
• Co-founder & CEO,
Kyligence Inc.
formed by the team who created Apache Kylin
http://kyligence.io
3. Agenda
• What is Apache Kylin
• Challenges with MapReduce
• Speed up Cubing with Spark
• Q&A
4. About Apache Kylin
ü Leading open source OLAP on Hadoop
ü Fast growing open source community
ü Adopted by 200+ global organizations
ü First born in China Apache Top Level Project
ü InfoWorld BossieAward:
– Best Open Source Big Data Tool (2015)
– Best Open Source Big Data Tool (2016)
Kylin / ˈkiːˈlɪn / 麒麟
-- n. (in Chinese art) a mythical animal of composite form
Apache Kylin
Extreme OLAP Engine for Big Data
http://kylin.apache.org
11. Star-Schema Benchmark
Run SSB at 10, 20 and 40 million-row scales,
Kylin’s response time keep stable
Read more at: https://github.com/kyligence/ssb-kylin
Kylin vs Hive, O(1) : O(N)
14. Build Cube with MapReduce
• Calculate Cuboids by layer :N
dim (Base cuboid), N-1 dim,
N-2…, 1, 0
• Reuse previous layer’s result
• HDFS used for data sharing
• Totally need N round MR;
15. Challenges with MapReduce
• Slow data sharing
ü Serialization, Replication, Deserialization…
• Repeated job submission
ü Submit dependent jar/files repeatedly
ü Re-queue when cluster is busy
• Short of streaming support
ü Couldn’t support low latency data processing
• Limited storage support
ü Couldn’t ingest data from non-HDFS
16.
17. Speed up Cubing with Spark
• Abstract each layer cuboids as
a RDD
• Cache parent RDD to generate
Child RDD
• Export RDD when Child be
generated
• As-is measure aggregators can
be reused with Spark Java API
RDD-1
RDD-2
RDD-3
RDD-4
RDD-5
19. Speed up Cubing with Spark
One job finishes all layers aggregation!
20. Performance Benchmark
• Environment
ü 4 nodes Hadoop cluster; each has 28 GB RAM and 12 cores
ü CDH 5.8, Apache Kylin 2.0
• Spark
ü Spark 1.6.3 on YARN
ü 6 executors, each has 4 cores, 5GB memory
• Test Data
ü Airline data of US DoT, totally 160 million rows
ü Cube:10 dimensions, 5 measures (SUM)
• Test Scenarios
ü Build the cube at different scale: 3 million, 50 million and 160 million rows
22. Spark Cubing vs. MR In-Mem Cubing
MR In-mem is slow when dataset is big
and random distributed
When data is in sharding,
MR In-mem can be fast
Spark is fast &
stable regardless
of data distribution
23. Benefits with Spark
• Spark speeds up Cubing at 1x
ü Half time be saved
• Spark simplifies Kylin’s development
ü More functions, less codes
• Spark brings Kylin to a new era
ü Real-time OLAP, Ad hoc query, Cloud integration
24. Thank You.
Join Apache Kylin community
dev@kylin.apache.org
Visit Kylin & Kyligence home
http://kylin.apache.org
http://kyligence.io