Mobile Internet, Social Media 以及 Smart Device 的發展促成資訊的大爆炸,伴隨產生大量的非結構化及半結構化的資料,不但資料的格式多樣,產生的速度極快,對企業的資訊架構帶來了前所未有的挑戰,面對多樣的資料結構及多樣的分析工具,我們應該採用什麼樣的架構互相整合,才能有效的管理資料生命週期,提取資料價值,Hadoop 生態系統,無疑的在這個大架構裡,將扮演最基礎的資料平台的角色,實現企業的 Data Lake。
2. 2
Agenda 大綱
• EDW 的現況與挑戰
• 何謂 Enterprise Data Lake (EDL)
• EDL 需要具有哪些特性
• 演化中的 Hadoop 生態系
• Hadoop 生態系如何實現Enterprise Data Lake
• Take Away
3. 3
邁向數據驅動的企業
現狀
Bring Data to Compute
未來
Bring Compute to Data
Relative size complexity
Data
Information-centric
businesses use all data:
Multi-structured,
internal external data
of all types
Compute
Compute
Compute
Process-centric
businesses use:
• Structured data mainly
• Internal data only
• “Important” data only
Compute
Compute
Compute
Data
Data
Data
Data
4. 4
EDW 現況與挑戰
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
Data ArchivesEDWs Marts SearchServers Document Stores Storage
Complex Architecture
• Many special-purpose
systems
• Moving data around
• No complete views
Visibility
• Leaving data behind
• Risk and compliance
• High cost of storage
Time to Data
• Up-front modeling
• Transforms slow
• Transforms lose data
Cost of Analytics
• Existing systems strained
• No agility
• BI backlog
4
1
2
3
5. 5
何謂 Enterprise Data Lake
• A Data lake is a large storage repository that holds
data until it is needed”
• Put an end to data silos
• Hadoop allows to hold data in their original native
format rather than forcing integration of large
volumes of data up front.
https://en.wikipedia.org/wiki/Data_lake
6. 6
EDWs
Marts
Storage
Search
Servers
Documents
Archives
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
Multi-workload analytic platform
• Bring applications to data
• Combine different workloads on
common data (i.e. SQL + Search)
• True BI agility
4
1
3
4
Data Lake is a reference architecture
Active archive
• Full fidelity original data
• Indefinite time, any source
• Lowest cost storage
1
Data management,
transformations
• One source of data for all analytics
• Persisted state of transformed data
• Significantly faster cheaper
2
Self-service exploratory BI
• Simple search + BI tools
• “Schema on read” agility
• Reduce BI user backlog
requests Big Data
Platform
3
1
2
9. 9
Hadoop Spark
YARN (Resource Management)
Spark on YARN
Spark
Streaming
GraphX MLlib
HDFS, Hbase (Data Store)
HivePig
Impala
MapReduce2
SparkSQL
Search
Core Hadoop
Support Spark compon
Not support yet add-on
10. 10
Cloudera is a member of, and aligned with, the broader Spark
community
Spark:
• Will replace MapReduce as the general purpose Hadoop
framework
– Broad community and vendor adoption
– Hadoop ecosystem integration (native 3rd party)
• Goes beyond data science/machine learning
– Cloudera working on Spark Core, Streaming, Security,
YARN, and MLlib
• Does not replace special purpose frameworks
– One size does not fit all for SQL, Search, Graph, Stream
Cloudera’s Position on Spark
11. 11
Spark Engineering in Cloudera
• Cloudera embraced Spark in early 2014
• One Platform Initiative (Mike’s Blog)
• Engineering with Intel to broaden Spark
ecosystem
– Hive-on-Spark
– Pig-on-Spark
– Spark-over-YARN
– Spark Streaming Reliability
– General Spark Optimization
– Security, Compliance Audit
12. 12
Hive on Spark
• Technology
– Hive: “standard” SQL tool in Hadoop
– Spark: next-gen distributed processing
framework
– Hive + Spark
• Performance
• Minimum feature gap
• Industry
– A lot of customers heavily invest in Hive
– Want to leverage the Spark engine
13. 13
Design Principles
• No or limited impact on Hive’s existing code path
• Maximize code reuse
• Minimum feature customization
• Low future maintenance cost
15. 15
Current Status
• All functionality in Hive is implemented
• First round of optimization is completed
– Map join, SMB
– Split generation and grouping
– CBO, vectorization
• More optimization and benchmarking coming
• Beta in CDH
– https://issues.apache.org/jira/browse/HIVE-7292
16. 16
• Data Lake Building Blocks
• 建立 Data Lake 的常見痛點
– 資料來源雜
– 資料表太寬
– 查詢速度慢
• 導入 Data Lake 的前置行動
– 目前關聯式資料庫的極限是幾個欄位?
– 未來如何驗證評測解決方案? 如何產生範例資料集 ?
Hadoop 生態系如何實現 EDL
18. 18
Components of the Data Lake
• Process Tools
• Hive: SQL like query and analysis
• Pig: Transformation for big data
• Sqoop: Extracts external source and load to Hadoop
• MR/Spark: General-purpose cluster computing
framework
• Spark-Stream: On-the-fly ETL
• NoSQL
• HBase
• Search
• Log Streaming
• Kafka/Scribe
• Flume
• Fluentd
• Language
• Python/Java/R/Scala
41. 41
TPC-DS 定義了評測用的資料表 Schema
[master:21000]
show
databases;
Query:
show
databases
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
_impala_builtins
|
|
default
|
|
tpcds
|
|
tpcds_parquet
|
|
tpcds_rcfile
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Fetched
5
row(s)
in
0.03s
[master:21000]
use
tpcds;
Query:
use
tpcds
[master:21000]
show
tables;
Query:
show
tables
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
name
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
customer
|
|
customer_address
|
|
customer_demographics
|
|
date_dim
|
|
household_demographics
|
|
inventory
|
|
item
|
|
promotion
|
|
store
|
|
store_sales
|
|
time_dim
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Fetched
11
row(s)
in
0.01s
42. 42
TPC-DS 也定義了評測用的 SQL 查詢
-‐-‐
start
query
1
in
stream
0
using
template
query27.tpl
select
i_item_id,
s_state,
-‐-‐
grouping(s_state)
g_state,
avg(ss_quantity)
agg1,
avg(ss_list_price)
agg2,
avg(ss_coupon_amt)
agg3,
avg(ss_sales_price)
agg4
from
store_sales,
customer_demographics,
date_dim,
store,
item
Where
ss_sold_date_sk
=
d_date_sk
and
ss_item_sk
=
i_item_sk
and
ss_store_sk
=
s_store_sk
and
ss_cdemo_sk
=
cd_demo_sk
and
cd_gender
=
'F'
and
cd_marital_status
=
'W'
and
cd_education_status
=
'Primary'
and
d_year
=
1998
and
s_state
in
('WI',
'CA',
'TX',
'FL',
'WA',
'TN')
and
ss_sold_date_sk
between
2450815
and
2451
-‐-‐
partition
key
filter
group
by
-‐-‐
rollup
(i_item_id,
s_state)
i_item_id,
s_state
order
by
i_item_id,
s_state
limit
100;
-‐-‐
end
query
1
in
stream
0
using
template
quer
43. 43
TPC-DS 也有提供產生指定筆數資料的工具
取樣資料表:資料量 3.8G, 文字檔格式, 沒有壓縮, 3 千萬筆資料
[master:21000]
show
table
stats
customer;
Query:
show
table
stats
customer
+-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
#Rows
|
#Files
|
Size
|
Bytes
Cached
|
Format
|
Incremental
stats
|
+-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
-‐1
|
1
|
3.81GB
|
NOT
CACHED
|
TEXT
|
false
|
+-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Fetched
1
row(s)
in
0.00s
[master:21000]
select
count(*)
from
customer;
Query:
select
count(*)
from
customer
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
count(*)
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|
30000000
|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
Fetched
1
row(s)
in
0.77s
使用相同的資料集與
查詢語句,較容易進行
不同技術的評選