2. 自我介绍
• Tech Lead and Engineering Manager at Databricks
• Apache Spark Committer and PMC Member
• Previously, IBM Master Inventor
• Spark, Database Replication, Information Integration
• Ph.D. in University of Florida
• Github: gatorsmile
7. Process data continuously and incrementally as new data arrive in a
cost efficient way without having to choose between batch or streaming
项 目 经 理 如 是 说,
数 据 工 程 师
8. Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a
cost efficient way without having to choose between batch or streaming
项 目 经 理 如 是 说,
数 据 工 程 师 的 第 一 份 架 构 草 图,
20. Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a
cost efficient way without having to choose between batch or streaming
项 目 经 理 如 是 说,
数 据 工 程 师 的 第 一 份 架 构 草 图,
?
到底最初的方案,哪里错了???
为何选择复[ keng ] 杂[ die ] 的 Lambda 架构!!!
24. Delta On Disk
Transaction Log
Table Versions
(Optional) Partition Directories
Data Files
my_table/
_delta_log/
00000.json
00001.json
date=2019-01-01/
file-1.parquet
25. Action Types
• Change Metadata – name, schema, partitioning, etc.
• Add File – adds a file (with optional statistics)
• Remove File – removes a file
Table = result of a set of actions
Result: Current Metadata, List of Files, List of Txns, Version
26. Changes to the table are stored as ordered, atomic
units called commits
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
000000.json
000001.json
。。。
Atomicity 的 实 现
27. 1. Record start version
2. Record reads/writes
3. Attempt commit, check
for conflicts among
transactions
4. If someone else wins,
check if anything you
read has changed.
5. Try again.
乐 观 并 发 控 制
000000.json
000001.json
000002.json
User 1 User 2
Write: Append
Read: Schema
Write: Append
Read: Schema
41. Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
3)遇 到 错 误 写 出 可 以 回 滚 (rollback) 可以删改(update/delete/merge)
update/delete/merge 能提供标准SQL文法吗?
正在努力!Spark 3.0 is coming!
支持 Spark 2.4,需要 Delta 需要加上自己的 SQL parser
42. 4)在 线 业 务 不 下 线 的 同 时 可 以 重 新 处 理 历 史 数 据 (replay historical data)
Stream the backfilled historical data through the same pipeline
因为 ACID support,删掉相关的结果,重新改业务逻辑,历史数据的做批处理,
流可以同时持续处理最新的数据。
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
43. 5)处 理 迟 到 数 据 (late arriving data) 而 无 需 推 迟 下 阶 段 的 数 据 处 理
Stream any late arriving data added to the table as they get added
因为 ACID support ,迟到的数据也可以通过MERGE/UPSERT 来处理
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
45. Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a
cost efficient way without having to choose between batch or streaming
项 目 经 理 如 是 说,
数 据 工 程 师 的 第 一 份 架 构 草 图 - Delta 架 构
69. 自我介绍
Delta Lake Roadmap
Releases Features
0.2.0 • Cloud storage support
• Improved concurrency
0.3.0 • Scala/Java APIs for DML commands
• Scala/Java APIs for query commit history
• Scala/Java APIs for vacuuming old files
0.4.0 • Python APIs for DML and utility operations
• In-place Conversion of Parquet to Delta Lake table
Q4 • Enable Hive support reading Delta tables
• SQL DML support with Spark 3.0
• And more
70. Delta Lake Community
2+
Exabytes of Delta
Read/Writes per month
3700+
Orgs using Delta
0
5,000
10,000
15,000
20,000
M
arch
April
M
ay
June
July
AugustSeptem
ber
74. Unified data analytics platform for accelerating innovation across
data science, data engineering, and business analytics
Original creators of popular data and machine learning open source projects
Global company with 5,000 customers and 450+ partners