SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
使用SMACK開發小型 Hybrid DB 系統
(踩過的坑)心得分享
許致軒 (Joe)
PilotTV / Data Engineer
● Chih-Hsuan Hsu (Joe)
● PilotTV Data Engineer
● Interestd in SMACK/ELK architecture
● 技術書籍譯者
○ Spark學習手冊
○ Cassandra技術手冊
● LinkIn:www.linkedin.com/in/joechh
● Mail:joechh731126@gmail.com
2
About Me
既有系統改造Story
● RDB負載與日俱增
● 單點失效
● 想將Batch Mode改造成Straming-based Data Flow
● 第一階段為了簡化只採用SMACK中的Spark、Kafka與Cassandra
○ 因為還不熟Mesos跟Akka…QQ
3
系統改造前/(預期)改造後
ETL
New ETL with
Kafka Producer
4
New ETL
● 以Java實做
● ETL結果會產生Json-format串流資料
● 透過Kafka Producer API將Json Streaming送到Kafka Cluster
︰用預設值建立的Kafka Producer throughput太低
New ETL with
Kafka Producer
5
開始研究Kafka Producer參數實驗(0.8.2)
參數 預設值 可用選項
producer.type sync sync, async
compression.codec none none, gzip, snappy
batch.num.messages 200 unlimited
request.required.acks 0 -1, 0, 1
queue.buffering.max.messages 10000 unlimited
https://kafka.apache.org/082/documentation.html
6
Kafka Producer: producer.type
● 將producer.type設定async可開啟批次傳輸模式
● 批次模式有較佳的Throughput,但客戶端忽然當機時有Data Loss的可能
6.3X
7
Kafka Producer: batch.num.messages
● 使用async模式時,一個批次傳輸的資料量
● 批次就會送出時機
○ 資料量達到batch.num.messages
○ 超過queue.buffer.max.ms的等待時間
2.55X 2.53X
8
Kafka Producer: queue.buffering.max.messages
● Queue中允許暫存的訊息數量
1.05X
9
Kafka Producer: compression.codec
● 支援輸出串流壓縮
2.14X
3.02X
10
Kafka Producer: request.required.acks
● 0:無須與Kafka Cluster進行資料接收確認(ack)
● 1:僅與Repica Leader進行ack
● -1:與所有Repica都進行ack
6.3X
2.94X
11
● request.required.acks = 1 還是有可能掉資料
● sync with leader node不代表有容錯
Kafka Producer: request.required.acks(con)
1 2 3 4
1 2 3 4
1 2 3 4Replica
Follower 1
Replica
Leader
5 6
producer sent
Replica
Follower 2
12
Kafka Producer: request.required.acks(con)
● request.required.acks = 1 還是有可能掉資料
● sync with leader node不代表有容錯
1 2 3 4
1 2 3 4
1 2 3 4
5 6 ack return
Replica
Follower 1
Replica
Leader
Replica
Follower 2
13
Kafka Producer: request.required.acks(con)
● request.required.acks = 1 還是有可能掉資料
● sync with leader node不代表有容錯
1 2 3 4
1 2 3 4
1 2 3 4
5 6
Replica
Follower 1
Replica
Leader
Replica
Follower 2
14
Kafka Producer: request.required.acks(con)
● request.required.acks = -1
● 要容錯:replication.factor >= 2 && min.insync.replicas >= 2
1 2 3 4
1 2 3 4
1 2 3 4
5 6
producer sent
Replica
Follower 1
Replica
Leader
Replica
Follower 2
15
Kafka Producer: request.required.acks(con)
1 2 3 4
1 2 3 4
1 2 3 4
5 6
all replicas sync, ack return
● request.required.acks = -1
● 要容錯:replication.factor >= 2 && min.insync.replicas >= 2
5 6Replica
Follower 1
Replica
Leader
Replica
Follower 2
16
5 6
Kafka Producer: request.required.acks(con)
1 2 3 4
1 2 3 4
1 2 3 4
5 6
● request.required.acks = -1
● 要容錯:replication.factor >= 2 && min.insync.replicas >= 2
5 6Replica
Follower 1
Replica
Leader
Replica
Follower 2
17
5 6
Java Lamba Streaming
● 實做中發現parallelStream()也有助於提昇througput!
3.47X
18
另外關於Kafka Producer物件本身.....
● Thread safe, 所以可以讓所有threads共享
19
Kafka Producer 實驗結論
參數 最後採用值 可用選項
producer.type async sync, async
compression.codec snappy none, gzip, snappy
batch.num.messages 1000 unlimited
request.required.acks 0 -1, 0, 1
queue.buffering.max.messages 20000 unlimited
● 還是必須根據需求實際測試以及對Data Loss的容忍度
● Latency v.s. Throughput
20
Spark streaming
● 以Scala實做
● 多個Kafka中的Streamings做Client-sideJoin
● 將Join結果寫入(upsert)SQL server以維護既有架構
New ETL with
Kafka Producer
upsert
streaming
join
21
Spark streaming at beginning
︰悲劇的運算throughput RRRRRRRR!! (Join 500Msgs/sec!!) 22
DB Lock Resource
● 檢查DB之後發現,Spark執行upsert時將Lock Resource用光.....
23
● 建立一張plain的base table(沒有任何index)
● 以insert取代upsert,再透過store-procedure進行二次Aggregation
Solution
append
aggregation
24
Throughput Improvement (Join 500 -> 13000 msgs/sec)
25
Spark with RDB的另一個(坑)注意事項
● SQL Server 最大允許同時連線數為32,767
● 不要問我為何知道............
● 無論是使用哪一套connection pool,要注意計算總連線數
● Total Connection = connection pool size * spark executor number
26
使用mapWithState()的Stateful API做Mapping時..
● 條件允許時可以設定timeout移除KV降低table的記憶體使用量
● 以最後一次KV被讀取的時間計算
27
Spark submit 一些好用的config
Ref: https://spark.apache.org/docs/2.0.0-preview/configuration.html
● supervise
● spark.streaming.backpressure.enabled
● spark.streaming.backpressure.initialRate
● spark.streaming.kafka.maxRatePerPartition
● spark.executor.extraJavaOptions
○ -XX:+UseConcMarkSweepGC
● spark.cleaner.referenceTracking.cleanCheckpoint
28
NoSQL之Cassandra
● Query-First 的 Schema設計理念
● 設計表之前需要先盡可能列出所有使用的情境
Ref: https://www.datastax.com/
Step1. 畫 ER Digram
Step2. 考慮查詢情境
Step3. 建立滿足查詢的表
29
How ever......Cassandra Out! in this project
User: Joe..........我們想要建立一個Dash Board。需要Ad-hoc
Query,可以對任意欄位進行任意的操作。所以我們無法跟你討
論可能的Query呢~~
Joe:
30
Migrate NoSQL solution to ELK stacks!
New ETL with
Kafka Producer
31
Logstash ingestion from Kafka
︰Bulk Loading to ES 的Throughput很低(indexing 8000 docs / min)
32
Logstash啟動Flag參數研究
https://www.elastic.co/guide/en/logstash/2.4/command-line-flags.html 33
First Step Improvement
● 因為資源尚足夠,嘗試增加Workers數量與Batch Size
● workers -> 20; batch -> 500
7.5X
34
LogStash需要處理多個Kafka topics時....
● Ver >= 5.0時,有topics屬性可以一次接的多個topics
● Ver < 5.0 時...........需要在設定檔中逐一宣告
35
ES Side Turning for Bulk Loading
● Bulk Load時幾個Trade-off的選項
○ 不Care最新資料的Latency -> 降低index.refresh_interval
○ 不Care容錯與查詢速度 -> 將副本數設定為0
○ 不Care Merge Segment佔用的IO(越快越好) ->不掐Merge IO
36
Bulk Load Improvement
● 最終結果:40萬Docs/min, 50X Throughput
50X
37
當Bulk load的很爽時........
● Too many open files!
一坑還有一坑......
38
● 先檢查目前的max_file_descriptors然後進行設定
ES max_file_descriptors修改
39
ES search Query Turning
● Optimize (force merge) cold index,甚至合成單一segment
● 使用兩類Cache提昇查詢效能
○ Filter cache:將過濾的結果cache起來,以供未來其他查詢使用
○ Shard cache:將查詢結果整個cache起來,下次一樣的查詢直接回傳
● 別忘了移除為了Bulk Load模式所做的暫時性設定
40
Filter Query v.s. Normal Query
Ref: Elasticsearch in Action
41
Translate Normal Query to Filter Query
42
關於ES欄位的新增修改
● 新增全新的欄位很容易(Flexible schema )
● 修改欄位的型別很麻煩!!需要reindex...
43
Kibana…..還沒踩到(時候未到?)
● Kibana 4.2版之後有Sense工具,下DSL很好用!!
● $./bin/kibana plugin --install elastic/sense
44
Summary
● Discussed components versions
○ Kafka: 0.8.2
○ Spark: 2.0.2
○ Cassandra: 3.10
○ Elasticsearch: 2.4.5
○ Logstash: 2.4.1
○ Kibana: 4.6.4
New ETL with
Kafka Producer
45

Mais conteúdo relacionado

Mais procurados

Establish The Core of Cloud Computing Application by Using Hazelcast (Chinese)
Establish The Core of  Cloud Computing Application  by Using Hazelcast (Chinese)Establish The Core of  Cloud Computing Application  by Using Hazelcast (Chinese)
Establish The Core of Cloud Computing Application by Using Hazelcast (Chinese)Joseph Kuo
 
2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform SecurityJazz Yao-Tsung Wang
 
Data Analyse Black Horse - ClickHouse
Data Analyse Black Horse - ClickHouseData Analyse Black Horse - ClickHouse
Data Analyse Black Horse - ClickHouseJack Gao
 
Serverless Event Streaming with Pulsar Functions-xiaolong
Serverless Event Streaming with Pulsar Functions-xiaolongServerless Event Streaming with Pulsar Functions-xiaolong
Serverless Event Streaming with Pulsar Functions-xiaolongStreamNative
 
Elastic stack day-2
Elastic stack day-2Elastic stack day-2
Elastic stack day-2YI-CHING WU
 
Distributed Data Analytics at Taobao
Distributed Data Analytics at TaobaoDistributed Data Analytics at Taobao
Distributed Data Analytics at TaobaoMin Zhou
 
京东实时消息队列JDQ技术实践与探索
京东实时消息队列JDQ技术实践与探索京东实时消息队列JDQ技术实践与探索
京东实时消息队列JDQ技术实践与探索confluent
 
The Practice of Apache Pulsar for Logging in China Mobile - Pulsar Summit Asi...
The Practice of Apache Pulsar for Logging in China Mobile - Pulsar Summit Asi...The Practice of Apache Pulsar for Logging in China Mobile - Pulsar Summit Asi...
The Practice of Apache Pulsar for Logging in China Mobile - Pulsar Summit Asi...StreamNative
 
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構Etu Solution
 
How to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environmentHow to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environmentAnna Yen
 
淘宝Hadoop数据分析实践
淘宝Hadoop数据分析实践淘宝Hadoop数据分析实践
淘宝Hadoop数据分析实践Min Zhou
 
ELK Stack - Kibana操作實務
ELK Stack - Kibana操作實務ELK Stack - Kibana操作實務
ELK Stack - Kibana操作實務Kedy Chang
 
豆瓣网技术架构变迁
豆瓣网技术架构变迁豆瓣网技术架构变迁
豆瓣网技术架构变迁reinhardx
 
准实时海量数据分析系统架构探究
准实时海量数据分析系统架构探究准实时海量数据分析系统架构探究
准实时海量数据分析系统架构探究Min Zhou
 
Elastic stack day-1
Elastic stack day-1Elastic stack day-1
Elastic stack day-1YI-CHING WU
 
The Evolution of Data Systems
The Evolution of Data SystemsThe Evolution of Data Systems
The Evolution of Data Systems宇 傅
 
Cephfs架构解读和测试分析
Cephfs架构解读和测试分析Cephfs架构解读和测试分析
Cephfs架构解读和测试分析Yang Guanjun
 
分布式文件实践经验交流
分布式文件实践经验交流分布式文件实践经验交流
分布式文件实践经验交流凯 李
 

Mais procurados (20)

Establish The Core of Cloud Computing Application by Using Hazelcast (Chinese)
Establish The Core of  Cloud Computing Application  by Using Hazelcast (Chinese)Establish The Core of  Cloud Computing Application  by Using Hazelcast (Chinese)
Establish The Core of Cloud Computing Application by Using Hazelcast (Chinese)
 
2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security
 
Data Analyse Black Horse - ClickHouse
Data Analyse Black Horse - ClickHouseData Analyse Black Horse - ClickHouse
Data Analyse Black Horse - ClickHouse
 
Serverless Event Streaming with Pulsar Functions-xiaolong
Serverless Event Streaming with Pulsar Functions-xiaolongServerless Event Streaming with Pulsar Functions-xiaolong
Serverless Event Streaming with Pulsar Functions-xiaolong
 
Elasticsearch 簡介
Elasticsearch 簡介Elasticsearch 簡介
Elasticsearch 簡介
 
Elastic stack day-2
Elastic stack day-2Elastic stack day-2
Elastic stack day-2
 
Distributed Data Analytics at Taobao
Distributed Data Analytics at TaobaoDistributed Data Analytics at Taobao
Distributed Data Analytics at Taobao
 
京东实时消息队列JDQ技术实践与探索
京东实时消息队列JDQ技术实践与探索京东实时消息队列JDQ技术实践与探索
京东实时消息队列JDQ技术实践与探索
 
The Practice of Apache Pulsar for Logging in China Mobile - Pulsar Summit Asi...
The Practice of Apache Pulsar for Logging in China Mobile - Pulsar Summit Asi...The Practice of Apache Pulsar for Logging in China Mobile - Pulsar Summit Asi...
The Practice of Apache Pulsar for Logging in China Mobile - Pulsar Summit Asi...
 
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
 
How to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environmentHow to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environment
 
淘宝Hadoop数据分析实践
淘宝Hadoop数据分析实践淘宝Hadoop数据分析实践
淘宝Hadoop数据分析实践
 
ELK Stack - Kibana操作實務
ELK Stack - Kibana操作實務ELK Stack - Kibana操作實務
ELK Stack - Kibana操作實務
 
豆瓣网技术架构变迁
豆瓣网技术架构变迁豆瓣网技术架构变迁
豆瓣网技术架构变迁
 
准实时海量数据分析系统架构探究
准实时海量数据分析系统架构探究准实时海量数据分析系统架构探究
准实时海量数据分析系统架构探究
 
Elastic stack day-1
Elastic stack day-1Elastic stack day-1
Elastic stack day-1
 
The Evolution of Data Systems
The Evolution of Data SystemsThe Evolution of Data Systems
The Evolution of Data Systems
 
Cephfs架构解读和测试分析
Cephfs架构解读和测试分析Cephfs架构解读和测试分析
Cephfs架构解读和测试分析
 
分布式文件实践经验交流
分布式文件实践经验交流分布式文件实践经验交流
分布式文件实践经验交流
 
Hantuo openstack
Hantuo openstackHantuo openstack
Hantuo openstack
 

Semelhante a SMACK Dev Experience

Kafka cluster best practices
Kafka cluster best practicesKafka cluster best practices
Kafka cluster best practicesRico Chen
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoopTianwei Liu
 
ELK 交流学习
ELK 交流学习ELK 交流学习
ELK 交流学习杨文 陈
 
Spark性能调优分享
Spark性能调优分享Spark性能调优分享
Spark性能调优分享Wenchun Xu
 
How do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend partHow do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend partacelyc1112009
 
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)涛 吴
 
Hacking Nginx at Taobao
Hacking Nginx at TaobaoHacking Nginx at Taobao
Hacking Nginx at TaobaoJoshua Zhu
 
How does Apache Pegasus used in SensorsData
How does Apache Pegasusused in SensorsDataHow does Apache Pegasusused in SensorsData
How does Apache Pegasus used in SensorsDataacelyc1112009
 
基于Symfony框架下的快速企业级应用开发
基于Symfony框架下的快速企业级应用开发基于Symfony框架下的快速企业级应用开发
基于Symfony框架下的快速企业级应用开发mysqlops
 
Spark在苏宁云商的实践及经验分享
Spark在苏宁云商的实践及经验分享Spark在苏宁云商的实践及经验分享
Spark在苏宁云商的实践及经验分享alipay
 
Java线上应用问题排查方法和工具(空望)
Java线上应用问题排查方法和工具(空望)Java线上应用问题排查方法和工具(空望)
Java线上应用问题排查方法和工具(空望)ykdsg
 
配置Oracle 10g 双向流复制
配置Oracle 10g 双向流复制配置Oracle 10g 双向流复制
配置Oracle 10g 双向流复制maclean liu
 
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...acelyc1112009
 
Lamp高性能设计
Lamp高性能设计Lamp高性能设计
Lamp高性能设计锐 张
 
Apache Kylin Data Summit 2019: Kyligence Presentation
Apache Kylin Data Summit 2019: Kyligence PresentationApache Kylin Data Summit 2019: Kyligence Presentation
Apache Kylin Data Summit 2019: Kyligence PresentationTyler Wishnoff
 
Exadata那点事
Exadata那点事Exadata那点事
Exadata那点事freezr
 
Golang 高性能实战
Golang 高性能实战Golang 高性能实战
Golang 高性能实战rfyiamcool
 
探索 ISTIO 新型 DATA PLANE 架構 AMBIENT MESH - GOLANG TAIWAN GATHERING #77 X CNTUG
探索 ISTIO 新型 DATA PLANE 架構 AMBIENT MESH - GOLANG TAIWAN GATHERING #77 X CNTUG探索 ISTIO 新型 DATA PLANE 架構 AMBIENT MESH - GOLANG TAIWAN GATHERING #77 X CNTUG
探索 ISTIO 新型 DATA PLANE 架構 AMBIENT MESH - GOLANG TAIWAN GATHERING #77 X CNTUGYingSiang Geng
 

Semelhante a SMACK Dev Experience (20)

Kafka cluster best practices
Kafka cluster best practicesKafka cluster best practices
Kafka cluster best practices
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoop
 
ELK 交流学习
ELK 交流学习ELK 交流学习
ELK 交流学习
 
Spark性能调优分享
Spark性能调优分享Spark性能调优分享
Spark性能调优分享
 
How do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend partHow do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend part
 
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)
 
Hacking Nginx at Taobao
Hacking Nginx at TaobaoHacking Nginx at Taobao
Hacking Nginx at Taobao
 
Cdc@ganji.com
Cdc@ganji.comCdc@ganji.com
Cdc@ganji.com
 
Kafka in Depth
Kafka in DepthKafka in Depth
Kafka in Depth
 
How does Apache Pegasus used in SensorsData
How does Apache Pegasusused in SensorsDataHow does Apache Pegasusused in SensorsData
How does Apache Pegasus used in SensorsData
 
基于Symfony框架下的快速企业级应用开发
基于Symfony框架下的快速企业级应用开发基于Symfony框架下的快速企业级应用开发
基于Symfony框架下的快速企业级应用开发
 
Spark在苏宁云商的实践及经验分享
Spark在苏宁云商的实践及经验分享Spark在苏宁云商的实践及经验分享
Spark在苏宁云商的实践及经验分享
 
Java线上应用问题排查方法和工具(空望)
Java线上应用问题排查方法和工具(空望)Java线上应用问题排查方法和工具(空望)
Java线上应用问题排查方法和工具(空望)
 
配置Oracle 10g 双向流复制
配置Oracle 10g 双向流复制配置Oracle 10g 双向流复制
配置Oracle 10g 双向流复制
 
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
 
Lamp高性能设计
Lamp高性能设计Lamp高性能设计
Lamp高性能设计
 
Apache Kylin Data Summit 2019: Kyligence Presentation
Apache Kylin Data Summit 2019: Kyligence PresentationApache Kylin Data Summit 2019: Kyligence Presentation
Apache Kylin Data Summit 2019: Kyligence Presentation
 
Exadata那点事
Exadata那点事Exadata那点事
Exadata那点事
 
Golang 高性能实战
Golang 高性能实战Golang 高性能实战
Golang 高性能实战
 
探索 ISTIO 新型 DATA PLANE 架構 AMBIENT MESH - GOLANG TAIWAN GATHERING #77 X CNTUG
探索 ISTIO 新型 DATA PLANE 架構 AMBIENT MESH - GOLANG TAIWAN GATHERING #77 X CNTUG探索 ISTIO 新型 DATA PLANE 架構 AMBIENT MESH - GOLANG TAIWAN GATHERING #77 X CNTUG
探索 ISTIO 新型 DATA PLANE 架構 AMBIENT MESH - GOLANG TAIWAN GATHERING #77 X CNTUG
 

SMACK Dev Experience