SlideShare uma empresa Scribd logo
1 de 59
Baixar para ler offline
1
Spark Drives Big Data
Analytics Application
基於Spark的數據分析
James Chen
Etu CTO
June 16, 2015
2
• Spark Brief
• What Cloudera is doing on Spark
• Spark Use Cases
• Cloudera’s Position on Spark
• Etu and Cloudera
Agenda
3
Key Advances by MapReduce:
• Data Locality: Automatic split computation and launch of mappers
appropriately
• Fault-Tolerance: Write out of intermediate results and restartable mappers
meant ability to run on commodity hardware
• Linear Scalability: Combination of locality + programming model that forces
developers to write generally scalable solutions to problems
A Brief Review of MapReduce
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
M
ap
Redu
ce
Redu
ce
Redu
ce
Redu
ce
4
MapReduce: Good
The Good:
• Built in fault tolerance
• Optimized IO path
• Scalable
• Developer focuses on Map/Reduce, not infrastructure
• Simple? API
5
MapReduce: Bad
The Bad:
•Optimized for disk IO
– Doesn’t leverage memory
– Iterative algorithms go through disk IO path again and
again
•Primitive API
– Developer’s have to build on very simple abstraction
– Key/Value in/out
– Even basic things like join require extensive code
•Result often many files that need to be combined appropriately
6
Spark is a general purpose computational framework
with more flexibility than MapReduce
Key properties:
• Leverages distributed memory
• Full Directed Graph expressions for data parallel computations
• Improved developer experience
Yet retains:
Linear scalability, Fault-tolerance, and Data Locality based computations
Reference:
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
What is Spark?
7
Easy to Develop
– High productive
language support
– Clean and expressive
APIs
– Interactive shell
– Out of box
functionality
Spark: Easy and Fast Big Data
Fast to Run
–General execution
graphs
–In-memory storage
2-5× less code
Up to 10× faster on disk,
100× in memory
8
Spark
Easy: Example – Word Count
Hadoop MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
val spark = new SparkContext(master, appName,
[sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
9
Hadoop Integration
• Works with Hadoop Data
• Runs With YARN
Libraries
• MLlib
• Spark Streaming
• GraphX (alpha)
Out-of-the-Box Functionality
Language support:
• Improved Python support
• SparkR
• Java 8
• Schema support in Spark’s
APIs
10
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
Example: Logistic Regression
11
• Hadoop cluster with 100 nodes
contains 10+TB of RAM today and
will double next year
• 1 GB RAM ~ $10-$20
• Trends:
• ½ price every 18 months
• 2x bandwidth every 3 years
Memory Management Leads to Greater
Performance
64-­‐128GB  RAM
16  cores
50  GB  per  
sec
Memory can be enabler for high
performance big data applications
12
In-memory Caching
• Data Partitions read from
RAM instead of disk
Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
Fast: Using RAM, Operator Graphs
join
filter
groupBy
B: B:
C: D: E:
F:
Ç√
Ω
map
A:
map
take
=  cached  partition=  RDD
13
Expressiveness of Programming Model
Map
Reduce
Map
Map
Reduce
Map
Reduce Efficient  group-­‐by   aggregations  
and  other  analytics
Pipelined  MapReduce  Jobs
Ma
p
Reduc
e
Ma
p
Reduc
eX X X
Ma
p
Reduc
e
Iterative  jobs  (Machine  Learning)
14
Logistic Regression Performance (Data
Fits in Memory)
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
Running  Time  (s)
Number  of  Iterations
Hadoop
Spark
110  s  /  iteration
first  iteration 80  s
further  iterations 1  s
15
• Spark Brief
• What Cloudera is doing on Spark
• Spark Use Cases
• Cloudera’s Position on Spark
• Etu and Cloudera
Agenda
16
Spark Engineering in Cloudera
• Cloudera embraced Spark in early 2014
• Engineering with Intel to broaden Spark ecosystem
– Hive-on-Spark
– Pig-on-Spark
– Spark-over-YARN
– Spark Streaming Reliability
– General Spark Optimization
17
Hive on Spark
• Technology
– Hive: “standard” SQL tool in Hadoop
– Spark: next-gen distributed processing framework
– Hive + Spark
• Performance
• Minimum feature gap
• Industry
– A lot of customers heavily invest in Hive
– Want to leverage the Spark engine
18
Design Principles
• No or limited impact on Hive’s existing code path
• Maximize code reuse
• Minimum feature customization
• Low future maintenance cost
19
Class Hierarchy
TaskCompiler
MapRedCompiler TezCompiler
Task Work
MapRedTask TezTask TezWorkMapRedWork
SparkCompiler SparkTask SparkWork
generates described by
20
Work – Metadata for Task
• MapReduceWork contains one MapWork and a possible ReduceWork
• SparkWork contains a graph of MapWorks and ReduceWorks
MapWork1
ReduceWork1
MapWork2
ReduceWork2
MapWork1
ReduceWork1
ReduceWork2
Query:  select  name,  
sum(value)  as  v  
from  dec
group  by  name  
order  by  v;
Spark  Job
MR  Job  2
MR    Job  1
21
Data Processing via Spark
• Treat Table as HadoopRDD (input RDD)
• Apply the function that wraps MR’s map-side processing
• Shuffle map output using Spark’s transformations (groupByKey,
sortByKey, etc)
• Apply the function that wraps MR’s reduce-side processing
22
Spark Plan
• MapInput – encapsulate a table
• MapTran – map-side processing
• ShuffleTran – shuffling
• ReduceTran – reduce-side processing
Query: Select name, sum(value) as v from dec group by name order by v;
23
Current Status
• All functionality in Hive is implemented
• First round of optimization is completed
– Map join, SMB
– Split generation and grouping
– CBO, vectorization
• More optimization and benchmarking coming
• Beta in CDH
– http://archive-primary.cloudera.com/cloudera-labs/hive-on-spark/
– http://www.cloudera.com/content/cloudera/en/documentation/hive-
spark/latest/PDF/hive-spark-get-started.pdf
24
• Spark Brief
• What Cloudera is doing on Spark
• Spark Use Cases
• Cloudera’s Position on Spark
Agenda
25
User Use Case Spark’s Value
Conviva
通過實時分析流量規則以及更精細的流
量控制,優化終端用戶的在線視頻體驗
• 快速原型開發
• 共享的離線和在線計算業務邏輯
• 開源的機器學習算法
Yahoo!
加速廣告投放的模型訓練週期,特徵提
取提高3備,採用協同過濾算法進行內容
推薦
• 降低數據管道的延遲
• 迭代式機器學習
• 高效的P2P廣播
Anonymous
(Large Tech
Company)
準實時日誌聚合於分析,實現監控和告
警
• 低延遲、高頻度的運行“mini”批
總也來處理最新數據
Technicolor
為(電信)客戶提供實時分析;提供流
處理和實時查詢能力
• 部署簡單,只需要Spark和Spark
Streaming
• 在線數據的隨機查詢
Sample Use Cases
26
Large Tech Company – Spark is used for new machine learning
investigations for search personalization
Financial Services – Process millions of stock positions and future
scenarios in 4hrs with Spark (compared with 1 week using
MapReduce)
University – Genomics research using Spark pipelines
Video – Spark and Spark Streaming for video streaming and analysis
Hospital – Spark for predictive modeling of disease conditions
Cloudera Use Cases in Verticals
27
• Run ETL on Spark using PIG
– To achieve very tight SLA’s.
– Accenture Smart Water Application.
• Spark Analytics over Hbase
– Patients physiological data, experiment and user data
– Serving Researchers.
• Traffic analysis using MLlib/Clustering at Dylan
• Annotated Variants analysis on Spark
– Using the Spark/Java framework in Duke
• Sepsis detection with Spark Streaming
Cloudera Use cases with different
Components
28
• A car shopping website where people
from all across the nation come to read
reviews, compare prices, and in general
get help in all matters car related.
• The goal was to build a near real-time
dashboard that would provide both
unique visitor and page view counts per
make and make/model that could be
engineered in a couple of weeks.
• In the past, these updates have been
restricted to hourly granularities with an
additional hour delay.
• Furthermore, as this data was not
available in an easy-to-use dashboard,
manual processing was needed to
visualize the data.
Near real-time dashboard by
Edmunds.com
29
Prototype Architecture
30
Page View Per Minute
31
Unique Visitor Per Minute
32
Total UV by Maker/Model
33
Case Study in Etu Insight
l Problem domain:
− Analyze user behavior from web site interaction log
− Analyze users behavior from existing offline data
− Make data aggregation on the data grouping by time
and users
l Approach:
− ETL process from the web log to Hive structure data
− Import existing database data
− Define and implement the aggregation function in Spark
(with Scala)
− Output the calculation result to HBase
34
Architecture & Flow
Web log User Data
Hive
(Structured Data)
SPARK
HBase
35
Etu Insight Dashboard
36
Advanced Analytics with Spark
• Written by Cloudera data science team
– First ever book bridging ML with
Hadoop ecosystem
– Focusing on use cases and examples
rather than a manual
– Target for data scientist solving real
word analysis problems
– Generally available in May 2015
37
Analyzing Big Data
• Building a model to detect credit card fraud using thousands
of features and billions of transactions
• Intelligently recommend millions of products to millions of
users
• Estimate financial risk through simulations of portfolios
including millions of instruments
• Easily manipulate data from thousands of human genomes to
detect genetic associations with disease
38
• Spark Brief
• What Cloudera is doing on Spark
• Spark Use Cases
• Cloudera’s Position on Spark
• Etu and Cloudera
Agenda
39
Spark is a fully integrated and supported part of Cloudera’s
enterprise data hub
• First vendor to ship and support Spark
– Invested early to make it a cohesive part of the platform
– Complemented by Intel’s early investment
– Developed and supported in collaboration with Databricks to
ensure success
• Only vendor with Spark committers on staff
• Several Spark use cases in production
• Well-trained support staff and external Training Courses
Cloudera’s Investment in Spark
40
Hadoop in the Spark World
YARN
Spark
Spark
Streaming
GraphX MLlib
HDFS, HBase
HivePig
Impala
MapReduce2
SparkSQL
Search
Core  Hadoop
Support  Spark  components
Unsupported  add-­‐ons
41
Focusing on Open Standards, not just Open Source
Open  Standards  are  just  as  
important  as  Open  Source.
Why  does  it  matter?
• Diverse  engineering  is  more  sustainable.
• Broad  support  ensures  vendor  
portability.
• Project  utility  depends  on  ecosystem  
compatibility,  which  depends   on  
standards.
Cloudera  leads  in  defining
the  de  facto  open  standards  
adopted  by  the  market.
Vendor   Support
Component
(Founder)
Cloudera Pivota
l
MapR Amaz
on
IBM Hortonwo
rks
Spark
(UC  
Berkeley)
✔ ✔ ✔ ✔ ✔ ✔
Impala  (Cloudera) ✔ ✖ ✔ ✔ ✖ ✖
Hue  (Cloudera) ✔ ✖ ✔ ✔ ✖ ✔
Sentry  (Cloudera) ✔ ✔ ✔ ✖ ✔ ✖
Flume  (Cloudera) ✔ ✔ ✔ ✖ ✔ ✔
Parquet  
(Cloudera/Twitter)
✔ ✔ ✔ ✔ ✔ ✖
Sqoop  (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔
Falcon   ✖ ✖ ✖ ✖ ✖ ✔
Knox ✖ ✖ ✖ ✖ ✖ ✔
Tez ✖ ✖ ✔ ✖ ✖ ✔
Ranger   ✖ ✖ ✖ ✖ ✖ ✔
42
Cloudera is a member of, and aligned with, the broader Spark
community
Spark:
• Will replace MapReduce as the general purpose Hadoop framework
– Broad community and vendor adoption
– Hadoop ecosystem integration (native & 3rd party)
• Goes beyond data science/machine learning
– Cloudera working on Spark Core, Streaming, Security, YARN, and MLlib
• Does not replace special purpose frameworks
– One size does not fit all for SQL, Search, Graph, Stream
Cloudera’s Position on Spark
43
• Spark Brief
• What Cloudera is doing on Spark
• Spark Use Cases
• Cloudera’s Position on Spark
• Etu and Cloudera
Agenda
44
Cloudera Partner with Etu
45
Etu 在 Hadoop 企業化的定位與價值
人才
招聘
團隊
建立
程式開發
資料 架構
探勘 設計
部署、調校
運維、管理
應用
平台
搶佔
市場
核心
價值
資源
調配
標準化、自動化
降低 Hadoop  平台
部署與維運的複雜度
• 省力:到府安裝調校、專案技術服務
• 省時:顧問與教育訓練,協助迅速上手
• 安心:本土技術支援,降低導入風險
• 智開:多年經驗分享,打通任督二脈
難
難
易
易
Etu
Manager
Etu Professional
Service
Etu Consulting
Etu Training
Etu Services
Etu Support
難
難
易
易
46
Etu Support
Etu Professional
Service
Etu Consulting
Cloudera
Support
Etu Manager Etu Services
Etu Big Data 軟體平台與服務
Cloudera
Manager
Etu
Manager
Cloudera Manager inside
Etu Training
47
主流 X86 商用伺服器
效能最佳化
全叢集管理
空機自動部署
全自動、高效能、易管理的巨量資料處理平台
唯一在地 Hadoop 專業服務
Etu Manager 讓 Hadoop 更容易
48
Etu Services
• Etu Manager 功能模組更新
• HDFS / MapReduce / HBase / Pig / Hive / Impala / Spark 技術諮詢 (電⼦子郵
件)
• 配合 CDH 提供升級與更新套件
• 客⼾戶問題管理 (Issue Management)
• Hadoop叢集規劃與設計 ● Hadoop軟體架構與資料模型設計
• Hadoop系統安裝與建置(on-site) ● Hadoop資料處理與應⽤用軟體開發
• Hadoop叢集維護檢測與調教(on-site) ● Hadoop資料移轉服務
Etu 專業服務 (以⼈人天計費)
• 叢集規劃與網路架構設計/顧問服務
• 應⽤用程式架構設計/顧問服務
Etu 科技顧問 (以⼈人天計費)
• 標準課程:Hadoop 直通學習地圖 – 針對不同職務需求,全⽅方位巨量資料技術實作學習
• 企業包班
Etu 教育訓練 (以⼈人次計費)
Etu 技術⽀支援 8X5 (以年計算)
49
Booth 4 : Etu Data Lake
Booth 5 : Cloudera
進一步了解
50
Appendix
Concepts
51
• Driver & Workers
• RDD – Resilient Distributed Dataset
• Transformations
• Actions
• Caching
Spark Concepts - Overview
52
Drivers and Workers
Driver
Worker
Worker
Worker
Data
Data
RAM
Data
RAM
Tasks
Results
RAM
53
• Read-only partitioned collection of records
• Created through:
– Transformation of data in storage
– Transformation of RDDs
• Contains lineage to compute from storage
• Lazy materialization
• Users control persistence and partitioning
RDD – Resilient Distributed Dataset
54
• Map
• Filter
• Sample
• Join
Operations
• Reduce
• Count
• First, Take
• SaveAs
Transformations Actions
55
• Transformations create new RDD from an
existing one
• Actions run computation on RDD and return a
value
• Transformations are lazy
• Actions materialize RDDs by computing
transformations
• RDDs can be cached to avoid re-computing
Operations
56
• RDDs contain lineage
• Lineage – source location and list of transformations
• Lost partitions can be re-computed from source data
Fault-Tolerance
msgs = textFile.filter(lambda s:
s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS  File Filtered  RDD Mapped  RDD
filter
(func =  startsWith(…))
map
(func =  split(...))
57
• Persist() and cache() mark data
• RDD is cached after first action
• Fault-tolerant – lost partitions will re-compute
• If not enough memory, some partitions will not be
cached
• Future actions are performed on cached
partitioned, so they are much faster
Use caching for iterative algorithms
Caching
58
• MEMORY_ONLY
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2…
Caching – Storage Levels
59
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
Easy: Expressive API
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
• sample
• take
• first
• partitionBy
• mapWith
• pipe
• save ...

Mais conteúdo relacionado

Mais procurados

700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationPedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationJen Aman
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningDataWorks Summit
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkDatabricks
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 

Mais procurados (20)

700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationPedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon Innovation
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent Memory
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
PySaprk
PySaprkPySaprk
PySaprk
 

Semelhante a Track A-2 基於 Spark 的數據分析

Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APIshareddatamsft
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...Lightbend
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Mike Broberg
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架hdhappy001
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyondXiao Li
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiDatabricks
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 

Semelhante a Track A-2 基於 Spark 的數據分析 (20)

Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 

Mais de Etu Solution

終歸:分群消費者x多元商機的實現
終歸:分群消費者x多元商機的實現終歸:分群消費者x多元商機的實現
終歸:分群消費者x多元商機的實現Etu Solution
 
歡迎回來:全面圖譜,金融 3.0 顧客行銷新視界
歡迎回來:全面圖譜,金融 3.0 顧客行銷新視界歡迎回來:全面圖譜,金融 3.0 顧客行銷新視界
歡迎回來:全面圖譜,金融 3.0 顧客行銷新視界Etu Solution
 
猜你喜歡:虛實並進,贏在全通路
猜你喜歡:虛實並進,贏在全通路猜你喜歡:虛實並進,贏在全通路
猜你喜歡:虛實並進,贏在全通路Etu Solution
 
投客所好:互聯內外,啟動投信藍海數據戰
投客所好:互聯內外,啟動投信藍海數據戰投客所好:互聯內外,啟動投信藍海數據戰
投客所好:互聯內外,啟動投信藍海數據戰Etu Solution
 
致詞歡迎:Big Data 無所不在,Data Technology 無 C 不歡
致詞歡迎:Big Data 無所不在,Data Technology 無 C 不歡致詞歡迎:Big Data 無所不在,Data Technology 無 C 不歡
致詞歡迎:Big Data 無所不在,Data Technology 無 C 不歡Etu Solution
 
啟程:Data Technology 的待客之道
啟程:Data Technology 的待客之道啟程:Data Technology 的待客之道
啟程:Data Technology 的待客之道Etu Solution
 
Track C-1 大數據時代的產品 ─ 創新與洞察決策
Track C-1 大數據時代的產品 ─ 創新與洞察決策Track C-1 大數據時代的產品 ─ 創新與洞察決策
Track C-1 大數據時代的產品 ─ 創新與洞察決策Etu Solution
 
Track C-3 Let's Play Marketing - 瘋創意 玩推薦 就該這樣搞行銷
Track C-3 Let's Play Marketing - 瘋創意 玩推薦 就該這樣搞行銷Track C-3 Let's Play Marketing - 瘋創意 玩推薦 就該這樣搞行銷
Track C-3 Let's Play Marketing - 瘋創意 玩推薦 就該這樣搞行銷Etu Solution
 
Track C-2 洞見未來 - Tableau 創造大數據新價值
Track C-2 洞見未來 - Tableau 創造大數據新價值Track C-2 洞見未來 - Tableau 創造大數據新價值
Track C-2 洞見未來 - Tableau 創造大數據新價值Etu Solution
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Etu Solution
 
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構Etu Solution
 
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告Etu Solution
 
Data without Boundaries - 圍繞第一方數據,找到商業驅動力
Data without Boundaries - 圍繞第一方數據,找到商業驅動力Data without Boundaries - 圍繞第一方數據,找到商業驅動力
Data without Boundaries - 圍繞第一方數據,找到商業驅動力Etu Solution
 
Big Data Tornado - 2015 台灣 Big Data 企業經典應用案例分享
Big Data Tornado - 2015 台灣 Big Data 企業經典應用案例分享Big Data Tornado - 2015 台灣 Big Data 企業經典應用案例分享
Big Data Tornado - 2015 台灣 Big Data 企業經典應用案例分享Etu Solution
 
Cloudera 助力台灣大數據產業的發展
Cloudera 助力台灣大數據產業的發展Cloudera 助力台灣大數據產業的發展
Cloudera 助力台灣大數據產業的發展Etu Solution
 
Data Leaders in Action - 資料價值領袖風範與關鍵行動
Data Leaders in Action - 資料價值領袖風範與關鍵行動Data Leaders in Action - 資料價值領袖風範與關鍵行動
Data Leaders in Action - 資料價值領袖風範與關鍵行動Etu Solution
 
Opening: Big Data+
Opening: Big Data+Opening: Big Data+
Opening: Big Data+Etu Solution
 
數位媒體的客戶洞察行銷術
數位媒體的客戶洞察行銷術數位媒體的客戶洞察行銷術
數位媒體的客戶洞察行銷術Etu Solution
 
Hadoop Big Data 成功案例分享
Hadoop Big Data 成功案例分享Hadoop Big Data 成功案例分享
Hadoop Big Data 成功案例分享Etu Solution
 

Mais de Etu Solution (20)

終歸:分群消費者x多元商機的實現
終歸:分群消費者x多元商機的實現終歸:分群消費者x多元商機的實現
終歸:分群消費者x多元商機的實現
 
歡迎回來:全面圖譜,金融 3.0 顧客行銷新視界
歡迎回來:全面圖譜,金融 3.0 顧客行銷新視界歡迎回來:全面圖譜,金融 3.0 顧客行銷新視界
歡迎回來:全面圖譜,金融 3.0 顧客行銷新視界
 
猜你喜歡:虛實並進,贏在全通路
猜你喜歡:虛實並進,贏在全通路猜你喜歡:虛實並進,贏在全通路
猜你喜歡:虛實並進,贏在全通路
 
投客所好:互聯內外,啟動投信藍海數據戰
投客所好:互聯內外,啟動投信藍海數據戰投客所好:互聯內外,啟動投信藍海數據戰
投客所好:互聯內外,啟動投信藍海數據戰
 
致詞歡迎:Big Data 無所不在,Data Technology 無 C 不歡
致詞歡迎:Big Data 無所不在,Data Technology 無 C 不歡致詞歡迎:Big Data 無所不在,Data Technology 無 C 不歡
致詞歡迎:Big Data 無所不在,Data Technology 無 C 不歡
 
啟程:Data Technology 的待客之道
啟程:Data Technology 的待客之道啟程:Data Technology 的待客之道
啟程:Data Technology 的待客之道
 
Track C-1 大數據時代的產品 ─ 創新與洞察決策
Track C-1 大數據時代的產品 ─ 創新與洞察決策Track C-1 大數據時代的產品 ─ 創新與洞察決策
Track C-1 大數據時代的產品 ─ 創新與洞察決策
 
Track C-3 Let's Play Marketing - 瘋創意 玩推薦 就該這樣搞行銷
Track C-3 Let's Play Marketing - 瘋創意 玩推薦 就該這樣搞行銷Track C-3 Let's Play Marketing - 瘋創意 玩推薦 就該這樣搞行銷
Track C-3 Let's Play Marketing - 瘋創意 玩推薦 就該這樣搞行銷
 
Track C-2 洞見未來 - Tableau 創造大數據新價值
Track C-2 洞見未來 - Tableau 創造大數據新價值Track C-2 洞見未來 - Tableau 創造大數據新價值
Track C-2 洞見未來 - Tableau 創造大數據新價值
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台
 
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
 
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
 
Data without Boundaries - 圍繞第一方數據,找到商業驅動力
Data without Boundaries - 圍繞第一方數據,找到商業驅動力Data without Boundaries - 圍繞第一方數據,找到商業驅動力
Data without Boundaries - 圍繞第一方數據,找到商業驅動力
 
Big Data Tornado - 2015 台灣 Big Data 企業經典應用案例分享
Big Data Tornado - 2015 台灣 Big Data 企業經典應用案例分享Big Data Tornado - 2015 台灣 Big Data 企業經典應用案例分享
Big Data Tornado - 2015 台灣 Big Data 企業經典應用案例分享
 
Cloudera 助力台灣大數據產業的發展
Cloudera 助力台灣大數據產業的發展Cloudera 助力台灣大數據產業的發展
Cloudera 助力台灣大數據產業的發展
 
Data Leaders in Action - 資料價值領袖風範與關鍵行動
Data Leaders in Action - 資料價值領袖風範與關鍵行動Data Leaders in Action - 資料價值領袖風範與關鍵行動
Data Leaders in Action - 資料價值領袖風範與關鍵行動
 
Opening: Big Data+
Opening: Big Data+Opening: Big Data+
Opening: Big Data+
 
數位媒體的客戶洞察行銷術
數位媒體的客戶洞察行銷術數位媒體的客戶洞察行銷術
數位媒體的客戶洞察行銷術
 
Hadoop Big Data 成功案例分享
Hadoop Big Data 成功案例分享Hadoop Big Data 成功案例分享
Hadoop Big Data 成功案例分享
 

Último

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 

Último (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 

Track A-2 基於 Spark 的數據分析

  • 1. 1 Spark Drives Big Data Analytics Application 基於Spark的數據分析 James Chen Etu CTO June 16, 2015
  • 2. 2 • Spark Brief • What Cloudera is doing on Spark • Spark Use Cases • Cloudera’s Position on Spark • Etu and Cloudera Agenda
  • 3. 3 Key Advances by MapReduce: • Data Locality: Automatic split computation and launch of mappers appropriately • Fault-Tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware • Linear Scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems A Brief Review of MapReduce M ap M ap M ap M ap M ap M ap M ap M ap M ap M ap M ap M ap Redu ce Redu ce Redu ce Redu ce
  • 4. 4 MapReduce: Good The Good: • Built in fault tolerance • Optimized IO path • Scalable • Developer focuses on Map/Reduce, not infrastructure • Simple? API
  • 5. 5 MapReduce: Bad The Bad: •Optimized for disk IO – Doesn’t leverage memory – Iterative algorithms go through disk IO path again and again •Primitive API – Developer’s have to build on very simple abstraction – Key/Value in/out – Even basic things like join require extensive code •Result often many files that need to be combined appropriately
  • 6. 6 Spark is a general purpose computational framework with more flexibility than MapReduce Key properties: • Leverages distributed memory • Full Directed Graph expressions for data parallel computations • Improved developer experience Yet retains: Linear scalability, Fault-tolerance, and Data Locality based computations Reference: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf What is Spark?
  • 7. 7 Easy to Develop – High productive language support – Clean and expressive APIs – Interactive shell – Out of box functionality Spark: Easy and Fast Big Data Fast to Run –General execution graphs –In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 8. 8 Spark Easy: Example – Word Count Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 9. 9 Hadoop Integration • Works with Hadoop Data • Runs With YARN Libraries • MLlib • Spark Streaming • GraphX (alpha) Out-of-the-Box Functionality Language support: • Improved Python support • SparkR • Java 8 • Schema support in Spark’s APIs
  • 10. 10 data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w Example: Logistic Regression
  • 11. 11 • Hadoop cluster with 100 nodes contains 10+TB of RAM today and will double next year • 1 GB RAM ~ $10-$20 • Trends: • ½ price every 18 months • 2x bandwidth every 3 years Memory Management Leads to Greater Performance 64-­‐128GB  RAM 16  cores 50  GB  per   sec Memory can be enabler for high performance big data applications
  • 12. 12 In-memory Caching • Data Partitions read from RAM instead of disk Operator Graphs • Scheduling Optimizations • Fault Tolerance Fast: Using RAM, Operator Graphs join filter groupBy B: B: C: D: E: F: Ç√ Ω map A: map take =  cached  partition=  RDD
  • 13. 13 Expressiveness of Programming Model Map Reduce Map Map Reduce Map Reduce Efficient  group-­‐by   aggregations   and  other  analytics Pipelined  MapReduce  Jobs Ma p Reduc e Ma p Reduc eX X X Ma p Reduc e Iterative  jobs  (Machine  Learning)
  • 14. 14 Logistic Regression Performance (Data Fits in Memory) 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 Running  Time  (s) Number  of  Iterations Hadoop Spark 110  s  /  iteration first  iteration 80  s further  iterations 1  s
  • 15. 15 • Spark Brief • What Cloudera is doing on Spark • Spark Use Cases • Cloudera’s Position on Spark • Etu and Cloudera Agenda
  • 16. 16 Spark Engineering in Cloudera • Cloudera embraced Spark in early 2014 • Engineering with Intel to broaden Spark ecosystem – Hive-on-Spark – Pig-on-Spark – Spark-over-YARN – Spark Streaming Reliability – General Spark Optimization
  • 17. 17 Hive on Spark • Technology – Hive: “standard” SQL tool in Hadoop – Spark: next-gen distributed processing framework – Hive + Spark • Performance • Minimum feature gap • Industry – A lot of customers heavily invest in Hive – Want to leverage the Spark engine
  • 18. 18 Design Principles • No or limited impact on Hive’s existing code path • Maximize code reuse • Minimum feature customization • Low future maintenance cost
  • 19. 19 Class Hierarchy TaskCompiler MapRedCompiler TezCompiler Task Work MapRedTask TezTask TezWorkMapRedWork SparkCompiler SparkTask SparkWork generates described by
  • 20. 20 Work – Metadata for Task • MapReduceWork contains one MapWork and a possible ReduceWork • SparkWork contains a graph of MapWorks and ReduceWorks MapWork1 ReduceWork1 MapWork2 ReduceWork2 MapWork1 ReduceWork1 ReduceWork2 Query:  select  name,   sum(value)  as  v   from  dec group  by  name   order  by  v; Spark  Job MR  Job  2 MR    Job  1
  • 21. 21 Data Processing via Spark • Treat Table as HadoopRDD (input RDD) • Apply the function that wraps MR’s map-side processing • Shuffle map output using Spark’s transformations (groupByKey, sortByKey, etc) • Apply the function that wraps MR’s reduce-side processing
  • 22. 22 Spark Plan • MapInput – encapsulate a table • MapTran – map-side processing • ShuffleTran – shuffling • ReduceTran – reduce-side processing Query: Select name, sum(value) as v from dec group by name order by v;
  • 23. 23 Current Status • All functionality in Hive is implemented • First round of optimization is completed – Map join, SMB – Split generation and grouping – CBO, vectorization • More optimization and benchmarking coming • Beta in CDH – http://archive-primary.cloudera.com/cloudera-labs/hive-on-spark/ – http://www.cloudera.com/content/cloudera/en/documentation/hive- spark/latest/PDF/hive-spark-get-started.pdf
  • 24. 24 • Spark Brief • What Cloudera is doing on Spark • Spark Use Cases • Cloudera’s Position on Spark Agenda
  • 25. 25 User Use Case Spark’s Value Conviva 通過實時分析流量規則以及更精細的流 量控制,優化終端用戶的在線視頻體驗 • 快速原型開發 • 共享的離線和在線計算業務邏輯 • 開源的機器學習算法 Yahoo! 加速廣告投放的模型訓練週期,特徵提 取提高3備,採用協同過濾算法進行內容 推薦 • 降低數據管道的延遲 • 迭代式機器學習 • 高效的P2P廣播 Anonymous (Large Tech Company) 準實時日誌聚合於分析,實現監控和告 警 • 低延遲、高頻度的運行“mini”批 總也來處理最新數據 Technicolor 為(電信)客戶提供實時分析;提供流 處理和實時查詢能力 • 部署簡單,只需要Spark和Spark Streaming • 在線數據的隨機查詢 Sample Use Cases
  • 26. 26 Large Tech Company – Spark is used for new machine learning investigations for search personalization Financial Services – Process millions of stock positions and future scenarios in 4hrs with Spark (compared with 1 week using MapReduce) University – Genomics research using Spark pipelines Video – Spark and Spark Streaming for video streaming and analysis Hospital – Spark for predictive modeling of disease conditions Cloudera Use Cases in Verticals
  • 27. 27 • Run ETL on Spark using PIG – To achieve very tight SLA’s. – Accenture Smart Water Application. • Spark Analytics over Hbase – Patients physiological data, experiment and user data – Serving Researchers. • Traffic analysis using MLlib/Clustering at Dylan • Annotated Variants analysis on Spark – Using the Spark/Java framework in Duke • Sepsis detection with Spark Streaming Cloudera Use cases with different Components
  • 28. 28 • A car shopping website where people from all across the nation come to read reviews, compare prices, and in general get help in all matters car related. • The goal was to build a near real-time dashboard that would provide both unique visitor and page view counts per make and make/model that could be engineered in a couple of weeks. • In the past, these updates have been restricted to hourly granularities with an additional hour delay. • Furthermore, as this data was not available in an easy-to-use dashboard, manual processing was needed to visualize the data. Near real-time dashboard by Edmunds.com
  • 32. 32 Total UV by Maker/Model
  • 33. 33 Case Study in Etu Insight l Problem domain: − Analyze user behavior from web site interaction log − Analyze users behavior from existing offline data − Make data aggregation on the data grouping by time and users l Approach: − ETL process from the web log to Hive structure data − Import existing database data − Define and implement the aggregation function in Spark (with Scala) − Output the calculation result to HBase
  • 34. 34 Architecture & Flow Web log User Data Hive (Structured Data) SPARK HBase
  • 36. 36 Advanced Analytics with Spark • Written by Cloudera data science team – First ever book bridging ML with Hadoop ecosystem – Focusing on use cases and examples rather than a manual – Target for data scientist solving real word analysis problems – Generally available in May 2015
  • 37. 37 Analyzing Big Data • Building a model to detect credit card fraud using thousands of features and billions of transactions • Intelligently recommend millions of products to millions of users • Estimate financial risk through simulations of portfolios including millions of instruments • Easily manipulate data from thousands of human genomes to detect genetic associations with disease
  • 38. 38 • Spark Brief • What Cloudera is doing on Spark • Spark Use Cases • Cloudera’s Position on Spark • Etu and Cloudera Agenda
  • 39. 39 Spark is a fully integrated and supported part of Cloudera’s enterprise data hub • First vendor to ship and support Spark – Invested early to make it a cohesive part of the platform – Complemented by Intel’s early investment – Developed and supported in collaboration with Databricks to ensure success • Only vendor with Spark committers on staff • Several Spark use cases in production • Well-trained support staff and external Training Courses Cloudera’s Investment in Spark
  • 40. 40 Hadoop in the Spark World YARN Spark Spark Streaming GraphX MLlib HDFS, HBase HivePig Impala MapReduce2 SparkSQL Search Core  Hadoop Support  Spark  components Unsupported  add-­‐ons
  • 41. 41 Focusing on Open Standards, not just Open Source Open  Standards  are  just  as   important  as  Open  Source. Why  does  it  matter? • Diverse  engineering  is  more  sustainable. • Broad  support  ensures  vendor   portability. • Project  utility  depends  on  ecosystem   compatibility,  which  depends   on   standards. Cloudera  leads  in  defining the  de  facto  open  standards   adopted  by  the  market. Vendor   Support Component (Founder) Cloudera Pivota l MapR Amaz on IBM Hortonwo rks Spark (UC   Berkeley) ✔ ✔ ✔ ✔ ✔ ✔ Impala  (Cloudera) ✔ ✖ ✔ ✔ ✖ ✖ Hue  (Cloudera) ✔ ✖ ✔ ✔ ✖ ✔ Sentry  (Cloudera) ✔ ✔ ✔ ✖ ✔ ✖ Flume  (Cloudera) ✔ ✔ ✔ ✖ ✔ ✔ Parquet   (Cloudera/Twitter) ✔ ✔ ✔ ✔ ✔ ✖ Sqoop  (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔ Falcon   ✖ ✖ ✖ ✖ ✖ ✔ Knox ✖ ✖ ✖ ✖ ✖ ✔ Tez ✖ ✖ ✔ ✖ ✖ ✔ Ranger   ✖ ✖ ✖ ✖ ✖ ✔
  • 42. 42 Cloudera is a member of, and aligned with, the broader Spark community Spark: • Will replace MapReduce as the general purpose Hadoop framework – Broad community and vendor adoption – Hadoop ecosystem integration (native & 3rd party) • Goes beyond data science/machine learning – Cloudera working on Spark Core, Streaming, Security, YARN, and MLlib • Does not replace special purpose frameworks – One size does not fit all for SQL, Search, Graph, Stream Cloudera’s Position on Spark
  • 43. 43 • Spark Brief • What Cloudera is doing on Spark • Spark Use Cases • Cloudera’s Position on Spark • Etu and Cloudera Agenda
  • 45. 45 Etu 在 Hadoop 企業化的定位與價值 人才 招聘 團隊 建立 程式開發 資料 架構 探勘 設計 部署、調校 運維、管理 應用 平台 搶佔 市場 核心 價值 資源 調配 標準化、自動化 降低 Hadoop  平台 部署與維運的複雜度 • 省力:到府安裝調校、專案技術服務 • 省時:顧問與教育訓練,協助迅速上手 • 安心:本土技術支援,降低導入風險 • 智開:多年經驗分享,打通任督二脈 難 難 易 易 Etu Manager Etu Professional Service Etu Consulting Etu Training Etu Services Etu Support 難 難 易 易
  • 46. 46 Etu Support Etu Professional Service Etu Consulting Cloudera Support Etu Manager Etu Services Etu Big Data 軟體平台與服務 Cloudera Manager Etu Manager Cloudera Manager inside Etu Training
  • 48. 48 Etu Services • Etu Manager 功能模組更新 • HDFS / MapReduce / HBase / Pig / Hive / Impala / Spark 技術諮詢 (電⼦子郵 件) • 配合 CDH 提供升級與更新套件 • 客⼾戶問題管理 (Issue Management) • Hadoop叢集規劃與設計 ● Hadoop軟體架構與資料模型設計 • Hadoop系統安裝與建置(on-site) ● Hadoop資料處理與應⽤用軟體開發 • Hadoop叢集維護檢測與調教(on-site) ● Hadoop資料移轉服務 Etu 專業服務 (以⼈人天計費) • 叢集規劃與網路架構設計/顧問服務 • 應⽤用程式架構設計/顧問服務 Etu 科技顧問 (以⼈人天計費) • 標準課程:Hadoop 直通學習地圖 – 針對不同職務需求,全⽅方位巨量資料技術實作學習 • 企業包班 Etu 教育訓練 (以⼈人次計費) Etu 技術⽀支援 8X5 (以年計算)
  • 49. 49 Booth 4 : Etu Data Lake Booth 5 : Cloudera 進一步了解
  • 51. 51 • Driver & Workers • RDD – Resilient Distributed Dataset • Transformations • Actions • Caching Spark Concepts - Overview
  • 53. 53 • Read-only partitioned collection of records • Created through: – Transformation of data in storage – Transformation of RDDs • Contains lineage to compute from storage • Lazy materialization • Users control persistence and partitioning RDD – Resilient Distributed Dataset
  • 54. 54 • Map • Filter • Sample • Join Operations • Reduce • Count • First, Take • SaveAs Transformations Actions
  • 55. 55 • Transformations create new RDD from an existing one • Actions run computation on RDD and return a value • Transformations are lazy • Actions materialize RDDs by computing transformations • RDDs can be cached to avoid re-computing Operations
  • 56. 56 • RDDs contain lineage • Lineage – source location and list of transformations • Lost partitions can be re-computed from source data Fault-Tolerance msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS  File Filtered  RDD Mapped  RDD filter (func =  startsWith(…)) map (func =  split(...))
  • 57. 57 • Persist() and cache() mark data • RDD is cached after first action • Fault-tolerant – lost partitions will re-compute • If not enough memory, some partitions will not be cached • Future actions are performed on cached partitioned, so they are much faster Use caching for iterative algorithms Caching
  • 58. 58 • MEMORY_ONLY • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2… Caching – Storage Levels
  • 59. 59 • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin Easy: Expressive API • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip • sample • take • first • partitionBy • mapWith • pipe • save ...