SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Prajod Vettiyattil 
Architect, Open source 
Wipro 
in.linkedin.com/in/prajod 
@prajods 
Apache Spark The Next Gen toolset for Big Data Processing 
Namitha M S 
Architect, Advanced Technologies 
Wipro 
in.linkedin.com/in/namithams 
Open Source India 
Nov 2014 
Bangalore
•Big Data 
•Hadoop stack and its limitations 
•Spark: An overview 
•Streaming, GraphX and MLlib 
•Performance characteristics of Spark 
Agenda
•Data too huge for normal systems 
•3 Vs: Volume, Variety, Velocity 
•Storage challenge 
•Analysis challenge 
•Query results take hours, days or months 
Big Data 
Data disks
The Big Data Analysis Triad 
Batch 
Interactive 
Streaming
The Hadoop stack 
•Distributed data processing 
•Fault tolerant 
•Process peta byte data sets 
•Ecosystem tools 
•Hive DB, Hbase 
•Pig 
•Storm 
•Hadoop 
•Map 
•Reduce 
•Shuffle, partition, sort 
•HDFS
Hadoop: Data flow 
Partition for target reducers 
Buffer in memory 
Map 
Input data files 
Sort each partition by key 
Merge all partitions and write to disk 
Potential spill to disk 
Merge round 1 
Merge round 2 
Merge round N 
http fetch from 
map node 
Reduce 
Merge sort 
… 
Output 
High disk I/O 
On Map nodes 
On Reduce nodes
•Batch mode 
•Only the batch layer in the Lambda pattern 
•No real time 
•No repetitive queries 
•Iterative algorithms 
•Interactive data querying 
•Poor support for distributed memory 
Limitations of Hadoop
Spark: An overview 
•“Over time, fewer projects will use MapReduce, and more will use Spark” 
•Doug Cutting, creator of Hadoop 
•New architecture: scale better and simplify 
•In memory processing for Big Data 
•Cached intermediate data sets 
•Multi-step DAG based execution 
•Resilient Distributed Data(RDD) sets 
•The core innovation in Spark
Spark Ecosystem tools 
Apache Spark 
Spark SQL 
Streaming 
MLlib 
GraphX 
Spark R 
Blink DB 
Shark 
Bagel
DAG Execution Engine 
Map 
Collect 
Filter 
Map 
Reduce 
Sort 
Collect 
DAG = Directed Acyclic Graph
•Resilient Distributed Data sets 
•Features 
•Read only 
•Fault tolerance without replication 
•Uses data lineage for recovery 
•Low network I/O 
•Partitions/Slices 
•parallel tasks 
RDD 
Disk 
Transform 1 
RDD 1 
Transform 2 
RDD 2 
Data partitions
Lambda architecture pattern 
•Used for Lambda architecture implementation 
•Batch layer 
•Speed layer 
•Serving layer 
Batch Layer 
Speed Layer 
Serving Layer 
Input 
Data consumers 
Query 
Query
Spark Streaming 
•For stream processing in Spark 
•Real time data 
•Like Twitter queries 
•Discretized streams(DStreams) 
•Micro batches 
•Sequence of RDDs
Discretized Streams 
Spark Streaming 
Spark 
Batches of x seconds 
Input 
Output
Why Spark Streaming 
•Near real time processing (0.5 – 2 sec latency) 
•Parallel recovery of lost nodes and stragglers 
•Implementation of Lambda architecture 
•Single engine for batch and stream 
•Not suited for low latency requirements 
•i.e., 100ms
Apache Storm vs Spark Streaming 
Feature 
Spark Streaming 
Storm 
Processing Model 
Micro-Batching 
Event Stream processing 
Message Delivery options 
Inherently fault tolerant, exactly once delivery 
At least once, at most once, exactly once 
Flexibility 
Coarse grained transformation 
Fine grained transformation 
Implemented in 
Scala 
Clojure 
Development Cost 
Common platform for both batch and stream 
Only stream – separate setup for batch 
Applicability 
Machine learning, Interactive analytics, near real time analytics 
Near real time analytics, Natural language processing
GraphX & MLlib 
• Data parallel Vs Graph Parallel processing 
• Wikipedia search vs Facebook connection search, Page 
rank 
• Spark MLlib implements high quality machine 
learning algorithms 
• Iterative Algorithm Paradigm 
• Leverage Spark’s in memory data sets 
( ) (t 1) t x  f x  
f(xt) f(xt+1) 
x(t) x(t+1)
Performance characteristics 
Performance of Spark 
•100x faster in memory 
•10x faster on disk 
Graph courtesy: spark.apache.org
Hadoop vs Spark 
Hadoop 
Spark 
Spark 
World Record 
100 TB * 
1 PB 
Data Size 
102.5 TB 
100 TB 
1000 TB 
Elapsed Time 
72 mins 
23 mins 
234 mins 
# Nodes 
2100 
206 
190 
# Cores 
50400 
6592 
6080 
# Reducers 
10,000 
29,000 
250,000 
Rate 
1.42 TB/min 
4.27 TB/min 
4.27 TB/min 
Rate/node 
0.67 GB/min 
20.7 GB/min 
22.5 GB/min 
Data courtesy: databricks.com
1 TB performance test: data per sec
1 TB performance test data rate vs RAM size
Apache Spark 
•New architecture 
•RDD, DAG 
•In memory processing 
•Map reduce and more 
•GraphX 
•MLlib 
•Spark streaming 
Summary 
Ecosystem tools 
•Spark R 
•Blink DB 
•Storm 
Spark performance 
•GBs per second 
•RAM to data size 
•Inflexion point
Questions 
Prajod Vettiyattil 
Architect, Open source 
Wipro 
@prajods 
in.linkedin.com/in/prajod 
Namitha M S 
Architect, Advanced Technologies 
Wipro 
in.linkedin.com/in/namithams

Mais conteúdo relacionado

Mais procurados

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 

Mais procurados (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 

Destaque

Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
 
Apache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source IntegrationApache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source Integrationprajods
 
SparkR + Zeppelin
SparkR + ZeppelinSparkR + Zeppelin
SparkR + Zeppelinfelixcss
 
Event Driven Architecture with Apache Camel
Event Driven Architecture with Apache CamelEvent Driven Architecture with Apache Camel
Event Driven Architecture with Apache Camelprajods
 
Tokyo webmining発表資料 20111127
Tokyo webmining発表資料 20111127Tokyo webmining発表資料 20111127
Tokyo webmining発表資料 20111127kan_yukiko
 
テキストマイニングで発掘!? 売上とユーザーレビューの相関分析
テキストマイニングで発掘!? 売上とユーザーレビューの相関分析テキストマイニングで発掘!? 売上とユーザーレビューの相関分析
テキストマイニングで発掘!? 売上とユーザーレビューの相関分析Shintaro Takemura
 
Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark
Interactive Data Science From Scratch with Apache Zeppelin and Apache SparkInteractive Data Science From Scratch with Apache Zeppelin and Apache Spark
Interactive Data Science From Scratch with Apache Zeppelin and Apache Sparkfelixcss
 
Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定Yoshiyasu SAEKI
 
IoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache FlinkIoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache FlinkTakanori Suzuki
 
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedApache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedGuido Schmutz
 
Spark architecture
Spark architectureSpark architecture
Spark architecturedatamantra
 
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~sugiyama koki
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 
fluent-plugin-norikra #fluentdcasual
fluent-plugin-norikra #fluentdcasualfluent-plugin-norikra #fluentdcasual
fluent-plugin-norikra #fluentdcasualSATOSHI TAGOMORI
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方Yoshiyasu SAEKI
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionWill Johnson
 

Destaque (20)

Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Apache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source IntegrationApache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source Integration
 
SparkR + Zeppelin
SparkR + ZeppelinSparkR + Zeppelin
SparkR + Zeppelin
 
Event Driven Architecture with Apache Camel
Event Driven Architecture with Apache CamelEvent Driven Architecture with Apache Camel
Event Driven Architecture with Apache Camel
 
Tokyo webmining発表資料 20111127
Tokyo webmining発表資料 20111127Tokyo webmining発表資料 20111127
Tokyo webmining発表資料 20111127
 
テキストマイニングで発掘!? 売上とユーザーレビューの相関分析
テキストマイニングで発掘!? 売上とユーザーレビューの相関分析テキストマイニングで発掘!? 売上とユーザーレビューの相関分析
テキストマイニングで発掘!? 売上とユーザーレビューの相関分析
 
Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark
Interactive Data Science From Scratch with Apache Zeppelin and Apache SparkInteractive Data Science From Scratch with Apache Zeppelin and Apache Spark
Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark
 
Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定
 
IoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache FlinkIoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache Flink
 
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedApache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms compared
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
fluent-plugin-norikra #fluentdcasual
fluent-plugin-norikra #fluentdcasualfluent-plugin-norikra #fluentdcasual
fluent-plugin-norikra #fluentdcasual
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
ストリームデータ分散処理基盤Storm
ストリームデータ分散処理基盤Stormストリームデータ分散処理基盤Storm
ストリームデータ分散処理基盤Storm
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
 
Apache Spark の紹介(前半:Sparkのキホン)
Apache Spark の紹介(前半:Sparkのキホン)Apache Spark の紹介(前半:Sparkのキホン)
Apache Spark の紹介(前半:Sparkのキホン)
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS Function
 

Semelhante a Apache Spark: The Next Gen toolset for Big Data Processing

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWSAmazon Web Services
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practiceDarko Marjanovic
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine LearningChris Fregly
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 

Semelhante a Apache Spark: The Next Gen toolset for Big Data Processing (20)

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 

Último

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 

Último (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

Apache Spark: The Next Gen toolset for Big Data Processing

  • 1. Prajod Vettiyattil Architect, Open source Wipro in.linkedin.com/in/prajod @prajods Apache Spark The Next Gen toolset for Big Data Processing Namitha M S Architect, Advanced Technologies Wipro in.linkedin.com/in/namithams Open Source India Nov 2014 Bangalore
  • 2. •Big Data •Hadoop stack and its limitations •Spark: An overview •Streaming, GraphX and MLlib •Performance characteristics of Spark Agenda
  • 3. •Data too huge for normal systems •3 Vs: Volume, Variety, Velocity •Storage challenge •Analysis challenge •Query results take hours, days or months Big Data Data disks
  • 4. The Big Data Analysis Triad Batch Interactive Streaming
  • 5. The Hadoop stack •Distributed data processing •Fault tolerant •Process peta byte data sets •Ecosystem tools •Hive DB, Hbase •Pig •Storm •Hadoop •Map •Reduce •Shuffle, partition, sort •HDFS
  • 6. Hadoop: Data flow Partition for target reducers Buffer in memory Map Input data files Sort each partition by key Merge all partitions and write to disk Potential spill to disk Merge round 1 Merge round 2 Merge round N http fetch from map node Reduce Merge sort … Output High disk I/O On Map nodes On Reduce nodes
  • 7. •Batch mode •Only the batch layer in the Lambda pattern •No real time •No repetitive queries •Iterative algorithms •Interactive data querying •Poor support for distributed memory Limitations of Hadoop
  • 8. Spark: An overview •“Over time, fewer projects will use MapReduce, and more will use Spark” •Doug Cutting, creator of Hadoop •New architecture: scale better and simplify •In memory processing for Big Data •Cached intermediate data sets •Multi-step DAG based execution •Resilient Distributed Data(RDD) sets •The core innovation in Spark
  • 9. Spark Ecosystem tools Apache Spark Spark SQL Streaming MLlib GraphX Spark R Blink DB Shark Bagel
  • 10. DAG Execution Engine Map Collect Filter Map Reduce Sort Collect DAG = Directed Acyclic Graph
  • 11. •Resilient Distributed Data sets •Features •Read only •Fault tolerance without replication •Uses data lineage for recovery •Low network I/O •Partitions/Slices •parallel tasks RDD Disk Transform 1 RDD 1 Transform 2 RDD 2 Data partitions
  • 12. Lambda architecture pattern •Used for Lambda architecture implementation •Batch layer •Speed layer •Serving layer Batch Layer Speed Layer Serving Layer Input Data consumers Query Query
  • 13. Spark Streaming •For stream processing in Spark •Real time data •Like Twitter queries •Discretized streams(DStreams) •Micro batches •Sequence of RDDs
  • 14. Discretized Streams Spark Streaming Spark Batches of x seconds Input Output
  • 15. Why Spark Streaming •Near real time processing (0.5 – 2 sec latency) •Parallel recovery of lost nodes and stragglers •Implementation of Lambda architecture •Single engine for batch and stream •Not suited for low latency requirements •i.e., 100ms
  • 16. Apache Storm vs Spark Streaming Feature Spark Streaming Storm Processing Model Micro-Batching Event Stream processing Message Delivery options Inherently fault tolerant, exactly once delivery At least once, at most once, exactly once Flexibility Coarse grained transformation Fine grained transformation Implemented in Scala Clojure Development Cost Common platform for both batch and stream Only stream – separate setup for batch Applicability Machine learning, Interactive analytics, near real time analytics Near real time analytics, Natural language processing
  • 17. GraphX & MLlib • Data parallel Vs Graph Parallel processing • Wikipedia search vs Facebook connection search, Page rank • Spark MLlib implements high quality machine learning algorithms • Iterative Algorithm Paradigm • Leverage Spark’s in memory data sets ( ) (t 1) t x  f x  f(xt) f(xt+1) x(t) x(t+1)
  • 18. Performance characteristics Performance of Spark •100x faster in memory •10x faster on disk Graph courtesy: spark.apache.org
  • 19. Hadoop vs Spark Hadoop Spark Spark World Record 100 TB * 1 PB Data Size 102.5 TB 100 TB 1000 TB Elapsed Time 72 mins 23 mins 234 mins # Nodes 2100 206 190 # Cores 50400 6592 6080 # Reducers 10,000 29,000 250,000 Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min Data courtesy: databricks.com
  • 20. 1 TB performance test: data per sec
  • 21. 1 TB performance test data rate vs RAM size
  • 22. Apache Spark •New architecture •RDD, DAG •In memory processing •Map reduce and more •GraphX •MLlib •Spark streaming Summary Ecosystem tools •Spark R •Blink DB •Storm Spark performance •GBs per second •RAM to data size •Inflexion point
  • 23. Questions Prajod Vettiyattil Architect, Open source Wipro @prajods in.linkedin.com/in/prajod Namitha M S Architect, Advanced Technologies Wipro in.linkedin.com/in/namithams