SlideShare uma empresa Scribd logo
1 de 27
Spark 计算模型
wangxing
MapReduce
Spark
“Apache Spark is a fast and general-purpose cluster
computing system. It provides high-level APIs in Java, Scala, Python and R,
and an optimized engine that supports general execution graphs. It also
supports a rich set of higher-level tools including Spark SQL for SQL and
structured data processing, MLlib for machine learning, GraphX for graph
processing, and Spark Streaming.”
Unified Platform
Resilient Distributed Dataset
RDD is an immutable and partitioned collection:
● Resilient:
○ automatically recover from node failures
● Distributed:
○ partitioned across the nodes of the cluster that can be operated on in
parallel
● Dataset:
○ RDDs are created with a file in the Hadoop file system, or an existing
Scala collection
Create RDD
“There are two ways to create RDDs: parallelizing an existing collection in your
driver program, or referencing a dataset in an external storage system, such as
a shared filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat.”
1. var mydata = sc.parallelize(Array(1, 2, 3, 4, 5))
2. var mydata = sc.makeRDD(1 to 100, 2)
3. var mydata = sc.textFile("hdfs://sa-onlinehdm1-cnc0.hlg:8020/tmp/1.txt")
Operate RDD
“Two types of operations: transformations, which create a new dataset(RDD)
from an existing one, and actions, which return a value to the driver program
after running a computation on the dataset.”
“All transformations in Spark are lazy, in that they do not compute their results
right away. Instead, they just remember the transformations applied to some
base dataset. The transformations are only computed when an action requires
a result to be returned to the driver program.
Other Control
● persist / cache / unpersist:
○ stores any partitions of the rdd in memory/disk
○ can be stored using a different storage level
○ allows future actions to be much faster,a key tool for iterative
algorithms
● checkpoint:
○ saves the RDD to disk, actually forgets the lineage of the RDD
completely. This is allows long lineages to be truncated.
Example: WordCount
1. var lines = sc.textFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”)
2. var counts = lines.flatMap(line => line.split(“ ”))
3. .map(word => (word, 1))
4. .reduceByKey((a, b) => a + b)
5. counts.collect().foreach(println)
6. counts.saveAsTextFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”)
Example: WordCount
scala> counts.toDebugString
(2) ShuffledRDD[8] at reduceByKey at <console>:23 []
+-(2) MapPartitionsRDD[7] at map at <console>:23 []
| MapPartitionsRDD[6] at flatMap at <console>:23 []
| MapPartitionsRDD[5] at textFile at <console>:21 []
| hdfs: //xxx HadoopRDD[4] at textFile at <console>:21 []
Dependency
Dependency
● Narrow dependencies:
○ allow for pipelined execution on one cluster node
○ easy fault recovery
● Wide dependencies:
○ require data from all parent partitions to be available and to be
shuffled across the nodes
○ a single failed node might cause a complete re-execution
DAG
● Directed:
○ Only in a single direction
● Acyclic:
○ No looping
Shuffle
● redistributes data among partitions
● partition keys into buckets
● write intermediate files to disk fetched by the next stage of tasks
Stage
● A stage is a set of independent tasks of a Spark job;
● DAG of tasks is split up into stages at the boundaries where shuffle occurs;
● DAGScheduler runs the stages in topological order;
● Each Stage can either be a shuffle map stage, in which case its tasks'
results are input for another stage, or a result stage, in which case its
tasks directly compute the action that initiated a job.
Stage
Stage
Job Schedule
Job Schedule
Spark Deploy
Spark Deploy
● Each application gets its own executor processes, isolating applications
from each other;
● The driver program must listen for and accept incoming connections from
its executors on the worker nodes;
● Because the driver schedules tasks on the cluster, it should be run close to
the worker nodes, preferably on the same local area network.
Tips
Use groupByKey/collect carefully
Use mapPartitions if initialization is heavy
Use Kryo serialization instead of java
Repartition if filter causes data skew
Spark Streaming
● Receives live input data streams and divides the data into batches of X
seconds;
● Treats each batch of data as RDDs and processes them using RDD
operations;
● The processed results of the RDD operations are returned in batches;
● Batch sizes as low as ½ sec, latency of about 1 sec.
Spark Streaming
● Receives live input data streams and divides the data into batches of X
seconds;
● Treats each batch of data as RDDs and processes them using RDD
operations;
● The processed results of the RDD operations are returned in batches;
● Batch sizes as low as ½ sec, latency of about 1 sec.
Spark Streaming
Windowed based computations allow you to apply transformations over a sliding window
of data. Every time the window slides over a source DStream, the source RDDs that fall
within the window are combined and operated upon to produce the RDDs of the
windowed DStream.
● window length - The duration of the window.
● sliding interval - The interval at which the window operation is performed.
Spark SQL
● Query data via SQL on DataFrame, a distributed collection of data organized into
named columns;
● DataFrame can created from an existing RDD, from a Hive table, or from data
sources;
● DataFrame can be operated on as normal RDDs and can also be registered as a
temporary table. Registering as a table allows you to run SQL queries over its data.
Example:Run sql on a text file.
Thanks

Mais conteúdo relacionado

Mais procurados

Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
Markus Klems
 

Mais procurados (20)

Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Hadoop2
Hadoop2Hadoop2
Hadoop2
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Advancing Scientific Data Support in ArcGIS
Advancing Scientific Data Support in ArcGISAdvancing Scientific Data Support in ArcGIS
Advancing Scientific Data Support in ArcGIS
 
Matlab netcdf guide
Matlab netcdf guideMatlab netcdf guide
Matlab netcdf guide
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 

Destaque

Scalable vertical search engine with hadoop
Scalable vertical search engine with hadoopScalable vertical search engine with hadoop
Scalable vertical search engine with hadoop
datasalt
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Destaque (15)

Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
 
Spark meetup stream processing use cases
Spark meetup   stream processing use casesSpark meetup   stream processing use cases
Spark meetup stream processing use cases
 
Scalable vertical search engine with hadoop
Scalable vertical search engine with hadoopScalable vertical search engine with hadoop
Scalable vertical search engine with hadoop
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Semelhante a Spark 计算模型

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 

Semelhante a Spark 计算模型 (20)

Spark
SparkSpark
Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 

Último

Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 

Último (20)

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 

Spark 计算模型

  • 3. Spark “Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.”
  • 5. Resilient Distributed Dataset RDD is an immutable and partitioned collection: ● Resilient: ○ automatically recover from node failures ● Distributed: ○ partitioned across the nodes of the cluster that can be operated on in parallel ● Dataset: ○ RDDs are created with a file in the Hadoop file system, or an existing Scala collection
  • 6. Create RDD “There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.” 1. var mydata = sc.parallelize(Array(1, 2, 3, 4, 5)) 2. var mydata = sc.makeRDD(1 to 100, 2) 3. var mydata = sc.textFile("hdfs://sa-onlinehdm1-cnc0.hlg:8020/tmp/1.txt")
  • 7. Operate RDD “Two types of operations: transformations, which create a new dataset(RDD) from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.” “All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset. The transformations are only computed when an action requires a result to be returned to the driver program.
  • 8. Other Control ● persist / cache / unpersist: ○ stores any partitions of the rdd in memory/disk ○ can be stored using a different storage level ○ allows future actions to be much faster,a key tool for iterative algorithms ● checkpoint: ○ saves the RDD to disk, actually forgets the lineage of the RDD completely. This is allows long lineages to be truncated.
  • 9. Example: WordCount 1. var lines = sc.textFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”) 2. var counts = lines.flatMap(line => line.split(“ ”)) 3. .map(word => (word, 1)) 4. .reduceByKey((a, b) => a + b) 5. counts.collect().foreach(println) 6. counts.saveAsTextFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”)
  • 10. Example: WordCount scala> counts.toDebugString (2) ShuffledRDD[8] at reduceByKey at <console>:23 [] +-(2) MapPartitionsRDD[7] at map at <console>:23 [] | MapPartitionsRDD[6] at flatMap at <console>:23 [] | MapPartitionsRDD[5] at textFile at <console>:21 [] | hdfs: //xxx HadoopRDD[4] at textFile at <console>:21 []
  • 12. Dependency ● Narrow dependencies: ○ allow for pipelined execution on one cluster node ○ easy fault recovery ● Wide dependencies: ○ require data from all parent partitions to be available and to be shuffled across the nodes ○ a single failed node might cause a complete re-execution
  • 13. DAG ● Directed: ○ Only in a single direction ● Acyclic: ○ No looping
  • 14. Shuffle ● redistributes data among partitions ● partition keys into buckets ● write intermediate files to disk fetched by the next stage of tasks
  • 15. Stage ● A stage is a set of independent tasks of a Spark job; ● DAG of tasks is split up into stages at the boundaries where shuffle occurs; ● DAGScheduler runs the stages in topological order; ● Each Stage can either be a shuffle map stage, in which case its tasks' results are input for another stage, or a result stage, in which case its tasks directly compute the action that initiated a job.
  • 16. Stage
  • 17. Stage
  • 21. Spark Deploy ● Each application gets its own executor processes, isolating applications from each other; ● The driver program must listen for and accept incoming connections from its executors on the worker nodes; ● Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network.
  • 22. Tips Use groupByKey/collect carefully Use mapPartitions if initialization is heavy Use Kryo serialization instead of java Repartition if filter causes data skew
  • 23. Spark Streaming ● Receives live input data streams and divides the data into batches of X seconds; ● Treats each batch of data as RDDs and processes them using RDD operations; ● The processed results of the RDD operations are returned in batches; ● Batch sizes as low as ½ sec, latency of about 1 sec.
  • 24. Spark Streaming ● Receives live input data streams and divides the data into batches of X seconds; ● Treats each batch of data as RDDs and processes them using RDD operations; ● The processed results of the RDD operations are returned in batches; ● Batch sizes as low as ½ sec, latency of about 1 sec.
  • 25. Spark Streaming Windowed based computations allow you to apply transformations over a sliding window of data. Every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream. ● window length - The duration of the window. ● sliding interval - The interval at which the window operation is performed.
  • 26. Spark SQL ● Query data via SQL on DataFrame, a distributed collection of data organized into named columns; ● DataFrame can created from an existing RDD, from a Hive table, or from data sources; ● DataFrame can be operated on as normal RDDs and can also be registered as a temporary table. Registering as a table allows you to run SQL queries over its data. Example:Run sql on a text file.

Notas do Editor

  1. 每一轮计算都有Map阶段和Reduce阶段,需要把计算步骤转换成若干轮的MapReduce。 Reduce的输出数据需要存储到磁盘,中间的shuffle有大量网络传输。因此MapReduce具有高延迟,不适合进行迭代计算的特点。 跑SQL需要搭建Hive,跑机器学习算法需要使用Mahout。
  2. Spark将数据切片后放内存中计算,速度是MapReduce的100倍之上。 Spark让开发者可以快速的用Java、Scala或Python编写程序。 极具通用性,使用统一技术栈解决SQL查询,流数据处理,机器学习和基于图的计算。
  3. RDD之前可以相互依赖。 Automatically rebuild on failure Persistence for reuse (RAM and/or disk)
  4. 1. 如果RDD的每个分区最多只能被一个Child RDD的一个分区使用,则称之为narrow dependency;若被多个Child RDD分区都依赖,则为wide dependency。
  5. Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers, which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code to the executors. Finally, SparkContext sends tasks to the executors to run.