SlideShare uma empresa Scribd logo
1 de 68
Apache Spark
Fundamentals
By
Zahra Eskandari
Course Agenda
• Spark Fundamentals
• 1 session
• Spark SQL
• 1.5 session
• Spark Streaming
• 0.5 session
• Spark MLlib & GraphX
• 1 session
Apache Spark Fundamentals
Agenda
• Apache Hadoop
• Spark overview
• Spark execution model
• Basic Spark programming Model
• Install Spark
• Scala
• First Example
• Second Example
• Running Spark Examples
• Your Turn!
Apache Hadoop
An overview of Apache Hadoop
• Open source software framework for distributed data processing
• Comes with Hadoop Distributed File System (HDFS)
• Map/Reduce for data processing
• YARN (Yet Another Resource Negotiator)
Map Reduce
• A generic data processing model
• Done through map step and reduce step
• In map step, each data item is processed independently
• In reduce step, map steps results are combined
Word Count
MapReduce limitations
• MapReduce use cases showed two major limitations:
• difficultly of programming directly in MR
• performance bottlenecks, or batch not fitting the use cases
• The subsequent steps have to read from disk
• In short, MR doesn’t compose well for large applications
Therefore, people built specialized
systems as workarounds…
Spark
• Unlike the various specialized systems, Spark’s goal was to generalize
MapReduce to support new apps within same engine
• Two reasonably small additions are enough to express the previous
models:
• fast data sharing
• general DAGs
• This allows for an approach which is more efficient for the engine,
and much simpler for the end users
Spark Overview
Apache Spark Overview
• A general framework for distributed computing that offers high
performance for both batch and interactive processing
• an in-memory big data platform that performs especially well with
iterative algorithms
• Originally developed by UC Berkeley starting in 2009 Moved to an
Apache project in 2013
• 10-100x speedup over Hadoop with some algorithms, especially
iterative ones as found in machine learning
Advantages of Spark
• Runs Programs up to 100x faster than Hadoop MapReduce
• Does the processing in the main memory of the worker nodes
• Prevents unnecessary I/O operations with the disks
• Ability to chain the tasks at an application programming level
• Minimizes the number of writes to the disks
• Uses Directed Acyclic Graph (DAG) data processing engine
• MapReduce is just one set of supported constructs
The petabyte sort record
DAG
• Stands for Directed Acyclic Graph
• For every spark job a DAG of tasks is created
Spark APIs
• APIs for
• Java
• Python
• Scala
• R
• Spark itself is written in Scala
• Percent of Spark programmers who use each language , In 2016
• 88% Scala, 44% Java, 22% Python
• I think if it were done today, we would see the rank as Scala, Python,
and Java
Spark Libraries
• Spark SQL
• For working withstructured data. Allows you to seamlessly mix SQL queries
with Spark programs
• Spark Streaming
• Allows you to build scalable fault-tolerant streaming applications
• MLlib
• Implements common machine learning algorithms
• GraphX
• For graphs and graph-parallel computation
Spark Architecture
Spark
• Runs
• Locally
• distributed a cross a cluster
• Requires a cluster manager
• Yarn
• Spark Stand alone
• Mesos
• spark
• Interactive shell
• Data exploration
• Ad-hoc analysis
• Submit an application
• is offten used alongside Hadoop’s data storage module, HDFS
• can also integrate equally well with other popular data storage subsystems such
as HBase, Cassandra,…
What is Spark Used For?
• Stream processing
• log files
• sensor data
• financial transactions
• Machine learning
• store data in memory and rapidly run repeated queries
• Interactive analytics
• business analysts and data scientists increasingly want to explore their data by asking
a question, viewing the result, and then either altering the initial question slightly or
drilling deeper into results. This interactive query process requires systems such as
Spark that are able to respond and adapt quickly
• Data integration
• Extract, transform, and load (ETL) processes
Who Uses Spark?
• Well-known companies such as IBM and Huawei have invested significant sums in the technology
• a growing number of startups are building businesses that depend in whole or in part upon Spark
• Spark founded Databricks
• which provides a hosted end-to-end data platform powered by Spark
• The major Hadoop vendors, including MapR, Cloudera and Hortonworks, have all moved to
support Spark alongside their existing products, and each is working to add value for their
customers.
• IBM, Huawei and others have all made significant investments in Apache Spark, integrating it into
their own products and contributing enhance-ments and extensions back to the Apache project
• Web-based companies like Chinese search engine Baidu, e-commerce operation Alibaba Taobao,
and social networking company Tencent all run Spark-based operations at scale, with Tencent’s
800 million active users reportedly generating over 700 TB of data per day for processing on a
cluster of more than 8,000 compute nodes.
3 reasons to choose Spark
• Simplicity
• All capabilities are accessible via a set of rich APIs
• designed for interacting quickly and easily with data at scale
• well documented
• Speed
• designed for speed, operating both in memory and on disk
• Support
• supports a range of programming languages
• includes native support for tight integration with a number of leading storage solutions in the
Hadoop ecosystem and beyond
• large, active, and international community
• A growing set of commercial providers
• including Databricks, IBM, and all of the main Hadoop vendors deliver comprehensive support for
Spark-based solutions
Spark Execution Model
Spark execution model
• Application
• Driver
• Executer
• Job
• Stage
Spark execution model
Spark execution model
• At runtime, a Spark application maps to a single driver process and a set of executor
processes distributed across the hosts in a cluster
• The driver process manages the job flow and schedules tasks and is available the entire
time the application is running.
• Typically, this driver process is the same as the client process used to initiate the job
• In interactive mode, the shell itself is the driver process
• The executors are responsible for executing work, in the form of tasks, as well as for
storing any data that you cache.
• Invoking an action inside a Spark application triggers the launch of a job to fulfill it
• Spark examines the dataset on which that action depends and formulates an execution
plan.
• The execution plan assembles the dataset transformations into stages. A stage is a
collection of tasks that run the same code, each on a different subset of the data.
Cluster Managers
• Yarn
• Spark Standalone
• Mesos
Basic Programming
Model
Basic Programming Model
• Spark’s basic data model is called a Resilient Distributed Dataset
(RDD)
• It is designed to support in-memory data storage, distributed across a
cluster
• fault-tolerant
• tracking the lineage of transformations applied to data
• Efficient
• parallelization of processing across multiple nodes in the cluster
• minimization of data replication between those nodes.
• Immutable
• Partitioned
RDDs
• Two basic types of operations on RDDs
• Transformations
• Transform an RDD into another RDD, such as mapping, filtering, and more
• Actions:
• Process an RDD into a result , such as count, collect, save , …
• The original RDD remains unchanged throughout
• The chain of transformations from RDD1 to RDDn are logged
• and can be repeated in the event of data loss or the failure of a cluster node
RDDs
• Data odes not have to fit in a single machine
• Data is separated into partitions
• RDDs remain in memory
• greatly increasing the performance of the cluster, particularly in use cases
with a requirement for iterative queries or processes
RDDs
Transformations
• Transformations are lazily processed, only upon an action
• Transformations create a new RDD from an existing one
• Transformations might trigger an RDD repartitioning, called a shuffle
• Intermediate results can be manually cached in memory/on disk
• Spill to disk can be handled automatically
Spark transformations
Spark transformations
Spark actions
Spark actions
RDD and cache
• Spark can persist (or cache) a dataset in memory across operations
• Each node stores in memory any slices of it that it computes and
reuses them in other actions on that dataset – often making future
actions more than 10x faster
• The cache is fault-tolerant: if any partition of an RDD is lost, it will
automatically be recomputed using the transformations that
originally created it
• You can mark an RDD to be persisted using the persist() or cache()
methods on it
Creating RDDs
#Turn a Scala collection into an RDD
• sc.Parallelize( List( 1, 2, 3, 4 ) )
#Load Text File from FS or HDFS
• sc.textFile( “file.txt”)
• sc.textFile( “directory/*.txt”)
• sc.textFile( “hdfs/namenode:9000/path/file.txt”)
#Use Existing Hadoop InputFormat
• Sc.hadoopFile( keyClass, valueClass , inputFmt , conf)
Install Spark
Installing Spark
• Step 1: Installing Java
• We need Java8
• Step 2: Install Scala
• We will use Spark Scala Shell
• Step 3: Install Spark
• Step 4: within the “spark” directory, run:
• ./bin/spark-shell
• Follow instructions here to install Spark on Ubuntu
For Cluster installations
• Each machine will need Spark in the same folder, and key-based
passwordless SSH
• access from the master for the user running Spark
• Slave machines will need to be listed in the slaves file
• See spark/conf/
Scala
Scala
• Stands for SCAlable Language
• Released in 2004
• JVM-based language that can call, and be called, by Java
• Strongly statically typed, yet feels dynamically typed
• Type inference saves us from having to write explicit types most of the time
• Blends the object-oriented and functional paradigms
• Functions can be passed as arguments and returned from other functions
• Functions as filters
• They can be stored in variables
• This allows flexible program flow control structures
• Functions can be applied for all members of a collection, this leads to very compact coding
• Notice above: the return value of the function is always the value of the last statement
Wordcount in Java
WordCount In Scala
Scala Sampler
Syntax and Features
• No semicolons
• unless multiple statements per line
• No need to specify types in all cases
• types follow variable and parameter names after a colon
• Almost everything is an expression that returns a value of a type
• No checked exceptions
• A pure OO language
• all values are objects, all operations are methods
Scala Notation
• ‘_’ is the default value or wild card
• ‘=>’ Is used to separate match expression
from block to be evaluated
• The anonymous function ‘(x,y) => x+y’ can be
replaced by ‘_+_’
• The ‘v=>v.Method’ can be replaced by
‘_.Method’
• lsts.filter(v=>v.length>2) is the same as
lsts.filter(_.length>2)
• Iteration with for:
• for (i <- 0 until 10) { // with 0 to 10, 10 is
included
• println(s"Item: $i")
• }
• "->" is the tuple delimiter
• (2, 3) is equal to 2 -> 3
• 2 -> (3 -> 4) == (2,(3,4))
• 2 -> 3 -> 4 == ((2,3),4)
• Notice above: single-statement functions do
not need curly braces { }
• Arrays are indexed with ( ), not [ ]. [ ] is used
for type bounds (like Java's < >)
First Example
Spark Context
Spark context
• First thing that a Spark program does is create a SparkContext object,
which tells Spark how to access a cluster
• In the shell for either Scala or Python, this is the sc variable, which is
created automatically
• Other programs must use a constructor to instantiate a new
SparkContext
• Then in turn SparkContext gets used to create other variables
Create RDD + Some basic
Actions
Some Basic Transformations and
Actions
Second Example
Read A text File into an RDD
Read A text File into an RDD
• each line
• An item in the RDD
• split out the words
• use map function to
transform each item into
words
• we end up with an array
of arrays of words
• one array for each line
• Solution?
• flatMap
Read A text File into an RDD
Filter lines
wordCount
wordCount
wordCount
Running Spark
Examples
Running Spark Examples
• Lets execute a Spark examples!
./bin/run-example SparkPi 30 local
Your Turn!
Your Turn!
• Find number of different Charactes in Words with
• Finds how many times each character appears in all the words that
appear more than 3 times in a text file
Where to get more information?
• Apache Spark Has a very good online community and good
documentation
• Spark Programming Guide on Apache Spark Official web site
Finished 
Your Turn Solution

Mais conteúdo relacionado

Mais procurados

Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsKetan Gote
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 

Mais procurados (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Spark
SparkSpark
Spark
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 

Semelhante a Apache Spark Fundamentals

Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practiceDarko Marjanovic
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentationRamesh Mudunuri
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 

Semelhante a Apache Spark Fundamentals (20)

Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 

Último

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 

Último (20)

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 

Apache Spark Fundamentals

  • 2. Course Agenda • Spark Fundamentals • 1 session • Spark SQL • 1.5 session • Spark Streaming • 0.5 session • Spark MLlib & GraphX • 1 session
  • 3. Apache Spark Fundamentals Agenda • Apache Hadoop • Spark overview • Spark execution model • Basic Spark programming Model • Install Spark • Scala • First Example • Second Example • Running Spark Examples • Your Turn!
  • 5. An overview of Apache Hadoop • Open source software framework for distributed data processing • Comes with Hadoop Distributed File System (HDFS) • Map/Reduce for data processing • YARN (Yet Another Resource Negotiator)
  • 6. Map Reduce • A generic data processing model • Done through map step and reduce step • In map step, each data item is processed independently • In reduce step, map steps results are combined
  • 8. MapReduce limitations • MapReduce use cases showed two major limitations: • difficultly of programming directly in MR • performance bottlenecks, or batch not fitting the use cases • The subsequent steps have to read from disk • In short, MR doesn’t compose well for large applications
  • 9. Therefore, people built specialized systems as workarounds…
  • 10. Spark • Unlike the various specialized systems, Spark’s goal was to generalize MapReduce to support new apps within same engine • Two reasonably small additions are enough to express the previous models: • fast data sharing • general DAGs • This allows for an approach which is more efficient for the engine, and much simpler for the end users
  • 12. Apache Spark Overview • A general framework for distributed computing that offers high performance for both batch and interactive processing • an in-memory big data platform that performs especially well with iterative algorithms • Originally developed by UC Berkeley starting in 2009 Moved to an Apache project in 2013 • 10-100x speedup over Hadoop with some algorithms, especially iterative ones as found in machine learning
  • 13. Advantages of Spark • Runs Programs up to 100x faster than Hadoop MapReduce • Does the processing in the main memory of the worker nodes • Prevents unnecessary I/O operations with the disks • Ability to chain the tasks at an application programming level • Minimizes the number of writes to the disks • Uses Directed Acyclic Graph (DAG) data processing engine • MapReduce is just one set of supported constructs
  • 15. DAG • Stands for Directed Acyclic Graph • For every spark job a DAG of tasks is created
  • 16. Spark APIs • APIs for • Java • Python • Scala • R • Spark itself is written in Scala • Percent of Spark programmers who use each language , In 2016 • 88% Scala, 44% Java, 22% Python • I think if it were done today, we would see the rank as Scala, Python, and Java
  • 17. Spark Libraries • Spark SQL • For working withstructured data. Allows you to seamlessly mix SQL queries with Spark programs • Spark Streaming • Allows you to build scalable fault-tolerant streaming applications • MLlib • Implements common machine learning algorithms • GraphX • For graphs and graph-parallel computation
  • 19. Spark • Runs • Locally • distributed a cross a cluster • Requires a cluster manager • Yarn • Spark Stand alone • Mesos • spark • Interactive shell • Data exploration • Ad-hoc analysis • Submit an application • is offten used alongside Hadoop’s data storage module, HDFS • can also integrate equally well with other popular data storage subsystems such as HBase, Cassandra,…
  • 20. What is Spark Used For? • Stream processing • log files • sensor data • financial transactions • Machine learning • store data in memory and rapidly run repeated queries • Interactive analytics • business analysts and data scientists increasingly want to explore their data by asking a question, viewing the result, and then either altering the initial question slightly or drilling deeper into results. This interactive query process requires systems such as Spark that are able to respond and adapt quickly • Data integration • Extract, transform, and load (ETL) processes
  • 21. Who Uses Spark? • Well-known companies such as IBM and Huawei have invested significant sums in the technology • a growing number of startups are building businesses that depend in whole or in part upon Spark • Spark founded Databricks • which provides a hosted end-to-end data platform powered by Spark • The major Hadoop vendors, including MapR, Cloudera and Hortonworks, have all moved to support Spark alongside their existing products, and each is working to add value for their customers. • IBM, Huawei and others have all made significant investments in Apache Spark, integrating it into their own products and contributing enhance-ments and extensions back to the Apache project • Web-based companies like Chinese search engine Baidu, e-commerce operation Alibaba Taobao, and social networking company Tencent all run Spark-based operations at scale, with Tencent’s 800 million active users reportedly generating over 700 TB of data per day for processing on a cluster of more than 8,000 compute nodes.
  • 22. 3 reasons to choose Spark • Simplicity • All capabilities are accessible via a set of rich APIs • designed for interacting quickly and easily with data at scale • well documented • Speed • designed for speed, operating both in memory and on disk • Support • supports a range of programming languages • includes native support for tight integration with a number of leading storage solutions in the Hadoop ecosystem and beyond • large, active, and international community • A growing set of commercial providers • including Databricks, IBM, and all of the main Hadoop vendors deliver comprehensive support for Spark-based solutions
  • 24. Spark execution model • Application • Driver • Executer • Job • Stage
  • 26. Spark execution model • At runtime, a Spark application maps to a single driver process and a set of executor processes distributed across the hosts in a cluster • The driver process manages the job flow and schedules tasks and is available the entire time the application is running. • Typically, this driver process is the same as the client process used to initiate the job • In interactive mode, the shell itself is the driver process • The executors are responsible for executing work, in the form of tasks, as well as for storing any data that you cache. • Invoking an action inside a Spark application triggers the launch of a job to fulfill it • Spark examines the dataset on which that action depends and formulates an execution plan. • The execution plan assembles the dataset transformations into stages. A stage is a collection of tasks that run the same code, each on a different subset of the data.
  • 27. Cluster Managers • Yarn • Spark Standalone • Mesos
  • 29. Basic Programming Model • Spark’s basic data model is called a Resilient Distributed Dataset (RDD) • It is designed to support in-memory data storage, distributed across a cluster • fault-tolerant • tracking the lineage of transformations applied to data • Efficient • parallelization of processing across multiple nodes in the cluster • minimization of data replication between those nodes. • Immutable • Partitioned
  • 30. RDDs • Two basic types of operations on RDDs • Transformations • Transform an RDD into another RDD, such as mapping, filtering, and more • Actions: • Process an RDD into a result , such as count, collect, save , … • The original RDD remains unchanged throughout • The chain of transformations from RDD1 to RDDn are logged • and can be repeated in the event of data loss or the failure of a cluster node
  • 31. RDDs • Data odes not have to fit in a single machine • Data is separated into partitions • RDDs remain in memory • greatly increasing the performance of the cluster, particularly in use cases with a requirement for iterative queries or processes
  • 32. RDDs
  • 33. Transformations • Transformations are lazily processed, only upon an action • Transformations create a new RDD from an existing one • Transformations might trigger an RDD repartitioning, called a shuffle • Intermediate results can be manually cached in memory/on disk • Spill to disk can be handled automatically
  • 38. RDD and cache • Spark can persist (or cache) a dataset in memory across operations • Each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset – often making future actions more than 10x faster • The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it • You can mark an RDD to be persisted using the persist() or cache() methods on it
  • 39. Creating RDDs #Turn a Scala collection into an RDD • sc.Parallelize( List( 1, 2, 3, 4 ) ) #Load Text File from FS or HDFS • sc.textFile( “file.txt”) • sc.textFile( “directory/*.txt”) • sc.textFile( “hdfs/namenode:9000/path/file.txt”) #Use Existing Hadoop InputFormat • Sc.hadoopFile( keyClass, valueClass , inputFmt , conf)
  • 41. Installing Spark • Step 1: Installing Java • We need Java8 • Step 2: Install Scala • We will use Spark Scala Shell • Step 3: Install Spark • Step 4: within the “spark” directory, run: • ./bin/spark-shell • Follow instructions here to install Spark on Ubuntu
  • 42. For Cluster installations • Each machine will need Spark in the same folder, and key-based passwordless SSH • access from the master for the user running Spark • Slave machines will need to be listed in the slaves file • See spark/conf/
  • 43. Scala
  • 44. Scala • Stands for SCAlable Language • Released in 2004 • JVM-based language that can call, and be called, by Java • Strongly statically typed, yet feels dynamically typed • Type inference saves us from having to write explicit types most of the time • Blends the object-oriented and functional paradigms • Functions can be passed as arguments and returned from other functions • Functions as filters • They can be stored in variables • This allows flexible program flow control structures • Functions can be applied for all members of a collection, this leads to very compact coding • Notice above: the return value of the function is always the value of the last statement
  • 47. Scala Sampler Syntax and Features • No semicolons • unless multiple statements per line • No need to specify types in all cases • types follow variable and parameter names after a colon • Almost everything is an expression that returns a value of a type • No checked exceptions • A pure OO language • all values are objects, all operations are methods
  • 48. Scala Notation • ‘_’ is the default value or wild card • ‘=>’ Is used to separate match expression from block to be evaluated • The anonymous function ‘(x,y) => x+y’ can be replaced by ‘_+_’ • The ‘v=>v.Method’ can be replaced by ‘_.Method’ • lsts.filter(v=>v.length>2) is the same as lsts.filter(_.length>2) • Iteration with for: • for (i <- 0 until 10) { // with 0 to 10, 10 is included • println(s"Item: $i") • } • "->" is the tuple delimiter • (2, 3) is equal to 2 -> 3 • 2 -> (3 -> 4) == (2,(3,4)) • 2 -> 3 -> 4 == ((2,3),4) • Notice above: single-statement functions do not need curly braces { } • Arrays are indexed with ( ), not [ ]. [ ] is used for type bounds (like Java's < >)
  • 51. Spark context • First thing that a Spark program does is create a SparkContext object, which tells Spark how to access a cluster • In the shell for either Scala or Python, this is the sc variable, which is created automatically • Other programs must use a constructor to instantiate a new SparkContext • Then in turn SparkContext gets used to create other variables
  • 52. Create RDD + Some basic Actions
  • 55. Read A text File into an RDD
  • 56. Read A text File into an RDD • each line • An item in the RDD • split out the words • use map function to transform each item into words • we end up with an array of arrays of words • one array for each line • Solution? • flatMap
  • 57. Read A text File into an RDD
  • 63. Running Spark Examples • Lets execute a Spark examples! ./bin/run-example SparkPi 30 local
  • 65. Your Turn! • Find number of different Charactes in Words with • Finds how many times each character appears in all the words that appear more than 3 times in a text file
  • 66. Where to get more information? • Apache Spark Has a very good online community and good documentation • Spark Programming Guide on Apache Spark Official web site

Notas do Editor

  1. Simplicity: Spark’s capabilities are accessible via a set of rich APIs, all de-signed specifically for interacting quickly and easily with data at scale. These APIs are well documented, and structured in a way that makes it straightforward for data scientists and application developers to quickly put Spark to work; Speed: Spark is designed for speed, operating both in memory and on disk. In 2014, Spark was used to win the Daytona Gray Sort benchmark-ing challenge, processing 100 terabytes of data stored on solid-state drives in just 23 minutes. The previous winner used Hadoop and a di†er-ent cluster configuration, but it took 72 minutes. This win was the result of processing a static data set. Spark’s performance can be even greater when supporting interactive queries of data stored in memory, with claims that Spark can be 100 times faster than Hadoop’s MapReduce in these situations; Support: Spark supports a range of programming languages, including Java, Python, R, and Scala. Although oftn closely associated with Ha-doop’s underlying storage system, HDFS, Spark includes native support for tight integration with a number of leading storage solutions in the Ha-doop ecosystem and beyond. Additionally, the Apache Spark community is large, active, and international. A growing set of commercial providers including Databricks, IBM, and all of the main Hadoop vendors deliver comprehensive support for Spark-based solutions.