SlideShare uma empresa Scribd logo
1 de 28
Spark real world use cases
and optimizations
Spark
•Spark is a cluster computing engine.
•Provides high-level API in Scala, Java, Python and R.
•Provides high level tools:
•Spark SQL.
•MLib.
•GraphX.
•Spark Streaming.
RDD
•The basic abstraction in Spark is the RDD.
•Stands for: Resilient Distributed Dataset.
•It is a collection of items which their source may for example:
•Hadoop (HDFS).
•JDBC.
•ElasticSearch.
•And more…
D is for Partitioned
• Partition is a sub-collection of data that should fit into memory
• Partition + transformation = Task
• This is the distributed part of the RDD
• Partitions are recomputed in case of failure - Resilient
Foo bar ..
Line 2
Hello
…
…
Line 100..
Line #...
…
…
Line 200..
Line #...
…
…
Line 300..
Line #...
…
D is for dependency
• RDD can depend on other RDDs.
• RDD is lazy.
• For example rdd.map(String::toUppercase)
• Creates a new RDD which depends on the original one.
• Contains only meta-data (i.e., the computing function).
• Only on a specific command (knows as actions, like collect) the
flow will be computed.
The famous word count example
sc.textFile("src/main/resources/books.txt")
.flatMap(line => line.split(" "))
.map(w => (w,1))
.reduceByKey((c1, c2) => c1 + c2)
The famous word count example
sc.textFile("src/main/resources/book.txt")
.flatMap(s -> Arrays.asList(s.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b)
.collectAsMap();
How does it work?
•Driver:
•Executes the main program
•Creates the RDDs
•Collects the results
•Executors:
•Executes the RDD operations
•Participate in the shuffle
Image taken from https://spark.apache.org/docs/latest/cluster-overview.html
Spark Terminology
•Application – the main program that runs on the driver and
creates the RDDs and collects the results.
•Job – a sequence of transformations on RDD till action occurs.
•Stage – a sequence of transformations on RDD till shuffle
occurs.
•Task – a sequence of transformations on a single partition till
shuffle occurs.
RDD from Collection
•You can create an RDD from a collection:
sc.parallelize(list)
•Takes a sequence from the driver and distributes it across the
nodes.
•Note, the distribution is lazy so be careful with mutable
collections!
•Important, if it’s a range collection, use the range method as it
does not create the collection on the driver.
RDD from file
•Spark supports reading files, directories and compressed files.
•The following out-of-the-box methods:
•textFile – retrieving an RDD[String] (lines).
•wholeTextFiles – retrieving an RDD[(String, String)] with
filename and content.
•sequenceFile – Hadoop sequence files RDD[(K,V)].
RDD Actions
•Return values by evaluating the RDD (not lazy):
•collect() – returns an list containing all the elements of the
RDD. This is the main method that evaluates the RDD.
•count() – returns the number of the elements in the RDD.
•first() – returns the first element of the RDD.
•foreach(f) – performs the function on each element of the
RDD.
RDD Transformations
•Return pointer to new RDD with transformation meta-data
•map(func) - Return a new distributed dataset formed by passing
each element of the source through a function func.
•filter(func) - Return a new dataset formed by selecting those
elements of the source on which func returns true.
•flatMap(func) - Similar to map, but each input item can be mapped
to 0 or more output items (so func should return a Seq rather than a
single item).
Shuffle
•Shuffle operations repartition the data across the network.
•Can be very expensive operations in Spark.
•You must be aware where and why shuffle happens.
•Order is not guaranteed inside a partition.
•Popular operations that cause shuffle are: groupBy*,
reduceBy*, sort*, aggregateBy* and join/intersect operations on
multiple RDDs.
Everyday I’m shuffling
Image taken from https://www.datastax.com/wp-content/uploads/2015/05/SparkShuffle.png
Transformation that shuffle
*Taken from the official Apache Spark documentation
•distinct([numTasks]) - Return a new dataset that contains the
distinct elements of the source dataset.
•groupByKey([numTasks]) - When called on a dataset of (K,
V) pairs, returns a dataset of (K, Iterable<V>) pairs.
•reduceByKey(func, [numTasks]) -When called on a
dataset of (K, V) pairs, returns a dataset of (K, V) pairs where
the values for each key are aggregated using the given reduce
function func, which must be of type (V,V) => V.
Transformation that shuffle
*Taken from the official Apache Spark documentation
•join(otherDataset, [numTasks]) - When called on datasets of
type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs
with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and
fullOuterJoin.
•sort, sortByKey…
•More at http://spark.apache.org/docs/latest/programming-
guide.html
Use case #1: Bucketing
• Every day we aggregate lots (millions/billions) of user actions
• We want that all the action of a single user will be saved in file
of its own
• Sounds simple..
• But should we use groupByKey?
• Or a reduceByKey?
groupByKey vs reduceByKey
Demo
Spark SQL
•Spark module for structured data
•Spark SQL provides a unified engine (catalyst) with 3 APIs:
•SQL. (out of topic today)
•Dataframes (untyped).
•Datasets (typed) – Only Scala and Java for Spark 2.x
Use Case #2: Top results
•Same user activities as before
•Now we want to take top active users
•Sounds simple :
1. Group by user Id
2. Count users in each group
3. Sort by count
4. Take the X top users
• But sort is expensive..
Demo
DataFrame and DataSets
• Originally Spark provided only DataFrames.
• A DataFrame is conceptually a table with typed columns.
• However, it is not typed at compilation.
• Starting with Spark 1.6, Datasets were introduced.
• Represent a table with columns, but, the row is typed at
compilation.
DataFrame and DataSets
val flightsDF: DataFrame =
spark.read.option("inferSchema",true).csv(”…flights.csv")
VS
val flightsDF: Dataset[Flight] =
spark.read.option("inferSchema",true)
.csv(”…flights.csv").as[Flight]
Use Case #3: Analytics report
•Our input is a CSV with flight records of different airports
•Flight from one airport to another including departure and arrival info
•We want to know how many flights arrived and departed from
every airport
•All we need to do is to group by the airports and then count..
To summarize
•We learned what Spark is
•Took a taste of RDD and Spark SQL
•Try to avoid shuffle - Shuffle is expensive
•Pick the aggregation method according to the use case
•Caching may help
Questions?

Mais conteúdo relacionado

Mais procurados

Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark StreamingKnoldus Inc.
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with CassandraRyan King
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Scala Days San Francisco
Scala Days San FranciscoScala Days San Francisco
Scala Days San FranciscoMartin Odersky
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufStructured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufDatabricks
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaAbhinav Singh
 
Introduction to ScalaZ
Introduction to ScalaZIntroduction to ScalaZ
Introduction to ScalaZKnoldus Inc.
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...Flink Forward
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaLightbend
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache SparkKnoldus Inc.
 
Martin Odersky - Evolution of Scala
Martin Odersky - Evolution of ScalaMartin Odersky - Evolution of Scala
Martin Odersky - Evolution of ScalaScala Italy
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaJoe Stein
 
Evolving Streaming Applications
Evolving Streaming ApplicationsEvolving Streaming Applications
Evolving Streaming ApplicationsDataWorks Summit
 

Mais procurados (20)

Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with Cassandra
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 
Akka streams
Akka streamsAkka streams
Akka streams
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Scala Days San Francisco
Scala Days San FranciscoScala Days San Francisco
Scala Days San Francisco
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufStructured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache Kafka
 
Introduction to ScalaZ
Introduction to ScalaZIntroduction to ScalaZ
Introduction to ScalaZ
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For Scala
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
Martin Odersky - Evolution of Scala
Martin Odersky - Evolution of ScalaMartin Odersky - Evolution of Scala
Martin Odersky - Evolution of Scala
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
 
Evolving Streaming Applications
Evolving Streaming ApplicationsEvolving Streaming Applications
Evolving Streaming Applications
 

Semelhante a Spark real world use cases and optimizations

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into SparkAshish kumar
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkWill Du
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraHandaru Sakti
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 

Semelhante a Spark real world use cases and optimizations (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark core
Spark coreSpark core
Spark core
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into Spark
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 

Mais de Gal Marder

Should i break it?
Should i break it?Should i break it?
Should i break it?Gal Marder
 
Reactive Micro Services with Java seminar
Reactive Micro Services with Java seminarReactive Micro Services with Java seminar
Reactive Micro Services with Java seminarGal Marder
 
Implementing Micro Services Tasks (service discovery, load balancing etc.) - ...
Implementing Micro Services Tasks (service discovery, load balancing etc.) - ...Implementing Micro Services Tasks (service discovery, load balancing etc.) - ...
Implementing Micro Services Tasks (service discovery, load balancing etc.) - ...Gal Marder
 
What’s expected in Java 9
What’s expected in Java 9What’s expected in Java 9
What’s expected in Java 9Gal Marder
 
What’s expected in Spring 5
What’s expected in Spring 5What’s expected in Spring 5
What’s expected in Spring 5Gal Marder
 
Multi-threading in the modern era: Vertx Akka and Quasar
Multi-threading in the modern era: Vertx Akka and QuasarMulti-threading in the modern era: Vertx Akka and Quasar
Multi-threading in the modern era: Vertx Akka and QuasarGal Marder
 
What's new in Java EE 6
What's new in Java EE 6What's new in Java EE 6
What's new in Java EE 6Gal Marder
 
What's Expected in Java 7
What's Expected in Java 7What's Expected in Java 7
What's Expected in Java 7Gal Marder
 

Mais de Gal Marder (8)

Should i break it?
Should i break it?Should i break it?
Should i break it?
 
Reactive Micro Services with Java seminar
Reactive Micro Services with Java seminarReactive Micro Services with Java seminar
Reactive Micro Services with Java seminar
 
Implementing Micro Services Tasks (service discovery, load balancing etc.) - ...
Implementing Micro Services Tasks (service discovery, load balancing etc.) - ...Implementing Micro Services Tasks (service discovery, load balancing etc.) - ...
Implementing Micro Services Tasks (service discovery, load balancing etc.) - ...
 
What’s expected in Java 9
What’s expected in Java 9What’s expected in Java 9
What’s expected in Java 9
 
What’s expected in Spring 5
What’s expected in Spring 5What’s expected in Spring 5
What’s expected in Spring 5
 
Multi-threading in the modern era: Vertx Akka and Quasar
Multi-threading in the modern era: Vertx Akka and QuasarMulti-threading in the modern era: Vertx Akka and Quasar
Multi-threading in the modern era: Vertx Akka and Quasar
 
What's new in Java EE 6
What's new in Java EE 6What's new in Java EE 6
What's new in Java EE 6
 
What's Expected in Java 7
What's Expected in Java 7What's Expected in Java 7
What's Expected in Java 7
 

Último

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...software pro Development
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 

Último (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 

Spark real world use cases and optimizations

  • 1. Spark real world use cases and optimizations
  • 2. Spark •Spark is a cluster computing engine. •Provides high-level API in Scala, Java, Python and R. •Provides high level tools: •Spark SQL. •MLib. •GraphX. •Spark Streaming.
  • 3. RDD •The basic abstraction in Spark is the RDD. •Stands for: Resilient Distributed Dataset. •It is a collection of items which their source may for example: •Hadoop (HDFS). •JDBC. •ElasticSearch. •And more…
  • 4. D is for Partitioned • Partition is a sub-collection of data that should fit into memory • Partition + transformation = Task • This is the distributed part of the RDD • Partitions are recomputed in case of failure - Resilient Foo bar .. Line 2 Hello … … Line 100.. Line #... … … Line 200.. Line #... … … Line 300.. Line #... …
  • 5. D is for dependency • RDD can depend on other RDDs. • RDD is lazy. • For example rdd.map(String::toUppercase) • Creates a new RDD which depends on the original one. • Contains only meta-data (i.e., the computing function). • Only on a specific command (knows as actions, like collect) the flow will be computed.
  • 6. The famous word count example sc.textFile("src/main/resources/books.txt") .flatMap(line => line.split(" ")) .map(w => (w,1)) .reduceByKey((c1, c2) => c1 + c2)
  • 7. The famous word count example sc.textFile("src/main/resources/book.txt") .flatMap(s -> Arrays.asList(s.split(" ")).iterator()) .mapToPair(word -> new Tuple2<>(word, 1)) .reduceByKey((a, b) -> a + b) .collectAsMap();
  • 8. How does it work? •Driver: •Executes the main program •Creates the RDDs •Collects the results •Executors: •Executes the RDD operations •Participate in the shuffle Image taken from https://spark.apache.org/docs/latest/cluster-overview.html
  • 9. Spark Terminology •Application – the main program that runs on the driver and creates the RDDs and collects the results. •Job – a sequence of transformations on RDD till action occurs. •Stage – a sequence of transformations on RDD till shuffle occurs. •Task – a sequence of transformations on a single partition till shuffle occurs.
  • 10. RDD from Collection •You can create an RDD from a collection: sc.parallelize(list) •Takes a sequence from the driver and distributes it across the nodes. •Note, the distribution is lazy so be careful with mutable collections! •Important, if it’s a range collection, use the range method as it does not create the collection on the driver.
  • 11. RDD from file •Spark supports reading files, directories and compressed files. •The following out-of-the-box methods: •textFile – retrieving an RDD[String] (lines). •wholeTextFiles – retrieving an RDD[(String, String)] with filename and content. •sequenceFile – Hadoop sequence files RDD[(K,V)].
  • 12. RDD Actions •Return values by evaluating the RDD (not lazy): •collect() – returns an list containing all the elements of the RDD. This is the main method that evaluates the RDD. •count() – returns the number of the elements in the RDD. •first() – returns the first element of the RDD. •foreach(f) – performs the function on each element of the RDD.
  • 13. RDD Transformations •Return pointer to new RDD with transformation meta-data •map(func) - Return a new distributed dataset formed by passing each element of the source through a function func. •filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true. •flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
  • 14. Shuffle •Shuffle operations repartition the data across the network. •Can be very expensive operations in Spark. •You must be aware where and why shuffle happens. •Order is not guaranteed inside a partition. •Popular operations that cause shuffle are: groupBy*, reduceBy*, sort*, aggregateBy* and join/intersect operations on multiple RDDs.
  • 15. Everyday I’m shuffling Image taken from https://www.datastax.com/wp-content/uploads/2015/05/SparkShuffle.png
  • 16. Transformation that shuffle *Taken from the official Apache Spark documentation •distinct([numTasks]) - Return a new dataset that contains the distinct elements of the source dataset. •groupByKey([numTasks]) - When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. •reduceByKey(func, [numTasks]) -When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V.
  • 17. Transformation that shuffle *Taken from the official Apache Spark documentation •join(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin. •sort, sortByKey… •More at http://spark.apache.org/docs/latest/programming- guide.html
  • 18. Use case #1: Bucketing • Every day we aggregate lots (millions/billions) of user actions • We want that all the action of a single user will be saved in file of its own • Sounds simple.. • But should we use groupByKey? • Or a reduceByKey?
  • 20. Demo
  • 21. Spark SQL •Spark module for structured data •Spark SQL provides a unified engine (catalyst) with 3 APIs: •SQL. (out of topic today) •Dataframes (untyped). •Datasets (typed) – Only Scala and Java for Spark 2.x
  • 22. Use Case #2: Top results •Same user activities as before •Now we want to take top active users •Sounds simple : 1. Group by user Id 2. Count users in each group 3. Sort by count 4. Take the X top users • But sort is expensive..
  • 23. Demo
  • 24. DataFrame and DataSets • Originally Spark provided only DataFrames. • A DataFrame is conceptually a table with typed columns. • However, it is not typed at compilation. • Starting with Spark 1.6, Datasets were introduced. • Represent a table with columns, but, the row is typed at compilation.
  • 25. DataFrame and DataSets val flightsDF: DataFrame = spark.read.option("inferSchema",true).csv(”…flights.csv") VS val flightsDF: Dataset[Flight] = spark.read.option("inferSchema",true) .csv(”…flights.csv").as[Flight]
  • 26. Use Case #3: Analytics report •Our input is a CSV with flight records of different airports •Flight from one airport to another including departure and arrival info •We want to know how many flights arrived and departed from every airport •All we need to do is to group by the airports and then count..
  • 27. To summarize •We learned what Spark is •Took a taste of RDD and Spark SQL •Try to avoid shuffle - Shuffle is expensive •Pick the aggregation method according to the use case •Caching may help