SlideShare uma empresa Scribd logo
1 de 24
Notes Sharing
Richard Kuo, Professional-Technical Architect,
Domain 2.0 Architecture & Planning
Agenda
• Big Data
• Overview of Spark
• Main Concepts
– RDD
– Transformations
– Programming Model
• Observation
01/06/15 Creative Common, BY, SA, NC 2
What is Apache Spark?
• Fast and general cluster computing system,
interoperable with Hadoop.
• Improves efficiency through:
– In-memory computing primitives
– General computational graph
• Improves usability through:
– Rich APIs in Scala, Java, Python
– Interactive shell
01/06/15 Creative Common, BY, SA, NC 3
Big Data: Hadoop Ecosystem
01/06/15 Creative Common, BY, SA, NC 4
Distributed Computing
01/06/15 Creative Common, BY, SA, NC 5
Comparison with Hadoop
Hadoop Spark
Map Reduce Framework Generalized Computation
Usually data is on disk (HDFS) On disk or in memory
Not ideal for iterative works Data can be cached in memory, great for
iterative works
Batch process Real time streaming or batch
Up to 10x faster when data is in disk
Up to 100x faster when data is in memory
2-5x time less code to write
Support Scala, Java and Python
Code re-use across modules
Interactive shell for ad-hoc exploratory
Library support: GraphX, Machine
Learning, SQL, R, Streaming, …
01/06/15 Creative Common, BY, SA, NC 6
01/06/15 Creative Common, BY, SA, NC 7
Compare to Hadoop:
01/06/15 Creative Common, BY, SA, NC 8
System performance degrade gracefully with less RAM
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully cached
Executiontime(s)
% of working set in cache
01/06/15 Creative Common, BY, SA, NC 9
Software Components
• Spark runs as a library in
your program (1 instance
per app)
• Runs tasks locally or on
cluster
– Mesos, YARN or standalone
mode
• Accesses storage systems
via Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
01/06/15 Creative Common, BY, SA, NC 10
Spark Architecture
• [Spark
Standalone
• |Mesos
• |Yarn]
Node
Client
01/06/15 Creative Common, BY, SA, NC 11
Key Concept: RDD’s
Resilient Distributed Datasets
• Collections of objects
spread across a cluster,
stored in RAM or on Disk
• Built through parallel
transformations
• Automatically rebuilt on
failure
Operations
• Transformations
(e.g. map, filter, groupBy)
• Actions
(e.g. count, collect, save)
01/06/15 Creative Common, BY, SA, NC 12
Write programs in terms of operations on distributed
datasets
Fault Recovery
RDDs track the series of transformations used to build
them (their lineage) to re-compute lost data, no data
replication across wire.
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
01/06/15 Creative Common, BY, SA, NC 13
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
Language Support
Standalone Programs
•Python, Scala, & Java
Interactive Shells
• Python & Scala
Performance
• Java & Scala are faster due to
static typing
• …but Python is often fine
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
01/06/15 Creative Common, BY, SA, NC 14
Interactive Shell
• The fastest way to
learn Spark
• Available in Python and
Scala
• Runs as an application
on an existing Spark
Cluster…
• Or can run locally
01/06/15 Creative Common, BY, SA, NC 15
DEMO
01/06/15 Creative Common, BY, SA, NC 16
Transformation
01/06/15 Creative Common, BY, SA, NC 17
Spark Streaming
01/06/15 Creative Common, BY, SA, NC 18
Spark Streaming
01/06/15 Creative Common, BY, SA, NC 19
Spark Streaming: Word Count
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.storage.StorageLevel
object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
// Create the context with a 1 second batch size
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
01/06/15 Creative Common, BY, SA, NC 20
Create Spark Context
Create, map, reduce
Output
Start
Analytics
01/06/15 Creative Common, BY, SA, NC 21
Conclusion
• Spark offers a rich API to make data analytics fast:
both less to write and fast to run.
• Achieves 100x speedups in real applications.
• Growing community.
01/06/15 Creative Common, BY, SA, NC 22
Observations:
• A lot of data, different kinds of data, generated
faster, need analyzed in real-time.
• All* products are data products.
• More complicate analytic algorithms applies to
commercial products and services.
• Not all data analysis requires the same accuracy.
• Expectation on service delivery increases.
01/06/15 Creative Common, BY, SA, NC 23
Reference:
• AMPLab at UC Berkeley
• Databrick
• UC BerkeleyX
– CS100.1x Introduction to Big Data with Apache Spark, starts 23 Feb 2015,
5 weeks
– CS190.1x Scalable Machine Learning, starts 14 Apr 2015, 5 weeks
• Spark Summit 2014 Training
• Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing
• An Architecture for Fast and General Data Processing on
Large Clusters
• Richard’s Study Notes
– Self Study AMPCamp
– Hortonworks HDP 2.2 Study
01/06/15 Creative Common, BY, SA, NC 24

Mais conteúdo relacionado

Mais procurados

Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie
 

Mais procurados (20)

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Internals
InternalsInternals
Internals
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 

Destaque

Derrick Miles on Executive Book Summaries
Derrick Miles on Executive Book SummariesDerrick Miles on Executive Book Summaries
Derrick Miles on Executive Book Summaries
TheMilestoneBrand
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
Edureka!
 

Destaque (14)

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Machine learning Mindmap
Machine learning MindmapMachine learning Mindmap
Machine learning Mindmap
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Visual book summaries
Visual book summariesVisual book summaries
Visual book summaries
 
ProQuest Safari: essentials of computing and popular technology
ProQuest Safari: essentials of computing and popular technologyProQuest Safari: essentials of computing and popular technology
ProQuest Safari: essentials of computing and popular technology
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Derrick Miles on Executive Book Summaries
Derrick Miles on Executive Book SummariesDerrick Miles on Executive Book Summaries
Derrick Miles on Executive Book Summaries
 
Julius Caesar - Summary
Julius Caesar - SummaryJulius Caesar - Summary
Julius Caesar - Summary
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
 
Great Executive Summaries
Great Executive SummariesGreat Executive Summaries
Great Executive Summaries
 
The Lean Startup - Visual Summary
The Lean Startup - Visual SummaryThe Lean Startup - Visual Summary
The Lean Startup - Visual Summary
 
Inside Apple
Inside AppleInside Apple
Inside Apple
 

Semelhante a Spark Study Notes

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 

Semelhante a Spark Study Notes (20)

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Spark core
Spark coreSpark core
Spark core
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 

Mais de Richard Kuo

UML, OWL and REA based enterprise business model 20110201a
UML, OWL and REA based enterprise business model 20110201aUML, OWL and REA based enterprise business model 20110201a
UML, OWL and REA based enterprise business model 20110201a
Richard Kuo
 
Cloud computing reference architecture from nist and ibm
Cloud computing reference architecture from nist and ibmCloud computing reference architecture from nist and ibm
Cloud computing reference architecture from nist and ibm
Richard Kuo
 

Mais de Richard Kuo (15)

Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
 
View Orchestration from Model Driven Engineering Prospective
View Orchestration from Model Driven Engineering ProspectiveView Orchestration from Model Driven Engineering Prospective
View Orchestration from Model Driven Engineering Prospective
 
Telecom Infra Project study notes
Telecom Infra Project study notesTelecom Infra Project study notes
Telecom Infra Project study notes
 
5g, gpu and fpga
5g, gpu and fpga5g, gpu and fpga
5g, gpu and fpga
 
Learning
Learning Learning
Learning
 
Kubernetes20151017a
Kubernetes20151017aKubernetes20151017a
Kubernetes20151017a
 
IaaS with Chef
IaaS with ChefIaaS with Chef
IaaS with Chef
 
Ontology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpediaOntology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpedia
 
SDN and NFV
SDN and NFVSDN and NFV
SDN and NFV
 
Graph Database
Graph DatabaseGraph Database
Graph Database
 
UML, OWL and REA based enterprise business model 20110201a
UML, OWL and REA based enterprise business model 20110201aUML, OWL and REA based enterprise business model 20110201a
UML, OWL and REA based enterprise business model 20110201a
 
Open v switch20150410b
Open v switch20150410bOpen v switch20150410b
Open v switch20150410b
 
Docker and coreos20141020b
Docker and coreos20141020bDocker and coreos20141020b
Docker and coreos20141020b
 
Git studynotes
Git studynotesGit studynotes
Git studynotes
 
Cloud computing reference architecture from nist and ibm
Cloud computing reference architecture from nist and ibmCloud computing reference architecture from nist and ibm
Cloud computing reference architecture from nist and ibm
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Spark Study Notes

  • 1. Notes Sharing Richard Kuo, Professional-Technical Architect, Domain 2.0 Architecture & Planning
  • 2. Agenda • Big Data • Overview of Spark • Main Concepts – RDD – Transformations – Programming Model • Observation 01/06/15 Creative Common, BY, SA, NC 2
  • 3. What is Apache Spark? • Fast and general cluster computing system, interoperable with Hadoop. • Improves efficiency through: – In-memory computing primitives – General computational graph • Improves usability through: – Rich APIs in Scala, Java, Python – Interactive shell 01/06/15 Creative Common, BY, SA, NC 3
  • 4. Big Data: Hadoop Ecosystem 01/06/15 Creative Common, BY, SA, NC 4
  • 6. Comparison with Hadoop Hadoop Spark Map Reduce Framework Generalized Computation Usually data is on disk (HDFS) On disk or in memory Not ideal for iterative works Data can be cached in memory, great for iterative works Batch process Real time streaming or batch Up to 10x faster when data is in disk Up to 100x faster when data is in memory 2-5x time less code to write Support Scala, Java and Python Code re-use across modules Interactive shell for ad-hoc exploratory Library support: GraphX, Machine Learning, SQL, R, Streaming, … 01/06/15 Creative Common, BY, SA, NC 6
  • 8. Compare to Hadoop: 01/06/15 Creative Common, BY, SA, NC 8
  • 9. System performance degrade gracefully with less RAM 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Executiontime(s) % of working set in cache 01/06/15 Creative Common, BY, SA, NC 9
  • 10. Software Components • Spark runs as a library in your program (1 instance per app) • Runs tasks locally or on cluster – Mesos, YARN or standalone mode • Accesses storage systems via Hadoop InputFormat API – Can use HBase, HDFS, S3, … Your application SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS or other storage 01/06/15 Creative Common, BY, SA, NC 10
  • 11. Spark Architecture • [Spark Standalone • |Mesos • |Yarn] Node Client 01/06/15 Creative Common, BY, SA, NC 11
  • 12. Key Concept: RDD’s Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) 01/06/15 Creative Common, BY, SA, NC 12 Write programs in terms of operations on distributed datasets
  • 13. Fault Recovery RDDs track the series of transformations used to build them (their lineage) to re-compute lost data, no data replication across wire. val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) 01/06/15 Creative Common, BY, SA, NC 13 HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 14. Language Support Standalone Programs •Python, Scala, & Java Interactive Shells • Python & Scala Performance • Java & Scala are faster due to static typing • …but Python is often fine Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); 01/06/15 Creative Common, BY, SA, NC 14
  • 15. Interactive Shell • The fastest way to learn Spark • Available in Python and Scala • Runs as an application on an existing Spark Cluster… • Or can run locally 01/06/15 Creative Common, BY, SA, NC 15
  • 18. Spark Streaming 01/06/15 Creative Common, BY, SA, NC 18
  • 19. Spark Streaming 01/06/15 Creative Common, BY, SA, NC 19
  • 20. Spark Streaming: Word Count import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.storage.StorageLevel object NetworkWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: NetworkWordCount <hostname> <port>") System.exit(1) } StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName("NetworkWordCount") val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } 01/06/15 Creative Common, BY, SA, NC 20 Create Spark Context Create, map, reduce Output Start
  • 22. Conclusion • Spark offers a rich API to make data analytics fast: both less to write and fast to run. • Achieves 100x speedups in real applications. • Growing community. 01/06/15 Creative Common, BY, SA, NC 22
  • 23. Observations: • A lot of data, different kinds of data, generated faster, need analyzed in real-time. • All* products are data products. • More complicate analytic algorithms applies to commercial products and services. • Not all data analysis requires the same accuracy. • Expectation on service delivery increases. 01/06/15 Creative Common, BY, SA, NC 23
  • 24. Reference: • AMPLab at UC Berkeley • Databrick • UC BerkeleyX – CS100.1x Introduction to Big Data with Apache Spark, starts 23 Feb 2015, 5 weeks – CS190.1x Scalable Machine Learning, starts 14 Apr 2015, 5 weeks • Spark Summit 2014 Training • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing • An Architecture for Fast and General Data Processing on Large Clusters • Richard’s Study Notes – Self Study AMPCamp – Hortonworks HDP 2.2 Study 01/06/15 Creative Common, BY, SA, NC 24

Notas do Editor

  1. MPI (Message Passing Interface)
  2. http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf
  3. Gracefully
  4. The barrier to entry for working with the spark API is minimal
  5. (word, 1L) reduceByKey(_, _)
  6. from http://spark.apache.org/docs/latest/streaming-programming-guide.html /** * Usage: NetworkWordCount <hostname> <port> * To run this on your local machine, you need to first run a Netcat server * `$ nc -lk 9999` * and then run the example * `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999` */