SlideShare uma empresa Scribd logo
1 de 68
Introduction to Spark Developer Training
Diana Carroll | Senior Curriculum Developer
Agenda
 Cloudera's Learning Path for Developers
 Target Audience and Prerequisites
 Course Outline
 Short Presentation Based on Actual Course Material
 Question and Answer Session
Learning Path: Developers
Create Powerful New Data Processing Tools
Learn to code and write MapReduce programs for production
Master advanced API topics required for real-world data analysis
Design schemas to minimize latency on massive data sets
Scale hundreds of thousands of operations per second
Implement recommenders and data experiments
Draw actionable insights from analysis of disparate data
Build converged applications using multiple processing engines
Develop enterprise solutions using components across the EDH
Combine batch and stream processing with interactive analytics
Optimize applications for speed, ease of use, and sophistication
Spark
Training
Big Data
Applications
HBase
Training
Intro to
Data Science
Developer
Training
Aaron T. Myers
Software Engineer
1 Broadest Range of Courses
Developer, Admin, Analyst, HBase, Data Science
2
3
Most Experienced Instructors
More than 20,000 students trained since 2009
6 Widest Geographic Coverage
Most classes offered: 50 cities worldwide plus online
7 Most Relevant Platform & Community
CDH deployed more than all other distributions combined
8 Depth of Training Material
Hands-on labs and VMs support live instruction
Leader in Certification
Over 8,000 accredited Cloudera professionals
4 Trusted Source for Training
100,000+ people have attended online courses 9 Ongoing Learning
Video tutorials and e-learning complement training
Why Cloudera Training?
Aligned to Best Practices and the Pace of Change
5 State of the Art Curriculum
Courses updated as Hadoop evolves 10Commitment to Big Data Education
University partnerships to teach Hadoop in the classroom
Cloudera Developer Training for Apache Spark
About the Course
 Intended for people who write code, such as
–Software Engineers
–Data Engineers
–ETL Developers
Target Audience
 No prior knowledge of Spark, Hadoop or distributed programming
concepts is required
Course Prerequisites
 No prior knowledge of Spark, Hadoop or distributed programming
concepts is required
 Requirements
–Basic familiarity with Linux or Unix
Course Prerequisites
$ mkdir /data
$ cd /data
$ rm /home/johndoe/salesreport.txt
 No prior knowledge of Spark, Hadoop or distributed programming
concepts is required
 Requirements
–Basic familiarity with Linux or Unix
–Intermediate-level programming skills in either Scala or Python
Course Prerequisites
$ mkdir /data
$ cd /data
$ rm /home/johndoe/salesreport.txt
Example of Required Scala Skill Level
 Do you understand the following code? Could you write something
similar?
object Maps {
val colors = Map("red" -> 0xFF0000,
"turquoise" -> 0x00FFFF,
"black" -> 0x000000,
"orange" -> 0xFF8040,
"brown" -> 0x804000)
def main(args: Array[String]) {
for (name <- args) println(
colors.get(name) match {
case Some(code) =>
name + " has code: " + code
case None =>
"Unknown color: " + name
}
)
}
}
Example of Required Python Skill Level
 Do you understand the following code? Could you write something
similar?
import sys
def parsePurchases(s):
return s.split(',')
if __name__ == "__main__":
if len(sys.argv) < 2:
print "Usage: SumPrices <products>"
exit(-1)
prices = {'apple': 0.40, 'banana': 0.50, 'orange': 0.10}
total = sum(prices[fruit]
for fruit in parsePurchases(sys.argv[1]))
print 'Total: $%.2f' % total
 Getting started with Scala
–www.scala-lang.org
Practicing Scala or Python
 Getting started with Scala
–www.scala-lang.org
 Getting started with Python
–python.org
–developers.google.com/edu/python
–and many more
Practicing Scala or Python
1. Introduction
Course Outline
1. Introduction
2. What is Spark?
Course Outline
1. Introduction
2. What is Spark?
3. Spark Basics
Course Outline
1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
Course Outline
1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
Course Outline
1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
Course Outline
1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
Course Outline
1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
Course Outline
1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
9. Writing Spark Applications
Course Outline
1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
9. Writing Spark Applications
10. Spark Streaming
Course Outline
1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
9. Writing Spark Applications
10. Spark Streaming
11. Common Patterns in Spark
Programming
Course Outline
1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
9. Writing Spark Applications
10. Spark Streaming
11. Common Patterns in Spark
Programming
12. Improving Spark Performance
Course Outline
1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
9. Writing Spark Applications
10. Spark Streaming
11. Common Patterns in Spark
Programming
12. Improving Spark Performance
13. Spark, Hadoop and the Enterprise
Data Center
Course Outline
1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
9. Writing Spark Applications
10. Spark Streaming
11. Common Patterns in Spark
Programming
12. Improving Spark Performance
13. Spark, Hadoop and the Enterprise
Data Center
14. Conclusion
Course Outline
 Based on
–Chapter 3: Spark Basics
–Chapter 4: Working with RDDs
Course Excerpt
 Based on
–Chapter 3: Spark Basics
–Chapter 4: Working with RDDs
 Topics
–What is Spark?
–The components of a distributed data processing system
–Intro to the Spark Shell
–Resilient Distributed Datasets
–RDD operations
–Example: WordCount
Course Excerpt
 Apache Spark is a fast, general engine for large-scale data
processing and analysis
–Open source, developed at UC Berkeley
 Written in Scala
–Functional programming language that runs in a JVM
What is Apache Spark?
 Apache Spark is a fast, general engine for large-scale data
processing and analysis
–Open source, developed at UC Berkeley
 Written in Scala
–Functional programming language that runs in a JVM
 Key Concepts
–Avoid the data bottleneck by distributing data when it is
stored
–Bring the processing to the data
–Data stored in memory
What is Apache Spark?
Distributed Processing with the Spark Framework
API
Spark
Distributed Processing with the Spark Framework
API
Cluster Computing
Spark
‱ Spark Standalone
‱ YARN
‱ Mesos
Distributed Processing with the Spark Framework
API
Cluster Computing Storage
Spark
‱ Spark Standalone
‱ YARN
‱ Mesos
HDFS
(Hadoop Distributed File
System)
 Spark Shell
–Interactive REPL – for learning or data exploration
–Python or Scala
 Spark Applications
–For large scale data processing
–Python, Java or Scala
What is Apache Spark?
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 0.9.1
/_/
Using Python version 2.6.6 (r266:84292, Jan
22 2014 09:42:36)
Spark context available as sc.
>>>
$ spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 0.9.1
/_/
Using Scala version 2.10.3 (Java HotSpot(TM)
64-Bit Server VM, Java 1.7.0_51)
Created spark context..
Spark context available as sc.
scala>
Scala Shell
Python Shell
 Every Spark application requires a Spark Context
–The main entry point to the Spark API
 Spark Shell provides a preconfigured Spark Context called sc
Spark Context
>>> sc.appName
u'PySparkShell'
scala> sc.appName
res0: String = Spark shell
 RDD (Resilient Distributed Dataset)
–Resilient – if data in memory is lost, it can be
recreated
–Distributed – stored in memory across the cluster
–Dataset – initial data can come from a file or created
programmatically
 RDDs are the fundamental unit of data in Spark
 Most of Spark programming is performing operations on
RDDs
RDD (Resilient Distributed Dataset)
data
data
data
data

RDD
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Example: A File-based RDD
I've never seen a purple
cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
File: purplecow.txt
RDD: mydata
> mydata = sc.textFile("purplecow.txt")
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Example: A File-based RDD
I've never seen a purple
cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
File: purplecow.txt
RDD: mydata
> mydata = sc.textFile("purplecow.txt")
> mydata.count()
4
 Two types of RDD operations
–Actions – return values
–count
–take(n)
RDD Operations
value
RDD
 Two types of RDD operations
–Actions – return values
–count
–take(n)
–Transformations – define new RDDs
based on the current one
–filter
–map
–reduce
RDD Operations
value
RDD
New RDDBase RDD
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Example: map and filter Transformations
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
Example: map and filter Transformations
map(lambda line: line.upper()) map(line => line.toUpperCase())
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
Example: map and filter Transformations
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
filter(lambda line: line.startswith('I'))
map(lambda line: line.upper()) map(line => line.toUpperCase())
filter(line => line.startsWith('I'))
 RDDs can hold any type of element
–Primitive types: integers, characters, booleans, strings, etc.
–Sequence types: lists, arrays, tuples, dicts, etc. (including nested)
–Scala/Java Objects (if serializable)
–Mixed types
RDDs
 RDDs can hold any type of element
–Primitive types: integers, characters, booleans, strings, etc.
–Sequence types: lists, arrays, tuples, dicts, etc. (including nested)
–Scala/Java Objects (if serializable)
–Mixed types
 Some types of RDDs have additional functionality
–Double RDDs – RDDs consisting of numeric data
–Pair RDDs – RDDs consisting of Key-Value pairs
RDDs
 Pair RDDs are a special form of RDD
–Each element must be a key-value pair (a two-
element tuple)
–Keys and values can be any type
Pair RDDs
(key1,value1)
(key2,value2)
(key3,value3)


Pair RDD
 Pair RDDs are a special form of RDD
–Each element must be a key-value pair (a two-
element tuple)
–Keys and values can be any type
 Why?
–Use with Map-Reduce algorithms
–Many additional functions are available for
common data processing needs
–E.g. sorting, joining, grouping, counting, etc.
Pair RDDs
(key1,value1)
(key2,value2)
(key3,value3)


Pair RDD
 MapReduce is a common programming model
–Two phases
–Map – process each element in a data set
–Reduce – aggregate or consolidate the data
–Easily applicable to distributed processing of large data sets
MapReduce
 MapReduce is a common programming model
–Two phases
–Map – process each element in a data set
–Reduce – aggregate or consolidate the data
–Easily applicable to distributed processing of large data sets
 Hadoop MapReduce is the major implementation
–Limited
–Each job has one Map phase, one Reduce phase in each
–Job output saved to files
MapReduce
 MapReduce is a common programming model
–Two phases
–Map – process each element in a data set
–Reduce – aggregate or consolidate the data
–Easily applicable to distributed processing of large data sets
 Hadoop MapReduce is the major implementation
–Limited
–Each job has one Map phase, one Reduce phase in each
–Job output saved to files
 Spark implements MapReduce with much greater flexibility
–Map and Reduce functions can be interspersed
–Results stored in memory
–Operations can be chained easily
MapReduce
MapReduce Example: Word Count
the cat sat on the mat
the aardvark sat on the sofa
Input Data
Result
aardvark 1
cat 1
mat 1
on 2
sat 2
sofa 1
the 4
?
Example: Word Count
> counts = sc.textFile(file)
the cat sat on the
mat
the aardvark sat on
the sofa
Example: Word Count
> counts = sc.textFile(file) 
.flatMap(lambda line: line.split())
the cat sat on the
mat
the aardvark sat on
the sofa
the
cat
sat
on
the
mat
the
aardvark
sat


Example: Word Count
> counts = sc.textFile(file) 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word,1))
the cat sat on the
mat
the aardvark sat on
the sofa
(the, 1)
(cat, 1)
(sat, 1)
(on, 1)
(the, 1)
(mat, 1)
(the, 1)
(aardvark, 1)
(sat, 1)


the
cat
sat
on
the
mat
the
aardvark
sat


Key-
Value
Pairs
Example: Word Count
> counts = sc.textFile(file) 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word,1)) 
.reduceByKey(lambda v1,v2: v1+v2)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
the cat sat on the
mat
the aardvark sat on
the sofa
(the, 1)
(cat, 1)
(sat, 1)
(on, 1)
(the, 1)
(mat, 1)
(the, 1)
(aardvark, 1)
(sat, 1)


the
cat
sat
on
the
mat
the
aardvark
sat


Example: Word Count
> counts = sc.textFile(file) 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word,1)) 
.reduceByKey(lambda v1,v2: v1+v2)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
the cat sat on the
mat
the aardvark sat on
the sofa
(the, 1)
(cat, 1)
(sat, 1)
(on, 1)
(the, 1)
(mat, 1)
(the, 1)
(aardvark, 1)
(sat, 1)


the
cat
sat
on
the
mat
the
aardvark
sat


Example: Word Count
> counts = sc.textFile(file) 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word,1)) 
.reduceByKey(lambda v1,v2: v1+v2)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
the cat sat on the
mat
the aardvark sat on
the sofa
(the, 1)
(cat, 1)
(sat, 1)
(on, 1)
(the, 1)
(mat, 1)
(the, 1)
(aardvark, 1)
(sat, 1)


the
cat
sat
on
the
mat
the
aardvark
sat


 ReduceByKey functions must be
–Binary – combines values
from two keys
–Commutative – x+y = y+x
–Associative – (x+y)+z = x+(y+z)
ReduceByKey
(the,1)
(cat,1)
(sat,1)
(on,1)
(the,1)
(mat,1)
(the,1)
(aardvark,1)
(sat,1)
(on,1)
(the,1)
> counts = sc.textFile(file) 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word,1)) 
.reduceByKey(lambda v1,v2: v1+v2)
(the,2)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
 ReduceByKey functions must be
–Binary – combines values
from two keys
–Commutative – x+y = y+x
–Associative – (x+y)+z = x+(y+z)
ReduceByKey
(the,1)
(cat,1)
(sat,1)
(on,1)
(the,1)
(mat,1)
(the,1)
(aardvark,1)
(sat,1)
(on,1)
(the,1)
> counts = sc.textFile(file) 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word,1)) 
.reduceByKey(lambda v1,v2: v1+v2)
(the,2)
(the,3)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
 ReduceByKey functions must be
–Binary – combines values
from two keys
–Commutative – x+y = y+x
–Associative – (x+y)+z = x+(y+z)
ReduceByKey
(the,1)
(cat,1)
(sat,1)
(on,1)
(the,1)
(mat,1)
(the,1)
(aardvark,1)
(sat,1)
(on,1)
(the,1)
> counts = sc.textFile(file) 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word,1)) 
.reduceByKey(lambda v1,v2: v1+v2)
(the,2)
(the,3)
(the,4)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
Example: Word Count
> counts = sc.textFile(file) 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word,1)) 
.reduceByKey(lambda v1,v2: v1+v2)
> counts.saveAsTextFile(output)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
(aardvark,1)
(cat,1)
(mat,1)
(on,2)
(sat,2)
(sofa,1)
(the,4)
 Spark takes the concepts of
MapReduce to the next level
–Higher level API = faster, easier
development
Spark v. Hadoop MapReduce
 Spark takes the concepts of
MapReduce to the next level
–Higher level API = faster, easier
development
Spark v. Hadoop MapReduce
public class WordCount {
public static void main(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
public class WordMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
public void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
String line = value.toString();
for (String word : line.split("W+")) {
if (word.length() > 0)
context.write(new Text(word), new IntWritable(1));
}
}
}
}
public class SumReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable>
values, Context context) throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
> counts = sc.textFile(file) 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word,1)) 
.reduceByKey(lambda v1,v2: v1+v2)
> counts.saveAsTextFile(output)
 Spark takes the concepts of
MapReduce to the next level
–Higher level API = faster, easier
development
–Low latency = near real-time
processing
Spark v. Hadoop MapReduce
 Spark takes the concepts of
MapReduce to the next level
–Higher level API = faster, easier
development
–Low latency = near real-time
processing
–In-memory data storage = up to
100x performance improvement
Spark v. Hadoop MapReduce
Logistic Regression
Thank you for attending!
‱ Submit questions in the Q&A panel
‱ Follow Cloudera University @ClouderaU
‱ Follow Diana on GitHub:
https://github.com/dianacarroll
‱ Follow the Developer learning path:
http://university.cloudera.com/develop
ers
‱ Learn about the enterprise data hub:
http://tinyurl.com/edh-webinar
‱ Join the Cloudera user community:
http://community.cloudera.com/
Register now for Cloudera training at
http://university.cloudera.com
Use discount code Spark_10 to save 10%
on new enrollments in Spark Developer
Training classes delivered by Cloudera
until October 3, 2014*
Use discount code 15off2 to save 15% on
enrollments in two or more training
classes delivered by Cloudera until
October 3, 2014*
* Excludes classes sold or delivered by Cloudera partners

Mais conteĂșdo relacionado

Mais procurados

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 

Mais procurados (20)

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark overview
Spark overviewSpark overview
Spark overview
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Spark
SparkSpark
Spark
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 

Semelhante a Introduction to Apache Spark Developer Training

HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)Durga Gadiraju
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache SparkEdureka!
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With SparkEdureka!
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Building data pipelines for modern data warehouse with Apache¼ Sparkℱ and .NE...
Building data pipelines for modern data warehouse with Apache¼ Sparkℱ and .NE...Building data pipelines for modern data warehouse with Apache¼ Sparkℱ and .NE...
Building data pipelines for modern data warehouse with Apache¼ Sparkℱ and .NE...Michael Rys
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark OverviewDharmjit Singh
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Sharktrihug
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...Durga Gadiraju
 

Semelhante a Introduction to Apache Spark Developer Training (20)

HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Building data pipelines for modern data warehouse with Apache¼ Sparkℱ and .NE...
Building data pipelines for modern data warehouse with Apache¼ Sparkℱ and .NE...Building data pipelines for modern data warehouse with Apache¼ Sparkℱ and .NE...
Building data pipelines for modern data warehouse with Apache¼ Sparkℱ and .NE...
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Module01
 Module01 Module01
Module01
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Spark core
Spark coreSpark core
Spark core
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
 

Mais de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïžcall girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïžDelhi Call girls
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto GonzĂĄlez Trastoy
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
(Genuine) Escort Service Lucknow | Starting â‚č,5K To @25k with A/C đŸ§‘đŸœâ€â€ïžâ€đŸ§‘đŸ» 89...
(Genuine) Escort Service Lucknow | Starting â‚č,5K To @25k with A/C đŸ§‘đŸœâ€â€ïžâ€đŸ§‘đŸ» 89...(Genuine) Escort Service Lucknow | Starting â‚č,5K To @25k with A/C đŸ§‘đŸœâ€â€ïžâ€đŸ§‘đŸ» 89...
(Genuine) Escort Service Lucknow | Starting â‚č,5K To @25k with A/C đŸ§‘đŸœâ€â€ïžâ€đŸ§‘đŸ» 89...gurkirankumar98700
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 

Último (20)

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïžcall girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
(Genuine) Escort Service Lucknow | Starting â‚č,5K To @25k with A/C đŸ§‘đŸœâ€â€ïžâ€đŸ§‘đŸ» 89...
(Genuine) Escort Service Lucknow | Starting â‚č,5K To @25k with A/C đŸ§‘đŸœâ€â€ïžâ€đŸ§‘đŸ» 89...(Genuine) Escort Service Lucknow | Starting â‚č,5K To @25k with A/C đŸ§‘đŸœâ€â€ïžâ€đŸ§‘đŸ» 89...
(Genuine) Escort Service Lucknow | Starting â‚č,5K To @25k with A/C đŸ§‘đŸœâ€â€ïžâ€đŸ§‘đŸ» 89...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 

Introduction to Apache Spark Developer Training

  • 1. Introduction to Spark Developer Training Diana Carroll | Senior Curriculum Developer
  • 2. Agenda  Cloudera's Learning Path for Developers  Target Audience and Prerequisites  Course Outline  Short Presentation Based on Actual Course Material  Question and Answer Session
  • 3. Learning Path: Developers Create Powerful New Data Processing Tools Learn to code and write MapReduce programs for production Master advanced API topics required for real-world data analysis Design schemas to minimize latency on massive data sets Scale hundreds of thousands of operations per second Implement recommenders and data experiments Draw actionable insights from analysis of disparate data Build converged applications using multiple processing engines Develop enterprise solutions using components across the EDH Combine batch and stream processing with interactive analytics Optimize applications for speed, ease of use, and sophistication Spark Training Big Data Applications HBase Training Intro to Data Science Developer Training Aaron T. Myers Software Engineer
  • 4. 1 Broadest Range of Courses Developer, Admin, Analyst, HBase, Data Science 2 3 Most Experienced Instructors More than 20,000 students trained since 2009 6 Widest Geographic Coverage Most classes offered: 50 cities worldwide plus online 7 Most Relevant Platform & Community CDH deployed more than all other distributions combined 8 Depth of Training Material Hands-on labs and VMs support live instruction Leader in Certification Over 8,000 accredited Cloudera professionals 4 Trusted Source for Training 100,000+ people have attended online courses 9 Ongoing Learning Video tutorials and e-learning complement training Why Cloudera Training? Aligned to Best Practices and the Pace of Change 5 State of the Art Curriculum Courses updated as Hadoop evolves 10Commitment to Big Data Education University partnerships to teach Hadoop in the classroom
  • 5. Cloudera Developer Training for Apache Spark About the Course
  • 6.  Intended for people who write code, such as –Software Engineers –Data Engineers –ETL Developers Target Audience
  • 7.  No prior knowledge of Spark, Hadoop or distributed programming concepts is required Course Prerequisites
  • 8.  No prior knowledge of Spark, Hadoop or distributed programming concepts is required  Requirements –Basic familiarity with Linux or Unix Course Prerequisites $ mkdir /data $ cd /data $ rm /home/johndoe/salesreport.txt
  • 9.  No prior knowledge of Spark, Hadoop or distributed programming concepts is required  Requirements –Basic familiarity with Linux or Unix –Intermediate-level programming skills in either Scala or Python Course Prerequisites $ mkdir /data $ cd /data $ rm /home/johndoe/salesreport.txt
  • 10. Example of Required Scala Skill Level  Do you understand the following code? Could you write something similar? object Maps { val colors = Map("red" -> 0xFF0000, "turquoise" -> 0x00FFFF, "black" -> 0x000000, "orange" -> 0xFF8040, "brown" -> 0x804000) def main(args: Array[String]) { for (name <- args) println( colors.get(name) match { case Some(code) => name + " has code: " + code case None => "Unknown color: " + name } ) } }
  • 11. Example of Required Python Skill Level  Do you understand the following code? Could you write something similar? import sys def parsePurchases(s): return s.split(',') if __name__ == "__main__": if len(sys.argv) < 2: print "Usage: SumPrices <products>" exit(-1) prices = {'apple': 0.40, 'banana': 0.50, 'orange': 0.10} total = sum(prices[fruit] for fruit in parsePurchases(sys.argv[1])) print 'Total: $%.2f' % total
  • 12.  Getting started with Scala –www.scala-lang.org Practicing Scala or Python
  • 13.  Getting started with Scala –www.scala-lang.org  Getting started with Python –python.org –developers.google.com/edu/python –and many more Practicing Scala or Python
  • 15. 1. Introduction 2. What is Spark? Course Outline
  • 16. 1. Introduction 2. What is Spark? 3. Spark Basics Course Outline
  • 17. 1. Introduction 2. What is Spark? 3. Spark Basics 4. Working with RDDs Course Outline
  • 18. 1. Introduction 2. What is Spark? 3. Spark Basics 4. Working with RDDs 5. The Hadoop Distributed File System Course Outline
  • 19. 1. Introduction 2. What is Spark? 3. Spark Basics 4. Working with RDDs 5. The Hadoop Distributed File System 6. Running Spark on a Cluster Course Outline
  • 20. 1. Introduction 2. What is Spark? 3. Spark Basics 4. Working with RDDs 5. The Hadoop Distributed File System 6. Running Spark on a Cluster 7. Parallel Programming with Spark Course Outline
  • 21. 1. Introduction 2. What is Spark? 3. Spark Basics 4. Working with RDDs 5. The Hadoop Distributed File System 6. Running Spark on a Cluster 7. Parallel Programming with Spark 8. Caching and Persistence Course Outline
  • 22. 1. Introduction 2. What is Spark? 3. Spark Basics 4. Working with RDDs 5. The Hadoop Distributed File System 6. Running Spark on a Cluster 7. Parallel Programming with Spark 8. Caching and Persistence 9. Writing Spark Applications Course Outline
  • 23. 1. Introduction 2. What is Spark? 3. Spark Basics 4. Working with RDDs 5. The Hadoop Distributed File System 6. Running Spark on a Cluster 7. Parallel Programming with Spark 8. Caching and Persistence 9. Writing Spark Applications 10. Spark Streaming Course Outline
  • 24. 1. Introduction 2. What is Spark? 3. Spark Basics 4. Working with RDDs 5. The Hadoop Distributed File System 6. Running Spark on a Cluster 7. Parallel Programming with Spark 8. Caching and Persistence 9. Writing Spark Applications 10. Spark Streaming 11. Common Patterns in Spark Programming Course Outline
  • 25. 1. Introduction 2. What is Spark? 3. Spark Basics 4. Working with RDDs 5. The Hadoop Distributed File System 6. Running Spark on a Cluster 7. Parallel Programming with Spark 8. Caching and Persistence 9. Writing Spark Applications 10. Spark Streaming 11. Common Patterns in Spark Programming 12. Improving Spark Performance Course Outline
  • 26. 1. Introduction 2. What is Spark? 3. Spark Basics 4. Working with RDDs 5. The Hadoop Distributed File System 6. Running Spark on a Cluster 7. Parallel Programming with Spark 8. Caching and Persistence 9. Writing Spark Applications 10. Spark Streaming 11. Common Patterns in Spark Programming 12. Improving Spark Performance 13. Spark, Hadoop and the Enterprise Data Center Course Outline
  • 27. 1. Introduction 2. What is Spark? 3. Spark Basics 4. Working with RDDs 5. The Hadoop Distributed File System 6. Running Spark on a Cluster 7. Parallel Programming with Spark 8. Caching and Persistence 9. Writing Spark Applications 10. Spark Streaming 11. Common Patterns in Spark Programming 12. Improving Spark Performance 13. Spark, Hadoop and the Enterprise Data Center 14. Conclusion Course Outline
  • 28.  Based on –Chapter 3: Spark Basics –Chapter 4: Working with RDDs Course Excerpt
  • 29.  Based on –Chapter 3: Spark Basics –Chapter 4: Working with RDDs  Topics –What is Spark? –The components of a distributed data processing system –Intro to the Spark Shell –Resilient Distributed Datasets –RDD operations –Example: WordCount Course Excerpt
  • 30.  Apache Spark is a fast, general engine for large-scale data processing and analysis –Open source, developed at UC Berkeley  Written in Scala –Functional programming language that runs in a JVM What is Apache Spark?
  • 31.  Apache Spark is a fast, general engine for large-scale data processing and analysis –Open source, developed at UC Berkeley  Written in Scala –Functional programming language that runs in a JVM  Key Concepts –Avoid the data bottleneck by distributing data when it is stored –Bring the processing to the data –Data stored in memory What is Apache Spark?
  • 32. Distributed Processing with the Spark Framework API Spark
  • 33. Distributed Processing with the Spark Framework API Cluster Computing Spark ‱ Spark Standalone ‱ YARN ‱ Mesos
  • 34. Distributed Processing with the Spark Framework API Cluster Computing Storage Spark ‱ Spark Standalone ‱ YARN ‱ Mesos HDFS (Hadoop Distributed File System)
  • 35.  Spark Shell –Interactive REPL – for learning or data exploration –Python or Scala  Spark Applications –For large scale data processing –Python, Java or Scala What is Apache Spark? $ pyspark Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 0.9.1 /_/ Using Python version 2.6.6 (r266:84292, Jan 22 2014 09:42:36) Spark context available as sc. >>> $ spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 0.9.1 /_/ Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51) Created spark context.. Spark context available as sc. scala> Scala Shell Python Shell
  • 36.  Every Spark application requires a Spark Context –The main entry point to the Spark API  Spark Shell provides a preconfigured Spark Context called sc Spark Context >>> sc.appName u'PySparkShell' scala> sc.appName res0: String = Spark shell
  • 37.  RDD (Resilient Distributed Dataset) –Resilient – if data in memory is lost, it can be recreated –Distributed – stored in memory across the cluster –Dataset – initial data can come from a file or created programmatically  RDDs are the fundamental unit of data in Spark  Most of Spark programming is performing operations on RDDs RDD (Resilient Distributed Dataset) data data data data
 RDD
  • 38. I've never seen a purple cow. I never hope to see one; But I can tell you, anyhow, I'd rather see than be one. Example: A File-based RDD I've never seen a purple cow. I never hope to see one; But I can tell you, anyhow, I'd rather see than be one. File: purplecow.txt RDD: mydata > mydata = sc.textFile("purplecow.txt")
  • 39. I've never seen a purple cow. I never hope to see one; But I can tell you, anyhow, I'd rather see than be one. Example: A File-based RDD I've never seen a purple cow. I never hope to see one; But I can tell you, anyhow, I'd rather see than be one. File: purplecow.txt RDD: mydata > mydata = sc.textFile("purplecow.txt") > mydata.count() 4
  • 40.  Two types of RDD operations –Actions – return values –count –take(n) RDD Operations value RDD
  • 41.  Two types of RDD operations –Actions – return values –count –take(n) –Transformations – define new RDDs based on the current one –filter –map –reduce RDD Operations value RDD New RDDBase RDD
  • 42. I've never seen a purple cow. I never hope to see one; But I can tell you, anyhow, I'd rather see than be one. Example: map and filter Transformations
  • 43. I've never seen a purple cow. I never hope to see one; But I can tell you, anyhow, I'd rather see than be one. I'VE NEVER SEEN A PURPLE COW. I NEVER HOPE TO SEE ONE; BUT I CAN TELL YOU, ANYHOW, I'D RATHER SEE THAN BE ONE. Example: map and filter Transformations map(lambda line: line.upper()) map(line => line.toUpperCase())
  • 44. I've never seen a purple cow. I never hope to see one; But I can tell you, anyhow, I'd rather see than be one. I'VE NEVER SEEN A PURPLE COW. I NEVER HOPE TO SEE ONE; BUT I CAN TELL YOU, ANYHOW, I'D RATHER SEE THAN BE ONE. Example: map and filter Transformations I'VE NEVER SEEN A PURPLE COW. I NEVER HOPE TO SEE ONE; I'D RATHER SEE THAN BE ONE. filter(lambda line: line.startswith('I')) map(lambda line: line.upper()) map(line => line.toUpperCase()) filter(line => line.startsWith('I'))
  • 45.  RDDs can hold any type of element –Primitive types: integers, characters, booleans, strings, etc. –Sequence types: lists, arrays, tuples, dicts, etc. (including nested) –Scala/Java Objects (if serializable) –Mixed types RDDs
  • 46.  RDDs can hold any type of element –Primitive types: integers, characters, booleans, strings, etc. –Sequence types: lists, arrays, tuples, dicts, etc. (including nested) –Scala/Java Objects (if serializable) –Mixed types  Some types of RDDs have additional functionality –Double RDDs – RDDs consisting of numeric data –Pair RDDs – RDDs consisting of Key-Value pairs RDDs
  • 47.  Pair RDDs are a special form of RDD –Each element must be a key-value pair (a two- element tuple) –Keys and values can be any type Pair RDDs (key1,value1) (key2,value2) (key3,value3) 
 Pair RDD
  • 48.  Pair RDDs are a special form of RDD –Each element must be a key-value pair (a two- element tuple) –Keys and values can be any type  Why? –Use with Map-Reduce algorithms –Many additional functions are available for common data processing needs –E.g. sorting, joining, grouping, counting, etc. Pair RDDs (key1,value1) (key2,value2) (key3,value3) 
 Pair RDD
  • 49.  MapReduce is a common programming model –Two phases –Map – process each element in a data set –Reduce – aggregate or consolidate the data –Easily applicable to distributed processing of large data sets MapReduce
  • 50.  MapReduce is a common programming model –Two phases –Map – process each element in a data set –Reduce – aggregate or consolidate the data –Easily applicable to distributed processing of large data sets  Hadoop MapReduce is the major implementation –Limited –Each job has one Map phase, one Reduce phase in each –Job output saved to files MapReduce
  • 51.  MapReduce is a common programming model –Two phases –Map – process each element in a data set –Reduce – aggregate or consolidate the data –Easily applicable to distributed processing of large data sets  Hadoop MapReduce is the major implementation –Limited –Each job has one Map phase, one Reduce phase in each –Job output saved to files  Spark implements MapReduce with much greater flexibility –Map and Reduce functions can be interspersed –Results stored in memory –Operations can be chained easily MapReduce
  • 52. MapReduce Example: Word Count the cat sat on the mat the aardvark sat on the sofa Input Data Result aardvark 1 cat 1 mat 1 on 2 sat 2 sofa 1 the 4 ?
  • 53. Example: Word Count > counts = sc.textFile(file) the cat sat on the mat the aardvark sat on the sofa
  • 54. Example: Word Count > counts = sc.textFile(file) .flatMap(lambda line: line.split()) the cat sat on the mat the aardvark sat on the sofa the cat sat on the mat the aardvark sat 

  • 55. Example: Word Count > counts = sc.textFile(file) .flatMap(lambda line: line.split()) .map(lambda word: (word,1)) the cat sat on the mat the aardvark sat on the sofa (the, 1) (cat, 1) (sat, 1) (on, 1) (the, 1) (mat, 1) (the, 1) (aardvark, 1) (sat, 1) 
 the cat sat on the mat the aardvark sat 
 Key- Value Pairs
  • 56. Example: Word Count > counts = sc.textFile(file) .flatMap(lambda line: line.split()) .map(lambda word: (word,1)) .reduceByKey(lambda v1,v2: v1+v2) (aardvark, 1) (cat, 1) (mat, 1) (on, 2) (sat, 2) (sofa, 1) (the, 4) the cat sat on the mat the aardvark sat on the sofa (the, 1) (cat, 1) (sat, 1) (on, 1) (the, 1) (mat, 1) (the, 1) (aardvark, 1) (sat, 1) 
 the cat sat on the mat the aardvark sat 

  • 57. Example: Word Count > counts = sc.textFile(file) .flatMap(lambda line: line.split()) .map(lambda word: (word,1)) .reduceByKey(lambda v1,v2: v1+v2) (aardvark, 1) (cat, 1) (mat, 1) (on, 2) (sat, 2) (sofa, 1) (the, 4) the cat sat on the mat the aardvark sat on the sofa (the, 1) (cat, 1) (sat, 1) (on, 1) (the, 1) (mat, 1) (the, 1) (aardvark, 1) (sat, 1) 
 the cat sat on the mat the aardvark sat 

  • 58. Example: Word Count > counts = sc.textFile(file) .flatMap(lambda line: line.split()) .map(lambda word: (word,1)) .reduceByKey(lambda v1,v2: v1+v2) (aardvark, 1) (cat, 1) (mat, 1) (on, 2) (sat, 2) (sofa, 1) (the, 4) the cat sat on the mat the aardvark sat on the sofa (the, 1) (cat, 1) (sat, 1) (on, 1) (the, 1) (mat, 1) (the, 1) (aardvark, 1) (sat, 1) 
 the cat sat on the mat the aardvark sat 

  • 59.  ReduceByKey functions must be –Binary – combines values from two keys –Commutative – x+y = y+x –Associative – (x+y)+z = x+(y+z) ReduceByKey (the,1) (cat,1) (sat,1) (on,1) (the,1) (mat,1) (the,1) (aardvark,1) (sat,1) (on,1) (the,1) > counts = sc.textFile(file) .flatMap(lambda line: line.split()) .map(lambda word: (word,1)) .reduceByKey(lambda v1,v2: v1+v2) (the,2) (aardvark, 1) (cat, 1) (mat, 1) (on, 2) (sat, 2) (sofa, 1) (the, 4)
  • 60.  ReduceByKey functions must be –Binary – combines values from two keys –Commutative – x+y = y+x –Associative – (x+y)+z = x+(y+z) ReduceByKey (the,1) (cat,1) (sat,1) (on,1) (the,1) (mat,1) (the,1) (aardvark,1) (sat,1) (on,1) (the,1) > counts = sc.textFile(file) .flatMap(lambda line: line.split()) .map(lambda word: (word,1)) .reduceByKey(lambda v1,v2: v1+v2) (the,2) (the,3) (aardvark, 1) (cat, 1) (mat, 1) (on, 2) (sat, 2) (sofa, 1) (the, 4)
  • 61.  ReduceByKey functions must be –Binary – combines values from two keys –Commutative – x+y = y+x –Associative – (x+y)+z = x+(y+z) ReduceByKey (the,1) (cat,1) (sat,1) (on,1) (the,1) (mat,1) (the,1) (aardvark,1) (sat,1) (on,1) (the,1) > counts = sc.textFile(file) .flatMap(lambda line: line.split()) .map(lambda word: (word,1)) .reduceByKey(lambda v1,v2: v1+v2) (the,2) (the,3) (the,4) (aardvark, 1) (cat, 1) (mat, 1) (on, 2) (sat, 2) (sofa, 1) (the, 4)
  • 62. Example: Word Count > counts = sc.textFile(file) .flatMap(lambda line: line.split()) .map(lambda word: (word,1)) .reduceByKey(lambda v1,v2: v1+v2) > counts.saveAsTextFile(output) (aardvark, 1) (cat, 1) (mat, 1) (on, 2) (sat, 2) (sofa, 1) (the, 4) (aardvark,1) (cat,1) (mat,1) (on,2) (sat,2) (sofa,1) (the,4)
  • 63.  Spark takes the concepts of MapReduce to the next level –Higher level API = faster, easier development Spark v. Hadoop MapReduce
  • 64.  Spark takes the concepts of MapReduce to the next level –Higher level API = faster, easier development Spark v. Hadoop MapReduce public class WordCount { public static void main(String[] args) throws Exception { Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } } public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("W+")) { if (word.length() > 0) context.write(new Text(word), new IntWritable(1)); } } } } public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } } > counts = sc.textFile(file) .flatMap(lambda line: line.split()) .map(lambda word: (word,1)) .reduceByKey(lambda v1,v2: v1+v2) > counts.saveAsTextFile(output)
  • 65.  Spark takes the concepts of MapReduce to the next level –Higher level API = faster, easier development –Low latency = near real-time processing Spark v. Hadoop MapReduce
  • 66.  Spark takes the concepts of MapReduce to the next level –Higher level API = faster, easier development –Low latency = near real-time processing –In-memory data storage = up to 100x performance improvement Spark v. Hadoop MapReduce Logistic Regression
  • 67.
  • 68. Thank you for attending! ‱ Submit questions in the Q&A panel ‱ Follow Cloudera University @ClouderaU ‱ Follow Diana on GitHub: https://github.com/dianacarroll ‱ Follow the Developer learning path: http://university.cloudera.com/develop ers ‱ Learn about the enterprise data hub: http://tinyurl.com/edh-webinar ‱ Join the Cloudera user community: http://community.cloudera.com/ Register now for Cloudera training at http://university.cloudera.com Use discount code Spark_10 to save 10% on new enrollments in Spark Developer Training classes delivered by Cloudera until October 3, 2014* Use discount code 15off2 to save 15% on enrollments in two or more training classes delivered by Cloudera until October 3, 2014* * Excludes classes sold or delivered by Cloudera partners

Notas do Editor

  1. As I said, Python is another option. Take a look at this simple program, which takes a list of products purchased from the command line, and calculates the total cost of the purchase. Again, if this syntax doesn’t make sense to you, you will need to get more familiar with Python before you take the course. In the course, you need to be comfortable with defining functions, working with lists and arrays, parsing strings and so on.
  2. If you don’t yet have the programming skills to take this course, a good place to start learning Scala is at the official Scala site: scala-lang.org, which includes lots of documentation including overviews and a series of tutorials geared toward Java developers. The site also has pointers to other resources, such as a Coursera course and several good books.
  3. There’s an even richer set of resources for learning Python, including tutorials at python.org, and well as many other tutorial sites and online classes. One particularly useful resource for experienced programmers is Google’s Python class for developers. And of course, there are many Python books available from O’Reilly and other respected publishers. Note that Spark uses Python 2.6 or 2.7, so if you are new to Python, focus your learning on Python 2 instead of 3.
  4. Now let’s turn our attention to what you will actually learn in the class. [CLICK] After a brief introduction
  5. Now let’s turn our attention to what you will actually learn in the class. [CLICK] After a brief introduction, [CLICK] Chapter 2 is “What is Spark?” As I said, no experience with Spark or distributed processing is required, so we start at the beginning: what is Spark and why would you want to use it? What problems does it solve and what kind of use cases might you want to use it for? [CLICK] Then in Chapter 3 we move on to actually using Spark. We introduce the concept of Resilient Distributed Datasets, or RDDs, which is the core concept in Spark development, and briefly cover the principles of Functional Programming as used in Spark. In the hands-on exercises, you’ll learn how to start the Spark interactive shell and load data from a file into an RDD. [CLICK] In Chapter 4 we look more deeply at RDDs: how to perform operations to transform them and extract data from them. You will learn about Map-Reduce, a programming model for parallel processing of large data sets, and compare Spark’s MapReduce implementation with Hadoop’s. In the exercises, you will work with a set of Apache web server logs files: loading them into an RDD, parsing and filtering the data, and aggregating, joining and reporting on the data. [CLICK] In Chapter 5, we introduce the Hadoop Distributed File System, or HDFS, which provides the distribute storage layer Spark uses to read and save data in a cluster. The course virtual machines include a running HDFS cluster, so in the exercises you will have a chance to import and export data using both the command line and a Spark application. [CLICK] Chapter 6 gives an overview of how a Spark application distributes processing on a cluster using a supported clustering platform, such as YARN, Mesos, or the Spark Standalone framework included with Spark. You will learn about different deployment options for a Spark application, and in the exercises you will start a Spark Standalone cluster on your virtual machine, start the Spark Shell on the cluster, and use the Spark Standalone web UI to explore the cluster. [CLICK] The next chapter goes deeper into clustered computing. We will cover how Spark partitions RDDs by storing data in memory on multiple nodes in the cluster
and how it distributes parallel tasks to process that data on the node where it is stored. In the exercises you will explore data partitioning, and use the Spark Application UI to better understand how Spark executes tasks in a cluster. [CLICK] In Chapter 8, we cover one of Spark’s unique features – the ability to cache distributed data locally, either in memory or on disk, for great improvements in performance. You will also learn about what makes RDDs “resilient”: how Spark uses “lineage” to recreate the data as needed in case of losing a node. [CLICK] In Chapter 9 teaches how to write and configure a Spark application from scratch. In the exercises, you will build a Spark application in either Scala or Python, configure different application properties, and submit the application to run on the cluster. [CLICK] Chapter 10 introduces one of the most exciting parts of the Spark ecosystem, Spark Streaming, which allows you to use Spark to process streaming data in near real-time, from sources such as application logs and social media feeds. In the exercises, you will write a Spark Streaming application to process data from a stream of web server logs. [CLICK] In the next chapter we discuss common patterns in Spark Programming, with a particular focus on implementing iterative algorithms in Spark, which is one of Spark’s special strong points. We will explore page ranking as a common iterative tasks, as well as briefly introduce Spark’s machine learning and graphing add-ons: MLLib and GraphX. In the labs, you will use Spark to implement an iterative calculation of k-means on location data. [CLICK] In Chapter 12, you will learn how to diagnose and fix common performance issues in Spark applications using techniques such as shared variables, serialization and data partitioning. you will practice using broadcast variables to avoid expensive join operations. [CLICK] Finally, in Chapter 13 you will learn how to use Spark in the context of a production data center. We will discuss how Spark complements existing Hadoop MapReduce applications, and explore how Spark applications work with other components of the Hadoop ecosystem such as Sqoop, Flume, HBase and Impala. In the final exercises before the course conclusion, [CLICK] you’ll practice extracting data from a relational database using Sqoop and using that data in Spark.
  6. Now let’s turn our attention to what you will actually learn in the class. [CLICK] After a brief introduction, [CLICK] Chapter 2 is “What is Spark?” As I said, no experience with Spark or distributed processing is required, so we start at the beginning: what is Spark and why would you want to use it? What problems does it solve and what kind of use cases might you want to use it for? [CLICK] Then in Chapter 3 we move on to actually using Spark. We introduce the concept of Resilient Distributed Datasets, or RDDs, which is the core concept in Spark development, and briefly cover the principles of Functional Programming as used in Spark. In the hands-on exercises, you’ll learn how to start the Spark interactive shell and load data from a file into an RDD. [CLICK] In Chapter 4 we look more deeply at RDDs: how to perform operations to transform them and extract data from them. You will learn about Map-Reduce, a programming model for parallel processing of large data sets, and compare Spark’s MapReduce implementation with Hadoop’s. In the exercises, you will work with a set of Apache web server logs files: loading them into an RDD, parsing and filtering the data, and aggregating, joining and reporting on the data. [CLICK] In Chapter 5, we introduce the Hadoop Distributed File System, or HDFS, which provides the distribute storage layer Spark uses to read and save data in a cluster. The course virtual machines include a running HDFS cluster, so in the exercises you will have a chance to import and export data using both the command line and a Spark application. [CLICK] Chapter 6 gives an overview of how a Spark application distributes processing on a cluster using a supported clustering platform, such as YARN, Mesos, or the Spark Standalone framework included with Spark. You will learn about different deployment options for a Spark application, and in the exercises you will start a Spark Standalone cluster on your virtual machine, start the Spark Shell on the cluster, and use the Spark Standalone web UI to explore the cluster. [CLICK] The next chapter goes deeper into clustered computing. We will cover how Spark partitions RDDs by storing data in memory on multiple nodes in the cluster
and how it distributes parallel tasks to process that data on the node where it is stored. In the exercises you will explore data partitioning, and use the Spark Application UI to better understand how Spark executes tasks in a cluster. [CLICK] In Chapter 8, we cover one of Spark’s unique features – the ability to cache distributed data locally, either in memory or on disk, for great improvements in performance. You will also learn about what makes RDDs “resilient”: how Spark uses “lineage” to recreate the data as needed in case of losing a node. [CLICK] In Chapter 9 teaches how to write and configure a Spark application from scratch. In the exercises, you will build a Spark application in either Scala or Python, configure different application properties, and submit the application to run on the cluster. [CLICK] Chapter 10 introduces one of the most exciting parts of the Spark ecosystem, Spark Streaming, which allows you to use Spark to process streaming data in near real-time, from sources such as application logs and social media feeds. In the exercises, you will write a Spark Streaming application to process data from a stream of web server logs. [CLICK] In the next chapter we discuss common patterns in Spark Programming, with a particular focus on implementing iterative algorithms in Spark, which is one of Spark’s special strong points. We will explore page ranking as a common iterative tasks, as well as briefly introduce Spark’s machine learning and graphing add-ons: MLLib and GraphX. In the labs, you will use Spark to implement an iterative calculation of k-means on location data. [CLICK] In Chapter 12, you will learn how to diagnose and fix common performance issues in Spark applications using techniques such as shared variables, serialization and data partitioning. you will practice using broadcast variables to avoid expensive join operations. [CLICK] Finally, in Chapter 13 you will learn how to use Spark in the context of a production data center. We will discuss how Spark complements existing Hadoop MapReduce applications, and explore how Spark applications work with other components of the Hadoop ecosystem such as Sqoop, Flume, HBase and Impala. In the final exercises before the course conclusion, [CLICK] you’ll practice extracting data from a relational database using Sqoop and using that data in Spark.
  7. Now let’s turn our attention to what you will actually learn in the class. [CLICK] After a brief introduction, [CLICK] Chapter 2 is “What is Spark?” As I said, no experience with Spark or distributed processing is required, so we start at the beginning: what is Spark and why would you want to use it? What problems does it solve and what kind of use cases might you want to use it for? [CLICK] Then in Chapter 3 we move on to actually using Spark. We introduce the concept of Resilient Distributed Datasets, or RDDs, which is the core concept in Spark development, and briefly cover the principles of Functional Programming as used in Spark. In the hands-on exercises, you’ll learn how to start the Spark interactive shell and load data from a file into an RDD. [CLICK] In Chapter 4 we look more deeply at RDDs: how to perform operations to transform them and extract data from them. You will learn about Map-Reduce, a programming model for parallel processing of large data sets, and compare Spark’s MapReduce implementation with Hadoop’s. In the exercises, you will work with a set of Apache web server logs files: loading them into an RDD, parsing and filtering the data, and aggregating, joining and reporting on the data. [CLICK] In Chapter 5, we introduce the Hadoop Distributed File System, or HDFS, which provides the distribute storage layer Spark uses to read and save data in a cluster. The course virtual machines include a running HDFS cluster, so in the exercises you will have a chance to import and export data using both the command line and a Spark application. [CLICK] Chapter 6 gives an overview of how a Spark application distributes processing on a cluster using a supported clustering platform, such as YARN, Mesos, or the Spark Standalone framework included with Spark. You will learn about different deployment options for a Spark application, and in the exercises you will start a Spark Standalone cluster on your virtual machine, start the Spark Shell on the cluster, and use the Spark Standalone web UI to explore the cluster. [CLICK] The next chapter goes deeper into clustered computing. We will cover how Spark partitions RDDs by storing data in memory on multiple nodes in the cluster
and how it distributes parallel tasks to process that data on the node where it is stored. In the exercises you will explore data partitioning, and use the Spark Application UI to better understand how Spark executes tasks in a cluster. [CLICK] In Chapter 8, we cover one of Spark’s unique features – the ability to cache distributed data locally, either in memory or on disk, for great improvements in performance. You will also learn about what makes RDDs “resilient”: how Spark uses “lineage” to recreate the data as needed in case of losing a node. [CLICK] In Chapter 9 teaches how to write and configure a Spark application from scratch. In the exercises, you will build a Spark application in either Scala or Python, configure different application properties, and submit the application to run on the cluster. [CLICK] Chapter 10 introduces one of the most exciting parts of the Spark ecosystem, Spark Streaming, which allows you to use Spark to process streaming data in near real-time, from sources such as application logs and social media feeds. In the exercises, you will write a Spark Streaming application to process data from a stream of web server logs. [CLICK] In the next chapter we discuss common patterns in Spark Programming, with a particular focus on implementing iterative algorithms in Spark, which is one of Spark’s special strong points. We will explore page ranking as a common iterative tasks, as well as briefly introduce Spark’s machine learning and graphing add-ons: MLLib and GraphX. In the labs, you will use Spark to implement an iterative calculation of k-means on location data. [CLICK] In Chapter 12, you will learn how to diagnose and fix common performance issues in Spark applications using techniques such as shared variables, serialization and data partitioning. you will practice using broadcast variables to avoid expensive join operations. [CLICK] Finally, in Chapter 13 you will learn how to use Spark in the context of a production data center. We will discuss how Spark complements existing Hadoop MapReduce applications, and explore how Spark applications work with other components of the Hadoop ecosystem such as Sqoop, Flume, HBase and Impala. In the final exercises before the course conclusion, [CLICK] you’ll practice extracting data from a relational database using Sqoop and using that data in Spark.
  8. Now let’s turn our attention to what you will actually learn in the class. [CLICK] After a brief introduction, [CLICK] Chapter 2 is “What is Spark?” As I said, no experience with Spark or distributed processing is required, so we start at the beginning: what is Spark and why would you want to use it? What problems does it solve and what kind of use cases might you want to use it for? [CLICK] Then in Chapter 3 we move on to actually using Spark. We introduce the concept of Resilient Distributed Datasets, or RDDs, which is the core concept in Spark development, and briefly cover the principles of Functional Programming as used in Spark. In the hands-on exercises, you’ll learn how to start the Spark interactive shell and load data from a file into an RDD. [CLICK] In Chapter 4 we look more deeply at RDDs: how to perform operations to transform them and extract data from them. You will learn about Map-Reduce, a programming model for parallel processing of large data sets, and compare Spark’s MapReduce implementation with Hadoop’s. In the exercises, you will work with a set of Apache web server logs files: loading them into an RDD, parsing and filtering the data, and aggregating, joining and reporting on the data. [CLICK] In Chapter 5, we introduce the Hadoop Distributed File System, or HDFS, which provides the distribute storage layer Spark uses to read and save data in a cluster. The course virtual machines include a running HDFS cluster, so in the exercises you will have a chance to import and export data using both the command line and a Spark application. [CLICK] Chapter 6 gives an overview of how a Spark application distributes processing on a cluster using a supported clustering platform, such as YARN, Mesos, or the Spark Standalone framework included with Spark. You will learn about different deployment options for a Spark application, and in the exercises you will start a Spark Standalone cluster on your virtual machine, start the Spark Shell on the cluster, and use the Spark Standalone web UI to explore the cluster. [CLICK] The next chapter goes deeper into clustered computing. We will cover how Spark partitions RDDs by storing data in memory on multiple nodes in the cluster
and how it distributes parallel tasks to process that data on the node where it is stored. In the exercises you will explore data partitioning, and use the Spark Application UI to better understand how Spark executes tasks in a cluster. [CLICK] In Chapter 8, we cover one of Spark’s unique features – the ability to cache distributed data locally, either in memory or on disk, for great improvements in performance. You will also learn about what makes RDDs “resilient”: how Spark uses “lineage” to recreate the data as needed in case of losing a node. [CLICK] In Chapter 9 teaches how to write and configure a Spark application from scratch. In the exercises, you will build a Spark application in either Scala or Python, configure different application properties, and submit the application to run on the cluster. [CLICK] Chapter 10 introduces one of the most exciting parts of the Spark ecosystem, Spark Streaming, which allows you to use Spark to process streaming data in near real-time, from sources such as application logs and social media feeds. In the exercises, you will write a Spark Streaming application to process data from a stream of web server logs. [CLICK] In the next chapter we discuss common patterns in Spark Programming, with a particular focus on implementing iterative algorithms in Spark, which is one of Spark’s special strong points. We will explore page ranking as a common iterative tasks, as well as briefly introduce Spark’s machine learning and graphing add-ons: MLLib and GraphX. In the labs, you will use Spark to implement an iterative calculation of k-means on location data. [CLICK] In Chapter 12, you will learn how to diagnose and fix common performance issues in Spark applications using techniques such as shared variables, serialization and data partitioning. you will practice using broadcast variables to avoid expensive join operations. [CLICK] Finally, in Chapter 13 you will learn how to use Spark in the context of a production data center. We will discuss how Spark complements existing Hadoop MapReduce applications, and explore how Spark applications work with other components of the Hadoop ecosystem such as Sqoop, Flume, HBase and Impala. In the final exercises before the course conclusion, [CLICK] you’ll practice extracting data from a relational database using Sqoop and using that data in Spark.
  9. Now let’s turn our attention to what you will actually learn in the class. [CLICK] After a brief introduction, [CLICK] Chapter 2 is “What is Spark?” As I said, no experience with Spark or distributed processing is required, so we start at the beginning: what is Spark and why would you want to use it? What problems does it solve and what kind of use cases might you want to use it for? [CLICK] Then in Chapter 3 we move on to actually using Spark. We introduce the concept of Resilient Distributed Datasets, or RDDs, which is the core concept in Spark development, and briefly cover the principles of Functional Programming as used in Spark. In the hands-on exercises, you’ll learn how to start the Spark interactive shell and load data from a file into an RDD. [CLICK] In Chapter 4 we look more deeply at RDDs: how to perform operations to transform them and extract data from them. You will learn about Map-Reduce, a programming model for parallel processing of large data sets, and compare Spark’s MapReduce implementation with Hadoop’s. In the exercises, you will work with a set of Apache web server logs files: loading them into an RDD, parsing and filtering the data, and aggregating, joining and reporting on the data. [CLICK] In Chapter 5, we introduce the Hadoop Distributed File System, or HDFS, which provides the distribute storage layer Spark uses to read and save data in a cluster. The course virtual machines include a running HDFS cluster, so in the exercises you will have a chance to import and export data using both the command line and a Spark application. [CLICK] Chapter 6 gives an overview of how a Spark application distributes processing on a cluster using a supported clustering platform, such as YARN, Mesos, or the Spark Standalone framework included with Spark. You will learn about different deployment options for a Spark application, and in the exercises you will start a Spark Standalone cluster on your virtual machine, start the Spark Shell on the cluster, and use the Spark Standalone web UI to explore the cluster. [CLICK] The next chapter goes deeper into clustered computing. We will cover how Spark partitions RDDs by storing data in memory on multiple nodes in the cluster
and how it distributes parallel tasks to process that data on the node where it is stored. In the exercises you will explore data partitioning, and use the Spark Application UI to better understand how Spark executes tasks in a cluster. [CLICK] In Chapter 8, we cover one of Spark’s unique features – the ability to cache distributed data locally, either in memory or on disk, for great improvements in performance. You will also learn about what makes RDDs “resilient”: how Spark uses “lineage” to recreate the data as needed in case of losing a node. [CLICK] In Chapter 9 teaches how to write and configure a Spark application from scratch. In the exercises, you will build a Spark application in either Scala or Python, configure different application properties, and submit the application to run on the cluster. [CLICK] Chapter 10 introduces one of the most exciting parts of the Spark ecosystem, Spark Streaming, which allows you to use Spark to process streaming data in near real-time, from sources such as application logs and social media feeds. In the exercises, you will write a Spark Streaming application to process data from a stream of web server logs. [CLICK] In the next chapter we discuss common patterns in Spark Programming, with a particular focus on implementing iterative algorithms in Spark, which is one of Spark’s special strong points. We will explore page ranking as a common iterative tasks, as well as briefly introduce Spark’s machine learning and graphing add-ons: MLLib and GraphX. In the labs, you will use Spark to implement an iterative calculation of k-means on location data. [CLICK] In Chapter 12, you will learn how to diagnose and fix common performance issues in Spark applications using techniques such as shared variables, serialization and data partitioning. you will practice using broadcast variables to avoid expensive join operations. [CLICK] Finally, in Chapter 13 you will learn how to use Spark in the context of a production data center. We will discuss how Spark complements existing Hadoop MapReduce applications, and explore how Spark applications work with other components of the Hadoop ecosystem such as Sqoop, Flume, HBase and Impala. In the final exercises before the course conclusion, [CLICK] you’ll practice extracting data from a relational database using Sqoop and using that data in Spark.