Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries.
Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.
2. Agenda
ï Cloudera's Learning Path for Developers
ï Target Audience and Prerequisites
ï Course Outline
ï Short Presentation Based on Actual Course Material
ï Question and Answer Session
3. Learning Path: Developers
Create Powerful New Data Processing Tools
Learn to code and write MapReduce programs for production
Master advanced API topics required for real-world data analysis
Design schemas to minimize latency on massive data sets
Scale hundreds of thousands of operations per second
Implement recommenders and data experiments
Draw actionable insights from analysis of disparate data
Build converged applications using multiple processing engines
Develop enterprise solutions using components across the EDH
Combine batch and stream processing with interactive analytics
Optimize applications for speed, ease of use, and sophistication
Spark
Training
Big Data
Applications
HBase
Training
Intro to
Data Science
Developer
Training
Aaron T. Myers
Software Engineer
4. 1 Broadest Range of Courses
Developer, Admin, Analyst, HBase, Data Science
2
3
Most Experienced Instructors
More than 20,000 students trained since 2009
6 Widest Geographic Coverage
Most classes offered: 50 cities worldwide plus online
7 Most Relevant Platform & Community
CDH deployed more than all other distributions combined
8 Depth of Training Material
Hands-on labs and VMs support live instruction
Leader in Certification
Over 8,000 accredited Cloudera professionals
4 Trusted Source for Training
100,000+ people have attended online courses 9 Ongoing Learning
Video tutorials and e-learning complement training
Why Cloudera Training?
Aligned to Best Practices and the Pace of Change
5 State of the Art Curriculum
Courses updated as Hadoop evolves 10Commitment to Big Data Education
University partnerships to teach Hadoop in the classroom
6. ï§ Intended for people who write code, such as
âSoftware Engineers
âData Engineers
âETL Developers
Target Audience
7. ï§ No prior knowledge of Spark, Hadoop or distributed programming
concepts is required
Course Prerequisites
8. ï§ No prior knowledge of Spark, Hadoop or distributed programming
concepts is required
ï§ Requirements
âBasic familiarity with Linux or Unix
Course Prerequisites
$ mkdir /data
$ cd /data
$ rm /home/johndoe/salesreport.txt
9. ï§ No prior knowledge of Spark, Hadoop or distributed programming
concepts is required
ï§ Requirements
âBasic familiarity with Linux or Unix
âIntermediate-level programming skills in either Scala or Python
Course Prerequisites
$ mkdir /data
$ cd /data
$ rm /home/johndoe/salesreport.txt
10. Example of Required Scala Skill Level
ï§ Do you understand the following code? Could you write something
similar?
object Maps {
val colors = Map("red" -> 0xFF0000,
"turquoise" -> 0x00FFFF,
"black" -> 0x000000,
"orange" -> 0xFF8040,
"brown" -> 0x804000)
def main(args: Array[String]) {
for (name <- args) println(
colors.get(name) match {
case Some(code) =>
name + " has code: " + code
case None =>
"Unknown color: " + name
}
)
}
}
11. Example of Required Python Skill Level
ï§ Do you understand the following code? Could you write something
similar?
import sys
def parsePurchases(s):
return s.split(',')
if __name__ == "__main__":
if len(sys.argv) < 2:
print "Usage: SumPrices <products>"
exit(-1)
prices = {'apple': 0.40, 'banana': 0.50, 'orange': 0.10}
total = sum(prices[fruit]
for fruit in parsePurchases(sys.argv[1]))
print 'Total: $%.2f' % total
12. ï§ Getting started with Scala
âwww.scala-lang.org
Practicing Scala or Python
13. ï§ Getting started with Scala
âwww.scala-lang.org
ï§ Getting started with Python
âpython.org
âdevelopers.google.com/edu/python
âand many more
Practicing Scala or Python
18. 1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
Course Outline
19. 1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
Course Outline
20. 1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
Course Outline
21. 1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
Course Outline
22. 1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
9. Writing Spark Applications
Course Outline
23. 1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
9. Writing Spark Applications
10. Spark Streaming
Course Outline
24. 1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
9. Writing Spark Applications
10. Spark Streaming
11. Common Patterns in Spark
Programming
Course Outline
25. 1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
9. Writing Spark Applications
10. Spark Streaming
11. Common Patterns in Spark
Programming
12. Improving Spark Performance
Course Outline
26. 1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
9. Writing Spark Applications
10. Spark Streaming
11. Common Patterns in Spark
Programming
12. Improving Spark Performance
13. Spark, Hadoop and the Enterprise
Data Center
Course Outline
27. 1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
5. The Hadoop Distributed File
System
6. Running Spark on a Cluster
7. Parallel Programming with Spark
8. Caching and Persistence
9. Writing Spark Applications
10. Spark Streaming
11. Common Patterns in Spark
Programming
12. Improving Spark Performance
13. Spark, Hadoop and the Enterprise
Data Center
14. Conclusion
Course Outline
29. ï§ Based on
âChapter 3: Spark Basics
âChapter 4: Working with RDDs
ï§ Topics
âWhat is Spark?
âThe components of a distributed data processing system
âIntro to the Spark Shell
âResilient Distributed Datasets
âRDD operations
âExample: WordCount
Course Excerpt
30. ï§ Apache Spark is a fast, general engine for large-scale data
processing and analysis
âOpen source, developed at UC Berkeley
ï§ Written in Scala
âFunctional programming language that runs in a JVM
What is Apache Spark?
31. ï§ Apache Spark is a fast, general engine for large-scale data
processing and analysis
âOpen source, developed at UC Berkeley
ï§ Written in Scala
âFunctional programming language that runs in a JVM
ï§ Key Concepts
âAvoid the data bottleneck by distributing data when it is
stored
âBring the processing to the data
âData stored in memory
What is Apache Spark?
33. Distributed Processing with the Spark Framework
API
Cluster Computing
Spark
âą Spark Standalone
âą YARN
âą Mesos
34. Distributed Processing with the Spark Framework
API
Cluster Computing Storage
Spark
âą Spark Standalone
âą YARN
âą Mesos
HDFS
(Hadoop Distributed File
System)
35. ï§ Spark Shell
âInteractive REPL â for learning or data exploration
âPython or Scala
ï§ Spark Applications
âFor large scale data processing
âPython, Java or Scala
What is Apache Spark?
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 0.9.1
/_/
Using Python version 2.6.6 (r266:84292, Jan
22 2014 09:42:36)
Spark context available as sc.
>>>
$ spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 0.9.1
/_/
Using Scala version 2.10.3 (Java HotSpot(TM)
64-Bit Server VM, Java 1.7.0_51)
Created spark context..
Spark context available as sc.
scala>
Scala Shell
Python Shell
36. ï§ Every Spark application requires a Spark Context
âThe main entry point to the Spark API
ï§ Spark Shell provides a preconfigured Spark Context called sc
Spark Context
>>> sc.appName
u'PySparkShell'
scala> sc.appName
res0: String = Spark shell
37. ï§ RDD (Resilient Distributed Dataset)
âResilient â if data in memory is lost, it can be
recreated
âDistributed â stored in memory across the cluster
âDataset â initial data can come from a file or created
programmatically
ï§ RDDs are the fundamental unit of data in Spark
ï§ Most of Spark programming is performing operations on
RDDs
RDD (Resilient Distributed Dataset)
data
data
data
dataâŠ
RDD
38. I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Example: A File-based RDD
I've never seen a purple
cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
File: purplecow.txt
RDD: mydata
> mydata = sc.textFile("purplecow.txt")
39. I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Example: A File-based RDD
I've never seen a purple
cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
File: purplecow.txt
RDD: mydata
> mydata = sc.textFile("purplecow.txt")
> mydata.count()
4
40. ï§ Two types of RDD operations
âActions â return values
âcount
âtake(n)
RDD Operations
value
RDD
41. ï§ Two types of RDD operations
âActions â return values
âcount
âtake(n)
âTransformations â define new RDDs
based on the current one
âfilter
âmap
âreduce
RDD Operations
value
RDD
New RDDBase RDD
42. I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Example: map and filter Transformations
43. I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
Example: map and filter Transformations
map(lambda line: line.upper()) map(line => line.toUpperCase())
44. I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
Example: map and filter Transformations
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
filter(lambda line: line.startswith('I'))
map(lambda line: line.upper()) map(line => line.toUpperCase())
filter(line => line.startsWith('I'))
45. ï§ RDDs can hold any type of element
âPrimitive types: integers, characters, booleans, strings, etc.
âSequence types: lists, arrays, tuples, dicts, etc. (including nested)
âScala/Java Objects (if serializable)
âMixed types
RDDs
46. ï§ RDDs can hold any type of element
âPrimitive types: integers, characters, booleans, strings, etc.
âSequence types: lists, arrays, tuples, dicts, etc. (including nested)
âScala/Java Objects (if serializable)
âMixed types
ï§ Some types of RDDs have additional functionality
âDouble RDDs â RDDs consisting of numeric data
âPair RDDs â RDDs consisting of Key-Value pairs
RDDs
47. ï§ Pair RDDs are a special form of RDD
âEach element must be a key-value pair (a two-
element tuple)
âKeys and values can be any type
Pair RDDs
(key1,value1)
(key2,value2)
(key3,value3)
âŠ
Pair RDD
48. ï§ Pair RDDs are a special form of RDD
âEach element must be a key-value pair (a two-
element tuple)
âKeys and values can be any type
ï§ Why?
âUse with Map-Reduce algorithms
âMany additional functions are available for
common data processing needs
âE.g. sorting, joining, grouping, counting, etc.
Pair RDDs
(key1,value1)
(key2,value2)
(key3,value3)
âŠ
Pair RDD
49. ï§ MapReduce is a common programming model
âTwo phases
âMap â process each element in a data set
âReduce â aggregate or consolidate the data
âEasily applicable to distributed processing of large data sets
MapReduce
50. ï§ MapReduce is a common programming model
âTwo phases
âMap â process each element in a data set
âReduce â aggregate or consolidate the data
âEasily applicable to distributed processing of large data sets
ï§ Hadoop MapReduce is the major implementation
âLimited
âEach job has one Map phase, one Reduce phase in each
âJob output saved to files
MapReduce
51. ï§ MapReduce is a common programming model
âTwo phases
âMap â process each element in a data set
âReduce â aggregate or consolidate the data
âEasily applicable to distributed processing of large data sets
ï§ Hadoop MapReduce is the major implementation
âLimited
âEach job has one Map phase, one Reduce phase in each
âJob output saved to files
ï§ Spark implements MapReduce with much greater flexibility
âMap and Reduce functions can be interspersed
âResults stored in memory
âOperations can be chained easily
MapReduce
52. MapReduce Example: Word Count
the cat sat on the mat
the aardvark sat on the sofa
Input Data
Result
aardvark 1
cat 1
mat 1
on 2
sat 2
sofa 1
the 4
?
53. Example: Word Count
> counts = sc.textFile(file)
the cat sat on the
mat
the aardvark sat on
the sofa
54. Example: Word Count
> counts = sc.textFile(file)
.flatMap(lambda line: line.split())
the cat sat on the
mat
the aardvark sat on
the sofa
the
cat
sat
on
the
mat
the
aardvark
sat
âŠ
55. Example: Word Count
> counts = sc.textFile(file)
.flatMap(lambda line: line.split())
.map(lambda word: (word,1))
the cat sat on the
mat
the aardvark sat on
the sofa
(the, 1)
(cat, 1)
(sat, 1)
(on, 1)
(the, 1)
(mat, 1)
(the, 1)
(aardvark, 1)
(sat, 1)
âŠ
the
cat
sat
on
the
mat
the
aardvark
sat
âŠ
Key-
Value
Pairs
56. Example: Word Count
> counts = sc.textFile(file)
.flatMap(lambda line: line.split())
.map(lambda word: (word,1))
.reduceByKey(lambda v1,v2: v1+v2)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
the cat sat on the
mat
the aardvark sat on
the sofa
(the, 1)
(cat, 1)
(sat, 1)
(on, 1)
(the, 1)
(mat, 1)
(the, 1)
(aardvark, 1)
(sat, 1)
âŠ
the
cat
sat
on
the
mat
the
aardvark
sat
âŠ
57. Example: Word Count
> counts = sc.textFile(file)
.flatMap(lambda line: line.split())
.map(lambda word: (word,1))
.reduceByKey(lambda v1,v2: v1+v2)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
the cat sat on the
mat
the aardvark sat on
the sofa
(the, 1)
(cat, 1)
(sat, 1)
(on, 1)
(the, 1)
(mat, 1)
(the, 1)
(aardvark, 1)
(sat, 1)
âŠ
the
cat
sat
on
the
mat
the
aardvark
sat
âŠ
58. Example: Word Count
> counts = sc.textFile(file)
.flatMap(lambda line: line.split())
.map(lambda word: (word,1))
.reduceByKey(lambda v1,v2: v1+v2)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
the cat sat on the
mat
the aardvark sat on
the sofa
(the, 1)
(cat, 1)
(sat, 1)
(on, 1)
(the, 1)
(mat, 1)
(the, 1)
(aardvark, 1)
(sat, 1)
âŠ
the
cat
sat
on
the
mat
the
aardvark
sat
âŠ
63. ï§ Spark takes the concepts of
MapReduce to the next level
âHigher level API = faster, easier
development
Spark v. Hadoop MapReduce
64. ï§ Spark takes the concepts of
MapReduce to the next level
âHigher level API = faster, easier
development
Spark v. Hadoop MapReduce
public class WordCount {
public static void main(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
public class WordMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
public void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
String line = value.toString();
for (String word : line.split("W+")) {
if (word.length() > 0)
context.write(new Text(word), new IntWritable(1));
}
}
}
}
public class SumReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable>
values, Context context) throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
> counts = sc.textFile(file)
.flatMap(lambda line: line.split())
.map(lambda word: (word,1))
.reduceByKey(lambda v1,v2: v1+v2)
> counts.saveAsTextFile(output)
65. ï§ Spark takes the concepts of
MapReduce to the next level
âHigher level API = faster, easier
development
âLow latency = near real-time
processing
Spark v. Hadoop MapReduce
66. ï§ Spark takes the concepts of
MapReduce to the next level
âHigher level API = faster, easier
development
âLow latency = near real-time
processing
âIn-memory data storage = up to
100x performance improvement
Spark v. Hadoop MapReduce
Logistic Regression
67.
68. Thank you for attending!
âą Submit questions in the Q&A panel
âą Follow Cloudera University @ClouderaU
âą Follow Diana on GitHub:
https://github.com/dianacarroll
âą Follow the Developer learning path:
http://university.cloudera.com/develop
ers
âą Learn about the enterprise data hub:
http://tinyurl.com/edh-webinar
âą Join the Cloudera user community:
http://community.cloudera.com/
Register now for Cloudera training at
http://university.cloudera.com
Use discount code Spark_10 to save 10%
on new enrollments in Spark Developer
Training classes delivered by Cloudera
until October 3, 2014*
Use discount code 15off2 to save 15% on
enrollments in two or more training
classes delivered by Cloudera until
October 3, 2014*
* Excludes classes sold or delivered by Cloudera partners
Notas do Editor
As I said, Python is another option. Take a look at this simple program, which takes a list of products purchased from the command line, and calculates the total cost of the purchase.
Again, if this syntax doesnât make sense to you, you will need to get more familiar with Python before you take the course. In the course, you need to be comfortable with defining functions, working with lists and arrays, parsing strings and so on.
If you donât yet have the programming skills to take this course, a good place to start learning Scala is at the official Scala site: scala-lang.org, which includes lots of documentation including overviews and a series of tutorials geared toward Java developers. The site also has pointers to other resources, such as a Coursera course and several good books.
Thereâs an even richer set of resources for learning Python, including tutorials at python.org, and well as many other tutorial sites and online classes. One particularly useful resource for experienced programmers is Googleâs Python class for developers. And of course, there are many Python books available from OâReilly and other respected publishers.
Note that Spark uses Python 2.6 or 2.7, so if you are new to Python, focus your learning on Python 2 instead of 3.
Now letâs turn our attention to what you will actually learn in the class.
[CLICK]
After a brief introduction
Now letâs turn our attention to what you will actually learn in the class.
[CLICK]
After a brief introduction, [CLICK]
Chapter 2 is âWhat is Spark?â As I said, no experience with Spark or distributed processing is required, so we start at the beginning: what is Spark and why would you want to use it? What problems does it solve and what kind of use cases might you want to use it for?
[CLICK]
Then in Chapter 3 we move on to actually using Spark. We introduce the concept of Resilient Distributed Datasets, or RDDs, which is the core concept in Spark development, and briefly cover the principles of Functional Programming as used in Spark. In the hands-on exercises, youâll learn how to start the Spark interactive shell and load data from a file into an RDD.
[CLICK]
In Chapter 4 we look more deeply at RDDs: how to perform operations to transform them and extract data from them. You will learn about Map-Reduce, a programming model for parallel processing of large data sets, and compare Sparkâs MapReduce implementation with Hadoopâs. In the exercises, you will work with a set of Apache web server logs files: loading them into an RDD, parsing and filtering the data, and aggregating, joining and reporting on the data.
[CLICK]
In Chapter 5, we introduce the Hadoop Distributed File System, or HDFS, which provides the distribute storage layer Spark uses to read and save data in a cluster. The course virtual machines include a running HDFS cluster, so in the exercises you will have a chance to import and export data using both the command line and a Spark application.
[CLICK]
Chapter 6 gives an overview of how a Spark application distributes processing on a cluster using a supported clustering platform, such as YARN, Mesos, or the Spark Standalone framework included with Spark. You will learn about different deployment options for a Spark application, and in the exercises you will start a Spark Standalone cluster on your virtual machine, start the Spark Shell on the cluster, and use the Spark Standalone web UI to explore the cluster.
[CLICK]
The next chapter goes deeper into clustered computing. We will cover how Spark partitions RDDs by storing data in memory on multiple nodes in the clusterâŠand how it distributes parallel tasks to process that data on the node where it is stored. In the exercises you will explore data partitioning, and use the Spark Application UI to better understand how Spark executes tasks in a cluster.
[CLICK]
In Chapter 8, we cover one of Sparkâs unique features â the ability to cache distributed data locally, either in memory or on disk, for great improvements in performance. You will also learn about what makes RDDs âresilientâ: how Spark uses âlineageâ to recreate the data as needed in case of losing a node.
[CLICK]
In Chapter 9 teaches how to write and configure a Spark application from scratch. In the exercises, you will build a Spark application in either Scala or Python, configure different application properties, and submit the application to run on the cluster.
[CLICK]
Chapter 10 introduces one of the most exciting parts of the Spark ecosystem, Spark Streaming, which allows you to use Spark to process streaming data in near real-time, from sources such as application logs and social media feeds. In the exercises, you will write a Spark Streaming application to process data from a stream of web server logs.
[CLICK]
In the next chapter we discuss common patterns in Spark Programming, with a particular focus on implementing iterative algorithms in Spark, which is one of Sparkâs special strong points. We will explore page ranking as a common iterative tasks, as well as briefly introduce Sparkâs machine learning and graphing add-ons: MLLib and GraphX. In the labs, you will use Spark to implement an iterative calculation of k-means on location data.
[CLICK]
In Chapter 12, you will learn how to diagnose and fix common performance issues in Spark applications using techniques such as shared variables, serialization and data partitioning.
you will practice using broadcast variables to avoid expensive join operations.
[CLICK]
Finally, in Chapter 13 you will learn how to use Spark in the context of a production data center. We will discuss how Spark complements existing Hadoop MapReduce applications, and explore how Spark applications work with other components of the Hadoop ecosystem such as Sqoop, Flume, HBase and Impala. In the final exercises before the course conclusion, [CLICK]
youâll practice extracting data from a relational database using Sqoop and using that data in Spark.
Now letâs turn our attention to what you will actually learn in the class.
[CLICK]
After a brief introduction, [CLICK]
Chapter 2 is âWhat is Spark?â As I said, no experience with Spark or distributed processing is required, so we start at the beginning: what is Spark and why would you want to use it? What problems does it solve and what kind of use cases might you want to use it for?
[CLICK]
Then in Chapter 3 we move on to actually using Spark. We introduce the concept of Resilient Distributed Datasets, or RDDs, which is the core concept in Spark development, and briefly cover the principles of Functional Programming as used in Spark. In the hands-on exercises, youâll learn how to start the Spark interactive shell and load data from a file into an RDD.
[CLICK]
In Chapter 4 we look more deeply at RDDs: how to perform operations to transform them and extract data from them. You will learn about Map-Reduce, a programming model for parallel processing of large data sets, and compare Sparkâs MapReduce implementation with Hadoopâs. In the exercises, you will work with a set of Apache web server logs files: loading them into an RDD, parsing and filtering the data, and aggregating, joining and reporting on the data.
[CLICK]
In Chapter 5, we introduce the Hadoop Distributed File System, or HDFS, which provides the distribute storage layer Spark uses to read and save data in a cluster. The course virtual machines include a running HDFS cluster, so in the exercises you will have a chance to import and export data using both the command line and a Spark application.
[CLICK]
Chapter 6 gives an overview of how a Spark application distributes processing on a cluster using a supported clustering platform, such as YARN, Mesos, or the Spark Standalone framework included with Spark. You will learn about different deployment options for a Spark application, and in the exercises you will start a Spark Standalone cluster on your virtual machine, start the Spark Shell on the cluster, and use the Spark Standalone web UI to explore the cluster.
[CLICK]
The next chapter goes deeper into clustered computing. We will cover how Spark partitions RDDs by storing data in memory on multiple nodes in the clusterâŠand how it distributes parallel tasks to process that data on the node where it is stored. In the exercises you will explore data partitioning, and use the Spark Application UI to better understand how Spark executes tasks in a cluster.
[CLICK]
In Chapter 8, we cover one of Sparkâs unique features â the ability to cache distributed data locally, either in memory or on disk, for great improvements in performance. You will also learn about what makes RDDs âresilientâ: how Spark uses âlineageâ to recreate the data as needed in case of losing a node.
[CLICK]
In Chapter 9 teaches how to write and configure a Spark application from scratch. In the exercises, you will build a Spark application in either Scala or Python, configure different application properties, and submit the application to run on the cluster.
[CLICK]
Chapter 10 introduces one of the most exciting parts of the Spark ecosystem, Spark Streaming, which allows you to use Spark to process streaming data in near real-time, from sources such as application logs and social media feeds. In the exercises, you will write a Spark Streaming application to process data from a stream of web server logs.
[CLICK]
In the next chapter we discuss common patterns in Spark Programming, with a particular focus on implementing iterative algorithms in Spark, which is one of Sparkâs special strong points. We will explore page ranking as a common iterative tasks, as well as briefly introduce Sparkâs machine learning and graphing add-ons: MLLib and GraphX. In the labs, you will use Spark to implement an iterative calculation of k-means on location data.
[CLICK]
In Chapter 12, you will learn how to diagnose and fix common performance issues in Spark applications using techniques such as shared variables, serialization and data partitioning.
you will practice using broadcast variables to avoid expensive join operations.
[CLICK]
Finally, in Chapter 13 you will learn how to use Spark in the context of a production data center. We will discuss how Spark complements existing Hadoop MapReduce applications, and explore how Spark applications work with other components of the Hadoop ecosystem such as Sqoop, Flume, HBase and Impala. In the final exercises before the course conclusion, [CLICK]
youâll practice extracting data from a relational database using Sqoop and using that data in Spark.
Now letâs turn our attention to what you will actually learn in the class.
[CLICK]
After a brief introduction, [CLICK]
Chapter 2 is âWhat is Spark?â As I said, no experience with Spark or distributed processing is required, so we start at the beginning: what is Spark and why would you want to use it? What problems does it solve and what kind of use cases might you want to use it for?
[CLICK]
Then in Chapter 3 we move on to actually using Spark. We introduce the concept of Resilient Distributed Datasets, or RDDs, which is the core concept in Spark development, and briefly cover the principles of Functional Programming as used in Spark. In the hands-on exercises, youâll learn how to start the Spark interactive shell and load data from a file into an RDD.
[CLICK]
In Chapter 4 we look more deeply at RDDs: how to perform operations to transform them and extract data from them. You will learn about Map-Reduce, a programming model for parallel processing of large data sets, and compare Sparkâs MapReduce implementation with Hadoopâs. In the exercises, you will work with a set of Apache web server logs files: loading them into an RDD, parsing and filtering the data, and aggregating, joining and reporting on the data.
[CLICK]
In Chapter 5, we introduce the Hadoop Distributed File System, or HDFS, which provides the distribute storage layer Spark uses to read and save data in a cluster. The course virtual machines include a running HDFS cluster, so in the exercises you will have a chance to import and export data using both the command line and a Spark application.
[CLICK]
Chapter 6 gives an overview of how a Spark application distributes processing on a cluster using a supported clustering platform, such as YARN, Mesos, or the Spark Standalone framework included with Spark. You will learn about different deployment options for a Spark application, and in the exercises you will start a Spark Standalone cluster on your virtual machine, start the Spark Shell on the cluster, and use the Spark Standalone web UI to explore the cluster.
[CLICK]
The next chapter goes deeper into clustered computing. We will cover how Spark partitions RDDs by storing data in memory on multiple nodes in the clusterâŠand how it distributes parallel tasks to process that data on the node where it is stored. In the exercises you will explore data partitioning, and use the Spark Application UI to better understand how Spark executes tasks in a cluster.
[CLICK]
In Chapter 8, we cover one of Sparkâs unique features â the ability to cache distributed data locally, either in memory or on disk, for great improvements in performance. You will also learn about what makes RDDs âresilientâ: how Spark uses âlineageâ to recreate the data as needed in case of losing a node.
[CLICK]
In Chapter 9 teaches how to write and configure a Spark application from scratch. In the exercises, you will build a Spark application in either Scala or Python, configure different application properties, and submit the application to run on the cluster.
[CLICK]
Chapter 10 introduces one of the most exciting parts of the Spark ecosystem, Spark Streaming, which allows you to use Spark to process streaming data in near real-time, from sources such as application logs and social media feeds. In the exercises, you will write a Spark Streaming application to process data from a stream of web server logs.
[CLICK]
In the next chapter we discuss common patterns in Spark Programming, with a particular focus on implementing iterative algorithms in Spark, which is one of Sparkâs special strong points. We will explore page ranking as a common iterative tasks, as well as briefly introduce Sparkâs machine learning and graphing add-ons: MLLib and GraphX. In the labs, you will use Spark to implement an iterative calculation of k-means on location data.
[CLICK]
In Chapter 12, you will learn how to diagnose and fix common performance issues in Spark applications using techniques such as shared variables, serialization and data partitioning.
you will practice using broadcast variables to avoid expensive join operations.
[CLICK]
Finally, in Chapter 13 you will learn how to use Spark in the context of a production data center. We will discuss how Spark complements existing Hadoop MapReduce applications, and explore how Spark applications work with other components of the Hadoop ecosystem such as Sqoop, Flume, HBase and Impala. In the final exercises before the course conclusion, [CLICK]
youâll practice extracting data from a relational database using Sqoop and using that data in Spark.
Now letâs turn our attention to what you will actually learn in the class.
[CLICK]
After a brief introduction, [CLICK]
Chapter 2 is âWhat is Spark?â As I said, no experience with Spark or distributed processing is required, so we start at the beginning: what is Spark and why would you want to use it? What problems does it solve and what kind of use cases might you want to use it for?
[CLICK]
Then in Chapter 3 we move on to actually using Spark. We introduce the concept of Resilient Distributed Datasets, or RDDs, which is the core concept in Spark development, and briefly cover the principles of Functional Programming as used in Spark. In the hands-on exercises, youâll learn how to start the Spark interactive shell and load data from a file into an RDD.
[CLICK]
In Chapter 4 we look more deeply at RDDs: how to perform operations to transform them and extract data from them. You will learn about Map-Reduce, a programming model for parallel processing of large data sets, and compare Sparkâs MapReduce implementation with Hadoopâs. In the exercises, you will work with a set of Apache web server logs files: loading them into an RDD, parsing and filtering the data, and aggregating, joining and reporting on the data.
[CLICK]
In Chapter 5, we introduce the Hadoop Distributed File System, or HDFS, which provides the distribute storage layer Spark uses to read and save data in a cluster. The course virtual machines include a running HDFS cluster, so in the exercises you will have a chance to import and export data using both the command line and a Spark application.
[CLICK]
Chapter 6 gives an overview of how a Spark application distributes processing on a cluster using a supported clustering platform, such as YARN, Mesos, or the Spark Standalone framework included with Spark. You will learn about different deployment options for a Spark application, and in the exercises you will start a Spark Standalone cluster on your virtual machine, start the Spark Shell on the cluster, and use the Spark Standalone web UI to explore the cluster.
[CLICK]
The next chapter goes deeper into clustered computing. We will cover how Spark partitions RDDs by storing data in memory on multiple nodes in the clusterâŠand how it distributes parallel tasks to process that data on the node where it is stored. In the exercises you will explore data partitioning, and use the Spark Application UI to better understand how Spark executes tasks in a cluster.
[CLICK]
In Chapter 8, we cover one of Sparkâs unique features â the ability to cache distributed data locally, either in memory or on disk, for great improvements in performance. You will also learn about what makes RDDs âresilientâ: how Spark uses âlineageâ to recreate the data as needed in case of losing a node.
[CLICK]
In Chapter 9 teaches how to write and configure a Spark application from scratch. In the exercises, you will build a Spark application in either Scala or Python, configure different application properties, and submit the application to run on the cluster.
[CLICK]
Chapter 10 introduces one of the most exciting parts of the Spark ecosystem, Spark Streaming, which allows you to use Spark to process streaming data in near real-time, from sources such as application logs and social media feeds. In the exercises, you will write a Spark Streaming application to process data from a stream of web server logs.
[CLICK]
In the next chapter we discuss common patterns in Spark Programming, with a particular focus on implementing iterative algorithms in Spark, which is one of Sparkâs special strong points. We will explore page ranking as a common iterative tasks, as well as briefly introduce Sparkâs machine learning and graphing add-ons: MLLib and GraphX. In the labs, you will use Spark to implement an iterative calculation of k-means on location data.
[CLICK]
In Chapter 12, you will learn how to diagnose and fix common performance issues in Spark applications using techniques such as shared variables, serialization and data partitioning.
you will practice using broadcast variables to avoid expensive join operations.
[CLICK]
Finally, in Chapter 13 you will learn how to use Spark in the context of a production data center. We will discuss how Spark complements existing Hadoop MapReduce applications, and explore how Spark applications work with other components of the Hadoop ecosystem such as Sqoop, Flume, HBase and Impala. In the final exercises before the course conclusion, [CLICK]
youâll practice extracting data from a relational database using Sqoop and using that data in Spark.
Now letâs turn our attention to what you will actually learn in the class.
[CLICK]
After a brief introduction, [CLICK]
Chapter 2 is âWhat is Spark?â As I said, no experience with Spark or distributed processing is required, so we start at the beginning: what is Spark and why would you want to use it? What problems does it solve and what kind of use cases might you want to use it for?
[CLICK]
Then in Chapter 3 we move on to actually using Spark. We introduce the concept of Resilient Distributed Datasets, or RDDs, which is the core concept in Spark development, and briefly cover the principles of Functional Programming as used in Spark. In the hands-on exercises, youâll learn how to start the Spark interactive shell and load data from a file into an RDD.
[CLICK]
In Chapter 4 we look more deeply at RDDs: how to perform operations to transform them and extract data from them. You will learn about Map-Reduce, a programming model for parallel processing of large data sets, and compare Sparkâs MapReduce implementation with Hadoopâs. In the exercises, you will work with a set of Apache web server logs files: loading them into an RDD, parsing and filtering the data, and aggregating, joining and reporting on the data.
[CLICK]
In Chapter 5, we introduce the Hadoop Distributed File System, or HDFS, which provides the distribute storage layer Spark uses to read and save data in a cluster. The course virtual machines include a running HDFS cluster, so in the exercises you will have a chance to import and export data using both the command line and a Spark application.
[CLICK]
Chapter 6 gives an overview of how a Spark application distributes processing on a cluster using a supported clustering platform, such as YARN, Mesos, or the Spark Standalone framework included with Spark. You will learn about different deployment options for a Spark application, and in the exercises you will start a Spark Standalone cluster on your virtual machine, start the Spark Shell on the cluster, and use the Spark Standalone web UI to explore the cluster.
[CLICK]
The next chapter goes deeper into clustered computing. We will cover how Spark partitions RDDs by storing data in memory on multiple nodes in the clusterâŠand how it distributes parallel tasks to process that data on the node where it is stored. In the exercises you will explore data partitioning, and use the Spark Application UI to better understand how Spark executes tasks in a cluster.
[CLICK]
In Chapter 8, we cover one of Sparkâs unique features â the ability to cache distributed data locally, either in memory or on disk, for great improvements in performance. You will also learn about what makes RDDs âresilientâ: how Spark uses âlineageâ to recreate the data as needed in case of losing a node.
[CLICK]
In Chapter 9 teaches how to write and configure a Spark application from scratch. In the exercises, you will build a Spark application in either Scala or Python, configure different application properties, and submit the application to run on the cluster.
[CLICK]
Chapter 10 introduces one of the most exciting parts of the Spark ecosystem, Spark Streaming, which allows you to use Spark to process streaming data in near real-time, from sources such as application logs and social media feeds. In the exercises, you will write a Spark Streaming application to process data from a stream of web server logs.
[CLICK]
In the next chapter we discuss common patterns in Spark Programming, with a particular focus on implementing iterative algorithms in Spark, which is one of Sparkâs special strong points. We will explore page ranking as a common iterative tasks, as well as briefly introduce Sparkâs machine learning and graphing add-ons: MLLib and GraphX. In the labs, you will use Spark to implement an iterative calculation of k-means on location data.
[CLICK]
In Chapter 12, you will learn how to diagnose and fix common performance issues in Spark applications using techniques such as shared variables, serialization and data partitioning.
you will practice using broadcast variables to avoid expensive join operations.
[CLICK]
Finally, in Chapter 13 you will learn how to use Spark in the context of a production data center. We will discuss how Spark complements existing Hadoop MapReduce applications, and explore how Spark applications work with other components of the Hadoop ecosystem such as Sqoop, Flume, HBase and Impala. In the final exercises before the course conclusion, [CLICK]
youâll practice extracting data from a relational database using Sqoop and using that data in Spark.