Spark mhug2

© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Spark

HDP: A Complete & Open Hadoop Distribution
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
YARN: Data Operating System
DATA MANAGEMENT
SECURITY
BATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
GOVERNANCE
& INTEGRATION
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS
(Hadoop Distributed File System)
In-Memory
Spark
Tez
Tez
HDP 2.x
Hortonworks Data Platform

What is Spark?
•  Spark is
–  an open-source Software solution that performs rapid calculations
on in-memory datasets
-  Open Source [Apache hosted & licensed]
•  Free to download and use in production
•  Developed by a community of developers [most of whom work for
DataBricks]
-  In-memory datasets
•  RDD (Resilient Distributed Data) is the basis for what Spark enables
•  Resilient – the models can be recreated on the fly from known state
•  Immutable – already defined RDDs can be used as a basis to
generate derivative RDDs but are never mutated
•  Distributed – the dataset is often partitioned across multiple nodes for
increased scalability and parallelism

Why Spark? - It’s About Ramp-Up & Reality
•  Spark supports using well known languages such as
•  Scala*
•  Python
•  Java
•  Using Spark Streaming Same code can be used on
•  Data at rest
•  Data in motion
•  Huge Community building around Spark

Spark is Expansive:
•  Fast and general processing engine for large scale data processing
•  Encourages reuse of libraries across several problem domains
•  Designed for iterative computations and interactive data mining
Spark
SQL

Spark Vs MapReduce Vs Tez?
MapReduce – On disk to HDFS
Tez – On Disk to Local Disk
Spark – In memory
Input
Disk Disk Disk
Write Read Write

RDD Primitives
- Resilient Distributed Datasets (RDD)
-  Immutable partitioned collection of objects
- Transformations (map, filter, groupby, join)
-  Lazy operations to build RDD from RDD
- Actions (count, collect, save)
-  Return a result or write it to storage

Fault Recovery
RDDs track lineage information that can be used to
efficiently re-compute lost data
msgs
=
textFile.filter(lambda
s:
s.startsWith(“ERROR”))

.map(lambda
s:
s.split(“t”)[2])

HDFS File Filtered RDD Mapped RDD
ﬁlter

(func
=
startsWith(…))

map

(func
=
split(...))

RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)!
linesWithSpark.count()!
74!
!
linesWithSpark.first()!
# Apache Spark!
textFile = sc.textFile(”SomeFile.txt”)!
Working with RDDs

RDD Graph
sc.textFile("/some-‐hdfs-‐data")

map
map
reduceByKey
collect
textFile

.flatMap(line=>line.split("
"))

.map(word=>(word,
1)))

.reduceByKey(_
+
_,
3)

.collect()

RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]

DAG Scheduler
map
map
reduceByKey
collect
textFile

map

Stage
2
Stage
1

map
reduceByKey
collect
textFile

Task
•  Fundamental unit of execution in Spark
-  A. Fetch input from InputFormat or a shuffle
-  B. Execute the task
-  C. Materialize task output as shuffle or driver result
Execute task
Fetch input
Write output
Pipelined
Execution

Worker
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Core
1
Core
2
Core
3

Spark on YARN
YARN RM
App Master
Monitoring UI

Things You Can Do With RDDs
•  RDDs are objects and expose a rich set of methods:
15
Name Description Name Description
filter Return a new RDD containing only those
elements that satisfy a predicate
collect Return an array containing all the elements of
this RDD
count Return the number of elements in this
RDD
first Return the first element of this RDD
foreach Applies a function to all elements of this
RDD (does not return an RDD)
reduce Reduces the contents of this RDD
subtract Return an RDD without duplicates of
elements found in passed-in RDD
union Return an RDD that is a union of the passed-in
RDD and this one

More Things You Can Do With RDDs
•  More stuff you can do…
16
flatMap Return a new RDD by first applying a
function to all elements of this RDD, and
then flattening the results
checkpoint Mark this RDD for checkpointing (its state will
be saved so it need not be recreated from
scratch)
cache Load the RDD into memory (what
doesn’t fit will be calculated as needed)
countByValue Return the count of each unique value in this
RDD as a map of (value, count) pairs
distinct Return a new RDD containing the
distinct elements in this RDD
persist Store the RDD to either memory, Disk, or
hybrid according to passed in value
sample Return a sampled subset of this RDD unpersist Clear any record of the RDD from disk/memory

Things You Can Do With PairRDDs
•  PairRDDs are RDDs containing Key Value Pairs:
17
join Return an RDD containing all pairs of
elements with matching keys in this
and other.
groupByKey Group the values for each key in the RDD into
a single sequence.
keys Return an RDD containing the keys
from each tuple stored in this RDD
countByKey Return the count of number of each element for
each key in the form of a Map
lookup Return a list of values stored in this
RDD using the passed in key
leftOuterJoin Perform Left outer join
values Return an RDD with the values of each
tuple.
subtractByKey Return an RDD with the pairs from this whose
keys are not in other.

Spark MLlib – Algorithms Offered
•  Classification: logistic regression, linear SVM,
– naïve Bayes, least squares, classification tree
•  Regression: generalized linear models (GLMs),
– regression tree
•  Collaborative filtering: alternating least squares (ALS),
– non-negative matrix factorization (NMF)
•  Clustering: k-means
•  Decomposition: SVD, PCA
•  Optimization: stochastic gradient descent, L-BFGS

Spark mhug2

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (15)

Semelhante a Spark mhug2

Semelhante a Spark mhug2 (20)

Último

Último (20)

Spark mhug2