SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Spark
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP: A Complete & Open Hadoop Distribution
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
YARN: Data Operating System
DATA MANAGEMENT
SECURITY
BATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
GOVERNANCE
& INTEGRATION
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS
(Hadoop Distributed File System)
In-Memory
Spark
Tez
Tez
HDP 2.x
Hortonworks Data Platform
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What is Spark?
•  Spark is
–  an open-source Software solution that performs rapid calculations
on in-memory datasets
-  Open Source [Apache hosted & licensed]
•  Free to download and use in production
•  Developed by a community of developers [most of whom work for
DataBricks]
-  In-memory datasets
•  RDD (Resilient Distributed Data) is the basis for what Spark enables
•  Resilient – the models can be recreated on the fly from known state
•  Immutable – already defined RDDs can be used as a basis to
generate derivative RDDs but are never mutated
•  Distributed – the dataset is often partitioned across multiple nodes for
increased scalability and parallelism
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Why Spark? - It’s About Ramp-Up & Reality
•  Spark supports using well known languages such as
•  Scala*
•  Python
•  Java
•  Using Spark Streaming Same code can be used on
•  Data at rest
•  Data in motion
•  Huge Community building around Spark
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark is Expansive:
•  Fast and general processing engine for large scale data processing
•  Encourages reuse of libraries across several problem domains
•  Designed for iterative computations and interactive data mining
Spark
SQL
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Vs MapReduce Vs Tez?
MapReduce – On disk to HDFS
Tez – On Disk to Local Disk
Spark – In memory
Input
Disk Disk Disk
Write Read Write
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
RDD Primitives
- Resilient Distributed Datasets (RDD)
-  Immutable partitioned collection of objects
- Transformations (map, filter, groupby, join)
-  Lazy operations to build RDD from RDD
- Actions (count, collect, save)
-  Return a result or write it to storage
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Fault Recovery
RDDs track lineage information that can be used to
efficiently re-compute lost data
msgs	
  =	
  textFile.filter(lambda	
  s:	
  s.startsWith(“ERROR”))	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .map(lambda	
  s:	
  s.split(“t”)[2])	
  
HDFS File Filtered RDD Mapped RDD
filter	
  
(func	
  =	
  startsWith(…))	
  
map	
  
(func	
  =	
  split(...))	
  
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)!
linesWithSpark.count()!
74!
!
linesWithSpark.first()!
# Apache Spark!
textFile = sc.textFile(”SomeFile.txt”)!
Working with RDDs
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
RDD Graph
sc.textFile("/some-­‐hdfs-­‐data")	
  
	
  
	
  
	
  
map	
  map	
   reduceByKey	
   collect	
  textFile	
  
.flatMap(line=>line.split("	
  "))	
  
.map(word=>(word,	
  1)))	
  
.reduceByKey(_	
  +	
  _,	
  3)	
  
.collect()	
  
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
DAG Scheduler
map	
  map	
   reduceByKey	
   collect	
  textFile	
  
map	
  
Stage	
  2	
  Stage	
  1	
  
map	
   reduceByKey	
   collect	
  textFile	
  
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Task
•  Fundamental unit of execution in Spark
-  A. Fetch input from InputFormat or a shuffle
-  B. Execute the task
-  C. Materialize task output as shuffle or driver result
Execute task
Fetch input
Write output
Pipelined
Execution
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Worker
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Core
1
Core
2
Core
3
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark on YARN
YARN RM
App Master
Monitoring UI
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Things You Can Do With RDDs
•  RDDs are objects and expose a rich set of methods:
15
Name Description Name Description
filter Return a new RDD containing only those
elements that satisfy a predicate
collect Return an array containing all the elements of
this RDD
count Return the number of elements in this
RDD
first Return the first element of this RDD
foreach Applies a function to all elements of this
RDD (does not return an RDD)
reduce Reduces the contents of this RDD
subtract Return an RDD without duplicates of
elements found in passed-in RDD
union Return an RDD that is a union of the passed-in
RDD and this one
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
More Things You Can Do With RDDs
•  More stuff you can do…
16
Name Description Name Description
flatMap Return a new RDD by first applying a
function to all elements of this RDD, and
then flattening the results
checkpoint Mark this RDD for checkpointing (its state will
be saved so it need not be recreated from
scratch)
cache Load the RDD into memory (what
doesn’t fit will be calculated as needed)
countByValue Return the count of each unique value in this
RDD as a map of (value, count) pairs
distinct Return a new RDD containing the
distinct elements in this RDD
persist Store the RDD to either memory, Disk, or
hybrid according to passed in value
sample Return a sampled subset of this RDD unpersist Clear any record of the RDD from disk/memory
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Things You Can Do With PairRDDs
•  PairRDDs are RDDs containing Key Value Pairs:
17
Name Description Name Description
join Return an RDD containing all pairs of
elements with matching keys in this
and other.
groupByKey Group the values for each key in the RDD into
a single sequence.
keys Return an RDD containing the keys
from each tuple stored in this RDD
countByKey Return the count of number of each element for
each key in the form of a Map
lookup Return a list of values stored in this
RDD using the passed in key
leftOuterJoin Perform Left outer join
values Return an RDD with the values of each
tuple.
subtractByKey Return an RDD with the pairs from this whose
keys are not in other.
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark MLlib – Algorithms Offered
•  Classification: logistic regression, linear SVM,
– naïve Bayes, least squares, classification tree
•  Regression: generalized linear models (GLMs),
– regression tree
•  Collaborative filtering: alternating least squares (ALS),
– non-negative matrix factorization (NMF)
•  Clustering: k-means
•  Decomposition: SVD, PCA
•  Optimization: stochastic gradient descent, L-BFGS

Mais conteúdo relacionado

Mais procurados

Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceHortonworks
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache sparksarith divakar
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsairisData
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsDataWorks Summit/Hadoop Summit
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonDataWorks Summit
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Mich Talebzadeh (Ph.D.)
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 

Mais procurados (20)

Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 

Destaque

In-Memory Databases, Trends and Technologies (2012)
In-Memory Databases, Trends and Technologies (2012)In-Memory Databases, Trends and Technologies (2012)
In-Memory Databases, Trends and Technologies (2012)Vilho Raatikka
 
Kelompok 1 struktur aljabar
Kelompok 1 struktur aljabarKelompok 1 struktur aljabar
Kelompok 1 struktur aljabarYuli Sinaga
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduceEdureka!
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkFlink Forward
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
BBC: CI Problems and our Solutions by Simon Thulbourn
BBC: CI Problems and our Solutions by Simon ThulbournBBC: CI Problems and our Solutions by Simon Thulbourn
BBC: CI Problems and our Solutions by Simon ThulbournDocker, Inc.
 
Revamping Development and Testing Using Docker – Transforming Enterprise IT b...
Revamping Development and Testing Using Docker – Transforming Enterprise IT b...Revamping Development and Testing Using Docker – Transforming Enterprise IT b...
Revamping Development and Testing Using Docker – Transforming Enterprise IT b...Docker, Inc.
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Destaque (15)

In-Memory Databases, Trends and Technologies (2012)
In-Memory Databases, Trends and Technologies (2012)In-Memory Databases, Trends and Technologies (2012)
In-Memory Databases, Trends and Technologies (2012)
 
Kelompok 1 struktur aljabar
Kelompok 1 struktur aljabarKelompok 1 struktur aljabar
Kelompok 1 struktur aljabar
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 
BBC: CI Problems and our Solutions by Simon Thulbourn
BBC: CI Problems and our Solutions by Simon ThulbournBBC: CI Problems and our Solutions by Simon Thulbourn
BBC: CI Problems and our Solutions by Simon Thulbourn
 
Revamping Development and Testing Using Docker – Transforming Enterprise IT b...
Revamping Development and Testing Using Docker – Transforming Enterprise IT b...Revamping Development and Testing Using Docker – Transforming Enterprise IT b...
Revamping Development and Testing Using Docker – Transforming Enterprise IT b...
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Semelhante a Spark mhug2

Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingAll Things Open
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
 

Semelhante a Spark mhug2 (20)

Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 

Último

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Último (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Spark mhug2

  • 1. Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Spark
  • 2. Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP: A Complete & Open Hadoop Distribution Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume NFS WebHDFS YARN: Data Operating System DATA MANAGEMENT SECURITY BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE & INTEGRATION Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox OPERATIONS Script Pig Search Solr SQL Hive HCatalog NoSQL HBase Accumulo Stream Storm Others ISV Engines 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) In-Memory Spark Tez Tez HDP 2.x Hortonworks Data Platform
  • 3. Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved What is Spark? •  Spark is –  an open-source Software solution that performs rapid calculations on in-memory datasets -  Open Source [Apache hosted & licensed] •  Free to download and use in production •  Developed by a community of developers [most of whom work for DataBricks] -  In-memory datasets •  RDD (Resilient Distributed Data) is the basis for what Spark enables •  Resilient – the models can be recreated on the fly from known state •  Immutable – already defined RDDs can be used as a basis to generate derivative RDDs but are never mutated •  Distributed – the dataset is often partitioned across multiple nodes for increased scalability and parallelism
  • 4. Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Why Spark? - It’s About Ramp-Up & Reality •  Spark supports using well known languages such as •  Scala* •  Python •  Java •  Using Spark Streaming Same code can be used on •  Data at rest •  Data in motion •  Huge Community building around Spark
  • 5. Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark is Expansive: •  Fast and general processing engine for large scale data processing •  Encourages reuse of libraries across several problem domains •  Designed for iterative computations and interactive data mining Spark SQL
  • 6. Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark Vs MapReduce Vs Tez? MapReduce – On disk to HDFS Tez – On Disk to Local Disk Spark – In memory Input Disk Disk Disk Write Read Write
  • 7. Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved RDD Primitives - Resilient Distributed Datasets (RDD) -  Immutable partitioned collection of objects - Transformations (map, filter, groupby, join) -  Lazy operations to build RDD from RDD - Actions (count, collect, save) -  Return a result or write it to storage
  • 8. Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Fault Recovery RDDs track lineage information that can be used to efficiently re-compute lost data msgs  =  textFile.filter(lambda  s:  s.startsWith(“ERROR”))                                .map(lambda  s:  s.split(“t”)[2])   HDFS File Filtered RDD Mapped RDD filter   (func  =  startsWith(…))   map   (func  =  split(...))  
  • 9. Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved RDD RDD RDD RDD Transformations Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line)! linesWithSpark.count()! 74! ! linesWithSpark.first()! # Apache Spark! textFile = sc.textFile(”SomeFile.txt”)! Working with RDDs
  • 10. Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved RDD Graph sc.textFile("/some-­‐hdfs-­‐data")         map  map   reduceByKey   collect  textFile   .flatMap(line=>line.split("  "))   .map(word=>(word,  1)))   .reduceByKey(_  +  _,  3)   .collect()   RDD[String] RDD[List[String]] RDD[(String, Int)] Array[(String, Int)] RDD[(String, Int)]
  • 11. Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved DAG Scheduler map  map   reduceByKey   collect  textFile   map   Stage  2  Stage  1   map   reduceByKey   collect  textFile  
  • 12. Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Task •  Fundamental unit of execution in Spark -  A. Fetch input from InputFormat or a shuffle -  B. Execute the task -  C. Materialize task output as shuffle or driver result Execute task Fetch input Write output Pipelined Execution
  • 13. Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Worker Execute task Fetch input Write output Execute task Fetch input Write output Execute task Fetch input Write output Execute task Fetch input Write output Execute task Fetch input Write output Execute task Fetch input Write output Execute task Fetch input Write output Core 1 Core 2 Core 3
  • 14. Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark on YARN YARN RM App Master Monitoring UI
  • 15. Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Things You Can Do With RDDs •  RDDs are objects and expose a rich set of methods: 15 Name Description Name Description filter Return a new RDD containing only those elements that satisfy a predicate collect Return an array containing all the elements of this RDD count Return the number of elements in this RDD first Return the first element of this RDD foreach Applies a function to all elements of this RDD (does not return an RDD) reduce Reduces the contents of this RDD subtract Return an RDD without duplicates of elements found in passed-in RDD union Return an RDD that is a union of the passed-in RDD and this one
  • 16. Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved More Things You Can Do With RDDs •  More stuff you can do… 16 Name Description Name Description flatMap Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results checkpoint Mark this RDD for checkpointing (its state will be saved so it need not be recreated from scratch) cache Load the RDD into memory (what doesn’t fit will be calculated as needed) countByValue Return the count of each unique value in this RDD as a map of (value, count) pairs distinct Return a new RDD containing the distinct elements in this RDD persist Store the RDD to either memory, Disk, or hybrid according to passed in value sample Return a sampled subset of this RDD unpersist Clear any record of the RDD from disk/memory
  • 17. Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Things You Can Do With PairRDDs •  PairRDDs are RDDs containing Key Value Pairs: 17 Name Description Name Description join Return an RDD containing all pairs of elements with matching keys in this and other. groupByKey Group the values for each key in the RDD into a single sequence. keys Return an RDD containing the keys from each tuple stored in this RDD countByKey Return the count of number of each element for each key in the form of a Map lookup Return a list of values stored in this RDD using the passed in key leftOuterJoin Perform Left outer join values Return an RDD with the values of each tuple. subtractByKey Return an RDD with the pairs from this whose keys are not in other.
  • 18. Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark MLlib – Algorithms Offered •  Classification: logistic regression, linear SVM, – naïve Bayes, least squares, classification tree •  Regression: generalized linear models (GLMs), – regression tree •  Collaborative filtering: alternating least squares (ALS), – non-negative matrix factorization (NMF) •  Clustering: k-means •  Decomposition: SVD, PCA •  Optimization: stochastic gradient descent, L-BFGS