Enviar pesquisa
Carregar
Spark mhug2
•
1 gostou
•
699 visualizações
Joseph Niemiec
Seguir
Apache Spark Intro MHUG Presentation
Leia menos
Leia mais
Tecnologia
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 18
Baixar agora
Baixar para ler offline
Recomendados
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Alex Zeltov
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
DataWorks Summit
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
Hortonworks
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
DataWorks Summit
Recomendados
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Alex Zeltov
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
DataWorks Summit
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
Hortonworks
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
DataWorks Summit
Spark Uber Development Kit
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
Running Spark in Production
Running Spark in Production
DataWorks Summit/Hadoop Summit
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Databricks
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
Hortonworks
Get most out of Spark on YARN
Get most out of Spark on YARN
DataWorks Summit
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
Uwe Printz
Big data processing with apache spark
Big data processing with apache spark
sarith divakar
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
airisData
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
DataWorks Summit/Hadoop Summit
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
DataWorks Summit
Polyalgebra
Polyalgebra
DataWorks Summit/Hadoop Summit
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Mich Talebzadeh (Ph.D.)
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
DataWorks Summit/Hadoop Summit
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
In-Memory Databases, Trends and Technologies (2012)
In-Memory Databases, Trends and Technologies (2012)
Vilho Raatikka
Kelompok 1 struktur aljabar
Kelompok 1 struktur aljabar
Yuli Sinaga
Mais conteúdo relacionado
Mais procurados
Spark Uber Development Kit
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
Running Spark in Production
Running Spark in Production
DataWorks Summit/Hadoop Summit
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Databricks
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
Hortonworks
Get most out of Spark on YARN
Get most out of Spark on YARN
DataWorks Summit
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
Uwe Printz
Big data processing with apache spark
Big data processing with apache spark
sarith divakar
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
airisData
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
DataWorks Summit/Hadoop Summit
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
DataWorks Summit
Polyalgebra
Polyalgebra
DataWorks Summit/Hadoop Summit
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Mich Talebzadeh (Ph.D.)
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
DataWorks Summit/Hadoop Summit
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
Mais procurados
(20)
Spark Uber Development Kit
Spark Uber Development Kit
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Running Spark in Production
Running Spark in Production
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
Get most out of Spark on YARN
Get most out of Spark on YARN
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
Big data processing with apache spark
Big data processing with apache spark
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Polyalgebra
Polyalgebra
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Destaque
In-Memory Databases, Trends and Technologies (2012)
In-Memory Databases, Trends and Technologies (2012)
Vilho Raatikka
Kelompok 1 struktur aljabar
Kelompok 1 struktur aljabar
Yuli Sinaga
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
DataStax Academy
Performance of Spark vs MapReduce
Performance of Spark vs MapReduce
Edureka!
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
Flink Forward
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
Apache HAWQ Architecture
Apache HAWQ Architecture
Alexey Grishchenko
MPP vs Hadoop
MPP vs Hadoop
Alexey Grishchenko
BBC: CI Problems and our Solutions by Simon Thulbourn
BBC: CI Problems and our Solutions by Simon Thulbourn
Docker, Inc.
Revamping Development and Testing Using Docker – Transforming Enterprise IT b...
Revamping Development and Testing Using Docker – Transforming Enterprise IT b...
Docker, Inc.
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
Big Data Architectural Patterns
Big Data Architectural Patterns
Amazon Web Services
Modern Data Architecture
Modern Data Architecture
Alexey Grishchenko
Big Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
Apache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
Destaque
(15)
In-Memory Databases, Trends and Technologies (2012)
In-Memory Databases, Trends and Technologies (2012)
Kelompok 1 struktur aljabar
Kelompok 1 struktur aljabar
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Performance of Spark vs MapReduce
Performance of Spark vs MapReduce
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Apache HAWQ Architecture
Apache HAWQ Architecture
MPP vs Hadoop
MPP vs Hadoop
BBC: CI Problems and our Solutions by Simon Thulbourn
BBC: CI Problems and our Solutions by Simon Thulbourn
Revamping Development and Testing Using Docker – Transforming Enterprise IT b...
Revamping Development and Testing Using Docker – Transforming Enterprise IT b...
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Big Data Architectural Patterns
Big Data Architectural Patterns
Modern Data Architecture
Modern Data Architecture
Big Data Analytics with Hadoop
Big Data Analytics with Hadoop
Apache Spark Architecture
Apache Spark Architecture
Semelhante a Spark mhug2
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Saptak Sen
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
Intro to Spark with Zeppelin
Intro to Spark with Zeppelin
Hortonworks
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
All Things Open
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
Data Science
Data Science
Subhajit75
Introduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
Rajan Kanitkar
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
Hackathon bonn
Hackathon bonn
Emil Andreas Siemes
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
Rob Vesse
Apache Spark & Hadoop
Apache Spark & Hadoop
MapR Technologies
Apache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
Semelhante a Spark mhug2
(20)
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
Intro to Spark with Zeppelin
Intro to Spark with Zeppelin
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Data Science
Data Science
Introduction to Apache Spark
Introduction to Apache Spark
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Hackathon bonn
Hackathon bonn
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
Apache Spark & Hadoop
Apache Spark & Hadoop
Apache Spark with Scala
Apache Spark with Scala
Último
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Igalia
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
hans926745
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Michael W. Hawkins
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Rafal Los
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Katpro Technologies
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Miguel Araújo
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Safe Software
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Principled Technologies
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Enterprise Knowledge
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
Results
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Delhi Call girls
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
The Digital Insurer
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Delhi Call girls
Slack Application Development 101 Slides
Slack Application Development 101 Slides
praypatel2
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
HampshireHUG
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
Antenna Manufacturer Coco
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
debabhi2
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
Puma Security, LLC
Último
(20)
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Slack Application Development 101 Slides
Slack Application Development 101 Slides
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
Spark mhug2
1.
Page 1 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Spark
2.
Page 2 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP: A Complete & Open Hadoop Distribution Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume NFS WebHDFS YARN: Data Operating System DATA MANAGEMENT SECURITY BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE & INTEGRATION Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox OPERATIONS Script Pig Search Solr SQL Hive HCatalog NoSQL HBase Accumulo Stream Storm Others ISV Engines 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) In-Memory Spark Tez Tez HDP 2.x Hortonworks Data Platform
3.
Page 3 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved What is Spark? • Spark is – an open-source Software solution that performs rapid calculations on in-memory datasets - Open Source [Apache hosted & licensed] • Free to download and use in production • Developed by a community of developers [most of whom work for DataBricks] - In-memory datasets • RDD (Resilient Distributed Data) is the basis for what Spark enables • Resilient – the models can be recreated on the fly from known state • Immutable – already defined RDDs can be used as a basis to generate derivative RDDs but are never mutated • Distributed – the dataset is often partitioned across multiple nodes for increased scalability and parallelism
4.
Page 4 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Why Spark? - It’s About Ramp-Up & Reality • Spark supports using well known languages such as • Scala* • Python • Java • Using Spark Streaming Same code can be used on • Data at rest • Data in motion • Huge Community building around Spark
5.
Page 5 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark is Expansive: • Fast and general processing engine for large scale data processing • Encourages reuse of libraries across several problem domains • Designed for iterative computations and interactive data mining Spark SQL
6.
Page 6 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark Vs MapReduce Vs Tez? MapReduce – On disk to HDFS Tez – On Disk to Local Disk Spark – In memory Input Disk Disk Disk Write Read Write
7.
Page 7 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved RDD Primitives - Resilient Distributed Datasets (RDD) - Immutable partitioned collection of objects - Transformations (map, filter, groupby, join) - Lazy operations to build RDD from RDD - Actions (count, collect, save) - Return a result or write it to storage
8.
Page 8 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Fault Recovery RDDs track lineage information that can be used to efficiently re-compute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
9.
Page 9 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved RDD RDD RDD RDD Transformations Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line)! linesWithSpark.count()! 74! ! linesWithSpark.first()! # Apache Spark! textFile = sc.textFile(”SomeFile.txt”)! Working with RDDs
10.
Page 10 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved RDD Graph sc.textFile("/some-‐hdfs-‐data") map map reduceByKey collect textFile .flatMap(line=>line.split(" ")) .map(word=>(word, 1))) .reduceByKey(_ + _, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] Array[(String, Int)] RDD[(String, Int)]
11.
Page 11 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved DAG Scheduler map map reduceByKey collect textFile map Stage 2 Stage 1 map reduceByKey collect textFile
12.
Page 12 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Task • Fundamental unit of execution in Spark - A. Fetch input from InputFormat or a shuffle - B. Execute the task - C. Materialize task output as shuffle or driver result Execute task Fetch input Write output Pipelined Execution
13.
Page 13 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Worker Execute task Fetch input Write output Execute task Fetch input Write output Execute task Fetch input Write output Execute task Fetch input Write output Execute task Fetch input Write output Execute task Fetch input Write output Execute task Fetch input Write output Core 1 Core 2 Core 3
14.
Page 14 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark on YARN YARN RM App Master Monitoring UI
15.
Page 15 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Things You Can Do With RDDs • RDDs are objects and expose a rich set of methods: 15 Name Description Name Description filter Return a new RDD containing only those elements that satisfy a predicate collect Return an array containing all the elements of this RDD count Return the number of elements in this RDD first Return the first element of this RDD foreach Applies a function to all elements of this RDD (does not return an RDD) reduce Reduces the contents of this RDD subtract Return an RDD without duplicates of elements found in passed-in RDD union Return an RDD that is a union of the passed-in RDD and this one
16.
Page 16 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved More Things You Can Do With RDDs • More stuff you can do… 16 Name Description Name Description flatMap Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results checkpoint Mark this RDD for checkpointing (its state will be saved so it need not be recreated from scratch) cache Load the RDD into memory (what doesn’t fit will be calculated as needed) countByValue Return the count of each unique value in this RDD as a map of (value, count) pairs distinct Return a new RDD containing the distinct elements in this RDD persist Store the RDD to either memory, Disk, or hybrid according to passed in value sample Return a sampled subset of this RDD unpersist Clear any record of the RDD from disk/memory
17.
Page 17 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Things You Can Do With PairRDDs • PairRDDs are RDDs containing Key Value Pairs: 17 Name Description Name Description join Return an RDD containing all pairs of elements with matching keys in this and other. groupByKey Group the values for each key in the RDD into a single sequence. keys Return an RDD containing the keys from each tuple stored in this RDD countByKey Return the count of number of each element for each key in the form of a Map lookup Return a list of values stored in this RDD using the passed in key leftOuterJoin Perform Left outer join values Return an RDD with the values of each tuple. subtractByKey Return an RDD with the pairs from this whose keys are not in other.
18.
Page 18 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark MLlib – Algorithms Offered • Classification: logistic regression, linear SVM, – naïve Bayes, least squares, classification tree • Regression: generalized linear models (GLMs), – regression tree • Collaborative filtering: alternating least squares (ALS), – non-negative matrix factorization (NMF) • Clustering: k-means • Decomposition: SVD, PCA • Optimization: stochastic gradient descent, L-BFGS
Baixar agora