SlideShare uma empresa Scribd logo
1 de 22
www.edureka.co/r-for-analytics
www.edureka.co/apache-spark-scala-training
5 Things One Must know about Spark!
Slide 2Slide 2Slide 2 www.edureka.co/apache-spark-scala-training
Agenda
At the end of this webinar you will be able to know about:
 #1 : Low Latency
 #2 : Streaming Support
 #3 : Machine Learning and Graph
 #4 : Data Frame API introduction
 #5 : Spark integration with hadoop
Slide 3Slide 3Slide 3 www.edureka.co/apache-spark-scala-training
Spark Architecure
Machine Learning
Library
Graph
programming
Spark interface
For RDBMS lovers
Utility for
continues
ingestion of data
Slide 4Slide 4Slide 4 www.edureka.co/apache-spark-scala-training
Low Latency
Slide 5Slide 5Slide 5 www.edureka.co/apache-spark-scala-training
Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency
computations, whereas MapReduce keeps shuffling things in and out of disk.
Sparks Cuts Down Read/Write I/O To Disk
Spark is good for data that fits in memory and off memory
Slide 6Slide 6Slide 6 www.edureka.co/apache-spark-scala-training
The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce cluster of 2100 nodes
Using Spark on 206 EC2 nodes, spark completed the benchmark in 23 minutes.
Spark sorted the same data 3X faster using 10X fewer machines
All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.
How Fast A System Can Sort 100 TB Of Data
Slide 7Slide 7Slide 7 www.edureka.co/apache-spark-scala-training
2014, 4.27 TB/min
100 TB in 1,406 seconds
207 Amazon EC2 i2.8xlarge nodes x
(32 vCores - 2.5Ghz Intel Xeon E5-2670
v2, 244GB memory, 8x800 GB SSD)
Reynold Xin, Parviz Deyhim, Xiangrui
Meng,
Ali Ghodsi, Matei Zaharia
Courtesy : sortbenchmark.org/
Sparks Benchmark
Slide 8Slide 8Slide 8 www.edureka.co/apache-spark-scala-training
Streaming Support
Slide 9Slide 9Slide 9 www.edureka.co/apache-spark-scala-training
Used for processing the real-time streaming data.
It uses the DStream : a series of RDDs, to process the real-time data
support streaming analytics reasonably well.
The Spark Streaming API closely matches that of the Spark Core
Event processing
Slide 10Slide 10Slide 10 www.edureka.co/apache-spark-scala-training
Machine Learning and graph
implementation with DAG
Slide 11Slide 11Slide 11 www.edureka.co/apache-spark-scala-training
MLlib,a machine
learning library
classification regression clustering collaborative filtering and so on
Some of these algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering
Machine Learning
Slide 12Slide 12Slide 12 www.edureka.co/apache-spark-scala-training
Cyclic Data Flows
• All jobs in spark comprise a series of operators and run on a set of data.
• All the operators in a job are used to construct a DAG (Directed Acyclic
Graph).
• The DAG is optimized by rearranging and combining operators where
possible.
Slide 13Slide 13Slide 13 www.edureka.co/apache-spark-scala-training
Component for graphs and graph-parallel computation
Extends the Spark RDD by introducing a new Graph abstraction
Graph Algorithms
PageRank Connected Components Triangle Counting
GraphX
Slide 14Slide 14Slide 14 www.edureka.co/apache-spark-scala-training
Support for Data Frames
Slide 15Slide 15Slide 15 www.edureka.co/apache-spark-scala-training
As spark continues to grow, it wants to enable wider audiences beyond “big data” engineers to leverage the
power of distributed processing.
Inspired by data frames in r and python (pandas)
Dataframes API is designed to make big data processing on tabular data easier
Dataframe is a distributed collection of data organized into named columns.
Provides operations to filter, group, or compute aggregates, and can be used with spark sql.
Can be constructed from structured data files, existing rdds, tables in hive, or external databases.
DataFrame
Slide 16Slide 16Slide 16 www.edureka.co/apache-spark-scala-training
Ability to scale from KBs to PBs
Support for a wide array of data formats and storage systems
State-of-the-art optimization and code generation through the spark SQL catalyst optimizer
Seamless integration with all big data tooling and infrastructure via spark
Apis for python, java, scala, and R (in development via sparkr)
DataFrame features
Slide 17Slide 17Slide 17 www.edureka.co/apache-spark-scala-training
Spark can use HDFS
Spark can use YARN
Slide 18Slide 18Slide 18 www.edureka.co/apache-spark-scala-training
Spark can leverage the resource negotiator of Hadoop framework i.e. YARN
Spark workloads can make use of Symphony scheduling policies and execute via YARN
Spark execution modes
Standalone Mesos HDFS
Spark Execution Platforms
Slide 19Slide 19Slide 19 www.edureka.co/apache-spark-scala-training
Spark Features/Modules In Demand
Source: Typesafe
Slide 20Slide 20Slide 20 www.edureka.co/apache-spark-scala-training
New Features In 2015
Data Frames 
• Similar API to data frames in R and Pandas
• Automatically optimised via Spark SQL
• Released in Spark 1.3
SparkR 
• Released in Spark 1.4
• Exposes DataFrames, RDD’s & ML library in R
Machine Learning Pipelines 
• High Level API
• Featurization
• Evaluation
• Model Tuning
External Data Sources 
• Platform API to plug Data-Sources into Spark
• Pushes logic into sources
Source: Databrix
Slide 21Slide 21Slide 21 www.edureka.co/apache-spark-scala-training
Spark overview
Questions
Slide 22

Mais conteúdo relacionado

Mais procurados

Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & Scala
Edureka!
 

Mais procurados (20)

Hadoop : The Pile of Big Data
Hadoop : The Pile of Big DataHadoop : The Pile of Big Data
Hadoop : The Pile of Big Data
 
Apache spark
Apache spark Apache spark
Apache spark
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Apache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduceApache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduce
 
Hadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsHadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionals
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & Scala
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Spark for big data analytics
Spark for big data analyticsSpark for big data analytics
Spark for big data analytics
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
Why Talend for Big Data?
Why Talend for Big Data?Why Talend for Big Data?
Why Talend for Big Data?
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 

Destaque

Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
Edureka!
 
Data collection and instrumentation reported
Data collection and instrumentation reportedData collection and instrumentation reported
Data collection and instrumentation reported
Elaine Balen-Climacosa
 
Web Uygulama Güvenliği 101
Web Uygulama Güvenliği 101Web Uygulama Güvenliği 101
Web Uygulama Güvenliği 101
Mehmet Ince
 
Serialization and performance by Sergey Morenets
Serialization and performance by Sergey MorenetsSerialization and performance by Sergey Morenets
Serialization and performance by Sergey Morenets
Alex Tumanoff
 
Web Uygulamalarında Kaynak Kod Analizi - 1
Web Uygulamalarında Kaynak Kod Analizi - 1Web Uygulamalarında Kaynak Kod Analizi - 1
Web Uygulamalarında Kaynak Kod Analizi - 1
Mehmet Ince
 

Destaque (20)

Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
 
Design Patterns : Solution to Software Design Problems
Design Patterns : Solution to Software Design ProblemsDesign Patterns : Solution to Software Design Problems
Design Patterns : Solution to Software Design Problems
 
Some Wicked Problems of Software Design.
Some Wicked Problems of Software Design.Some Wicked Problems of Software Design.
Some Wicked Problems of Software Design.
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code Examples
 
Python for Penetration testers
Python for Penetration testersPython for Penetration testers
Python for Penetration testers
 
Data collection and instrumentation reported
Data collection and instrumentation reportedData collection and instrumentation reported
Data collection and instrumentation reported
 
STACK OVERFLOW DATASET ANALYSIS
STACK OVERFLOW DATASET ANALYSISSTACK OVERFLOW DATASET ANALYSIS
STACK OVERFLOW DATASET ANALYSIS
 
2016 spark survey
2016 spark survey2016 spark survey
2016 spark survey
 
Web Uygulama Güvenliği 101
Web Uygulama Güvenliği 101Web Uygulama Güvenliği 101
Web Uygulama Güvenliği 101
 
Web Uygulamalarında Kayank Kod Analizi – II
Web Uygulamalarında Kayank Kod Analizi – IIWeb Uygulamalarında Kayank Kod Analizi – II
Web Uygulamalarında Kayank Kod Analizi – II
 
Serialization and performance by Sergey Morenets
Serialization and performance by Sergey MorenetsSerialization and performance by Sergey Morenets
Serialization and performance by Sergey Morenets
 
Web Uygulamalarında Kaynak Kod Analizi - 1
Web Uygulamalarında Kaynak Kod Analizi - 1Web Uygulamalarında Kaynak Kod Analizi - 1
Web Uygulamalarında Kaynak Kod Analizi - 1
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
DevOps is Going to Replace SDLC! Learn Why?
DevOps is Going to Replace SDLC! Learn Why?DevOps is Going to Replace SDLC! Learn Why?
DevOps is Going to Replace SDLC! Learn Why?
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 

Semelhante a 5 things one must know about spark!

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

Semelhante a 5 things one must know about spark! (20)

Module01
 Module01 Module01
Module01
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkR
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 

Mais de Edureka!

Mais de Edureka! (20)

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | Edureka
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | Edureka
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | Edureka
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | Edureka
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| Edureka
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | Edureka
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | Edureka
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | Edureka
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | Edureka
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | Edureka
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | Edureka
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | Edureka
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | Edureka
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

5 things one must know about spark!

  • 2. Slide 2Slide 2Slide 2 www.edureka.co/apache-spark-scala-training Agenda At the end of this webinar you will be able to know about:  #1 : Low Latency  #2 : Streaming Support  #3 : Machine Learning and Graph  #4 : Data Frame API introduction  #5 : Spark integration with hadoop
  • 3. Slide 3Slide 3Slide 3 www.edureka.co/apache-spark-scala-training Spark Architecure Machine Learning Library Graph programming Spark interface For RDBMS lovers Utility for continues ingestion of data
  • 4. Slide 4Slide 4Slide 4 www.edureka.co/apache-spark-scala-training Low Latency
  • 5. Slide 5Slide 5Slide 5 www.edureka.co/apache-spark-scala-training Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency computations, whereas MapReduce keeps shuffling things in and out of disk. Sparks Cuts Down Read/Write I/O To Disk Spark is good for data that fits in memory and off memory
  • 6. Slide 6Slide 6Slide 6 www.edureka.co/apache-spark-scala-training The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce cluster of 2100 nodes Using Spark on 206 EC2 nodes, spark completed the benchmark in 23 minutes. Spark sorted the same data 3X faster using 10X fewer machines All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. How Fast A System Can Sort 100 TB Of Data
  • 7. Slide 7Slide 7Slide 7 www.edureka.co/apache-spark-scala-training 2014, 4.27 TB/min 100 TB in 1,406 seconds 207 Amazon EC2 i2.8xlarge nodes x (32 vCores - 2.5Ghz Intel Xeon E5-2670 v2, 244GB memory, 8x800 GB SSD) Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia Courtesy : sortbenchmark.org/ Sparks Benchmark
  • 8. Slide 8Slide 8Slide 8 www.edureka.co/apache-spark-scala-training Streaming Support
  • 9. Slide 9Slide 9Slide 9 www.edureka.co/apache-spark-scala-training Used for processing the real-time streaming data. It uses the DStream : a series of RDDs, to process the real-time data support streaming analytics reasonably well. The Spark Streaming API closely matches that of the Spark Core Event processing
  • 10. Slide 10Slide 10Slide 10 www.edureka.co/apache-spark-scala-training Machine Learning and graph implementation with DAG
  • 11. Slide 11Slide 11Slide 11 www.edureka.co/apache-spark-scala-training MLlib,a machine learning library classification regression clustering collaborative filtering and so on Some of these algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering Machine Learning
  • 12. Slide 12Slide 12Slide 12 www.edureka.co/apache-spark-scala-training Cyclic Data Flows • All jobs in spark comprise a series of operators and run on a set of data. • All the operators in a job are used to construct a DAG (Directed Acyclic Graph). • The DAG is optimized by rearranging and combining operators where possible.
  • 13. Slide 13Slide 13Slide 13 www.edureka.co/apache-spark-scala-training Component for graphs and graph-parallel computation Extends the Spark RDD by introducing a new Graph abstraction Graph Algorithms PageRank Connected Components Triangle Counting GraphX
  • 14. Slide 14Slide 14Slide 14 www.edureka.co/apache-spark-scala-training Support for Data Frames
  • 15. Slide 15Slide 15Slide 15 www.edureka.co/apache-spark-scala-training As spark continues to grow, it wants to enable wider audiences beyond “big data” engineers to leverage the power of distributed processing. Inspired by data frames in r and python (pandas) Dataframes API is designed to make big data processing on tabular data easier Dataframe is a distributed collection of data organized into named columns. Provides operations to filter, group, or compute aggregates, and can be used with spark sql. Can be constructed from structured data files, existing rdds, tables in hive, or external databases. DataFrame
  • 16. Slide 16Slide 16Slide 16 www.edureka.co/apache-spark-scala-training Ability to scale from KBs to PBs Support for a wide array of data formats and storage systems State-of-the-art optimization and code generation through the spark SQL catalyst optimizer Seamless integration with all big data tooling and infrastructure via spark Apis for python, java, scala, and R (in development via sparkr) DataFrame features
  • 17. Slide 17Slide 17Slide 17 www.edureka.co/apache-spark-scala-training Spark can use HDFS Spark can use YARN
  • 18. Slide 18Slide 18Slide 18 www.edureka.co/apache-spark-scala-training Spark can leverage the resource negotiator of Hadoop framework i.e. YARN Spark workloads can make use of Symphony scheduling policies and execute via YARN Spark execution modes Standalone Mesos HDFS Spark Execution Platforms
  • 19. Slide 19Slide 19Slide 19 www.edureka.co/apache-spark-scala-training Spark Features/Modules In Demand Source: Typesafe
  • 20. Slide 20Slide 20Slide 20 www.edureka.co/apache-spark-scala-training New Features In 2015 Data Frames  • Similar API to data frames in R and Pandas • Automatically optimised via Spark SQL • Released in Spark 1.3 SparkR  • Released in Spark 1.4 • Exposes DataFrames, RDD’s & ML library in R Machine Learning Pipelines  • High Level API • Featurization • Evaluation • Model Tuning External Data Sources  • Platform API to plug Data-Sources into Spark • Pushes logic into sources Source: Databrix
  • 21. Slide 21Slide 21Slide 21 www.edureka.co/apache-spark-scala-training Spark overview

Notas do Editor

  1. You can show hands on with data frames like Load data into Spark DataFrames Explore data with Spark SQL Here is a reference for hands on : https://www.mapr.com/blog/using-apache-spark-dataframes-processing-tabular-data#.VdxJofmqqko
  2. http://www.information-management.com/gallery/Big-Data-Hadoop-2015-Predictions-Forrester-10026357-1.html https://www.forrester.com/Predictions+2015+Hadoop+Will+Become+A+Cornerstone+Of+Your+Business+Technology+Agenda/fulltext/-/E-RES117705