Getting started with Apache Spark

•Download as PPTX, PDF•

2 likes•412 views

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Software

Outline
• What’s Spark
• Why Spark
• Fundamental concepts
• Cluster Deployment
• Spark Streaming
• Application Development
• Deployment
• Application Monitoring
• Debugging

What’s Spark
• Fast and speedy
• General (purpose) engine
• For large-scale data processing
• In memory processing
• Built at AMPLab, University of California,
Berkeley as sub-project of Hadoop
• Now it’s Apache’s

Why Spark
• Speed
• Ease of use
• Generality
• Runs everywhere
(Hadoop, Mesos, standalone or in cloud)
• Fault Tolerance
• Integration
• Deployment

Fundamental Concepts
• What exactly it does
Hadoop execution flow
Spark execution flow

Fundamental Concepts
• How exactly it does

Fundamental Concepts
• Resilient Distributed Dataset (RDD)
– Abstraction
– Immutable
– Partitioned collection
– Operated on in parallel
• RDD Operations
– Actions
– Transformations
• Spark Context

Fundamental Concepts
• Driver Program
• Cluster Manager
• Worker Node
• Executer
• Job
• Stage
• Task
• Application Jar
• Deploy Mode

Cluster Deployment
• Standalone
• Amazon EC2
• Apache Mesos
• Hadoop Yarn

Cluster Deployment
• Master page to monitor your cluster
– http://<server-url>:8080

Spark Streaming
• How it works internally

Spark Streaming
• Discretised Streams
– Abstraction
– Continuous Stream
– Input data/ processed data
– Series of RDDs

Spark Streaming
• Any operation applied on a DStream translates
to operations on the underlying RDDs

Spark Streaming
• Window Operations
• Output Operations
• DataFrame and SQL Operations
– DataFrame is abstraction that can act as
distributed SQL query engine.

Application Development
• Spark-Shell
– Code in Scala with instant execution

Application Development
• Self-Contained Applications
– Dependencies /Linking Libraries

Application Development
• Self-Contained Applications
– A simple app

Application Development
• Self-Contained Applications
– Packaging
– Don’t forget app dependencies

Application Monitoring
• monitor your app
– http://<driver-node>:4040

Application Monitor
• History Server
– Enable and Start History Server http://<server-url>:18080

Debugging
• Remote debugging
– Enable Remote debugging
– Must be running on local[*]

Running on Yarn
• Why to run on Yarn?
– Cluster resources
– Schedulers
– Security

Running on Yarn
• Yarn Architecture
– Resource Manager
– Node Manager
– Application Master
– Container

Running on Yarn
• Standalone vs Spark on Yarn

References
[1] Apache Spark official site http://spark.apache.org/
[2] Introduction to Spark http://www.slideshare.net/rahuldausa/introduction-to-apache-spark-
39638645
[3] Running Spark on Yarn http://badrit.com/blog/2015/2/29/running-spark-on-
yarn#.VnEQub9eeaq
[4] Debugging Apache Spark Jobs http://danosipov.com/?p=779
[5] Habib’s brain

A Big Thank You
Spark it up
You got questions?

What's hot

Getting Started, Low Hanging Fruit: Our First Experiences with Oracle Managem...Lucas Jellema

Apache sparkSameer Mahajan

Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless DreamsJosh Carlisle

David Max SATURN 2018 - Migrating from Oracle to EspressoDavid Max

Introduction to Rails by Evgeniy HinyukPivorak MeetUp

Lecture #5 Introduction to railsEvgeniy Hinyuk

Change Data Capture using KafkaAkash Vacher

Architecture - why so serious?Barbara Fusinska

Velocity - Edge UGPhil Pursglove

Oasis – data analysis platform for enterpriseLINE Corporation

RavenDB 4.0Oren Eini

Indic threads pune12-typesafe stack software development on the jvmIndicThreads

Training on iOS app development - Samesh Swongamikha & Neetin SharmaMobileNepal

APEX Alpe Adria Mike Hichwa Keynote April 11th 2019- ZagrebMichael Hichwa

Scala & Spark Online TrainingLearntek1

Scala for java developers 6 may 2017 - yeniBaris Dere

The Art of Intelligence – A Practical Introduction Machine Learning for Orac...Lucas Jellema

Mysql from a DBA prespectiveKarthik .P.R

SpringPeople Introduction to MongoDB AdministrationSpringPeople

It's a wrap - closing keynote for nlOUG Tech Experience 2017 (16th June, The ...Lucas Jellema

What's hot (20)

Getting Started, Low Hanging Fruit: Our First Experiences with Oracle Managem...

Apache spark

Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless Dreams

David Max SATURN 2018 - Migrating from Oracle to Espresso

Introduction to Rails by Evgeniy Hinyuk

Lecture #5 Introduction to rails

Change Data Capture using Kafka

Architecture - why so serious?

Velocity - Edge UG

Oasis – data analysis platform for enterprise

RavenDB 4.0

Indic threads pune12-typesafe stack software development on the jvm

Training on iOS app development - Samesh Swongamikha & Neetin Sharma

APEX Alpe Adria Mike Hichwa Keynote April 11th 2019- Zagreb

Scala & Spark Online Training

Scala for java developers 6 may 2017 - yeni

The Art of Intelligence – A Practical Introduction Machine Learning for Orac...

Mysql from a DBA prespective

SpringPeople Introduction to MongoDB Administration

It's a wrap - closing keynote for nlOUG Tech Experience 2017 (16th June, The ...

Viewers also liked

February 2016 HUG: Running Spark Clusters in Containers with DockerYahoo Developer Network

HDFS & MapReduceSkillspeed

Spark手把手:[e2-spk-s03]Erhwen Kuo

Spark手把手:[e2-spk-s02]Erhwen Kuo

Spark手把手:[e2-spk-s01]Erhwen Kuo

Functional Programming for OO Programmers (part 2)Calvin Cheng

Introduction to parallel and distributed computation with sparkAngelo Leto

Spark手把手:[e2-spk-s04]二文郭

Scala+RDDYuanhang Wang

Run Your First Hadoop 2.x ProgramSkillspeed

ScalaTrainingsChinedu Ekwunife

Scala+spark 2ndYuanhang Wang

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit

Functional Programming for OO Programmers (part 1)Calvin Cheng

Apache Spark & ScalaEdureka!

What's new with Apache Spark?Paco Nathan

Scala meetup - Intro to sparkJavier Arrieta

Spark徹底入門 #cwt2015Cloudera Japan

Apache Spark CoreGirish Khanzode

Hadoop on DockerRakesh Saha

Viewers also liked (20)

February 2016 HUG: Running Spark Clusters in Containers with Docker

HDFS & MapReduce

Spark手把手:[e2-spk-s03]

Spark手把手:[e2-spk-s02]

Spark手把手:[e2-spk-s01]

Functional Programming for OO Programmers (part 2)

Introduction to parallel and distributed computation with spark

Spark手把手:[e2-spk-s04]

Scala+RDD

Run Your First Hadoop 2.x Program

ScalaTrainings

Scala+spark 2nd

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...

Functional Programming for OO Programmers (part 1)

Apache Spark & Scala

What's new with Apache Spark?

Scala meetup - Intro to spark

Spark徹底入門 #cwt2015

Apache Spark Core

Hadoop on Docker

Similar to Getting started with Apache Spark

Apache Spark FundamentalsZahra Eskandari

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup

Dec6 meetup spark presentationRamesh Mudunuri

Apache Spark - A High Level overviewKaran Alang

Apache Spark on HDinsight TrainingSynergetics Learning and Cloud Consulting

DataOps with Project AmaterasuDataWorks Summit/Hadoop Summit

Apache Spark Workshop at Hadoop SummitSaptak Sen

Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiDatabricks

Apache sparkPrashant Pranay

Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi

Apache Spark in IndustryDorian Beganovic

Getting started with SparkSQL - Desert Code Camp 2016clairvoyantllc

Programming in Spark using PySpark Mostafa

Spark Resource ManagerShad Amez

Big Data Introduction - Solix empowerDurga Gadiraju

Spark introduction and architectureSohil Jain

Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore

YARN Ready: Apache Spark Hortonworks

Containerdays Intro to HabitatMandi Walls

Similar to Getting started with Apache Spark (20)

Apache Spark Fundamentals

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

Dec6 meetup spark presentation

Apache Spark - A High Level overview

Apache Spark on HDinsight Training

DataOps with Project Amaterasu

Apache Spark Workshop at Hadoop Summit

Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi

Apache spark

Transitioning Compute Models: Hadoop MapReduce to Spark

Apache Spark in Industry

Getting started with SparkSQL - Desert Code Camp 2016

Programming in Spark using PySpark

Spark Resource Manager

Big Data Introduction - Solix empower

Spark introduction and architecture

Scaling Spark Workloads on YARN - Boulder/Denver July 2015

YARN Ready: Apache Spark

Containerdays Intro to Habitat

Recently uploaded

5 Signs You Need a Fashion PLM Software.pdfWave PLM

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

AI & Machine Learning Presentation TemplatePresentation.STUDIO

How to Choose the Right Laravel Development Partner in New York City_compress...software pro Development

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Right Money Management App For Your Financial GoalsJhone kinadey

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Direct Style Effect Systems -The Print[A] Example- A Comprehension AidPhilip Schwarz

Exploring the Best Video Editing App.pdfproinshot.com

10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems

VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale

Recently uploaded (20)

5 Signs You Need a Fashion PLM Software.pdf

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

Diamond Application Development Crafting Solutions with Precision

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Microsoft AI Transformation Partner Playbook.pdf

Optimizing AI for immediate response in Smart CCTV

HR Software Buyers Guide in 2024 - HRSoftware.com

AI & Machine Learning Presentation Template

How to Choose the Right Laravel Development Partner in New York City_compress...

Unlocking the Future of AI Agents with Large Language Models

Right Money Management App For Your Financial Goals

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

A Secure and Reliable Document Management System is Essential.docx

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

Exploring the Best Video Editing App.pdf

10 Trends Likely to Shape Enterprise Technology in 2024

VTU technical seminar 8Th Sem on Scikit-learn

Getting started with Apache Spark

1. Apache Spark Stream Programming and Distributed Data Processing Habib Ahmed Bhutto Senior Software Engineer iConnect360

2. Outline • What’s Spark • Why Spark • Fundamental concepts • Cluster Deployment • Spark Streaming • Application Development • Deployment • Application Monitoring • Debugging

3. What’s Spark • Fast and speedy • General (purpose) engine • For large-scale data processing • In memory processing • Built at AMPLab, University of California, Berkeley as sub-project of Hadoop • Now it’s Apache’s

4. Why Spark • Speed • Ease of use • Generality • Runs everywhere (Hadoop, Mesos, standalone or in cloud) • Fault Tolerance • Integration • Deployment

5. Fundamental Concepts • What exactly it does Hadoop execution flow Spark execution flow

6. Fundamental Concepts • How exactly it does

7. Fundamental Concepts • Resilient Distributed Dataset (RDD) – Abstraction – Immutable – Partitioned collection – Operated on in parallel • RDD Operations – Actions – Transformations • Spark Context

8. Fundamental Concepts • Driver Program • Cluster Manager • Worker Node • Executer • Job • Stage • Task • Application Jar • Deploy Mode

9. Cluster Deployment • Standalone • Amazon EC2 • Apache Mesos • Hadoop Yarn

10. Cluster Deployment • Master page to monitor your cluster – http://<server-url>:8080

11. Spark Streaming • How it works

12. Spark Streaming • How it works internally

13. Spark Streaming

14. Spark Streaming • Discretised Streams – Abstraction – Continuous Stream – Input data/ processed data – Series of RDDs

15. Spark Streaming • Any operation applied on a DStream translates to operations on the underlying RDDs

16. Spark Streaming • Window Operations • Output Operations • DataFrame and SQL Operations – DataFrame is abstraction that can act as distributed SQL query engine.

17. Application Development • Spark-Shell – Code in Scala with instant execution

18. Application Development • Self-Contained Applications – Dependencies /Linking Libraries

19. Application Development • Self-Contained Applications – A simple app

20. Application Development • Self-Contained Applications – Packaging – Don’t forget app dependencies

21. Deployment • That’s how you deploy

22. Application Monitoring • monitor your app – http://<driver-node>:4040

23. Application Monitor • History Server – Enable and Start History Server http://<server-url>:18080

24. Application Monitor • History Server – Enable and Start History Server http://<server-url>:18080

25. Debugging • Remote debugging – Enable Remote debugging – Must be running on local[*]

26. Running on Yarn • Why to run on Yarn? – Cluster resources – Schedulers – Security

27. Running on Yarn • Standalone

28. Running on Yarn • Yarn Architecture – Resource Manager – Node Manager – Application Master – Container

29. Running on Yarn • Yarn Client Mode

30. Running on Yarn • Yarn Cluster Mode

31. Running on Yarn • Standalone vs Spark on Yarn

32. References [1] Apache Spark official site http://spark.apache.org/ [2] Introduction to Spark http://www.slideshare.net/rahuldausa/introduction-to-apache-spark- 39638645 [3] Running Spark on Yarn http://badrit.com/blog/2015/2/29/running-spark-on- yarn#.VnEQub9eeaq [4] Debugging Apache Spark Jobs http://danosipov.com/?p=779 [5] Habib’s brain

33. A Big Thank You Spark it up You got questions?

Getting started with Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Getting started with Apache Spark

Similar to Getting started with Apache Spark (20)

Recently uploaded

Recently uploaded (20)

Getting started with Apache Spark