Apache Spark

•

0 gostou•348 visualizações

Apache Spark presentation showing how Spark works internally and how it deals with distributed data. A comparison with Apache Hadoop is made in order to show the advantages that Apache Spark.

Engenharia

Hadoop issues
- Difficult to maintain / install
- Slow due to replication & disk storage
- Need integration for differents tools (machine learning, stream processing)
- "Spending more time learning processing data tool than processing data"

Cluster
● Standalone
● Apache Mesos
● Hadoop YARN (2.0)
agnostic to the underlying cluster manager

Which one should I choose ?
Standalone - simulation / repl
YARN / Mesos - run Spark alongside with other applications / use the richer
resource scheduling capabilities
YARN - Resource manager / node manager
MESOS - Mesos master / mesos agent
YARN - will likely be preinstalled in many Hadoop distributions.
In all cases - it is best to run Spark on the same nodes as HDFS for fast access to
storage. You can install Mesos or the standalone cluster manager on the same nodes manually, or most
Hadoop distributions already install YARN and HDFS together.

RDD - Resilient Distributed Dataset
Cluster
Error, ts,
msg1, warn, ts,
msg2, Error
info, ts, msg8,
info, ts, msg3,
info
Error, ts,
msg5, ts, info
Error, ts, info,
msg9, ts, info,
Error
RDD / 4 partitions (2-4 partition for CPU in your cluster)
Worker Worker Worker

RDD - Resilient Distributed Dataset
Parallelized Collections
JavaSparkContext’s parallelize method
(distData) can be operated on in parallel

RDD - Resilient Distributed Dataset
External Datasets

Error, ts,
msg1, warn, ts,
msg2, Error
info, ts, msg8,
info, ts, msg3,
info
Error, ts,
msg5, ts, info
Error, ts, info,
msg9, ts, info,
Error
Error, ts,
msg1, ts, Error
Error, ts,
msg3, ts
Error, ts,
msg9, ts, Error
logLines RDD
(inputBase RDD)
filter(fx) transformation
errorsRDD
RDD in action

RDD in action
Error, ts,
msg1, ts, Error
Error, ts,
msg3, ts
Error, ts,
msg9, ts, Error
Error, ts,
Error, ts,
msg3, ts
Error, ts,
msg9, Error,
msg1
errorsRDD
coalesce(2) transformation
cleanedRDD
collect()
action RDD is
materialize
Execute DAG
(Directed Acyclic Graph)

logLinesRDD
cleanedRDD
collect()
errosRDD
Error, ts, msg1,
ts, msg3, ts
Error, ts, msg4,
ts, msg1
Error, ts, msg1, ts Error, ts, ts, msg1
filter(fx)
errorMsg1RDD
count()
saveToCassandra()

cache
logLinesRDD
cleanedRDD
collect()
errosRDD
Error, ts, msg1,
ts, msg3, ts
Error, ts, msg4,
ts, msg1
Error, ts, msg1, ts Error, ts, ts, msg1
filter(fx)
errorMsg1RDD
count()
saveToCassandra()

RDD - Resilient Distributed Dataset
RDD Persistence
persist() ->
MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER (Java and Scala),
MEMORY_AND_DISK_SER (Java and Scala), DISK_ONLY
cache() -> default (StorageLevel.MEMORY_ONLY)

RDD - Resilient Distributed Dataset
Which Storage Level to Choose
MEMORY_ONLY -> MEMORY_ONLY_SER -> DISK_ONLY

RDD - Resilient Distributed Dataset
Removing data
- LRU (least-recently-used)
- RDD.unpersist() method.

Lifecycle of a Spark program
1) Create some input RDD from external data
2) Lazily transform them (filter(), map())
3) Ask Spark to cache() RDDs that need to be reuse
4) Launch actions (count(), reduce()) to kick off parallel computation

Spark SQL
DataFrames can be created from different data sources such as:
- Existing RDDs
- Structured data files
- JSON datasets
- Hive tables
- External databases
SQLContext
HiveContext
(HiveQL)

Spark streaming
Streaming data: user activity on websites, monitoring data, server logs, and other
event data
Threat as RDDs
pre-defined interval
(N seconds)

Other Spark libraries
- MLib (Machine learning)
- Spark Streaming (Streaming)
- GraphX (distributed graph processing)
- Third party projects
(https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects)

Monitoring
- Web interface
- Rest API
- JMX

Security
Authentication via a shared secret
- YARN: spark.authenticate to true / automatically handle generation and
distribution of shared secret
- OTHERS: spark.authenticate.secret for each node
WebUI - java servlet filters (spark.ui.filters)

References
http://spark.apache.org/docs/latest/cluster-overview.html
https://www.youtube.com/watch?v=PFK6gsnlV5E
https://www.youtube.com/watch?v=49Hr5xZyTEA

Thanks!Questions?
jefersonm@gmail.com
@jefersonm
jefersonm
jefersonm
jefmachado

Mais conteúdo relacionado

Mais procurados

Advanced Apache Cassandra Operations with JMXzznate

Hands on MapR -- Viadeaviadea

ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...Altinity Ltd

Cassandra&map reducevlaskinvlad

Hadoop & MapReduceNewvewm

Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOAltinity Ltd

Gnocchi Profiling v2Gordon Chung

38 39 v-dbench june 16Senthilkumar E

Gnocchi Profiling 2.1.xGordon Chung

Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...DataStax

orca_fosdem_FINALaddisonhuddy

Concurrent and Distributed Applications with Akka, Java and ScalaFernando Rodriguez

Apache Spark with ScalaFernando Rodriguez

Failing gracefullyTakuya UESHIN

Gnocchi v3Gordon Chung

Gnocchi v4 (preview)Gordon Chung

Developing and Deploying Edge Analytics with RedisDavid Rauschenbach

MongoDB Backup & Disaster RecoveryElankumaran Srinivasan

Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14Jayesh Thakrar

ClickHouse Materialized Views: The Magic ContinuesAltinity Ltd

Mais procurados (20)

Advanced Apache Cassandra Operations with JMX

Hands on MapR -- Viadea

ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...

Cassandra&map reduce

Hadoop & MapReduce

Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO

Gnocchi Profiling v2

38 39 v-dbench june 16

Gnocchi Profiling 2.1.x

Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...

orca_fosdem_FINAL

Concurrent and Distributed Applications with Akka, Java and Scala

Apache Spark with Scala

Failing gracefully

Gnocchi v3

Gnocchi v4 (preview)

Developing and Deploying Edge Analytics with Redis

MongoDB Backup & Disaster Recovery

Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14

ClickHouse Materialized Views: The Magic Continues

Semelhante a Apache Spark

Spark Summit East 2015 Advanced Devops Student SlidesDatabricks

Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui

Study Notes: Apache SparkGao Yunzhong

Introduction to Apache Spark Hubert Fan Chiang

SparkHeena Madan

Bigdata processing with Spark - part IIArjen de Vries

Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan

SparkNotesDemet Aksoy

Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar

Spark 计算模型wang xing

Tuning and Debugging in Apache SparkDatabricks

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Apache Spark ArchitectureAlexey Grishchenko

Apache Spark: What? Why? When?Massimo Schenone

Making Big Data Analytics Interactive and Real-TimeSeven Nguyen

Hadoop Architecture in DepthSyed Hadoop

Hadoop introductionacogoluegnes

Zaharia spark-scala-days-2012Skills Matter Talks

MySQL HAKris Buytaert

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Semelhante a Apache Spark (20)

Spark Summit East 2015 Advanced Devops Student Slides

Spark & Cassandra at DataStax Meetup on Jan 29, 2015

Study Notes: Apache Spark

Introduction to Apache Spark

Spark

Bigdata processing with Spark - part II

Brief Intro to Apache Spark @ Stanford ICME

SparkNotes

Geek Night - Functional Data Processing using Spark and Scala

Spark 计算模型

Tuning and Debugging in Apache Spark

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab

Apache Spark Architecture

Apache Spark: What? Why? When?

Making Big Data Analytics Interactive and Real-Time

Hadoop Architecture in Depth

Hadoop introduction

Zaharia spark-scala-days-2012

MySQL HA

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab

Mais de Jéferson Machado

druid.ioJéferson Machado

Node.js, is it the solution for every problem?Jéferson Machado

Plano de carreira, isso funciona ? Me consegue uma bússola por favor. (Agile...Jéferson Machado

How to innovate ?Jéferson Machado

Management 3.0 (TDC 2015)Jéferson Machado

Management 3.0, como evoluir pessoas em conjunto com sua organização.Jéferson Machado

Business model generationJéferson Machado

Lean & T.O.CJéferson Machado

Kanban metricsJéferson Machado

AngularJSJéferson Machado

Python - basicsJéferson Machado

GROWJéferson Machado

1 jeferson (grow)Jéferson Machado

Apache PigJéferson Machado

Apache HBaseJéferson Machado

ScalaJéferson Machado

Management 3.0Jéferson Machado

Theory of constraintsJéferson Machado

Spring MVCJéferson Machado

Continuous integrationJéferson Machado

Mais de Jéferson Machado (20)

druid.io

Node.js, is it the solution for every problem?

Plano de carreira, isso funciona ? Me consegue uma bússola por favor. (Agile...

How to innovate ?

Management 3.0 (TDC 2015)

Management 3.0, como evoluir pessoas em conjunto com sua organização.

Business model generation

Lean & T.O.C

Kanban metrics

AngularJS

Python - basics

GROW

1 jeferson (grow)

Apache Pig

Apache HBase

Scala

Management 3.0

Theory of constraints

Spring MVC

Continuous integration

Último

(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7Call Girls in Nagpur High Profile Call Girls

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat

PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani

University management System project report..pdfKamal Acharya

UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan

UNIT-II FMM-Flow Through Circular Conduitsrknatarajan

chapter 5.pptx: drainage and irrigation engineeringmulugeta48

Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile

result management system report for college projectTonystark477637

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat

Generative AI or GenAI technology based PPTbhaskargani46

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi

KubeKraft presentation @CloudNativeHooghlysanyuktamishra911

Unit 1 - Soil Classification and Compaction.pdfRagavanV2

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

Online banking management system project.pdfKamal Acharya

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY

Thermal Engineering -unit - III & IV.pptDineshKumar4165

Apache Spark

1. Lightning-fast cluster computing

2. Solve the same problem

3. Hadoop issues - Difficult to maintain / install - Slow due to replication & disk storage - Need integration for differents tools (machine learning, stream processing) - "Spending more time learning processing data tool than processing data"

4. Hadoop / Spark comparison 10x faster

5. Spark Architecture

6. Cluster ● Standalone ● Apache Mesos ● Hadoop YARN (2.0) agnostic to the underlying cluster manager

7. Which one should I choose ? Standalone - simulation / repl YARN / Mesos - run Spark alongside with other applications / use the richer resource scheduling capabilities YARN - Resource manager / node manager MESOS - Mesos master / mesos agent YARN - will likely be preinstalled in many Hadoop distributions. In all cases - it is best to run Spark on the same nodes as HDFS for fast access to storage. You can install Mesos or the standalone cluster manager on the same nodes manually, or most Hadoop distributions already install YARN and HDFS together.

8. RDD - Resilient Distributed Dataset Cluster Error, ts, msg1, warn, ts, msg2, Error info, ts, msg8, info, ts, msg3, info Error, ts, msg5, ts, info Error, ts, info, msg9, ts, info, Error RDD / 4 partitions (2-4 partition for CPU in your cluster) Worker Worker Worker

9. RDD - Resilient Distributed Dataset Parallelized Collections JavaSparkContext’s parallelize method (distData) can be operated on in parallel

10. RDD - Resilient Distributed Dataset External Datasets

11. Error, ts, msg1, warn, ts, msg2, Error info, ts, msg8, info, ts, msg3, info Error, ts, msg5, ts, info Error, ts, info, msg9, ts, info, Error Error, ts, msg1, ts, Error Error, ts, msg3, ts Error, ts, msg9, ts, Error logLines RDD (inputBase RDD) filter(fx) transformation errorsRDD RDD in action

12. RDD in action Error, ts, msg1, ts, Error Error, ts, msg3, ts Error, ts, msg9, ts, Error Error, ts, Error, ts, msg3, ts Error, ts, msg9, Error, msg1 errorsRDD coalesce(2) transformation cleanedRDD collect() action RDD is materialize Execute DAG (Directed Acyclic Graph)

13. logLinesRDD cleanedRDD collect() errosRDD Error, ts, msg1, ts, msg3, ts Error, ts, msg4, ts, msg1 Error, ts, msg1, ts Error, ts, ts, msg1 filter(fx) errorMsg1RDD count() saveToCassandra()

14. cache logLinesRDD cleanedRDD collect() errosRDD Error, ts, msg1, ts, msg3, ts Error, ts, msg4, ts, msg1 Error, ts, msg1, ts Error, ts, ts, msg1 filter(fx) errorMsg1RDD count() saveToCassandra()

15. RDD - Resilient Distributed Dataset RDD Persistence persist() -> MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER (Java and Scala), MEMORY_AND_DISK_SER (Java and Scala), DISK_ONLY cache() -> default (StorageLevel.MEMORY_ONLY)

16. RDD - Resilient Distributed Dataset Which Storage Level to Choose MEMORY_ONLY -> MEMORY_ONLY_SER -> DISK_ONLY

17. RDD - Resilient Distributed Dataset Removing data - LRU (least-recently-used) - RDD.unpersist() method.

18. Lifecycle of a Spark program 1) Create some input RDD from external data 2) Lazily transform them (filter(), map()) 3) Ask Spark to cache() RDDs that need to be reuse 4) Launch actions (count(), reduce()) to kick off parallel computation

19. Example job

20. Dependency types

21. Scheduling Process

22. Spark SQL DataFrames can be created from different data sources such as: - Existing RDDs - Structured data files - JSON datasets - Hive tables - External databases SQLContext HiveContext (HiveQL)

23. Spark streaming Streaming data: user activity on websites, monitoring data, server logs, and other event data Threat as RDDs pre-defined interval (N seconds)

24. Other Spark libraries - MLib (Machine learning) - Spark Streaming (Streaming) - GraphX (distributed graph processing) - Third party projects (https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects)

25. Monitoring - Web interface - Rest API - JMX

26. Security Authentication via a shared secret - YARN: spark.authenticate to true / automatically handle generation and distribution of shared secret - OTHERS: spark.authenticate.secret for each node WebUI - java servlet filters (spark.ui.filters)

27. References http://spark.apache.org/docs/latest/cluster-overview.html https://www.youtube.com/watch?v=PFK6gsnlV5E https://www.youtube.com/watch?v=49Hr5xZyTEA

28. Thanks!Questions? jefersonm@gmail.com @jefersonm jefersonm jefersonm jefmachado

Apache Spark

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Apache Spark

Semelhante a Apache Spark (20)

Mais de Jéferson Machado

Mais de Jéferson Machado (20)

Último

Último (20)

Apache Spark