DIscover Spark and Spark streaming

•

1 gostou•701 visualizações

Maturin BADO

Spark RDD Spark Streaming

Dados e análise

TechLabs by
A la découverte de Machine Learning, de Redis et de Spark

TechLabs by
2
Maturin BADO
@mccstanmbg
github.com/mccstan
SPARK

Outline
❏ Data processing today
❏ Spark, hadoop, MapReduce
❏ Spark ecosystem
❏ Spark basics

Data processing today
Data intensive application
Definition :
“We call an application data-intensive if data is its primary challenge—the
quantity of data, the complexity of data, or the speed at which it is changing—as
opposed to compute-intensive, where CPU cycles are the bottleneck.”
Martin Klepmann

Data processing today
Today apps needs :
❏ Store data (databases)
❏ Caches
❏ search data (search index)
❏ Asynchronously message handling (stream processing)
❏ batch processing

Spark, hadoop, MapReduce
Spark : main differences with Map Reduce
❏ Spark load most of the dataset in memory
❏ Implement cache mechanisms which reduce read from disk
❏ Is much faster than MapReduce : Job scheduling
❏ Does not implement any data distribution technology but
can run on top of hadoop clusters (HDFS )

Spark basics : RDD
RDD : Resilient Distributed data
❏ Primary spark abstraction
❏ Fault tolerant collection of elements
❏ Partitioned and Immutable
❏ Two types operations
❏ Lazy Transformation

Outline
❏ Why In-stream processing ?
❏ Runtime and Programming Model
❏ Spark Streaming : Overview
❏ Benefits of Discretized Stream Processing
❏ Processing flow
❏ Transform operations
❏ Window operations

Runtime and Programming Model
Native Streaming

Runtime and Programming Model
Micro-batch Streaming

Benefits of Discretized Stream Processing
Dynamic load balancing

Benefits of Discretized Stream Processing
Fast failure and straggler recovery

Benefits of Discretized Stream Processing
❏ Unification of batch, streaming and interactive analytics
❏ Advanced analytics like machine learning and interactive SQL
❏ Streaming + SQL and DataFrames
❏ Streaming + MLlib

Spark Streaming : DStreams
Discretized Streams (DStreams) :
❏ The basic spark streaming abstraction
❏ A continuous series of RDDs

Spark Streaming : Transformations
Transform Operations : Any operation applied on a DStream translates
to operations on the underlying RDDs

Spark Streaming : Transformations
Window Operations :

Spark Streaming : Time abstractions
Batch interval
Sliding interval
Window size

Spark Streaming : Time abstractions
Batch interval
Window size
Sliding interval

Spark Streaming : Some examples
❏ Wordcount
❏ stateless operation, counting words for every batch
❏ Basic Error count
❏ stateless operation, using a filter : contains(“ERROR”)
❏ Cumulative Error count
❏ Stateful operation, errors from the beginning of the processing
❏ Windowed Errors counts
❏ Stateful operation, errors from the sliding window of time

The git repo
https://github.com/SoatGroup/spark-streaming-java-examples
https://github.com/SoatGroup/spark-streaming-python

Mais conteúdo relacionado

Mais procurados

Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA

Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax

Infosys Ltd: Performance Tuning - A Key to Successful Cassandra MigrationDataStax Academy

How to Build a Scylla Database Cluster that Fits Your NeedsScyllaDB

Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy

Stsg17 speaker yousunjeongYousun Jeong

Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy

Data Pipelines with Spark & DataStax EnterpriseDataStax

Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks

Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax

Spark Summit EU talk by Mike PercySpark Summit

Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB

MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit

Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB

Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB

Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...ScyllaDB

Scylla Summit 2018: Keynote - 4 Years of ScyllaScyllaDB

Overcoming Barriers of Scaling Your DatabaseScyllaDB

Mais procurados (20)

Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...

Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...

Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration

How to Build a Scylla Database Cluster that Fits Your Needs

Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay

Stsg17 speaker yousunjeong

Cisco: Cassandra adoption on Cisco UCS & OpenStack

Data Pipelines with Spark & DataStax Enterprise

Azure + DataStax Enterprise Powers Office 365 Per User Store

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...

Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...

Spark Summit EU talk by Mike Percy

Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...

MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...

Zeotap: Moving to ScyllaDB - A Graph of Billions Scale

Cassandra vs. ScyllaDB: Evolutionary Differences

Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...

Scylla Summit 2018: Keynote - 4 Years of Scylla

Overcoming Barriers of Scaling Your Database

Semelhante a DIscover Spark and Spark streaming

Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit

Spark Driven Big Data Analyticsinoshg

Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe

The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks

From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks

Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine

Healthcare Claim Reimbursement using Apache SparkDatabricks

Unified Big Data Processing with Apache SparkC4Media

AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty

Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks

Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson

Glint with Apache SparkVenkata Naga Ravi

Spark Summit EU talk by Ahsan Javed AwanSpark Summit

Apache Spark FundamentalsZahra Eskandari

SnappyData Toronto Meetup Nov 2017SnappyData

Ops Jumpstart: MongoDB Administration 101MongoDB

Cassandra at PollfishPollfish

Cassandra at PollfishStavros Kontopoulos

Semelhante a DIscover Spark and Spark streaming (20)

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka

Spark Driven Big Data Analytics

Stream, stream, stream: Different streaming methods with Spark and Kafka

The Future of Hadoop: A deeper look at Apache Spark

Processing Large Data with Apache Spark -- HasGeek

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

From Pipelines to Refineries: scaling big data applications with Tim Hunter

Stream Data Processing at Big Data Landscape by Oleksandr Fedirko

Healthcare Claim Reimbursement using Apache Spark

Unified Big Data Processing with Apache Spark

AWS Big Data Demystified #1: Big data architecture lessons learned

Headaches and Breakthroughs in Building Continuous Applications

Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...

Glint with Apache Spark

Spark Summit EU talk by Ahsan Javed Awan

Apache Spark Fundamentals

SnappyData Toronto Meetup Nov 2017

Ops Jumpstart: MongoDB Administration 101

Cassandra at Pollfish

Último

Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823

👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli

Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Anomaly detection and data imputation within time seriesParis Women in Machine Learning and Data Science

DIscover Spark and Spark streaming

1. TechLabs by A la découverte de Machine Learning, de Redis et de Spark

2. TechLabs by 2 Maturin BADO @mccstanmbg github.com/mccstan SPARK

3. Spark : Introduction

4. Outline ❏ Data processing today ❏ Spark, hadoop, MapReduce ❏ Spark ecosystem ❏ Spark basics

5. Data processing today Data intensive application Definition : “We call an application data-intensive if data is its primary challenge—the quantity of data, the complexity of data, or the speed at which it is changing—as opposed to compute-intensive, where CPU cycles are the bottleneck.” Martin Klepmann

6. Data processing today Today apps needs : ❏ Store data (databases) ❏ Caches ❏ search data (search index) ❏ Asynchronously message handling (stream processing) ❏ batch processing

7. Spark, hadoop, MapReduce

8. Spark, hadoop, MapReduce Spark : main differences with Map Reduce ❏ Spark load most of the dataset in memory ❏ Implement cache mechanisms which reduce read from disk ❏ Is much faster than MapReduce : Job scheduling ❏ Does not implement any data distribution technology but can run on top of hadoop clusters (HDFS )

9. Spark ecosystem : open source

10. Spark ecosystem : features

11. Spark ecosystem : deployment

12. Spark basics : RDD RDD : Resilient Distributed data ❏ Primary spark abstraction ❏ Fault tolerant collection of elements ❏ Partitioned and Immutable ❏ Two types operations ❏ Lazy Transformation

13. Spark basics : An execution flow

14. Spark Streaming

15. Outline ❏ Why In-stream processing ? ❏ Runtime and Programming Model ❏ Spark Streaming : Overview ❏ Benefits of Discretized Stream Processing ❏ Processing flow ❏ Transform operations ❏ Window operations

16. Why In-stream processing ?

17. Why In-stream processing ?

18. Runtime and Programming Model Native Streaming

19. Runtime and Programming Model Micro-batch Streaming

20. Spark Streaming : Overview

21. Benefits of Discretized Stream Processing Dynamic load balancing

22. Benefits of Discretized Stream Processing Fast failure and straggler recovery

23. Benefits of Discretized Stream Processing ❏ Unification of batch, streaming and interactive analytics ❏ Advanced analytics like machine learning and interactive SQL ❏ Streaming + SQL and DataFrames ❏ Streaming + MLlib

24. Spark Streaming : Processing flow

25. Spark Streaming : DStreams Discretized Streams (DStreams) : ❏ The basic spark streaming abstraction ❏ A continuous series of RDDs

26. Spark Streaming : Transformations Transform Operations : Any operation applied on a DStream translates to operations on the underlying RDDs

27. Spark Streaming : Transformations Window Operations :

28. Spark Streaming : Time abstractions Batch interval Sliding interval Window size

29. Spark Streaming : Time abstractions Batch interval Window size Sliding interval

30. Spark Streaming : Some examples ❏ Wordcount ❏ stateless operation, counting words for every batch ❏ Basic Error count ❏ stateless operation, using a filter : contains(“ERROR”) ❏ Cumulative Error count ❏ Stateful operation, errors from the beginning of the processing ❏ Windowed Errors counts ❏ Stateful operation, errors from the sliding window of time

31. The git repo https://github.com/SoatGroup/spark-streaming-java-examples https://github.com/SoatGroup/spark-streaming-python

DIscover Spark and Spark streaming

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a DIscover Spark and Spark streaming

Semelhante a DIscover Spark and Spark streaming (20)

Último

Último (20)

DIscover Spark and Spark streaming