SlideShare uma empresa Scribd logo
1 de 63
www.unicomlearning.com

Lightning Fast Big Data Analytics using
Apache Spark
Manish Gupta
Solutions Architect – Product Engineering and Development
30th Jan 2014 - Delhi

www.bigdatainnovation.org
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model

Example & Demo
Spark Current Users & Roadmap
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model

Example & Demo
Spark Current Users & Roadmap
www.unicomlearning.com

www.bigdatainnovation.org

What is Hadoop?
It’s an open-sourced software for distributed storage of large datasets on commodity
class hardware in a highly fault-tolerant, scalable and a flexible way.
HDFS
It also provide a programming model/framework for processing these large datasets
in a massively-parallel, fault-tolerant and data-location aware fashion.
MR
Map

Input

Reduce

Map
Map

Output
Reduce
www.unicomlearning.com

www.bigdatainnovation.org

Limitations of Map Reduce
HDFS
read

HDFS
write

HDFS
read

iter. 1

Input

Map

iter. 2

Map

. . .

Reduce

Map

Input

HDFS
write

Output
Reduce

 Slow due to replication, serialization, and disk IO
 Inefficient for:
•

Iterative algorithms (Machine Learning, Graphs & Network Analysis)

•

Interactive Data Mining (R, Excel, Adhoc Reporting, Searching)
www.unicomlearning.com

www.bigdatainnovation.org

Approach: Leverage Memory?
 Memory bus >> disk & SSDs
 Many datasets fit into memory
 1TB = 1 billion records @ 1 KB
 Memory Capacity also follows the
Moore’s Law

A single 8GB stick of RAM is about
$80 right now. In 2021, you’d be
able to buy a single stick of RAM
that contains 64GB for the same
price.
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model

Example & Demo
Spark Current Users & Roadmap
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model

Example & Demo
Spark Current Users & Roadmap
www.unicomlearning.com

www.bigdatainnovation.org

Spark
“A big data analytics cluster-computing framework written in Scala.”
 Open Sourced originally developed in AMPLab at UC Berkley.

 Provides In-Memory analytics which is faster than Hadoop/Hive (upto 100x).
 Designed for running Iterative algorithms & Interactive analytics
 Highly compatible with Hadoop’s Storage APIs.
 - Can run on your existing Hadoop Cluster Setup.
 Developers can write driver programs using multiple programming languages.

…
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Spark Driver (Master)

Cluster Manager
Cache

Cache

Cache

Spark
Worker
Datanode

Datanode

Block

....
....

Spark
Worker

Block

Spark
Worker
Datanode
Block

HDFS
www.unicomlearning.com

www.bigdatainnovation.org

Spark
HDFS
read

HDFS
write

iter. 1

Input

HDFS
read

HDFS
write

iter. 2

. . .
www.unicomlearning.com

www.bigdatainnovation.org

Spark
HDFS
read

iter. 1

iter. 2

. . .

Input

Not tied to 2 stage Map
Reduce paradigm
1. Extract a working set
2. Cache it
3. Query it repeatedly
Logistic regression in Hadoop and Spark
www.unicomlearning.com

www.bigdatainnovation.org

Spark
A simple analytical operation:
1

pagecount = spark.textFile( "/wiki/pagecounts“ )
pagecount.count()

2

englishPages = pagecount.filter( _.split(" ")(1) == "en“ )
englishPages.cache()
englishPages.count()
englishTuples = englishPages.map( line => line.split(" ") )
englishKeyValues = englishTuples.map( line => (line(0), line(3).toInt) )
englishKeyValues.reduceByKey( _+_, 1).collect

Select count(*)
from pagecounts
Select Col1, sum(Col4)
from pagecounts
Where Col2 = “en”
Group by Col1
www.unicomlearning.com

www.bigdatainnovation.org

Shark
 HIVE on SPARK = SHARK
 A large scale data warehouse system just like Apache Hive.
 Highly compatible with Hive (HQL, metastore, serialization formats, and
UDFs)
 Built on top of Spark (thus a faster execution engine).
 Provision of creating In-memory materialized tables (Cached Tables).
 And cached tables utilizes columnar storage instead of raw storage.

Row Storage

Column Storage

1

ABC

4.1

1

2

3

2

XYZ

3.5

ABC

XYZ

PPP

3

PPP

6.4

4.1

3.5

6.4
www.unicomlearning.com

www.bigdatainnovation.org

Shark
HIVE
Client

CLI

JDBC

Driver
Meta store

SQL
Parser

Query
Optimizer

Map Reduce
HDFS

Physical Plan
Execution
www.unicomlearning.com

www.bigdatainnovation.org

Shark
SHARK
Client

CLI

Driver
Meta store

SQL
Parser

Query
Optimizer

Spark

HDFS

JDBC

Cache Mgr.
Physical Plan
Execution
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model

Example & Demo
Spark Current Users & Roadmap
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model

Example & Demo
Spark Current Users & Roadmap
www.unicomlearning.com

www.bigdatainnovation.org

Spark Programming Model
Driver Program
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map

Cluster
Manager

SparkContext

Worker Node

Writes

Executer

Task

Worker Node

Cache

Executer

Task

Datanode

Task

…

User (Developer)
HDFS

Cache
Task

Datanode
www.unicomlearning.com

www.bigdatainnovation.org

Spark Programming Model
Driver Program
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map

Writes

User (Developer)

RDD
(Resilient
Distributed
Dataset)

•
•
•
•
•
•

Immutable Data structure
In-memory (explicitly)
Fault Tolerant
Parallel Data Structure
Controlled partitioning to
optimize data placement
Can be manipulated using
rich set of operators.
www.unicomlearning.com

www.bigdatainnovation.org

RDD
 Programming Interface: Programmer can perform 3 types of operations:
Transformations
•

Create a new
dataset from and
existing one.

•

Actions
•

Lazy in nature. They
are executed only
when some action is
performed.
•

•

Example :
• Map(func)
• Filter(func)
• Distinct()

Returns to the
driver program a
value or exports
data to a storage
system after
performing a
computation.
Example:
• Count()
• Reduce(funct)
• Collect
• Take()

Persistence
•

For caching datasets
in-memory for
future operations.

•

Option to store on
disk or RAM or
mixed (Storage
Level).

•

Example:
• Persist()
• Cache()
www.unicomlearning.com

www.bigdatainnovation.org

Spark
How Spark Works:
RDD: Parallel collection with partitions
 User application create RDDs, transform
them, and run actions.
This results in a DAG (Directed Acyclic Graph) of
operators.
DAG is compiled into stages
Each stage is executed as a series of Task (one
Task for each Partition).
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Example:
sc.textFile(“/wiki/pagecounts”)

textFile

RDD[String]
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Example:
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))

textFile

map

RDD[String]
RDD[List[String]]
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Example:
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))

textFile

map

map

RDD[String]
RDD[List[String]]
RDD[(String, Int)]
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Example:
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
.reduceByKey(_+_, 3)

textFile

map

map

RDD[String]
RDD[List[String]]
RDD[(String, Int)]
RDD[(String, Int)]

reduceByKey
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Example:
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
.reduceByKey(_+_, 3)
.collect()

RDD[String]
RDD[List[String]]
RDD[(String, Int)]
RDD[(String, Int)]
Array[(String, Int)]

collect
textFile

map

map

reduceByKey
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Execution Plan:

collect
textFile

map

map

reduceByKey

Above logical plan gets compiled by the DAG
scheduler into a Plan comprising of Stages
as…
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Execution Plan:
Stage 2

Stage 1

collect
textFile

map

map

reduceByKey

Stages are sequences of RDDs, that don’t have a Shuffle in
between
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Stage 2

Stage 1

collect
textFile

1.
2.
3.
4.

map

map

reduceByKey

1.
2.
3.

Read HDFS split
Apply both the maps
Start Partial reduce
Write shuffle data

Stage 1

Stage 2

Read shuffle data
Final reduce
Send result to driver
program
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Stage Execution:
Stage 1
Task 1
Task 2
Task 2
Task 2

 Create a task for each Partition in the new RDD
 Serialize the Task
 Schedule and ship Tasks to Slaves
And all this happens internally (you need to do anything)
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Task Execution:
Task is the fundamental unit of execution in Spark

Fetch Input
HDFS /
RDD

Execute Task
Write Output
time

HDFS / RDD /
intermediate
shuffle output
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Spark Executor (Slaves)
Fetch Input

Core 1

Fetch Input

Execute Task

Fetch Input

Execute Task

Write Output

Execute Task

Write Output

Fetch Input

Core 2

Write Output
Fetch Input

Execute Task

Execute Task

Write Output
Fetch Input

Core 3

Write Output
Fetch Input

Execute Task
Write Output

Execute Task
Write Output
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Summary of Components
 Task : The fundamental unit of execution in Spark
 Stage: Set of Tasks that run parallel
 DAG : Logical Graph of RDD operations
 RDD : Parallel dataset with partitions
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model

Example & Demo
Spark Current Users & Roadmap
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model

Example & Demo
Spark Current Users & Roadmap
www.unicomlearning.com

www.bigdatainnovation.org

Example & Demo
Cluster Details:

 6 m1.Xlarge EC2 nodes.
 1 machine is Master Node
 5 worker node machines
 64 bit, 4 vCPU
 15 GB Ram
www.unicomlearning.com

www.bigdatainnovation.org

Example & Demo
Dataset:

 Wiki Page View Stats
 20 GB of webpage view counts
 3 days worth of data
<date_time> <project_code> <page_title> <num_hits> <page_size>
Base RDD to All Wiki Pages
val allPages = sc.textFile("/wiki/pagecounts")
allPages.take(10).foreach(println)
allPages.count()

Transformed RDD for all English pages (cached)
val englishPages = allPages.filter(_.split(" ")(1) == "en")
englishPages.cache()
englishPages.count()
englishPages.count()
www.unicomlearning.com

www.bigdatainnovation.org

Example & Demo
Dataset:

 Wiki Page View Stats
 20 GB of webpage view counts
 3 days worth of data
<date_time> <project_code> <page_title> <num_hits> <page_size>
Select date, sum(pageviews) from pagecounts group by date
englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(3).toInt)).reduceByKey(_+_, 1).collect.foreach(println)

Select date, count(distinct pageURL) from pagecounts group by date
englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(2))).distinct().countByKey().foreach(println)

Select distinct(datetime) from pagecounts order by datetime
englishPages.map(line => line.split(" ")).map(line => (line(0), 1)).distinct().sortByKey().collect().foreach(println)
www.unicomlearning.com

www.bigdatainnovation.org

Example & Demo
Dataset:
 Network Datasets
 Directed and Bi-directed Graphs
 One small Facebook Social Network
 127 nodes (Friends)
 1668 Edges (Friendships)
 Bi-directed graph
 Google’s internal site network
 15713 Nodes (web pages)
 170845 Edges (hyperlinks)
 Directed Graph
www.unicomlearning.com

www.bigdatainnovation.org

Example & Demo
Page Rank Calculation:
•
•
•
•

Estimate the node importance
Each directed link from A -> B is a vote to B from A.
More links to a page, more important a page is.
When a page with higher PR, points to something, then it’s vote weighs more.

1.

Start each page at a rank of 1

2. On each iteration, have page p contribute (rank
of p) / (no. of neighbors of p) to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
www.unicomlearning.com

www.bigdatainnovation.org

Example & Demo
Scala Code:
var iters = 100
val lines = sc.textFile("/dataset/google/edges.csv",1)
val links = lines.map{ s =>
val parts = s.split( "t“ )
(parts(0), parts(1))
}.distinct().groupByKey().cache()
var ranks = links.mapValues(v => 1.0)
for (i <- 1 to iters) {
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
val size = urls.size
urls.map(url => (url, rank / size))
}
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
}
val output = ranks.map(l=>(l._2,l._1)).sortByKey(false).map(l=>(l._2,l._1))
output.take(20).foreach(tup => println( tup._2 + " : " + tup._1 ))
2 seconds
38 seconds
Page Rank
761.1985177
455.7028756
259.6052388
192.7257649
144.0349154
134.1566312
130.3546324
123.4014613
120.0661165
118.6884515
112.2309539
108.8375347
106.9724799
105.822426
105.1554798
99.97741309
97.90651416
90.7910291
90.70522689
87.4353413

Page URL
google
google/about.html
google/privacy.html
google/jobs/
google/support
google/terms_of_service.html
google/intl/en/about.html
google/imghp
google/accounts/Login
google/intl/en/options/
google/preferences
google/sitemap.html
google/press/
google/language_tools
google/support/toolbar/
google/maps
google/advanced_search
google/intl/en/services/
google/intl/en/ads/
google/adsense/
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model

Example & Demo
Spark Current Users & Roadmap
www.unicomlearning.com

www.bigdatainnovation.org

Spark Current Users & Roadmap

Source: Apache - Powered By Spark
www.unicomlearning.com

www.bigdatainnovation.org

Roadmap
www.unicomlearning.com

www.bigdatainnovation.org

Conclusion
 Because of In-memory processing, computations are very fast. Developers can
write iterative algorithms without writing out a result set after each pass
through the data.
 Suitable for scenarios when sufficient memory available in your cluster.
 It provides an integrated framework for advanced analytics like Graph
processing, Stream Processing, Machine Learning etc. This simplifies
integration.
 It’s community is expanding and development is happening very aggressively.
 It’s comparatively newer than Hadoop and only few users.
www.unicomlearning.com

Topic:

Thank You
Speaker name: MANISH GUPTA
Email ID: manish.gupta@globallogic.com

www.bigdatainnovation.org

Organized by
UNICOM Trainings & Seminars Pvt. Ltd.
contact@unicomlearning.com
Backup Slides
www.unicomlearning.com

www.bigdatainnovation.org

Spark Internal Components
Spark core
Operators

Scheduler

Block manager

Networking

Accumulators

Interpreter

Broadcast

Hadoop I/O

Mesos backend

Standalone backend
www.unicomlearning.com

www.bigdatainnovation.org

In-Memory
But what if I run out of memory?
100

70

58.1

60

40.7

50

29.7

40
30

11.5

Iteration time (s)

80

68.8

90

20
10
0

Cache disabled

25%

50%

75%

% of working set in memory

Fully cached
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks
 AMPLab performed a quantitative and qualitative comparisons of 4
system
 HIVE, Impala, Redshift and Shark
 Done on Common Crawl Corpus Dataset
 81 TB size
 Consists of 3 tables:
 Page Rankings
 User Visits
 Documents
 Data was partitioned in such a way that each node had:
 25GB of User Visits
 1GB of Ranking
 30GB of Web Crawl (document)
Source: https://amplab.cs.berkeley.edu/benchmark/#
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks
Hardware Configuration
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks

• Redshift outperforms for on-disk data.
• Shark and Impala outperform Hive by 3-4X.
• For larger result-sets, Shark outperforms Impala.
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks

• Redshift columnar storage outperforms every time.
• Shark in-memory is 2nd best in all cases.
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks
• Redshift bigger cluster has an advantage.
• Shark and Impala competing.
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks

• Impala & Redshift don’t have UDF.
• Shark outperforms hive.
www.unicomlearning.com

www.bigdatainnovation.org

Roadmap
www.unicomlearning.com

www.bigdatainnovation.org

Spark

In Last 6 months of Year 2013

Mais conteúdo relacionado

Mais procurados

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet odsc
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibpumaranikar
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 

Mais procurados (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 

Destaque

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Big Data and Mobile Analytics - MMA SF Jan13
Big Data and Mobile Analytics - MMA SF Jan13Big Data and Mobile Analytics - MMA SF Jan13
Big Data and Mobile Analytics - MMA SF Jan13jenveese
 
Polymorphic publishing john barnes - what to build now
Polymorphic publishing   john barnes - what to build nowPolymorphic publishing   john barnes - what to build now
Polymorphic publishing john barnes - what to build nowJohn Barnes
 
Modeling the Smart and Connected City of the Future with Kafka and Spark
Modeling the Smart and Connected City of the Future with Kafka and SparkModeling the Smart and Connected City of the Future with Kafka and Spark
Modeling the Smart and Connected City of the Future with Kafka and SparkSingleStore
 
Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Building a Real-Time Data Pipeline with Spark, Kafka, and PythonBuilding a Real-Time Data Pipeline with Spark, Kafka, and Python
Building a Real-Time Data Pipeline with Spark, Kafka, and PythonSingleStore
 
Mobile Data Analytics
Mobile Data AnalyticsMobile Data Analytics
Mobile Data AnalyticsRICHARD AMUOK
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learningdatamantra
 
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Spark Summit
 
Enterprise Development Trends 2016 - Cloud, Container and Microservices Insig...
Enterprise Development Trends 2016 - Cloud, Container and Microservices Insig...Enterprise Development Trends 2016 - Cloud, Container and Microservices Insig...
Enterprise Development Trends 2016 - Cloud, Container and Microservices Insig...Lightbend
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsZohar Elkayam
 

Destaque (20)

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Big Data and Mobile Analytics - MMA SF Jan13
Big Data and Mobile Analytics - MMA SF Jan13Big Data and Mobile Analytics - MMA SF Jan13
Big Data and Mobile Analytics - MMA SF Jan13
 
Polymorphic publishing john barnes - what to build now
Polymorphic publishing   john barnes - what to build nowPolymorphic publishing   john barnes - what to build now
Polymorphic publishing john barnes - what to build now
 
Modeling the Smart and Connected City of the Future with Kafka and Spark
Modeling the Smart and Connected City of the Future with Kafka and SparkModeling the Smart and Connected City of the Future with Kafka and Spark
Modeling the Smart and Connected City of the Future with Kafka and Spark
 
Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Building a Real-Time Data Pipeline with Spark, Kafka, and PythonBuilding a Real-Time Data Pipeline with Spark, Kafka, and Python
Building a Real-Time Data Pipeline with Spark, Kafka, and Python
 
Plan de carrera dentro de una empresa
Plan de carrera dentro de una empresaPlan de carrera dentro de una empresa
Plan de carrera dentro de una empresa
 
Mobile Data Analytics
Mobile Data AnalyticsMobile Data Analytics
Mobile Data Analytics
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learning
 
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
 
Enterprise Development Trends 2016 - Cloud, Container and Microservices Insig...
Enterprise Development Trends 2016 - Cloud, Container and Microservices Insig...Enterprise Development Trends 2016 - Cloud, Container and Microservices Insig...
Enterprise Development Trends 2016 - Cloud, Container and Microservices Insig...
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic Functions
 

Semelhante a Lightening Fast Big Data Analytics using Apache Spark

An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsDataWorks Summit
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 

Semelhante a Lightening Fast Big Data Analytics using Apache Spark (20)

An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Spark 101
Spark 101Spark 101
Spark 101
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Module01
 Module01 Module01
Module01
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 

Último

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Lightening Fast Big Data Analytics using Apache Spark

Notas do Editor

  1. Solutions Architect in GlobalLogic.Been working for the last 10 years on large databases, data warehouses, ETLs, data mining, and now for around 2-3 years on Big Data Analytics, Machine Learning &amp; distributed System.GlobalLogic is a 6000+ headcount company is into Full Product Life Cycle service and one of the fastest growing R&amp;D services firm.Provide Advisory, Professional Services, Engineering and Support service to 250+ customers globallyWill speak about an In memory cluster computing framework that can really Nitrogen Boost your existing Hadoop based Big Data setup for analytics.
  2. Quickly touch upon Hadoop, What it does, HDFS, Map Reduce, and some of it’s limitationsIntroduce Spark and one of the tool build on top of Spark called Shark (The SQL Interface to Spark)Little bit on Spark’s architecture and it’s basic programming modelShowcase a demo about Spark and Shark’s functionalityWill speak a bit about the future of Spark, where it’s heading and about some of it’s existing customers and contributors.
  3. Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure.Large is basically 10-100 GBs and above only.It is the driving force behind the big data industry growth.Provides 2 basic components:HDFS: Large Scale Storage SystemMap Reduce: Distributed Cluster computing frameworkTypical Hadoop setup comprises of :Cluster of a particular Hadoop DistributionTools like Hive, Pig and Mahout running on top of Hadoop (internally processing HDFS data using Map Reduce jobs)Set of tools for importing/exporting data into HDFS from/to external systems like RDBMS or Server Logs.
  4. - One of the reason why Map Reduced is criticized is – Restricted programming framework - MapReduce tasks must be written as acyclic dataflow programs - Stateless mapper followed by a stateless reducer, that are executed by a batch job scheduler - Repeated querying of datasets become difficult - thus hard to write iterative algorithms- After each iteration of Map-Reduce, data has to be persisted on disc for next iteration to proceed with processing.
  5. SparkContext: represents the connection to a Spark cluster provides the entry point for interacting with Spark. we can interact our jobs.Driver program: The process runniwith Spark and distribute ng the main() function of the application and creating the SparkContextCluster manager: An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)Worker node: Any node that can run application code in the clusterExecutor: A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.Task: A unit of work that will be sent to one executorJob: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you&apos;ll see this term used in the driver&apos;s logs.Stage: Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you&apos;ll see this term used in the driver&apos;s logs.
  6. Resilient Distributed Datasets or RDD are the distributed memory abstractions that lets programmer perform in-memory parallel computations on large clusters. And that too in a highly fault tolerant manner.This is the main concept around which the whole Spark framework revolves around.Currently 2 types of RDDs:Parallelized collections: Created by calling parallelize method on an existing Scala collection. Developer can specify the number of slices to cut the dataset into. Ideally 2-3 slices per CPU.Hadoop Datasets: These distributed datasets are created from any file stored on HDFS or other storage systems supported by Hadoop (S3, Hbaseetc). These are created using SparkContext’s textFile method. Default number of slices in this case is 1 slice per file block.
  7. Transformations: Like map – takes an RDD as an input, passes &amp; process each element to a function, and return a new transformed RDD as an output.By default, each transformed RDD is recomputed each time you run an action on it. Unless you specify the RDD to be cached in memory. Spark will try to keep the elements around the cluster for faster access.RDD can be persisted on discs as well.Caching is the Key tool for iterative algorithms.Using persist, one can specify the Storage Level for persisting an RDD. Cache is just a short hand for default storage level. Which is MEMORY_ONLY.MEMORY_ONLYStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they&apos;re needed. This is the default level.MEMORY_AND_DISKStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don&apos;t fit on disk, and read them from there when they&apos;re needed.MEMORY_ONLY_SERStore RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.MEMORY_AND_DISK_SERSimilar to MEMORY_ONLY_SER, but spill partitions that don&apos;t fit in memory to disk instead of recomputing them on the fly each time they&apos;re needed.DISK_ONLYStore the RDD partitions only on disk.MEMORY_ONLY_2, MEMORY_AND_DISK_2 etcSame as the levels above, but replicate each partition on two cluster nodes.Which Storage level is best:Few things to consider:Try to keep in-memory as much as possibleTry not to spill to disc unless your computed datasets are memory expensiveUse replication only if you want fault tolerance
  8. PageRank is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page,[1] one of the founders of Google. PageRank is a way of measuring the importance of website pages.PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
  9. Spark Streaming: For stream processingContinuously executes various parallel operations on an Input Stream of Data.System receives a continuous data and divide is into batches. And each batch is considered and processed as an RDD.Graph X:Distributed Graph SystemDesigned to efficiently execute Graph algorithms using Spark parallel and in-memory computation frameworkMLBase:Goal of MLBase is to make distributed machine learning easy.BlinkDB:Approximate query engineAllows for trade-off between accuracy and response timeHighly interactive on very large datasetsIn process of deploying this at FacebookAMPLab have demonstrated how complex queries on 17 TB data (running on 100 node cluster) can be completed in less than 2 seconds !You specify queries with time boundationSelect avg(SessionTime) from tblSession where UserGender=‘MALE’ within 2 SECONDS
  10. -Interpreter: It’s actually the Scala command line (interpreter) that’s been modified for SparkHadoop I/O: for Reading/Writing from HDFSStanadlone: Custom Resource Manager- Operators: Map, Join, Group by etc on RDDNetworking: Replication, Caching, GraphBlock Manager: Very Simple Key-Value store that used as cacheBroadcaster: Sending / Receiving event, Heartbeat etc-
  11. -used by majority of Fortune 50 companies.