O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
BigData, newborn technologies
evolving fast. Why Apache Spark
outruns Apache Hadoop
Andy Petrella, Nextlab
Xavier Tordoir,...
Andy
@Noootsab, I am
@NextLab_be owner
@SparkNotebook creator
@Wajug co-driver
@Devoxx4Kids organizer
Maths & CS
Data love...
So what...
Part I
● What
○ distributed resources
○ data
○ managers
● Why:
○ fastest
○ smartest
○ biggest
● How:
○ Map Redu...
Part I: The Distributed Age
What is a distributed environment
Computations needs three kind of resources:
● CPU
● MEM
● Data storage
However, it’s har...
What is a distributed environment
Lacking of one of these will result in higher response time
or reduced accuracy.
Unfortu...
What is a distributed environment
Distributed File System
You have 100 nodes in your cluster, but only 1 dataset.
Will you replicate it on all nodes?
Extend...
HDFS towards Tachyon
Hadoop Distributed File System
Implements GoogleFS
Store and read files splitted and replicated on no...
Nodes will fail, jobs cannot
We need resilience
Management
Resources are generally fewer than required by algorithm.
We ne...
Mesos and Marathon
Mesos: High available cluster manager
Nodes: attach or remove them on the fly
Nodes are offering resour...
Why: for everybody and now ?
Fastest:
1. Time to result
2. Near real time processing
Runtime is smaller, Dev lifecyle is shorter
→ no synchronization-hell
It can even be really interactive
→ consoles or note...
Why for everybody and now
No bottlenecks → new-coming data are readily available for
processing
Opens the doors for online...
Why for everybody and now
Smartest: train more and more models, ensembling lots of
them is no more a problem
More complex ...
Why for everybody and now
Accessing an higher level of accuracy is tricky and might
require lots and lots of models.
Runni...
Why for everybody and now
Biggest: no need for sampling big datasets
…
…
That’s it!
How!?
Google papers stimulated the open software community,
hence competitive tools now exist.
In the area of computation ...
How!?
MapReduce (Google white paper 2004):
Programming model for distributed data intensive
computations
Helps dealing wit...
Functions:
Map ≅ transform data to key value pairs
Reduce ≅ aggregate key value pairs per key (e.g. sum,
max, count)
Mappe...
Map
Reduce: apply a binary associative operator on all
elements
Image from RxJava: https://github.com/ReactiveX/RxJava/wik...
Hadoop implementation has some limitations
Mappers and Reducers ship functions to data while java is not a functional
lang...
How!?
MapReduce on steroids
I) Functional paradigm:
- process built lazily based on simple concepts
- Map and Reduce are t...
So what...
Part I
● What
○ distributed resources
○ data
○ managers
● Why:
○ fastest
○ smartest
○ biggest
● How:
○ Map Redu...
Part II: Spark to the Rescue
RDDs
Think of an RDD[T] as an immutable, distributed collection
of objects of type T
• Resilient => Can be reconstructed i...
RDD[T]
Data distribution hierarchy:
- RDD[T]
- Elements
[ x1, x2 ]
[ x10 ]
[ x8,x5,x6 ]
[ x11 ]
[ x14,x13 ]
[ x9,x16 ]
[ x...
Execution
Execution is split in fundamental units: Tasks
Tasks running in parallel are grouped in Stages
Execution
Core1
Task0
(read/process/write)
Task0
(read/process/write)
Task0
(read/process/write)
Core2
Task1
(read/process...
Master and Workers
Spark Streaming
When you have big fat streams behaving as one single
collection
t
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] R...
Spark SQL
Mapping: RDD -> “table”, Element Field -> “column”
MLLib: Distributed ML
Classification
● linear SVM, logistic regression, classification trees, naive Bayes Models
Regressio...
GraphX
Connecting the dots
Graph processing at scale.
> Take edges
> Link nodes
> Combine/Send messages
Use cases examples
- Parallel batch processing of time series
- Bayesian Network in financial market
- IoT platform (Lambd...
Genomics
Biological systems are very complex
One human sequence is 60Gb
ADAM
Credits: AmpLab (UC Berkeley)
Stratification using 1000Genomes
http://www.1000genomes.org/
ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Va...
Machine Learning model
Clustering: KMeans
ref: http://en.wikipedia.org/wiki/K-means_clustering
Machine Learning model
MLLib, KMeans
MLLib:
● Machine Learning Algorithms
● Data structures (e.g. Vector)
Mashup
prediction
Sample [NA20332] is in cluster #0 for population Some( ASW)
Sample [NA20334] is in cluster # 2 for popul...
Mashup
#0 #1 #2
GBR 0 0 89
ASW 54 0 7
CHB 0 97 0
Cluster
40 m3.xlarge
160 cores + 600G
Eggo project (public genomics data in ADAM format on s3)
We…
1000genomes in ADAM format on S3.
Open Source GA4GH Interop s...
The end (of the slides)
Thanks for your attention!
Xavier Tordoir
xavier@silicocloud.eu
Andy Petrella
andy.petrella@nextla...
Próximos SlideShares
Carregando em…5
×

What is Distributed Computing, Why we use Apache Spark

5.210 visualizações

Publicada em

In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.

The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).

This talk has been given together by @xtordoir and myself at the University of Liège, Belgium.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

What is Distributed Computing, Why we use Apache Spark

  1. 1. BigData, newborn technologies evolving fast. Why Apache Spark outruns Apache Hadoop Andy Petrella, Nextlab Xavier Tordoir, SilicoCloud
  2. 2. Andy @Noootsab, I am @NextLab_be owner @SparkNotebook creator @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Who are we? Xavier @xtordoir SilicoCloud -> Physics -> Data analysis -> genomics -> scalable systems -> ...
  3. 3. So what... Part I ● What ○ distributed resources ○ data ○ managers ● Why: ○ fastest ○ smartest ○ biggest ● How: ○ Map Reduce ○ Limitations ○ Extensions PART II ● Spark ○ Model ○ Caching and lineage ○ Master and Workers ○ Core example ● Beyond Processing ○ Streaming ○ SQL ○ GraphX ○ MLlib ○ Example ● Use cases ○ Parallel batch processing of timeseries ○ ADAM
  4. 4. Part I: The Distributed Age
  5. 5. What is a distributed environment Computations needs three kind of resources: ● CPU ● MEM ● Data storage However, it’s hard to extent each of them at will on a single machine
  6. 6. What is a distributed environment Lacking of one of these will result in higher response time or reduced accuracy. Unfortunately, it doesn’t matter how parallelized is the algorithm or optimized are the computations If the solution can’t be inside, it must be outside.
  7. 7. What is a distributed environment
  8. 8. Distributed File System You have 100 nodes in your cluster, but only 1 dataset. Will you replicate it on all nodes? Extended case: your dataset is 1 Zettabyte (10⁹Tb)? Lonesome solution: ● split the file on nodes ● axing the algorithm to access local data subsets
  9. 9. HDFS towards Tachyon Hadoop Distributed File System Implements GoogleFS Store and read files splitted and replicated on nodes 1Zb file = 8E12 x 128Mb files IOPs are expensive and require more CPU clocks than DRAM access Hence... Tachyon: memory-centric distributed file system
  10. 10. Nodes will fail, jobs cannot We need resilience Management Resources are generally fewer than required by algorithm. We need scheduling The requirements are fluctuating We need elasticity
  11. 11. Mesos and Marathon Mesos: High available cluster manager Nodes: attach or remove them on the fly Nodes are offering resources -- Applications accept them Node crash: the application restarts the assigned tasks Marathon: Meta application on Mesos Application crash: automatically restarted on different node
  12. 12. Why: for everybody and now ? Fastest: 1. Time to result 2. Near real time processing
  13. 13. Runtime is smaller, Dev lifecyle is shorter → no synchronization-hell It can even be really interactive → consoles or notebooks tools. Why for everybody and now
  14. 14. Why for everybody and now No bottlenecks → new-coming data are readily available for processing Opens the doors for online models!
  15. 15. Why for everybody and now Smartest: train more and more models, ensembling lots of them is no more a problem More complex modelling can be tackled if required
  16. 16. Why for everybody and now Accessing an higher level of accuracy is tricky and might require lots and lots of models. Running a model takes quite some time, specially if the data has to be read every single time. Example: Netflix contest winner (AT&T labs) ensembled 500 models to gain 10% accuracy. Although in 2009 it wasn’t possible to use it in production, today this could change.
  17. 17. Why for everybody and now Biggest: no need for sampling big datasets … … That’s it!
  18. 18. How!? Google papers stimulated the open software community, hence competitive tools now exist. In the area of computation in distributed environment, there are two disruptive papers: ● Google’s Mapreduce ● Berkeley’s Spark
  19. 19. How!? MapReduce (Google white paper 2004): Programming model for distributed data intensive computations Helps dealing with parallelization, fault-tolerance, data distribution, load balancing
  20. 20. Functions: Map ≅ transform data to key value pairs Reduce ≅ aggregate key value pairs per key (e.g. sum, max, count) Mappers and Reducers are sent to data location (nodes) How!?
  21. 21. Map Reduce: apply a binary associative operator on all elements Image from RxJava: https://github.com/ReactiveX/RxJava/wiki/Transforming-Observables How!?
  22. 22. Hadoop implementation has some limitations Mappers and Reducers ship functions to data while java is not a functional language ⇒ Composability is difficult and more IO/network operations are required Iterative algorithms (e.g. stochastic gradient) have to read data at each step (while data has not changed, only parameters) How!?
  23. 23. How!? MapReduce on steroids I) Functional paradigm: - process built lazily based on simple concepts - Map and Reduce are two of them II) Cache data in memory. No more IO.
  24. 24. So what... Part I ● What ○ distributed resources ○ data ○ managers ● Why: ○ fastest ○ smartest ○ biggest ● How: ○ Map Reduce ○ Limitations ○ Extensions PART II ● Spark ○ Model ○ Caching and lineage ○ Master and Workers ○ Core example ● Beyond Processing ○ Streaming ○ SQL ○ GraphX ○ MLlib ○ Example (notebook) ● Use cases ○ Parallel batch processing of timeseries ○ ADAM
  25. 25. Part II: Spark to the Rescue
  26. 26. RDDs Think of an RDD[T] as an immutable, distributed collection of objects of type T • Resilient => Can be reconstructed in case of failure • Distributed => Transformations are parallelizable operations • Dataset => Data loaded and partitioned across cluster nodes (executors)
  27. 27. RDD[T] Data distribution hierarchy: - RDD[T] - Elements [ x1, x2 ] [ x10 ] [ x8,x5,x6 ] [ x11 ] [ x14,x13 ] [ x9,x16 ] [ x3 ] [ x7,x12 ] [ x15 ] [ x17,x4 ] Executor 1 - Executors - Partitions Executor 2 Executor 3 Executor 4
  28. 28. Execution Execution is split in fundamental units: Tasks Tasks running in parallel are grouped in Stages
  29. 29. Execution Core1 Task0 (read/process/write) Task0 (read/process/write) Task0 (read/process/write) Core2 Task1 (read/process/write) Task1 (read/process/write) Task1 (read/process/write) Core3 Task2 (read/process/write) Task2 (read/process/write) Task2 (read/process/write) Stage2 Stage1 Stage0
  30. 30. Master and Workers
  31. 31. Spark Streaming When you have big fat streams behaving as one single collection t DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] DStreams: Discretized Streams (= Sequence of RDDs)
  32. 32. Spark SQL Mapping: RDD -> “table”, Element Field -> “column”
  33. 33. MLLib: Distributed ML Classification ● linear SVM, logistic regression, classification trees, naive Bayes Models Regression ● SVM, regression trees, linear regression (regularized) Clustering & dimensionality reduction ● singular value decomposition, PCA, k-means clustering “The library to teach them all”
  34. 34. GraphX Connecting the dots Graph processing at scale. > Take edges > Link nodes > Combine/Send messages
  35. 35. Use cases examples - Parallel batch processing of time series - Bayesian Network in financial market - IoT platform (Lambda architecture) - OpenStreetMap cities topologies classification - Markov Chain in Land Use/Land Cover prediction - Genomics: ADAM
  36. 36. Genomics Biological systems are very complex One human sequence is 60Gb
  37. 37. ADAM Credits: AmpLab (UC Berkeley)
  38. 38. Stratification using 1000Genomes http://www.1000genomes.org/ ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
  39. 39. Machine Learning model Clustering: KMeans ref: http://en.wikipedia.org/wiki/K-means_clustering
  40. 40. Machine Learning model MLLib, KMeans MLLib: ● Machine Learning Algorithms ● Data structures (e.g. Vector)
  41. 41. Mashup prediction Sample [NA20332] is in cluster #0 for population Some( ASW) Sample [NA20334] is in cluster # 2 for population Some( ASW) Sample [HG00120] is in cluster # 2 for population Some( GBR) Sample [NA18560] is in cluster # 1 for population Some( CHB)
  42. 42. Mashup #0 #1 #2 GBR 0 0 89 ASW 54 0 7 CHB 0 97 0
  43. 43. Cluster 40 m3.xlarge 160 cores + 600G
  44. 44. Eggo project (public genomics data in ADAM format on s3) We… 1000genomes in ADAM format on S3. Open Source GA4GH Interop services implementation Machine learning on 1000genomes Genomic data and distributed computing
  45. 45. The end (of the slides) Thanks for your attention! Xavier Tordoir xavier@silicocloud.eu Andy Petrella andy.petrella@nextlab.be

×