O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Lighting Fast Big Data Analytics with 
Apache . 
Andy Petrella (@noootsab), Gerard Maas (@maasg) 
Big Data Hacker Data Pro...
Agenda 
What is Spark? 
Spark Foundation: The RDD 
Demo 
Ecosystem 
Examples 
Resources 
#devoxx #sparkvoxx @noootsab @maa...
Memory Network 
CPU’s 
(and don’t forget to throw some disks in the mix) 
#devoxx #sparkvoxx @noootsab @maasg
What is Spark? 
Spark is a fast and general engine for large-scale distributed data processing. 
val file = spark.textFile...
Spark: A Strong Open Source Project 
27/02 Apache top-level proj 
30/05 Spark 1.0.0 REL 
11/09 Spark 1.1.0 REL 
42 contibu...
Compared to Map-Reduce 
public class WordCount { 
public static class Map extends Mapper<LongWritable, Text, Text, IntWrit...
The Big Idea... 
Express computations in terms of operations on a data set. 
Spark Core Concept: RDD => Resilient Distribu...
RDDs 
Executors 
Spark Cluster 
HDFS 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...") RDD 
Partitions 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...").flatMap(l => l.split(" ")) 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
111111 
111111 
111111 
#devoxx #sparkvoxx @noootsab @m...
RDDs 
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
111111 
111111 
111111 
.reduceByKey(_ + _) 
2411 
3121...
RDDs 
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
111111 
111111 
111111 
.reduceByKey(_ + _) 
2411 
3121...
RDDs 
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
111111 
111111 
111111 
.reduceByKey(_ + _) 
2411 
3121...
The Spark Lingo 
.textFile("...").flatMap(l => l.split(" ")).map(w => (w,1)) 
111111 
111111 
111111 
.reduceByKey(_ + _) ...
Spark: RDD Operations 
INPUT 
DATA 
HDFS 
TEXT/ 
Sequence 
File 
RDD 
SparkContext 
RDD 
OUTPUT 
Data 
HDFS 
TEXT/ 
Sequen...
Transformations 
Inner Manipulations 
> map, flatMap, filter, distinct 
Cross RDD 
> union, subtract, intersection, join, ...
Actions 
Fetch Data 
> collect, take, first, takeSample 
Aggregate Results 
> reduce, count, countByKey 
Output 
> foreach...
RDD Lineage 
Each RDDs keeps track of its parent. 
This is the basis for DAG scheduling 
and fault recovery 
val file = sp...
Spark has Support for... 
Java 
Scala Notebook 
Python 
API 
Shell 
> 
A 
A API 
A API 
> Shell Notebook 
R API Shell 
The...
Demo 
Exploring and 
transforming data with 
the Spark Shell 
Acknowlegments: 
Book data provided by Project Gutenberg (ht...
#devoxx #sparkvoxx @noootsab @maasg
Agenda 
What is Spark? 
Spark Foundation: The RDD 
Demo 
Ecosystem 
Examples 
Resources 
#devoxx #sparkvoxx @noootsab @maa...
Ecosystem 
Now, we know what is Spark! 
At least, we know its Core, let’s say SDK. 
Thanks to its great and enthusiastic c...
Higher level primitives ... 
… or APIs 
… or the rise of the popolo 
If Spark Core is the fold of distributed computing 
T...
Spark Streaming 
When you have big fat streams behaving as one single collection 
t 
DStream[T] 
RDD[T] RDD[T] RDD[T] RDD[...
Spark Streaming 
#devoxx #sparkvoxx @noootsab @maasg
Spark SQL 
From SQL to noSQL to SQL … to noSQL 
Structured Query Language 
We’re not really querying but we’re processing ...
Spark SQL 
#devoxx #sparkvoxx @noootsab @maasg
MLLib 
“The library to teach them all” 
SciPy, SciKitLearn, R, MatLab and c° → learn on one machine 
(sadly often, one cor...
GraphX 
Connecting the dots 
Graph processing at scale. 
> Takes edges 
> Add some nodes 
> Combine = Send messages (Prege...
GraphX 
Connecting the dots 
Graph processing at scale. 
> Take edges 
> Link nodes 
> Combine/Send messages 
#devoxx #spa...
ADAM 
The new kid on the block in the Spark community (with the uncovered Thunder) 
Game changing library for processing D...
Tooling (NoIDE) 
Besides the classical Eclipse, IntellijIDEA, Netbeans, Sublime Text and family! 
An IDE is not enough bec...
ISpark 
Spark-Shell backend for IPython (Worksheet for data analysts) 
#devoxx #sparkvoxx @noootsab @maasg
Zeppelin 
Well shaped Notebook based on Kibana, offering Spark dedicated features 
> Multi languages (Scala, sql, markdown...
Spark Notebook 
Scala-Notebook fork, enhanced for Spark peculiarities. 
Full Scala, Akka and RxScala. 
Features including:...
Databricks Cloud 
The amazing product crafted by the company behind Spark! 
Cannot say more than this product will be amaz...
Spark devoxx2014
Spark devoxx2014
Spark devoxx2014
Spark devoxx2014
Spark devoxx2014
Spark devoxx2014
Spark devoxx2014
Spark devoxx2014
Spark devoxx2014
Spark devoxx2014
Spark devoxx2014
Spark devoxx2014
Spark devoxx2014
Spark devoxx2014
Próximos SlideShares
Carregando em…5
×

Spark devoxx2014

  • Seja o primeiro a comentar

Spark devoxx2014

  1. 1. Lighting Fast Big Data Analytics with Apache . Andy Petrella (@noootsab), Gerard Maas (@maasg) Big Data Hacker Data Processing Team Lead #devoxx #sparkvoxx @noootsab @maasg
  2. 2. Agenda What is Spark? Spark Foundation: The RDD Demo Ecosystem Examples Resources #devoxx #sparkvoxx @noootsab @maasg
  3. 3. Memory Network CPU’s (and don’t forget to throw some disks in the mix) #devoxx #sparkvoxx @noootsab @maasg
  4. 4. What is Spark? Spark is a fast and general engine for large-scale distributed data processing. val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line. split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Fast Functional Growing Ecosystem #devoxx #sparkvoxx @noootsab @maasg
  5. 5. Spark: A Strong Open Source Project 27/02 Apache top-level proj 30/05 Spark 1.0.0 REL 11/09 Spark 1.1.0 REL 42 contibutors 118 contibutors #Commits. src: github.com/apache/spark 176 contibutors #devoxx #sparkvoxx @noootsab @maasg
  6. 6. Compared to Map-Reduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable( 1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount" ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[ 0])); FileOutputFormat.setOutputPath(job, new Path(args[ 1])); job.waitForCompletion( true); } } val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line. split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark #devoxx #sparkvoxx @noootsab @maasg
  7. 7. The Big Idea... Express computations in terms of operations on a data set. Spark Core Concept: RDD => Resilient Distributed Dataset Think of an RDD as an immutable, distributed collection of objects • Resilient => Can be reconstructed in case of failure • Distributed => Transformations are parallelizable operations • Dataset => Data loaded and partitioned across cluster nodes (executors) RDDs are memory-intensive. Caching behavior is controllable. #devoxx #sparkvoxx @noootsab @maasg
  8. 8. RDDs Executors Spark Cluster HDFS #devoxx #sparkvoxx @noootsab @maasg
  9. 9. RDDs .textFile("...") RDD Partitions #devoxx #sparkvoxx @noootsab @maasg
  10. 10. RDDs .textFile("...").flatMap(l => l.split(" ")) #devoxx #sparkvoxx @noootsab @maasg
  11. 11. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 #devoxx #sparkvoxx @noootsab @maasg
  12. 12. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 #devoxx #sparkvoxx @noootsab @maasg
  13. 13. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 75 7 3 #devoxx #sparkvoxx @noootsab @maasg
  14. 14. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 75 7753 7 3 #devoxx #sparkvoxx @noootsab @maasg
  15. 15. The Spark Lingo .textFile("...").flatMap(l => l.split(" ")).map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 75 7753 7 3 Job Cluster Executor RDD Partition Stage Task #devoxx #sparkvoxx @noootsab @maasg
  16. 16. Spark: RDD Operations INPUT DATA HDFS TEXT/ Sequence File RDD SparkContext RDD OUTPUT Data HDFS TEXT/ Sequence File Cassandra #devoxx #sparkvoxx @noootsab @maasg
  17. 17. Transformations Inner Manipulations > map, flatMap, filter, distinct Cross RDD > union, subtract, intersection, join, cartesian Structural reorganization (Expensive) > groupBy, aggregate, sort Tuning > coalesce, repartition #devoxx #sparkvoxx @noootsab @maasg
  18. 18. Actions Fetch Data > collect, take, first, takeSample Aggregate Results > reduce, count, countByKey Output > foreach, foreachPartition, save* #devoxx #sparkvoxx @noootsab @maasg
  19. 19. RDD Lineage Each RDDs keeps track of its parent. This is the basis for DAG scheduling and fault recovery val file = spark.textFile("hdfs://...") val wordsRDD = file.flatMap(line => line.split (" ")) .map(word => (word, 1)) .reduceByKey(_ + _) val scoreRdd = words.map{case (k,v) => (v,k)} HadoopRDD MappedRDD FlatMappedRDD MappedRDD MapPartitionsRDD ShuffleRDD wordsRDD MapPartitionsRDD scoreRDD MappedRDD rdd.toDebugString is your friend #devoxx #sparkvoxx @noootsab @maasg
  20. 20. Spark has Support for... Java Scala Notebook Python API Shell > A A API A API > Shell Notebook R API Shell The Spark Shell is the best way to start exploring Spark #devoxx #sparkvoxx @noootsab @maasg
  21. 21. Demo Exploring and transforming data with the Spark Shell Acknowlegments: Book data provided by Project Gutenberg (http://www.gutenberg.org/) through https://www.opensciencedatacloud.org/ Cluster computing resources provided by http://www.virdata.com #devoxx #sparkvoxx @noootsab @maasg
  22. 22. #devoxx #sparkvoxx @noootsab @maasg
  23. 23. Agenda What is Spark? Spark Foundation: The RDD Demo Ecosystem Examples Resources #devoxx #sparkvoxx @noootsab @maasg
  24. 24. Ecosystem Now, we know what is Spark! At least, we know its Core, let’s say SDK. Thanks to its great and enthusiastic community Spark Core have been used in an ever growing number of fields Hence the ecosystem is evolving fast #devoxx #sparkvoxx @noootsab @maasg
  25. 25. Higher level primitives ... … or APIs … or the rise of the popolo If Spark Core is the fold of distributed computing Then we’re going to look at the map, filter, countBy, groupBy, ... #devoxx #sparkvoxx @noootsab @maasg
  26. 26. Spark Streaming When you have big fat streams behaving as one single collection t DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] #devoxx #sparkvoxx @noootsab @maasg
  27. 27. Spark Streaming #devoxx #sparkvoxx @noootsab @maasg
  28. 28. Spark SQL From SQL to noSQL to SQL … to noSQL Structured Query Language We’re not really querying but we’re processing SQL provides the mathematical (abstraction) structures to manipulate data We can optimize, Spark has Catalyst #devoxx #sparkvoxx @noootsab @maasg
  29. 29. Spark SQL #devoxx #sparkvoxx @noootsab @maasg
  30. 30. MLLib “The library to teach them all” SciPy, SciKitLearn, R, MatLab and c° → learn on one machine (sadly often, one core) SVM lm NaiveBayes PCA K-Means ALS SVD #devoxx #sparkvoxx @noootsab @maasg
  31. 31. GraphX Connecting the dots Graph processing at scale. > Takes edges > Add some nodes > Combine = Send messages (Pregel) #devoxx #sparkvoxx @noootsab @maasg
  32. 32. GraphX Connecting the dots Graph processing at scale. > Take edges > Link nodes > Combine/Send messages #devoxx #sparkvoxx @noootsab @maasg
  33. 33. ADAM The new kid on the block in the Spark community (with the uncovered Thunder) Game changing library for processing DNA, Genotypes, Variant and co. Comes with the right stack for processing … … legacy huge bunch of vital data #devoxx #sparkvoxx @noootsab @maasg
  34. 34. Tooling (NoIDE) Besides the classical Eclipse, IntellijIDEA, Netbeans, Sublime Text and family! An IDE is not enough because not only softwares or services are crafted. Spark is for data analysis, and data scientist need > interactivity (exploration) > reproducibility (environment, data and logic) > shareability (results) #devoxx #sparkvoxx @noootsab @maasg
  35. 35. ISpark Spark-Shell backend for IPython (Worksheet for data analysts) #devoxx #sparkvoxx @noootsab @maasg
  36. 36. Zeppelin Well shaped Notebook based on Kibana, offering Spark dedicated features > Multi languages (Scala, sql, markdown, shell) > Dynamic forms (generating inputs) > Data visualization (and export) Check the website! #devoxx #sparkvoxx @noootsab @maasg
  37. 37. Spark Notebook Scala-Notebook fork, enhanced for Spark peculiarities. Full Scala, Akka and RxScala. Features including: > Multi languages (Scala, sql, markdown, javascript) > Data visualization > Spark work tracking Try it: curl https://raw.githubusercontent.com/andypetrella/spark-notebook/spark/run.sh | bash -s dev #devoxx #sparkvoxx @noootsab @maasg
  38. 38. Databricks Cloud The amazing product crafted by the company behind Spark! Cannot say more than this product will be amazing. Fully collaborative, dashboard creation and publication. Register for a beta account (Still eagerly waiting for mine

    Seja o primeiro a comentar

    Entre para ver os comentários

  • noootsab

    Dec. 6, 2014
  • SotirisBeis

    Dec. 9, 2014
  • uprush

    Dec. 10, 2014
  • dongwang218

    Jan. 20, 2015
  • debop

    May. 19, 2015
  • checkincheckin

    Jun. 8, 2015

Vistos

Vistos totais

1.717

No Slideshare

0

De incorporações

0

Número de incorporações

108

Ações

Baixados

112

Compartilhados

0

Comentários

0

Curtir

6

×