O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Spark: Taming Big Data

470 visualizações

Publicada em

Introduction to Spark and it's many modules.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Spark: Taming Big Data

  1. 1. Taming Big Data Leonardo Gamas
  2. 2. Leonardo Gamas Software Engineer @ JusBrasil @leogamas
  3. 3. What is Spark? "Apache Spark™ is a fast and general engine for large- scale data processing."
  4. 4. One engine to rule them all?
  5. 5. Spark is Fast
  6. 6. Spark is Integrated
  7. 7. Spark is simple file = spark.textFile("hdfs://...") file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b)
  8. 8. Language support ● Scala ● Java ● Python
  9. 9. Community
  10. 10. Community
  11. 11. Who is using Spark?
  12. 12. RDD Resilient Distributed Dataset "Fault-tolerant collection of elements that can be operated on in parallel"
  13. 13. Dataset ● Transformations ○ RDD => RDD ○ Lazy ● Actions ○ RDD => Stuff ○ Not lazy
  14. 14. RDD Transformations
  15. 15. RDD Transformations ● map(func) ● filter(func) ● flatMap(func) ● mapPartitions(func) ● mapPartitionsWithIndex (func) ● sample(withReplacement, fraction, seed) ● union(otherDataset) ● intersection(otherDataset) ● distinct([numTasks]) ● groupByKey([numTasks]) ● reduceByKey(func, [numTasks]) ● aggregateByKey(zeroValue) (seqOp, combOp, [numTasks]) ● sortByKey([ascending], [numTasks]) ● join(otherDataset, [numTasks]) ● cogroup(otherDataset, [numTasks]) ● cartesian(otherDataset) ● pipe(command, [envVars]) ● coalesce(numPartitions) ● repartition(numPartitions)
  16. 16. RDD Actions
  17. 17. RDD Actions ● reduce(func) ● collect() ● count() ● first() ● take(n) ● takeSample (withReplacement, num, [seed]) ● takeOrdered(n, [ordering]) ● saveAsTextFile(path) ● saveAsSequenceFile (path) ● saveAsObjectFile(path) ● countByKey() ● foreach(func)
  18. 18. Distributed
  19. 19. Distributed
  20. 20. Resilient
  21. 21. Resilient "RDDs track lineage information that can be used to efficiently recompute lost data."
  22. 22. Resilient
  23. 23. Resilient
  24. 24. Resilient
  25. 25. Resilient
  26. 26. Resilient
  27. 27. RDDs are cacheable access disk twice
  28. 28. RDDs are immutable
  29. 29. RDD Internals ● Partitions (Splits) ● Dependencies ● Action (how to retrieve data) ● Location hint (pref) ● Partitioner
  30. 30. Broadcast Variables
  31. 31. Deployment ● Mesos ● YARN ● Standalone
  32. 32. Spark Projects
  33. 33. Spark Projects
  34. 34. Spark Projects Spark Core
  35. 35. Spark Projects Spark Core Spark SQL
  36. 36. Spark Projects Spark Core Spark SQL Spark Streaming
  37. 37. Spark Projects Spark Core Spark SQL Spark Streaming Spark MLlib 
  38. 38. Spark Projects Spark Core Spark SQL Spark Streaming Spark MLlib Spark GraphX
  39. 39. Spark SQL case class Person(name: String, age: Int) //Class // Map RDD val people = sc.textFile("...") .map(_.split(",")) .map(p => Person(p(0), p(1).trim.toInt)) //Register as table people.registerTempTable("people") // Query val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
  40. 40. Spark SQL - Hive val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) // Create table and load data sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH '...' INTO TABLE src") // Queries are expressed in HiveQL sqlContext.sql("FROM src SELECT key, value").foreach(println)
  41. 41. Spark Streaming
  42. 42. Spark Streaming
  43. 43. Spark Streaming val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) ssc.start() ssc.awaitTermination()
  44. 44. Spark Streaming
  45. 45. Spark Streaming def updateFunction(newValues: Seq[Int], runningCount: Option [Int]): Option[Int] = { val newCount = ... // add the new values with the previous Some(newCount) } val runningCounts = pairs.updateStateByKey[Int](updateFunction _)
  46. 46. Spark Streaming
  47. 47. Spark MLlib MLlib is Apache Spark's scalable machine learning library.
  48. 48. MLlib - Algorithms ● Linear Algebra ● Basic Statistics ● Classification and Regression ○ Linear model (SVM, Logistic regression, Linear Regression) ○ Decision trees ○ Naive Bayes
  49. 49. MLlib - Algorithms ● Collaborative filtering (ALS) ● Clustering (K-Means) ● Dimensionality Reduction (SVD and PCA) ● Feature extraction and transformation ● Optimization (SGD, L-BFGS)
  50. 50. MLlib - K-Means points = spark.textFile("hdfs://...") .map(parsePoint) model = KMeans.train(points, k=10) cluster = model.predict(testPoint)
  51. 51. MLlib - ALS Recommendation val data = sc.textFile("...") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS rank = 10, iters = 20 val model = ALS.train(ratings, rank, numIterations, 0.01) val recommendations = model.recommendProducts(userId, 10)
  52. 52. MLlib - Naive Bayes val data = sc.textFile("data/mllib/sample_naive_bayes_data.txt") val training = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ') .map(_.toDouble))) } val model = NaiveBayes.train(training, lambda = 1.0)
  53. 53. GraphX is Apache Spark's API for graphs and graph-parallel computation.
  54. 54. GraphX
  55. 55. GraphX
  56. 56. GraphX
  57. 57. GraphX
  58. 58. GraphX - Algorithms ● Connected Components ● Triangle Count ● Strongly Connected Components ● PageRank
  59. 59. // Run PageRank val ranks = graph.pageRank(0.0001).vertices // Join the ranks with the usernames val users = sc.textFile("...").map { ... => (id, username)) } val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) => (username, rank) } // Print the result println(ranksByUsername.collect().mkString("n")) GraphX - PageRank
  60. 60. Questions?

×