O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Beyond Shuffling
tips & tricks for scaling Apache Spark
Data Day Texas
2016
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Software Engineer at IBM
● previously Alpine, ...
What is going to be covered:
● What I think I might know about you
● RDD re-use (caching, persistence levels, and checkpoi...
Who I think you wonderful humans are?
● Nice* people
● Don’t mind pictures of cats
● Know some Apache Spark
● Want to scal...
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory ...
Considerations for Key/Value Data
● What does the distribution of keys look like?
● What type of aggregations do we need t...
What is key skew and why do we care?
● Keys aren’t evenly distributed
○ Sales by zip code, or records by city, etc.
● grou...
groupByKey - just how evil is it?
● Pretty evil
● Groups all of the records with the same key into a single record
○ Even ...
So what does that look like?
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D...
Let’s revisit wordcount with groupByKey
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val groupe...
And now back to the “normal” version
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val wordCount...
Let’s see what it looks like when we run the two
Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp
val ...
GroupByKey
reduceByKey
So what did we do instead?
● reduceByKey
○ Works when the types are the same (e.g. in our summing version)
● aggregateByKe...
So why did we read in python/*.py
If we just read in the standard README.md file there aren’t enough duplicated
keys for t...
Can just the shuffle cause problems?
● Sorting by key can put all of the records in the same partition
● We can run into p...
Shuffle explosions :(
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(9...
100% less explosions
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94...
Everyone* needs reduce, let’s make it faster!
● reduce & aggregate have “tree” versions
● we already had free map-side red...
Spark accumulators
● Really “great” way for keeping track of failed records
● Double counting makes things really tricky
○...
Using an accumulator for validation:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records = input.map{ x => ...
Using a library: simple historic validation Photo by Dvortygirl
val vc = new ValidationConf(jobHistoryPath, "1", true,
Lis...
With a Spark internal counter...
val vc = new ValidationConf(tempPath, "1", true,
List[ValidationRule](
new AbsoluteSparkC...
Where can Spark SQL benefit perf?
● Structured or semi-structured data
● OK with having less* complex operations available...
Why is Spark SQL good for those things?
● Space efficient columnar cached representation
● Able to push down operations to...
Spark SQL v. RDD serialization
● Testing the serialization alone Databricks showed a close to 20x
improvement in speed!
● ...
Introducing Datasets
● New in Spark 1.6
● Provide templated compile time strongly typed version of DataFrames
● Make it ea...
Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.toDF().filter($"ha...
So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => ...
And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: D...
Photo by Christian Heilmann
beware the implicit list conversion
● Iterator to Iterator transformations are super useful
○ They allow Spark to spill to...
Preview: bringing codegen to Spark ML
● Based on Spark SQL’s code generation
○ First draft using quasiquotes
○ Switch to j...
@Override
public double call(Vector input) throws
Exception {
if (input.apply(1) <= 1.0) {
return 0.1;
} else {
if (input....
Additional Resources
● Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
○ http://spark.apache.org/docs/latest...
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analyt...
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analyt...
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analyt...
And the next book…..
Still being written - signup to be notified when it is available:
● http://www.highperformancespark.c...
Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summi...
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you care about Spark testing and
don’t hate surveys: http://bit.
ly/hold...
Próximos SlideShares
Carregando em…5
×

Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark 1.6 and Datasets - Data Day Texas 2016

Beyond Shuffling - Tips & Tricks for scaling your Apache Spark programs. This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with an introduction to one of Spark's newest features: Datasets.

  • Entre para ver os comentários

Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark 1.6 and Datasets - Data Day Texas 2016

  1. 1. Beyond Shuffling tips & tricks for scaling Apache Spark Data Day Texas 2016
  2. 2. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● I’m a Software Engineer at IBM ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & Fast Data processing with Spark ○ co-author of a new book focused on Spark performance coming out next year* ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  3. 3. What is going to be covered: ● What I think I might know about you ● RDD re-use (caching, persistence levels, and checkpointing) ● Working with key/value data ○ Why group key is evil and what we can do about it ● Best practices for Spark accumulators* ● When Spark SQL can be amazing and wonderful ● A brief introduction to Datasets (new in Spark 1.6)
  4. 4. Who I think you wonderful humans are? ● Nice* people ● Don’t mind pictures of cats ● Know some Apache Spark ● Want to scale your Apache Spark jobs ● Don’t overly mind a grab-bag of topics Lori Erickson
  5. 5. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455 Photo from Cocoa Dream
  6. 6. RDD re-use - sadly not magic ● If we know we are going to re-use the RDD what should we do? ○ If it fits nicely in memory caching in memory ○ persisting at another level ■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER ○ checkpointing ● Noisey clusters ○ _2 & checkpointing can help Richard Gillin
  7. 7. Considerations for Key/Value Data ● What does the distribution of keys look like? ● What type of aggregations do we need to do? ● Do we want our data in any particular order? ● Are we joining with another RDD? ● Whats our partitioner? ○ If we don’t have an explicit one: what is the partition structure? eleda 1
  8. 8. What is key skew and why do we care? ● Keys aren’t evenly distributed ○ Sales by zip code, or records by city, etc. ● groupByKey will explode (but it's pretty easy to break) ● We can have really unbalanced partitions ○ If we have enough key skew sortByKey could even fail ○ Stragglers (uneven sharding can make some tasks take much longer) Mitchell Joyce
  9. 9. groupByKey - just how evil is it? ● Pretty evil ● Groups all of the records with the same key into a single record ○ Even if we immediately reduce it (e.g. sum it or similar) ○ This can be too big to fit in memory, then our job fails ● Unless we are in SQL then happy pandas PROgeckoam
  10. 10. So what does that look like? (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
  11. 11. Let’s revisit wordcount with groupByKey val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum)
  12. 12. And now back to the “normal” version val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts
  13. 13. Let’s see what it looks like when we run the two Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions // Evil group by key version val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() val evilWordCounts = grouped.mapValues(_.sum) evilWordCounts.take(5) // Less evil version val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts.take(5)
  14. 14. GroupByKey
  15. 15. reduceByKey
  16. 16. So what did we do instead? ● reduceByKey ○ Works when the types are the same (e.g. in our summing version) ● aggregateByKey ○ Doesn’t require the types to be the same (e.g. computing stats model or similar) Allows Spark to pipeline the reduction & skip making the list We also got a map-side reduction (note the difference in shuffled read)
  17. 17. So why did we read in python/*.py If we just read in the standard README.md file there aren’t enough duplicated keys for the reduceByKey & groupByKey difference to be really apparent Which is why groupByKey can be safe sometimes
  18. 18. Can just the shuffle cause problems? ● Sorting by key can put all of the records in the same partition ● We can run into partition size limits (around 2GB) ● Or just get bad performance ● So we can handle data like the above we can add some “junk” to our key (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) PROTodd Klassy
  19. 19. Shuffle explosions :( (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110, A, B) (94110, A, C) (94110, E, F) (94110, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (10003, D, E)
  20. 20. 100% less explosions (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110_A, A, B) (94110_A, A, C) (94110_A, A, R) (94110_D, D, R) (94110_T, T, R) (10003_A, A, R) (10003_D, D, E) (67843_T, T, R) (94110_E, E, R) (94110_E, E, R) (94110_E, E, F) (94110_T, T, R)
  21. 21. Everyone* needs reduce, let’s make it faster! ● reduce & aggregate have “tree” versions ● we already had free map-side reduction ● but now we can get even better!** **And we might be able to make even cooler versions
  22. 22. Spark accumulators ● Really “great” way for keeping track of failed records ● Double counting makes things really tricky ○ Jobs which worked “fine” don’t continue to work “fine” when minor changes happen ● Relative rules can save us* under certain conditions Found Animals Foundation Follow
  23. 23. Using an accumulator for validation: val (ok, bad) = (sc.accumulator(0), sc.accumulator(0)) val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1 // Actual parse logic here } // An action (e.g. count, save, etc.) if (bad.value > 0.1* ok.value) { throw Exception("bad data - do not use results") // Optional cleanup } // Mark as safe P.S: If you are interested in this check out spark-validator (still early stages). Found Animals Foundation Follow
  24. 24. Using a library: simple historic validation Photo by Dvortygirl val vc = new ValidationConf(jobHistoryPath, "1", true, List[ValidationRule](new AvgRule("acc", 0.001, Some(200)))) val v = Validation(sc, vc) // Some job logic // Register an accumulator (optional) val acc = sc.accumulator(0) v.registerAccumulator(acc, "acc") // More Job logic goes here if (v.validate(jobId)) { // Success logic goes here } else sadness()
  25. 25. With a Spark internal counter... val vc = new ValidationConf(tempPath, "1", true, List[ValidationRule]( new AbsoluteSparkCounterValidationRule("recordsRead", Some(30), Some (1000))) ) val sqlCtx = new SQLContext(sc) val v = Validation(sc, sqlCtx, vc) //Do work here.... assert(v.validate(5) === true) } Photo by Dvortygirl
  26. 26. Where can Spark SQL benefit perf? ● Structured or semi-structured data ● OK with having less* complex operations available to us ● We may only need to operate on a subset of the data ○ The fastest data to process isn’t even read ● Remember that non-magic cat? Its got some magic** now ○ In part from peeking inside of boxes ● non-JVM (aka Python & R) users: saved from double serialization cost! :) **Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting magic Matti Mattila
  27. 27. Why is Spark SQL good for those things? ● Space efficient columnar cached representation ● Able to push down operations to the data store ● Optimizer is able to look inside of our operations ○ Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and (append(_, _)) Matti Mattila
  28. 28. Spark SQL v. RDD serialization ● Testing the serialization alone Databricks showed a close to 20x improvement in speed! ● Some research shows serialization cost is a substantial blocking component of Spark performance ○ Making Sense of Performance in Data Analytics Frameworks by Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun
  29. 29. Introducing Datasets ● New in Spark 1.6 ● Provide templated compile time strongly typed version of DataFrames ● Make it easier to intermix functional & relational code ○ Do you hate writing UDFS? So do I! ● Still an experimental component (API will change in future versions) ○ Although the next major version seems likely to be 2.0 anyways so lots of things may change regardless
  30. 30. Using Datasets to mix functional & relational style: val ds: Dataset[RawPanda] = ... val happiness = ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
  31. 31. So what was that? ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y) convert a Dataset to a DataFrame to access more DataFrame sunctions Convert DataFrame back to a Dataset A typed query (specifies the return type).Traditional functional reduction: arbitrary scala code :)
  32. 32. And functional style maps: /** * Functional map + Dataset, sums the positive attributes for the pandas */ def funMap(ds: Dataset[RawPanda]): Dataset[Double] = { ds.map{rp => rp.attributes.filter(_ > 0).sum} }
  33. 33. Photo by Christian Heilmann
  34. 34. beware the implicit list conversion ● Iterator to Iterator transformations are super useful ○ They allow Spark to spill to disk if reading an entire partition is too much ○ Not to mention better pipelining when we put multiple transformations together ● Most of the default transformations are already set up for this ● But when we start working directly with the iterators ○ Sometimes to save setup time on expensive objects ○ e.g. mapPartitions, mapPartitionsWithIndex etc. ● implicit conversions can screw us up, and even cause OOMs :( Christian Heilmann
  35. 35. Preview: bringing codegen to Spark ML ● Based on Spark SQL’s code generation ○ First draft using quasiquotes ○ Switch to janino for Java compilation ● Initial draft for Gradient Boosted Trees ○ Based on DB’s work ○ First draft with QuasiQuotes ■ Moved to Java for speed ○ See SPARK-10387 for the details Jon
  36. 36. @Override public double call(Vector input) throws Exception { if (input.apply(1) <= 1.0) { return 0.1; } else { if (input.apply(0) <= 0.5) { return 0.0; } else { return 2.0; } } } (1, 1.0) 0.1 (0, 0.5) 0.0 2.0 What the generated code looks like: Glenn Simmons
  37. 37. Additional Resources ● Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) ○ http://spark.apache.org/docs/latest/ ● Kay Ousterhout’s work ○ http://www.eecs.berkeley.edu/~keo/ ● Books ● Videos ● Spark Office Hours ○ Normally in the bay area - will do Google Hangouts ones soon ○ follow me on twitter for future ones - https://twitter.com/holdenkarau raider of gin
  38. 38. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark
  39. 39. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Coming soon: High Performance Spark
  40. 40. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Coming soon: High Performance Spark Learning Spark Signing @Oreilly (3rd floor) after the talk :D
  41. 41. And the next book….. Still being written - signup to be notified when it is available: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark
  42. 42. Spark Videos ● Apache Spark Youtube Channel ● My Spark videos on YouTube - ○ http://bit.ly/holdenSparkVideos ● Spark Summit 2014 training ● Paco’s Introduction to Apache Spark
  43. 43. Cat wave photo by Quinn Dombrowski k thnx bye! If you care about Spark testing and don’t hate surveys: http://bit. ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau

×