O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons

7.910 visualizações

Publicada em

The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?

The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?

At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.

In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.

Publicada em: Dados e análise

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons

  1. 1. A full Machine learning pipeline in Scikit-learn vs Scala-Spark: pros and cons Jose Quesada and David Anderson @quesada, @alpinegizmo, @datascienceret
  2. 2. Why this talk?
  3. 3. • How do you get from a single-machine workload to a fully distributed one? • Answer: Spark machine learning • Is there something I'm missing out by staying with python?
  4. 4. • Mentors are world-class. CTOs, library authors, inventors, founders of fast-growing companies, etc • DSR accepts fewer than 5% of the applications • Strong focus on commercial awareness • 5 years of working experience on average • 30+ partner companies in Europe
  5. 5. DSR participants do a portfolio project
  6. 6. Why is DSR talking about Scala/Spark? They are behind Scala IBM is behind this They hired us to make training materials
  7. 7. Source: Spark 2015 infographic
  8. 8. Time Mindsharein‘datasciencebadasses’(subjective)
  9. 9. Scala “Scala offers the easiest refactoring experience that I've ever had due to the type system.” Jacob, coursera engineer
  10. 10. Spark • Basically distributed Scala • API • Scala, Java, Python, and R bindings • Libraries • SQL, streams, graph processing, machine learning • One of the most active open source projects
  11. 11. “Spark will inevitably become the de-facto Big Data framework for Machine Learning and Data Science.” Dean Wampler, Lightbend
  12. 12. All under one roof (big Win) Source: Spark 2015 infographic Spark Core Spark SQL Spark streaming Spark.ml (machine learning GraphX (graphs)
  13. 13. Spark Programming Model Input Driver / SparkContext Worker Worker
  14. 14. Data is partitioned; code is sent to the data Input Driver / SparkContext Worker Worker Data Data
  15. 15. Example: word count hello world foo bar foo foo bar bye world Data is immutable, and is partitioned across the cluster
  16. 16. Example: word count hello world foo bar foo foo bar bye world We get things done by creating new, transformed copies of the data. In parallel. hello world foo bar foo foo bar bye world (hello, 1) (world, 1) (foo, 1) (bar, 1) (foo, 1) (foo, 1) (bar, 1) (bye, 1) (world, 1)
  17. 17. Example: word count hello world foo bar foo foo bar bye world Some operations require a shuffle to group data together hello world foo bar foo foo bar bye world (hello, 1) (world, 1) (foo, 1) (bar, 1) (foo, 1) (foo, 1) (bar, 1) (bye, 1) (world, 1) (hello, 1) (foo, 3) (bar, 2) (bye, 1) (world, 2)
  18. 18. Example: word count lines = sc.textFile(input) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x + y)) ------------------------------------------------- word_count.saveAsTextFile(output) Pipelined into the same python executor Nothing happens until after this line, when this "action" forces evaluation of the RDD
  19. 19. RDD – Resilient Distributed Dataset • An immutable, partitioned collection of elements that can be operated on in parallel • Lazy • Fault-tolerant
  20. 20. PySpark RDD Execution Model Whenever you provide a lambda to operate on an RDD: • Each Spark worker forks a Python worker • data is serialized and piped to those Python workers
  21. 21. Impact of this execution model • Worker overhead (forking, serialization) • The cluster manager isn't aware of Python's memory needs • Very confusing error messages
  22. 22. Spark Dataframes (and Datasets) • Based on RDDs, but tabular; something like SQL tables • Not Pandas • Rescues Python from serialization overhead • df.filter(df.col("color") == "red") vs. rdd.filter(lambda x: x.color == "red") • processed entirely in the JVM • Python UDFs and maps still require serialization and piping to Python • can write (and register) Scala code, and then call it from Python
  23. 23. DataFrame execution: unified across languages Python DF Java/Scala DF R DF Logical Plan Execution API wrappers create a logical plan (a DAG) Catalyst optimizes the plan; Tungsten compiles the plan into executable code
  24. 24. DataFrame performance
  25. 25. ML Workflow Data Ingestion Data Cleaning / Feature Engineering Model Training Testing and Validation Deployment
  26. 26. Machine learning with scikit-learn • Easy to use • Rich ecosystem • Limited to one machine (but see sparkit-learn package)
  27. 27. Machine learning with Hadoop (in short: NO) • Each iteration is a new M/R job • Each job must store data in HDFS – lots of overhead
  28. 28. How Spark killed Hadoop map/reduce • Far easier to program • More cost-effective since less hardware can perform the same tasks much faster • Can do real-time processing as well as batch processing • Can do ML, graphs
  29. 29. Machine learning with Spark • Spark was designed for ML workloads • Caching (reuse data) • Accumulators (keep state across iterations) • Functional, lazy, fault-tolerant • Many popular algorithms are supported out of the box • Simple to productionalize models • MLlib is RDD (the past), spark.ml is dataframes, the future
  30. 30. Spark is an Ecosystem of ML frameworks • Spark was designed by people who understood the need of ML practitioners (unlike Hadoop) • MLlib • Spark.ml • System.ml (IBM) • Keystone.ml
  31. 31. Spark.ML– the basics • DataFrame: ML requires DFs holding vectors • Transformer: transforms one DF into another • Estimator: fit on a DF; produces a transformer • Pipeline: chain of transformers and estimators • Parameter: there is a unified API for specifying parameters • Evaluator: • CrossValidator: model selection via grid search
  32. 32. Hyper-parameter tuning Machine Learning scaling challenges that Spark solves
  33. 33. Hyper-parameter tuning Machine Learning scaling challenges that Spark solves ETL/feature engineering
  34. 34. Hyper-parameter tuning Machine Learning scaling challenges that Spark solves ETL/feature engineering Model
  35. 35. Q: Hardest scaling problem in data science? A: Adding people • Spark.ml has a clean architecture and APIs that should encourage code sharing and reuse • Good first step: can you refactor some ETL code as a Transformer? • Don't see much sharing of components happening yet • Entire libraries, yes; components, not so much • Perhaps because Spark has been evolving so quickly • E.g., pull request implementing non-linear SVMs that has been stuck for a year
  36. 36. Structured types in Spark SQL DataFrames DataSets (Java/Scala only) Syntax Errors Runtime Compile time Compile time Analysis Errors Runtime Runtime Compile time
  37. 37. User experience Spark.ml – Scikit-learn
  38. 38. Indexing categorical features • You are responsible for identifying and indexing categorical features val rfcd_indexer = new StringIndexer() .setInputCol("color") .setOutputCol("color_index") .fit(dataset) val seo_indexer = new StringIndexer() .setInputCol("status") .setOutputCol("status_index") .fit(dataset)
  39. 39. Assembling features • You must gather all of your features into one Vector, using a VectorAssembler val assembler = new VectorAssembler() .setInputCols(Array("color_index", "status_index", ...)) .setOutputCol("features")
  40. 40. Spark.ml – Scikit-learn: Pipelines (good news!) • Spark ML and scikit-learn: same approach • Chain together Estimators and Transformers • Support non-linear pipelines (must be a DAG) • Unify parameter passing • Support for cross-validation and grid search • Can write your own custom pipeline stages Spark.ml just like scikit-learn
  41. 41. Transformer Description scikit-learn Binarizer Threshold numerical feature to binary Binarizer Bucketizer Bucket numerical features into ranges ElementwiseProduct Scale each feature/column separately HashingTF Hash text/data to vector. Scale by term frequency FeatureHasher IDF Scale features by inverse document frequency TfidfTransformer Normalizer Scale each row to unit norm Normalizer OneHotEncoder Encode k-category feature as binary features OneHotEncoder PolynomialExpansion Create higher-order features PolynomialFeatures RegexTokenizer Tokenize text using regular expressions (part of text methods) StandardScaler Scale features to 0 mean and/or unit variance StandardScaler StringIndexer Convert String feature to 0-based indices LabelEncoder Tokenizer Tokenize text on whitespace (part of text methods) VectorAssembler Concatenate feature vectors FeatureUnion VectorIndexer Identify categorical features, and index Word2Vec Learn vector representation of words Spark.ml – Scikit-learn: NLP tasks (thumbs up)
  42. 42. Graph stuff (graphX, graphframes, not great) • Extremely easy to run monster algorithms in a cluster • GraphX has no python API • Graphframes are cool, and should provide access to the graph tools in Spark from python • In practice, it didn’t work too well
  43. 43. Things we liked in Spark ML • Architecture encourages building reusable pieces • Type safety, plus types are driving optimizations • Model fitting returns an object that transforms the data • Uniform way of passing parameters • It's interesting to use the same platform for ETL and model fitting • Very easy to parallelize ETL and grid search, or work with huge models
  44. 44. Disappointments using Spark ML • Feature indexing and assembly can become tedious • Surprised by the maximum depth limit for trees: 30 • Data exploration and visualization aren't easy in Scala • Wish list: non-linear SVMs, deep learning (but see Deeplearning4j)
  45. 45. What is new for machine learning in Spark 2.0 • DataFrame-based Machine Learning API emerges as the primary ML API: With Spark 2.0, the spark.ml package, with its “pipeline” APIs, will emerge as the primary machine learning API. While the original spark.mllib package is preserved, future development will focus on the DataFrame-based API. • Machine learning pipeline persistence: Users can now save and load machine learning pipelines and models across all programming languages supported by Spark.
  46. 46. What is new for data structures in Spark 2.0 Unifying the API for Streams and static data: Infinite datasets (same interface as dataframes)
  47. 47. What have Spark and Scala ever given us?
  48. 48. … Other than distributed dataframes, distributed machine learning, easy distributed grid search, distributed SQL, distributed stream analysis, more performance than map reduce easier programming model And easier deployment … What have Spark and Scala ever given us?
  49. 49. Reminder: 25 videos explaining ML on spark • For people who already know ML • http://datascienceretreat.com/videos/data-science-with-scala-and- spark)
  50. 50. Thank you for your attention! @quesada, @datascienceret

×