Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

Scala and Spark are each great tools for data processing and they work well together. They can process data via small simple interactive queries as well as in very large highly-available and scalable production systems. They provide an integrated framework for an ever growing wide range of data processing capabilities. We examine the reasons for this and also look a couple of simple data processing examples written in Scala. Presented by John Nestor, Sr Architect at 47 Degrees.

  1. 1. #datapopupseattle Scala and Spark are Ideal for Big Data John Nestor Sr Architect, 47 Degrees 47deg
  2. 2. #datapopupseattle UNSTRUCTURED Data Science POP-UP in Seattle www.dominodatalab.com D Produced by Domino Data Lab Domino’s enterprise data science platform is used by leading analytical organizations to increase productivity, enable collaboration, and publish models into production faster.
  3. 3. Scala and Spark are Ideal for Big Data John Nestor 47 Degrees Seattle Unstructured Data Science Pop-Up October 7, 2015 www.47deg.com 347deg.com
  4. 4. 47deg.com Why Scala? • Strong typing • Concise elegant syntax • Runs on JVM (Java Virtual Machine) • Supports both object-oriented and functional • Small simple programs through large parallel distributed systems • Easy to cleanly extend with new libraries and DSL’s • Ideal for parallel and distributed systems 4
  5. 5. 47deg.com Scala: Strong Typing and Concise Syntax • Strong typing like Java. • Compile time checks • Better modularity via strongly typed interfaces • Easier maintenance: types make code easier to understand • Concise syntax like Python. • Type inference. Compiler infers most types that had to be explicit in Java. • Powerful syntax that avoid much of the boilerplate of Java code (see next slide). • Best of both worlds: safety of strong typing with conciseness (like Python). 5
  6. 6. 47deg.com Scala Case Class • Java version
 class User {
 private String name;
 private Int age;
 public User(String name, Int age) {
 this.name = name; this.age = age;
 public getAge() { return age; }
 public setAge(Int age) { this.age = age;}
 User joe = new User(“Joe”, 30);
 • Scala version
 case class User(name:String, var age:Int)
 val joe = User(“Joe”, 30) 6
  7. 7. 47deg.com Functional Scala • Anonymous functions. 
 (a:Int,b:Int) => a+b • Functions that take and return other functions. • Rarely need variables or loops • Immutable collections: Seq[T], Map[K,V], … • Works well with concurrent or distributed systems • Natural for functional programming • Functional collection operations (a small sample) • map, flatMap, reduce, … • filter, groupBy, sortBy, take, drop, … 7
  8. 8. 47deg.com Scala Availability and Support • Open Source • Typesafe provides support. Founded my Martin Odersky who designed Scala. • IDEs: Intellij IDEA and Eclipse • Libraries: lots now and more every day • ScalaNLP - Epic (natural language processing) • Major Scala users: LinkedIn, Twitter, Goldman Sachs, Coursera, Angies List, Whitepages • Major systems written in Scala: Spark, Kafka 8
  9. 9. 47deg.com Typesafe Scala Components • Scala Compiler (includes REPL) • Scala Standard Libraries • SBT - Scala Build Tool • Play - scaleable web applications • Scala JS - compiles Scala to JavaScript • Akka - for parallel and distributed computation • Spray - high performance asynchronous TCP/ HTTP library • Spark - Typesafe also supports Spark • Slick - for SQL database access • ConductR - Scala deployment/devops tool • Reactive Monitoring (Beta) 9
  10. 10. 47deg.com Why Spark? • Support for not only batch but also (near) real-time • Fast - keeps data in memory as much as possible • Often 10X to 100X Hadoop speed • A clean easy-to-use API • A richer set of functional operations than just map and reduce • A foundation for a wide set of integrated data applications • Can recover from failures - recompute or (optional) replication • Scalable for very large data sets and reduced time 10
  11. 11. 47deg.com Spark RDDs • RDD[T] - resilient distributed data set • typed (must be serializable) • immutable • ordered • can be processed in parallel • lazy evaluation - permits more global optimizations • Rich set of functional operations ( a small sample) • map, flatMap, reduce, … • filter, groupBy, sortBy, take, drop, … 11
  12. 12. 47deg.com Spark Components • Spark Core • Scalable multi-node cluster • Failure detection and recovery • RDDs and functional operations • MLLib - for machine learning • linear regression, SVMs, clustering, collaborative filtering, dimension reduction • more on the way! • GraphX - for graph computation • Streaming - for near real-time • Dataframes - for SQL and Json 12
  13. 13. 47deg.com Spark Availability and Support • Open Source - top level Apache project • Over 750 contributors from over 200 organizations • Can process multiple petabytes on clusters of over 8000 nodes • Databricks. Matei Zaharia who wrote the original Spark is a founder and CTO • Packages (more every day) • Zeppelin - Scala notebooks • Cassandra, Kafka connectors 13
  14. 14. 47deg.com Clusters and Scalability • Scala Akka clusters (process distribution, micro services) • message passing • remote Actors • Spark clusters (data distribution) • local • Stand alone (optionally with ZooKeeper) • Apache Mesos • Hadoop Yarn • can run above on Amazon and Google clouds 14
  15. 15. 47deg.com Why Scala for Spark? • Why not Python, R, or Java for Spark? • Spark is written in Scala • Scala source code is important Spark documentation • Spark is best extended in Scala • The primary API for Spark is Scala • The functional features of Scala and Spark are a natural fit and easiest to use in Scala • If you want to build scalable high performance production code based on Spark, R by itself is too specialized, Python is too slow and Java is tedious to write and maintain 15
  16. 16. 47deg.com Demo 16
  17. 17. 47deg.com Seattle Resources • Seattle Meetups • Scala at the Sea Meetup 
 http://www.meetup.com/Seattle-Scala-User-Group/ • Seattle Spark Meetup 
 http://www.meetup.com/Seattle-Spark-Meetup/ • Seattle Training: Spark and Typesafe Scala Classes http://www.47deg.com/events#training • UW Scala Professional Certificate Program 
 http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html 17
