At the beggining, we had RDDs. A distributed Scala collection! Then, the DataFrame API showed up. It brought with it tons of datasource implementations and a query optimizer. But... where are our types? Are we really referring to a column with a string and then casting it? In Scala? This meant more runtime errors and made the code harder to refactor. Then, we got Datasets. We have types again! The bad thing is that lambdas kill some of the performance we can achieve with dataframes. Also, runtime errors don't completely disappear.
The Frameless library tries to solve these problems so we can get all the performance, keep our types and reduce the runtime errors. This talk explains the pros and cons of the library, and also dives deeper into some implementation details to make it possible.
3. Raise your hand if...
● You use Spark in production
● You use Spark with Scala
4. Raise your hand if...
● You use Spark in production
● You use Spark with Scala
● You know what the typeclass pattern is
5. Raise your hand if...
● You use Spark in production
● You use Spark with Scala
● You know what the typeclass pattern is
● You know what generic programming or Shapeless is
6. Raise your hand if...
● You use Spark in production
● You use Spark with Scala
● You know what the typeclass pattern is
● You know what generic programming or Shapeless is
● You’ve used Spark with Frameless before
8. RDDs
trait Person { val name: String }
case class Teacher(id: Int, name: String, salary: Double) extends Person
case class Student(id: Int, name: String) extends Person
9. RDDs
trait Person { val name: String }
case class Teacher(id: Int, name: String, salary: Double) extends Person
case class Student(id: Int, name: String) extends Person
val people: RDD[Person] = sc.parallelize(List(
Teacher(1, "Emma", 60000),
Student(2, "Steve"),
Student(3, "Arnold")
))
10. Lambdas are (almost) type-safe
val names = people.map(person => person.name)
val names = people.map {
case Teacher(_, name, _) => s"Teacher $name"
case Student(_, name) => s"Student $name"
}
11. Lambdas are (almost) type-safe
val names = people.map(person => person.name)
val names = people.map {
case Teacher(_, name, _) => s"Teacher $name"
case Student(_, name) => s"Student $name"
}
Possible MatchError
at runtime
21. Datasets
● Try to get the best of both worlds
● We can use lambdas as in RDDs!
○ What about performance?
● Full DataFrame API as DataFrame = Dataset[Row]
● They seem type-safe
22. We can use the DataFrame API
val names: DataFrame = people.select("namee")
23. Still not type-safe :(
AnalysisException: cannot resolve '`namee`'
given input columns: [id, name, age]
Runtime
val names: DataFrame = people.select("namee")
24. But… we can cast them!
val names: Dataset[Int] = people.select("name").as[Int]
25. But… we can cast them! ...and fail :(
AnalysisException: Cannot up cast `name` from
string to int as it may truncate
Runtime
val names: Dataset[Int] = people.select("name").as[Int]
27. Lambdas… are type-safe!
Error: value namee is not a member of PersonCompile
val names: Dataset[String] = people.map(_.namee)
28. What about performance?
● 2²⁵ random generated people
● 20 parquet files
● 4 cores
people.filter(_.age == 26).count() VS people.filter($"age" === 26).count()
33. Encoders?
UnsupportedOperationException: No Encoder found for
Car
- field (class: "Car", name: "car")
- root class: "PersonCar"
case class PersonCar(personId: Int, car: Car)
val cars: Dataset[PersonCar] = spark.createDataset(List(
PersonCar(1, new Car("Tesla Model S"))
))
Runtime
35. Frameless
● Wraps the Spark API
● Type-safe non-lambda methods
● No run-time performance differences
● Provides a way to define custom encoders
● Actions are also lazy
36. Typed Datasets
val peopleFL: TypedDataset[Person] = people.typed
val names: TypedDataset[String] = peopleFL.select(peopleFL('namee))
37. Typed Datasets
No column Symbol with
shapeless.tag.Tagged[String("namee")] of type A in
Person
Compile
val peopleFL: TypedDataset[Person] = people.typed
val names: TypedDataset[String] = peopleFL.select(peopleFL('namee))
38. Column operations are also supported
scala> val agesDivided = peopleFL.select(peopleFL('age)/2)
agesDivided: TypedDataset[Double]
39. Column operations are also supported
scala> val agesDivided = peopleFL.select(peopleFL('age)/2)
agesDivided: TypedDataset[Double]
val intToString = (x: Int) => x.toString
val udf = peopleFL.makeUDF(intToString)
scala> val result = peopleFL.select(udf(peopleFL('age)))
result: TypedDataset[String]
40. Aggregations
case class AvgAge(name: String, age: Double)
val ageByName: TypedDataset[AvgAge] = {
peopleFL.groupBy(peopleFL('name)).agg(avg(peopleFL('age)))
}.as[AvgAge]
41. Custom type encoders: Injection
sealed trait Gender
case object Female extends Gender
case object Male extends Gender
case object Other extends Gender
case class PersonGender(id: Int, gender: Gender)
TypedDataset.create(peopleGender)
42. Custom encoders: Injection
sealed trait Gender
case object Female extends Gender
case object Male extends Gender
case object Other extends Gender
case class PersonGender(id: Int, gender: Gender)
TypedDataset.create(peopleGender)
Compile Cannot find implicit value for value encoder
43. Custom encoders: Injection
implicit val genderToInt: Injection[Gender, Int] = Injection(
{
case Female => 1; case Male => 2; case Other => 3
},{
case 1 => Female; case 2 => Male; case 3 => Other
}
)
scala> TypedDataset.create(peopleGender)
res0: TypedDataset[PersonGender] = [id: int, gender: int]
45. Lazy actions
val numPeopleJob: Job[Long] = people.count().withDescription("...")
val num: Long = numPeopleJob.run()
val sampleJob = for {
num <- people.count()
sample <- people.take((num/10).toInt)
} yield sample
47. Encoders are typeclasses
val peopleList = List(Person(1, "Miguel", 26))
val people = spark.createDataset(peopleList)
def createDataset[T : Encoder](data: Seq[T]): Dataset[T]
48. Encoders are typeclasses
val peopleList = List(Person(1, "Miguel", 26))
val people = spark.createDataset(peopleList)
def createDataset[T : Encoder](data: Seq[T]): Dataset[T]
// It’s the same as
def createDataset[T](data: Seq[T])(implicit encoder: Encoder[T])
49. Encoders are typeclasses
● Instances provided by SQLImplicits class
● That’s why we need import spark.implicits._ everywhere!
implicit def newSequenceEncoder[T <: Seq[_] : TypeTag]: Encoder[T] =
ExpressionEncoder() // <- Reflection at runtime!
50. Reflection is not our friend
class Car(name: String)
val cars = Seq(Car("Tesla"))
val ds: Dataset[Car] = spark.createDataset(cars)
Compile Unable to find encoder for type stored in a Dataset.
51. Reflection is not our friend
class Car(name: String)
val cars = Seq(Car("Tesla"))
val ds: Dataset[Car] = spark.createDataset(cars)
val ds: Dataset[Seq[Cars]] = spark.createDataset(Seq(cars))
Runtime
Compile
No encoder found for Car
Unable to find encoder for type stored in a Dataset.
52. How different are the Frameless encoders?
def create[A](data: Seq[A])(
implicit
encoder: TypedEncoder[A],
sqlContext: SQLContext
): TypedDataset[A]
54. How to know if our class has a column?
// We were calling people(‘name)
def TypedDataset[T] {
def apply[A](column: Witness.Lt[Symbol])(
implicit
exists: TypedColumn.Exists[T, column.T, A],
encoder: TypedEncoder[A]
): TypedColumn[T, A]
}
55. How to know if our class has a column?
object TypedColumn.Exists[T, K, V] {
implicit def deriveRecord[T, H <: HList, K, V](
implicit
lgen: LabelledGeneric.Aux[T, H],
selector: Selector.Aux[H, K, V]
): Exists[T, K, V] = new Exists[T, K, V] {}
}
56. Concepts we need to understand first
● Generic programming and HList
● Literal types
● Phantom types
● Type tagging
● Dependent types
62. trait Increasable
def inc(x: Int with Increasable) = x+1
inc(3.asInstanceOf[Int with Increasable]): Int = 4
inc(3)
error: type mismatch; found: Int(3); required: Int with Increasable
Phantom types and type tagging
● Phantom type: no runtime behaviour
● Type tagging: assign a phantom type to other types
63. All combined with Shapeless!
"name" ->> 1
res1: Int with KeyTag[String("name"),Int] = 1
64. All combined with Shapeless!
"name" ->> 1
res1: Int with KeyTag[String("name"),Int] = 1
val me = ("id" ->> 1) :: ("name" ->> "Miguel") :: ("age" ->> 26) :: HNil
::Int with KeyTag[String("id"),Int],
::String with KeyTag[String("name"),String],
::Short with KeyTag[String("age"),Short],
::HNil
65. LabelledGeneric
val genericPerson = LabelledGeneric[Person]
::Int with KeyTag[Symbol with Tagged[String("id")],Int],
::String with KeyTag[Symbol with Tagged[String("name")],String],
::Short with KeyTag[Symbol with Tagged[String("age")],Short],
HNil
66. Dependent types
trait Generic[A] {
type Repr
def to(value: A): Repr
}
def getRepr[A](v: A)(gen: Generic[A]): gen.Repr = gen.to(v)
// Is it not the same as this?
def getRepr[A, R](v: A)(gen: Generic2[A, R]): R = ???
67. Shapeless Witness
trait Witness {
type T
val value: T
}
def getField[A,K,V](value: A with KeyTag[K,V])
(implicit witness: Witness.Aux[K]) = witness.value
// Aux[K] = Witness { type T = K }
>scala getField("name" ->> 1)
res0: String("name") = name
68. Shapeless Witness
Witness.Aux[A] = Witness { type T = A }
>scala val witness = Witness(‘name)
witness: Witness.Aux[Symbol with Tagged[String("name")]
Witness.Lt[A] = Witness { type T <: A }
// Tagged Symbol is a subtype of Symbol. So previous line is also...
witness: Witness.Lt[Symbol]
69. Back to Frameless
// We were calling people(‘name)
def TypedDataset[T] {
def apply[A](column: Witness.Lt[Symbol])(
implicit
exists: TypedColumn.Exists[T, column.T, A],
encoder: TypedEncoder[A]
): TypedColumn[T, A]
}
70. Back to Frameless
object TypedColumn.Exists[T, K, V] {
implicit def deriveRecord[T, H <: HList, K, V](
implicit
lgen: LabelledGeneric.Aux[T, H],
selector: Selector.Aux[H, K, V]
): Exists[T, K, V] = new Exists[T, K, V] {}
}
71. To use it or not to use it
Type-safe with the same performance
Injections for custom types
Lazy jobs with descriptions
Slower compilation
Not yet stable. No official Spark backward compatibility