Two years ago, Spotify introduced Scio, an open-source Scala framework to develop data pipelines and deploy them on Google Dataflow. In this talk, we will discuss the evolution of Scio, and share the highlights of running Scio in production for two years. We will showcase several interesting data processing workflows ran at Spotify, what we learned from running them in production, and how we leveraged that knowledge to make Scio faster, and safer and easier to use.
10. ☑ It works
☑ It works reliably
☐ It works reliably and efficiently
☐ It works reliably, efficiently and easily
data-processing @spotify
今は「確実に動く」段階、目指すのは「効率」と「簡単」
21. Anonymisation job
‒ Encrypt all personal data
‒ Each user has unique keys
‒ Runs hourly
https://labs.spotify.com/2018/09/18/scalable-user-privacy/
匿名化のジョブ。個人情報を暗号化する。
22. Anonymisation optimisation
‒ Replace Kryo by custom coders for Avro’s GenericRecord
‒ Kryo was really inefficient
‒ Only possible in Scio > 0.7
‒ Scio now has a compile time warning for GenericRecord
匿名化の最適化。非効率的だったKryoを、カスタムコー
ダーへ置換した。コンパイル時の警告も出る。これらも0.7
の成果。
31. SMB join
− Shuffle once, join everywhere → Amortized cost
− PR in Apache Beam (https://github.com/apache/beam/pull/8486)
Goal: handle gotchas automatically:
• Store and check bucketing metadata
• handle skewness
• support joining datasets with a different number of buckets
− Bonus: Storage is more efficient (better compression)
一度のシャッフルで何度も join できる。Beam に PR 中。
33. Scio 0.8
− SchemaCoder
(structure aware coders)
− BeamSQL
− Automatic type conversion
− Better coder support for java classes
− Simpler job completion API (remove futures / ExecutionContext)...
− Bugfixes
− etc.
Scio 0.8 では SchemaCoder、BeamSQL などを追加
34. BeamSQL
val coll: SCollection[User] = ???
val r: SCollection[(String, List[String])] =
sql"""
SELECT username, emails
FROM ${coll}
""".as[(String, List[String])]
‒ Is the query valid SQL ?
‒ Are `username` and `emails`
valid fields in `User` ?
‒ What’s the type of `username` ?
‒ What’s the type of `emails` ?
‒ Can the result be converted to the
expected type ?
クエリは妥当な SQL だろうか? String は何を指すのか?
35. BeamSQL
val r =
- sql"""
+ tsql"""
SELECT username, emails
FROM ${coll}
""".as[(String, List[String])]
40. BeamSQL
tsql"""
SELECT username, emails
FROM ${coll}
""".as[(Int, List[String])]
Inferred schema for query is not compatible
with the expected schema.
Query result schema (inferred):
┌─────────────────────────────┬──────────┬──────────┐
│ NAME │ TYPE │ NULLABLE │
├─────────────────────────────┼──────────┼──────────┤
│ username │ STRING │ NO │
│ emails │ STRING[] │ NO │
└─────────────────────────────┴──────────┴──────────┘
Expected schema:
┌─────────────────────────────┬──────────┬──────────┐
│ NAME │ TYPE │ NULLABLE │
├─────────────────────────────┼──────────┼──────────┤
│ _1 │ INT32 │ NO │
│ _2 │ STRING[] │ NO │
└─────────────────────────────┴──────────┴──────────┘
「推論されたスキーマは期待されるスキーマとの互換性が
ありません」
41. Automatic type conversion
val in: SCollection[A] = ???
val r: SCollection[B] =
in.to[B](To.safe)
− Convert between classes
without boilerplate
− Support Java beans
− Support Scala case classes
− Support Avro SpecificRecord
自動の型変換。ボイラープレートなしで、Java beansや、
Scalaのcase class、AvroのSpecificRecordについて対
応。
42. Type conversion
val in: SCollection[A] = ???
val r: SCollection[B] =
in.to[B](To.safe)
Schemas are not compatible:
A schema:
┌─────────────────────────────┬──────────┬──────────┐
│ NAME │ TYPE │ NULLABLE │
├─────────────────────────────┼──────────┼──────────┤
│ i │ INT32 │ NO │
│ s │ STRING │ NO │
│ e │ ROW │ NO │
│ e.xs │ INT64[] │ NO │
│ e.q │ STRING │ NO │
└─────────────────────────────┴──────────┴──────────┘
B schema:
┌─────────────────────────────┬──────────┬──────────┐
│ NAME │ TYPE │ NULLABLE │
├─────────────────────────────┼──────────┼──────────┤
│ q │ STRING │ NO │
│ xs │ INT64[] │ NO │
└─────────────────────────────┴──────────┴──────────┘
型変換