O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Speakers: Igor Maravić & Neville Li, Spotify
From stream to recommendation with
Cloud Pub/Sub and Cloud Dataflow
DATA & AN...
22
Current Event Delivery System
3
Client
Client
Client
Client
Current event delivery system
Gateway
Syslog
Syslog
Producer
Any Data Centre
Groupers Realti...
4
Client
Client
Client
Client
Complex
Gateway
Syslog
Syslog
Producer
Any Data Centre
Groupers Realtime
Brokers
ETL job
Che...
5
Client
Client
Client
Client
Stateless
Gateway
Syslog
Syslog
Producer
Any Data Centre
Groupers Realtime
Brokers
ETL job
C...
6
Delivered data growth
2007 2008 2009 2010 2011 2012 2013 2014 2015
77
Redesigning Event Delivery
8
Redesigning event delivery
Gateway
Syslog
File Tailer
Any data centre
Client
Hadoop
Client
Client
Client Event
Delivery
...
9
Same API
Gateway
Syslog
File Tailer
Any data centre
Hadoop
Event
Delivery
Service
Reliable
Persistent Queue
ETL
Client
C...
10
Persistence
Gateway
Syslog
File Tailer
Any data centre
Hadoop
Event
Delivery
Service
Reliable
Persistent Queue
ETL
Clie...
11
Keep it simple
Gateway
Syslog
File Tailer
Any data centre
Hadoop
Event
Delivery
Service
Reliable
Persistent Queue
ETL
C...
Build it!
1313
Choosing reliable persistent queue
Kafka 0.8
14
Proven technology
15
16
Strong community
1717
Reliable persistent queue
18
Event delivery with Kafka 0.8
Gateway
Syslog
File Tailer
Any data centre
Client
Hadoop
Client
Client
Client
Event
Deliv...
19
Gateway
Syslog
File Tailer
Any data centre
Client
Hadoop
Client
Client
Client
Event
Delivery
Service
Hadoop data centre...
Cloud Pub/Sub
20
Retains undelivered data
22
At least once delivery
2323
Globally available
24
Simple REST API
2525
No operational responsibility*
2626
SHUT UP
AND
TAKE MY MONEY!
2727
Caution advised!
Building up trust in Cloud Pub/Sub
28
29
Delivered data growth
2007 2008 2009 2010 2011 2012 2013 2014 2015
Demo time!
30
31
2M events
per second.
Cloud Pub/Sub,
Spotify chooses You!
32
33
Event delivery with Cloud Pub/Sub
Gateway
Any data centre
Client
Hadoop
Client
Client
Client
Cloud Pub/Sub
Event
Delive...
3434
Streaming ETL job with
Cloud Dataflow
35
Dataflow SDK is a framework
36
Cloud Dataflow is a managed service
37
ETL job
38
Single Cloud Pub/Sub subscription
39
GCS and HDFS in parallel.
40
2016-03-22
03H
2016-03-22
04H
Event time based hourly buckets
2016-03-21
23H
2016-03-22
00H
2016-03-22
01H
2016-03-22
0...
41
Incremental bucket fill
2016-03-21
23H
2016-03-22
00H
2016-03-22
01H
2016-03-22
02H
2016-03-22
04H
2016-03-22
03H
42
2016-03-22
00H
2016-03-22
01H
2016-03-21
23H
2016-03-22
03H
Bucket completeness
2016-03-22
02H
2016-03-22
04H
43
2016-03-22
04H
Late data handling
2016-03-22
03H
2016-03-22
00H
2016-03-22
01H
2016-03-21
23H
2016-03-22
02H
44
Event time based hourly buckets
Incremental bucket fill
Bucket completeness
Late data handling
45
Windowing
46
Windowing
@Override
public PCollection<KV<String, Iterable<EventMessage>>> apply(
final PCollection<KV<String, EventMes...
4747
Streaming
Where are we right now?
49
Preliminary results
Watermark Lag
Minutes
5050
Scio
Scala API for Google Cloud Dataflow
51
Origin story
Scalding and Spark popular for ML, recommendations, analytics @ Spotify
50+ users, 400+ unique jobs
Early ...
52
Why not Scalding on GCE
Pros
● Big community - Twitter, eBay, Etsy, Stripe, LinkedIn, SoundCloud
● Stable and proven
Co...
53
Why not Spark on GCE
Pros
● Batch, streaming, interactive and SQL
● MLlib, GraphX
● Scala, Python, and R support
Cons
●...
54
Why Dataflow with Scala
Dataflow
● Hosted solution, no operations
● Ecosystem: GCS, Bigquery, Pubsub, Datastore, Bigtab...
55
Cloud
Storage
Pub/Sub Datastore BigtableBigQuery
Batch Streaming Interactive REPL
Scio Scala API
Dataflow Java SDK Scal...
56
Scio
Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]
Verb: I can, know, understand, have knowledge.
Core API sim...
57
WordCount
Almost identical to Spark version
val sc = ScioContext()
sc.textFile("shakespeare.txt")
.flatMap(_.split("[^a...
58
PageRank in 13 lines
def pageRank(in: SCollection[(String, String)]) = {
val links = in.groupByKey()
var ranks = links....
59
SQL and Big Data Pipelines
SQL is easier to write than data pipelines, but
Hive with TSV or Avro
● Row based storage, i...
60
BigQuery and Scio
BigQuery
● Slicing and dicing, aggregation, etc.
● Scaling independently
● Web UI, Tableau, QlikView ...
61
JSON vs Type Safe BigQuery
JSON approach, a.k.a. everything is Object
sc.bigQuerySelect("...").map { r =>
(r.get("track...
62
Spotify Running
60 million tracks
30 million users * 10 tempo buckets * 25 personalized tracks
Audio: tempo, energy, ti...
63
Rapid prototyping with Bigquery
64
Spotify Running
SELECT user_id, vector
FROM UserEntity
WHERE ...
SELECT
track_id, audio.tempo ...
FROM TrackEntity
WHER...
65
66
67
What’s the catch?
Early stage, some rough edges
No interactive mode → Scio REPL (WIP), BigQuery + Datalab
No machine le...
Learnings?
69
Blog posts @ labs.spotify.com
Spotify’s Event Delivery - The Road To The Cloud
Part I, Part II, Part III
7070
Thank You
Igor Maravić <igor@spotify.com>
Neville Li <neville@spotify.com>
Próximos SlideShares
Carregando em…5
×

From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

4.083 visualizações

Publicada em

Spotify talk at GCP NEXT 2016, March 24, 2016.

Google Cloud Pubsub, Dataflow, BigQuery and Scio.

https://github.com/spotify/scio

Publicada em: Software
  • Seja o primeiro a comentar

From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

  1. 1. Speakers: Igor Maravić & Neville Li, Spotify From stream to recommendation with Cloud Pub/Sub and Cloud Dataflow DATA & ANALYTICS
  2. 2. 22 Current Event Delivery System
  3. 3. 3 Client Client Client Client Current event delivery system Gateway Syslog Syslog Producer Any Data Centre Groupers Realtime Brokers ETL job Checkpoint Monitor Hadoop Hadoop Data Center Service Discovery ACK Brokers Syslog Consumer Liveness Monitor Brokers
  4. 4. 4 Client Client Client Client Complex Gateway Syslog Syslog Producer Any Data Centre Groupers Realtime Brokers ETL job Checkpoint Monitor Hadoop Hadoop Data Center Service Discovery ACK Brokers Syslog Consumer Liveness Monitor Brokers
  5. 5. 5 Client Client Client Client Stateless Gateway Syslog Syslog Producer Any Data Centre Groupers Realtime Brokers ETL job Checkpoint Monitor Hadoop Hadoop Data Center Service Discovery ACK Brokers Syslog Consumer Liveness Monitor Brokers
  6. 6. 6 Delivered data growth 2007 2008 2009 2010 2011 2012 2013 2014 2015
  7. 7. 77 Redesigning Event Delivery
  8. 8. 8 Redesigning event delivery Gateway Syslog File Tailer Any data centre Client Hadoop Client Client Client Event Delivery Service Reliable Persistent Queue ETL
  9. 9. 9 Same API Gateway Syslog File Tailer Any data centre Hadoop Event Delivery Service Reliable Persistent Queue ETL Client Client Client Client
  10. 10. 10 Persistence Gateway Syslog File Tailer Any data centre Hadoop Event Delivery Service Reliable Persistent Queue ETL Client Client Client Client
  11. 11. 11 Keep it simple Gateway Syslog File Tailer Any data centre Hadoop Event Delivery Service Reliable Persistent Queue ETL Client Client Client Client
  12. 12. Build it!
  13. 13. 1313 Choosing reliable persistent queue
  14. 14. Kafka 0.8 14
  15. 15. Proven technology 15
  16. 16. 16 Strong community
  17. 17. 1717 Reliable persistent queue
  18. 18. 18 Event delivery with Kafka 0.8 Gateway Syslog File Tailer Any data centre Client Hadoop Client Client Client Event Delivery Service Hadoop data centre Camus (ETL) Brokers Mirror Makers Brokers
  19. 19. 19 Gateway Syslog File Tailer Any data centre Client Hadoop Client Client Client Event Delivery Service Hadoop data centre Camus (ETL) Brokers Mirror Makers Brokers Event delivery with Kafka 0.8
  20. 20. Cloud Pub/Sub 20
  21. 21. Retains undelivered data
  22. 22. 22 At least once delivery
  23. 23. 2323 Globally available
  24. 24. 24 Simple REST API
  25. 25. 2525 No operational responsibility*
  26. 26. 2626 SHUT UP AND TAKE MY MONEY!
  27. 27. 2727 Caution advised!
  28. 28. Building up trust in Cloud Pub/Sub 28
  29. 29. 29 Delivered data growth 2007 2008 2009 2010 2011 2012 2013 2014 2015
  30. 30. Demo time! 30
  31. 31. 31 2M events per second.
  32. 32. Cloud Pub/Sub, Spotify chooses You! 32
  33. 33. 33 Event delivery with Cloud Pub/Sub Gateway Any data centre Client Hadoop Client Client Client Cloud Pub/Sub Event Delivery Service File Tailer Syslog Cloud Storage Dataflow ETL using Cloud Dataflow
  34. 34. 3434 Streaming ETL job with Cloud Dataflow
  35. 35. 35 Dataflow SDK is a framework
  36. 36. 36 Cloud Dataflow is a managed service
  37. 37. 37 ETL job
  38. 38. 38 Single Cloud Pub/Sub subscription
  39. 39. 39 GCS and HDFS in parallel.
  40. 40. 40 2016-03-22 03H 2016-03-22 04H Event time based hourly buckets 2016-03-21 23H 2016-03-22 00H 2016-03-22 01H 2016-03-22 02H
  41. 41. 41 Incremental bucket fill 2016-03-21 23H 2016-03-22 00H 2016-03-22 01H 2016-03-22 02H 2016-03-22 04H 2016-03-22 03H
  42. 42. 42 2016-03-22 00H 2016-03-22 01H 2016-03-21 23H 2016-03-22 03H Bucket completeness 2016-03-22 02H 2016-03-22 04H
  43. 43. 43 2016-03-22 04H Late data handling 2016-03-22 03H 2016-03-22 00H 2016-03-22 01H 2016-03-21 23H 2016-03-22 02H
  44. 44. 44 Event time based hourly buckets Incremental bucket fill Bucket completeness Late data handling
  45. 45. 45 Windowing
  46. 46. 46 Windowing @Override public PCollection<KV<String, Iterable<EventMessage>>> apply( final PCollection<KV<String, EventMessage>> shardedEvents) { return shardedEvents .apply("Assign Hourly Windows", Window.<~>into( FixedWindows.of(ONE_HOUR)) .withAllowedLateness(ONE_DAY) .triggering( AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterPane.elementCountAtLeast(maxEventsInFile)) .withLateFirings(AfterFirst.of( AfterPane.elementCountAtLeast(maxEventsInFile), AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(TEN_SECONDS)))) .discardingFiredPanes()) .apply("Aggregate Events", GroupByKey.create()); }
  47. 47. 4747 Streaming
  48. 48. Where are we right now?
  49. 49. 49 Preliminary results Watermark Lag Minutes
  50. 50. 5050 Scio Scala API for Google Cloud Dataflow
  51. 51. 51 Origin story Scalding and Spark popular for ML, recommendations, analytics @ Spotify 50+ users, 400+ unique jobs Early 2015 - Dataflow Scala hack project
  52. 52. 52 Why not Scalding on GCE Pros ● Big community - Twitter, eBay, Etsy, Stripe, LinkedIn, SoundCloud ● Stable and proven Cons ● Hadoop cluster operations ● Multi-tenancy, resource contention and utilization ● No streaming mode
  53. 53. 53 Why not Spark on GCE Pros ● Batch, streaming, interactive and SQL ● MLlib, GraphX ● Scala, Python, and R support Cons ● Hard to tune and scale ● Cluster lifecycle management
  54. 54. 54 Why Dataflow with Scala Dataflow ● Hosted solution, no operations ● Ecosystem: GCS, Bigquery, Pubsub, Datastore, Bigtable ● Simple unified model for batch and streaming Scala ● High level DSL, easy transition for developers ● Reusable and composable code via functional programming ● Numerical libraries: Breeze, Algebird
  55. 55. 55 Cloud Storage Pub/Sub Datastore BigtableBigQuery Batch Streaming Interactive REPL Scio Scala API Dataflow Java SDK Scala Libraries Extra features
  56. 56. 56 Scio Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge. Core API similar to spark-core, some ideas from scalding github.com/spotify/scio
  57. 57. 57 WordCount Almost identical to Spark version val sc = ScioContext() sc.textFile("shakespeare.txt") .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue() .saveAsTextFile("wordcount.txt")
  58. 58. 58 PageRank in 13 lines def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks }
  59. 59. 59 SQL and Big Data Pipelines SQL is easier to write than data pipelines, but Hive with TSV or Avro ● Row based storage, inefficient full scan ● No integration with other frameworks Parquet ● Inspired by Google Dremel which powers BigQuery ● Immature Hive integration, hard to scale with Spark SQL ● Poor impedance matching with Scalding, Avro, etc.
  60. 60. 60 BigQuery and Scio BigQuery ● Slicing and dicing, aggregation, etc. ● Scaling independently ● Web UI, Tableau, QlikView etc. Scio ● Custom logic hard to express in SQL ● Seamless integration with BigQuery IO ● Scala macros for type safety
  61. 61. 61 JSON vs Type Safe BigQuery JSON approach, a.k.a. everything is Object sc.bigQuerySelect("...").map { r => (r.get("track").asInstanceOf[TableRow] .get("name").asInstanceOf[String], r.get("audio").asInstanceOf[TableRow] .get("tempo").toString.toInt ) } Compile Run job Wait NullPointerException or ClassCastException Repeat Type safe approach @BigQueryType.fromQuery("...") class TrackTempo sc.typedBigQuery[TrackTempo]().map { t => (t.track.name, t.audio.tempo.getOrElse(-1)) } Compile Run Profit
  62. 62. 62 Spotify Running 60 million tracks 30 million users * 10 tempo buckets * 25 personalized tracks Audio: tempo, energy, time signature ... Metadata: genres, categories Latent vectors from collaborative filtering
  63. 63. 63 Rapid prototyping with Bigquery
  64. 64. 64 Spotify Running SELECT user_id, vector FROM UserEntity WHERE ... SELECT track_id, audio.tempo ... FROM TrackEntity WHERE ... most popular per recording top N tracks per artist bucket by tempo vector LSH per bucket GBK GBK GBK RBK top tracks per user + bucket side input Cloud Datastore
  65. 65. 65
  66. 66. 66
  67. 67. 67 What’s the catch? Early stage, some rough edges No interactive mode → Scio REPL (WIP), BigQuery + Datalab No machine learning → TensorFlow Licensed under Apache 2, contribution welcome!
  68. 68. Learnings?
  69. 69. 69 Blog posts @ labs.spotify.com Spotify’s Event Delivery - The Road To The Cloud Part I, Part II, Part III
  70. 70. 7070 Thank You Igor Maravić <igor@spotify.com> Neville Li <neville@spotify.com>

×