SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Playlist Recommendations
@
Nikhil Tibrewal
@nikhil_tibrewal
Who am I?
Nikhil Tibrewal (Nick-hill)
● Data Engineer on Lambda squad (Spotify’s primary ML team)
● Graduated from Carnegie Mellon University in Dec 2013
● B.Sc. in Computer Science + additional major in Econ
● Been part of Spotify band for ~1.5 years
● Worked on a range of projects, primarily Playlist Recommendations
Spotify in numbers
● Started in 2006, 58 markets
● 75M+ active users, 20M+ paying
● 30M+ songs, 20K new per day
● 1.5+ billion playlists
● 1 TB logs per day
● Discover tab
● Radio
● Related Artists
● Discover Weekly
● Playlist recs on “Now” Strip
Recommendations so far on Spotify
For Ellie Goulding
“Now” Strip
Human
curated
playlist
“Now” Strip
Human
curated
playlist
Recommended
playlist
But…
How are playlist recs generated?
Quick Overview!
● Recommend only human
curated playlists (1000+)
○ Well-designed cover images
○ Thorough descriptions
○ Title reflects content
Quick Overview!
● Recommend only human
curated playlists (1000+)
○ Well-designed cover images
○ Thorough descriptions
○ Title reflects content
Good
Quick Overview!
● Recommend only human
curated playlists (1000+)
○ Well-designed cover images
○ Thorough descriptions
○ Title reflects content
Good Bad
Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist
Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist
○ Use Annoy to store playlist vectors in N dimensional space
ANNOY (Approximate Nearest Neighbors Oh Yeah)
created at Spotify
https://github.com/spotify/annoy
Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist
○ Use Annoy to store playlist vectors in N dimensional space
○ Vectorize user taste as well:
■ User vector derived from user listening history
Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist
○ Use Annoy to store playlist vectors in N dimensional space
○ Vectorize user taste as well:
■ User vector derived from user listening history
○ User and playlist vectors in same space!
○ Query for nearest playlists to user from Annoy tree
annoyTree.getNearest(seedVector, K)
Quick Overview!
● Recommendations pipeline: Ranking Model
○ Use genre information, demographics data, and playlist popularity
data to further rank recommendations
■ John: 21, USA, likes rock
■ Should get rock playlist recs that are popular in USA and
amongst 21 year olds
○ Apply post-processing steps for shuffling and add variety to avoid
repetitions
Quick Overview!
● Recommendations pipeline: Ranking Model
○ Use genre information, demographics data, and playlist popularity
data to further rank recommendations
■ John: 21, USA, likes rock
■ Should get rock playlist recs that are popular in USA and
amongst 21 year olds
○ Apply post-processing steps for shuffling and add variety to avoid
repetitions
90% DAUs have recs!
Quick Overview!
● Infrastructure
○ Luigi to manage workflow (also built at Spotify)
○ Entire pipeline written in Scalding
○ 1200+ nodes Hadoop cluster to run jobs
○ Cassandra (~dozen nodes for playlist recs)
○ Java backend micro-services serving recs
Quick Overview!
"Scalding is comprised of a DSL (domain-specific language)
that makes MapReduce computations look like Scala’s
collection API and is a wrapper for Cascading to make it easy
to define jobs, test and data sources on an HDFS" (http:
//cascading.io/customer/twitter/)
Scalding w.r.t. Playlist Recs
● Used Python back in the day
○ Inputs and outputs were tab separated
○ Complexity UP => Difficulty to maintain UP
○ Hard to write tests
● Scalding provided compile time error checks
○ Catch errors early
○ Define schemas (e.g. Avro)
● Can use Parquet + Avro for input/output
○ Easy to write and read data
○ Records with a lot of fields!
○ Lesson: Parquet hurts performance w/ fat columns (nested data structs)
+
Scalding w.r.t. Playlist Recs +
Scalding w.r.t. Playlist Recs
● Data quality
○ Hadoop counters wrappers in extended Scalding library code
+
Scalding w.r.t. Playlist Recs
● Data quality
○ Hadoop counters wrappers in extended Scalding library code
○ Verify counters within reasonable ranges
+
Scalding w.r.t. Playlist Recs +
Scalding w.r.t. Playlist Recs
● Pipeline tolerance
○ Job failures are normal, and annoying with big jobs
○ Scalding checkpoints
○ Lesson: checkpoint itself is a map-reduce job and has the same caveats
○ Still very helpful!
+
Scalding w.r.t. Playlist Recs
● Job runtimes
○ Common solutions: more reducers and code optimizations
○ Speculative execution for larger jobs
○ Caveat: can take up unnecessary resources
+
Scalding w.r.t. Playlist Recs
● Memory issues
○ Used Sparkey indices in Python (developed at Spotify, now open source)
■ “Simple constant key/value storage lib for read-heavy systems with
infrequent large bulk inserts”
■ Replicated to all mappers
○ Complex jobs in Scalding => higher memory config for jobs with Sparkey
+
https://github.com/spotify/sparkey
Scalding w.r.t. Playlist Recs
● Memory issues
○ Used Sparkey indices in Python (developed at Spotify, now open source)
■ “Simple constant key/value storage lib for read-heavy systems with
infrequent large bulk inserts”
■ Replicated to all mappers
○ Complex jobs in Scalding => higher memory config for jobs with Sparkey
○ Lesson: trade memory resources for MAYBE a little more time with joins
+
bigPipe.join(exSparkeyPipe)
https://github.com/spotify/sparkey
Scalding w.r.t. Playlist Recs
● Driven
○ “A sophisticated tool that collects telemetry data from running Scalding /
Cascading jobs on a cluster and presenting them in an intriguing User
Interface."
○ http://cascading.io/
+
Scalding w.r.t. Playlist Recs +
Scalding w.r.t. Playlist Recs
● Other awesome benefits
+
Scalding w.r.t. Playlist Recs
● Other awesome benefits
○ Active community + big players
+
Scalding w.r.t. Playlist Recs
● Other awesome benefits
○ Active community + big players
○ Data pipeline flows naturally follow the functional paradigm - essentially
writing Scala code
+
Scalding w.r.t. Playlist Recs +
Scalding w.r.t. Playlist Recs
Productivity without sacrificing performance!
+
Status: Completed
Spotify is hiring!
Nikhil Tibrewal
@nikhil_tibrewal

Mais conteúdo relacionado

Mais procurados

Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
Chris Johnson
 

Mais procurados (20)

The Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyThe Evolution of Big Data at Spotify
The Evolution of Big Data at Spotify
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and Pain
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ Spotify
 
Recommending and searching @ Spotify
Recommending and searching @ SpotifyRecommending and searching @ Spotify
Recommending and searching @ Spotify
 
Data at Spotify
Data at SpotifyData at Spotify
Data at Spotify
 
Playlists at Spotify - Using Cassandra to store version controlled objects
Playlists at Spotify - Using Cassandra to store version controlled objectsPlaylists at Spotify - Using Cassandra to store version controlled objects
Playlists at Spotify - Using Cassandra to store version controlled objects
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
 
Spotify: Horizontal Scalability for Great Success
Spotify: Horizontal Scalability for Great SuccessSpotify: Horizontal Scalability for Great Success
Spotify: Horizontal Scalability for Great Success
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music Recommendations
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
How data drives spotify
How data drives spotifyHow data drives spotify
How data drives spotify
 

Destaque (6)

Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
 
Music survey results (2)
Music survey results (2)Music survey results (2)
Music survey results (2)
 
Music & interaction
Music & interactionMusic & interaction
Music & interaction
 
Jackdaw research music survey report
Jackdaw research music survey reportJackdaw research music survey report
Jackdaw research music survey report
 
How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015
 
Mugo one pager
Mugo one pagerMugo one pager
Mugo one pager
 

Semelhante a Playlist Recommendations @ Spotify

Semelhante a Playlist Recommendations @ Spotify (20)

Spotify cassandra london
Spotify cassandra londonSpotify cassandra london
Spotify cassandra london
 
Hive at Last.fm
Hive at Last.fmHive at Last.fm
Hive at Last.fm
 
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
 
Ontology Access Kit_ Workshop Intro Slides.pptx
Ontology Access Kit_ Workshop Intro Slides.pptxOntology Access Kit_ Workshop Intro Slides.pptx
Ontology Access Kit_ Workshop Intro Slides.pptx
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Cassandra nyc
Cassandra nycCassandra nyc
Cassandra nyc
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Sound soft hackday-100905
Sound soft hackday-100905Sound soft hackday-100905
Sound soft hackday-100905
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.
 
Recommendations 101
Recommendations 101 Recommendations 101
Recommendations 101
 
GDSC NYCU | 如何建立自己的開源專案
 GDSC NYCU | 如何建立自己的開源專案 GDSC NYCU | 如何建立自己的開源專案
GDSC NYCU | 如何建立自己的開源專案
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad Data
 
Clouds are Not Free: Guide to Observability-Driven Efficiency Optimizations
Clouds are Not Free: Guide to Observability-Driven Efficiency OptimizationsClouds are Not Free: Guide to Observability-Driven Efficiency Optimizations
Clouds are Not Free: Guide to Observability-Driven Efficiency Optimizations
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
 

Último

scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
HenryBriggs2
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
Health
 

Último (20)

A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 

Playlist Recommendations @ Spotify

  • 2. Who am I? Nikhil Tibrewal (Nick-hill) ● Data Engineer on Lambda squad (Spotify’s primary ML team) ● Graduated from Carnegie Mellon University in Dec 2013 ● B.Sc. in Computer Science + additional major in Econ ● Been part of Spotify band for ~1.5 years ● Worked on a range of projects, primarily Playlist Recommendations
  • 3. Spotify in numbers ● Started in 2006, 58 markets ● 75M+ active users, 20M+ paying ● 30M+ songs, 20K new per day ● 1.5+ billion playlists ● 1 TB logs per day
  • 4. ● Discover tab ● Radio ● Related Artists ● Discover Weekly ● Playlist recs on “Now” Strip Recommendations so far on Spotify For Ellie Goulding
  • 7. But… How are playlist recs generated?
  • 8. Quick Overview! ● Recommend only human curated playlists (1000+) ○ Well-designed cover images ○ Thorough descriptions ○ Title reflects content
  • 9. Quick Overview! ● Recommend only human curated playlists (1000+) ○ Well-designed cover images ○ Thorough descriptions ○ Title reflects content Good
  • 10. Quick Overview! ● Recommend only human curated playlists (1000+) ○ Well-designed cover images ○ Thorough descriptions ○ Title reflects content Good Bad
  • 11. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering
  • 12. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering ○ Vectorize playlists: ■ Playlist vector derived from track vectors in playlist
  • 13. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering ○ Vectorize playlists: ■ Playlist vector derived from track vectors in playlist ○ Use Annoy to store playlist vectors in N dimensional space ANNOY (Approximate Nearest Neighbors Oh Yeah) created at Spotify https://github.com/spotify/annoy
  • 14. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering ○ Vectorize playlists: ■ Playlist vector derived from track vectors in playlist ○ Use Annoy to store playlist vectors in N dimensional space ○ Vectorize user taste as well: ■ User vector derived from user listening history
  • 15. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering ○ Vectorize playlists: ■ Playlist vector derived from track vectors in playlist ○ Use Annoy to store playlist vectors in N dimensional space ○ Vectorize user taste as well: ■ User vector derived from user listening history ○ User and playlist vectors in same space! ○ Query for nearest playlists to user from Annoy tree annoyTree.getNearest(seedVector, K)
  • 16. Quick Overview! ● Recommendations pipeline: Ranking Model ○ Use genre information, demographics data, and playlist popularity data to further rank recommendations ■ John: 21, USA, likes rock ■ Should get rock playlist recs that are popular in USA and amongst 21 year olds ○ Apply post-processing steps for shuffling and add variety to avoid repetitions
  • 17. Quick Overview! ● Recommendations pipeline: Ranking Model ○ Use genre information, demographics data, and playlist popularity data to further rank recommendations ■ John: 21, USA, likes rock ■ Should get rock playlist recs that are popular in USA and amongst 21 year olds ○ Apply post-processing steps for shuffling and add variety to avoid repetitions 90% DAUs have recs!
  • 18. Quick Overview! ● Infrastructure ○ Luigi to manage workflow (also built at Spotify) ○ Entire pipeline written in Scalding ○ 1200+ nodes Hadoop cluster to run jobs ○ Cassandra (~dozen nodes for playlist recs) ○ Java backend micro-services serving recs
  • 19. Quick Overview! "Scalding is comprised of a DSL (domain-specific language) that makes MapReduce computations look like Scala’s collection API and is a wrapper for Cascading to make it easy to define jobs, test and data sources on an HDFS" (http: //cascading.io/customer/twitter/)
  • 20. Scalding w.r.t. Playlist Recs ● Used Python back in the day ○ Inputs and outputs were tab separated ○ Complexity UP => Difficulty to maintain UP ○ Hard to write tests ● Scalding provided compile time error checks ○ Catch errors early ○ Define schemas (e.g. Avro) ● Can use Parquet + Avro for input/output ○ Easy to write and read data ○ Records with a lot of fields! ○ Lesson: Parquet hurts performance w/ fat columns (nested data structs) +
  • 22. Scalding w.r.t. Playlist Recs ● Data quality ○ Hadoop counters wrappers in extended Scalding library code +
  • 23. Scalding w.r.t. Playlist Recs ● Data quality ○ Hadoop counters wrappers in extended Scalding library code ○ Verify counters within reasonable ranges +
  • 25. Scalding w.r.t. Playlist Recs ● Pipeline tolerance ○ Job failures are normal, and annoying with big jobs ○ Scalding checkpoints ○ Lesson: checkpoint itself is a map-reduce job and has the same caveats ○ Still very helpful! +
  • 26. Scalding w.r.t. Playlist Recs ● Job runtimes ○ Common solutions: more reducers and code optimizations ○ Speculative execution for larger jobs ○ Caveat: can take up unnecessary resources +
  • 27. Scalding w.r.t. Playlist Recs ● Memory issues ○ Used Sparkey indices in Python (developed at Spotify, now open source) ■ “Simple constant key/value storage lib for read-heavy systems with infrequent large bulk inserts” ■ Replicated to all mappers ○ Complex jobs in Scalding => higher memory config for jobs with Sparkey + https://github.com/spotify/sparkey
  • 28. Scalding w.r.t. Playlist Recs ● Memory issues ○ Used Sparkey indices in Python (developed at Spotify, now open source) ■ “Simple constant key/value storage lib for read-heavy systems with infrequent large bulk inserts” ■ Replicated to all mappers ○ Complex jobs in Scalding => higher memory config for jobs with Sparkey ○ Lesson: trade memory resources for MAYBE a little more time with joins + bigPipe.join(exSparkeyPipe) https://github.com/spotify/sparkey
  • 29. Scalding w.r.t. Playlist Recs ● Driven ○ “A sophisticated tool that collects telemetry data from running Scalding / Cascading jobs on a cluster and presenting them in an intriguing User Interface." ○ http://cascading.io/ +
  • 31. Scalding w.r.t. Playlist Recs ● Other awesome benefits +
  • 32. Scalding w.r.t. Playlist Recs ● Other awesome benefits ○ Active community + big players +
  • 33. Scalding w.r.t. Playlist Recs ● Other awesome benefits ○ Active community + big players ○ Data pipeline flows naturally follow the functional paradigm - essentially writing Scala code +
  • 35. Scalding w.r.t. Playlist Recs Productivity without sacrificing performance! +
  • 36. Status: Completed Spotify is hiring! Nikhil Tibrewal @nikhil_tibrewal