9. > 15 M active users*
* Users active within the previous 30 days
Tuesday, October 23, 12
10. > Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 days
Tuesday, October 23, 12
11. > 18 M tracks
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 days
Tuesday, October 23, 12
12. > 20 k new tracks added per day
> 18 M tracks
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 days
Tuesday, October 23, 12
13. > 1 century of listening
> 20 k new tracks added per day
> 18 M tracks
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 days
Tuesday, October 23, 12
14. > 500 M playlists
> 1 century of listening
> 20 k new tracks added per day
> 18 M tracks
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 days
Tuesday, October 23, 12
19. Service overview
Storage
User
Search
Metadata
Tuesday, October 23, 12
20. Service overview
Storage
User
Search
Metadata
.
.
.
Tuesday, October 23, 12
21. Service overview
Storage
User
AP
Search
Metadata
.
.
.
Tuesday, October 23, 12
22. Service overview
Storage
User
AP
Search
Metadata
.
.
.
Tuesday, October 23, 12
23. Service overview
Storage
User
AP
Search
Metadata
.
.
.
Tuesday, October 23, 12
24. Service overview
Storage
User
AP
Search
Metadata
.
.
.
Tuesday, October 23, 12
25. Content pipeline
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
26. Content pipeline
ti on
e s
Label A n g
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
27. Ingestion
XM L L
M M
LX MX
X L
Background image: lord enfield (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/
Tuesday, October 23, 12
30. Ingestion: Delivery formats
~ 10 different incoming XML formats
- Proprietary formats (majors)
Tuesday, October 23, 12
31. Ingestion: Delivery formats
~ 10 different incoming XML formats
- Proprietary formats (majors)
- Spotify delivery format (mostly indies)
Tuesday, October 23, 12
32. Ingestion: Delivery formats
~ 10 different incoming XML formats
- Proprietary formats (majors)
- Spotify delivery format (mostly indies)
Thousands of lines of source specific code
Tuesday, October 23, 12
33. Data model [simplified]
1 Artist Transcoding
* *
*
Album 1 1
* Disc 1
1 Audio
* 1
*
Track
*
Rights *
Tuesday, October 23, 12
34. Ingestion
LXML and XSLT with extensions for
parsing/transforming XML
Tuesday, October 23, 12
35. Ingestion: XPath extensions
>>> def formerlify(_, name):
... return 'The artist formerly known as %s' %name
>>> #Namespace stuff
>>> from lxml import etree
>>> ns = etree.FunctionNamespace('http://my.org/myfunctions')
>>> ns['hello'] = hello
>>> ns.prefix = 'f'
>>> root = etree.XML('<a><b>Prince</b></a>')
>>> print(root.xpath('f:hello(string(b))'))
... The artist formerly known as Prince
http://lxml.de/extensions.html#xpath-extension-functions
Tuesday, October 23, 12
37. Ingestion
Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up
350 MB of disk space
Tuesday, October 23, 12
38. Ingestion
Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up
350 MB of disk space
Bible apparently fits in 3MB XML
Tuesday, October 23, 12
39. Ingestion
Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up
350 MB of disk space
Bible apparently fits in 3MB XML
>>> timeit.timeit('e.parse("huge.xml")',
setup='import lxml.etree as e',
number=5) / 5
4.19...
>>> timeit.timeit('e.parse("huge.xml")',
setup='import xml.etree.cElementTree as e',
number=5) / 5
4.78...
>>> timeit.timeit('e.parse("huge.xml")',
setup='import xml.etree.ElementTree as e',
number=5) / 5
55.39...
Tuesday, October 23, 12
40. Content pipeline
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
41. Content pipeline
ti on
e s
Label A n g
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
42. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
44. Metadata - challenges
Image: Nicolas Genin (CC BY 2.0) http://www.flickr.com/photos/22785954@N08
Tuesday, October 23, 12
45. Content pipeline
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
46. Content pipeline
ti on
e s
Label A n g
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
47. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
48. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
56. Content matching
(16 * 10 ** 6) ** 2 = A large number
Tuesday, October 23, 12
57. Content matching
(16 * 10 ** 6) ** 2 = A large number
Reduce search space:
>>> from unicodedata import normalize
>>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]
Tuesday, October 23, 12
58. Content matching
(16 * 10 ** 6) ** 2 = A large number
Reduce search space:
>>> from unicodedata import normalize
>>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]
Side note: Levenshtein (edit) distance is a heavy operation
-> speeded up about 4x with pypy (or use c-extension)
Tuesday, October 23, 12
60. it!
h
Automatic data processing will never be perfect
c
a t
P
Tuesday, October 23, 12
61. Content pipeline
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
62. Content pipeline
ti on
e s
Label A n g
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
63. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
64. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
65. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
g
in
od
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
67. Content pipeline
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
68. Content pipeline
ti on
e s
Label A n g
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
69. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
70. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
71. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
g
in
od
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
72. Content pipeline
ti on e n g
s e r g
e xi
g e d
Label A
In M In
Label B
Label C
Label D Curation/enrichment
g
in
od
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
74. Index build
• Nightly batch job on db-dumps
Tuesday, October 23, 12
75. Index build
• Nightly batch job on db-dumps
• Previously mostly python but now moved to Java for
performance reason
Tuesday, October 23, 12
76. Index build
• Nightly batch job on db-dumps
• Previously mostly python but now moved to Java for
performance reason
• But still lots of python helper scripts :)
Tuesday, October 23, 12
77. Content pipeline
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
78. Content pipeline
ti on
e s
Label A n g
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
79. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
80. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
81. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
g
in
od
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
82. Content pipeline
ti on e n g
s e r g
e xi
g e d
Label A
In M In
Label B
Label C
Label D Curation/enrichment
g
in
od
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
83. Content pipeline
g
on e n g in
s ti r g xi l is
h
e e de b
Label A n g M In u
I P
Label B
Label C
Label D Curation/enrichment
g On site live services,
in
od
e.g. search, browse
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
98. Content pipeline
g
on e n g in
s ti r g xi l is
h
e e de b
Label A n g M In u
I P
Label B
Label C
Label D Curation/enrichment
g On site live services,
in
od
e.g. search, browse
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
101. Choice of database
Depends on the use case - duh!
Tuesday, October 23, 12
102. Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
Tuesday, October 23, 12
103. Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
• Cassandra (e.g. playlist service)
Tuesday, October 23, 12
104. Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
• Cassandra (e.g. playlist service)
• Tokyo cabinet (e.g. browse service)
Tuesday, October 23, 12
105. Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
• Cassandra (e.g. playlist service)
• Tokyo cabinet (e.g. browse service)
• Lucene (search service)
Tuesday, October 23, 12
106. Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
• Cassandra (e.g. playlist service)
• Tokyo cabinet (e.g. browse service)
• Lucene (search service)
• HDFS
Tuesday, October 23, 12
107. PostgreSQL
[Pic. of elephant]
Image: http2007 (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/
Tuesday, October 23, 12
108. PostgreSQL
Redundancy + scaling:
master/slave
Tuesday, October 23, 12
109. PostgreSQL
Joins and subqueries -
let the query planner roll!
Tuesday, October 23, 12
116. Thank you
henok@spotify.com
Tuesday, October 23, 12
117. Distribution/publish
Popen + gevent (although IO-bound)
import gevent
gevent.monkey.patch_all()
def _wait(self):
while True:
res = self.poll()
if res is not None:
return res
gevent.sleep(0.1)
subprocess.Popen.wait = _wait
Tuesday, October 23, 12