Searching and Recommending TV series with SQL

Series-O-RamaSeries-O-Rama
Search & Recommend TV series with SQLSearch & Recommend TV series with SQL
http://bit.ly/series-o-ramahttp://bit.ly/series-o-rama
Guillaume Cabanac
guillaume.cabanac@univ-tlse3.fr

Toulouse: A Picture is Worth a Thousand Words
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
2
1
2
3
4
Capbreton
3h ride
Toulouse
population: 437 000
students: 97 000
Ax-les-Thermes
1h40 ride
Collioure
2h30 ride

en.wikipedia.org
Telly Addicts Need Help to Find TV Series
 Main Topics of Grey’s AnatomyGrey’s Anatomy?
 Text mining, Visualization
 Series about ‘plane crash islandplane crash island’
 Search engine
 What should I watch next?
 Recommender system
amazon.com →
3

Text Mining: Let’s Crunch Subtitles
4
 Main Topics of Grey’s AnatomyGrey’s Anatomy?
 Text mining, Visualization
 Series about ‘plane crash islandplane crash island’
 Search engine
 What should I watch next?
 Recommender system
Cold CaseCold Case
GreyGrey’s Anatomy’s Anatomy

What’s in a Subtitle File?
5
 Title – Season – Episode – Language.srt
 1 episode = 1 plain text file
 Synchronization
 start --> stop
 Dialogue
 We can easily extract words
[ a, again*2, and, but, com, cuban,
different, favorite, food, for*2, forum,
going, great, happen*2, has, hungry, i*2,
is, it, love, m, my, nice, night*2, miami,
now, pork, s*2, sandwiches, something, the,
to*2, tonight, town, www ]

6
DB technology at Work! [Home]
7 527 files = 337 MB
100% Java and Oracle

DB technology at Work! [Search engine]
7
Ranked list
of results

DB technology at Work! [Infos]
8
Most
popular
terms
Most
related
series

DB technology at Work! [Recommendations]
9

10
I liked I disliked
What should
I watch next?

11
Ranked list of
recommendations

How Does this Work?
12

Architecture and Data Model
13
DB
subtitles
indexing
searching
browsing
recommending
GUI
offline
online
Dict = { idT, term}
8 plane
27 killer
29 crash
Posting = { idT*, idS*, nb}
27 45 89
8 45 3
8 12 90
⊆
⊆

Theory − Text Indexing Pipeline
14
[the, plane, crashed, ..., planes, ..., is]
[plane, crashed, ..., planes, ...]
[plane, crash, ..., plane, ...]
{(plane, 48), (crash, 15) ...}
Tokenization +
lowercase
Stopwords removal
Stemming
PorterPorter’s Stemmer (1980)’s Stemmer (1980)
http://qaa.ath.cx/porter_js_demo.html
In 1720 Robert Gordon retired to Aberdeen having amassed a
considerable fortune in Poland. On his death 11 years later he willed his
entire estate to build a residential school for educating young boys. In
the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this
was converted into a day school to be known as Robert Gordon’s
College. This school also began to hold day and evening classes for boys
girls and adults in primary secondary mechanical and other subjects …
Counting

Theory − Similarity of Paired Series
15
A Big Limitation
 The distribution of terms among series is ignored
It makes no difference that a term occurs 1 time or 1,000,000 times
 Dice’s Coefficient (1945)
 Based on the Set Theory
 Example: Let us Model a Series as a Set of Terms
House = {hospital, doctor, crazy, psycho}
Grey’s = {doctor, care, hospital}

Vocabulary
Theory − Vector Space Model, Term Weighting
16
Raw TF
dexter > lost
max
max
 Normalization
TF / max(TF)
survive ?
max
max
dexter < lost

Theory − Best Match Retrieval
17
1 TV series = 1 vector
1 45 1467 6790 n
Now, we know how to:
 Find most popular termspopular terms for a TV series
 Compute similaritysimilarity between TV series
 Find TV series matching a querymatching a query

Theory − More on Term Weighting
18
1 45 1467 6790 n
1 TV series = 1 vector
 All terms are supposed to be equally representative
… but ‘survive’ is way more unusual than ‘people’
⇒ ‘survive’ better represents Lost than ‘people’ does
IDF: Inverse Document FrequencyIDF: Inverse Document Frequency

Theory − The Big Picture: TF*IDF
19
An important term for series S is frequent in Sis frequent in S and globally unusualglobally unusual.

Theory … and Practice
20
Series = { idS, name, maxNb}
12 Lost 540
45 Dexter 125
Dict = { idT, term idf }
8 plane 1.25
27 killer 2.87
29 crash 3.07
Posting = { idT*, idS*, nb, tf }
27 45 89 0.71
8 45 3 0.02
8 12 90 0.16
⊆
⊆

Description of a TV Series
21
Lost
⋈
 Many surnames need to be filtered out

Retrieval of TV Series − queries with 1 term
22
survive ⋈
Importance of normalization
• Stargate Atlantis
nb/maxNb = 63/1116 = 0.05645
• Blade
nb/maxNb = 9/163 = 0.05521

Retrieval of TV Series − queries with n terms
23
survive mulder ⋈
67|The Vampire Diaries
survive|0.028|0.107 = 0.028 * 0.107 = 0.003
mulder|0.007|3.977 = 0.007 * 3.977 = 0.028
+ 0.031
18| X-Files
survive|0.014|0.107 = 0.014 * 0.107 = 0.001
mulder|1.000|3.977 = 1.000 * 3.977 = 3.977
+ 3.978
⁞

Similar to House?
Computing Similarities Among TV Series 1/2
24
⋈
First, let’s compute the numerator where:
Ai = Terms from House
Bi = Terms from Another TV series Ai Bi

Similar to House?
Computing Similarities Among TV Series 2/2
⋈
⋈
⋈
25

Thank you
http://www.irit.fr/~Guillaume.Cabanachttp://www.irit.fr/~Guillaume.Cabanac

Searching and Recommending TV series with SQL

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de Guillaume Cabanac

Mais de Guillaume Cabanac (20)

Último

Último (20)

Searching and Recommending TV series with SQL

Notas do Editor