social pharmacy d-pharm 1st year by Pragati K. Mahajan
Searching and Recommending TV series with SQL
1. Series-O-RamaSeries-O-Rama
Search & Recommend TV series with SQLSearch & Recommend TV series with SQL
http://bit.ly/series-o-ramahttp://bit.ly/series-o-rama
Guillaume Cabanac
guillaume.cabanac@univ-tlse3.fr
2. Toulouse: A Picture is Worth a Thousand Words
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
2
1
2
3
4
Capbreton
3h ride
Toulouse
population: 437 000
students: 97 000
Ax-les-Thermes
1h40 ride
Collioure
2h30 ride
3. en.wikipedia.org
Telly Addicts Need Help to Find TV Series
Main Topics of Grey’s AnatomyGrey’s Anatomy?
Text mining, Visualization
Series about ‘plane crash islandplane crash island’
Search engine
What should I watch next?
Recommender system
amazon.com →
3
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
4. Text Mining: Let’s Crunch Subtitles
4
Main Topics of Grey’s AnatomyGrey’s Anatomy?
Text mining, Visualization
Series about ‘plane crash islandplane crash island’
Search engine
What should I watch next?
Recommender system
Cold CaseCold Case
GreyGrey’s Anatomy’s Anatomy
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
5. What’s in a Subtitle File?
5
Title – Season – Episode – Language.srt
1 episode = 1 plain text file
Synchronization
start --> stop
Dialogue
We can easily extract words
[ a, again*2, and, but, com, cuban,
different, favorite, food, for*2, forum,
going, great, happen*2, has, hungry, i*2,
is, it, love, m, my, nice, night*2, miami,
now, pork, s*2, sandwiches, something, the,
to*2, tonight, town, www ]
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
6. 6
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
DB technology at Work! [Home]
7 527 files = 337 MB
100% Java and Oracle
7. DB technology at Work! [Search engine]
7
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Ranked list
of results
8. DB technology at Work! [Infos]
8
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Most
popular
terms
Most
related
series
9. DB technology at Work! [Recommendations]
9
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
10. DB technology at Work! [Recommendations]
10
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
I liked I disliked
What should
I watch next?
11. DB technology at Work! [Recommendations]
11
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Ranked list of
recommendations
12. How Does this Work?
12
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
13. Architecture and Data Model
13
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
DB
subtitles
indexing
searching
browsing
recommending
GUI
offline
online
Dict = { idT, term}
8 plane
27 killer
29 crash
Posting = { idT*, idS*, nb}
27 45 89
8 45 3
8 12 90
⊆
⊆
14. Theory − Text Indexing Pipeline
14
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
[the, plane, crashed, ..., planes, ..., is]
[plane, crashed, ..., planes, ...]
[plane, crash, ..., plane, ...]
{(plane, 48), (crash, 15) ...}
Tokenization +
lowercase
Stopwords removal
Stemming
PorterPorter’s Stemmer (1980)’s Stemmer (1980)
http://qaa.ath.cx/porter_js_demo.html
In 1720 Robert Gordon retired to Aberdeen having amassed a
considerable fortune in Poland. On his death 11 years later he willed his
entire estate to build a residential school for educating young boys. In
the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this
was converted into a day school to be known as Robert Gordon’s
College. This school also began to hold day and evening classes for boys
girls and adults in primary secondary mechanical and other subjects …
Counting
15. Theory − Similarity of Paired Series
15
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
A Big Limitation
The distribution of terms among series is ignored
It makes no difference that a term occurs 1 time or 1,000,000 times
Dice’s Coefficient (1945)
Based on the Set Theory
Example: Let us Model a Series as a Set of Terms
House = {hospital, doctor, crazy, psycho}
Grey’s = {doctor, care, hospital}
16. Vocabulary
Theory − Vector Space Model, Term Weighting
16
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Raw TF
dexter > lost
max
max
Normalization
TF / max(TF)
survive ?
max
max
dexter < lost
17. Theory − Best Match Retrieval
17
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
1 TV series = 1 vector
1 45 1467 6790 n
Now, we know how to:
Find most popular termspopular terms for a TV series
Compute similaritysimilarity between TV series
Find TV series matching a querymatching a query
18. Theory − More on Term Weighting
18
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
1 45 1467 6790 n
1 TV series = 1 vector
All terms are supposed to be equally representative
… but ‘survive’ is way more unusual than ‘people’
⇒ ‘survive’ better represents Lost than ‘people’ does
IDF: Inverse Document FrequencyIDF: Inverse Document Frequency
19. Theory − The Big Picture: TF*IDF
19
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
An important term for series S is frequent in Sis frequent in S and globally unusualglobally unusual.
20. Theory … and Practice
20
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Series = { idS, name, maxNb}
12 Lost 540
45 Dexter 125
Dict = { idT, term idf }
8 plane 1.25
27 killer 2.87
29 crash 3.07
Posting = { idT*, idS*, nb, tf }
27 45 89 0.71
8 45 3 0.02
8 12 90 0.16
⊆
⊆
21. Description of a TV Series
21
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Lost
⋈
Many surnames need to be filtered out
22. Retrieval of TV Series − queries with 1 term
22
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
survive ⋈
Importance of normalization
• Stargate Atlantis
nb/maxNb = 63/1116 = 0.05645
• Blade
nb/maxNb = 9/163 = 0.05521
23. Retrieval of TV Series − queries with n terms
23
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
survive mulder ⋈
67|The Vampire Diaries
survive|0.028|0.107 = 0.028 * 0.107 = 0.003
mulder|0.007|3.977 = 0.007 * 3.977 = 0.028
+ 0.031
18| X-Files
survive|0.014|0.107 = 0.014 * 0.107 = 0.001
mulder|1.000|3.977 = 1.000 * 3.977 = 3.977
+ 3.978
⁞
24. Similar to House?
Computing Similarities Among TV Series 1/2
24
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
⋈
First, let’s compute the numerator where:
Ai = Terms from House
Bi = Terms from Another TV series Ai Bi
25. Similar to House?
Computing Similarities Among TV Series 2/2
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
⋈
⋈
⋈
25
select term, tf*idf score from posting p, dict d where p.idT = d.idT and idS = (select idS from series where name = 'Lost') order by 2 desc, 1 ;
select name, term, nb, tf from posting p, series s, dict d where p.idS = s.idS and p.idT = d.idT and term = 'survive' order by tf desc, name ;
select name, sum(tf*idf) rsv from posting p, series s, dict d where p.idS = s.idS and p.idT = d.idT and term in ('survive', 'mulder') group by p.idS, name order by 2 desc, 1 ;
with numerator as ( select pLost.idS idLostS, pOther.idS idOtherS, sum(pLost.tf*idf * pOther.tf*idf) numValue from posting pLost, posting pOther, dict d where pLost.idT = pOther.idT -- common terms and pLost.idT = d.idT -- for IDF and pLost.idS <> pOther.idS and pLost.idS = (select idS from series where name = 'House') group by pLost.idS, pOther.idS ) select name, numValue / ( sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idLostS)) * sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idOtherS))) score from numerator n, series s where n.idOtherS = s.idS order by 2 desc, 1 ;
with numerator as ( select pHouse.idS idHouseS, pOther.idS idOtherS, sum(pHouse.tf*idf * pOther.tf*idf) numValue from posting pHouse, posting pOther, dict d where pHouse.idT = pOther.idT -- common terms and pHouse.idT = d.idT -- for IDF and pHouse.idS <> pOther.idS and pHouse.idS = (select idS from series where name = 'House') group by pHouse.idS, pOther.idS ) select name, numValue / ( sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idHouseS)) * sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idOtherS))) score from numerator n, series s where n.idOtherS = s.idS order by 2 desc, 1 ;