2. Who am I
Lead Engineer @ Glowdayz
● Over 680k users
● Over 120k reviewers
● Over 2.6m reviews and ratings
We provide weekly ranking based on reviews and ratings
● Aprox 6k brands
● Aprox 82k products
3. Data Specialist vs. Domain Expert
Data Specialist
from MARS
Domain Expert
from VENUS
4. Data Specialist vs. Domain Expert
● Data Specialist
○ Classification
○ Topic Modeling
○ Word Embedding
○ Probability
○ Similarity
○ …...
O_o; ..?DATA
11. Sentiment Analysis
● Scaled f-score
○ Term associations:
■ “Good” → positive class
■ “Bad” → negative class
○ Association by two factors
■ Frequency : how often a term occurs in a class
■ Precision : P(class | document contains terms)
○ F-score
■ IR evaluation metric
■ Harmonic mean btw precision & recall (Both should be high)
14. Word Embedding
● Distributional Hypothesis
○ “You shall know a word by the company it keeps” (J.R. Firth,
British Linguist, 1957)
○ “words that occur in similar contexts tend to be similar” (Z.S.
Harris, American Linguist, 1992)
15. Word Embedding
● Distributional Hypothesis
○ Moon, Trump, Jinping are presidents
■ President Moon said yesterday
■ President Trump said yesterday
■ President Jinping said yesterday
○ Python is ...
■ I write a code in Python
■ A program is written in Python
■ Python is a programming language
16. Word Embedding
● Pre-training Word Vectors
2.6M review docs
V130k x 150
Word Vectors
pre-trained
Word2Vec
model
>>> model.most_similar('피부색')
[('얼굴색', 0.870),
('피부톤', 0.847),
('톤', 0.740),
('얼굴빛', 0.665),
('본래_피부색', 0.657),
('낯빛', 0.615),
('21_호', 0.586),
('23_호', 0.586),
('하얀피부', 0.582),
('화사함', 0.562),
('안색', 0.551)]
C={docs}
17. Word Embedding
● Word Projectection
D={xi
|xi
⊂C}
subset of C
V130k x 150
model
Word Vectors
pre-trained
token[0..i] mode[token[0..i]]
(i+1) x 150