SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
Hongjoo LEE
Topic Modeling & Word Embedding
on Cosmetics
with Interactive Visualization
Who am I
Lead Engineer @ Glowdayz
● Over 680k users
● Over 120k reviewers
● Over 2.6m reviews and ratings
We provide weekly ranking based on reviews and ratings
● Aprox 6k brands
● Aprox 82k products
Data Specialist vs. Domain Expert
Data Specialist
from MARS
Domain Expert
from VENUS
Data Specialist vs. Domain Expert
● Data Specialist
○ Classification
○ Topic Modeling
○ Word Embedding
○ Probability
○ Similarity
○ …...
O_o; ..?DATA
Data Specialist vs. Domain Expert
● Domain Expert
○ 외래어
■ 딥씨 듀드롭
○ 외계어
■ "살결수", "오일수"
○ 은어
■ 7스킨, 콧물, 유목민
DOMAIN?!... -_-
Data Specialist vs. Domain Expert
수분력
3쳐발DATA DOMAIN
Building Review Corpus
Topic Modeling
Sentiment Analysis
Word Embedding
Review Corpus Interactive
Visualization
?
Consumer
Insights
Build a Corpus
Topic Modeling
● Latent Dirichlet Allocation
Topics Documents Topic proportions & assignments
Topic Modeling
● pyLDAvis
뒤집어,
여드름,
트러블,
수부지
Sentiment Analysis
● Scaled f-score
○ Term associations:
■ “Good” → positive class
■ “Bad” → negative class
○ Association by two factors
■ Frequency : how often a term occurs in a class
■ Precision : P(class | document contains terms)
○ F-score
■ IR evaluation metric
■ Harmonic mean btw precision & recall (Both should be high)
Sentiment Analysis
● Visualize
positive
negative
neutral
Sentiment Analysis
● Scattertext
유목민, 닦토로, 7스킨,
콧물
여드름 올라오,
흐르, 오일,
용기
Word Embedding
● Distributional Hypothesis
○ “You shall know a word by the company it keeps” (J.R. Firth,
British Linguist, 1957)
○ “words that occur in similar contexts tend to be similar” (Z.S.
Harris, American Linguist, 1992)
Word Embedding
● Distributional Hypothesis
○ Moon, Trump, Jinping are presidents
■ President Moon said yesterday
■ President Trump said yesterday
■ President Jinping said yesterday
○ Python is ...
■ I write a code in Python
■ A program is written in Python
■ Python is a programming language
Word Embedding
● Pre-training Word Vectors
2.6M review docs
V130k x 150
Word Vectors
pre-trained
Word2Vec
model
>>> model.most_similar('피부색')
[('얼굴색', 0.870),
('피부톤', 0.847),
('톤', 0.740),
('얼굴빛', 0.665),
('본래_피부색', 0.657),
('낯빛', 0.615),
('21_호', 0.586),
('23_호', 0.586),
('하얀피부', 0.582),
('화사함', 0.562),
('안색', 0.551)]
C={docs}
Word Embedding
● Word Projectection
D={xi
|xi
⊂C}
subset of C
V130k x 150
model
Word Vectors
pre-trained
token[0..i] mode[token[0..i]]
(i+1) x 150
Word Embedding
● Tensorboard Projector
제형 ≅ {
질감, 체형, 재형,
타입, 젤타입,
텍스쳐, 느낌, 젤,
콧물_제형, 점성,
마무리_감, …
}
Future works
● Consumer Insights
○ Conceptual keyword buzz
○ Radar chart
● Feature engineering
Consumer Insights
● Conceptual keyword buzz
Consumer Insights
● Radar Chart
향
흡수
보습
제형
자극
Feature Engineering
● Cosmetic Domain Specific Corpus Analyzer
한글 분석기 성능 비교
""" review_text
에스쁘아의 메이크업 제품은 발색력도 좋고요 가격대비에 만족합니다
사용한 파데는 에스쁘아 비실크와 브이디엘 퍼펙팅 래스트입니다
"""
from konlpy.tag import Kkma, Hannanum, Komoran, Twitter, Mecab
Kkma().pos(review_text)
Hannanum().pos(review_text)
Komoran().pos(review_text)
Twitter().pos(review_text)
Mecab().pos(review_text)
한글 분석기 성능 비교
“에스쁘아의 메이크업 제품은 발색력도 좋고요 가격대비에 만족합니다”
Kkma Hannanum Twitter Mecab Glowpick
에스쁘아/UN 에스쁘아/N 에스쁘아/Noun 에스/NNG 에스쁘아/NNP
의/JKG 의/J 의/Josa 쁘아의/UNKN 의/JKG
메이크업/NNG 메이크업/N 메이크업/Noun 메이크업/NNG 메이크업/NNG
제품/NNG 제품/N 제품/Noun 제품/NNG 제품/NNG
은/JX 은/J 은/Josa 은/JX 은/JX
한글 분석기 성능 비교
“에스쁘아의 메이크업 제품은 발색력도 좋고요 가격대비에 만족합니다”
Kkma Hannanum Twitter Mecab Glowpick
발색/NNG 발색력/N 발/Noun 발색/NNG 발색력/NNP
력/XSN 색력/Noun 력/XSN
도/JX 도/J 도/Josa 도/JX 도/JX
좋/VA 좋/P 좋/Adjective 좋/VA 좋/VA
고요/EFN 고요/E 고요/Eomi 고/EC 고/EC
요/MM 요/MM
한글 분석기 성능 비교
“에스쁘아의 메이크업 제품은 발색력도 좋고요 가격대비에 만족합니다”
Kkma Hannanum Twitter Mecab Glowpick
가격/NNG 가격대비/N 가격/Noun 가격/NNG 가격_대비
대비/NNG 에/J 대비/Noun 대비/NNG
에/JKM 에/Josa 에/JKB 에/JKB
만족/NNG 만족/N 만족합/Verb 만족/NNG 만족_합니다
하/XSV 하/X 합니다/XSV+EC
ㅂ니다/EFN ㅂ니다/E 니다/Eomi
한글 분석기 성능 비교
“사용한 파데는 에스쁘아 비실크와 브이디엘 퍼펙팅 래스트입니다”
Kkma Hannanum Twitter Mecab Glowpick
사용/NNG 사용/N 사용한/Verb 사용/NNG 사용/NNG
하/XSV 하/X 한/XSV+ETM 한/XSV+ETM
ㄴ/ETD ㄴ/E
파/NNG 파/P 파/Verb 파/NNG 파데/NNP
데는/NNG 데는/E 데/PreEomi 데/NNB
는/Eomi 는/JX 는/JX
한글 분석기 성능 비교
“사용한 파데는 에스쁘아 비실크와 브이디엘 퍼펙팅 래스트입니다”
Kkma Hannanum Twitter Mecab Glowpick
에스/NNG 에스쁘아/N 에스쁘/Noun 에스/NNG 에스쁘아/NNP
쁘/UN 쁘아/UNKN
아/VV, 아/ECS 아/Josa
비/XPN 비/X 비실/Noun 비/XPN 비/XPN
실크/NNG 실크/N 크/Verb 실크/NNG 실크/NNG
와/JKM 와/J 와/Eomi 와/JC 와/JC
한글 분석기 성능 비교
“사용한 파데는 에스쁘아 비실크와 브이디엘 퍼펙팅 래스트입니다”
Kkma Hannanum Twitter Mecab Glowpick
브이/NNG 브이디엘/N 브이/Noun 브이/NNG 브이디엘/NNP
디/NNG 디/Noun 디/NNG
엘/NNG 엘/Josa 엘/JKB+JKO
한글 분석기 성능 비교
“사용한 파데는 에스쁘아 비실크와 브이디엘 퍼펙팅 래스트입니다”
Kkma Hannanum Twitter Mecab Glowpick
푸/VV 퍼펙팅/N 퍼펙팅/Noun 퍼/VV+EC 퍼펙팅_래스트
어/ECS 펙/NNG
펙팅/UN 팅/MAG
래스트/NNG 래스트/N 래/Josa 래스/NNG
스/Noun 트/NNG
이/VCP 이/J 트입니/Verb 입니다/VCP+EC 입니다/VCP+EC
ㅂ니다/EFN ㅂ니다/E 다/Eomi
Contacts
lee.hongjoo@yandex.com
https://www.linkedin.com/in/hongjoo-lee/
We are hiring!

Mais conteúdo relacionado

Destaque

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Topic Modeling & Word Embedding on Cosmetics

  • 1. Hongjoo LEE Topic Modeling & Word Embedding on Cosmetics with Interactive Visualization
  • 2. Who am I Lead Engineer @ Glowdayz ● Over 680k users ● Over 120k reviewers ● Over 2.6m reviews and ratings We provide weekly ranking based on reviews and ratings ● Aprox 6k brands ● Aprox 82k products
  • 3. Data Specialist vs. Domain Expert Data Specialist from MARS Domain Expert from VENUS
  • 4. Data Specialist vs. Domain Expert ● Data Specialist ○ Classification ○ Topic Modeling ○ Word Embedding ○ Probability ○ Similarity ○ …... O_o; ..?DATA
  • 5. Data Specialist vs. Domain Expert ● Domain Expert ○ 외래어 ■ 딥씨 듀드롭 ○ 외계어 ■ "살결수", "오일수" ○ 은어 ■ 7스킨, 콧물, 유목민 DOMAIN?!... -_-
  • 6. Data Specialist vs. Domain Expert 수분력 3쳐발DATA DOMAIN
  • 7. Building Review Corpus Topic Modeling Sentiment Analysis Word Embedding Review Corpus Interactive Visualization ? Consumer Insights
  • 9. Topic Modeling ● Latent Dirichlet Allocation Topics Documents Topic proportions & assignments
  • 11. Sentiment Analysis ● Scaled f-score ○ Term associations: ■ “Good” → positive class ■ “Bad” → negative class ○ Association by two factors ■ Frequency : how often a term occurs in a class ■ Precision : P(class | document contains terms) ○ F-score ■ IR evaluation metric ■ Harmonic mean btw precision & recall (Both should be high)
  • 13. Sentiment Analysis ● Scattertext 유목민, 닦토로, 7스킨, 콧물 여드름 올라오, 흐르, 오일, 용기
  • 14. Word Embedding ● Distributional Hypothesis ○ “You shall know a word by the company it keeps” (J.R. Firth, British Linguist, 1957) ○ “words that occur in similar contexts tend to be similar” (Z.S. Harris, American Linguist, 1992)
  • 15. Word Embedding ● Distributional Hypothesis ○ Moon, Trump, Jinping are presidents ■ President Moon said yesterday ■ President Trump said yesterday ■ President Jinping said yesterday ○ Python is ... ■ I write a code in Python ■ A program is written in Python ■ Python is a programming language
  • 16. Word Embedding ● Pre-training Word Vectors 2.6M review docs V130k x 150 Word Vectors pre-trained Word2Vec model >>> model.most_similar('피부색') [('얼굴색', 0.870), ('피부톤', 0.847), ('톤', 0.740), ('얼굴빛', 0.665), ('본래_피부색', 0.657), ('낯빛', 0.615), ('21_호', 0.586), ('23_호', 0.586), ('하얀피부', 0.582), ('화사함', 0.562), ('안색', 0.551)] C={docs}
  • 17. Word Embedding ● Word Projectection D={xi |xi ⊂C} subset of C V130k x 150 model Word Vectors pre-trained token[0..i] mode[token[0..i]] (i+1) x 150
  • 18. Word Embedding ● Tensorboard Projector 제형 ≅ { 질감, 체형, 재형, 타입, 젤타입, 텍스쳐, 느낌, 젤, 콧물_제형, 점성, 마무리_감, … }
  • 19. Future works ● Consumer Insights ○ Conceptual keyword buzz ○ Radar chart ● Feature engineering
  • 21. Consumer Insights ● Radar Chart 향 흡수 보습 제형 자극
  • 22. Feature Engineering ● Cosmetic Domain Specific Corpus Analyzer
  • 23. 한글 분석기 성능 비교 """ review_text 에스쁘아의 메이크업 제품은 발색력도 좋고요 가격대비에 만족합니다 사용한 파데는 에스쁘아 비실크와 브이디엘 퍼펙팅 래스트입니다 """ from konlpy.tag import Kkma, Hannanum, Komoran, Twitter, Mecab Kkma().pos(review_text) Hannanum().pos(review_text) Komoran().pos(review_text) Twitter().pos(review_text) Mecab().pos(review_text)
  • 24. 한글 분석기 성능 비교 “에스쁘아의 메이크업 제품은 발색력도 좋고요 가격대비에 만족합니다” Kkma Hannanum Twitter Mecab Glowpick 에스쁘아/UN 에스쁘아/N 에스쁘아/Noun 에스/NNG 에스쁘아/NNP 의/JKG 의/J 의/Josa 쁘아의/UNKN 의/JKG 메이크업/NNG 메이크업/N 메이크업/Noun 메이크업/NNG 메이크업/NNG 제품/NNG 제품/N 제품/Noun 제품/NNG 제품/NNG 은/JX 은/J 은/Josa 은/JX 은/JX
  • 25. 한글 분석기 성능 비교 “에스쁘아의 메이크업 제품은 발색력도 좋고요 가격대비에 만족합니다” Kkma Hannanum Twitter Mecab Glowpick 발색/NNG 발색력/N 발/Noun 발색/NNG 발색력/NNP 력/XSN 색력/Noun 력/XSN 도/JX 도/J 도/Josa 도/JX 도/JX 좋/VA 좋/P 좋/Adjective 좋/VA 좋/VA 고요/EFN 고요/E 고요/Eomi 고/EC 고/EC 요/MM 요/MM
  • 26. 한글 분석기 성능 비교 “에스쁘아의 메이크업 제품은 발색력도 좋고요 가격대비에 만족합니다” Kkma Hannanum Twitter Mecab Glowpick 가격/NNG 가격대비/N 가격/Noun 가격/NNG 가격_대비 대비/NNG 에/J 대비/Noun 대비/NNG 에/JKM 에/Josa 에/JKB 에/JKB 만족/NNG 만족/N 만족합/Verb 만족/NNG 만족_합니다 하/XSV 하/X 합니다/XSV+EC ㅂ니다/EFN ㅂ니다/E 니다/Eomi
  • 27. 한글 분석기 성능 비교 “사용한 파데는 에스쁘아 비실크와 브이디엘 퍼펙팅 래스트입니다” Kkma Hannanum Twitter Mecab Glowpick 사용/NNG 사용/N 사용한/Verb 사용/NNG 사용/NNG 하/XSV 하/X 한/XSV+ETM 한/XSV+ETM ㄴ/ETD ㄴ/E 파/NNG 파/P 파/Verb 파/NNG 파데/NNP 데는/NNG 데는/E 데/PreEomi 데/NNB 는/Eomi 는/JX 는/JX
  • 28. 한글 분석기 성능 비교 “사용한 파데는 에스쁘아 비실크와 브이디엘 퍼펙팅 래스트입니다” Kkma Hannanum Twitter Mecab Glowpick 에스/NNG 에스쁘아/N 에스쁘/Noun 에스/NNG 에스쁘아/NNP 쁘/UN 쁘아/UNKN 아/VV, 아/ECS 아/Josa 비/XPN 비/X 비실/Noun 비/XPN 비/XPN 실크/NNG 실크/N 크/Verb 실크/NNG 실크/NNG 와/JKM 와/J 와/Eomi 와/JC 와/JC
  • 29. 한글 분석기 성능 비교 “사용한 파데는 에스쁘아 비실크와 브이디엘 퍼펙팅 래스트입니다” Kkma Hannanum Twitter Mecab Glowpick 브이/NNG 브이디엘/N 브이/Noun 브이/NNG 브이디엘/NNP 디/NNG 디/Noun 디/NNG 엘/NNG 엘/Josa 엘/JKB+JKO
  • 30. 한글 분석기 성능 비교 “사용한 파데는 에스쁘아 비실크와 브이디엘 퍼펙팅 래스트입니다” Kkma Hannanum Twitter Mecab Glowpick 푸/VV 퍼펙팅/N 퍼펙팅/Noun 퍼/VV+EC 퍼펙팅_래스트 어/ECS 펙/NNG 펙팅/UN 팅/MAG 래스트/NNG 래스트/N 래/Josa 래스/NNG 스/Noun 트/NNG 이/VCP 이/J 트입니/Verb 입니다/VCP+EC 입니다/VCP+EC ㅂ니다/EFN ㅂ니다/E 다/Eomi