Презентация Артема Просветова, data scientist CleverDATA, о технологии анализа данных на примере работы с бьюти-блогами для конференции Data Science Weekend (3-4 марта 2017).
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)
1. Text mining of Beauty Blogs:
Text mining of Beauty Blogs:
О чем говорят женщины?
Артем Просветов
Data Scientist, CleverDATA
2. empty
not English
techcrunch.com
photo/video pages
correct English page
cleverdata.ru | info@cleverdata.ru
Raw blog data
Raw data: 98,496 pages in format of ~ 1,000,000 files.
Ready for analysis: 58,719 English pages (59.6%)
40.4% data: empty pages and pages with errors, not English pages
(23,461), photo/video pages without text (2,315), articles from
techcrunch.com (3,402)
4. cleverdata.ru | info@cleverdata.ru
Mean blog post size (in words)
One can distinguish 2 populations
of bloggers:
•twitter style' authors with short
posts (~20%)
•full-length bloggers with 200-500
mean words per post (~80%)
6. cleverdata.ru | info@cleverdata.ru
Sentiment analysis
• - the resulting sentiment rate is based
on 4 independent rate systems.
• - the majority of the blogs have positive
emotion rate.
• - the mean sentiment rate is «positive
warm» 0.72.
• - all this results are intuitively consistent
and are in a good agreement with
manual tests
7. cleverdata.ru | info@cleverdata.ru
We used a few traffic rank systems:
Estimation of blog efficiency
• Alexa Rank, that basically audits and makes public the frequency of
visits on various Web sites.
• Yandex Thematic Citation Index (TIC), that determines the
“credibility” of Internet resources based on a qualitative assessment
of links to other sites.
• Google Page Rank, that works by counting the number and quality
of links to blog to determine a rough estimate of how important the
website is.
8. cleverdata.ru | info@cleverdata.ru
Content relevance rate is based on fuzzy string matching:
- Every company product name was string matched with all amount of blogs.
- String matching is based on Levinstein's metric.
- Pages with 90% matching rate were marked up.
- Tests with direct brand name matching showed that we get about 90-100%
accuracy on each product name deppends on words in title.
- The result relevance rate for each author is summed from all marks of
his/hers pages.
Relevance Rate
9. cleverdata.ru | info@cleverdata.ru
Levenshtein distance is a string metric for measuring the difference between
two sequences.
Informally, the Levenshtein distance between two words is the minimum
number of single-character edits (i.e. insertions, deletions or substitutions)
required to change one word into the other.
Levinshtein distance between 'beer' and 'bread' is 44/100
Levenshtein distance
10. cleverdata.ru | info@cleverdata.ru
The most active authors
write with sentiment
rate in short range:
0.74 +/- 0.03
Sentiment rate
Blogsize(pages)
Sentiments vs Blog size
12. cleverdata.ru | info@cleverdata.ru
Again, 2 kinds of bloggers:
- 'twitter style' authors
with short posts
- full-length bloggers
Log(mean words per page)
Log(Blogsize)
Words vs Pages
13. cleverdata.ru | info@cleverdata.ru
f you want to make a big
discussion, you should
praise something.
All highly discussed
authors are sentiment
positive (>=0.4)
Sentiment rate
Meandiscussion
Discussion vs Sentiments
14. cleverdata.ru | info@cleverdata.ru
We use Klout service to rank authors
according to online social influence.
Klout measures the size of a user's
social media network and correlates the
content created to measure how other
users interact with that content.
- the median Klout score is 40.1
Using of Klout score for bloggers
15. cleverdata.ru | info@cleverdata.ru
One can distinguish a population
of beginner bloggers with low
Klout score, that have tendency
to amplification of sentiments.
Sentiment rate
Kloutscore
Sentiments vs Klout score
16. cleverdata.ru | info@cleverdata.ru
• Amount of blog pages
• Mean discussion size
• AlexaRank + YandexTIC + Google PageRank
• Relevance rate
• Sentiment rate
• Klout score
Final Author Rating is based on
17. cleverdata.ru | info@cleverdata.ru
4 independent sentiment
rating systems are combined
Alexa Rank
Yandex Thematic Citation Index
Google PageRank
list of most PR effective authors
Pragmatic statistical information
key recommendations for blogger
resulting sentiment rate is
fully consistent with tests
Blog
efficien
cy
rating
Blog
relevance
rating
Sentiment
analysis
Make your data clever
Based on fuzzy string
matching
Blog rating in
accordance to
mentions of company
products in text
19. cleverdata.ru | info@cleverdata.ru
Testing the result
Hayley Carr (Top Rated Author):
“BlaBlaBla is definitely a brand to be reckoned with... All of the
BlaBlaBla products have multiple purposes, as well as smelling
and feeling fabulous; the packaging is clean and fresh whilst
still looking great in your bathroom, as well as having unique
application methods that only aid the product performance...
It's definitely worth checking out this growing brand, before it
starts taking over the world. “
21. cleverdata.ru | info@cleverdata.ru
In order to associate a blogger
with a product we must:
• Find products for promotion
• Find main topics of each blogger
• Match topics of each blogger with product names
• Find best combinations of blogger and product
23. cleverdata.ru | info@cleverdata.ru
In order to associate a blogger
with a product we must:
• Find products for promotion
• Find main topics of each blogger
• Match topics of each blogger with product names
• Find best combinations of blogger and product
24. cleverdata.ru | info@cleverdata.ru
Let's build document-term
matrix, where each row is a
document, each term is a
column and a color intensity
indicates that a term appears in
a document at least once.
We can use TF-IDF method
to get document-term matrix.
Finding topics:
the document-term matrix
25. cleverdata.ru | info@cleverdata.ru
Finding topics: TF - IDF
• Term frequency TF(t,d) is the number of times that term t
occurs in document d.
• The inverse document frequency (IDF) is a measure of how
much information the word provides, that is, whether the
term is common or rare across all documents.
• Term frequency–inverse document frequency, is a
numerical statistic that is intended to reflect how important
a word is to a document in a collection or corpus.
26. cleverdata.ru | info@cleverdata.ru
• NMF is a variant of Matrix
Factorization where we start
with a matrix D with document-
term matrix, and constrain the
elements of W and T to be non-
negative.
• Lets us interpret each row of the
T matrix as a topic.
Topic extraction: NMF
27. cleverdata.ru | info@cleverdata.ru
In order to associate a blogger
with a product we must:
• Find products for promotion
• Find main topics of each blogger
• Match topics of each blogger with product names
• Find best combinations of blogger and product
28. cleverdata.ru | info@cleverdata.ru
• For each author we build document-term matrix.
• For each document-term matrix we perform matrix
factorization and find main topics
• For each product we match product name with
main topics of author and find the rate of intensity.
• If author have exact product name in one of
his/hers titles, we set the rate of intensity to 0 (the
author has already made review of the the
product).
Topic extraction
29. cleverdata.ru | info@cleverdata.ru
Thus for each pair of author-product we find rate of intensity and we can
visualize it in form of heatmap where products are sorted by mean rate of
intensity and authors are sorted by author rating:
Note: the most rated authors are highly intensive on matrix
The intensity matrix
30. cleverdata.ru | info@cleverdata.ru
In order to associate a blogger
with a product we must:
• Find products for promotion
• Find main topics of each blogger
• Match topics of each blogger with product names
• Find best combinations of blogger and product
31. cleverdata.ru | info@cleverdata.ru
Next we extract the most resonance peaks from product-author matrix of intensity.
After each peak extraction the column with a peak is dropped, so for each author
we get only one product.
We need to build recommendations only for 4 products and we can select 40
best rated authors for this task.
The intensity matrix
32. cleverdata.ru | info@cleverdata.ru
In order to associate a blogger
with a product we must:
• Find products for promotion
• Find main topics of each blogger
• Match topics of each blogger with product names
• Find best combinations of blogger and product
• Profit!
33. cleverdata.ru | info@cleverdata.ru
BlaBlaBla Body Oil Allison http://www.neversaydiebeauty.com
BlaBlaBla Wrinkle
Repair
Cindy Batchelor http://mystylespot.net
BlaBlaBla Face Serum Marie Papachatzis http://iamthemakeupjunkie.blogspot.ru
BlaBlaBla Face Oil Emily - Style Lobster http://stylelobster.com
The resulting associations