Physiochemical properties of nanomaterials and its nanotoxicity.pptx
AINL 2016: Alekseev, Nikolenko
1. predicting the age of social network users
from user-generated texts with word
embeddings
Anton Alekseev1
and Sergey Nikolenko1,2,3
AINL 2016
St. Petersburg, November 11, 2016
1
Steklov Institute of Mathematics at St. Petersburg
2
Kazan Federal University, Kazan, Russia
3
Laboratory for Internet Studies, NRU Higher School of Economics, St. Petersburg
2. deep learning for nlp
• Recent success on numerous tasks: from text classification to
syntax parsing.
• Various applications of word vectors pretrained on huge
corpora: word2vec, GloVe, etc.
• In this work we have tried a few approaches for social network
users profiling using full-text items.
2
3. dataset
• 868,126 users of the Odnoklassniki social network. In particular,
it contains the following data:
(1) demographic information on 868,126 users of the network:
gender, age, and region;
(2) “friendship” relation, several different type of links: “friend”, “love”,
“spouse”, “parent”, and so on;
(3) history of logins for individual users;
(4) data on the “likes” (“class” marks) a user has given to other users’
statuses and posts in various groups;
(5) texts of the posts for individual users;
(6) communities (“groups”) statuses that have been liked by these
selected users.
3
4. dataset
• The dataset has been created as follows:
(1) 486 seed users;
(2) for these users, their sets of friends have been extracted;
(3) then the friends of these friends; as a result — seed users
neighborhood of depth 2 in the social graph.
4
5. dataset: age
• Mean age — 31.39 years;
0 20 40 60 80 100 120
Age
100
101
102
103
104
105
Number of users
Figure 1: Users’ age in the Odnoklassniki dataset
• Note users with implausible ages (ages 2 and 3, age higher than
100 years).
• We have removed all users with ages below 10 and above 80.
5
6. dataset: friends
100 101 102 103 104
Number of friends
100
101
102
103
104
Number of users
Figure 2: Number of friends distribution in the Odnoklassniki dataset
• “Arc” (not a straight line!) on a log-log plot — probably due to
the collection method: smaller number of degree-0, degree-1,
etc. vertices (that is, users) than in the whole “social graph”.
6
7. dataset and setup for word embeddings
A large Russian-language corpus with about 14G tokens in 2.5M
documents. This corpus includes:
• Russian Wikipedia: 1.15M documents, 238M tokens;
• automated Web crawl data: 890K documents, 568M tokens;
• (main part) the huge lib.rus.ec library corpus: 234K documents,
12.9G tokens;
• user statuses and group posts from the Odnoklassniki social
network, as described above.
For more info, please see the article cited in the paper:
Arefyev N., Panchenko A., Lukanin A., Lesota O., Romanov P. Evaluating Three Corpus-Based Semantic Similarity Systems for
Russian. In Proceedings of International Conference on Computational Linguistics Dialogue 2015, 2015
7
8. dataset and setup for word embeddings
• 𝑤𝑜𝑟𝑑2𝑣𝑒𝑐 configurations we took into consideration
Table 1: Word2vec models trained on large Russian corpora. The
columns are: dimension 𝑑, window size 𝑤, number of negative samples
𝑛, vocabulary threshold 𝑣, and the resulting model size
Type 𝑑 𝑤 𝑛 𝑣 Size
CBOW 100 11 10 20 1.3G
CBOW 100 11 10 30 0.97G
skip-gram 100 11 10 30 0.97G
skip-gram 200 11 1 20 1.3G
CBOW 200 11 1 30 2.0G
skip-gram 200 11 1 30 2.0G
CBOW 300 11 1 30 2.9G
skip-gram 300 11 1 30 2.9G
8
9. chosen base approaches
• How well is user’s age approximated by mean friends age?
(1) MeanAge: predict age with the mean of friends’ ages;
(2) LinearRegr: linear regression with a single feature
(mean friends age);
(3) ElasticNet: elastic net regressor with a single feature
(mean friends age);
(4) GradBoost: gradient boosting with a single feature
(mean friends age).
9
10. algorithms based on word embeddings
• The sum and/or mean of word embeddings to represent a
sentence/paragraph.
• The idea is not as naive as it seems; successfully applied in a
few studies (please see the paper for references).
• Three basic algorithms:
(1) MeanVec: train on mean vectors of all statuses for a user;
(2) LargestCluster: train on the centroid of the largest cluster of
statuses;
(3) AllMeanV: train on every status independently, with the mean
vector of a specific status and the mean age of friends as features
and the user’s demography as target; at the testing stage, we
compute predictions for every status and average the predictions.
10
11. meanvec
(1) for every user 𝑢 ∈ 𝑈:
• for every status 𝑠 ∈ 𝑆 𝑢, compute its mean vector ̄v 𝑠 = 1
|𝑠|
∑ 𝑤∈𝑆
v 𝑤;
• compute the mean vector of all statuses ̄v 𝑢 = 1
|𝑆 𝑢|
∑ 𝑠∈𝑆 𝑢
̄v 𝑠;
(2) train ML with features (MAF 𝑢, ̄v 𝑢) for every 𝑢 ∈ 𝑈.
11
12. largestcluster
The LargestCluster algorithm operates as follows: for a machine
learning algorithm ML,
(1) for every user 𝑢 ∈ 𝑈:
• for every status 𝑠 ∈ 𝑆 𝑢, compute its mean vector ̄v 𝑠 = 1
|𝑠|
∑ 𝑤∈𝑆
v 𝑤;
• cluster the set of vectors { ̄v 𝑠 ∣ 𝑠 ∈ 𝑆} into two clusters with
agglomerative clustering; denote by 𝐶 ⊆ 𝑆 the larger cluster;
• compute the mean vector of statuses from 𝐶 ̄c 𝑢 = 1
|𝐶|
∑ 𝑠∈𝐶
̄v 𝑠;
(2) train ML with features (MAF 𝑢, ̄c 𝑢) for every 𝑢 ∈ 𝑈.
12
13. allmeanv
The AllMeanV algorithm operates as follows: for a machine learning
algorithm ML,
(1) for every user 𝑢 ∈ 𝑈 and every status 𝑠 ∈ 𝑆 𝑢, compute its mean
vector ̄v 𝑠 = 1
|𝑠|
∑ 𝑤∈𝑆
v 𝑤;
(2) train ML with features (MAF 𝑢, ̄v 𝑠) for every 𝑢 ∈ 𝑈 and 𝑠 ∈ 𝑆 𝑢;
(3) on the prediction stage, for a user 𝑢 ∈ 𝑈test:
• for every status 𝑠 ∈ 𝑆 𝑢, compute its mean vector ̄v 𝑠 = 1
|𝑠|
∑ 𝑤∈𝑆
v 𝑤;
• predict the age for this status, 𝑎 𝑠 = ML(MAF 𝑢, ̄v 𝑠);
• return the average predicted age, 𝑎 = 1
|𝑆 𝑢|
∑ 𝑠∈𝑆 𝑢
ML(MAF 𝑢, ̄v 𝑠).
13
14. evaluation: prepared data
Train/test sets were splitted 80:20.
Dataset Training set Test set
Users Statuses Users Statuses
661206 10880321 165301 2704883
14
15. evaluation: comparing meanvec models using word2vec with
different parameters
• Adding word2vec components do help all models.
• CBOW models outperform skip-gram models.
• Increasing dimensionality adds minor improvements.
• Vocabulary threshold tweaking doesn’t affect the results greatly.
At the same time, it increases training time when made less
strict. Thus, tuning the threshold allows to reduce
computational expenses without loss in model quality.
• The more expressive the classifier is, the better normalized
word2vec models are.
15
16. evaluation: meanvec vs largestcluster
• MeanVec outperforms LargestCluster in all experiments.
• Possibly: largest cluster consists of least meaningful items, e.g.
updates from online game, short texts (e.g. “:)”). Verified with
direct examination.
• Requires thorough research, analysis and experiments.
Variations on clustering may still prove to be useful.
16
17. evaluation: meanvec vs largestcluster vs allmeanv
• LargestCluster loses in almost all cases to both MeanVec and
AllMeanV .
• AllMeanV is on par with MeanVec with linear models, but not
with GradBoost: MeanVec wins. This can be interpreted as
showing that simple averaging works very well in the semantic
space (not surprising — many semantic relations are known to
become linear in the space of embeddings); even better than
building an ensemble of predictions from individual statuses.
17
18. summary
• Huge unlabeled text datasets processed. A number of word2vec
models trained.
• User age prediction algorithms suggested and compared with
each other.
• Improvements were not huge when compared to ’mean friends
age’, but the general approach still looks promising.
• Possible developments:
• other word embeddings, possibly including ones that do word
sense disambiguation,
• text embeddings (such as Paragraph Vectors),
• character-level embeddings.
18