The document discusses several studies that aimed to determine personality types through analysis of social media data like tweets.
One study analyzed over 1.7 million tweets to determine Myers-Briggs Type Indicator (MBTI) personality types. It found that word vectors, part-of-speech tags, and n-grams achieved over 65% accuracy on average. Another study used over 960,000 tweets to classify MBTI types with over 99% accuracy using support vector machines and logistic regression. However, it saw performance drop with testing data, likely due to noise in tweets. A third study compiled a dataset of 1.2 million tweets from 1,500 users self-reporting MBTI types.
5. WHY PERSONALITY PREDICTION?
Areas which are directly affected with a user’s personality:
1. Marketing.
2. Recommendation Systems.
3. Customized web pages, advertisements and products.
4. Customized search engines and user experience.
5. Understanding criminal and psychopathic behaviors.
6. Sentiment analysis and clustering of text.
By Joud Khattab 5
6. LITERATURE SURVEY
1) Understanding Personality through Social Media:
Y.Wang et al. (2016), Department of Computer Science, Stanford University.
2) Detection of MBTI viaText Based Computer-Mediated Communication:
D. Brinks et al. (2012), Department of Electrical Engineering, Stanford University.
3) PersonalityTraits onTwitter:
B. Plank et al. (2015), Center for LanguageTechnology, University of Copenhagen.
4) Identifying PersonalityTypes Using Document Classification Methods:
M. Komisin et al. (2012), Department of Computer Science, University of North Carolina
Wilmington.
By Joud Khattab 6
8. DATA SET
(Y. WANG, 2016)
Twitter dataset:
GNIPAPIs.
around 90,000 users.
Extracting and filtering all personality-related tweets from 2006 to 2015.
The most recent tweets for all the 90,000 users.
1.7 million tweets that contain the personality codes.
By Joud Khattab 8
(1)
9. DATA CLEANING
(Y. WANG, 2016)
1. PositiveTweets:
@ProfCarol Just wondering, what’s your type? I’m an ENFJ
@whitneyhess that’s an interesting test.. I got ENTP and it seems pretty accurate IMO
@megfowler I’m INTP according to this http://similarminds.com/jung.html
2. NegativeTweets:
I’ll bet that Jeremiah @jowyang is an ESTJ
@mark ENTJYou should have known... http://typelogic.com/entj.html
I love my wife. Even though she’s INFP
Retrieve 120K tweets out of all the 1.7M tweets with personality codes.
By Joud Khattab 9
(1)
10. SOCIAL MEDIA DATA DISADVANTAGE
(Y. WANG, 2016)
Language on social media has richer content that makes linguistic analysis tool
perform poorly.
Each tweet is limited to 140 character contains hashtag, at-mention, URL and
emoticons.
People tend to use shorten version of phrases “iono” means “I don’t know”.
Lack of conventional orthography.
Collecting personality data is costly.
By Joud Khattab 10
(1)
12. FEATURES SELECTION
(Y. WANG, 2016)
1) Bag of N-Grams.
2) Part-Of-Speech Tags.
3) WordVectors.
By Joud Khattab 12
(1)
13. N-GRAM
(Y. WANG, 2016)
By Joud Khattab 13
(1)
Top correlated unigram forThinking Top correlated unigram for Feeling
Top correlated bigram for Introversion Top correlated bigram for Extroversion
14. POSTAGGING
(Y. WANG, 2016)
Twitter POS tagger has 25 types of distinctive tags has been used.
Common noun is a good indicator for personality.
People who use common nouns more often tend to be in Extroversion, Intuition,
Thinking, or Judging type.
Introverted people use more pronouns but less common nouns.
Interjection which includes (“lol”, “haha”, “FTW”, “yea”) is more likely to be used
by Sensing and Perceiving type.
Emoticon is more likely to be used by Sensing and Feeling type.
Numbers are more likely to be used by Sensing andThinking type.
Extroverted people are more likely to use hashtags.
By Joud Khattab 14
(1)
15. WORD COUNT
(Y. WANG, 2016)
1) Average word vectors:
average all the vectors of all the word that is available in the tweets of a user to
represent the vector representations of that user.
2) Weighted average word vectors:
A weighted average the vectors of the words that is available in the tweets of a user
according to theTF-IDF values.
The weighted vector representation is then used to represent the vector
representations of that user.
By Joud Khattab 15
(1)
16. MODEL SELECTION
(Y. WANG, 2016)
1. Logistic Regression model with 10-fold cross-validation.
2. Random Forest and SVM.
By Joud Khattab 16
(1)
17. MODEL RESULTS
(Y. WANG, 2016)
Classifier E vs I N vs S T vs F P vs J Average
WordVector 67.9% 64.3% 67.3% 60.8% 65.1%
Bag of n-grams 63.1% 58.8% 62.1% 58.8% 60.7%
Unigram 61.7% 58.1% 60.9% 58.2% 59.7%
Bigram 60.9% 56.9% 60.7% 57.3% 59.0%
Trigram 61.3% 56.7% 59.3% 57.0% 58.6%
POSTag 59.3% 57.5% 60.3% 56.9% 58.5%
POS + n-rams 62.8% 60.7% 63.3% 59.6% 61.6%
POS + n-gram
+WordVector
69.1% 65.3% 68.0% 61.9% 66.1%
By Joud Khattab 17
(1)
18. DETECTION OF MBTI VIA TEXT BASED
COMPUTER-MEDIATED COMMUNICATION
D. Brinks et al. (2012)
Department of Electrical Engineering
Stanford University
By Joud Khattab 18
(2)
19. DATA SET
(D. BRINKS, 2012)
Twitter API to get tweets including MBTI abbreviation.
6,358 users includes 960,715 tweets.
Multiple level of data elimination where done to eliminate any improper data.
By Joud Khattab 19
(2)
20. DATA CLEANING
(D. BRINKS, 2012)
Many users labeled “INTP” weren’t referencing their MBT. instead, they had
simply misspelled “into”.
Any user whose tweet contained two or more different MBTs was rejected.
numbers, links, @<user>, and MBTs were replaced with “NUMBER”, “URL”,
“AT_USER”, and “MBT”.
Contractions were replaced by their expanded form.
Words were converted to lowercase.
Finally, all of a user’s tweets were aggregated into a single text block.
By Joud Khattab 20
(2)
22. PROCESSING PARAMETERIZATION
(D. BRINKS, 2012)
1) Porter Stemming.
2) Emoticon Substitution.
3) MinimumToken Frequency.
4) Minimum User Frequency.
5) Term FrequencyTransform.
6) Inverse Document FrequencyTransform.
By Joud Khattab 22
(2)
23. TRAINING ACCURACY BY CLASSIFIER
(D. BRINKS, 2012)
Classifier E vs I N vs S T vs F P vs J Average
Multinomial Event Model Naive Bayes 96.0% 83.4% 84.6% 75.9% 85.0%
L2-regularized logistic regression (primal) 99.8% 99.8% 100.0% 99.8% 99.9%
L2-regularized L2-loss SV classification
(dual)
99.8% 99.9% 99.9% 99.9% 99.9%
L2-regularized L2-loss SV classification
(primal)
99.8% 99.9% 99.9% 99.9% 99.9%
L2-regularized L1-loss SV classification
(dual)
99.9% 99.9% 99.9% 99.9% 99.9%
SV classification by Crammer and Singer 100.0% 100.0% 100.0% 100.0% 100.0%
L1-regularized L2-loss SV classification 100.0% 100.0% 100.0% 100.0% 100.0%
L1-regularized logistic regression 99.9% 99.9% 99.8% 99.9% 99.9%
L2-regularized logistic regression (dual) 100.0% 100.0% 100.0% 100.0% 100.0%
By Joud Khattab 23
(2)
24. HIGHVARIANCE SOLUTIONS
(D. BRINKS, 2012)
1. Get more data:
Unfortunately,Twitter places a cap on data retrieval requests.
Even after tripling the number of collected tweets, performance remained constant.
2. Decreasing the feature set size:
Modifying the preprocessing steps.
Parameterized number of features fed to classifier to determine the optimal features.
Several transforms detailed were added to the classifier.
Algorithm was modified to use confidence metrics in its classification and instructed to
only decide for users about which it had a strong degree of certainty.
However, none of these options improved testing behavior to any significant
degree.
By Joud Khattab 24
(2)
25. PERFORMANCE BY CLASSIFIER
(D. BRINKS, 2012)
Classifier E vs I N vs S T vs F P vs J Average
Multinomial Event Model Naive Bayes 63.9% 74.6% 60.8% 58.5% 64.5%
L2-regularized logistic regression (primal) 60.3% 70.7% 59.4% 56.1% 61.6%
L2-regularized L2-loss SV classification
(dual)
56.9% 67.5% 59.3% 54.1% 59.5%
L2-regularized L2-loss SV classification
(primal)
58.8% 69.5% 59.0% 55.9% 61.0%
L2-regularized L1-loss SV classification
(dual)
56.8% 67.6% 59.6% 54.5% 59.7%
SV classification by Crammer and Singer 56.8% 67.7% 59.4% 54.5% 59.6%
L1-regularized L2-loss SV classification 59.4% 68.3% 56.8% 56.1% 60.2%
L1-regularized logistic regression 60.9% 70.5% 58.5% 56.3% 61.6%
L2-regularized logistic regression (dual) 59.2% 69.6% 59.0% 55.0% 60.7%
By Joud Khattab 25
(2)
26. DATA PROBLEM
(D. BRINKS, 2012)
Reasons why the machine classifier did not achieve better performance because a
large portion of tweets are noise with respect to MBTI.
Twitter imposes a 140-character limit on each tweet, users are forced to express
themselves succinctly.
Large percentage of tokens in tweets are not English words, but twitter handles being
retweeted or URLs.Thus, while a user’s tweet set may contain a thousand tokens, a
significant subset is unique to that individual user, and cannot be used for correlation.
Due to retweeting, a user’s tweet may not be expressing his or her own thoughts.
By Joud Khattab 26
(2)
27. COMPARISON WITH HUMAN EXPERTS
(D. BRINKS, 2012)
Spectrum Human 1 Human 2 MNEMNB
E vs I 50.0% 40.0% 55.0%
N vs S 50.0% 90.0% 90.0%
T vs F 80.0% 65.0% 55.0%
P vs J 60.0% 50.0% 65.0%
Average 60.0% 61.3% 66.3%
By Joud Khattab 27
(2)
28. PERSONALITY TRAITS ON TWITTER
B. Plank et al. (2015)
Center for LanguageTechnology
University of Copenhagen
By Joud Khattab 28
(3)
29. DATA SET
(B. PLANK, 2015)
Corpus of 1.2M tweets.
1,500 users that self-identity with an MBTI.
Open source code and data set.
By Joud Khattab 29
(3)
32. By Joud Khattab 32
0 2 4 6 8 10 12 14 16 18
ISTP
ESFP
ESFJ
ESTJ
ESTP
ENFJ
ENTJ
ISTJ
ISFP
ENTP
ISFJ
INTP
ENFP
INFJ
INFP
INTJ
MBTI distribution inTwitter corpusVS general US population
US Population
Paper 3
Paper 2
Paper 1
33. CLASSIFIER
(B. PLANK, 2015)
By Joud Khattab 33
(3)
Classifier E vs I N vs S T vs F P vs J Average
Accuracy for four
discrimination tasks
Majority 64.1% 77.5% 58.4% 58.8% 64.7%
System 72.5% 77.4% 61.2% 55.4% 66.6%
Prediction performance
for four discrimination
Tasks controlled for
gender
Majority 64.9% 79.6% 51.8% 59.4% 63.9%
System 72.1% 79.5% 54.0% 58.2% 66.0%
34. PREDICTIVE FEATURES
(B. PLANK, 2015)
By Joud Khattab 34
(3)
INTROVERT
• someone (91%)
• probably (89%)
• favorite (83%)
• stars (81%)
• b (81%)
• writing (78%)
• , the (77%)
• status count< 5000
(77%)
• lol (74%)
• but i (74%)
EXTROVERT
• pull (96%)
• mom (81%)
• travel (78%)
• don’t get (78%)
• when you’re (77%)
• posted (77%)
• #HASHTAG is
(76%)
• comes to (72%)
• tonight ! (71%)
• join (69%)
THINKING
• must be (95%)
• drink (95%)
• red (91%)
• from the (89%)
• all the (88%)
• business (85%)
• to get a (81%)
• hope (81%)
• june (78%)
• their (77%)
FEELING
• out to (88%)
• difficult (87%)
• the most (85%)
• couldn’t (85%)
• me and (80%)
• in @USER (80%)
• wonderful (79%)
• what it (79%)
• trying to (79%)
• ! so (78%)
35. IDENTIFYING PERSONALITY TYPES USING
DOCUMENT CLASSIFICATION METHODS
M. Komisin et al. (2012)
Department of Computer Science
University of North CarolinaWilmington
By Joud Khattab 35
(4)
36. DATA SET
(M. KOMISIN, 2012)
Data collected as part of a graduate course:
Students took the MBTI Step II.
Completed a Best Possible Future Self (BPFS) exercise.
Over 3 semesters, data was collected from 40 subjects.
Best Possible Future SelfWriting (BPFS) Exercise:
This essay contains elements of self-description, present and future, as well as various contexts.
“Think about your life in the future. Imagine everything gone as well as it possibly.You have succeeded
accomplishing all your life goals.Think of this as the realization of all your dreams. Now, write about it.”
Many existing data sets are comprised of written essays, which usually contain highly canonical
language, often of a specific topic.
Such controlled settings inhibit the expression of individual traits much more than spontaneous
language.
By Joud Khattab 36
(4)
37. PREPROCESSING
(M. KOMISIN, 2012)
1. Word stemming.
2. Stop-words removal.
3. Multiple Data smoothing techniques.
Lidstone smoothing.
Good-Turing smoothing.
Witten and Bell Smoothing.
By Joud Khattab 37
(4)
38. MODEL SELECTION
(M. KOMISIN, 2012)
1. Naïve Bayes.
2. SVM.
3. Linguistic Inquiry andWord Count (LIWC).
By Joud Khattab 38
(4)
39. LIWC FEATURES
(PENNEBAKER, 2001)
STANDARD COUNTS:
Word count, words per sentence, type/token ratio, words captured, words longer than 6
letters, negations, assents, articles, prepositions, numbers.
Pronouns: 1st person singular, 1st person plural, total 1st person, total 2nd person, total
3rd person
PSYCHOLOGICAL PROCESSES:
Affective or emotional processes: positive emotions, positive feelings, optimism and
energy, negative emotions, anxiety or fear, anger, sadness.
Cognitive Processes: causation, insight, discrepancy, inhibition, tentative, certainty.
Sensory and perceptual processes: seeing, hearing, feeling.
Social processes: communication, other references to people, friends, family, humans.
By Joud Khattab 39
(4)
40. LIWC FEATURES
(PENNEBAKER, 2001)
RELATIVITY:
Time, past tense verb, present tense verb, future tense verb.
Space: up, down, inclusive, exclusive.
Motion.
PERSONAL CONCERNS:
Occupation: school, work and job, achievement.
Leisure activity: home, sports, television and movies, music.
Money and financial issues.
Metaphysical issues: religion, death, physical states and functions, body states and
symptoms, sexuality, eating and drinking, sleeping, grooming.
By Joud Khattab 40
(4)
42. TEXT FEATURES OF BPFS ESSAYS
(M. KOMISIN, 2012)
Myers-Briggs
Preferences
Word
Tokens
Unique
Words
WordsTokens
Per Document
UniqueWord
Types Per
Document
Extraversion 10,428 1,859 401 72
Introversion 5,275 1,140 377 81
Sensing 7,913 1,455 377 69
Intuition 7,790 1,594 410 84
Thinking 6,879 1,348 362 71
Feeling 8,824 1,685 420 80
Judging 6,210 1,389 388 87
Perceiving 9,493 1,649 396 69
By Joud Khattab 42
(4)
43. TEXT FEATURES OF BPFS ESSAYS AFTER
PORTER AND STOP-WORD FILTERING
(M. KOMISIN, 2012)
Myers-Briggs
Preferences
Word
Tokens
Unique
Words
WordsTokens
Per Document
UniqueWord
Types Per
Document
Extraversion 5,631 1,376 217 53
Introversion 2,834 846 202 60
Sensing 4,335 1,067 206 51
Intuition 4,130 1,178 217 62
Thinking 3,718 1,015 196 53
Feeling 4,747 1,224 226 58
Judging 3,312 1,030 207 64
Perceiving 5,153 1,207 215 50
By Joud Khattab 43
(4)
44. CLASSIFICATION RESULTS
(M. KOMISIN, 2012)
Summary of results with leave-one-out
cross validation and sample size (n = 40)
Summary of results with leave-one-out cross
validation and reduced sample size (n = 30)
lowest clarity scores removed
By Joud Khattab 44
(4)
45. By Joud Khattab 45
Research
Papers
Date Set
Kind
Date Set Size Features and Pre-processing
Prediction
Models
Evaluation
Metrics
Y.Wang, 2016 Twitter Dataset
1.7 M tweets for
90,000 users, 120 K
tweets after
preprocessing
n-grams, POS tags, word vectors
(Average word vectors, Weighted
average word vectors)
Logistic Regression
(10-fold cross-
validation), Random
Forest, SVM
Highest average is
66.1% for combined
features
D. Brinks, 2012 Twitter Dataset
960 K tweets for
6,000 users
Porter Stemming, Emoticon
Substitution, MinimumToken
Frequency, Minimum User Frequency,
Term FrequencyTransform, Inverse
Document FrequencyTransform
Naïve Bayes, multi-
variate event model,
confidence metrics,
SVM, logistic
regression
Highest average is
64.5%
B. Plank, 2015 Twitter Dataset
1.2 M tweets for 1,500
users
gender, n-grams, count statistics,
tweets count, followers, statuses,
favorites
logistic regression
Highest average is
66.6% (T–F predicted
with high reliability,
while
others are very hard to
model)
M. Komisin,
2012
MBTITest and
BPFS Exercise
4800 text
specific word choices, semantic
categories words
Porter stemming, stop-words
removal, smoothing techniques
Naïve Bayes, SVM,
LIWC
Highest average 65%
46. RESEARCH GAP
TwitterVS. Document.
Language on social media has richer content that makes linguistic analysis tool
perform poorly.
Each tweet is limited to 140 character contains hashtag, at-mention, URL and
emoticons.
Due to retweeting, a user’s tweet may not be expressing his or her own thoughts.
Removing StopWords problem.
Collecting personality data is costly.
MBTI distribution in twitter that discussed in the fourth paper.
By Joud Khattab 46
47. PROPOSED WORK
Validation
Model Selection
N-Gram POS tagger Naïve Bayes
Data Preprocessing
Snow Ball Stemmer Porter Stemmer Lemmatize StopWords Emoji
Data Cleaning
Data Collection
Twitter Corpus Letter Corpus Text Corpus
Research
By Joud Khattab 47
48. MODEL SELECTION (TEXT CORPUS)
NAÏVE BAYES
Data Set E / I T / F S / N
cleaned version naive bayes gain function for every two letter
50 / 20 0.6 0.95 0.525
70 / 30 ↓ 0.5 ↓ ↑ 0.96 ↑ ↑ 0.616 ↑
cleaned version stop word naive bayes gain
50 / 20 0.6 0.975 0.525
70 / 30 ↓ 0.5 ↓ ↑ 0.983 ↑ ↑ 0.57 ↑
cleaned version snow stemmer naive bayes gain
50 / 20 0.6 0.975 0.525
70 / 30 ↓ 0.5 ↓ ↑ 0.967 ↑ ↑ 0.583 ↑
By Joud Khattab 48
1)
49. MODEL SELECTION (LETTER CORPUS)
N-GRAM
1. cleaned version 1-gram first 20%
2. cleaned version 2-gram first 20%
3. cleaned version 3-gram first 20%
4. cleaned version snow stemmer 1-gram first 20%
5. cleaned version snow stemmer 2-gram first 20%
6. cleaned version snow stemmer 3-gram first 20%
7. cleaned version stop words 1-gram first 20%
8. cleaned version stop words 2-gram first 20%
9. cleaned version stop words 3-gram first 20%
By Joud Khattab 49
2)