Superficial data analysis

1
SUPERFICIAL
DATA ANALYSIS
Exploring
Millions of Social
Stereotypes
Presentators
Nguyen Dao Tan Bao
Cao Dinh Qui
Pham Huy Thanh
Instructor
Prof. Lothar Piepmeyer

Superficial Data Analysis
2
How the stereotypes and our appearances influence
the way we are perceived ?
The answers were found by analyzing facts in large pool
of data collected from diverse group of people

Let us tell you the story about
FaceStat.com
from Brendan O’Connor & Lukas Biewald
We love the story

4
How do we perceive AGE, GENDER, INTELLIGENCE, AND
ATTRACTIVENESS ?
WHAT INSIGHT CAN WE EXTRACT from millions of anonymous
opinions?

Collect the data
FaceStat runs on an SQL
database.
Judgment of user is taken and
saves as a set of (face ID, attribute,
judgment) triples.
Exploring the relationships
between different types of
perceived attributes.
5

Collect the data
Example question :
“How old do I look?”
Look at age judgments’ value and
count how many times each value
occurs and order by this count
We have 10 million rows of data that
can be extracted from the database
6
1st number is the frequency count.
2nd string is the response value

Clean up the data
Problematic of data:
“How old do I look?”
Error: rn
Outliers: rare values
Format of user responses:
Text instead of number
7

Clean up the data
Challenges in
“preprocessing data”
Mapping from multiple-choice
responses to numerical values: “very
trustworthy” vs “not to be trusted”
Aggregate results from multiple people
into a single description of a face.
problematic of data
Missing values
8

Further Investigate
Distribution of age values:
“Outliers”
Remove the outliers:
Select rows with age less than 100
12
Figure 17-3 :
Initial histogram of Face age Data

Further Investigate
13
Figure 17-4 :
Histogram of cleaned Face age Data

Age, Attractive and Gender
14
Figure 17-5 :
Scatterplot of attractiveness versus age, colored by
gender.
Pink: Female
Blue : Male

15
Figure 17-6 :
smoothed scatterplots for attractiveness versus
age, one plot per gender.

“ How does age affect attractiveness ? “
• We compute 95% confidence intervals
• Fit a loess curve to help visualize aggregate patterns in this
noisy sequential data
16

17
Figure 17-7 :
smoothed scatterplots for attractiveness versus
age, one plot per gender.

• Women are generally judged as more attractive than men
across all ages except babies.
• Babies are found to be most attractive, but the attractiveness
drops until around age 18 after which it rises and peaks
around age 27 After that, attractiveness drops until around
age 50
18

Attributes Correlations
How about the others ?

Intelligence

Intelligence
Weight

Intelligence
Weight
Trustworthy

Intelligence
Weight
Trustworthy
Outfit

Intelligence
Weight
Trustworthy
Outfit
Wealth

We can do with the same step ...

We can do with the same step ...
Or …..
We can put everything in a big picture

We use R language to make
Pearson Correlation Matrix

Woman are judged more intelligent than men
Woman are judged more likely to win a dog fight
Dress size is weakly correlated to weight

Describe me in one word : ………………………………………..

Describe me in one word : ………FREE FORM TAGS……..

Describe me in one word : Cute !

Describe me in one word : Pls, call me - 091231512

Describe me in one word : abc xyz aK&*$(#k,,fh..

Let’s use R language to examine the tags !
THE TAGS

THE TAGS
“cute” and “Cute” can be merged ?

THE TAGS
“hot” and “HOT!!!” have different
semantic content !!!

THE TAGS
“hot” and “HOT!!!” have different
semantic content !!!
Unknown language !?

290,000 unique tags
out of 2.4 million total.
.
THE TAGS

290,000 unique tags
out of 2.4 million total.
The top 1,000 unique tags
have 1.4 million occurrences
.
THE TAGS

How do the tags
fit in
with the rest of our data?.
THE TAGS

Which description tags
are most characteristic of
male or female ?
GENDERED WORDS

male or female ?
GENDERED WORDS
Ex : handsome

male or female ?
GENDERED WORDS
Ex : handsome, makeup

male or female ?
GENDERED WORDS
Ex : handsome, makeup, shopping

male or female ?
GENDERED WORDS
Ex : handsome, makeup, shopping, gamer

How to do ?
Count the words
that occur
most often
for men or for women ?

Score tags by the ratio of occurrences between
genders
GENDERED WORDS

Score tags by the ratio of occurrences between
genders
How characteristic a tag T is for gender G
GENDERED WORDS

What are the typical types of
people in our data?
Cute
Loser
flirty
fratboy Player
idiot
…

Data Mining
Supervised
Learning
Unsupervised
learning
Association
Rules
Clustering …Classification Regression
CLUSTERING
…
Decision
Tree
Have a
target
attribute
DON’T
have a
target
attribute
Labelled
data
Unlabelled
data

Definition: grouping together objects
that are similar to each other
Applications:
- Marketing segmentation
- Business
- Healthcare
- Document retrieve
- Etc…
CLUSTERING

K-MEANS CLUSTERING
The k-means algorithm is an algorithm to cluster n objects
based on attributes into k partitions, where k < n.

How the K-Mean Clustering
algorithm works?

A Simple example showing the implementation
of k-means algorithm
(using K=2)

Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).

Step 2:
Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
Their new centroids are:

Step 3:
Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.
Therefore, the new clusters
are:
{1,2} and {3,4,5,6,7}
Next centroids are:
m1=(1.25,1.5) and m2 =
(3.9,5.1)

Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
Therefore, there is no
change in the cluster.
Thus, the algorithm comes to
a halt here and final result
consist of 2 clusters {1,2}
and {3,4,5,6,7}.

K=6 clusters and 8 attributes
Blue
custer
Green
cluster
Red
cluster
Turquoise
cluster
Orange
cluster
Purple
cluster

Conclusion
The data shows people hold some familiar
stereotypes.
Let’s data speak it self.

Superficial data analysis

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (11)

Semelhante a Superficial data analysis

Semelhante a Superficial data analysis (20)

Último

Último (20)

Superficial data analysis