2. Superficial Data Analysis
2
How the stereotypes and our appearances influence
the way we are perceived ?
The answers were found by analyzing facts in large pool
of data collected from diverse group of people
3. Let us tell you the story about
FaceStat.com
from Brendan O’Connor & Lukas Biewald
We love the story
4. 4
How do we perceive AGE, GENDER, INTELLIGENCE, AND
ATTRACTIVENESS ?
WHAT INSIGHT CAN WE EXTRACT from millions of anonymous
opinions?
5. Collect the data
FaceStat runs on an SQL
database.
Judgment of user is taken and
saves as a set of (face ID, attribute,
judgment) triples.
Exploring the relationships
between different types of
perceived attributes.
5
6. Collect the data
Example question :
“How old do I look?”
Look at age judgments’ value and
count how many times each value
occurs and order by this count
We have 10 million rows of data that
can be extracted from the database
6
1st number is the frequency count.
2nd string is the response value
7. Clean up the data
Problematic of data:
“How old do I look?”
Error: rn
Outliers: rare values
Format of user responses:
Text instead of number
7
1st number is the frequency count.
2nd string is the response value
8. Clean up the data
Challenges in
“preprocessing data”
Mapping from multiple-choice
responses to numerical values: “very
trustworthy” vs “not to be trusted”
Aggregate results from multiple people
into a single description of a face.
problematic of data
Missing values
8
1st number is the frequency count.
2nd string is the response value
12. Further Investigate
Distribution of age values:
“Outliers”
Remove the outliers:
Select rows with age less than 100
12
Figure 17-3 :
Initial histogram of Face age Data
14. Age, Attractive and Gender
14
Figure 17-5 :
Scatterplot of attractiveness versus age, colored by
gender.
Pink: Female
Blue : Male
15. Age, Attractive and Gender
15
Figure 17-6 :
smoothed scatterplots for attractiveness versus
age, one plot per gender.
16. Age, Attractive and Gender
“ How does age affect attractiveness ? “
• We compute 95% confidence intervals
• Fit a loess curve to help visualize aggregate patterns in this
noisy sequential data
16
17. Age, Attractive and Gender
“ How does age affect attractiveness ? “
17
Figure 17-7 :
smoothed scatterplots for attractiveness versus
age, one plot per gender.
18. Age, Attractive and Gender
“ How does age affect attractiveness ? “
• Women are generally judged as more attractive than men
across all ages except babies.
• Babies are found to be most attractive, but the attractiveness
drops until around age 18 after which it rises and peaks
around age 27 After that, attractiveness drops until around
age 50
18
25. We can do with the same step ...
Attributes Correlations
26. We can do with the same step ...
Or …..
We can put everything in a big picture
Attributes Correlations
27. We use R language to make
Pearson Correlation Matrix
Attributes Correlations
28.
29.
30.
31. Woman are judged more intelligent than men
Woman are judged more likely to win a dog fight
Dress size is weakly correlated to weight
Attributes Correlations
68. Definition: grouping together objects
that are similar to each other
Applications:
- Marketing segmentation
- Business
- Healthcare
- Document retrieve
- Etc…
CLUSTERING
69. K-MEANS CLUSTERING
The k-means algorithm is an algorithm to cluster n objects
based on attributes into k partitions, where k < n.
71. A Simple example showing the implementation
of k-means algorithm
(using K=2)
72. Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
73. Step 2:
Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
Their new centroids are:
74. Step 3:
Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.
Therefore, the new clusters
are:
{1,2} and {3,4,5,6,7}
Next centroids are:
m1=(1.25,1.5) and m2 =
(3.9,5.1)
75. Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
Therefore, there is no
change in the cluster.
Thus, the algorithm comes to
a halt here and final result
consist of 2 clusters {1,2}
and {3,4,5,6,7}.