Slides for talk delivered at the Python Pune meetup on 31st Jan 2014.
Categorical data is a huge problem many data scientists face. This talk is about how to tame it
1. 1
Categorical Data Analysis in Python
By
Jaidev Deshpande
Data Scientist, DataCulture Analytics
twitter.com/jaidevd
2. 2
Problem: Who's likely to attend the next
meetup?
●
Who comes often?
●
Men / Women?
●
Where do you live? How far from the venue?
●
Proficiency with Python
(Beginner / Intermediate / Advanced)?
●
Area of interest?
3. 3
Something like..
Attendees Features
Attendance
(%)
Gender Pincode Proficiency in
Python
Interest ...
attendee_1 80 M 411013 Intermediate Web ...
attendee_2 30 F 411040 Advanced Test /
Automation
...
attendee_3 55 M 411001 Beginners Scientific ...
... ... ... ... ... ... ...
● 1. Numerical features – continuous and quantitative
● 2. Categorical features – discrete and qualitative
4. 4
Common Numerical Operations on Data
●
Obviously – add, subtract, multiply divide
●
Statistical moments
●
Operations in vector spaces
– Distance measures
– Slicing
5. 5
Comparison of Operations
Numerical Data
Addition, subtract, multiply, divide
Mean, Variance, Standard Deviation
Vector Spaces – the very idea of
'measuring'
Categorical Data (Strings, etc)
What's the product of two strings?
The average pincode of two areas?
&%%#&$$*&!!!!
At least get some numbers!
10. 10
Correspondence Analysis
●
How are proficiencies related w.r.t gender? (Row profiles)
●
How are genders related w.r.t proficiency? (Column profiles)
– Cosine similarity
– Correlation / Covariance
●
How are they interrelated?
– Weighted chi-squared distance
●
Can the dimensionality be reduced?
– Singular value decomposition / PCA
– sklearn.decomposition.PCA
– sklearn.decomposition.TruncatedSVD
11. 11
Sample Problem
●
Consider the proficiency and interest features from the original
problem
●
Fake data with 100 observations
●
Contingency matrix:
automation scientific web
advanced 8 1 7
beginner 13 9 35
intermediate 7 1 19