At the heart of data analysis, there lies a need to understand the real world entities being represented in the data. Every data set we encounter is an attempt to capture a slice of our complex world and communicate some information about it in a way that has potential to be informative to humans, machines, or both. Moving from basic analyses to advanced analytics requires the ability to imagine multiple ways of conceptualizing the composition of entities and the relationships present in our data. It also requires the realization that different levels of aggregation, disaggregation, and transformation can open up new pathways to understanding our data and identifying the valuable insights it contains.
In this talk, we’ll discuss several ways to think about the composition and representation of our data. We’ll also demonstrate a series of methods that leverage tools like networks, hierarchical aggregations, and unsupervised clustering to visually explore our data, transform it to discover new insights, help frame analytical problems and questions, and even improve machine learning model performance. In exploring these approaches, and with the help of Python libraries such as Pandas, Scikit-Learn, Seaborn, and Yellowbrick, we will provide a practical framework for thinking creatively and visually about your data and unlocking latent value and insights hidden deep beneath its surface.
2. How many of you consider
yourselves data scientists?
3. How many of you spend a
lot of time exploring data?
4. How many of you have a
formal process for
data exploration?
5. ABOUT ME - TONY OJEDA
Founder of District Data Labs
Education and research company
Business & finance background
Self-taught programmer (R & Python)
7. PUT THINGS IN ORDER
Categories, Classifications, Taxonomies, Ontologies.
8. EXPLORATION FRAMEWORK
Identify
Types of
Information
Entities
in
Data
Set
Review
Transformation
Methods
Visualization
Methods
Create
Category
Aggregations
Continuous Bins
Cluster
Categories
Prep
Phase
Insights
Over
Time
Visualization
Filter
+
Aggregate Field
Relationships Entity
RelationshipsExplore
Phase
9. THE DATA: EPA VEHICLE FUEL ECONOMY
http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip
13. ENTITIES IN OUR DATA SET
Level 4
Level 3
Level 2
Level 1 Year + Model
Year + Model Type
Year + Make
Year
Year + Vehicle Class
Vehicle Class
Model Type
Make
14. ENTITIES IN OUR DATA SET
Level 4
Level 3
Level 2
Level 1 2016 Ford Mustang 2.3L V4
Automatic Rear-wheel Drive
2016 Ford Mustang
2016 Fords
All 2016 Vehicles
2016 Subcompacts
All Subcompacts
Ford Mustangs
All Ford Vehicles
20. CATEGORY AGGREGATIONS
Vehicle
Class
Special
Purpose
Vehicle
2WD
Midsize
Cars
Subcompact
Cars
Compact
Cars
Sport
Utility
Vehicle
-‐ 4WD
Small
Sport
Utility
Vehicle
2WD
Small
Sport
Utility
Vehicle
4WD
Two
Seaters
Small
Station
Wagons
Minicompact Cars
Minivan
-‐ 4WD
+ 23
more
Vehicle
Category
Small
Cars
Midsize
Cars
Large
Cars
Station
Wagons
Pickup
Trucks
Special
Purpose
Sport
Utility
Vans
&
Minivans
21. CATEGORIES FROM CONTINUOUS
Very
Low
Low
Moderate
High
Very
High
Combined MPG à Fuel Efficiency Quintiles
Engine Displacement à Engine Size Quintiles
CO2 Emission à Emission Quintiles
Fuel Cost à Fuel Cost Quintiles
22. CLUSTER CATEGORIES
Takes multiple fields
into consideration
together.
Groups things in
ways you may not
have thought of.
Come up with
descriptive names
for clusters.
Number of clusters?
Looking for relatively
clear boundaries.
Automatically creates
new categories
(saves time).
57. GRAPH ANALYSIS
Relationships between entities.
Attributes entities have in common.
Actions one entity takes involving another.
Changes in relationships over time.
71. EXPLORATION FRAMEWORK
Identify
Types of
Information
Entities
in
Data
Set
Review
Transformation
Methods
Visualization
Methods
Create
Category
Aggregations
Continuous Bins
Cluster
Categories
Prep
Phase
Insights
Over
Time
Visualization
Filter
+ Aggregate Field
Relationships Entity
RelationshipsExplore
Phase
73. NO REALLY… SO MANY!
Significantly more small cars than other types in
2016.
Sport Utility Vehicles are currently next most
popular.
Midsize Cars currently third most popular.
Other vehicle types currently not as popular. Sport Utility Vehicles didn't exist in 1985. Small Cars were even more popular in 1985.
Pickup Trucks were also more popular in 1985.
Special Purpose Vehicles were more popular as
well.
Most vehicles today have very small or
moderate sized engines.
Most vehicles today are very fuel efficient. Few vehicles today have large engines. Even fewer have low fuel efficiency.
BMW currently makes the most vehicle models,
followed by Chevy and Ford.
Pagani and Alfa Romeo make the least number
of vehicle models.
Smaller cars with smaller engines are most fuel
efficient.
Currently no small engine vehicles with low
efficiency.
Even Vans and Station Wagons are relatively
fuel efficient these days.
Each vehicle category has varying engine sizes.
BMW is doubling down on small cars. So is
Porsche.
Ford, Chevrolet, & Nissan are going for breadth.
Jeep and Land Rover are focused solely on
SUVs.
Ram is focused solely on Pickup Trucks
A few companies are focused only on small
cars.
Toyota used to make a lot of small cars, but now
makes less in favor of SUVs and Pickup Trucks.
Overall surge in small efficient engine vehicles
over last 10 years.
Mostly at expense of moderate/large inefficient
engines.
Even Large Cars are relatively fuel efficient these
days.
Linear relationships between engine size and
fuel cost and emmissions.
Exponential relationships between efficiency
and engine size and fuel cost.
Clustering into 4 groups results in relatively
clear boundaries in the data.
Manufacturers with both depth and breadth of
vehicle attributes have most connections.
Manufacturers that specialize are positioned
toward edge of network.
Four distinct communities detected and
connections over time have converged.