5. Exploratory Data Analysis (EDA) is an
approach to analysing data sets to
summarize their main characteristics, often
with visual methods.
A statistical model can be used or not, but
primarily EDA is for seeing what the data
can tell us beyond the formal modelling or
hypothesis testing task.
7. Use Case - Online Learning Platform
User
Area
Vendor
Course
Course
Taken
Cloud (25%)
Data Science (50%)
Web (15%)
Software Engineering
(10%)
Software Mind (20%)
Cloud Solutions (3%)
InfraNet (12%)
DataLearn (7%)
WWW Way (11%)
Soft Skills (4%)
Edu Zen (10%)
Data Foundation (25%)
Learning Island (5%)
Design Your Way (3%)
2014
2015
2016
Prices:
10$ (25%) 99$ (20%)
19$ (15%) 250 (15%)
49$ (20%) 500 (5%)
8. courses.aggregate
Name Area Vendor Year Month Price [$]
Perez, Lisa Data Science Data Foundation 2015 7 99
Tran, Janiro Software Engineering DataLearn 2016 2 10
Bajwa, John Cloud InfraNet 2015 9 250
Lindsey, Aaron Web Software Mind 2014 6 19
Cooper, Duncan Software Engineering Learning Island 2014 7 250
Grumbach, Alexander Web Design Your Way 2015 2 99
9.
10. Categorical data - count occurrences
Cloud Data
Science
Software
Engineering
Web
693 2271 462 1574
# Count occurrences
courses.areas <-
table(courses.aggregate$area
11. Bar plot – Number of courses taken by Area
# Draw the plot
barplot(courses.areas,
ylab="Count",
main="Areas")
13. Stacked Bar plot – Areas by Vendors
# Draw the plot
barplot(vendor.area, ylab="Count",
main="Areas by Vendor",
col=rainbow(4))
legend("topright", fill=rainbow(4),
legend=row.names(vendor.area
))
14. Stacked Beside Bar plot – Areas by Year
# Count occurrences
areas.year <- table(data.frame(
courses.aggregate$area,
courses.aggregate$year))
# Draw the plot
barplot(areas.year, ylab="Count",
main="Areas By Year",
col=rainbow(4), beside=TRUE)
legend("topleft", fill=rainbow(4),
legend=row.names(areas.year))
15. Stacked Bar plot – Areas by Year
# Draw the plot
barplot(areas.year, ylab="Count",
main="Areas by year",
col=rainbow(4))
legend("topright",
legend=row.names(areas.year),
fill=rainbow(4))
16. 100% Stacked Bar plot – Areas by Year
# Draw the plot
barplot(prop.table(areas.year, 2)*100,
col=rainbow(4), ylab="%",
main="Years by Areas")
legend("topright",
legend=row.names(areas.year),
fill=rainbow(4))
17. Pie chart – Areas
# Areas occurrences
per_labels <- round(
courses.areas/sum(courses.areas) * 100, 1)
per_labels <- paste(per_labels, "%", sep="")
# Draw the plot
pie(courses.areas,
col=rainbow(4),
labels=per_labels)
legend("topleft", fill=rainbow(4)
legend=names(courses.areas))
20. Bar plot – Revenue per year
# Draw the plot
barplot(revenue.year$price,
names.arg =
revenue.year$year,
ylab="Count [$]",
main="Revenue per year")
21. Categorical data - count occurrences
# Prepare data
library(reshape)
revenue.year.area <- aggregate(
price ~ year + area,
data=courses.aggregate, sum)
rya <- t(cast(revenue.year.area,
year ~ area, value="price"))
2014 2015 2016
Cloud 127474 17873 16819
Data
Science
65639 73645 74289
Software 8342 9976 11781
Web 52556 57508 77308
22. Stacked Bar plot – Revenue by Year and Area
# Draw the plot
barplot(rya, col=rainbow(4),
ylab="Count [$]",
main="Revenue by Year & Area")
legend("topright", fill=rainbow(4),
legend=row.names(rya))
23. Stacked Beside Bar plot – Areas Revenue by Year
# Draw the plot
barplot(rya, col=rainbow(4),
ylab="Count [$]",
main="Revenue by Year & Area",
beside=TRUE)
legend("topright", fill=rainbow(3),
legend=row.names(rya))
25. Histogram – Course Prices
# Draw the plot
hist(courses.aggregate$price,
main="Ditribution of prices",
xlab="Course price",
breaks=20,
col=heat.colors(20))
26. Histogram – Course Prices per month
# Prepare the data
revenue.year.month <-
aggregate(price ~ year + month,
data=courses.aggregate, sum)
# Draw the plot
hist(revenue.year.month$price,
main="Distribution of revenue per month",
xlab="Revenue per month",
breaks=20,
col=heat.colors(20))
27. Density – Course Prices per month
# Probability density
hist(revenue.year.month$price,
main="Distribution of revenue per month",
xlab="Revenue per month", breaks=20,
col=heat.colors(20), prob=TRUE)
lines(density(revenue.year.month$price))
34. Stacked Bar chart – base vs. lattice
barplot(rya, col=rainbow(4),
ylab="Count [$]",
main="Revenue by Year & Area")
legend("topright", fill=rainbow(4),
legend=row.names(rya))
barchart(Cloud + `Data Science` +
`Software Engineering` + Web ~ year
data=t(rya), auto.key=TRUE,
stack=TRUE, horizontal=FALSE,
ylab="Count [$]", main="Areas by Year")
35. Stacked Bar chart – base vs. ggplot2
barplot(rya, col=rainbow(4),
ylab="Count [$]",
main="Revenue by Year & Area")
legend("topright", fill=rainbow(4),
legend=row.names(rya))
ggplot(revenue.year.area,
aes(x = year, y=price, fill = area)) +
geom_bar(stat = "identity") +
ggtitle("Revenue by Year & Area") +
ylab("Count [$]")
36. Histogram – base vs. lattice
hist(revenue.year.month$price,
main="Ditribution of revenue per month",
xlab="Revenue per month",
breaks=20,
col=heat.colors(20))
histogram(~price, data=revenue.year.month,
main="Ditribution of revenue per month",
xlab="Revenue per month",
breaks = 20, type = "count",
col=heat.colors(20))
37. Histogram – base vs. ggplot2
hist(revenue.year.month$price,
main="Ditribution of revenue per month",
xlab="Revenue per month",
breaks=20,
col=heat.colors(20))
ggplot(revenue.year.month, aes(x = price)) +
geom_histogram(stat = "bin",
binwidth=2500, aes(fill=..count..)) +
ggtitle("Ditribution of revenue per month") +
xlab("Revenue per month")
38. Box plot – base vs. lattice
boxplot(price~year,
data=revenue.year.month,
col=2:4,
main="Revenue by Year",
xlab="Year", ylab="Revenue")
boxplot(price~year,
data=revenue.year.month,
col=2:4,
main="Revenue by Year",
xlab="Year", ylab="Revenue")
39. Box plot – base vs. ggplot
boxplot(price~year,
data=revenue.year.month,
col=2:4,
main="Revenue by Year",
xlab="Year", ylab="Revenue")
ggplot(revenue.year.month,
aes(x=factor(year), y=price)) +
geom_boxplot(aes(fill=factor(year))) +
ggtitle("Total by Year") +
ylab("Revenue") +
xlab("Year")
40. Scatter plot – base vs. lattice
plot(price~units, data=revenue.month.area,
xlab="Units", ylab="Revenue [$]",
col=area,
main="Revenue by Units (All years)")
# And you need legend manually created
xyplot(price~units, data=revenue.month.area,
xlab="Units", ylab="Revenue [$]",
pch=19,
group = area,
auto.key = TRUE)
41. Scatter plot – base vs. ggplot2
plot(price~units, data=revenue.month.area,
xlab="Units", ylab="Revenue [$]",
col=area,
main="Revenue by Units (All years)")
# And you need legend manually created
ggplot(revenue.month.area,
aes(x=units, y=price)) +
geom_point(aes(col=area)) +
ggtitle("Revenue by Units (All years)") +
ylab("Revenue [$]") + xlab("Units")