r for data science 2. grammar of graphics (ggplot2) clean -ref

Grammar of graphics (그래픽 문법)
Outline (개요)
• Motivating example (동기부여를 위한 예제)
• example of research question (연구 질문 예시)
• mpg data (Mile per Gallon, 기름량당 주행거리)
• ggplot example
• Math review (수학 리뷰)
• function mapping (함수 대응)
• Dimension (차원) & Co-ordinate system (좌표계)
• Grammar of graphics (그래픽 문법)
• aesthetic mapping (미학적 대응)
• facet (면)
• geometric object (기하학적 개체)
• Statistical transformations (통계적 변환)
• Position adjustments (위치 조정)
• Coordinate systems (좌표계)
• The layered grammar of graphics (층화된 그래픽 문법)

Understand (이해하다): data exploration (데이터 탐색)
Transform (변환하다) & Visualize (시각화하다) & Model(모형을 만들다)

First example)
engine size (엔진 크기) vs. fuel usage (연료 소모량)
• Research question (연구 질문)
• Do cars with big engines use more fuel than cars with small engines?
• 엔진이 큰 차가 엔진이 작은 차보다 연료 소모량이 큰가?

내장된 데이터) mpg data
Cf) Miles per Gallon (MPG): 기름 1갤런당 몇 마일 가나
mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi a4 1.8 1999 4 auto(… f 18 29 p comp…
#> 2 audi a4 1.8 1999 4 manua… f 21 29 p comp…
#> 3 audi a4 2 2008 4 manua… f 20 31 p comp…
#> 4 audi a4 2 2008 4 auto(… f 21 30 p comp…
#> 5 audi a4 2.8 1999 6 auto(… f 16 26 p comp…
#> 6 audi a4 2.8 1999 6 manua… f 18 26 p comp…
#> # ... with 228 more rows
help(mpg)

mpg {ggplot2} R Documentation
Fuel economy data from 1999 and 2008 for 38 popular models of car
Description
This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which
had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
Usage
mpg
Format
A data frame with 234 rows and 11 variables
manufacturer
model
model name
displ
engine displacement, in litres
year
year of manufacture
cyl
number of cylinders
trans
type of transmission
drv
f = front-wheel drive, r = rear wheel drive, 4 = 4wd
cty
city miles per gallon
hwy
highway miles per gallon
fl
fuel type
class
"type" of car

Let's plot first in a 2-D plane(2차원 평면): x- & y-axis(축)
x = displ, y = hwy
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

template (템플릿, 주형)
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

function mapping (함수 대응)
Dimension (차원)
Co-ordinate system (좌표계)

function (함수)
f:X→Y
f maps(대응시키다) X into Y
https://en.wikipedia.org/wiki/Function_(mathematics)

Dimension (차원)
Co-ordinate system (좌표계)
https://en.wikipedia.org/wiki/Dimension

aesthetic mapping (미학적 대응)

ggplot2
• ggplot2 is based on the grammar of
graphics(그래픽 문법), the idea that you
can build every graph from the same
components: a data set, a coordinate
system(좌표계), and geoms(기하,
도형)—visual marks that represent data
points.
• To display (화면에 표현하다) values (값),
map (대응시키다) variables (변수) in the
data to visual properties of the geom
(기하, 도형 = aesthetics 미학) like
size(크기), color(색), and x and y
locations(위치, 좌표).

Let's map (대응시키다) a 3rd dimension (차원) in color(색상)
geom_point(mapping = aes(x = displ, y = hwy, color = class))

Let's map (대응시키다) a 3rd dimension (차원) in size (크기)
geom_point(mapping = aes(x = displ, y = hwy, size = class))

Let's map (대응시키다) a 3rd dimension (차원) in alpha
(transparency, 투명도)
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

Let's map (대응시키다) a 3rd dimension (차원) in shape(모양)
geom_point(mapping = aes(x = displ, y = hwy, shape = class))

A calculated variable (계산한 변수) as a 3rd dimension (차원)
geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))

Cf) Common problem: syntax error~!
e.g.) Location of "+"
ggplot(data = mpg)
+ geom_point(mapping = aes(x = displ, y = hwy))

Cf) Common problem
aesthetic mapping (미학적 대응) is for variables (변수) in the data
e.g.) "blue" is not a variable
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

facet (면)

facet (면): subplots that each display one subset of the data
facet_wrap(~VariableName)
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)

facet_grid(VariabeName4Row ~ VariabeName4Column)
facet_grid(drv ~ cyl)

facet_grid(drv ~ .)

facet_grid(. ~ cyl)

-> use a variable with more unique (고유) levels (단계, 범주)
facet_grid(trans ~ drv)

-> use a variable with more unique (고유) levels (단계, 범주)
facet_grid(drv ~ trans)

geometric object (기하학적 개체)

Plot in a 2-D plane(2차원 평면): x- & y-axis(축)
geom_point (점)

geom_smooth (매끄러운)
geom_smooth(mapping = aes(x = displ, y = hwy))

geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))

geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

geom_smooth(mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE)

geom_point (점) & geom_smooth (매끄러운)
geom_smooth(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()

geom_point(mapping = aes(color = class)) +
geom_smooth()

geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)

Statistical transformations (통계적 변환)

내장된 데이터) diamond data
> diamonds
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ... with 53,930 more rows
> help(diamonds)

diamonds {ggplot2} R Documentation
Prices of 50,000 round cut diamonds
Description
A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:
Usage
diamonds
Format
A data frame with 53940 rows and 10 variables:
price
price in US dollars ($326–$18,823)
carat
weight of the diamond (0.2–5.01)
cut
quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color
diamond colour, from J (worst) to D (best)
clarity
a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x
length in mm (0–10.74)
y
width in mm (0–58.9)
z
depth in mm (0–31.8)
depth
total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table
width of top of diamond relative to widest point (43–95)

geom_bar
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))

geom_bar
geom_bar(mapping = aes(x = cut, colour = cut))

geom_bar
geom_bar(mapping = aes(x = cut, fill = cut))

geom_bar: how is the y-axis calculated?
geom_bar(mapping = aes(x = cut))

geom_bar: how is the y-axis calculated?

geom_bar() = stat_count()
stat_count(mapping = aes(x = cut))

geom_bar(): ..prop..
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

The data after the statistical transformation
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
demo
# # A tibble: 5 x 2
# cut freq
# <chr> <dbl>
# 1 Fair 1610
# 2 Good 4906
# 3 Very Good 12082
# 4 Premium 13791
# 5 Ideal 21551

The data after the statistical transformation
ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

r for data science 2. grammar of graphics (ggplot2) clean -ref

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a r for data science 2. grammar of graphics (ggplot2) clean -ref

Semelhante a r for data science 2. grammar of graphics (ggplot2) clean -ref (20)

Mais de Min-hyung Kim

Mais de Min-hyung Kim (7)

Último

Último (20)

r for data science 2. grammar of graphics (ggplot2) clean -ref