22 r data manipulation 2 pt 20140404

[원] 통계상담 2014-11
A.2 R 데이터 다루기
reshape, plyr, data.table packages
허 명 회 (고려대 교수, 통계학)
2014.04.04
Hadley Wickham at UseR! 2013:
Author of reshape, plyr, ggplot2, ...
응 용 데 이 터 분 석 : R의 활용

[원] 통계상담 2014-12
개 요
R에서 loop를 사용하지 않는 효율적인 데이터 처리 방법들
1. reshape 형태 전환
2. plyr 분리-적용-합체 (split-apply-combine)
3. data.table 검색

[원] 통계상담 2014-13
: reshape
데이터세트의 형태 전환
사례: french_fries {reshape}
구조 time treatment subject rep y1 y2 y3 y4 y5
5개 반응
2회 반복
12명 피험자
3개 처리
10개 시점
* 다수의 결측
* 행 순서 임의화

[원] 통계상담 2014-14
: reshape
사례: french_fries 계속
목표 subject 별로 treatment에 따른 5개 y 반응의 평균 구하기
subject =  : treatment y1 y2 y3 y4 y5
1 □ □ □ □ □
2 □ □ □ □ □
3 □ □ □ □ □

[원] 통계상담 2014-15
: reshape
melting: french_fries 사례 계속
> ff.melted <- melt(french_fries, id=c("time","subject","treatment","rep"), na.rm=TRUE)
> head(ff.melted, 10)
time subject treatment rep variable value
1 1 3 1 1 y1 2.9
2 1 3 1 2 y1 14.0
3 1 10 1 1 y1 11.0
4 1 10 1 2 y1 9.9
5 1 15 1 1 y1 1.2
6 1 15 1 2 y1 8.8
7 1 16 1 1 y1 9.0
8 1 16 1 2 y1 8.2
9 1 19 1 1 y1 7.0
10 1 19 1 2 y1 13.0
⋮
식별(id) 변수

[원] 통계상담 2014-16
: reshape
casting: french_fries 사례 계속
> cast(ff.melted, subject+treatment ~ variable, length)
subject treatment y1 y2 y3 y4 y5
1 3 1 18 18 18 18 18
2 3 2 18 18 18 18 18
3 3 3 18 18 18 18 18 ... 20회 미달
4 10 1 20 20 20 20 20
5 10 2 20 20 20 20 20
6 10 3 20 20 20 20 20
7 15 1 20 20 20 20 20
8 15 2 20 20 20 20 20
9 15 3 19 19 19 19 19
10 16 1 20 20 20 20 20
11 16 2 20 19 20 20 20 ... 20회 미달
12 16 3 20 20 20 20 20
⋮

[원] 통계상담 2014-17
: reshape
> cast(ff.melted, subject+treatment ~ variable, function(x) 20-length(x))
subject treatment y1 y2 y3 y4 y5
1 3 1 2 2 2 2 2
2 3 2 2 2 2 2 2
3 3 3 2 2 2 2 2 ... 2회 결측
4 10 1 0 0 0 0 0
5 10 2 0 0 0 0 0
6 10 3 0 0 0 0 0
7 15 1 0 0 0 0 0
8 15 2 0 0 0 0 0
9 15 3 1 1 1 1 1
10 16 1 0 0 0 0 0
11 16 2 0 1 0 0 0 ... 1회 결측
12 16 3 0 0 0 0 0
⋮

[원] 통계상담 2014-18
: reshape
> options(digits=3)
> cast(ff.melted, treatment+time ~ variable, mean)
treatment time y1 y2 y3 y4 y5
1 1 1 7.92 1.796 0.904 2.76 2.150
2 1 2 7.59 2.525 1.004 3.90 1.975
3 1 3 7.77 2.296 0.817 4.65 1.117
4 1 4 8.40 1.979 1.025 2.08 0.467
5 1 5 7.74 1.367 0.771 4.28 3.008
6 1 6 6.08 1.825 0.467 4.34 2.554
7 1 7 6.28 1.242 0.163 3.20 2.196
8 1 8 5.17 0.987 0.633 5.39 4.588
9 1 9 6.07 1.830 0.135 3.95 2.905
10 1 10 5.46 1.960 0.455 6.50 5.400
11 2 1 8.78 2.492 0.996 1.72 0.808
12 2 2 8.54 3.125 0.950 2.14 0.662
⋮

[원] 통계상담 2014-19
: reshape
> options(digits=3)
> cast(ff.melted, treatment+time ~ variable, mean, margins="grand_col")
subject treatment y1 y2 y3 y4 y5 (all)
1 3 1 6.22 0.372 0.1889 2.106 3.1111 2.40
2 3 2 6.74 0.589 0.1056 3.139 2.4778 2.61
3 3 3 5.29 0.767 0.0944 2.856 2.8667 2.38
4 10 1 9.96 6.750 0.5850 4.020 1.3750 4.54
5 10 2 9.99 6.980 0.4750 2.150 0.8200 4.08
6 10 3 10.03 6.450 0.1450 3.110 0.6900 4.08
7 15 1 3.36 0.720 0.4200 3.965 3.2600 2.35
8 15 2 4.41 1.315 0.3400 2.285 2.0600 2.08
9 15 3 3.96 0.989 0.4421 2.547 2.3684 2.06
10 16 1 6.50 3.260 0.7550 4.120 1.2300 3.17
11 16 2 6.45 3.374 1.0550 3.400 0.4550 2.94
12 16 3 6.86 2.700 1.1250 3.200 0.5550 2.89
⋮

[원] 통계상담 2014-110
: reshape
array: french_fries 사례 계속
> options(digits=3)
> cast(ff.melted, subject ~ treatment ~ variable, mean)
, , variable = y1 , , variable = y2 , , variable = y3 ⋯
treatment treatment
subject 1 2 3 subject 1 2 3
3 6.22 6.74 5.29 3 0.372 0.589 0.767
10 9.96 9.99 10.03 10 6.750 6.980 6.450
15 3.36 4.41 3.96 15 0.720 1.315 0.989
16 6.50 6.45 6.86 16 3.260 3.374 2.700
19 9.38 8.64 8.74 19 3.055 2.450 1.725
31 8.84 8.03 9.03 31 0.444 0.617 0.650
51 10.68 9.98 10.22 51 2.640 3.795 3.130
52 5.06 5.51 5.47 52 0.805 1.025 0.865
63 6.78 8.41 8.06 63 0.025 0.105 0.065
78 3.62 3.78 4.00 78 0.735 0.295 0.705
79 8.06 7.94 7.73 79 0.282 0.694 0.572
86 4.18 3.99 3.87 86 1.772 2.061 1.633

[원] 통계상담 2014-111
: reshape
> options(digits=3)
> apply(cast(ff.melted, subject ~ treatment ~ variable, mean), c(2,3), mean)
variable
treatment y1 y2 y3 y4 y5
1 6.89 1.74 0.639 4.05 2.58
2 6.99 1.94 0.652 3.63 2.45
3 6.94 1.69 0.668 3.85 2.53

[원] 통계상담 2014-112
: reshape
> options(digits=3)
> cast(ff.melted, subject+treatment ~ ., quantile, c(0,0.25,0.5,0.75,1))
subject treatment X0. X25. X50. X75. X100.
1 3 1 0 0.000 0.40 3.22 14.0
2 3 2 0 0.000 0.50 3.38 14.1
3 3 3 0 0.000 0.60 3.80 14.1
4 10 1 0 0.000 3.85 8.40 13.2
5 10 2 0 0.000 2.55 8.25 11.4
6 10 3 0 0.000 3.35 8.40 11.5
7 15 1 0 0.175 1.25 3.65 10.8
8 15 2 0 0.200 1.05 3.12 12.7
9 15 3 0 0.200 0.80 3.40 10.4
10 16 1 0 0.300 2.15 4.95 11.0
11 16 2 0 0.200 1.50 4.65 13.4
12 16 3 0 0.500 1.35 4.58 12.7
⋮

[원] 통계상담 2014-113
: plyr
Split-Apply-Combine
split apply combine
data a function outputs
하둡: Map Reduce

[원] 통계상담 2014-114
: plyr
사례: baseball {plyr}
구조 data.frame: 21699 obs. of 22 variables
$ id : ch "ansonca01" "forceda01" "mathebo01" "startjo01" ...
$ year : int 1871 1871 1871 1871 1871 1871 1871 1872 ...
$ rbi : int 16 29 10 34 23 21 23 50 15 16 ...
목표 id (선수) 별 career high rbi year 구하기
- c.year (= year-min(year)+1)
- max.rbi
절차 1. split – 전체자료를 id (선수) 별로 나누기
2. apply – id 별 subset에서 c.year와 max.rbi, 그것의 c.year를 구하기
3. combine – 앞의 결과를 합체하기

[원] 통계상담 2014-115
: plyr
사례: baseball 계속
# plyr for baseball data
library(plyr)
str(baseball)
calculate_c.year <- function(df) mutate(df, cyear = year - min(year)+1)
baseball.1 <- ddply(baseball, .(id), calculate_c.year)
## 데이터프레임 baseball의 오른쪽에 cyear가 붙는다.
calculate_c.rbi <- function(df)
c(best.year=df$cyear[which.max(df$rbi)], best.rbi=max(df$rbi), career.year=max(df$cyear))
bb.2 <- ddply(baseball.1, .(id), calculate_c.rbi)
str(bb.2)
## 데이터프레임 bb.2는 4개 변수로 구성된다: id, best.year, best.rbi, career.year
## 데이터프레임 bb.2의 개체 수는 1,228 (=선수 수)이다.

[원] 통계상담 2014-116
: plyr
사례: baseball 계속
# histograms of best.year and career.year
max(bb.2$career.year)
hist(bb.2$best.year, breaks=seq(0.5,40.5,1), xlab="best.year", main="")
hist(bb.2$career.year, breaks=seq(0.5,40.5,1), xlab="career.year", main="")
## max.rbi 분포의 mode는 7년차

[원] 통계상담 2014-117
: plyr
**ply
출력
array df list
입력
array aaply adply alply
df daply ddply dlply
list laply ldaply llaply

[원] 통계상담 2014-118
: plyr
summarise( )
> library(plyr)
> ddply(baseball, "id", summarise, duration = max(year)-min(year)+1,
+ nteams = length(unique(team)))
id duration nteams
1 aaronha01 23 3
2 abernte02 18 7
3 adairje01 13 4
4 adamsba01 21 2
5 adamsbo03 14 4
6 adcocjo01 17 5

[원] 통계상담 2014-119
: data.table
효율적인 검색
1. Data Table 만들기: 6개의 random digit column과 1개의 수치 열로 구성된
10,000,000*7 데이터프레임
library(data.table)
n <- 10000000
digits <- as.factor(0:9)
x1 <- sample(digits, n, replace=T)
DT <- data.table(x1, x2, x3, x4, x5, x6, y=rnorm(n))

[원] 통계상담 2014-120
: data.table
Data Table 만들기 (계속)
> head(DT, 10)
x1 x2 x3 x4 x5 x6 y
1: 3 7 0 2 1 0 -2.1384800
2: 9 1 6 1 6 0 2.1295443
3: 9 6 2 9 6 3 -1.0069040
4: 8 8 5 5 6 4 0.1813213
5: 2 9 9 5 3 3 -0.5683664
6: 2 8 3 0 8 4 0.1869398
7: 0 8 9 8 5 6 -0.1080321
8: 4 7 5 3 7 1 2.1213928
9: 1 9 4 9 1 6 1.3338342
10: 9 7 9 7 6 4 –0.6250066
> class(DT)
[1] "data.table" "data.frame"

[원] 통계상담 2014-121
: data.table
검색 키의 설정
> setkey(DT, x1, x2, x3, x4, x5, x6)
> head(DT, 10)
x1 x2 x3 x4 x5 x6 y * key 변수들의 순서로 정렬된다.
1: 0 0 0 0 0 0 1.7554923
2: 0 0 0 0 0 0 1.4160151
3: 0 0 0 0 0 0 0.3351744
4: 0 0 0 0 0 0 -0.4342841
5: 0 0 0 0 0 1 -1.4443813
6: 0 0 0 0 0 1 0.8493174
7: 0 0 0 0 0 1 1.2504767
8: 0 0 0 0 0 1 -1.4396524
9: 0 0 0 0 0 1 -0.9762352
10: 0 0 0 0 0 1 0.9889054

[원] 통계상담 2014-122
: data.table
자료세트 검색
> DT[J("1","2","3","4","5","6")]
x1 x2 x3 x4 x5 x6 y x1 x2 x3 x4 x5 x6 y
1: 1 2 3 4 5 6 0.4442011 1: 1 2 3 4 5 6 0.018513680
2: 1 2 3 4 5 6 –0.4213922 2: 1 2 3 4 5 6 0.632281586
3: 1 2 3 4 5 6 0.9358654 3: 1 2 3 4 5 6 -0.169242317
4: 1 2 3 4 5 6 0.1211770 4: 1 2 3 4 5 6 0.003417459
5: 1 2 3 4 5 6 0.2052872 5: 1 2 3 4 5 6 -1.290678412
6: 1 2 3 4 5 6 –1.4889960 6: 1 2 3 4 5 6 0.420696995
7: 1 2 3 4 5 6 –0.8041964 7: 1 2 3 4 5 6 1.484245923
8: 1 2 3 4 5 6 0.050544004
9: 1 2 3 4 5 6 0.151274821
10: 1 2 3 4 5 6 0.308839374
11: 1 2 3 4 5 6 0.076483702
* 기대되는 레코드 수는  × 
 
 개.
* 출현 레코드 수는 평균이   인 포아송 분포를 따름.

[원] 통계상담 2014-123
: data.table
검색: 다른 방법
> p.time <- proc.time()
> DT[x1=="1" & x2=="2" & x3=="3" & x4=="4" & x5=="5" & x6=="6",]
⋮
> proc.time() - p.time
user system elapsed
8.47 1.06 9.63
비교: 앞 방법의 처리 시간
> proc.time() - p.time
user system elapsed
0.08 0.03 0.11 * elapsed time 기준 기존 방법 대비 1.1%에 불과,
* data.table이 binary search를 하기 때문.

[원] 통계상담 2014-124
정리⋅요약
데이터 다루기: “빅” 데이터 분석의 기초 (fundamentals)
통계학 전공자의 취약점
데이터 과학으로 진화하기 위해 넘어야 할 벽
참고문헌: R Manuals, Vignettes, ...
Journal of Statistical Software Papers
전희원 (2013). R로 하는 데이터 시각화. 한빛미디어
실습 파일: reshape_ff.r plyr_bb.r datatale_sim.r

22 r data manipulation 2 pt 20140404

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de Myung-Hoe Huh

Mais de Myung-Hoe Huh (6)

22 r data manipulation 2 pt 20140404