Introduction into R for historians (part 4: data manipulation)

Recap
Data manipulation
data.table package
Basic statistical techniques
Data manipulation in R
Richard L. Zijdeman
May 29, 2015
Richard L. Zijdeman Data manipulation in R

Recap
Data manipulation
data.table package
1 Recap
2 Data manipulation
3 data.table package
4 Basic statistical techniques

Recap
Data manipulation
data.table package
Recap

Recap
Data manipulation
data.table package
What we’ve seen so far
functions to read in data
read.csv(), read.xlsx()
objects
assignment <-
characteristics, e.g.:
str(), summary(), head(), tail()
calculus
mean(), min(), max()
plotting
plot()
ggplot()
paint by ‘layer’

Recap
Data manipulation
data.table package
Before we go on. . .
Structure your R script
Filename, Date, Purpose, Author, Last change
Use comments to tell what you are doing
read in data
changing variables (why did you do it)

Recap
Data manipulation
data.table package
Create a working directory, with subdirs
+ documents
+ data
- source
- derived
+ analysis
+ figures

Recap
Data manipulation
data.table package
Set a working directory
setwd(), getwd()
use relative paths to save things
“./” = currenty directory
“./../” = folder up
Read J. Scott Long’ “Workﬂow”

Recap
Data manipulation
data.table package
Data manipulation

Recap
Data manipulation
data.table package
Assignment and Indexing
First, we’ll read in the HSN marriages again
hmar <- read.csv("./../data/derived/HSN_marriages.csv",
stringsAsFactors = FALSE,
encoding = "latin1",
header = TRUE,
nrows = 10000)

Recap
Data manipulation
data.table package
Change case of text
tolower()
toupper()
tolower("CaN we pleASe jUSt have LOWER cases?")
## [1] "can we please just have lower cases?"
names(hmar) <- tolower(names(hmar))
names(hmar)
## [1] "id_marriage" "idnr" "m_loc" "m_
## [5] "sex_hsnrp" "age_groom" "occ_groom" "ci
## [9] "sign_groom" "b_loc_groom" "l_loc_groom" "ag
## [13] "occ_bride" "civilst_bride" "sign_bride" "b_
## [17] "l_loc_bride" "a_f_groom" "occ_f_groom" "si

Recap
Data manipulation
data.table package
Indexing
There were way to many names to print on a slide. . . How many
names are there actually?

Recap
Data manipulation
data.table package
Use the length() command to ﬁnd out:
length(names(hmar))
## [1] 29
So let’s print just the ﬁrst two:
names(hmar)[1:2]
## [1] "id_marriage" "idnr"
The technique using squared brackets is called indexing

Recap
Data manipulation
data.table package
Any idea how we would show the last two names?

Recap
Data manipulation
data.table package
x <- length(names)
names(hmar)[(x-1):x]
## [1] "id_marriage"
Using concatenate we could also extract various names
names(hmar)[c(1, 3, 5)]
## [1] "id_marriage" "m_loc" "sex_hsnrp"

Recap
Data manipulation
data.table package
We can also apply indexing to a data.frame:
hmar[1:2, 1:3]
## id_marriage idnr m_loc
## 1 1 1001 Abcoude-Baambrugge
## 2 2 1005 Baarn
# shows the first 2 rows and first 3 columns
# so, in general: data.frame[rows, columns]

Recap
Data manipulation
data.table package
head() and tail()
So actually, you should now be able to replace head() and tail()
How?

Recap
Data manipulation
data.table package
# head()
hmar[1:6, ]
# tail()
y <- nrow(hmar)
hmar[(y-6):y, ]

Recap
Data manipulation
data.table package
data.table package

Recap
Data manipulation
data.table package
Developed by Matt Dowle
Website:
https://github.com/Rdatatable/data.table/wiki
Why data.table?
fast subsetting on large ﬁles
more consistent ‘grammar’
less typing

Recap
Data manipulation
data.table package
install.packages("data.table")
library(data.table)

Recap
Data manipulation
data.table package
Class: data.table
For data.table functions to work we need to deﬁne a data.frame as
class data.base
is.data.table(hmar)
## [1] FALSE
hmar.dt <- data.table(hmar)
is.data.table(hmar.dt)
## [1] TRUE
is.data.frame(hmar.dt)
## [1] TRUE

Recap
Data manipulation
data.table package
Friends with beneﬁts
Data.frame and data.table are like ‘friends with beneﬁts’
all.equal(hmar, hmar.dt)
## [1] "Attributes: < Names: 2 string mismatches >"
## [2] "Attributes: < Length mismatch: comparison on first
## [3] "Attributes: < Component 1: Modes: character, extern
## [4] "Attributes: < Component 1: target is character, cur
## [5] "Attributes: < Component 2: Modes: numeric, characte
## [6] "Attributes: < Component 2: Lengths: 10000, 2 >"
## [7] "Attributes: < Component 2: target is numeric, curre
# so we have all the benefits of a data.frame
# ... and additional benefits of data.table
NB: next series of commands will only work for data.tablesRichard L. Zijdeman Data manipulation in R

Recap
Data manipulation
data.table package
Sort with setkey
Often we want to sort our data. We can do so with setkey()
hmar.dt[1:6, m_year]
## [1] 1849 1851 1864 1840 1843 1858
# note for data.frame hmar it would be:
# hmar[1:6, hmar$m_year]
setkeyv(hmar.dt, "m_year")
hmar.dt[1:6, m_year]
## [1] 1831 1831 1833 1833 1834 1834
identical(hmar.dt, hmar)
## [1] FALSE Richard L. Zijdeman Data manipulation in R

Recap
Data manipulation
data.table package
Multiple keys
It is alo possible to sort on multiple keys
setkeyv(hmar.dt, c("id_marriage", "idnr"))

Recap
Data manipulation
data.table package
Subsetting
groom.sig <- hmar.dt[age_groom > 30, ]
dim(groom.sig)
## [1] 2493 29
groom.sig <- hmar.dt[sign_groom == "h", ]
dim(groom.sig)
## [1] 9590 29

Recap
Data manipulation
data.table package
groom.sig <- hmar.dt[sign_groom == "h" &
age_groom > 30, ]
dim(groom.sig)
## [1] 2358 29
groom.sig <- hmar.dt[m_year != 1840,
list(id_marriage, idnr)]
dim(groom.sig)
## [1] 9985 2

Recap
Data manipulation
data.table package
Creating new variables
Let’s create a variable for the mean of marriage of grooms
hmar.dt[, mean.gage := mean(age_groom)]
summary(hmar.dt$age_groom)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.00 24.00 26.00 28.38 30.00 79.00
summary(hmar.dt$mean.gage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28.38 28.38 28.38 28.38 28.38 28.38

Recap
Data manipulation
data.table package
Another example (from yesterday)
Dummy variable for equal municipality of birth
hmar.dt[, eq_b_loc := (b_loc_groom == b_loc_bride)]
summary(hmar.dt$eq_b_loc)
## Mode FALSE TRUE NA's
## logical 6957 3043 0

Recap
Data manipulation
data.table package
Creating variables by group
As we saw, a var with mean age wasn’t really interesting
average age of grooms at marriage by civil status
hmar.dt[, gage.mean.civ := mean(age_groom),
by = civilst_groom]
table(hmar.dt$civilst_groom, hmar.dt$gage.mean.civ)
##
## 27.2427939112599 40.8829787234043 42.9548286604361
## 1 9263 0 0
## 2 0 0 642
## 3 0 94 0
## 6 0 0 0

Recap
Data manipulation
data.table package
Summary subsets of the data
So far, added vars to original data.frame
can be redundant though
Think of context, say municipalities
archival material on characteristics, e.g.:
population
steam power
You can also make context characteristics by aggregation

Recap
Data manipulation
data.table package
mc <- hmar.dt[, mean(age_groom), by = b_loc_groom]
summary(mc)
## b_loc_groom V1
## Length:1184 Min. :-2.00
## Class :character 1st Qu.:26.00
## Mode :character Median :28.17
## Mean :29.36
## 3rd Qu.:31.00
## Max. :69.00

Recap
Data manipulation
data.table package
We can improve by naming the variable directly, and adding more
variables
mc2 <- hmar.dt[, list(mean_gage = mean(age_groom),
mean_bage = mean(age_bride)),
by = b_loc_groom]
summary(mc2)
## b_loc_groom mean_gage mean_bage
## Length:1184 Min. :-2.00 Min. :-2.00
## Class :character 1st Qu.:26.00 1st Qu.:23.80
## Mode :character Median :28.17 Median :25.88
## Mean :29.36 Mean :26.53
## 3rd Qu.:31.00 3rd Qu.:28.00
## Max. :69.00 Max. :64.00

Recap
Data manipulation
data.table package
One more. . . counts
Yesterday, we talked about the problem of overlapping points. We
used geom_jitter to solve it.
Now let’s do it properly:
mc3 <- hmar.dt[, list(frequency = .N),
by = list(m_year, age_bride)]
# notice the .N ... N is often used for nr. of obs
library(ggplot2)

Recap
Data manipulation
data.table package
Using colour
ggplot(mc3, aes(x= m_year, y = age_bride)) +
geom_point(aes(colour = frequency),
size = 10, shape = 18) +
theme_bw()
20
40
60
age_bride
10
20
30
frequency

Recap
Data manipulation
data.table package
Using size
ggplot(mc3, aes(x= m_year, y = age_bride)) +
geom_point(aes(size = frequency),
colour = "blue", shape = 18) +
theme_bw()
20
40
60
age_bride
frequency
10
20
30

Recap
Data manipulation
data.table package

Recap
Data manipulation
data.table package
Box and whisker plot
Distribution of data
Median: 50% of the cases above and below
Box: 1st and 3rd quartile
Interquartile range (IQR): Q3-Q1
Outliers (Tukey, 1977):
x < Q1 - 1.5*IQR
x > Q3 + 1.5*IQR

Recap
Data manipulation
data.table package
boxplot(hmar.dt$age_bride,
ylab = "Age")
0204060
Age

Recap
Data manipulation
data.table package
hmar.dt[, sign.bride.cln := sign_bride == "h"]
hmar.dt[age_bride < 14, age_bride := NA]
# NB: no missing values here, but mind this when recoding!

Recap
Data manipulation
data.table package
boxplot(hmar.dt$age_bride ~ hmar.dt$sign.bride.cln,
names = c("not signed", "signed"),
col = c("red", "green"))
not signed signed
203040506070

Recap
Data manipulation
data.table package

Introduction into R for historians (part 4: data manipulation)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (15)

Semelhante a Introduction into R for historians (part 4: data manipulation)

Semelhante a Introduction into R for historians (part 4: data manipulation) (20)

Mais de Richard Zijdeman

Mais de Richard Zijdeman (10)

Último

Último (20)

Introduction into R for historians (part 4: data manipulation)