Introduction into R for the European Historical Population Sample summerschool, Cluj-Napoca, Romana, 2015. Aimed at a public of historians with little quantitative skills
2. Recap
Data manipulation
data.table package
Basic statistical techniques
1 Recap
2 Data manipulation
3 data.table package
4 Basic statistical techniques
Richard L. Zijdeman Data manipulation in R
4. Recap
Data manipulation
data.table package
Basic statistical techniques
What we’ve seen so far
functions to read in data
read.csv(), read.xlsx()
objects
assignment <-
characteristics, e.g.:
str(), summary(), head(), tail()
calculus
mean(), min(), max()
plotting
plot()
ggplot()
paint by ‘layer’
Richard L. Zijdeman Data manipulation in R
5. Recap
Data manipulation
data.table package
Basic statistical techniques
Before we go on. . .
Structure your R script
Filename, Date, Purpose, Author, Last change
Use comments to tell what you are doing
read in data
changing variables (why did you do it)
Richard L. Zijdeman Data manipulation in R
6. Recap
Data manipulation
data.table package
Basic statistical techniques
Create a working directory, with subdirs
+ documents
+ data
- source
- derived
+ analysis
+ figures
Richard L. Zijdeman Data manipulation in R
7. Recap
Data manipulation
data.table package
Basic statistical techniques
Set a working directory
setwd(), getwd()
use relative paths to save things
“./” = currenty directory
“./../” = folder up
Read J. Scott Long’ “Workflow”
Richard L. Zijdeman Data manipulation in R
9. Recap
Data manipulation
data.table package
Basic statistical techniques
Assignment and Indexing
First, we’ll read in the HSN marriages again
hmar <- read.csv("./../data/derived/HSN_marriages.csv",
stringsAsFactors = FALSE,
encoding = "latin1",
header = TRUE,
nrows = 10000)
Richard L. Zijdeman Data manipulation in R
10. Recap
Data manipulation
data.table package
Basic statistical techniques
Change case of text
tolower()
toupper()
tolower("CaN we pleASe jUSt have LOWER cases?")
## [1] "can we please just have lower cases?"
names(hmar) <- tolower(names(hmar))
names(hmar)
## [1] "id_marriage" "idnr" "m_loc" "m_
## [5] "sex_hsnrp" "age_groom" "occ_groom" "ci
## [9] "sign_groom" "b_loc_groom" "l_loc_groom" "ag
## [13] "occ_bride" "civilst_bride" "sign_bride" "b_
## [17] "l_loc_bride" "a_f_groom" "occ_f_groom" "si
Richard L. Zijdeman Data manipulation in R
11. Recap
Data manipulation
data.table package
Basic statistical techniques
Indexing
There were way to many names to print on a slide. . . How many
names are there actually?
Richard L. Zijdeman Data manipulation in R
12. Recap
Data manipulation
data.table package
Basic statistical techniques
Use the length() command to find out:
length(names(hmar))
## [1] 29
So let’s print just the first two:
names(hmar)[1:2]
## [1] "id_marriage" "idnr"
The technique using squared brackets is called indexing
Richard L. Zijdeman Data manipulation in R
14. Recap
Data manipulation
data.table package
Basic statistical techniques
x <- length(names)
names(hmar)[(x-1):x]
## [1] "id_marriage"
Using concatenate we could also extract various names
names(hmar)[c(1, 3, 5)]
## [1] "id_marriage" "m_loc" "sex_hsnrp"
Richard L. Zijdeman Data manipulation in R
15. Recap
Data manipulation
data.table package
Basic statistical techniques
We can also apply indexing to a data.frame:
hmar[1:2, 1:3]
## id_marriage idnr m_loc
## 1 1 1001 Abcoude-Baambrugge
## 2 2 1005 Baarn
# shows the first 2 rows and first 3 columns
# so, in general: data.frame[rows, columns]
Richard L. Zijdeman Data manipulation in R
16. Recap
Data manipulation
data.table package
Basic statistical techniques
head() and tail()
So actually, you should now be able to replace head() and tail()
How?
Richard L. Zijdeman Data manipulation in R
19. Recap
Data manipulation
data.table package
Basic statistical techniques
Developed by Matt Dowle
Website:
https://github.com/Rdatatable/data.table/wiki
Why data.table?
fast subsetting on large files
more consistent ‘grammar’
less typing
Richard L. Zijdeman Data manipulation in R
21. Recap
Data manipulation
data.table package
Basic statistical techniques
Class: data.table
For data.table functions to work we need to define a data.frame as
class data.base
is.data.table(hmar)
## [1] FALSE
hmar.dt <- data.table(hmar)
is.data.table(hmar.dt)
## [1] TRUE
is.data.frame(hmar.dt)
## [1] TRUE
Richard L. Zijdeman Data manipulation in R
22. Recap
Data manipulation
data.table package
Basic statistical techniques
Friends with benefits
Data.frame and data.table are like ‘friends with benefits’
all.equal(hmar, hmar.dt)
## [1] "Attributes: < Names: 2 string mismatches >"
## [2] "Attributes: < Length mismatch: comparison on first
## [3] "Attributes: < Component 1: Modes: character, extern
## [4] "Attributes: < Component 1: target is character, cur
## [5] "Attributes: < Component 2: Modes: numeric, characte
## [6] "Attributes: < Component 2: Lengths: 10000, 2 >"
## [7] "Attributes: < Component 2: target is numeric, curre
# so we have all the benefits of a data.frame
# ... and additional benefits of data.table
NB: next series of commands will only work for data.tablesRichard L. Zijdeman Data manipulation in R
23. Recap
Data manipulation
data.table package
Basic statistical techniques
Sort with setkey
Often we want to sort our data. We can do so with setkey()
hmar.dt[1:6, m_year]
## [1] 1849 1851 1864 1840 1843 1858
# note for data.frame hmar it would be:
# hmar[1:6, hmar$m_year]
setkeyv(hmar.dt, "m_year")
hmar.dt[1:6, m_year]
## [1] 1831 1831 1833 1833 1834 1834
identical(hmar.dt, hmar)
## [1] FALSE Richard L. Zijdeman Data manipulation in R
24. Recap
Data manipulation
data.table package
Basic statistical techniques
Multiple keys
It is alo possible to sort on multiple keys
setkeyv(hmar.dt, c("id_marriage", "idnr"))
Richard L. Zijdeman Data manipulation in R
25. Recap
Data manipulation
data.table package
Basic statistical techniques
Subsetting
groom.sig <- hmar.dt[age_groom > 30, ]
dim(groom.sig)
## [1] 2493 29
groom.sig <- hmar.dt[sign_groom == "h", ]
dim(groom.sig)
## [1] 9590 29
Richard L. Zijdeman Data manipulation in R
26. Recap
Data manipulation
data.table package
Basic statistical techniques
groom.sig <- hmar.dt[sign_groom == "h" &
age_groom > 30, ]
dim(groom.sig)
## [1] 2358 29
groom.sig <- hmar.dt[m_year != 1840,
list(id_marriage, idnr)]
dim(groom.sig)
## [1] 9985 2
Richard L. Zijdeman Data manipulation in R
27. Recap
Data manipulation
data.table package
Basic statistical techniques
Creating new variables
Let’s create a variable for the mean of marriage of grooms
hmar.dt[, mean.gage := mean(age_groom)]
summary(hmar.dt$age_groom)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.00 24.00 26.00 28.38 30.00 79.00
summary(hmar.dt$mean.gage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28.38 28.38 28.38 28.38 28.38 28.38
Richard L. Zijdeman Data manipulation in R
28. Recap
Data manipulation
data.table package
Basic statistical techniques
Another example (from yesterday)
Dummy variable for equal municipality of birth
hmar.dt[, eq_b_loc := (b_loc_groom == b_loc_bride)]
summary(hmar.dt$eq_b_loc)
## Mode FALSE TRUE NA's
## logical 6957 3043 0
Richard L. Zijdeman Data manipulation in R
29. Recap
Data manipulation
data.table package
Basic statistical techniques
Creating variables by group
As we saw, a var with mean age wasn’t really interesting
average age of grooms at marriage by civil status
hmar.dt[, gage.mean.civ := mean(age_groom),
by = civilst_groom]
table(hmar.dt$civilst_groom, hmar.dt$gage.mean.civ)
##
## 27.2427939112599 40.8829787234043 42.9548286604361
## 1 9263 0 0
## 2 0 0 642
## 3 0 94 0
## 6 0 0 0
Richard L. Zijdeman Data manipulation in R
30. Recap
Data manipulation
data.table package
Basic statistical techniques
Summary subsets of the data
So far, added vars to original data.frame
can be redundant though
Think of context, say municipalities
archival material on characteristics, e.g.:
population
steam power
You can also make context characteristics by aggregation
Richard L. Zijdeman Data manipulation in R
31. Recap
Data manipulation
data.table package
Basic statistical techniques
mc <- hmar.dt[, mean(age_groom), by = b_loc_groom]
summary(mc)
## b_loc_groom V1
## Length:1184 Min. :-2.00
## Class :character 1st Qu.:26.00
## Mode :character Median :28.17
## Mean :29.36
## 3rd Qu.:31.00
## Max. :69.00
Richard L. Zijdeman Data manipulation in R
32. Recap
Data manipulation
data.table package
Basic statistical techniques
We can improve by naming the variable directly, and adding more
variables
mc2 <- hmar.dt[, list(mean_gage = mean(age_groom),
mean_bage = mean(age_bride)),
by = b_loc_groom]
summary(mc2)
## b_loc_groom mean_gage mean_bage
## Length:1184 Min. :-2.00 Min. :-2.00
## Class :character 1st Qu.:26.00 1st Qu.:23.80
## Mode :character Median :28.17 Median :25.88
## Mean :29.36 Mean :26.53
## 3rd Qu.:31.00 3rd Qu.:28.00
## Max. :69.00 Max. :64.00
Richard L. Zijdeman Data manipulation in R
33. Recap
Data manipulation
data.table package
Basic statistical techniques
One more. . . counts
Yesterday, we talked about the problem of overlapping points. We
used geom_jitter to solve it.
Now let’s do it properly:
mc3 <- hmar.dt[, list(frequency = .N),
by = list(m_year, age_bride)]
# notice the .N ... N is often used for nr. of obs
library(ggplot2)
Richard L. Zijdeman Data manipulation in R
34. Recap
Data manipulation
data.table package
Basic statistical techniques
Using colour
ggplot(mc3, aes(x= m_year, y = age_bride)) +
geom_point(aes(colour = frequency),
size = 10, shape = 18) +
theme_bw()
20
40
60
age_bride
10
20
30
frequency
Richard L. Zijdeman Data manipulation in R
35. Recap
Data manipulation
data.table package
Basic statistical techniques
Using size
ggplot(mc3, aes(x= m_year, y = age_bride)) +
geom_point(aes(size = frequency),
colour = "blue", shape = 18) +
theme_bw()
20
40
60
age_bride
frequency
10
20
30
Richard L. Zijdeman Data manipulation in R
37. Recap
Data manipulation
data.table package
Basic statistical techniques
Box and whisker plot
Distribution of data
Median: 50% of the cases above and below
Box: 1st and 3rd quartile
Interquartile range (IQR): Q3-Q1
Outliers (Tukey, 1977):
x < Q1 - 1.5*IQR
x > Q3 + 1.5*IQR
Richard L. Zijdeman Data manipulation in R
39. Recap
Data manipulation
data.table package
Basic statistical techniques
hmar.dt[, sign.bride.cln := sign_bride == "h"]
hmar.dt[age_bride < 14, age_bride := NA]
# NB: no missing values here, but mind this when recoding!
Richard L. Zijdeman Data manipulation in R
40. Recap
Data manipulation
data.table package
Basic statistical techniques
boxplot(hmar.dt$age_bride ~ hmar.dt$sign.bride.cln,
names = c("not signed", "signed"),
col = c("red", "green"))
not signed signed
203040506070
Richard L. Zijdeman Data manipulation in R