3x half day lecture and practicals on introductory facets of R, amongst others: installing R and RStudio, reading in and writing out data, data cleaning, descriptive statistics, data visualization (including visual analysis). Courtesy of the European Historical Sample Population Network and the Babeş-Bolyai University (Cluj-Napoca, Romania)
1. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Introduction into R
Part 1A
Richard L. Zijdeman
2016-06-15
Richard L. Zijdeman Introduction into R
2. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
1 Quantitave research methods
2 Data analysis workflow
3 Statistical Software
4 Installing R and RStudio
5 Getting help
Richard L. Zijdeman Introduction into R
3. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Quantitave research methods
Richard L. Zijdeman Introduction into R
4. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Why
To answer descriptive and explanatory questions on populations
Richard L. Zijdeman Introduction into R
5. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Workflow: PTE
problem (research question)
theory (hypothesis)
empirical test . . . with loops between T-E and P-T-E
Richard L. Zijdeman Introduction into R
6. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Research Questions
descriptive (to what extent. . . )
comparative (comparing two entities)
trend (comparison over time)
explanatory (focus on mechanism at hand)
Richard L. Zijdeman Introduction into R
7. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Theory
deductive reasoning
explanans
general mechanism
condition
explanandum (hypothesis)
Richard L. Zijdeman Introduction into R
8. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Empirical test
sample vs. population
random vs. stratified samples
testing technique, e.g.:
T-test, correlation, regression
Software required for faster analysis
Richard L. Zijdeman Introduction into R
9. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Data analysis workflow
Richard L. Zijdeman Introduction into R
10. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Empirical testings has its own workflow
Grolemund & Wickham, 2016, Creative Commons
Attribution-NonCommercial-NoDerivs 4.0.
Richard L. Zijdeman Introduction into R
11. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Statistical Software
Richard L. Zijdeman Introduction into R
12. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
The dangers of analysing with spreadsheets
(e.g. MS Excel)
tempting to input and clean data and analyse in the same sheet
di cult to track cleaning rules
defaults mess up your data (e.g. 01200 -> 1200)
Richard L. Zijdeman Introduction into R
13. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Why use syntax (scripting)
E ciency (really)
Quality (error checking)
Replicatability
Communication
Richard L. Zijdeman Introduction into R
14. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
R
R is open source, which is good and bad:
anybody can contribute (check, improve, create code)
free of charge
but: R depends on collective action
cannot ‘demand’ support
sprawl of packages
Richard L. Zijdeman Introduction into R
15. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
RStudio
browser for R
provides easy access to:
scripts
data
plots
manual
Richard L. Zijdeman Introduction into R
16. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Installing R and RStudio
Richard L. Zijdeman Introduction into R
17. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Download R
Instructions via http://www.r-project.org
Choose a CRAN mirror
http://cran.r-project.org/mirrors.html
close, but active too!
Romania hasn’t gone (yet!)
Click on ‘Download R for Windows’
Follow usual installation procedure
Double click on R
You should now have a working session!
Close the session, do not save workspace image
Richard L. Zijdeman Introduction into R
18. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Packages and libraries
base R (core product)
additional packages
CRAN repository
spread through ‘mirrors’
choose a local, but active mirror
Github
packages not on CRAN
development versions of CRAN libraries
Richard L. Zijdeman Introduction into R
19. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
RStudio
RStudio is found on http://www.rstudio.com
Download the version for your OS (e.g. windows)
http://www.rstudio.com/products/rstudio/download/
Install by double clicking on the downloaded file
Start RStudio by double clicking on the icon
You do not need to start R, before starting RStudio
Richard L. Zijdeman Introduction into R
20. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Getting help
Richard L. Zijdeman Introduction into R
21. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Build-in help: “?”
?[function] / ?[package]
e.g. “?plot” or “?graphics”
check the index for user guides and vignettes
Richard L. Zijdeman Introduction into R
22. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Cran website
Manuals
R FAQ
R Journal
Richard L. Zijdeman Introduction into R
23. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Online communities
Stackoverflow
Instance of Stackexchange
Reputation based Q&A
Specific lists for packages, e.g.:
ggplot2
R-sig-mixed-models
Richard L. Zijdeman Introduction into R
24. Quantitave research methods
Data analysis workflow
Statistical Software
Installing R and RStudio
Getting help
Asking a question Getting an answer
Search the web: others must have had this problem too
If you raise a question:
be polite
be concise
short background
replicatable example
debrief your e orts sofar
Richard L. Zijdeman Introduction into R
25. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Introduction into R
Part 1B
Richard L. Zijdeman
2016-06-15
Richard L. Zijdeman Introduction into R
26. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
1 Introducing RStudio and R
2 Introducing base R
3 Data visualization using ggplot2
Richard L. Zijdeman Introduction into R
27. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Introducing RStudio and R
Richard L. Zijdeman Introduction into R
28. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
RStudio
Rstudio is sort of a ‘viewer’ on R
helps to organize input and output:
editor (upper left)
console (lower left)
environment (upper right)
output (lower right)
Richard L. Zijdeman Introduction into R
29. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
R script
series of ))commands to manipulate data
always save your script, NEVER change your data
original data + script = reproducable research
Richard L. Zijdeman Introduction into R
30. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Packages
Build your R system using packages
‘Base R’ is basic. Add packages for your specific needs
Packages are found on servers, called ‘mirrors’
Make sure to select a mirror first
https://cran.r-project.org/mirrors.html%5Bhttps:
//cran.r-project.org/mirrors.html%5D
## To permanently add the mirror, type:
options(repos=structure(
c(CRAN="http://cran.xl-mirror.nl")))
## replace http://... with your favorite mirror
Richard L. Zijdeman Introduction into R
31. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Packages for book (see 1.4.2)
pkgs <- c(
"broom", "dplyr", "ggplot2", "jpeg", "jsonlite",
"knitr", "Lahman", "microbenchmark", "png", "pryr",
"purrr", "rcorpora", "readr", "stringr", "tibble",
"tidyr"
)
install.packages(pkgs)
Richard L. Zijdeman Introduction into R
32. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
R Session
contains scripts, data, functions
can be saved ‘workspace image’
prefer not to:
sessions are usually cluttered
only useful if running script takes time
Suggested tweak:
Options: uncheck “Restore .RData into workspace at startup”
Options: Save workspace to .RData on exit, select ‘never’
Richard L. Zijdeman Introduction into R
33. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Introducing base R
Richard L. Zijdeman Introduction into R
34. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
base R: assignment and print()
‘attach’ values to an object (e.g. a variable)
x <- 5
y <- 4
z <- x * y
print(z)
## [1] 20
Richard L. Zijdeman Introduction into R
35. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
base R: assignment and print() (II)
Try and imagine the potential of assignment
x <- c(4, 3, 2, 1, 0, 27, 34, 35)
# c for concatenate values
y <- -1
z <- x*y
print(z)
## [1] -4 -3 -2 -1 0 -27 -34 -35
Richard L. Zijdeman Introduction into R
36. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
base R: data.frame
basically a table
contains columns (variables)
contains rows (cases)
“flat table” in Kees’ terminology
my.df <- data.frame(x,z)
str(my.df) # show STRucture
## data.frame : 8 obs. of 2 variables:
## $ x: num 4 3 2 1 0 27 34 35
## $ z: num -4 -3 -2 -1 0 -27 -34 -35
There’s much more, but let’s keep that for tomorrow
Richard L. Zijdeman Introduction into R
37. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Data visualization using ggplot2
Richard L. Zijdeman Introduction into R
38. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Visualizing your data
Not just for analyses!
Data quality
representativeness
missing data
Richard L. Zijdeman Introduction into R
39. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
plot() in base R
library(help = "datasets") # all datasets in R
?mtcars # show help on mtcars dataset
df <- mtcars()
str(mtcars) # display STRucture of an object
plot(mtcars$hp, mtcars$mpg)
plot(df)
Richard L. Zijdeman Introduction into R
40. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
plot() is like . . .
plot() is like latex:
Forge it in anyway you want
Heterogeneous approach though
Takes quite some time to get it right
Richard L. Zijdeman Introduction into R
41. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
ggplot() as alternative
ggplot is but one of many graph packages ggplot is nice bc, of:
similar approach to various types of graphs
easy build up for basic graphs
can get quite complex too
(but cannot do it all)
Richard L. Zijdeman Introduction into R
42. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
ggplot() and the canvas metaphore
ggplot() consists of two elements
canvas
(multiple) layers of paint
Richard L. Zijdeman Introduction into R
43. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
mapping and geom layers
ggplot() consists of two elements
canvas:
data
mapping (aesthetic)
(multiple) layers of paint
geom layers
ggplot(data = <DATASET>,
mapping = aes(x = <X-VAR>, y = <Y-VAR>)) +
geom_<TYPE>
Richard L. Zijdeman Introduction into R
44. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
our first ggplot
install.packages("ggplot2")
library(ggplot2)
df <- mtcars
ggplot(data = df, aes(x = hp, y = mpg)) +
geom_point()
Richard L. Zijdeman Introduction into R
45. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
geom_ features
? geom_point
install.packages("ggplot2")
library(ggplot2)
df <- mtcars
ggplot(data = df, aes(x = hp, y = mpg)) +
geom_point(fill = "white", colour = "blue",
shape = 21, size = 4)
Richard L. Zijdeman Introduction into R
46. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Adding characteristics to your plot
Add variables to explain a pattern
ggplot(data = df, aes(x = hp, y = mpg)) +
geom_point(aes(colour = wt), size = 4)
NB: notice the di erence?
ggplot(data = df, aes(x = hp, y = mpg)) +
geom_point(aes(colour = wt, size = 4))
Richard L. Zijdeman Introduction into R
47. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Multiple geom’s
Add variables to explain a pattern
ggplot(data = df, aes(x = hp, y = mpg)) +
geom_point(aes(colour = as.factor(am)),
size = 6) + # increase size bc overlap
geom_point(aes(shape = as.factor(vs)),
size = 3)
#V/S whether V8 (0) or Straight (European) (1)
Richard L. Zijdeman Introduction into R
48. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Adding facets
Facets help reduce complexity
ggplot(data = df, aes(x = hp, y = mpg)) +
geom_point(aes(colour = as.factor(am)),
size = 4) +
facet_wrap( ~ vs)
Richard L. Zijdeman Introduction into R
49. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Things to consider with geom(_point)
fill only works where shape actually can be filled
consider order of geoms
mind overlap:
decrease size
use alpha
use ‘open’ shapes
geom_jitter
Richard L. Zijdeman Introduction into R
50. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
ggplot and titles
Various ways to add titlex to axes and stu
Can get quite complex
Here’s the basiscs
ggplot(data = df, aes(x = hp, y = mpg)) +
geom_point() +
labs(title = "Nice graph", x = "Horse Power",
y = "Miles per Gallon" )
Richard L. Zijdeman Introduction into R
51. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Themes and size
ggplot(data = df, aes(x = hp, y = mpg)) +
geom_point() +
labs(title = "Nice graph", x = "Horse Power",
y = "Miles per Gallon" ) +
theme_bw(base_size = 16)
Richard L. Zijdeman Introduction into R
52. Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Much more to learn
not just about ggplot()
axes
legend (guides)
geoms
also about dataviz in general
general do’s and don’ts
which problem fits which graph
it’s a science! (Graph theory)
Richard L. Zijdeman Introduction into R
53. Data wrangling
bit about NA
Introduction into R
Part 2A, 2B
Richard L. Zijdeman
2016-06-16
Richard L. Zijdeman Introduction into R
54. Data wrangling
bit about NA
1 Data wrangling
2 bit about NA
Richard L. Zijdeman Introduction into R
56. Data wrangling
bit about NA
Grolemund & Wickham, 2016, Creative Commons
Attribution-NonCommercial-NoDerivs 4.0.
Richard L. Zijdeman Introduction into R
57. Data wrangling
bit about NA
dplyr package
# install.packages("dplyr") # 1 time only
library(dplyr)
install.packages("nycflights13")
library(nycflights13)
print(flights)
Richard L. Zijdeman Introduction into R
58. Data wrangling
bit about NA
tibble or data_frame vs data.frame
str(mtcars)
class(mtcars)
mtcars_tbl <- as_data_frame(mtcars)
str(mtcars)
class(mtcars)
Richard L. Zijdeman Introduction into R
59. Data wrangling
bit about NA
filter
filter(mtcars, am == 1, vs == 0)
some.cars <- filter(mtcars, am == 1, vs == 0)
some.cars
(some.cars2 <- filter(mtcars, am == 1, vs == 0))
Richard L. Zijdeman Introduction into R
60. Data wrangling
bit about NA
filter and using or
filter(mtcars, gear == 3 | gear == 4) # !! not like this:
filter(mtcars, gear == 3 | 4)
Richard L. Zijdeman Introduction into R
62. Data wrangling
bit about NA
Arrange
arrange(flights, dep_time)
arrange(flights, year, month, day) # ascending order
arrange(flights, desc(day))
# NB: missing values come at end
Richard L. Zijdeman Introduction into R
63. Data wrangling
bit about NA
Select
df <- select(flights, year, month, day)
names(flights)
df <- select(flights, tailnum:dest)
df <- select(flights, -(tailnum:dest))
df
df <- select(flights, starts_with("arr_"))
df <- select(flights, ends_with("e"))
df <- select(flights, contains("a"))
Richard L. Zijdeman Introduction into R
64. Data wrangling
bit about NA
rename
df <- rename(flights, Y_ear = year)
df <- mutate(flights, year1 = year+1)
select(df, year, year1)
df <- mutate(flights, year1 = year + 1, year2 = year1+1)
select(df, contains("year"))
df <- transmute(flights, year1 = year + 1, year2 = year1+1)
# only maintains the newly created variables
Richard L. Zijdeman Introduction into R
65. Data wrangling
bit about NA
group_by
by_day <- group_by(flights, year, month, day)
summarise(by_day)
cars <- mtcars
cars <- as_data_frame(mtcars)
summarise(cars, mean_hp = mean(hp, na.rm = TRUE))
mean(cars$hp, na.rm = TRUE)
Richard L. Zijdeman Introduction into R
66. Data wrangling
bit about NA
the pipe: %>%
cars_grp <- group_by(cars, carb)
class(cars)
class(cars_grp)
summarise(cars_grp, mmpg = mean(mpg, na.rm = TRUE))
cars_grp_sum <- summarise(cars_grp,
mmpg = mean(mpg, na.rm = TRUE),
count = n())
cars_grp_sum
plot <- ggplot(cars_grp_sum,
aes(x = carb, y = mmpg,
label = carb)) +
geom_point(aes(size = count)) +
geom_text(colour = "cyan")
plot Richard L. Zijdeman Introduction into R
67. Data wrangling
bit about NA
more pipe, adding a filter
cars_grp_sum3 <- cars %>%
group_by(carb) %>%
summarise(mmpg = mean(mpg, na.rm = TRUE),
count = n()) %>%
filter(count > 3)
ggplot(cars_grp_sum3, aes(x = carb, y = mmpg, label = carb)
geom_point(aes(size = count)) +
geom_text(colour = "cyan") +
labs(title = "figure with %>% and count > 3")
Richard L. Zijdeman Introduction into R
68. Session management
Basic data manipulation
Introduction into R
Part 3A
Richard L. Zijdeman
2016-06-17
Richard L. Zijdeman Introduction into R
69. Session management
Basic data manipulation
1 Session management
2 Basic data manipulation
Richard L. Zijdeman Introduction into R
71. Session management
Basic data manipulation
Maintaining your workspace
Grolemund & Wickham, 2016, Creative Commons
Attribution-NonCommercial-NoDerivs 4.0.
Richard L. Zijdeman Introduction into R
72. Session management
Basic data manipulation
Setting up a session
clear your Environment
check sessionInfo() for loaded packages
detach obsolete packages under ‘other attached packages’
set your directory (“" on windows and”/" for linux/mac)
load libraries (install new ones)
load your data
Richard L. Zijdeman Introduction into R
73. Session management
Basic data manipulation
Example session setup
rm(list = ls())
sessionInfo() # check for other attached packages
detach("package:nycflights13", unload = TRUE)
setwd("/Users/RichardZ/Dropbox/
Summer school 2016/Richard Zijdeman/")
getwd() # to see whether you re in the right directory
dir() # shows what s in your directory
Richard L. Zijdeman Introduction into R
74. Session management
Basic data manipulation
Loading your data
read.table() (generic function)
read.csv()
library(foreign) # e.g. SPSS and Stata
library(readxl) # fast excel-package
Richard L. Zijdeman Introduction into R
75. Session management
Basic data manipulation
Reading in data
Di erent functions for di erent files:
Base R: read.table() (read.csv())
foreign package: read.spss(), read.dta(), read.dbf()
readxl
alternatives packages:
xlsx(Java required)
gdata (perl-based)
openxlsx package: read.xlsx()
Richard L. Zijdeman Introduction into R
76. Session management
Basic data manipulation
read.csv()
file: your file, including directory
header: variable names or not?
sep: seperator
read.csv default: “,”
read.csv2 default: “;”
skip: number of rows to skip
nrows: total number of rows to read
stringsAsFactors
encoding (e.g. “latin1” or “UTF-8”)
Richard L. Zijdeman Introduction into R
77. Session management
Basic data manipulation
read_excel from readxl package
path: your file, including directory
sheet: name or number of sheet
col_names: col names in 1st row?
col_types: specify type
na: what’s the sign for missing values
skip: how many rows to skip before data starts
Richard L. Zijdeman Introduction into R
78. Session management
Basic data manipulation
Example session loading your csv data
# setwd() to set your working directory
hmar100 <- read.csv("./Datafiles_HSN/HSN_marriages.csv",
stringsAsFactors = FALSE,
encoding = "latin1",
header = TRUE,
nrows = 100) # just first 100 rows
Richard L. Zijdeman Introduction into R
79. Session management
Basic data manipulation
Example session loading your excel data
# setwd() to set your working directory
install.packages("readxl")
library("readxl")
hmar <- read_excel("./Datafiles_HSN/HSN_marriages_awful.xls
col_names = TRUE,
skip = 3) # empty lines not counted!!!
Richard L. Zijdeman Introduction into R
81. Session management
Basic data manipulation
Change case of text
tolower()
toupper()
tolower("CaN we pleASe jUSt have LOWER cases?")
names(hmar) <- tolower(names(hmar))
Richard L. Zijdeman Introduction into R
82. Session management
Basic data manipulation
length()
Used to count how many instances there are
length(names(hmar))
# shows number of variables in hmar
Richard L. Zijdeman Introduction into R
86. Basic statistical techniques
Box and whisker plot
Distribution of data
Median: 50% of the cases above and below
Box: 1st and 3rd quartile
Interquartile range (IQR): Q3-Q1
Outliers (Tukey, 1977):
x < Q1 - 1.5*IQR
x > Q3 + 1.5*IQR
Richard L. Zijdeman Introduction into R
87. Basic statistical techniques
p <- ggplot(hmar, aes(sign_groom, age_groom))
p + geom_boxplot()
Richard L. Zijdeman Introduction into R
88. Basic statistical techniques
hmar <- mutate(hmar, sign_groomD = (sign_groom == "h" & !(i
p <- ggplot(hmar, aes(sign_groomD, age_groom))
p + geom_boxplot()
Richard L. Zijdeman Introduction into R
89. Basic statistical techniques
hmar <- mutate(hmar, sign_groomD = (sign_groom == "h" & !(i
p <- ggplot(hmar, aes(sign_groomD, age_groom))
p + geom_boxplot() + geom_jitter(shape = 24, width = 0.2)
Richard L. Zijdeman Introduction into R
91. Basic statistical techniques
A small PTE project
Look at the variables in the HSN files
Think of a research question
Provide a general mechanism and hypothesis
Plot your results
Richard L. Zijdeman Introduction into R