SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
Recap
Data manipulation
data.table package
Basic statistical techniques
Data manipulation in R
Richard L. Zijdeman
May 29, 2015
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
1 Recap
2 Data manipulation
3 data.table package
4 Basic statistical techniques
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Recap
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
What we’ve seen so far
functions to read in data
read.csv(), read.xlsx()
objects
assignment <-
characteristics, e.g.:
str(), summary(), head(), tail()
calculus
mean(), min(), max()
plotting
plot()
ggplot()
paint by ‘layer’
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Before we go on. . .
Structure your R script
Filename, Date, Purpose, Author, Last change
Use comments to tell what you are doing
read in data
changing variables (why did you do it)
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Create a working directory, with subdirs
+ documents
+ data
- source
- derived
+ analysis
+ figures
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Set a working directory
setwd(), getwd()
use relative paths to save things
“./” = currenty directory
“./../” = folder up
Read J. Scott Long’ “Workflow”
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Data manipulation
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Assignment and Indexing
First, we’ll read in the HSN marriages again
hmar <- read.csv("./../data/derived/HSN_marriages.csv",
stringsAsFactors = FALSE,
encoding = "latin1",
header = TRUE,
nrows = 10000)
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Change case of text
tolower()
toupper()
tolower("CaN we pleASe jUSt have LOWER cases?")
## [1] "can we please just have lower cases?"
names(hmar) <- tolower(names(hmar))
names(hmar)
## [1] "id_marriage" "idnr" "m_loc" "m_
## [5] "sex_hsnrp" "age_groom" "occ_groom" "ci
## [9] "sign_groom" "b_loc_groom" "l_loc_groom" "ag
## [13] "occ_bride" "civilst_bride" "sign_bride" "b_
## [17] "l_loc_bride" "a_f_groom" "occ_f_groom" "si
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Indexing
There were way to many names to print on a slide. . . How many
names are there actually?
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Use the length() command to find out:
length(names(hmar))
## [1] 29
So let’s print just the first two:
names(hmar)[1:2]
## [1] "id_marriage" "idnr"
The technique using squared brackets is called indexing
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Any idea how we would show the last two names?
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
x <- length(names)
names(hmar)[(x-1):x]
## [1] "id_marriage"
Using concatenate we could also extract various names
names(hmar)[c(1, 3, 5)]
## [1] "id_marriage" "m_loc" "sex_hsnrp"
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
We can also apply indexing to a data.frame:
hmar[1:2, 1:3]
## id_marriage idnr m_loc
## 1 1 1001 Abcoude-Baambrugge
## 2 2 1005 Baarn
# shows the first 2 rows and first 3 columns
# so, in general: data.frame[rows, columns]
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
head() and tail()
So actually, you should now be able to replace head() and tail()
How?
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
# head()
hmar[1:6, ]
# tail()
y <- nrow(hmar)
hmar[(y-6):y, ]
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
data.table package
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Developed by Matt Dowle
Website:
https://github.com/Rdatatable/data.table/wiki
Why data.table?
fast subsetting on large files
more consistent ‘grammar’
less typing
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
install.packages("data.table")
library(data.table)
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Class: data.table
For data.table functions to work we need to define a data.frame as
class data.base
is.data.table(hmar)
## [1] FALSE
hmar.dt <- data.table(hmar)
is.data.table(hmar.dt)
## [1] TRUE
is.data.frame(hmar.dt)
## [1] TRUE
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Friends with benefits
Data.frame and data.table are like ‘friends with benefits’
all.equal(hmar, hmar.dt)
## [1] "Attributes: < Names: 2 string mismatches >"
## [2] "Attributes: < Length mismatch: comparison on first
## [3] "Attributes: < Component 1: Modes: character, extern
## [4] "Attributes: < Component 1: target is character, cur
## [5] "Attributes: < Component 2: Modes: numeric, characte
## [6] "Attributes: < Component 2: Lengths: 10000, 2 >"
## [7] "Attributes: < Component 2: target is numeric, curre
# so we have all the benefits of a data.frame
# ... and additional benefits of data.table
NB: next series of commands will only work for data.tablesRichard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Sort with setkey
Often we want to sort our data. We can do so with setkey()
hmar.dt[1:6, m_year]
## [1] 1849 1851 1864 1840 1843 1858
# note for data.frame hmar it would be:
# hmar[1:6, hmar$m_year]
setkeyv(hmar.dt, "m_year")
hmar.dt[1:6, m_year]
## [1] 1831 1831 1833 1833 1834 1834
identical(hmar.dt, hmar)
## [1] FALSE Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Multiple keys
It is alo possible to sort on multiple keys
setkeyv(hmar.dt, c("id_marriage", "idnr"))
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Subsetting
groom.sig <- hmar.dt[age_groom > 30, ]
dim(groom.sig)
## [1] 2493 29
groom.sig <- hmar.dt[sign_groom == "h", ]
dim(groom.sig)
## [1] 9590 29
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
groom.sig <- hmar.dt[sign_groom == "h" &
age_groom > 30, ]
dim(groom.sig)
## [1] 2358 29
groom.sig <- hmar.dt[m_year != 1840,
list(id_marriage, idnr)]
dim(groom.sig)
## [1] 9985 2
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Creating new variables
Let’s create a variable for the mean of marriage of grooms
hmar.dt[, mean.gage := mean(age_groom)]
summary(hmar.dt$age_groom)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.00 24.00 26.00 28.38 30.00 79.00
summary(hmar.dt$mean.gage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28.38 28.38 28.38 28.38 28.38 28.38
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Another example (from yesterday)
Dummy variable for equal municipality of birth
hmar.dt[, eq_b_loc := (b_loc_groom == b_loc_bride)]
summary(hmar.dt$eq_b_loc)
## Mode FALSE TRUE NA's
## logical 6957 3043 0
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Creating variables by group
As we saw, a var with mean age wasn’t really interesting
average age of grooms at marriage by civil status
hmar.dt[, gage.mean.civ := mean(age_groom),
by = civilst_groom]
table(hmar.dt$civilst_groom, hmar.dt$gage.mean.civ)
##
## 27.2427939112599 40.8829787234043 42.9548286604361
## 1 9263 0 0
## 2 0 0 642
## 3 0 94 0
## 6 0 0 0
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Summary subsets of the data
So far, added vars to original data.frame
can be redundant though
Think of context, say municipalities
archival material on characteristics, e.g.:
population
steam power
You can also make context characteristics by aggregation
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
mc <- hmar.dt[, mean(age_groom), by = b_loc_groom]
summary(mc)
## b_loc_groom V1
## Length:1184 Min. :-2.00
## Class :character 1st Qu.:26.00
## Mode :character Median :28.17
## Mean :29.36
## 3rd Qu.:31.00
## Max. :69.00
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
We can improve by naming the variable directly, and adding more
variables
mc2 <- hmar.dt[, list(mean_gage = mean(age_groom),
mean_bage = mean(age_bride)),
by = b_loc_groom]
summary(mc2)
## b_loc_groom mean_gage mean_bage
## Length:1184 Min. :-2.00 Min. :-2.00
## Class :character 1st Qu.:26.00 1st Qu.:23.80
## Mode :character Median :28.17 Median :25.88
## Mean :29.36 Mean :26.53
## 3rd Qu.:31.00 3rd Qu.:28.00
## Max. :69.00 Max. :64.00
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
One more. . . counts
Yesterday, we talked about the problem of overlapping points. We
used geom_jitter to solve it.
Now let’s do it properly:
mc3 <- hmar.dt[, list(frequency = .N),
by = list(m_year, age_bride)]
# notice the .N ... N is often used for nr. of obs
library(ggplot2)
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Using colour
ggplot(mc3, aes(x= m_year, y = age_bride)) +
geom_point(aes(colour = frequency),
size = 10, shape = 18) +
theme_bw()
20
40
60
age_bride
10
20
30
frequency
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Using size
ggplot(mc3, aes(x= m_year, y = age_bride)) +
geom_point(aes(size = frequency),
colour = "blue", shape = 18) +
theme_bw()
20
40
60
age_bride
frequency
10
20
30
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Basic statistical techniques
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Box and whisker plot
Distribution of data
Median: 50% of the cases above and below
Box: 1st and 3rd quartile
Interquartile range (IQR): Q3-Q1
Outliers (Tukey, 1977):
x < Q1 - 1.5*IQR
x > Q3 + 1.5*IQR
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
boxplot(hmar.dt$age_bride,
ylab = "Age")
0204060
Age
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
hmar.dt[, sign.bride.cln := sign_bride == "h"]
hmar.dt[age_bride < 14, age_bride := NA]
# NB: no missing values here, but mind this when recoding!
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
boxplot(hmar.dt$age_bride ~ hmar.dt$sign.bride.cln,
names = c("not signed", "signed"),
col = c("red", "green"))
not signed signed
203040506070
Richard L. Zijdeman Data manipulation in R
Recap
Data manipulation
data.table package
Basic statistical techniques
Richard L. Zijdeman Data manipulation in R

Mais conteúdo relacionado

Mais procurados

final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)
Ankit Rathi
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Text Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated DocumentsText Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated Documents
Nelson Auner
 

Mais procurados (20)

R tutorial
R tutorialR tutorial
R tutorial
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
LSESU a Taste of R Language Workshop
LSESU a Taste of R Language WorkshopLSESU a Taste of R Language Workshop
LSESU a Taste of R Language Workshop
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
 
Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013
Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013
Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013
 
Working with text data
Working with text dataWorking with text data
Working with text data
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
R program
R programR program
R program
 
15 unionfind
15 unionfind15 unionfind
15 unionfind
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of R
 
Text Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated DocumentsText Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated Documents
 
R Programming For Beginners | R Language Tutorial | R Tutorial For Beginners ...
R Programming For Beginners | R Language Tutorial | R Tutorial For Beginners ...R Programming For Beginners | R Language Tutorial | R Tutorial For Beginners ...
R Programming For Beginners | R Language Tutorial | R Tutorial For Beginners ...
 
BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
1.3 introduction to R language, importing dataset in r, data exploration in r
1.3 introduction to R language, importing dataset in r, data exploration in r1.3 introduction to R language, importing dataset in r, data exploration in r
1.3 introduction to R language, importing dataset in r, data exploration in r
 
Coding and Cookies: R basics
Coding and Cookies: R basicsCoding and Cookies: R basics
Coding and Cookies: R basics
 

Destaque

Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Rinke Hoekstra
 
Excel 3: Data Analysis
Excel 3: Data Analysis Excel 3: Data Analysis
Data manipulation instructions
Data manipulation instructionsData manipulation instructions
Data manipulation instructions
Mahesh Kumar Attri
 

Destaque (15)

Excel for SEO and Analytics by SeoTakeaways
Excel for SEO and Analytics by SeoTakeawaysExcel for SEO and Analytics by SeoTakeaways
Excel for SEO and Analytics by SeoTakeaways
 
The Structured Data Hub in 2019
The Structured Data Hub in 2019The Structured Data Hub in 2019
The Structured Data Hub in 2019
 
Advancing the comparability of occupational data through Linked Open Data
Advancing the comparability of occupational data through Linked Open DataAdvancing the comparability of occupational data through Linked Open Data
Advancing the comparability of occupational data through Linked Open Data
 
Csdh sbg clariah_intr01
Csdh sbg clariah_intr01Csdh sbg clariah_intr01
Csdh sbg clariah_intr01
 
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
 
Historical occupational classification and occupational stratification schemes
Historical occupational classification and occupational stratification schemesHistorical occupational classification and occupational stratification schemes
Historical occupational classification and occupational stratification schemes
 
Labour force participation of married women, US 1860-2010
Labour force participation of married women, US 1860-2010Labour force participation of married women, US 1860-2010
Labour force participation of married women, US 1860-2010
 
Excel 3: Data Analysis
Excel 3: Data Analysis Excel 3: Data Analysis
Excel 3: Data Analysis
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
 
Managing Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS caseManaging Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS case
 
QBer - Connect your data to the cloud
QBer - Connect your data to the cloudQBer - Connect your data to the cloud
QBer - Connect your data to the cloud
 
Data manipulation instructions
Data manipulation instructionsData manipulation instructions
Data manipulation instructions
 
Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance Visualization
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
Data transfer and manipulation
Data transfer and manipulationData transfer and manipulation
Data transfer and manipulation
 

Semelhante a Introduction into R for historians (part 4: data manipulation)

CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
J Singh
 
R Programming.pptx
R Programming.pptxR Programming.pptx
R Programming.pptx
kalai75
 

Semelhante a Introduction into R for historians (part 4: data manipulation) (20)

3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data frame
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017
 
R basics
R basicsR basics
R basics
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
 
India software developers conference 2013 Bangalore
India software developers conference 2013 BangaloreIndia software developers conference 2013 Bangalore
India software developers conference 2013 Bangalore
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
R Programming.pptx
R Programming.pptxR Programming.pptx
R Programming.pptx
 
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
 
Lecture_R.ppt
Lecture_R.pptLecture_R.ppt
Lecture_R.ppt
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 
QMC: Undergraduate Workshop, Tutorial on 'R' Software - Yawen Guan, Feb 26, 2...
QMC: Undergraduate Workshop, Tutorial on 'R' Software - Yawen Guan, Feb 26, 2...QMC: Undergraduate Workshop, Tutorial on 'R' Software - Yawen Guan, Feb 26, 2...
QMC: Undergraduate Workshop, Tutorial on 'R' Software - Yawen Guan, Feb 26, 2...
 
PPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini RatrePPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini Ratre
 
Sat rday
Sat rdaySat rday
Sat rday
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
The Very ^ 2 Basics of R
The Very ^ 2 Basics of RThe Very ^ 2 Basics of R
The Very ^ 2 Basics of R
 
introtorandrstudio.ppt
introtorandrstudio.pptintrotorandrstudio.ppt
introtorandrstudio.ppt
 

Mais de Richard Zijdeman

Mais de Richard Zijdeman (10)

Linked Data: Een extra ontstluitingslaag op archieven
Linked Data: Een extra ontstluitingslaag op archieven Linked Data: Een extra ontstluitingslaag op archieven
Linked Data: Een extra ontstluitingslaag op archieven
 
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
 
grlc. store, share and run sparql queries
grlc. store, share and run sparql queriesgrlc. store, share and run sparql queries
grlc. store, share and run sparql queries
 
Rijpma's Catasto meets SPARQL dhb2017_workshop
Rijpma's Catasto meets SPARQL dhb2017_workshopRijpma's Catasto meets SPARQL dhb2017_workshop
Rijpma's Catasto meets SPARQL dhb2017_workshop
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
 
Toogdag 2017
Toogdag 2017Toogdag 2017
Toogdag 2017
 
work in a globalized world
work in a globalized worldwork in a globalized world
work in a globalized world
 
Examples of digital history at the IISH
Examples of digital history at the IISHExamples of digital history at the IISH
Examples of digital history at the IISH
 
Historical occupational classification and stratification schemes (lecture)
Historical occupational classification and stratification schemes (lecture)Historical occupational classification and stratification schemes (lecture)
Historical occupational classification and stratification schemes (lecture)
 
Using HISCO and HISCAM to code and analyze occupations
Using HISCO and HISCAM to code and analyze occupationsUsing HISCO and HISCAM to code and analyze occupations
Using HISCO and HISCAM to code and analyze occupations
 

Último

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 

Último (20)

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 

Introduction into R for historians (part 4: data manipulation)

  • 1. Recap Data manipulation data.table package Basic statistical techniques Data manipulation in R Richard L. Zijdeman May 29, 2015 Richard L. Zijdeman Data manipulation in R
  • 2. Recap Data manipulation data.table package Basic statistical techniques 1 Recap 2 Data manipulation 3 data.table package 4 Basic statistical techniques Richard L. Zijdeman Data manipulation in R
  • 3. Recap Data manipulation data.table package Basic statistical techniques Recap Richard L. Zijdeman Data manipulation in R
  • 4. Recap Data manipulation data.table package Basic statistical techniques What we’ve seen so far functions to read in data read.csv(), read.xlsx() objects assignment <- characteristics, e.g.: str(), summary(), head(), tail() calculus mean(), min(), max() plotting plot() ggplot() paint by ‘layer’ Richard L. Zijdeman Data manipulation in R
  • 5. Recap Data manipulation data.table package Basic statistical techniques Before we go on. . . Structure your R script Filename, Date, Purpose, Author, Last change Use comments to tell what you are doing read in data changing variables (why did you do it) Richard L. Zijdeman Data manipulation in R
  • 6. Recap Data manipulation data.table package Basic statistical techniques Create a working directory, with subdirs + documents + data - source - derived + analysis + figures Richard L. Zijdeman Data manipulation in R
  • 7. Recap Data manipulation data.table package Basic statistical techniques Set a working directory setwd(), getwd() use relative paths to save things “./” = currenty directory “./../” = folder up Read J. Scott Long’ “Workflow” Richard L. Zijdeman Data manipulation in R
  • 8. Recap Data manipulation data.table package Basic statistical techniques Data manipulation Richard L. Zijdeman Data manipulation in R
  • 9. Recap Data manipulation data.table package Basic statistical techniques Assignment and Indexing First, we’ll read in the HSN marriages again hmar <- read.csv("./../data/derived/HSN_marriages.csv", stringsAsFactors = FALSE, encoding = "latin1", header = TRUE, nrows = 10000) Richard L. Zijdeman Data manipulation in R
  • 10. Recap Data manipulation data.table package Basic statistical techniques Change case of text tolower() toupper() tolower("CaN we pleASe jUSt have LOWER cases?") ## [1] "can we please just have lower cases?" names(hmar) <- tolower(names(hmar)) names(hmar) ## [1] "id_marriage" "idnr" "m_loc" "m_ ## [5] "sex_hsnrp" "age_groom" "occ_groom" "ci ## [9] "sign_groom" "b_loc_groom" "l_loc_groom" "ag ## [13] "occ_bride" "civilst_bride" "sign_bride" "b_ ## [17] "l_loc_bride" "a_f_groom" "occ_f_groom" "si Richard L. Zijdeman Data manipulation in R
  • 11. Recap Data manipulation data.table package Basic statistical techniques Indexing There were way to many names to print on a slide. . . How many names are there actually? Richard L. Zijdeman Data manipulation in R
  • 12. Recap Data manipulation data.table package Basic statistical techniques Use the length() command to find out: length(names(hmar)) ## [1] 29 So let’s print just the first two: names(hmar)[1:2] ## [1] "id_marriage" "idnr" The technique using squared brackets is called indexing Richard L. Zijdeman Data manipulation in R
  • 13. Recap Data manipulation data.table package Basic statistical techniques Any idea how we would show the last two names? Richard L. Zijdeman Data manipulation in R
  • 14. Recap Data manipulation data.table package Basic statistical techniques x <- length(names) names(hmar)[(x-1):x] ## [1] "id_marriage" Using concatenate we could also extract various names names(hmar)[c(1, 3, 5)] ## [1] "id_marriage" "m_loc" "sex_hsnrp" Richard L. Zijdeman Data manipulation in R
  • 15. Recap Data manipulation data.table package Basic statistical techniques We can also apply indexing to a data.frame: hmar[1:2, 1:3] ## id_marriage idnr m_loc ## 1 1 1001 Abcoude-Baambrugge ## 2 2 1005 Baarn # shows the first 2 rows and first 3 columns # so, in general: data.frame[rows, columns] Richard L. Zijdeman Data manipulation in R
  • 16. Recap Data manipulation data.table package Basic statistical techniques head() and tail() So actually, you should now be able to replace head() and tail() How? Richard L. Zijdeman Data manipulation in R
  • 17. Recap Data manipulation data.table package Basic statistical techniques # head() hmar[1:6, ] # tail() y <- nrow(hmar) hmar[(y-6):y, ] Richard L. Zijdeman Data manipulation in R
  • 18. Recap Data manipulation data.table package Basic statistical techniques data.table package Richard L. Zijdeman Data manipulation in R
  • 19. Recap Data manipulation data.table package Basic statistical techniques Developed by Matt Dowle Website: https://github.com/Rdatatable/data.table/wiki Why data.table? fast subsetting on large files more consistent ‘grammar’ less typing Richard L. Zijdeman Data manipulation in R
  • 20. Recap Data manipulation data.table package Basic statistical techniques install.packages("data.table") library(data.table) Richard L. Zijdeman Data manipulation in R
  • 21. Recap Data manipulation data.table package Basic statistical techniques Class: data.table For data.table functions to work we need to define a data.frame as class data.base is.data.table(hmar) ## [1] FALSE hmar.dt <- data.table(hmar) is.data.table(hmar.dt) ## [1] TRUE is.data.frame(hmar.dt) ## [1] TRUE Richard L. Zijdeman Data manipulation in R
  • 22. Recap Data manipulation data.table package Basic statistical techniques Friends with benefits Data.frame and data.table are like ‘friends with benefits’ all.equal(hmar, hmar.dt) ## [1] "Attributes: < Names: 2 string mismatches >" ## [2] "Attributes: < Length mismatch: comparison on first ## [3] "Attributes: < Component 1: Modes: character, extern ## [4] "Attributes: < Component 1: target is character, cur ## [5] "Attributes: < Component 2: Modes: numeric, characte ## [6] "Attributes: < Component 2: Lengths: 10000, 2 >" ## [7] "Attributes: < Component 2: target is numeric, curre # so we have all the benefits of a data.frame # ... and additional benefits of data.table NB: next series of commands will only work for data.tablesRichard L. Zijdeman Data manipulation in R
  • 23. Recap Data manipulation data.table package Basic statistical techniques Sort with setkey Often we want to sort our data. We can do so with setkey() hmar.dt[1:6, m_year] ## [1] 1849 1851 1864 1840 1843 1858 # note for data.frame hmar it would be: # hmar[1:6, hmar$m_year] setkeyv(hmar.dt, "m_year") hmar.dt[1:6, m_year] ## [1] 1831 1831 1833 1833 1834 1834 identical(hmar.dt, hmar) ## [1] FALSE Richard L. Zijdeman Data manipulation in R
  • 24. Recap Data manipulation data.table package Basic statistical techniques Multiple keys It is alo possible to sort on multiple keys setkeyv(hmar.dt, c("id_marriage", "idnr")) Richard L. Zijdeman Data manipulation in R
  • 25. Recap Data manipulation data.table package Basic statistical techniques Subsetting groom.sig <- hmar.dt[age_groom > 30, ] dim(groom.sig) ## [1] 2493 29 groom.sig <- hmar.dt[sign_groom == "h", ] dim(groom.sig) ## [1] 9590 29 Richard L. Zijdeman Data manipulation in R
  • 26. Recap Data manipulation data.table package Basic statistical techniques groom.sig <- hmar.dt[sign_groom == "h" & age_groom > 30, ] dim(groom.sig) ## [1] 2358 29 groom.sig <- hmar.dt[m_year != 1840, list(id_marriage, idnr)] dim(groom.sig) ## [1] 9985 2 Richard L. Zijdeman Data manipulation in R
  • 27. Recap Data manipulation data.table package Basic statistical techniques Creating new variables Let’s create a variable for the mean of marriage of grooms hmar.dt[, mean.gage := mean(age_groom)] summary(hmar.dt$age_groom) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.00 24.00 26.00 28.38 30.00 79.00 summary(hmar.dt$mean.gage) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 28.38 28.38 28.38 28.38 28.38 28.38 Richard L. Zijdeman Data manipulation in R
  • 28. Recap Data manipulation data.table package Basic statistical techniques Another example (from yesterday) Dummy variable for equal municipality of birth hmar.dt[, eq_b_loc := (b_loc_groom == b_loc_bride)] summary(hmar.dt$eq_b_loc) ## Mode FALSE TRUE NA's ## logical 6957 3043 0 Richard L. Zijdeman Data manipulation in R
  • 29. Recap Data manipulation data.table package Basic statistical techniques Creating variables by group As we saw, a var with mean age wasn’t really interesting average age of grooms at marriage by civil status hmar.dt[, gage.mean.civ := mean(age_groom), by = civilst_groom] table(hmar.dt$civilst_groom, hmar.dt$gage.mean.civ) ## ## 27.2427939112599 40.8829787234043 42.9548286604361 ## 1 9263 0 0 ## 2 0 0 642 ## 3 0 94 0 ## 6 0 0 0 Richard L. Zijdeman Data manipulation in R
  • 30. Recap Data manipulation data.table package Basic statistical techniques Summary subsets of the data So far, added vars to original data.frame can be redundant though Think of context, say municipalities archival material on characteristics, e.g.: population steam power You can also make context characteristics by aggregation Richard L. Zijdeman Data manipulation in R
  • 31. Recap Data manipulation data.table package Basic statistical techniques mc <- hmar.dt[, mean(age_groom), by = b_loc_groom] summary(mc) ## b_loc_groom V1 ## Length:1184 Min. :-2.00 ## Class :character 1st Qu.:26.00 ## Mode :character Median :28.17 ## Mean :29.36 ## 3rd Qu.:31.00 ## Max. :69.00 Richard L. Zijdeman Data manipulation in R
  • 32. Recap Data manipulation data.table package Basic statistical techniques We can improve by naming the variable directly, and adding more variables mc2 <- hmar.dt[, list(mean_gage = mean(age_groom), mean_bage = mean(age_bride)), by = b_loc_groom] summary(mc2) ## b_loc_groom mean_gage mean_bage ## Length:1184 Min. :-2.00 Min. :-2.00 ## Class :character 1st Qu.:26.00 1st Qu.:23.80 ## Mode :character Median :28.17 Median :25.88 ## Mean :29.36 Mean :26.53 ## 3rd Qu.:31.00 3rd Qu.:28.00 ## Max. :69.00 Max. :64.00 Richard L. Zijdeman Data manipulation in R
  • 33. Recap Data manipulation data.table package Basic statistical techniques One more. . . counts Yesterday, we talked about the problem of overlapping points. We used geom_jitter to solve it. Now let’s do it properly: mc3 <- hmar.dt[, list(frequency = .N), by = list(m_year, age_bride)] # notice the .N ... N is often used for nr. of obs library(ggplot2) Richard L. Zijdeman Data manipulation in R
  • 34. Recap Data manipulation data.table package Basic statistical techniques Using colour ggplot(mc3, aes(x= m_year, y = age_bride)) + geom_point(aes(colour = frequency), size = 10, shape = 18) + theme_bw() 20 40 60 age_bride 10 20 30 frequency Richard L. Zijdeman Data manipulation in R
  • 35. Recap Data manipulation data.table package Basic statistical techniques Using size ggplot(mc3, aes(x= m_year, y = age_bride)) + geom_point(aes(size = frequency), colour = "blue", shape = 18) + theme_bw() 20 40 60 age_bride frequency 10 20 30 Richard L. Zijdeman Data manipulation in R
  • 36. Recap Data manipulation data.table package Basic statistical techniques Basic statistical techniques Richard L. Zijdeman Data manipulation in R
  • 37. Recap Data manipulation data.table package Basic statistical techniques Box and whisker plot Distribution of data Median: 50% of the cases above and below Box: 1st and 3rd quartile Interquartile range (IQR): Q3-Q1 Outliers (Tukey, 1977): x < Q1 - 1.5*IQR x > Q3 + 1.5*IQR Richard L. Zijdeman Data manipulation in R
  • 38. Recap Data manipulation data.table package Basic statistical techniques boxplot(hmar.dt$age_bride, ylab = "Age") 0204060 Age Richard L. Zijdeman Data manipulation in R
  • 39. Recap Data manipulation data.table package Basic statistical techniques hmar.dt[, sign.bride.cln := sign_bride == "h"] hmar.dt[age_bride < 14, age_bride := NA] # NB: no missing values here, but mind this when recoding! Richard L. Zijdeman Data manipulation in R
  • 40. Recap Data manipulation data.table package Basic statistical techniques boxplot(hmar.dt$age_bride ~ hmar.dt$sign.bride.cln, names = c("not signed", "signed"), col = c("red", "green")) not signed signed 203040506070 Richard L. Zijdeman Data manipulation in R
  • 41. Recap Data manipulation data.table package Basic statistical techniques Richard L. Zijdeman Data manipulation in R