SlideShare a Scribd company logo
1 of 32
Download to read offline
Data Manipulation on R 
Factor Manipulations,subset,sorting and Reshape 
Abhik Seal 
Indiana University School of Informatics and Computing(dsdht.wikispaces.com)
Basic Manipulating Data 
So far , we've covered how to read in data from various ways like from files, internet and databases and 
reading various formats of files. This session we are interested to manipulate data after reading in the file for 
easy data processing. 
2/35
Sorting and Ordering data 
sort(x,decreasing=FALSE) : 'sort (or order) a vector or factor (partially) into ascending or descending 
order.' order(...,decreasing=FALSE):'returns a permutation which rearranges its first argument into 
ascending or descending order, breaking ties by further arguments.' 
x <- c(1,5,7,8,3,12,34,2) 
sort(x) 
## [1] 1 2 3 5 7 8 12 34 
order(x) 
## [1] 1 8 5 2 3 4 6 7 
3/35
Some examples of sorting and ordering 
# sort by mpg 
newdata <- mtcars[order(mpg),] 
head(newdata,3) 
## mpg cyl disp hp drat wt qsec vs am gear carb 
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 
# sort by mpg and cyl 
newdata <- mtcars[order(mpg, cyl),] 
head(newdata,3) 
## mpg cyl disp hp drat wt qsec vs am gear carb 
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 
4/35
Ordering with plyr 
library(plyr) 
head(arrange(mtcars,mpg),3) 
## mpg cyl disp hp drat wt qsec vs am gear carb 
## 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 
## 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 
## 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 
head(arrange(mtcars,desc(mpg)),3) 
## mpg cyl disp hp drat wt qsec vs am gear carb 
## 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 
## 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 
## 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 
5/35
Subsetting data 
set.seed(12345) 
#create a dataframe 
X<-data.frame("A"=sample(1:10),"B"=sample(11:20),"C"=sample(21:30)) 
# Add NA VALUES 
X<-X[sample(1:10),];X$B[c(1,6,10)]=NA 
head(X) 
## A B C 
## 8 4 NA 27 
## 1 8 11 25 
## 2 10 12 23 
## 5 3 13 24 
## 3 7 16 28 
## 10 5 NA 26 
6/35
Basic data subsetting 
# Accessing only first row 
X[1,] 
## A B C 
## 8 4 NA 27 
# accessing only first column 
X[,1] 
## [1] 4 8 10 3 7 5 9 1 2 6 
# accessing first row and first column 
X[1,1] 
## [1] 4 
7/35
And/OR's 
head(X[(X$A <=6 & X$C > 24),],3) 
## A B C 
## 8 4 NA 27 
## 10 5 NA 26 
## 7 2 19 29 
head(X[(X$A <=6 | X$C > 24),],3) 
## A B C 
## 8 4 NA 27 
## 1 8 11 25 
## 5 3 13 24 
8/35
select Non NA values Data Frame 
# select the dataframe without NA values in B column 
head(X[which(X$B!='NA'),],4) 
## A B C 
## 1 8 11 25 
## 2 10 12 23 
## 5 3 13 24 
## 3 7 16 28 
# select those which have values > 14 
head(X[which(X$B>11),],4) 
## A B C 
## 2 10 12 23 
## 5 3 13 24 
## 3 7 16 28 
## 4 9 20 30 
9/35
# creating a data frame with 2 variables 
data <- data.frame(x1=c(2,3,4,5,6),x2=c(5,6,7,8,1)) 
list_data<-list(dat=data,vec.obj=c(1,2,3)) 
list_data 
## $dat 
## x1 x2 
## 1 2 5 
## 2 3 6 
## 3 4 7 
## 4 5 8 
## 5 6 1 
## 
## $vec.obj 
## [1] 1 2 3 
# accessing second element of the list_obj objects 
list_data[[2]] 
## [1] 1 2 3 
10/35
Factors 
Factors are used to represent categorical data, and can also be used for ordinal data (ie categories have an 
intrinsic ordering) Note that R reads in character strings as factors by default in functions like read.table()'The 
function factor is used to encode a vector as a factor (the terms 'category' and 'enumerated type' are also used 
for factors). If argument ordered is TRUE, the factor levels are assumed to be ordered. For compatibility with S 
there is also a function ordered.'is.factor, is.ordered, as.factor and as.ordered are the membership and 
coercion functions for these classes. 
11/35
Factors 
Suppose we have a vector of case-control status 
cc=factor(c("case","case","case","control","control","control")) 
cc 
## [1] case case case control control control 
## Levels: case control 
levels(cc)=c("control","case") 
cc 
## [1] control control control case case case 
## Levels: control case 
12/35
Factors 
Factors can be converted to numericor charactervery easily 
x=factor(c("case","case","case","control","control","control"),levels=c("control","case")) 
as.character(x) 
## [1] "case" "case" "case" "control" "control" "control" 
as.numeric(x) 
## [1] 2 2 2 1 1 1 
13/35
Cut 
Now that we know more about factors, cut()will make more sense: 
x=1:100 
cx=cut(x,breaks=c(0,10,25,50,100)) 
head(cx) 
## [1] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] 
## Levels: (0,10] (10,25] (25,50] (50,100] 
table(cx) 
## cx 
## (0,10] (10,25] (25,50] (50,100] 
## 10 15 25 50 
14/35
Cut 
We can also leave off the labels 
cx=cut(x,breaks=c(0,10,25,50,100),labels=FALSE) 
head(cx) 
## [1] 1 1 1 1 1 1 
table(cx) 
## cx 
## 1 2 3 4 
## 10 15 25 50 
15/35
Cut 
cx=cut(x,breaks=c(10,25,50),labels=FALSE) 
head(cx) 
## [1] NA NA NA NA NA NA 
table(cx) 
## cx 
## 1 2 
## 15 25 
table(cx,useNA="ifany") 
## cx 
## 1 2 <NA> 
## 15 25 60 
16/35
Adding to data frames 
m1=matrix(1:9,nrow=3,ncol=3,byrow=FALSE) 
m1 
## [,1] [,2] [,3] 
## [1,] 1 4 7 
## [2,] 2 5 8 
## [3,] 3 6 9 
m2=matrix(1:9,nrow=3,ncol=3,byrow=TRUE) 
m2 
## [,1] [,2] [,3] 
## [1,] 1 2 3 
## [2,] 4 5 6 
## [3,] 7 8 9 
17/35
Adding using cbind 
You can add columns (or another matrix/data frame) to a data frame or matrix using cbind()('column bind'). 
You can also add rows (or another matrix/data frame) using rbind()('row bind'). Note that the vector you are 
adding has to have the same length as the number of rows (for cbind()) or the number of columns (rbind()) 
cbind(m1,m2) 
## [,1] [,2] [,3] [,4] [,5] [,6] 
## [1,] 1 4 7 1 2 3 
## [2,] 2 5 8 4 5 6 
## [3,] 3 6 9 7 8 9 
18/35
Reshape data 
Datasets layout could be long or wide. In long-layout, multiple rows represent a single subject's record, 
whereas in wide-layout, a single row represents a single subject's record. In doing some statistical analysis 
sometimes we require wide data and sometimes long data, so that we can easily reshape the data to meet the 
requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not 
change the content of the dataset. This section mainly focuses the melt and cast paradigm of reshaping 
datasets, which is implemented in the reshape contributed package. Later on, this same package is 
reimplemented with a new name, reshape2, which is much more time and memory efficient (the Reshaping 
Data with the reshape Package paper, by Wickham, which can be found at 
(http://www.jstatsoft.org/v21/i12/paper)) 
19/35
Wide data has a column for each variable. For example, this is wide-format data: 
# ozone wind temp 
# 1 23.62 11.623 65.55 
# 2 29.44 10.267 79.10 
# 3 59.12 8.942 83.90 
# 4 59.96 8.794 83.97 
Data in long format 
# variable value 
# 1 ozone 23.615 
# 2 ozone 29.444 
# 3 ozone 59.115 
# 4 ozone 59.962 
# 5 wind 11.623 
# 6 wind 10.267 
# 7 wind 8.942 
# 8 wind 8.794 
# 9 temp 65.548 
# 10 temp 79.100 
# 11 temp 83.903 
# 12 temp 83.968 
20/35
reshape 2 Package 
"In reality, you need long-format data much more commonly than wide-format data. For example, ggplot2 
requires long-format data plyr requires long-format data, and most modelling functions (such as lm(), glm(), 
and gam()) require long-format data. But people often find it easier to record their data in wide format." 
reshape2 is based around two key functions: melt and cast: melt takes wide-format data and melts it into 
long-format data. cast takes long-format data and casts it into wide-format data. 
21/35
Melt 
library(reshape2) 
head(airquality,2) 
## ozone solar.r wind temp month day 
## 1 41 190 7.4 67 5 1 
## 2 36 118 8.0 72 5 2 
aql <- melt(airquality) # [a]ir [q]uality [l]ong format 
head(aql,5) 
## variable value 
## 1 ozone 41 
## 2 ozone 36 
## 3 ozone 12 
## 4 ozone 18 
## 5 ozone NA 
22/35
By default, melt has assumed that all columns with numeric values are variables with values. Maybe here we 
want to know the values of ozone, solar.r, wind, and temp for each month and day. We can do that with melt 
by telling it that we want month and day to be “ID variables”. ID variables are the variables that identify 
individual rows of data. 
m <- melt(airquality, id.vars = c("month", "day")) 
head(m,4) 
## month day variable value 
## 1 5 1 ozone 41 
## 2 5 2 ozone 36 
## 3 5 3 ozone 12 
## 4 5 4 ozone 18 
23/35
Melt also allow us to control the column names in long data format 
m <- melt(airquality, id.vars = c("month", "day"), 
variable.name = "climate_variable", 
value.name = "climate_value") 
head(m) 
## month day climate_variable climate_value 
## 1 5 1 ozone 41 
## 2 5 2 ozone 36 
## 3 5 3 ozone 12 
## 4 5 4 ozone 18 
## 5 5 5 ozone NA 
## 6 5 6 ozone 28 
24/35
Long- to wide-format data: the cast functions 
In reshape2 there are multiple cast functions. Since you will most commonly work with data.frame objects, 
we’ll explore the dcast function. (There is also acast to return a vector, matrix, or array.) dcast uses a formula 
to describe the shape of the data. 
m <- melt(airquality, id.vars = c("month", "day")) 
aqw <- dcast(m, month + day ~ variable) 
head(aqw) 
## month day ozone solar.r wind temp 
## 1 5 1 41 190 7.4 67 
## 2 5 2 36 118 8.0 72 
## 3 5 3 12 149 12.6 74 
## 4 5 4 18 313 11.5 62 
## 5 5 5 NA NA 14.3 56 
## 6 5 6 28 NA 14.9 66 
Here, we need to tell dcast that month and day are the ID variables. 
Besides re-arranging the columns, we’ve recovered our original data. 
25/35
Data Manipulation Using plyr 
For large-scale data, we can split the dataset, perform the manipulation or analysis, and then combine it into a 
single output again. This type of split using default R is not much efficient, and to overcome this limitation, 
Wickham, in 2011, developed an R package called plyr in which he efficiently implemented the split-apply-combine 
strategy. We can compare this strategy to map-reduce strategy for processing large amount of data. 
In the coming slides i will give example of the split-apply-combine strategy using 
· 
Without Loops 
· 
With Loops 
· 
Using plyr package 
26/35
Without loops 
I am using the iris dataset here 
1. Split the iris dataset into three parts. 
2. Remove the species name variable from the data. 
3. Calculate the mean of each variable for the three different parts separately. 
4. Combine the output into a single data frame. 
iris.set <- iris[iris$Species=="setosa",-5] 
iris.versi <- iris[iris$Species=="versicolor",-5] 
iris.virg <- iris[iris$Species=="virginica",-5] 
# calculating mean for each piece (The apply step) 
mean.set <- colMeans(iris.set) 
mean.versi <- colMeans(iris.versi) 
mean.virg <- colMeans(iris.virg) 
# combining the output (The combine step) 
mean.iris <- rbind(mean.set,mean.versi,mean.virg) 
# giving row names so that the output could be easily understood 
rownames(mean.iris) <- c("setosa","versicolor","virginica") 
27/35
With Loops 
mean.iris.loop <- NULL 
for(species in unique(iris$Species)) 
{ 
iris_sub <- iris[iris$Species==species,] 
column_means <- colMeans(iris_sub[,-5]) 
mean.iris.loop <- rbind(mean.iris.loop,column_means) 
} 
# giving row names so that the output could be easily understood 
rownames(mean.iris.loop) <- unique(iris$Species) 
NB: In the split-apply-combine strategy is that each piece should be independent of the other. The strategy 
wont work if one piece is dependent upon one another. 
28/35
Using plyr 
library (plyr) 
ddply(iris,~Species,function(x) colMeans(x[,- 
which(colnames(x)=="Species")])) 
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width 
## 1 setosa 5.006 3.428 1.462 0.246 
## 2 versicolor 5.936 2.770 4.260 1.326 
## 3 virginica 6.588 2.974 5.552 2.026 
mean.iris.loop 
## Sepal.Length Sepal.Width Petal.Length Petal.Width 
## setosa 5.006 3.428 1.462 0.246 
## versicolor 5.936 2.770 4.260 1.326 
## virginica 6.588 2.974 5.552 2.026 
29/35
Merging data frames 
# Make a data frame mapping story numbers to titles 
stories <- read.table(header=T, text=' 
storyid title 
1 lions 
2 tigers 
3 bears 
') 
# Make another data frame with the data and story numbers (no titles) 
data <- read.table(header=T, text=' 
subject storyid rating 
1 1 6.7 
1 2 4.5 
1 3 3.7 
2 2 3.3 
2 3 4.1 
2 1 5.2 
') 
30/35
Merge the two data frames 
merge(stories, data, "storyid") 
## storyid title subject rating 
## 1 1 lions 1 6.7 
## 2 1 lions 2 5.2 
## 3 2 tigers 1 4.5 
## 4 2 tigers 2 3.3 
## 5 3 bears 1 3.7 
## 6 3 bears 2 4.1 
If the two data frames have different names for the columns you want to match on, the names can be 
specified: 
# In this case, the column is named 'id' instead of storyid 
stories2 <- read.table(header=T, text=' 
id title 
1 lions 
2 tigers 
3 bears ') 
merge(x=stories2, y=data, by.x="id", by.y="storyid") 
31/35
Resources and Materials used 
· 
Data Manipulation with R by Phil Spector 
· 
Getting and Cleaning data Coursera Course 
· 
plyr by Hadley Wickham 
· 
Andrew Jaffe Notes 
· 
R cookbok 
32/35

More Related Content

What's hot

2 R Tutorial Programming
2 R Tutorial Programming2 R Tutorial Programming
2 R Tutorial ProgrammingSakthi Dasans
 
Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2maikroeder
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R StudioRupak Roy
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data scienceLong Nguyen
 
1 R Tutorial Introduction
1 R Tutorial Introduction1 R Tutorial Introduction
1 R Tutorial IntroductionSakthi Dasans
 
Exploratory data analysis using r
Exploratory data analysis using rExploratory data analysis using r
Exploratory data analysis using rTahera Shaikh
 
Data tidying with tidyr meetup
Data tidying with tidyr  meetupData tidying with tidyr  meetup
Data tidying with tidyr meetupMatthew Samelson
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With REdureka!
 
8. R Graphics with R
8. R Graphics with R8. R Graphics with R
8. R Graphics with RFAO
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with RDr Nisha Arora
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM Cynthia Saracco
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data framekrishna singh
 
Basics of reflection in java
Basics of reflection in javaBasics of reflection in java
Basics of reflection in javakim.mens
 

What's hot (20)

Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 
2 R Tutorial Programming
2 R Tutorial Programming2 R Tutorial Programming
2 R Tutorial Programming
 
Data Visualization With R
Data Visualization With RData Visualization With R
Data Visualization With R
 
Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R Studio
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
1 R Tutorial Introduction
1 R Tutorial Introduction1 R Tutorial Introduction
1 R Tutorial Introduction
 
Exploratory data analysis using r
Exploratory data analysis using rExploratory data analysis using r
Exploratory data analysis using r
 
Data tidying with tidyr meetup
Data tidying with tidyr  meetupData tidying with tidyr  meetup
Data tidying with tidyr meetup
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With R
 
8. R Graphics with R
8. R Graphics with R8. R Graphics with R
8. R Graphics with R
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with R
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Rbootcamp Day 1
Rbootcamp Day 1Rbootcamp Day 1
Rbootcamp Day 1
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data frame
 
Basics of reflection in java
Basics of reflection in javaBasics of reflection in java
Basics of reflection in java
 

Viewers also liked

스마트러닝시장동향
스마트러닝시장동향스마트러닝시장동향
스마트러닝시장동향Duke Kim
 
Impacto de las aulas virtuales en la educación
Impacto de las aulas virtuales en la educaciónImpacto de las aulas virtuales en la educación
Impacto de las aulas virtuales en la educaciónalejandracastroandrade
 
Evolucion de la comunicacion humana susana castaneda
Evolucion de la  comunicacion humana susana castanedaEvolucion de la  comunicacion humana susana castaneda
Evolucion de la comunicacion humana susana castanedaSusana Castañeda
 
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会Yusuke Suzuki
 
Interview Ilb Life Style Dordrecht Dec2011
Interview Ilb Life Style Dordrecht Dec2011Interview Ilb Life Style Dordrecht Dec2011
Interview Ilb Life Style Dordrecht Dec2011Leanne_Eline
 
Sharing is the new lead gen - Talk at Web 2.0 expo
Sharing is the new lead gen - Talk at Web 2.0 expoSharing is the new lead gen - Talk at Web 2.0 expo
Sharing is the new lead gen - Talk at Web 2.0 expoRashmi Sinha
 
Interview exercise
Interview exerciseInterview exercise
Interview exerciseworkventures
 
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015LinkedIn India
 

Viewers also liked (11)

스마트러닝시장동향
스마트러닝시장동향스마트러닝시장동향
스마트러닝시장동향
 
Impacto de las aulas virtuales en la educación
Impacto de las aulas virtuales en la educaciónImpacto de las aulas virtuales en la educación
Impacto de las aulas virtuales en la educación
 
Evolucion de la comunicacion humana susana castaneda
Evolucion de la  comunicacion humana susana castanedaEvolucion de la  comunicacion humana susana castaneda
Evolucion de la comunicacion humana susana castaneda
 
Zaragoza turismo 200
Zaragoza turismo 200Zaragoza turismo 200
Zaragoza turismo 200
 
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会
 
Interview Ilb Life Style Dordrecht Dec2011
Interview Ilb Life Style Dordrecht Dec2011Interview Ilb Life Style Dordrecht Dec2011
Interview Ilb Life Style Dordrecht Dec2011
 
Judit Jorba
Judit JorbaJudit Jorba
Judit Jorba
 
Sharing is the new lead gen - Talk at Web 2.0 expo
Sharing is the new lead gen - Talk at Web 2.0 expoSharing is the new lead gen - Talk at Web 2.0 expo
Sharing is the new lead gen - Talk at Web 2.0 expo
 
Interview exercise
Interview exerciseInterview exercise
Interview exercise
 
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015
 
Chapter 11
Chapter 11Chapter 11
Chapter 11
 

Similar to Data manipulation on r

fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxdataKarthik
 
A quick introduction to R
A quick introduction to RA quick introduction to R
A quick introduction to RAngshuman Saha
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
 
R Programming.pptx
R Programming.pptxR Programming.pptx
R Programming.pptxkalai75
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environmentYogendra Chaubey
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RStacy Irwin
 
Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data ManipulationChu An
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RRajib Layek
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxcarliotwaycave
 
Day 1d R structures & objects: matrices and data frames.pptx
Day 1d   R structures & objects: matrices and data frames.pptxDay 1d   R structures & objects: matrices and data frames.pptx
Day 1d R structures & objects: matrices and data frames.pptxAdrien Melquiond
 
Ggplot2 work
Ggplot2 workGgplot2 work
Ggplot2 workARUN DN
 

Similar to Data manipulation on r (20)

R programming
R programmingR programming
R programming
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
 
R Programming Homework Help
R Programming Homework HelpR Programming Homework Help
R Programming Homework Help
 
A quick introduction to R
A quick introduction to RA quick introduction to R
A quick introduction to R
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Introduction to r
Introduction to rIntroduction to r
Introduction to r
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 
R Programming.pptx
R Programming.pptxR Programming.pptx
R Programming.pptx
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environment
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
R gráfico
R gráficoR gráfico
R gráfico
 
Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data Manipulation
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
 
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
 
Day 1d R structures & objects: matrices and data frames.pptx
Day 1d   R structures & objects: matrices and data frames.pptxDay 1d   R structures & objects: matrices and data frames.pptx
Day 1d R structures & objects: matrices and data frames.pptx
 
Ggplot2 work
Ggplot2 workGgplot2 work
Ggplot2 work
 
R_Proficiency.pptx
R_Proficiency.pptxR_Proficiency.pptx
R_Proficiency.pptx
 

More from Abhik Seal

Clinicaldataanalysis in r
Clinicaldataanalysis in rClinicaldataanalysis in r
Clinicaldataanalysis in rAbhik Seal
 
Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryAbhik Seal
 
Data handling in r
Data handling in rData handling in r
Data handling in rAbhik Seal
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical DatasetsAbhik Seal
 
Introduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsIntroduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsAbhik Seal
 
Mapping protein to function
Mapping protein to functionMapping protein to function
Mapping protein to functionAbhik Seal
 
Sequencedatabases
SequencedatabasesSequencedatabases
SequencedatabasesAbhik Seal
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataAbhik Seal
 
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles Abhik Seal
 
Learning chemistry with google
Learning chemistry with googleLearning chemistry with google
Learning chemistry with googleAbhik Seal
 
3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using dataAbhik Seal
 
R scatter plots
R scatter plotsR scatter plots
R scatter plotsAbhik Seal
 
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorialAbhik Seal
 
Pharmacohoreppt
PharmacohorepptPharmacohoreppt
PharmacohorepptAbhik Seal
 

More from Abhik Seal (20)

Chemical data
Chemical dataChemical data
Chemical data
 
Clinicaldataanalysis in r
Clinicaldataanalysis in rClinicaldataanalysis in r
Clinicaldataanalysis in r
 
Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug Discovery
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
Networks
NetworksNetworks
Networks
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical Datasets
 
Introduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsIntroduction to Adverse Drug Reactions
Introduction to Adverse Drug Reactions
 
Mapping protein to function
Mapping protein to functionMapping protein to function
Mapping protein to function
 
Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical data
 
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles
 
Learning chemistry with google
Learning chemistry with googleLearning chemistry with google
Learning chemistry with google
 
3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data
 
Poster
PosterPoster
Poster
 
R scatter plots
R scatter plotsR scatter plots
R scatter plots
 
Indo us 2012
Indo us 2012Indo us 2012
Indo us 2012
 
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorial
 
Weka guide
Weka guideWeka guide
Weka guide
 
Pharmacohoreppt
PharmacohorepptPharmacohoreppt
Pharmacohoreppt
 
Document1
Document1Document1
Document1
 

Recently uploaded

Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 

Recently uploaded (20)

Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 

Data manipulation on r

  • 1. Data Manipulation on R Factor Manipulations,subset,sorting and Reshape Abhik Seal Indiana University School of Informatics and Computing(dsdht.wikispaces.com)
  • 2. Basic Manipulating Data So far , we've covered how to read in data from various ways like from files, internet and databases and reading various formats of files. This session we are interested to manipulate data after reading in the file for easy data processing. 2/35
  • 3. Sorting and Ordering data sort(x,decreasing=FALSE) : 'sort (or order) a vector or factor (partially) into ascending or descending order.' order(...,decreasing=FALSE):'returns a permutation which rearranges its first argument into ascending or descending order, breaking ties by further arguments.' x <- c(1,5,7,8,3,12,34,2) sort(x) ## [1] 1 2 3 5 7 8 12 34 order(x) ## [1] 1 8 5 2 3 4 6 7 3/35
  • 4. Some examples of sorting and ordering # sort by mpg newdata <- mtcars[order(mpg),] head(newdata,3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 # sort by mpg and cyl newdata <- mtcars[order(mpg, cyl),] head(newdata,3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 4/35
  • 5. Ordering with plyr library(plyr) head(arrange(mtcars,mpg),3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 head(arrange(mtcars,desc(mpg)),3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 5/35
  • 6. Subsetting data set.seed(12345) #create a dataframe X<-data.frame("A"=sample(1:10),"B"=sample(11:20),"C"=sample(21:30)) # Add NA VALUES X<-X[sample(1:10),];X$B[c(1,6,10)]=NA head(X) ## A B C ## 8 4 NA 27 ## 1 8 11 25 ## 2 10 12 23 ## 5 3 13 24 ## 3 7 16 28 ## 10 5 NA 26 6/35
  • 7. Basic data subsetting # Accessing only first row X[1,] ## A B C ## 8 4 NA 27 # accessing only first column X[,1] ## [1] 4 8 10 3 7 5 9 1 2 6 # accessing first row and first column X[1,1] ## [1] 4 7/35
  • 8. And/OR's head(X[(X$A <=6 & X$C > 24),],3) ## A B C ## 8 4 NA 27 ## 10 5 NA 26 ## 7 2 19 29 head(X[(X$A <=6 | X$C > 24),],3) ## A B C ## 8 4 NA 27 ## 1 8 11 25 ## 5 3 13 24 8/35
  • 9. select Non NA values Data Frame # select the dataframe without NA values in B column head(X[which(X$B!='NA'),],4) ## A B C ## 1 8 11 25 ## 2 10 12 23 ## 5 3 13 24 ## 3 7 16 28 # select those which have values > 14 head(X[which(X$B>11),],4) ## A B C ## 2 10 12 23 ## 5 3 13 24 ## 3 7 16 28 ## 4 9 20 30 9/35
  • 10. # creating a data frame with 2 variables data <- data.frame(x1=c(2,3,4,5,6),x2=c(5,6,7,8,1)) list_data<-list(dat=data,vec.obj=c(1,2,3)) list_data ## $dat ## x1 x2 ## 1 2 5 ## 2 3 6 ## 3 4 7 ## 4 5 8 ## 5 6 1 ## ## $vec.obj ## [1] 1 2 3 # accessing second element of the list_obj objects list_data[[2]] ## [1] 1 2 3 10/35
  • 11. Factors Factors are used to represent categorical data, and can also be used for ordinal data (ie categories have an intrinsic ordering) Note that R reads in character strings as factors by default in functions like read.table()'The function factor is used to encode a vector as a factor (the terms 'category' and 'enumerated type' are also used for factors). If argument ordered is TRUE, the factor levels are assumed to be ordered. For compatibility with S there is also a function ordered.'is.factor, is.ordered, as.factor and as.ordered are the membership and coercion functions for these classes. 11/35
  • 12. Factors Suppose we have a vector of case-control status cc=factor(c("case","case","case","control","control","control")) cc ## [1] case case case control control control ## Levels: case control levels(cc)=c("control","case") cc ## [1] control control control case case case ## Levels: control case 12/35
  • 13. Factors Factors can be converted to numericor charactervery easily x=factor(c("case","case","case","control","control","control"),levels=c("control","case")) as.character(x) ## [1] "case" "case" "case" "control" "control" "control" as.numeric(x) ## [1] 2 2 2 1 1 1 13/35
  • 14. Cut Now that we know more about factors, cut()will make more sense: x=1:100 cx=cut(x,breaks=c(0,10,25,50,100)) head(cx) ## [1] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] ## Levels: (0,10] (10,25] (25,50] (50,100] table(cx) ## cx ## (0,10] (10,25] (25,50] (50,100] ## 10 15 25 50 14/35
  • 15. Cut We can also leave off the labels cx=cut(x,breaks=c(0,10,25,50,100),labels=FALSE) head(cx) ## [1] 1 1 1 1 1 1 table(cx) ## cx ## 1 2 3 4 ## 10 15 25 50 15/35
  • 16. Cut cx=cut(x,breaks=c(10,25,50),labels=FALSE) head(cx) ## [1] NA NA NA NA NA NA table(cx) ## cx ## 1 2 ## 15 25 table(cx,useNA="ifany") ## cx ## 1 2 <NA> ## 15 25 60 16/35
  • 17. Adding to data frames m1=matrix(1:9,nrow=3,ncol=3,byrow=FALSE) m1 ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 m2=matrix(1:9,nrow=3,ncol=3,byrow=TRUE) m2 ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 ## [3,] 7 8 9 17/35
  • 18. Adding using cbind You can add columns (or another matrix/data frame) to a data frame or matrix using cbind()('column bind'). You can also add rows (or another matrix/data frame) using rbind()('row bind'). Note that the vector you are adding has to have the same length as the number of rows (for cbind()) or the number of columns (rbind()) cbind(m1,m2) ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1 4 7 1 2 3 ## [2,] 2 5 8 4 5 6 ## [3,] 3 6 9 7 8 9 18/35
  • 19. Reshape data Datasets layout could be long or wide. In long-layout, multiple rows represent a single subject's record, whereas in wide-layout, a single row represents a single subject's record. In doing some statistical analysis sometimes we require wide data and sometimes long data, so that we can easily reshape the data to meet the requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not change the content of the dataset. This section mainly focuses the melt and cast paradigm of reshaping datasets, which is implemented in the reshape contributed package. Later on, this same package is reimplemented with a new name, reshape2, which is much more time and memory efficient (the Reshaping Data with the reshape Package paper, by Wickham, which can be found at (http://www.jstatsoft.org/v21/i12/paper)) 19/35
  • 20. Wide data has a column for each variable. For example, this is wide-format data: # ozone wind temp # 1 23.62 11.623 65.55 # 2 29.44 10.267 79.10 # 3 59.12 8.942 83.90 # 4 59.96 8.794 83.97 Data in long format # variable value # 1 ozone 23.615 # 2 ozone 29.444 # 3 ozone 59.115 # 4 ozone 59.962 # 5 wind 11.623 # 6 wind 10.267 # 7 wind 8.942 # 8 wind 8.794 # 9 temp 65.548 # 10 temp 79.100 # 11 temp 83.903 # 12 temp 83.968 20/35
  • 21. reshape 2 Package "In reality, you need long-format data much more commonly than wide-format data. For example, ggplot2 requires long-format data plyr requires long-format data, and most modelling functions (such as lm(), glm(), and gam()) require long-format data. But people often find it easier to record their data in wide format." reshape2 is based around two key functions: melt and cast: melt takes wide-format data and melts it into long-format data. cast takes long-format data and casts it into wide-format data. 21/35
  • 22. Melt library(reshape2) head(airquality,2) ## ozone solar.r wind temp month day ## 1 41 190 7.4 67 5 1 ## 2 36 118 8.0 72 5 2 aql <- melt(airquality) # [a]ir [q]uality [l]ong format head(aql,5) ## variable value ## 1 ozone 41 ## 2 ozone 36 ## 3 ozone 12 ## 4 ozone 18 ## 5 ozone NA 22/35
  • 23. By default, melt has assumed that all columns with numeric values are variables with values. Maybe here we want to know the values of ozone, solar.r, wind, and temp for each month and day. We can do that with melt by telling it that we want month and day to be “ID variables”. ID variables are the variables that identify individual rows of data. m <- melt(airquality, id.vars = c("month", "day")) head(m,4) ## month day variable value ## 1 5 1 ozone 41 ## 2 5 2 ozone 36 ## 3 5 3 ozone 12 ## 4 5 4 ozone 18 23/35
  • 24. Melt also allow us to control the column names in long data format m <- melt(airquality, id.vars = c("month", "day"), variable.name = "climate_variable", value.name = "climate_value") head(m) ## month day climate_variable climate_value ## 1 5 1 ozone 41 ## 2 5 2 ozone 36 ## 3 5 3 ozone 12 ## 4 5 4 ozone 18 ## 5 5 5 ozone NA ## 6 5 6 ozone 28 24/35
  • 25. Long- to wide-format data: the cast functions In reshape2 there are multiple cast functions. Since you will most commonly work with data.frame objects, we’ll explore the dcast function. (There is also acast to return a vector, matrix, or array.) dcast uses a formula to describe the shape of the data. m <- melt(airquality, id.vars = c("month", "day")) aqw <- dcast(m, month + day ~ variable) head(aqw) ## month day ozone solar.r wind temp ## 1 5 1 41 190 7.4 67 ## 2 5 2 36 118 8.0 72 ## 3 5 3 12 149 12.6 74 ## 4 5 4 18 313 11.5 62 ## 5 5 5 NA NA 14.3 56 ## 6 5 6 28 NA 14.9 66 Here, we need to tell dcast that month and day are the ID variables. Besides re-arranging the columns, we’ve recovered our original data. 25/35
  • 26. Data Manipulation Using plyr For large-scale data, we can split the dataset, perform the manipulation or analysis, and then combine it into a single output again. This type of split using default R is not much efficient, and to overcome this limitation, Wickham, in 2011, developed an R package called plyr in which he efficiently implemented the split-apply-combine strategy. We can compare this strategy to map-reduce strategy for processing large amount of data. In the coming slides i will give example of the split-apply-combine strategy using · Without Loops · With Loops · Using plyr package 26/35
  • 27. Without loops I am using the iris dataset here 1. Split the iris dataset into three parts. 2. Remove the species name variable from the data. 3. Calculate the mean of each variable for the three different parts separately. 4. Combine the output into a single data frame. iris.set <- iris[iris$Species=="setosa",-5] iris.versi <- iris[iris$Species=="versicolor",-5] iris.virg <- iris[iris$Species=="virginica",-5] # calculating mean for each piece (The apply step) mean.set <- colMeans(iris.set) mean.versi <- colMeans(iris.versi) mean.virg <- colMeans(iris.virg) # combining the output (The combine step) mean.iris <- rbind(mean.set,mean.versi,mean.virg) # giving row names so that the output could be easily understood rownames(mean.iris) <- c("setosa","versicolor","virginica") 27/35
  • 28. With Loops mean.iris.loop <- NULL for(species in unique(iris$Species)) { iris_sub <- iris[iris$Species==species,] column_means <- colMeans(iris_sub[,-5]) mean.iris.loop <- rbind(mean.iris.loop,column_means) } # giving row names so that the output could be easily understood rownames(mean.iris.loop) <- unique(iris$Species) NB: In the split-apply-combine strategy is that each piece should be independent of the other. The strategy wont work if one piece is dependent upon one another. 28/35
  • 29. Using plyr library (plyr) ddply(iris,~Species,function(x) colMeans(x[,- which(colnames(x)=="Species")])) ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 setosa 5.006 3.428 1.462 0.246 ## 2 versicolor 5.936 2.770 4.260 1.326 ## 3 virginica 6.588 2.974 5.552 2.026 mean.iris.loop ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## setosa 5.006 3.428 1.462 0.246 ## versicolor 5.936 2.770 4.260 1.326 ## virginica 6.588 2.974 5.552 2.026 29/35
  • 30. Merging data frames # Make a data frame mapping story numbers to titles stories <- read.table(header=T, text=' storyid title 1 lions 2 tigers 3 bears ') # Make another data frame with the data and story numbers (no titles) data <- read.table(header=T, text=' subject storyid rating 1 1 6.7 1 2 4.5 1 3 3.7 2 2 3.3 2 3 4.1 2 1 5.2 ') 30/35
  • 31. Merge the two data frames merge(stories, data, "storyid") ## storyid title subject rating ## 1 1 lions 1 6.7 ## 2 1 lions 2 5.2 ## 3 2 tigers 1 4.5 ## 4 2 tigers 2 3.3 ## 5 3 bears 1 3.7 ## 6 3 bears 2 4.1 If the two data frames have different names for the columns you want to match on, the names can be specified: # In this case, the column is named 'id' instead of storyid stories2 <- read.table(header=T, text=' id title 1 lions 2 tigers 3 bears ') merge(x=stories2, y=data, by.x="id", by.y="storyid") 31/35
  • 32. Resources and Materials used · Data Manipulation with R by Phil Spector · Getting and Cleaning data Coursera Course · plyr by Hadley Wickham · Andrew Jaffe Notes · R cookbok 32/35