This document serves as a type-and-run handout to an R learner.
It is not meant to serve as a formal language declaration or an exhaustive guide to R. Rather, its purpose is to provide a firm understanding of the building blocks of R so that the knowledge can be applied to various use cases.
A direct result of this approach is that many of the slides here will have illustrative examples that a user must type into the R console.
2. This document is not meant to serve as a formal language declaration or an
exhaustive guide to R. Rather, its purpose is to provide a firm understanding of the
building blocks of R so that the knowledge can be applied to various use cases.
A direct result of this approach is that much of the slides here will have illustrative
examples that a user must type into the R console.
How to Use This Document
3. What is R?
R is the GNU implementation of the S language developed by John Chambers, Rick Becker
and Allan Wilks at the AT&T Bell Labs (where the C and C++ languages were born).
The commercial implementation of S is called ‘S-PLUS’, while the copyleft (as opposed to
copyrighted) implementation is known as R. R was developed by Ross Ihaka and Robert
Gentleman, at University of Auckland.
R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. Among other things it has,
● an effective data handling and storage facility,
● a suite of operators for calculations on arrays, in particular matrices,
● a large, coherent, integrated collection of intermediate tools for data analysis,
● graphical facilities for data analysis and display either directly at the computer or on hard-copy, and,
● a well developed, simple and effective programming language (called ‘S’) which includes conditionals, loops, user defined recursive functions and input and
output facilities. (Indeed most of the system supplied functions are themselves written in the S language.)
4. ● Objects in R
○ Basic Object Types
○ Extending Basic Objects via Attributes
● Operations
○ Arithmetic, Assignment, Relational, Logical,
and Special
○ Filtering Data via Vector Indexing
○ Multidimensional (Homogeneous Data) -
Arrays and Matrices
○ Heterogeneous Data - Lists and Data
Frames
○ Aggregation
● Flow Control
○ Conditional
○ Repetition
○ Jumps
Outline of This Presentation
6. Everything is an Object in R
R deals with data stored in memory. Even data from external sources must be
loaded into computer memory before they can be manipulated.
R does not provide direct access to the computer’s memory. Rather, R provides a
number of specialized data structures referred to as objects.
These objects are referred to through symbols or variables. Furthermore, these
symbols are themselves objects and can be manipulated in the same way as any
other object.
Furthermore, everything in R is an object, including executable code (functions).
This is crucial to understanding and mastering R.
8. Vectors (‘atomic vectors’)
Lists (‘recursive vectors’)
Language objects
Expression objects
Function objects
NULL
The 12 Basic Object Types in R
Built In objects and special forms
Promise objects
Dot-dot-dot
Environments
Pairlist objects
The “Any” type
9. Vector Objects
Vectors can be thought of as contiguous cells containing data. They are usually
accessed through indexing operations such as x[5]. Indexing is a bit more involved
in R because it includes filtering as well. More on this later.
Vectors must have their values all of the same mode. Thus any given vector must
be unambiguously either logical, numeric, complex, character or raw.
Numerical literals such as 42, 1e3, (-6.5), as well as character strings such as “Hello,
world” are vectors of length 1. Zero-length vectors are also possible.
R has six basic (‘atomic’) vector types: logical, integer, real, complex, character (in C
aka ‘string’) and raw. In addition, R has list vectors.
10. The Six Atomic Vector Types in R
Type typeof mode storage.mode
Logical logical logical logical
Integer integer numeric integer
Real double numeric double
Complex complex complex complex
String character character character
Raw raw raw raw
Single numbers, such as 4.2, and strings, such as "four point two" are still vectors, of length 1.
11. List Objects
Lists are ordered collections of elements, each of which can contain any type of R
object. List elements can be heterogeneous. List elements are accessed through
three different indexing operations.
Lists are vectors as well. To distinguish basic vectors from lists, basic vectors are
usually referred to as ‘atomic vectors’, and lists are referred to as ‘recursive vectors’
(since the elements of a list themselves can be lists).
12. There are three language objects: calls, expressions, and names. Confusingly, R
has another object type called "expression".
Unlike arrays and matrices, this provides an intrinsic way to handle modeling or
formulae.
Language Objects
13. Using the Language Object to Create Formulas
Unlike arrays and matrices, R provides an
intrinsic way to handle modeling or formulae via
the language object.
class(fo <- y ~ x1*x2) # "formula"
fo
typeof(fo) # R internal : "language"
terms(fo)
environment(fo)
environment(as.formula("y ~ x"))
environment(as.formula("y ~ x", env =
new.env()))
14. Function Objects
In R, functions are also objects and can be manipulated in much the same way as
any other object. Functions (or more precisely, function closures) have three basic
components: a formal argument list, a body and an environment.
It is possible to have closures as well. Closures are delimited by braces, {} , and
unlike functions, only have a body. Since they lack an environment, symbols
declared within a closure belong to the parent environment.
Operators in R are functions as well. This is an important feature that will become
important in OOP.
15. Built-in Objects (and Special Forms)
These two kinds of object contain the builtin functions of R, i.e., those that are
displayed as .Primitive in code listings (as well as those accessed via the .Internal
function and hence not user-visible as objects). The difference between the two lies
in the argument handling. Builtin functions have all their arguments evaluated and
passed to the internal function, in accordance with call-by-value, whereas special
functions pass the unevaluated arguments to the internal function.
From the R language, these objects are just another kind of function. The
is.primitive function can distinguish them from interpreted functions.
16. Environment Objects
Environments can be thought of as consisting of two things. A frame, consisting of a
set of symbol-value pairs, and an enclosure, a pointer to an enclosing environment.
When R looks up the value for a symbol the frame is examined and if a matching
symbol is found its value will be returned. If not, the enclosing environment is then
accessed and the process repeated.
Environments are created implicitly by function calls.
Environments form a tree structure in which the enclosures play the role of parents.
The tree of environments is rooted in an empty environment, available through
emptyenv(), which has no parent. It is the direct parent of the environment of the
base package (available through the baseenv() function).
17. Promise objects are part of R’s lazy evaluation mechanism. They contain three slots: a
value, an expression, and an environment.
When a function is called the arguments are matched and then each of the formal
arguments is bound to a promise. The expression that was given for that formal argument
and a pointer to the environment the function was called from are stored in the promise.
Until that argument is accessed there is no value associated with the promise. When the
argument is accessed, the stored expression is evaluated in the stored environment, and
the result is returned. The result is also saved by the promise. The substitute function will
extract the content of the expression slot. This allows the programmer to access either the
value or the expression associated with the promise.
Promise Objects
18. Pairlist Objects
The use of pairlists is deprecated since generic vectors are usually more efficient to
use. When an internal pairlist is accessed from R it is generally (including when
subsetted) converted to a generic vector.
19. NULL, Any, and dot-dot-dot (...) Objects
There is a special object called NULL. It is used whenever there is a need to indicate or specify that an
object is absent. It should not be confused with a vector or list of zero length. The NULL object has no type
and no modifiable properties. There is only one NULL object in R, to which all instances refer. To test for
NULL use is.null. You cannot set attributes on NULL.
It is not really possible for an object to be of “Any” type, but it is nevertheless a valid type value. It gets used
in certain (rather rare) circumstances, e.g. as.vector(x, "any"), indicating that type coercion should not be
done.
The ... object type is stored as a type of pairlist. The components of ... can be accessed in the usual pairlist
manner from C code, but ... is not easily accessed as an object in interpreted code, and even the existence
of such an object should typically not be assumed, as that may change in the future. If a function has ... as a
formal argument then any actual arguments that do not match a formal argument are matched with …
21. In R, every object has at least a few attributes. These include the length, mode,
class, type, and storage mode. In addition, there are others such as dim and names.
Attributes tell R to interpret and handle an object in a specific way.
For example, a list object with a class attribute of “data.frame” will be handled
differently than a list. A vector having a class attribute of “factor” will be printed
differently. The names attribute of a list will make it possible to access elements by
name.
Attributes of Objects
22. Common Attributes: mode, length, and class
By the mode of an object we mean the basic type of its fundamental constituents. This is a special
case of a “property” of an object.
Another property of every object is its length. The functions mode(object) and length(object) can
be used to find out the mode and length of any defined structure.
All objects in R have a class, reported by the function class. For simple vectors this is just the
mode, for example "numeric", "logical", "character" or "list", but "matrix", "array", "factor" and
"data.frame" are other possible values.
A special attribute known as the class of the object is used to allow for an object-oriented style of
programming in R. For example, if an object has class "data.frame", it will be printed in a certain
way, the plot() function will display it graphically in a certain way, and other so-called generic
functions such as summary() will react to it as an argument in a way sensitive to its class.
24. ● Atomic Vectors
○ Integer
○ Numeric
○ Complex
○ Character
○ Logical
● List Vectors
● Special Types
○ Matrices
○ Arrays
○ Factors
○ Data frames
Data Types and Operators
● Arithmetic
○ Addition, subtraction, multiplication,
exponentiation, division, integer division,
modulus
● Assignment
○ x = value, x <- value, value -> x, x <<- value,
value ->> x
● Relational
○ <, >, >=, <=, ==, !=
● Logical
○ &&, ||, !, &, |
● Special
○ :, %in%
25. > 5%%2 # five modulus 2
[1] 2 # returns remainder after division
> 5%/%2 # five integer-division 2
[1] 2 # returns quotient of division
> 5 / 0 # division by zero
[1] Inf # returns Inf
> 0 / 0 # zero divided by zero
[1] NaN # returns NaN (Not a Number)
> 2 + 2
[1] 4 # the [1] before the answer indicates that the answer is a 1-d
vector (of one element)
> 3-2
[1] -1
> 5*2
[1] 10
> 6^2 # six to the power 2
[1] 36
> 6**2 # six to the power 2, identical to Python format
[1] 36
> 5/2 # five divided by 2
[1] 2.5 # returns a decimal
Scalar Arithmetic
26. NaN and Inf
# mixed operations: all mathematical operators return NaN
> Inf + NaN
[1] NaN
> NaN +1 # addition/subtraction of a scalar to NaN returns NaN
[1] NaN
> NaN -1
[1] NaN
> NaN + NaN # same with *
[1] NaN
> NaN - NaN # same with /
[1] NaN
> Inf + 1
[1] Inf
> Inf - 1
[1] Inf
> Inf + Inf # same with *
[1] Inf
# but subtraction and division between Inf returns NaN
> Inf - Inf # same with /
[1] NaN
27. A Summary of NaN and Inf in R
A op B B = 0 B = 1 B = (-1) B = NaN Inf
A = 0 + 0
- 0
* 0
/ NaN
+ 1
- (-1)
* 0
/ 0
+ (-1)
- 1
* 0
/ 0
+ NaN
- NaN
* NaN
/ NaN
+ Inf
- (-Inf)
* NaN
/ 0
A = 1 + 1
- 1
* 0
/ Inf
+ 2
- 0
* 1
/ 1
+ 0
- 2
* (-1)
/ (-1)
+ NaN
- NaN
* NaN
/ NaN
+ Inf
- (-Inf)
* Inf
/ 0
A = (-1) + (-1)
- (-1)
* 0
/ (-Inf)
+ 0
- (-2)
* 0
/ (-1)
+ (-2)
- 0
* 1
/ 1
+ NaN
- NaN
* NaN
/ NaN
+ Inf
- (-Inf)
* (-Inf)
/ 0
A = NaN + NaN
- NaN
* NaN
/ NaN
+ NaN
- NaN
* NaN
/ NaN
+ NaN
- NaN
* NaN
/ NaN
+ NaN
- NaN
* NaN
/ NaN
+ NaN
- NaN
* NaN
/ NaN
A = Inf + Inf
- Inf
* NaN
/ Inf
+ Inf
- Inf
* Inf
/ Inf
+ Inf
- Inf
* (-Inf)
/ (-Inf)
+ NaN
- NaN
* NaN
/ NaN
+ Inf
- NaN
* Inf
/ NaN
28. Vectors: Addition & Subtraction
> c(1, 3, 4, 7) # c() stands for ‘combine’ - notice that it’s a simple c - R is case sensitive
[1] 1 3 4 7 # the result is a 1-d vector of 4 elements
> c(1, 3, 4, 7) + c(2, 3, 5, 8) # vector addition, equal length: corresponding elements added
[1] 3 6 9 15
> c(12, 15, 28, 74) + c (2, 8) # unequal addition: smaller vector is recycled & added
[1] 14 23 30 82
> c(15, 18, 21) + 5 # same logic applies to scalars - scalars are treated as one-element vectors
[1] 20 23 26
> c(91, 90, 76, 54, 23) - c(2, 3) # unequal lengths where larger vector length is not a multiple of the smaller vector length - smaller vector is recycled -
with a warning
[1] 89 87 74 51 21
Warning message:
In c(91, 90, 76, 54, 23) - c(2, 3) :
longer object length is not a multiple of shorter object length
29. Vectors: Multiplication & Division
> c(1, 3, 5) * 2 # multiplication by scalar
[1] 2 6 10
> c(1, 3, 5) / 3 # division by a scalar
[1] 0.3333333 1.0000000 1.6666667
> c(1, 3, 5) * c(2, 4, 6) # multiplication of two vectors with equal lengths - element-wise multiplication
[1] 2 12 30
> c(1, 3, 5, 7, 9) * c(2, 5) # multiplication of two vectors with unequal lengths - smaller vector is recycled, just like in addition
[1] 2 15 10 35 18
Warning message:
In c(1, 3, 5, 7, 9) * c(2, 5) :
longer object length is not a multiple of shorter object length
> c(1, 3, 5, 7, 9, 11) / c(5, 10) # division by a vector
[1] 0.2 0.3 1.0 0.7 1.8 1.1
30. > TRUE
[1] TRUE
> FALSE
[1] FALSE
> TRUE || FALSE # logical OR
[1] TRUE
> TRUE && FALSE # logical AND
[1] FALSE
> ! FALSE # logical NOT
[1] TRUE
> TRUE + 1 # TRUE coerced to a numeric value
[1] 2
Logical Arithmetic
> TRUE + FALSE # logicals coerced to numeric values
[1] 1
> TRUE * FALSE
[1] 0
> TRUE / TRUE
[1] 1
> TRUE/ FALSE # same rules apply
[1] Inf
> 1 && 1 # the && operator wil coerce the ‘1’ into TRUE.
[1] TRUE
31. # same goes for the | and || operators
> (2 < 3) | (5=6)
Error in 5 = 6 : invalid (do_set) left-hand side to assignment
> (2 < 3) || (5=6)
[1] TRUE
> 2 = 3 # caveat: = is not equality
Error in 2 = 3 : invalid (do_set) left-hand side to assignment
> 2==3 # == is the equality comparison
[1] FALSE
> 2!=3
[1] TRUE
# this will throw an error, since & evaluates both operands,
regardless of the first comparison being sufficient
> (2 > 3) & (5=6)
Error in 5 = 6 : invalid (do_set) left-hand side to assignment
# but the following will not; && will ‘short-circuit’ and return
> (2 > 3) && (5=6)
[1] FALSE
# but if the first comparison is inconclusive, then second will be
evaluated, throwing an error
> (2 < 3) && (5=6)
Error in 5 = 6 : invalid (do_set) left-hand side to assignment
Logical Comparisons
32. Missing Value: NA
> b = NA # missing value marker
> b
[1] NA
> class(b)
[1] "logical"
> b + 1
[1] NA
> b - 1
[1] NA
> b + TRUE
[1] NA
> b || TRUE
[1] TRUE
> b && TRUE
[1] NA
> b && FALSE
[1] FALSE
> b || FALSE
[1] NA
# you cannot check equality / inequality of NA
> NA==NA
[1] NA
> NA!=NA
[1] NA
33. Comparisons involving Inf, NaN, and NA
# Use of is.na() function. NaN==NA returns FALSE but:
> is.na(NaN)
[1] TRUE
> is.na(Inf)
[1] FALSE
# Use of is.nan() function.
> is.nan(NaN)
[1] TRUE
> is.nan(Inf)
[1] FALSE
> is.na(NA)
[1] TRUE
> Inf < NaN
[1] NA
> Inf == NaN
[1] NA
> Inf == Inf
[1] TRUE
> NaN == NaN
[1] NA
> Inf == NA
[1] NA
> NaN == NA
[1] NA
34. > x = TRUE; y = FALSE # TRUE & FALSE are boolean literals
> x
[1] TRUE
> x && y # logical AND, shortcut version
[1] FALSE
> c( x && y, x || y, !x, !y) # logical AND, OR, NOT, shortcut versions
[1] FALSE TRUE FALSE TRUE
# simple comparisons
> 2 >3
[1] FALSE
> 3 > 3
[1] FALSE
> 3>=3
[1] TRUE
> 2<=4
[1] TRUE
Logical Vectors
> v = 1:7 # simple sequence
> v
[1] 1 2 3 4 5 6 7
> v > 3 # elementwise comparison of vector with scalar
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE
> d = c(2,3) # another vector, with non-matching length
> v > d
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE
Warning message:
In v > d : longer object length is not a multiple of shorter object
length
> v[v>3] # since v>3 is a 7-element boolean vector, we can use it to
filter elements
[1] 4 5 6 7 # only the elements for which v>3 is TRUE are fetched
> v[TRUE] # for the sake of demonstration
[1] 1 2 3 4 5 6 7
> v[FALSE]
integer(0)
35. > exp(1) # R comes with a lot of built-in functions of the form f(...)
[1] 2.718282
> log(2) # function called with one argument, the second argument
defaults to e
[1] 0.6931472
> log(2, 10) # second argument is ‘base’, and it’s matched
positionally
[1] 0.30103
> log( 2, base = 10) # alternatively, the second argument can be
matched by name
[1] 0.30103
> log( x = 2, base = 10) # both arguments matched by name: order
doesn’t matter here, e.g. log( base = 10, x = 2 ) is identical
[1] 0.30103
Built-in Functions
> log( base = 10, 2) # R will first match the named argument, and the
unnamed arguments will be matched positionally
[1] 0.30103
# in fact, all operators like +, -, *, ^ are functions, and R calls these
functions under-the-hood when operators are used in expressions.
# summary functions
length(x) - number of elements in x
sum(x) - sum of elements in x
mean (x) - mean of elements in x
min(x) - minimum of elements in x
max(x) - maximum of elements in x
range(x) - returns a 2-element vector of c( min(x), max(x) )
var(x) - sample variance of elements in x
36. Numeric Sequences
> 1:10 # basic sequence
[1] 1 2 3 4 5 6 7 8 9 10
> 8:-2 # backward sequence
[1] 8 7 6 5 4 3 2 1 0 -1 -2
> seq( 2, 8) # the seq() function
[1] 2 3 4 5 6 7 8
> seq( from = 2, to = 8) # equivalent to above, named arguments
[1] 2 3 4 5 6 7 8
> seq( from = 2, to = 8, by = 2) # stepping parameter
[1] 2 4 6 8
> seq( from = 2, to = 8, by = 4) # when range is not a multiple of step
size, end value may not be included
[1] 2 6
> seq(8) # if only one argument is given, it’s matched with
‘to’ parameter, and ‘from’ defaults to 1
[1] 1 2 3 4 5 6 7 8
> v <- seq(1, 8, 2) # create sequence
> v
[1] 1 3 5 7
> 5 %in% v # the %in% operator
[1] TRUE
> 4 %in% v
[1] FALSE
37. Character Vectors
> "Hello" # string literals are treated as 1-d character vectors
[1] "Hello"
> ‘This is a string, too’ # they can be enclosed in single quotes, too: note how the R console delimits strings by double quotes, regardless
[1] “This is a string, too”
> c("a", "b") # you can have character vectors as well: note how c() combines, not concatenates
[1] "a" "b" # note how the result is a 2-element vector
> paste (“a”, “b”) # for concatenation, you need to call the ‘paste’ function
[1] "a b" # now the result is a 1-element vector. Note the space between: this is the default separator for paste
> paste (“a”, “b”, sep = “”) # let’s override the default one-space separator with a zero-length string
[1] "ab" # now it’s a proper concatenation
> paste(2) # note how a scalar is converted into a character array
[1] “2”
> paste( c(1,2)) # vectors are converted not concatenated
[1] "1" "2"
38. Complex & Integer Vectors
> a <- 2+3i # the symbol ‘i’ when placed after a numeric denotes a
complex number 0+1i, i being the square root of (-1)
> b <- 5-4i
> a+b
[1] 7-1i
> a-b
[1] -3+7i
> a*b
[1] 22+7i
> a/b
[1] -0.0487805+0.5609756i
> class(a) # the function ‘class’ retrieves what storage class this
variable is
[1] "complex"
> b = 20L # the suffix L tells R that this is of class integer
> class(b)
[1] "integer"
> b = 2.5L # trying to store numeric by force
Warning message:
integer literal 2.5L contains decimal; using numeric value
> class(b)
[1] "numeric"
39. > (5**2 + 57) -> y # right assignment also works
> x = 5; y = 4 # multiple assignment in one line with ‘ ; ’
> y <<- 75; 50->>x # yet another alternative, but this has to do with
assigning to a masked variable outside current scope
> assign(v, 1) # basic assignment
> # notice no output
> v <- 1 # syntactic shortcut
> # notice no output
> v # now type the variable name
[1] 1 # and you see the value
> v = 5 # alternative assignment method
> v # same as <- operator, except in the following case
> 5
> sin( x = 5) # assigns the value 5 to the x parameter of sin function
> [1] -0.9589243
> sin( x <-5) # creates a new variable x, assigns 5 to it, the whole
expression evaluates to 5, which then gives the value
> [1] -0.9589243 # the difference is, using = didn’t create a new
variable in the workspace, using <- does.
Symbols (Variables) & Assignment
41. > v[c(1, 1)] # first element repeated twice
[1] 1 1
> > v[c(2, 1, 4, 3, 5, 3)] # doesn’t need to be a sequence
[1] 2 1 4 3 5 3
> v[-1] # negative args allowed; asks to drop the first element
[1] 2 3 4 5 6 7
> v[-5] # drop the fifth element
[1] 1 2 3 4 6 7
> v[ c(-1, -3)] # drop elements 1 & 3
[1] 2 4 5 6 7
> v[-1:-3] # drop elements 1 through 3
[1] 4 5 6 7
> v = c(1, 3, 5) # assignment of numeric vector
> v[1] # unlike C, this is 1-oriented
[1] 1 # the first value is returned
> v[0] # there is no element at “zeroth” position
> numeric(0) # graceful fallback
> v[1.7] # non-integers floored
[1] 1
> v = seq(1,7) # basic sequence from 1 to 7
> v
[1] 1 2 3 4 5 6 7
> v[1:3] # ask to return elements 1 to 3
[1] 1 2 3
> k = seq(1, 6, 2) # create a sequence 1 3 5
> k
[1] 1 3 5
> v[k] # returns elements at 1st, 3rd & 5th positions
[1] 1 3 5
> v[c(1,3,5)] # identical to v[k]
> [1] 1 3 5
Vector Indexing
42. > v[11] = 11 # gaps are filled with NA
> v
[1] 1.0 2.0 3.0 4.5 4.5 6.0 7.0 8.0 NA NA 11.0
> class(v) # however this is numeric NA, not logical NA
[1] "numeric"
> class(v[9])
[1] "numeric"
> v = c(1, 7, 4, 0, 3, 3, 5, 6, 2, 9, 1, 1, 0, 7, 4, 6, 8, NA)
> sort(v)
[1] 0 0 1 1 1 2 3 3 4 4 5 6 6 7 7 8 9 # NAs are missing
> sort(v, na.last = TRUE)
[1] 0 0 1 1 1 2 3 3 4 4 5 6 6 7 7 8 9 NA
> sort(v, na.last = FALSE)
[1] NA 0 0 1 1 1 2 3 3 4 4 5 6 6 7 7 8 9
> k # remember k is c(1 3 5)
[1] 1 3 5
> v[-k] # drop elements 1, 3 & 5
[1] 2 4 6 7
> v [ -8] # there is no eighth element to drop
[1] 1 2 3 4 5 6 7 # so the entire vector is returned
> v[4] = 4.5; v[5] = 4.5 # change elements 4 & 5
> v
[1] 1.0 2.0 3.0 4.5 4.5 6.0 7.0
> v[8] = 8 # non-existent index adds element
> v
[1] 1.0 2.0 3.0 4.5 4.5 6.0 7.0 8.0
Vector Indexing, Replacing, Inserting & Sorting
43. Four Types of Vector Indices
1. A vector of positive integral quantities. In this case the values in the index vector must lie in
the set {1, 2, …, length(x)}. The corresponding elements of the vector are selected and
concatenated, in that order, in the result. The index vector can be of any length and the result
is of the same length as the index vector.
2. A vector of negative integral quantities. Such an index vector specifies the values to be
excluded rather than included.
3. A logical vector. In this case the index vector is recycled to the same length as the vector from
which elements are to be selected. Values corresponding to TRUE in the index vector are
selected and those corresponding to FALSE are omitted. NA values in the index vector are
included in the result as NA.
4. A vector of character strings. This possibility only applies where an object has a names
attribute to identify its components. In this case a sub-vector of the names vector may be used
in the same way as the positive integral labels
45. Special Types: Arrays, Matrices, and Factors
R provides no intrinsic way to handle arrays and matrices (unlike MATLAB or
OCTAVE). Instead, we create vectors and ask R to treat them as arrays or matrices
by setting the ‘dim’ attribute. Alternatively, we can use the array() and matrix()
functions to create these objects. Arrays can have any non-zero dimensions.
Matrices are a special case of arrays having just two dimensions.
Similarly, R has no intrinsic support of factors. This is done by asking R to treat a
vector as factors by setting its class manually (or by using the factor function).
46. # now output the array: notice the order in which data are filled
> d
, , 1
[,1] [,2] [,3]
[1,] 12 11 5
[2,] 14 9 13
[3,] 12 9 9
, , 2
[,1] [,2] [,3]
[1,] 10 5 9
[2,] 3 8 10
[3,] 7 14 11
# create 18 random numbers
> d = floor(rnorm(18, mean = 10, sd = 3))
> d
[1] 12 14 12 11 9 9 5 13 9 10 3 7 5 8 14 9 10 11
# change into an array
> d = array(d)
> class(d)
[1] "array"
# one dimension, 18 elements
> dim(d)
[1] 18
# change the dimensions to 3x3x2
> dim(d) = c(3, 3, 2)
# check dimensions
> dim(d)
[1] 3 3 2
Arrays
47. > x[1] # first element - no ambiguities
[1] 1
> x[1][1] # not like C, this doesn’t work as x[row][col]
[1] 1
> x[1,1] # but this does: x [row, col]
[1] 1
> x[1,2] # row one, column 2 is 5, and not 2
[1] 5
> x[1][2] # again, this fails
[1] NA
> x[15] # first increment rows, then column
[1] 15 # FORTRAN column-major order
Matrices from Vectors
> x = 1:20 # simple sequence
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> class(x)
[1] "integer" # class in atomic type integer
> dim(x) # dim attribute is not set
NULL
> dim(x) = c(4, 5) # setting dim will let R treat this as a matrix/array
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
> class(x)
[1] "matrix" # 2-d array is a matrix
> dim(x) # dimension vector is a 2-element vector
[1] 4 5
48. Matrices from Functions
# the matrix function
> z = matrix(1:20, 4, 5)
> z
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
# use of only nrow (if less data, then data will recycle
> p = matrix(1:20, nrow = 4)
> p
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
# use of ncol, completely equivalent
> p = matrix(1:20, ncol = 5)
> p
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
# use byrow to control how data is filled
> p = matrix(1:20, 4, 5, byrow = TRUE)
> p
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
# alternative: use the array function with the dim parameter
> y <- array(1:20, c(4,5))
> class(y)
[1] "matrix"
> y
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
49. # the original matrix
> p
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
# create an index matrix
> idx = array(c(1:3,3:1), c(3,2))
> idx
[,1] [,2]
[1,] 1 3
[2,] 2 2
[3,] 3 1
# access the elements: note how idx is used as [row, col]
> p[idx]
[1] 9 6 3
Matrices & Index Matrices
# now set those elements to zero
> p[idx] = 0
> p
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 0 13 17
[2,] 2 0 10 14 18
[3,] 0 7 11 15 19
[4,] 4 8 12 16 20
50. # creating random numbers
> rnorm(5)
[1] 0.5414429 -0.5555167 1.7667198 1.1929404 -0.7713971
> floor(rnorm(5, mean = 10, sd = 5))
[1] 6 10 14 18 6
# create a 2x2 random matrix from a normal distribution
> matrix(floor(rnorm(4, mean = 10, sd = 5)), 2, 2) -> p
> p
[,1] [,2]
[1,] 13 7
[2,] 20 11
# create another 2x2 random matrix from a normal distribution
> matrix(floor(rnorm(4, mean = 10, sd = 5)), 2, 2) -> q
> q
[,1] [,2]
[1,] 6 4
[2,] 6 18
Matrix Operations
> p*q # not true multiplication, element-wise multiplication
[,1] [,2]
[1,] 78 28
[2,] 120 198
# outer product
> p %o% q
, , 1, 1
[,1] [,2]
[1,] 78 42
[2,] 120 66
, , 2, 1
[,1] [,2]
[1,] 78 42
[2,] 120 66
, , 1, 2
[,1] [,2]
[1,] 52 28
[2,] 80 44
, , 2, 2
[,1] [,2]
[1,] 234 126
[2,] 360 198
52. Matrix Operations
# e.g take the following system of eqns.
# x + y + z = 2
# 6x - 4y + 5z = 31
# 5x + 2y + 2z = 13
# this can be written as follows:
# M x = b where M is the coefficient matrix, x is the vector [x y z]’
> m = matrix( c(1, 6, 5, 1, -4, 2, 1, 5, 2), nrow = 3)
> b = c(2, 31, 13)
# solving for x involves finding the inverse of m, m-1
> solve(m)
[,1] [,2] [,3]
[1,] -0.6666667 1.850372e-17 0.33333333
[2,] 0.4814815 -1.111111e-01 0.03703704
[3,] 1.1851852 1.111111e-01 -0.37037037
# but we can always directly solve for x as follows:
> solve(m, b)
[1] 3 -2 1
# determinant
> det(m)
[1] 27
# eigenvalues and eigenvectors
> eigen(m)
eigen() decomposition
$`values`
[1] -5.6445744 5.5123299 -0.8677555
$vectors
[,1] [,2] [,3]
[1,] 0.1204044 -0.2968715 -0.5511522
[2,] -0.9768516 -0.5842714 0.2262813
[3,] 0.1768158 -0.7553107 0.8031363
# for more methods, e.g. rref(), install pracma
>install.packages("pracma")
>library(pracma)
53. Special Type: Factors
Unlike formulas, R provides no intrinsic way to
handle factors. This is done by associating two
vectors of equal length or reinterpreting an
existing symbol via its class attribute.
# given two vectors of equal lengths, one with responses and the
other with factor levels, R allows to apply summary functions at each
factor level - using factor() and tapply() functions.
> incomes <- c(50, 53, 80, 35, 47, 92, 44, 62, 61, 30)
> depts <- c(“H”, “H”, “M”, “M”, “A”, “S”, “M”, “S”, “H”, “A”)
> dfact <- factor(depts)
> dfact
[1] H H M M A S M S H A
Levels: A H M S
> tapply(incomes, dfact, mean)
A H M S
38.50000 54.66667 53.00000 77.00000
# both the first & second arguments must be of equal lengths.
55. Handling Heterogeneous Data: Lists and Data Frames
Most of the time, the data to be analysed will not be typed into the R console:
rather, they will be read from an external data source, like a disk file or a repository.
While R does not have a native type to read tabular data, it does provide lists. Lists
can contain heterogeneous values. Based on lists, a new class is built, “data
frames”, which will serve as the containers for external data.
56. Lists
# presence of one character element forces all elements to
characters, can’t use vectors to store heterogeneous data
> emp <- c("Sam", 34L, 85.5, 132000, "HR")
> emp
[1] "Sam" "34" "85.5" "132000" "HR"
# the correct approach is to use a list
> emp <- list("Sam", 34L, 85.5, 132000, "HR")
> emp
[[1]]
[1] "Sam"
[[2]]
[1] 34
[[3]]
[1] 85.5
[[4]]
[1] 132000
[[5]]
[1] "HR"
> emp[1]
[[1]]
[1] "Sam"
> emp[1][1]
[[1]]
[1] "Sam"
> emp[1][2]
[[1]]
NULL
> emp[6]
[[1]]
NULL
57. # difference between [ ] and [[ ]]
> emp[6] # 6th element by position: note operator [ ]
[[1]]
[[1]][[1]]
[1] "ann"
[[1]][[2]]
[1] "beth"
> emp[[6]] # 6th item in list ‘emp’ is also a list: note operator [[ ]]
[[1]]
[1] "ann"
[[2]]
[1] "beth"
> emp[[6]][1] # note how a 1-item list is returned, not an atomic
[[1]]
[1] "ann"
> emp[[6]][[1]] # note how an atomic is returned
[1] "ann"
Lists with Sub-lists
> emp <- list("Sam", 34L, 85.5, 132000, "HR", list("ann", "beth"))
> emp
[[1]]
[1] "Sam"
[[2]]
[1] 34
[[3]]
[1] 85.5
[[4]]
[1] 132000
[[5]]
[1] "HR"
[[6]]
[[6]][[1]]
[1] "ann"
[[6]][[2]]
[1] "beth
59. Addition / Deletion with Lists
# create a list with two elements
> emp <- list("Sam", 23)
> emp
[[1]]
[1] "Sam"
[[2]]
[1] 23
# now add another element
> emp <- c(emp, "HR")
> emp
[[1]]
[1] "Sam"
[[2]]
[1] 23
[[3]]
[1] "HR"
# access the third element
> emp[3]
[[1]]
[1] "HR"
# minus the third element
> emp = emp[-3]
> emp
[[1]]
[1] "Sam"
[[2]]
[1] 23
# now add another element at position 4
> emp[4] = 4
> emp
[[1]]
[1] "Sam"
[[2]]
[1] 23
[[3]]
NULL # note missing element at position 3
[[4]]
[1] 4
60. Data Frames
# there are two basic ways to look at data frames. One: as a
# combination of several vectors, each of which represents one
# variable. Each vector corresponds to a column, each element
# corresponds to a row (observation).
> emp.names = c("Sam", "Joe", "Ann")
> emp.codes = c(12, 23, 45)
> emp.salaries = c(45000, 32000, 85000)
> emps = data.frame(emp.codes, emp.names, emp.salaries)
> emps
emp.codes emp.names emp.salaries
1 12 Sam 45000
2 23 Joe 32000
3 45 Ann 85000
# getting the dimension and column name info
> dim(emps)
[1] 3 3
> names(emps)
[1] "emp.codes" "emp.names" "emp.salaries"
# The other way to look at data frames is as combinations of lists,
# each of which hold heterogeneous info about a single record.
> sam = list("Sam", 12, 45000)
> joe = list("Joe", 23, 32000)
> ann = list("Ann", 45, 85000)
> emps2 = data.frame(rbind(sam, joe, ann))
> emps2 # note that default names have been given to columns
X1 X2 X3
sam Sam 12 45000
joe Joe 23 32000
ann Ann 45 85000
> names(emps2) = c("emp.names", "emp.codes", "emp.salaries")
> emps2
emp.names emp.codes emp.salaries
sam Sam 12 45000
joe Joe 23 32000
ann Ann 45 85000
> rownames(emps2) # unlike in previous case, rows have names
[1] "sam" "joe" "ann"
> rownames(emps2) = NULL # lets remove them
> rownames(emps2)
[1] "1" "2" "3"
> emps2
emp.codes emp.names emp.salaries
1 Sam 12 45000
2 Joe 23 32000
3 Ann 45 85000
61. Data Wrangling
# select a range of rows or columns based on order
# filter a subset of rows by condition on columns
# summary statistics for columns, by groups in columns
# change the row or column order
# append, insert, delete rows or columns
# transform the data type of a column
63. ● lapply(X, FUN, …) - operates on a vector or list and applies the FUN function for each element in the
vector (or list) X
● sapply(X, FUN, …, simplify = TRUE, USE.NAMES = TRUE) - works just like lapply, but will simplify the
output if possible, i.e., instead of returning a list like lapply, it will return a vector instead if the data is
simplifiable.
● vapply(X, FUN, FUN.VALUE, …, USE.NAMES = TRUE) - similar to sapply, but requires us to specify what
type of data we are expecting the arguments for vapply are.
● tapply(X, LEVELS, FUN, …) - similar to sapply but applies the FUN on groups specified by the levels of
LEVELS.
● mapply(FUN, …, MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) - ‘multivariate’ apply. Its
purpose is to be able to vectorize arguments to a function that is not usually accepting vectors as
arguments.
● apply(X, MARGIN, FUN) - finally, there is the general apply() function which works on arrays (and
matrices). MARGIN specifies which dimension to group by.
● Note: the xapply() family is considered legacy functionality and should not be used for new code.
Instead, it is recommended to use the purrr package for all aggregation in R.
Aggregation: the xapply() family of functions
66. Overview of Flow Control in R
● Grouping
● Conditional
○ The if/else structure
○ The ifelse() function
○ The switch() function
● Repetition
○ The while loop
○ The for/in loop
○ The repeat loop
○ The ‘foreach’ package
● Jump
○ Break
○ Next
67. ● Commands may be grouped together in braces, {expr_1; …; expr_m}, in which
case the value of the group is the result of the last expression in the group
evaluated.
● Since such a group is also an expression it may, for example, be itself included
in parentheses and used as part of an even larger expression, and so on.
● Groups are important in conditionals and repetitions because often their
bodies are grouped statements.
Groups (Closures)
68. ● The if/else construct
○ Syntax: if (expr_1) expr_2 else expr_3
○ Here, expr_1 must evaluate to a single logical value and the entire expression evaluates to
either expr_2 or expr_3.
● The ifelse() function
○ This is a vectorized version of the if/else construct
○ This has the form ifelse(condition, a, b) and returns a vector of the same length as condition,
with elements a[i] if condition[i] is true, otherwise b[i] (where a and b are recycled as necessary).
● The switch() function
○ Syntax: switch (integer_expression, list)
○ Evaluates the integer_expression and returns the first element from ‘list’ whose index matches
with integer_expression.
Conditional
69. ● The unconditional loop: repeat expr_2
○ No conditions - infinite loop by default
○ Need a ‘break’ statement to break out of the loop
● The sentinel-controlled loop: while (condition) expr
○ expr is evaluated as long as the condition evaluates to true
○ Both ‘break’ and ‘next’ are accommodated.
● The counter-controlled loop: for (name in vector_expr_1) expr_2
○ name is the loop variable and expr_1 is a vector expression, (often a sequence).
○ expr_2 is often a grouped expression with its sub-expressions written in terms of the dummy
name. It repeatedly evaluated as name ranges through the values in the vector result of expr_1.
Repetition
70. ● The package ‘foreach’ provides the parallel counterpart to the for/in loop.
● The foreach() function takes an expression and returns an object of type
‘foreach’.
● The special %do% and %dopar% binary operators take a ‘foreach’ object as the
first operand and a grouped expression as the second operand.
● %do% evaluates sequentially while %dopar% runs parallely.
● When the ‘foreach’ function takes no arguments, the shortcut ‘times()’ can be
used for convenience.
● For more info, refer to the documentation.
The foreach package
71. Jumps
● The break statement
○ Unconditionally breaks from a loop
○ Only way to break ‘repeat’ loops
● The next statement
○ Skips evaluating the rest of the grouped expression
○ Forces the next iteration