2. What is R?
• Functional programming language
• Matrix-based
• Interpreted (written in C and Fortran)
• Environment for statistical computing and graphics
• Open source and GPL license
• 6000+ packages in CRAN
3. Why use R?
• Matrix calculation
• Data visualization (interactive too)
• Statistic analysis (regression, time series, geo-spatial)
• Data mining, classification, clustering
• Analysis of genomic data
• Machine learning
4. Who uses R?
• Oracle integrates R in its Big Data Appliance
• IBM offers support for in-Hadoop execution of R
• Data analysts for Google and Apple
• 12° in TIOBE popularity index
5. How to use R?
• Command-line interface, autonomous script or graphical front-ends
• Connection to any data source
• Data analysis
• Modeling and computation
• Data visualization
• Fitting models or displaying data
6. R Studio IDE
• licence AGPL 3
• Scripts
• Workspace
• Console
• Images
7. Reading and writing data
• From/To plain text files
• From/To Excel files
• From/To Databases
• From the Web
> heisenberg <- read.csv(file="simple.csv",head=TRUE,sep=",")
> write.csv(x=data, file="simple.csv")
> library(gdata)
> mydata = read.xls("mydata.xls")
> write.xlsx(x=data, file="simple.csv«)
> library(XLConnect)
> wk = loadWorkbook("mydata.xls")
> df = readWorksheet(wk,sheet="Sheet1")
> library(RPostgreSQL)
> con <- dbConnect(dbDriver("PostgreSQL"), dbname = "abc", user="postgres")
> q <- dbGetQuery(con, "SELECT * FROM prices WHERE x > 0")
> dbSendQuery(con, “INSERT INTO forecasts VALUE (10)")
> fpe <- read.table("http://data.princeton.edu/wws509/datasets/effort.dat")
14. Plots from my MSc thesis
• Prices of energy in the Italian
Power Exchange spot market
• Forecast using a SARIMA model
15. Performances
• Good performances with built-in math functions
• Possibility to monitor the memory usage
• Possibility to offload data to an external DB to speed up large operations
• Functions for big data sets
• Parallel computation
17. Vector part 1
> x <- c(2,5,9.5,-3) #create a vector
> x[2] #selects the second element
[1] 5
> x[c(2,4)] #select the elements in position 2 and 4
[1] 5 -3
> x[-c(1,3)] #keep out the elements in position 1 and 3
[1] 5 -3
> x[x>0] #select only positive elements
[1] 2.0 5.0 9.5
> x[!(x<=0)] #keep out the striclty not positve elements
[1] 2.0 5.0 9.5
> x[x>0]-1 > x[x>0]+c(1,2,3) #sum element-wise
[1] 1.0 4.0 8.5 [1] 3.0 7.0 11.0
> x[x>0][2]
[1] 5
18. Vector part 2
> which(x>0) #show the indexes that match the condition
[1] 1 2 3
> which.max(x) > which.min(x) > length(x)
[1] 4 [1] 3 [1] 4
> x<-1:10 > paste(1:5, c("A","B"), sep="")
[1] 1 2 3 4 5 6 7 8 9 10 [1] "1A" "2B" "3A" "4B" "5A"
> x1<-seq(1,1000, length=10) #vector from 1 to 1000 with step 10
[1] 1 112 223 334 445 556 667 778 889 1000
> x2<-rep(2,times=10) #repeat 2 10 times
[1] 2 2 2 2 2 2 2 2 2 2
> rep(c(1,3),times=4) #repeat (1,3) 4 times
[1] 1 3 1 3 1 3 1 3
> rep(c(1,9),c(3,1)) #repeat (1,9) 3 and 1 times respectively
[1] 1 1 1 9
> length(c(x,x1,x2,3))
[1] 31 #see also sort, order, eigen
21. List, can contain different object types
> lista<-list(matrix(1:9,nrow=3),rep(0,3),c(‘good’,’bad’))
> length(lista)
[1] 3
> lista[[3]] #third element
[1] ‘good’ ‘bad’
> length(lista[[3]])
[1] 2
> lista[[2]]+2 #sum on the second item
[1] 2 2 2
> lista[[1]][2,2]
[1] 5
> names(lista)<-c(‘first’, ‘second’, ‘third’) #names for elements
> lista$second #or lista[[second]] return a vector
[1] 0 0 0
> lista["second"] #return a filtered list by the condition
$second
[1] 0 0 0
22. Multidimensional Array and named indexes
> a<-array(1:24, dim=c(3,4,2))
> dim(a) #show dimensions
[1] 3 4 2
> a[,,2]
[,1] [,2] [,3] [,4]
[1,] 13 16 19 22
[2,] 14 17 20 23
[3,] 15 18 21 24
> a[1,,]
[,1] [,2]
[1,] 1 13
[2,] 4 16
[3,] 7 19
[4,] 10 22
> a[1,2,1]
[1] 4
> x<-matrix(1:10, ncol=5)
> dimnames(x)<-list(c("X","Y"),NULL)
[,1] [,2] [,3] [,4] [,5]
X 1 3 5 7 9
Y 2 4 6 8 10
> dimnames(x)[[2]]<-c("g","h","j","j","k")
g h j j k
X 1 3 5 7 9
Y 2 4 6 8 10
Summary of Data Structures
Linear Rectangular
Homogeneous Vectors Matrices
Heterogeneous Lists Data frames
23. Data frame
> X<-data.frame(id=1:4, sex=c("M","F","F","M"))
id sex
1 1 M
2 2 F
3 3 F
4 4 M
> X$age<-c(2.5,3,5,6.2)
id sex age
1 1 M 2.5
2 2 F 3.0
3 3 F 5.0
4 4 M 6.2
#X[X$age<3 | X$age>5, c("id","sex")]
> subset(X,subset=(age<3 | age>5), select=-age)
id sex
1 1 M
4 4 M #see also merge, attach
> summary(X)
id sex age
Min. :1.00 F:2 Min. :2.500
1st Qu.:1.75 M:2 1st Qu.:2.875
Median :2.50 Median :4.000
Mean :2.50 Mean :4.175
3rd Qu.:3.25 3rd Qu.:5.300
Max. :4.00 Max. :6.200
Notas do Editor
R's data structures include vectors, matrices, multidimensional arrays, lists and data frames (similar to tables in a relational database). A scalar is represented as a vector with length one. It’s interpreted and its packages are mainly written using R, C and Fortran. R is freely available under the GPL, and pre-compiled binary versions are provided for various operating systems. The R community is very active in terms of packages for specific functions or specific areas of study.
R can act as a matrix-calculation toolbox with performances comparable to GNU Octave or MATLAB.
Another strength of R is static graphics, which can produce publication-quality graphs, including mathematical symbols. Dynamic and interactive graphics are available through additional packages.
R's system includes objects for: regression models, time-series and geo-spatial coordinates, techniques for linear and nonlinear modeling, classical statistical tests, classification, clustering, and others.
R is easily extensible through functions and extensions.
Polls and surveys of data miners show that R's popularity has increased substantially in recent years.
R is an interpreted language; users typically access it through a command-line interpreter; there are also several graphical front-ends for it.
…
The IDE I used is very similar to MatLam with the following four sections: one for the scripts, one for the current workspace where the objects and the matriices are easlily accessible, one for the console to compute analysis on the fly, one for the generated images
R has the same capability of common procedural languages, to control the flow you can use instructions like while, repeat, if, and functions.
R allows to handle exceptions using try catch blocks.
Functions have default parameters in the definition, you can call a function using positional or named arguments.
A generic function acts differently depending on the type of arguments passed to it. So, the generic function dispatches the implementation specific to that type of object. For example, R has a generic print function that can print almost every type of object in R with a simple print(objectname) syntax.
One line methods for: correlation, plotting, regression,