SlideShare a Scribd company logo
1 of 78
Download to read offline
R and Data Mining
美味书签 (AVOS China)
杨朝中
R and Data Mining
●
R 语言介绍
●
R 文本挖掘框架
● High Performance Computing in R
●
R 网络分析
●
统计图形
R and Data Mining
●
R 语言介绍
●
R 文本挖掘框架
● High Performance Computing in R
●
R 网络分析
●
统计图形
R 语言介绍
●
统计计算
● CRAN (Comprehensive R Archive Network)
R 语言介绍
●
统计计算

对象类型

统计分析模型
● CRAN (Comprehensive R Archive Network)
对象类型
●
向量 (vector)
●
因子 (factor)
●
数组和矩阵 (array and matrix)
●
数据框和列表 (data.frame and list)
●
函数 (function)
向量 (vector)
> test.vector = c(1:100)
> test.vector
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
[23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
[45] 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
[67] 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
[89] 89 90 91 92 93 94 95 96 97 98 99 100
> test.vector[3]
[1] 3
> test.vector[1]
[1] 1
> sum(test.vector)
[1] 5050
> mean(test.vector)
[1] 50.5
> var(test.vector)
[1] 841.6667
> sd(test.vector)
[1] 29.01149
因子 (factor)
> test.factor = factor(c(1,1,2,2,2,3,3,3,4,4,1,1,4,4))
> test.factor
[1] 1 1 2 2 2 3 3 3 4 4 1 1 4 4
Levels: 1 2 3 4
> levels(test.factor) = c("first","second","third","fourth")
> test.factor
[1] first first second second second third third third fourth fourth first first
[13] fourth fourth
Levels: first second third fourth
> levels(test.factor) = c("a","b","c","d")
> test.factor
[1] a a b b b c c c d d a a d d
Levels: a b c d
数组 (array)
> test.array = array(rbinom(100,5,0.5),dim=c(4,5,5))
> test.array
, , 1
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 2 3 1
[2,] 4 2 2 2 2
[3,] 2 1 3 3 5
[4,] 2 2 4 2 2
> test.array[,3,]
[,1] [,2] [,3] [,4] [,5]
[1,] 2 3 4 4 2
[2,] 2 2 2 1 1
[3,] 3 2 4 3 4
[4,] 4 3 3 1 2
> test.array[3,2,]
[1] 1 2 3 1 1
矩阵 (matrix)
> test.matrix = matrix(rpois(50,5),nrow=5)
> test.matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 6 3 12 7 6 2 3 5 4 4
[2,] 2 5 11 3 1 4 7 2 5 5
[3,] 2 4 1 5 1 3 2 7 5 8
[4,] 4 7 5 8 4 5 3 2 6 2
[5,] 9 15 5 6 2 4 8 8 5 3
> t(test.matrix)
[,1] [,2] [,3] [,4] [,5]
[1,] 6 2 2 4 9
[2,] 3 5 4 7 15
[3,] 12 11 1 5 5
[4,] 7 3 5 8 6
[5,] 6 1 1 4 2
[6,] 2 4 3 5 4
[7,] 3 7 2 3 8
[8,] 5 2 7 2 8
[9,] 4 5 5 6 5
[10,] 4 5 8 2 3
矩阵 (matix)
> test.matrix = matrix(runif(25,min=1,max=5),nrow=5)
> test.matrix
[,1] [,2] [,3] [,4] [,5]
[1,] 1.844365 2.470590 4.744482 4.693239 2.597706
[2,] 2.051089 2.954349 4.807748 3.974937 2.487159
[3,] 4.554397 2.187724 4.519553 4.916905 3.988060
[4,] 4.629351 3.770774 2.992690 4.660705 2.510643
[5,] 3.894542 3.281654 2.471337 3.484586 2.115016
> qr(test.matrix)
$qr
[,1] [,2] [,3] [,4] [,5]
[1,] -8.0591276 -6.30550129 -7.7768280 -9.2254948 -5.94547975
[2,] 0.2545051 -2.20153679 -2.8030382 -2.2409546 -0.64008014
[3,] 0.5651229 -0.83950762 -3.5747057 -2.2750825 -1.96267828
[4,] 0.5744234 -0.15061209 -0.6607485 0.7479590 0.01142934
[5,] 0.4832462 -0.07700937 -0.6148309 0.9179222 0.06790194
$rank
[1] 5
$qraux
[1] 1.22885416 1.51634534 1.43057441 1.39676050 0.06790194
矩阵 (matrix)
> svd(test.matrix)
$d
[1] 17.66944239 3.22284465 1.78184517 0.61566884 0.05156261
$u
[,1] [,2] [,3] [,4] [,5]
[1,] -0.4285623 -0.55858839 0.1433838 0.6112554 0.33184518
[2,] -0.4207851 -0.46523651 0.3361892 -0.6261498 -0.31844658
[3,] -0.5179119 0.03462469 -0.8461578 -0.1172279 -0.02903471
[4,] -0.4722861 0.50932622 0.2777685 0.3687009 -0.55175807
[5,] -0.3846913 0.45926238 0.2707020 -0.2908960 0.69511911
$v
[,1] [,2] [,3] [,4] [,5]
[1,] -0.4356020 0.71976143 -0.31404796 -0.1898322 -0.39690304
[2,] -0.3666388 0.23238151 0.80369243 -0.2606880 0.31256209
[3,] -0.4958375 -0.64266729 -0.01537137 -0.4151453 -0.41053867
[4,] -0.5530530 -0.10129870 0.04863968 0.8254724 -0.01001832
[5,] -0.3522846 -0.06826158 -0.50284218 -0.2055605 0.75903264
矩阵 (matrix)
> cbind(test.matrix,rep(1,times=5))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1.844365 2.470590 4.744482 4.693239 2.597706 1
[2,] 2.051089 2.954349 4.807748 3.974937 2.487159 1
[3,] 4.554397 2.187724 4.519553 4.916905 3.988060 1
[4,] 4.629351 3.770774 2.992690 4.660705 2.510643 1
[5,] 3.894542 3.281654 2.471337 3.484586 2.115016 1
> rbind(test.matrix, seq(1,2,length.out=5))
[,1] [,2] [,3] [,4] [,5]
[1,] 1.844365 2.470590 4.744482 4.693239 2.597706
[2,] 2.051089 2.954349 4.807748 3.974937 2.487159
[3,] 4.554397 2.187724 4.519553 4.916905 3.988060
[4,] 4.629351 3.770774 2.992690 4.660705 2.510643
[5,] 3.894542 3.281654 2.471337 3.484586 2.115016
[6,] 1.000000 1.250000 1.500000 1.750000 2.000000
数据框 (data.frame)
> test.data.frame =
data.frame(id=1:10,name=letters[1:10],age=sample(c(25,23,24),size=10,replace=TRUE))
> test.data.frame
id name age
1 1 a 25
2 2 b 23
3 3 c 23
4 4 d 23
5 5 e 24
6 6 f 24
7 7 g 24
8 8 h 25
9 9 i 25
10 10 j 25
> test.data.frame$id
[1] 1 2 3 4 5 6 7 8 9 10
> test.data.frame$name
[1] a b c d e f g h i j
Levels: a b c d e f g h i j
> test.data.frame$age
[1] 25 23 23 23 24 24 24 25 25 25
列表 (List)
> test.list =
list(test.vector,test.factor,test.array,test.matrix,test.data.frame)
> str(test.list)
List of 5
$ : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
$ : Factor w/ 4 levels "a","b","c","d": 1 1 2 2 2 3 3 3 4 4 ...
$ : num [1:4, 1:5, 1:5] 1 4 2 2 3 2 1 2 2 2 ...
$ : num [1:5, 1:5] 1.84 2.05 4.55 4.63 3.89 ...
$ :'data.frame': 10 obs. of 3 variables:
..$ id : int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ name: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
..$ age : num [1:10] 25 23 23 23 24 24 24 25 25 25
> test.list[4]
[[1]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1.844365 2.470590 4.744482 4.693239 2.597706
[2,] 2.051089 2.954349 4.807748 3.974937 2.487159
[3,] 4.554397 2.187724 4.519553 4.916905 3.988060
[4,] 4.629351 3.770774 2.992690 4.660705 2.510643
[5,] 3.894542 3.281654 2.471337 3.484586 2.115016
函数 (function)
> test.function = function(x) factorial(x)
> test.function(3)
[1] 6
>lapply(test.vector[31:35],test.function)
[[1]]
[1] 8.222839e+33
[[2]]
[1] 2.631308e+35
[[3]]
[1] 8.683318e+36
[[4]]
[1] 2.952328e+38
[[5]]
[1] 1.033315e+40
统计分析模型
●
回归分析
●
方差分析
●
判别分析
●
聚类分析
●
主成分分析
●
因子分析
●
连续系统模拟、离散系统模拟
R 语言介绍
●
统计计算
● CRAN (Comprehensive R Archive Network)
CRAN
● CRAN Task Views
● Natural Language Processing
● Machine Learning & Statistical Learning
● High-Performance and Parallel Computing with R
● gRaphical Models in R
● Graphic displays
R and Data Mining
●
R 语言介绍
●
R 文本挖掘框架
● High Performance Computing in R
●
R 网络分析
●
统计图形
R 文本挖掘框架
‘tm’ package UML 类图
Text Preprocessing in R
●
数据导入: Corpus 、 PlainTextDocument 、 tm_map
●
中文分词: rmmseg4j
●
英文词干提取: Rstem 、 Snowball 、 RWeka
●
英文句子识别: openNLP
●
英文同义词: wordnet
●
构造基于 tf-idf 的文档单词矩阵:
DocumentTermMatrix 、 weightTfIdf
Preprocessing
library(tm)
library(rmmseg4j)
library(openNLP)
library(Rstem)
library(Snowball)
cor = Corpus(DirSource("~/work/text-mining/20news-bydate-test/1000/"),
readerControl=list(reader=readPlain))
cwsed = tm_map(cor, function(x){
PlainTextDocument(mmseg4j(as.character(x), method="maxword"),
id=ID(x))
})
dtm = DocumentTermMatrix(cwsed, control=list(weighting = function(x){
weightTfIdf(x)
},wordLengths=c(1,Inf)))
文本聚类
降维处理
++++++++++++++++++++++++++++++++++++++++++
> nTerms(dtm)
[1] 103757
> dtm2 = removeSparseTerms(dtm, 0.9)
> nTerms(dtm2)
[1] 709
++++++++++++++++++++++++++++++++++++++++++
聚类
++++++++++++++++++++++++++++++++++++++++++
km = kmeans(as.matrix(dtm2), centers=5, iter.max=10)
dbscan?
spectral clustering?
Cluster validation
● Internal measures
● Stability measures
● Biological
Internal measures
● Connectivity
● Silhouette Width
● Dunn Index
Stability measures
● Average Proportion of Non-overlap(APN)
● Average Distance (AD)
Stability measures
● Average Distance between Means (ADM)
● Figure of Merit (FOM)
Biological
● Biological Homogeneity Index (BHI)
● Biological Stability Index (BSI)
Cluster validation
library(tm)
library(kernlab)
library(clValid)
intern=clValid(as.matrix(dtm2),2:10,clMethods=c("hierarchical","kmeans","pa
m"),validation="internal",maxitems=3000)
summary(intern)
op <- par(no.readonly=TRUE)
par(mfrow=c(2,2),mar=c(4,4,3,1))
plot(intern, legend=FALSE)
legend("right", clusterMethods(intern), col=1:9, lty=1:9, pch=paste(1:9))
par(op)
文本分类
●
朴素贝叶斯
●
支持向量机 (Support Vector Machine)
台湾大学 林智仁
Libsvm(e1071)
Liblinear(LiblinearR)
Evaluation and Acurracy
improvement
● Cross validation
● Bootstrap
● Ensemble Method
R and Data Mining
●
R 语言介绍
●
R 文本挖掘框架
● High Performance Computing in R
●
R 网络分析
●
统计图形
High Performance Computing in R
● Parallel Computing
Rmpi 、 snowfall 、 snowFT 、
parallel(>=R 2.14) 、 Rhadoop
● Large memory and out-of-memory data
ff 、 HadoopStreaming
● Easier interfaces for Compiled code
Rcpp 、 Rjava 、 inline
● Profiling tools
profr 、 proftools
Rhadoop
http://www.revolutionanalytics.com/
Rhadoop
● Rmr2
mapreduce 、 from.dfs 、 to.dfs 、 keyval
● Rhdfs
hdfs.file 、 hdfs.close 、 hdfs.exists 、 hdfs.cp
hdfs.read
● Rhbase
hb.new.table 、 hb.delete.table 、 hb.insert 、
hb.get
k-medios.iter =
function(points, distfun,ncenters,centers = NULL) {
from.dfs(mapreduce(input = points,
map =
if (is.null(centers)) {
function(k,v) keyval(sample(1:ncenters,1),v)
}
else {
function(k,v) {
distances = apply(centers, 1, function(c) distfun(c,v))
keyval(centers[which.min(distances),], v)
}
},
reduce = function(k,vv) keyval(NULL, iter.center(vv)),
structured = T))
}
Parallel computing
library(snowfall)
library(tm)
library(kernlab)
svm_parallel =
function(dtm){
sfInit(parallel=TRUE, cpus=4, type="MPI")
data = as.data.frame(inspect(dtm))
data$type = factor(rep(1:5, times=c(500,500,500,500,564)))
levels(data$type) = c('sports','tech','news','education','learning')
sub = sample(c(0,1,2,3,4), size=2564, replace=T)
wrapper = function(x){
if(require(kernlab)){
ksvm(type ~., data=x)
}
}
ksvm.models =
sfLapplyLB(c(data[sub==0,],data[sub==1,],data[sub==2,],data[sub==3,],data[sub==4,]),
wrapper)
sfStop()
ksvm.models
}
Parallel computing
> library(parallel)
> cl =
makeCluster(detectCores(logical=FALSE))
> parLapplyLB(cl, 46:50, test.function)
[[1]]
[1] 5.502622e+57
[[2]]
[1] 2.586232e+59
[[3]]
[1] 1.241392e+61
[[4]]
[1] 6.082819e+62
[[5]]
[1] 3.041409e+64
R and Data Mining
●
R 语言介绍
●
R 文本挖掘框架
● High Performance Computing in R
●
R 网络分析
●
统计图形
library(igraph)
g <- graph.full(6,
directed=FALSE)
plot(g)
library(igraph)
g <- graph.ring(10,
directed=FALSE)
plot(g)
library(igraph)
g <- graph.star(16, mode = c("undirected"), center = 1)
plot(g)
library(igraph)
g <-
graph(c(1,2,4,5,3,4,5,6),directed=FALSE)
plot(g)
library(igraph)
M <- matrix(runif(100),nrow=10)
g <- graph.adjacency(M>0.9)
plot(g)
> M[,1:5]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.44746867 0.9753915 0.6890068 0.8500356 0.5812459
[2,] 0.10004725 0.9870645 0.9322102 0.6834764 0.8518852
[3,] 0.04882503 0.1599767 0.5268769 0.7756217 0.5713700
[4,] 0.91988082 0.4018993 0.3562261 0.7624379 0.1849250
[5,] 0.43281897 0.6032613 0.8240209 0.3340224 0.7189334
[6,] 0.87971431 0.9331585 0.4483813 0.4743045 0.5121772
[7,] 0.04519996 0.1875099 0.5615725 0.5913464 0.9487314
[8,] 0.78936780 0.6904077 0.6834867 0.2760950 0.1559759
[9,] 0.13621689 0.5607899 0.2745078 0.7246721 0.1932709
[10,] 0.54878255 0.4730136 0.7992216 0.4186087 0.2547914
> M[,1:5] > 0.9
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] FALSE TRUE FALSE FALSE FALSE
[2,] FALSE TRUE TRUE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE
[4,] TRUE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE
[6,] FALSE TRUE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE TRUE
[8,] FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE
library(igraph)
g1 <- graph.full(6, directed=FALSE)
g2 <- graph(c(6,7,7,8,8,9,9,10,9,7,11,12,12,8),
directed=FALSE)
g <- graph.union(g1, g2)
plot(g)
> V(g)
Vertex sequence:
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> degree(g)
[1] 5 5 5 5 5 6 3 3 3 1 1 2
> V(g)[degree(g)>1]
Vertex sequence:
[1] 1 2 3 4 5 6 7 8 9 12
> graph.dfs(g, 9)
$order
[1] 9 7 6 1 2 3 4 5 8 12 11 10
> graph.bfs(g, 9)
$order
[1] 9 7 8 10 6 12 1 2 3 4 5 11
网络分析
● igraph
● graph
● network
● sna
R and Data Mining
●
R 语言介绍
●
R 文本挖掘框架
● High Performance Computing in R
●
R 网络分析基本
●
统计图形
统计图形
Statistical graphics is, or should be, an
transdisciplinary field informed by scientific,
statistical,computing, aesthetic, psychological
and sociological considerations.[Leland
Wilkinson, The Grammar of Graphics]
The grammar of Graphics
In brief, the grammar tells us that the statistical
graphic is a mapping from data to aesthetic
attributes(color, shape,size) of geometric
objects(points, lines, bars).
直方图 (hist)
条形图 (barplot)
散点图 (plot)
> x=seq(from=-pi,to=pi,length.out=100)
> y=sin(x)
> plot(x, y, col="blue")
概率密度曲线
> x=seq(from=-pi,to=pi,length.out=100)
> y = dnorm(x)
> plot(x, y, col="blue")
颜色等高图
散点图矩阵
矩阵图 (matplot)
matplot(test.matrix,type="b")
高级绘图程序
● lattice
● ggplot2
An implementation of the grammar of graphics
in R
ggplot2
●
Data( 数据 ) 和 Mapping( 映射 )
●
Geom( 几何对象 )
●
Stat( 统计变换 )
●
Scale( 标度 )
●
Coord( 坐标系统 )
●
Facet( 分面 )
●
Layer( 图层 )
ggplot2
●
测试数据
> str(mpg)
'data.frame': 234 obs. of 11 variables:
$ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
$ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
$ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
$ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
$ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
$ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
$ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
$ cty : int 18 21 20 21 16 18 18 18 16 20 ...
$ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
$ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
$ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
ggplot2
> library(ggplot2)
> p <- ggplot(data=mpg,
mapping=aes(x=cty,y=hwy))
> p + geom_point()
ggplot2
> p <- ggplot(data=mpg,
mapping=aes(x=cty,y=hwy,colour=factor(year)))
> p + geom_point()
ggplot2
> p + geom_point() + stat_smooth()
ggplot2
> p + geom_point(mapping=aes(size=displ)) +
stat_smooth()
ggplot2
> p + geom_point(mapping=aes(size=displ)) + stat_smooth() +
coord_cartesian(xlim=c(20,30),ylim=c(0,40))
ggplot2
> p + geom_point(mapping=aes(size=displ)) + stat_smooth() +
facet_wrap(~year,ncol=2)
ggplot2
qplot(x,y,colour=factor(y)
)
ggplot2
y = sin(x) + rnorm(100)
qplot(x,y,colour=factor(y)
)
ggplot2
plotmatrix(data,mapping=aes(),colour="blue")
R 中文博客
●
肖凯
http://xccds1977.blogspot.jp
●
刘思喆
统计之都 R 语言版版主
http://cos.name/cn/
●
谢益辉
http://yihui.name/
国外网站
●
数据科学家 twitter
Big Data: Experts to Follow on Twitter
●
R 语言相关论文或书籍
Journal of Statistical Software
● R and Data Mining
http://www.rdatamining.com/
● R-project search
http://www.rseek.org/

More Related Content

What's hot

R data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RR data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RDr. Volkan OBAN
 
The Ring programming language version 1.2 book - Part 25 of 84
The Ring programming language version 1.2 book - Part 25 of 84The Ring programming language version 1.2 book - Part 25 of 84
The Ring programming language version 1.2 book - Part 25 of 84Mahmoud Samir Fayed
 
Table of Useful R commands.
Table of Useful R commands.Table of Useful R commands.
Table of Useful R commands.Dr. Volkan OBAN
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
 
The Ring programming language version 1.10 book - Part 40 of 212
The Ring programming language version 1.10 book - Part 40 of 212The Ring programming language version 1.10 book - Part 40 of 212
The Ring programming language version 1.10 book - Part 40 of 212Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 77 of 184
The Ring programming language version 1.5.3 book - Part 77 of 184The Ring programming language version 1.5.3 book - Part 77 of 184
The Ring programming language version 1.5.3 book - Part 77 of 184Mahmoud Samir Fayed
 
The Ring programming language version 1.3 book - Part 50 of 88
The Ring programming language version 1.3 book - Part 50 of 88The Ring programming language version 1.3 book - Part 50 of 88
The Ring programming language version 1.3 book - Part 50 of 88Mahmoud Samir Fayed
 
The Ring programming language version 1.4.1 book - Part 10 of 31
The Ring programming language version 1.4.1 book - Part 10 of 31The Ring programming language version 1.4.1 book - Part 10 of 31
The Ring programming language version 1.4.1 book - Part 10 of 31Mahmoud Samir Fayed
 
The Ring programming language version 1.5.1 book - Part 33 of 180
The Ring programming language version 1.5.1 book - Part 33 of 180The Ring programming language version 1.5.1 book - Part 33 of 180
The Ring programming language version 1.5.1 book - Part 33 of 180Mahmoud Samir Fayed
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slidesYanchang Zhao
 
The Ring programming language version 1.4 book - Part 18 of 30
The Ring programming language version 1.4 book - Part 18 of 30The Ring programming language version 1.4 book - Part 18 of 30
The Ring programming language version 1.4 book - Part 18 of 30Mahmoud Samir Fayed
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRsquared Academy
 
5. R basics
5. R basics5. R basics
5. R basicsFAO
 
D3 svg & angular
D3 svg & angularD3 svg & angular
D3 svg & angular500Tech
 
The Ring programming language version 1.10 book - Part 46 of 212
The Ring programming language version 1.10 book - Part 46 of 212The Ring programming language version 1.10 book - Part 46 of 212
The Ring programming language version 1.10 book - Part 46 of 212Mahmoud Samir Fayed
 
Rのスコープとフレームと環境と
Rのスコープとフレームと環境とRのスコープとフレームと環境と
Rのスコープとフレームと環境とTakeshi Arabiki
 

What's hot (20)

R for you
R for youR for you
R for you
 
R data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RR data mining-Time Series Analysis with R
R data mining-Time Series Analysis with R
 
The Ring programming language version 1.2 book - Part 25 of 84
The Ring programming language version 1.2 book - Part 25 of 84The Ring programming language version 1.2 book - Part 25 of 84
The Ring programming language version 1.2 book - Part 25 of 84
 
Table of Useful R commands.
Table of Useful R commands.Table of Useful R commands.
Table of Useful R commands.
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
The Ring programming language version 1.10 book - Part 40 of 212
The Ring programming language version 1.10 book - Part 40 of 212The Ring programming language version 1.10 book - Part 40 of 212
The Ring programming language version 1.10 book - Part 40 of 212
 
Mongo indexes
Mongo indexesMongo indexes
Mongo indexes
 
The Ring programming language version 1.5.3 book - Part 77 of 184
The Ring programming language version 1.5.3 book - Part 77 of 184The Ring programming language version 1.5.3 book - Part 77 of 184
The Ring programming language version 1.5.3 book - Part 77 of 184
 
The Ring programming language version 1.3 book - Part 50 of 88
The Ring programming language version 1.3 book - Part 50 of 88The Ring programming language version 1.3 book - Part 50 of 88
The Ring programming language version 1.3 book - Part 50 of 88
 
The Ring programming language version 1.4.1 book - Part 10 of 31
The Ring programming language version 1.4.1 book - Part 10 of 31The Ring programming language version 1.4.1 book - Part 10 of 31
The Ring programming language version 1.4.1 book - Part 10 of 31
 
The Ring programming language version 1.5.1 book - Part 33 of 180
The Ring programming language version 1.5.1 book - Part 33 of 180The Ring programming language version 1.5.1 book - Part 33 of 180
The Ring programming language version 1.5.1 book - Part 33 of 180
 
Fp java8
Fp java8Fp java8
Fp java8
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slides
 
The Ring programming language version 1.4 book - Part 18 of 30
The Ring programming language version 1.4 book - Part 18 of 30The Ring programming language version 1.4 book - Part 18 of 30
The Ring programming language version 1.4 book - Part 18 of 30
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For Beginners
 
5. R basics
5. R basics5. R basics
5. R basics
 
array
arrayarray
array
 
D3 svg & angular
D3 svg & angularD3 svg & angular
D3 svg & angular
 
The Ring programming language version 1.10 book - Part 46 of 212
The Ring programming language version 1.10 book - Part 46 of 212The Ring programming language version 1.10 book - Part 46 of 212
The Ring programming language version 1.10 book - Part 46 of 212
 
Rのスコープとフレームと環境と
Rのスコープとフレームと環境とRのスコープとフレームと環境と
Rのスコープとフレームと環境と
 

Viewers also liked

SUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project PresentationSUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project PresentationSung Park
 
R user group presentation
R user group presentationR user group presentation
R user group presentationTom Liptrot
 
Text Mining with R for Social Science Research
Text Mining with R for Social Science ResearchText Mining with R for Social Science Research
Text Mining with R for Social Science ResearchRyan Wesslen
 
Twitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RTwitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RNikhil Gadkar
 
Automatic extraction of microorganisms and their habitats from free text usin...
Automatic extraction of microorganisms and their habitats from free text usin...Automatic extraction of microorganisms and their habitats from free text usin...
Automatic extraction of microorganisms and their habitats from free text usin...Catherine Canevet
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in RRajarshi Guha
 
Computing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lotteryComputing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lotteryChia-Chi Chang
 
Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studioAshley Lindley
 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)Vincent Handara
 
Data mining with R- regression models
Data mining with R- regression modelsData mining with R- regression models
Data mining with R- regression modelsHamideh Iraj
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Richard Sheng
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with RYanchang Zhao
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With RJahnab Kumar Deka
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data MiningYanchang Zhao
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on rAshraf Uddin
 

Viewers also liked (20)

SUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project PresentationSUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project Presentation
 
R user group presentation
R user group presentationR user group presentation
R user group presentation
 
Predictshine
PredictshinePredictshine
Predictshine
 
Text Mining with R for Social Science Research
Text Mining with R for Social Science ResearchText Mining with R for Social Science Research
Text Mining with R for Social Science Research
 
Twitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RTwitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using R
 
Automatic extraction of microorganisms and their habitats from free text usin...
Automatic extraction of microorganisms and their habitats from free text usin...Automatic extraction of microorganisms and their habitats from free text usin...
Automatic extraction of microorganisms and their habitats from free text usin...
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
 
Computing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lotteryComputing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lottery
 
Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studio
 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)
 
Data mining with R- regression models
Data mining with R- regression modelsData mining with R- regression models
Data mining with R- regression models
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 

Similar to R and data mining

Getting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsGetting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsBarbara Fusinska
 
R is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdfR is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdfannikasarees
 
India software developers conference 2013 Bangalore
India software developers conference 2013 BangaloreIndia software developers conference 2013 Bangalore
India software developers conference 2013 BangaloreSatnam Singh
 
R Programming: Numeric Functions In R
R Programming: Numeric Functions In RR Programming: Numeric Functions In R
R Programming: Numeric Functions In RRsquared Academy
 
data frames.pptx
data frames.pptxdata frames.pptx
data frames.pptxRacksaviR
 
Day 1d R structures & objects: matrices and data frames.pptx
Day 1d   R structures & objects: matrices and data frames.pptxDay 1d   R structures & objects: matrices and data frames.pptx
Day 1d R structures & objects: matrices and data frames.pptxAdrien Melquiond
 
Useful javascript
Useful javascriptUseful javascript
Useful javascriptLei Kang
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with RYanchang Zhao
 

Similar to R and data mining (20)

R programming language
R programming languageR programming language
R programming language
 
Getting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsGetting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commits
 
R is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdfR is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdf
 
R
RR
R
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
India software developers conference 2013 Bangalore
India software developers conference 2013 BangaloreIndia software developers conference 2013 Bangalore
India software developers conference 2013 Bangalore
 
R Programming: Numeric Functions In R
R Programming: Numeric Functions In RR Programming: Numeric Functions In R
R Programming: Numeric Functions In R
 
R programming
R programmingR programming
R programming
 
R Programming Intro
R Programming IntroR Programming Intro
R Programming Intro
 
R Programming Homework Help
R Programming Homework HelpR Programming Homework Help
R Programming Homework Help
 
data frames.pptx
data frames.pptxdata frames.pptx
data frames.pptx
 
Programming in R
Programming in RProgramming in R
Programming in R
 
RBootcam Day 2
RBootcam Day 2RBootcam Day 2
RBootcam Day 2
 
Day 1d R structures & objects: matrices and data frames.pptx
Day 1d   R structures & objects: matrices and data frames.pptxDay 1d   R structures & objects: matrices and data frames.pptx
Day 1d R structures & objects: matrices and data frames.pptx
 
Useful javascript
Useful javascriptUseful javascript
Useful javascript
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
R training3
R training3R training3
R training3
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
Arrays basics
Arrays basicsArrays basics
Arrays basics
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 

Recently uploaded

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 

Recently uploaded (20)

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 

R and data mining

  • 1. R and Data Mining 美味书签 (AVOS China) 杨朝中
  • 2.
  • 3.
  • 4.
  • 5. R and Data Mining ● R 语言介绍 ● R 文本挖掘框架 ● High Performance Computing in R ● R 网络分析 ● 统计图形
  • 6. R and Data Mining ● R 语言介绍 ● R 文本挖掘框架 ● High Performance Computing in R ● R 网络分析 ● 统计图形
  • 7. R 语言介绍 ● 统计计算 ● CRAN (Comprehensive R Archive Network)
  • 9. 对象类型 ● 向量 (vector) ● 因子 (factor) ● 数组和矩阵 (array and matrix) ● 数据框和列表 (data.frame and list) ● 函数 (function)
  • 10. 向量 (vector) > test.vector = c(1:100) > test.vector [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 [45] 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 [67] 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 [89] 89 90 91 92 93 94 95 96 97 98 99 100 > test.vector[3] [1] 3 > test.vector[1] [1] 1 > sum(test.vector) [1] 5050 > mean(test.vector) [1] 50.5 > var(test.vector) [1] 841.6667 > sd(test.vector) [1] 29.01149
  • 11. 因子 (factor) > test.factor = factor(c(1,1,2,2,2,3,3,3,4,4,1,1,4,4)) > test.factor [1] 1 1 2 2 2 3 3 3 4 4 1 1 4 4 Levels: 1 2 3 4 > levels(test.factor) = c("first","second","third","fourth") > test.factor [1] first first second second second third third third fourth fourth first first [13] fourth fourth Levels: first second third fourth > levels(test.factor) = c("a","b","c","d") > test.factor [1] a a b b b c c c d d a a d d Levels: a b c d
  • 12. 数组 (array) > test.array = array(rbinom(100,5,0.5),dim=c(4,5,5)) > test.array , , 1 [,1] [,2] [,3] [,4] [,5] [1,] 1 3 2 3 1 [2,] 4 2 2 2 2 [3,] 2 1 3 3 5 [4,] 2 2 4 2 2 > test.array[,3,] [,1] [,2] [,3] [,4] [,5] [1,] 2 3 4 4 2 [2,] 2 2 2 1 1 [3,] 3 2 4 3 4 [4,] 4 3 3 1 2 > test.array[3,2,] [1] 1 2 3 1 1
  • 13. 矩阵 (matrix) > test.matrix = matrix(rpois(50,5),nrow=5) > test.matrix [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 6 3 12 7 6 2 3 5 4 4 [2,] 2 5 11 3 1 4 7 2 5 5 [3,] 2 4 1 5 1 3 2 7 5 8 [4,] 4 7 5 8 4 5 3 2 6 2 [5,] 9 15 5 6 2 4 8 8 5 3 > t(test.matrix) [,1] [,2] [,3] [,4] [,5] [1,] 6 2 2 4 9 [2,] 3 5 4 7 15 [3,] 12 11 1 5 5 [4,] 7 3 5 8 6 [5,] 6 1 1 4 2 [6,] 2 4 3 5 4 [7,] 3 7 2 3 8 [8,] 5 2 7 2 8 [9,] 4 5 5 6 5 [10,] 4 5 8 2 3
  • 14. 矩阵 (matix) > test.matrix = matrix(runif(25,min=1,max=5),nrow=5) > test.matrix [,1] [,2] [,3] [,4] [,5] [1,] 1.844365 2.470590 4.744482 4.693239 2.597706 [2,] 2.051089 2.954349 4.807748 3.974937 2.487159 [3,] 4.554397 2.187724 4.519553 4.916905 3.988060 [4,] 4.629351 3.770774 2.992690 4.660705 2.510643 [5,] 3.894542 3.281654 2.471337 3.484586 2.115016 > qr(test.matrix) $qr [,1] [,2] [,3] [,4] [,5] [1,] -8.0591276 -6.30550129 -7.7768280 -9.2254948 -5.94547975 [2,] 0.2545051 -2.20153679 -2.8030382 -2.2409546 -0.64008014 [3,] 0.5651229 -0.83950762 -3.5747057 -2.2750825 -1.96267828 [4,] 0.5744234 -0.15061209 -0.6607485 0.7479590 0.01142934 [5,] 0.4832462 -0.07700937 -0.6148309 0.9179222 0.06790194 $rank [1] 5 $qraux [1] 1.22885416 1.51634534 1.43057441 1.39676050 0.06790194
  • 15. 矩阵 (matrix) > svd(test.matrix) $d [1] 17.66944239 3.22284465 1.78184517 0.61566884 0.05156261 $u [,1] [,2] [,3] [,4] [,5] [1,] -0.4285623 -0.55858839 0.1433838 0.6112554 0.33184518 [2,] -0.4207851 -0.46523651 0.3361892 -0.6261498 -0.31844658 [3,] -0.5179119 0.03462469 -0.8461578 -0.1172279 -0.02903471 [4,] -0.4722861 0.50932622 0.2777685 0.3687009 -0.55175807 [5,] -0.3846913 0.45926238 0.2707020 -0.2908960 0.69511911 $v [,1] [,2] [,3] [,4] [,5] [1,] -0.4356020 0.71976143 -0.31404796 -0.1898322 -0.39690304 [2,] -0.3666388 0.23238151 0.80369243 -0.2606880 0.31256209 [3,] -0.4958375 -0.64266729 -0.01537137 -0.4151453 -0.41053867 [4,] -0.5530530 -0.10129870 0.04863968 0.8254724 -0.01001832 [5,] -0.3522846 -0.06826158 -0.50284218 -0.2055605 0.75903264
  • 16. 矩阵 (matrix) > cbind(test.matrix,rep(1,times=5)) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1.844365 2.470590 4.744482 4.693239 2.597706 1 [2,] 2.051089 2.954349 4.807748 3.974937 2.487159 1 [3,] 4.554397 2.187724 4.519553 4.916905 3.988060 1 [4,] 4.629351 3.770774 2.992690 4.660705 2.510643 1 [5,] 3.894542 3.281654 2.471337 3.484586 2.115016 1 > rbind(test.matrix, seq(1,2,length.out=5)) [,1] [,2] [,3] [,4] [,5] [1,] 1.844365 2.470590 4.744482 4.693239 2.597706 [2,] 2.051089 2.954349 4.807748 3.974937 2.487159 [3,] 4.554397 2.187724 4.519553 4.916905 3.988060 [4,] 4.629351 3.770774 2.992690 4.660705 2.510643 [5,] 3.894542 3.281654 2.471337 3.484586 2.115016 [6,] 1.000000 1.250000 1.500000 1.750000 2.000000
  • 17. 数据框 (data.frame) > test.data.frame = data.frame(id=1:10,name=letters[1:10],age=sample(c(25,23,24),size=10,replace=TRUE)) > test.data.frame id name age 1 1 a 25 2 2 b 23 3 3 c 23 4 4 d 23 5 5 e 24 6 6 f 24 7 7 g 24 8 8 h 25 9 9 i 25 10 10 j 25 > test.data.frame$id [1] 1 2 3 4 5 6 7 8 9 10 > test.data.frame$name [1] a b c d e f g h i j Levels: a b c d e f g h i j > test.data.frame$age [1] 25 23 23 23 24 24 24 25 25 25
  • 18. 列表 (List) > test.list = list(test.vector,test.factor,test.array,test.matrix,test.data.frame) > str(test.list) List of 5 $ : int [1:100] 1 2 3 4 5 6 7 8 9 10 ... $ : Factor w/ 4 levels "a","b","c","d": 1 1 2 2 2 3 3 3 4 4 ... $ : num [1:4, 1:5, 1:5] 1 4 2 2 3 2 1 2 2 2 ... $ : num [1:5, 1:5] 1.84 2.05 4.55 4.63 3.89 ... $ :'data.frame': 10 obs. of 3 variables: ..$ id : int [1:10] 1 2 3 4 5 6 7 8 9 10 ..$ name: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ..$ age : num [1:10] 25 23 23 23 24 24 24 25 25 25 > test.list[4] [[1]] [,1] [,2] [,3] [,4] [,5] [1,] 1.844365 2.470590 4.744482 4.693239 2.597706 [2,] 2.051089 2.954349 4.807748 3.974937 2.487159 [3,] 4.554397 2.187724 4.519553 4.916905 3.988060 [4,] 4.629351 3.770774 2.992690 4.660705 2.510643 [5,] 3.894542 3.281654 2.471337 3.484586 2.115016
  • 19. 函数 (function) > test.function = function(x) factorial(x) > test.function(3) [1] 6 >lapply(test.vector[31:35],test.function) [[1]] [1] 8.222839e+33 [[2]] [1] 2.631308e+35 [[3]] [1] 8.683318e+36 [[4]] [1] 2.952328e+38 [[5]] [1] 1.033315e+40
  • 21. R 语言介绍 ● 统计计算 ● CRAN (Comprehensive R Archive Network)
  • 22. CRAN ● CRAN Task Views ● Natural Language Processing ● Machine Learning & Statistical Learning ● High-Performance and Parallel Computing with R ● gRaphical Models in R ● Graphic displays
  • 23. R and Data Mining ● R 语言介绍 ● R 文本挖掘框架 ● High Performance Computing in R ● R 网络分析 ● 统计图形
  • 26. Text Preprocessing in R ● 数据导入: Corpus 、 PlainTextDocument 、 tm_map ● 中文分词: rmmseg4j ● 英文词干提取: Rstem 、 Snowball 、 RWeka ● 英文句子识别: openNLP ● 英文同义词: wordnet ● 构造基于 tf-idf 的文档单词矩阵: DocumentTermMatrix 、 weightTfIdf
  • 27. Preprocessing library(tm) library(rmmseg4j) library(openNLP) library(Rstem) library(Snowball) cor = Corpus(DirSource("~/work/text-mining/20news-bydate-test/1000/"), readerControl=list(reader=readPlain)) cwsed = tm_map(cor, function(x){ PlainTextDocument(mmseg4j(as.character(x), method="maxword"), id=ID(x)) }) dtm = DocumentTermMatrix(cwsed, control=list(weighting = function(x){ weightTfIdf(x) },wordLengths=c(1,Inf)))
  • 28. 文本聚类 降维处理 ++++++++++++++++++++++++++++++++++++++++++ > nTerms(dtm) [1] 103757 > dtm2 = removeSparseTerms(dtm, 0.9) > nTerms(dtm2) [1] 709 ++++++++++++++++++++++++++++++++++++++++++ 聚类 ++++++++++++++++++++++++++++++++++++++++++ km = kmeans(as.matrix(dtm2), centers=5, iter.max=10) dbscan? spectral clustering?
  • 29. Cluster validation ● Internal measures ● Stability measures ● Biological
  • 30. Internal measures ● Connectivity ● Silhouette Width ● Dunn Index
  • 31. Stability measures ● Average Proportion of Non-overlap(APN) ● Average Distance (AD)
  • 32. Stability measures ● Average Distance between Means (ADM) ● Figure of Merit (FOM)
  • 33. Biological ● Biological Homogeneity Index (BHI) ● Biological Stability Index (BSI)
  • 34. Cluster validation library(tm) library(kernlab) library(clValid) intern=clValid(as.matrix(dtm2),2:10,clMethods=c("hierarchical","kmeans","pa m"),validation="internal",maxitems=3000) summary(intern) op <- par(no.readonly=TRUE) par(mfrow=c(2,2),mar=c(4,4,3,1)) plot(intern, legend=FALSE) legend("right", clusterMethods(intern), col=1:9, lty=1:9, pch=paste(1:9)) par(op)
  • 35.
  • 36. 文本分类 ● 朴素贝叶斯 ● 支持向量机 (Support Vector Machine) 台湾大学 林智仁 Libsvm(e1071) Liblinear(LiblinearR)
  • 37. Evaluation and Acurracy improvement ● Cross validation ● Bootstrap ● Ensemble Method
  • 38. R and Data Mining ● R 语言介绍 ● R 文本挖掘框架 ● High Performance Computing in R ● R 网络分析 ● 统计图形
  • 39. High Performance Computing in R ● Parallel Computing Rmpi 、 snowfall 、 snowFT 、 parallel(>=R 2.14) 、 Rhadoop ● Large memory and out-of-memory data ff 、 HadoopStreaming ● Easier interfaces for Compiled code Rcpp 、 Rjava 、 inline ● Profiling tools profr 、 proftools
  • 41. Rhadoop ● Rmr2 mapreduce 、 from.dfs 、 to.dfs 、 keyval ● Rhdfs hdfs.file 、 hdfs.close 、 hdfs.exists 、 hdfs.cp hdfs.read ● Rhbase hb.new.table 、 hb.delete.table 、 hb.insert 、 hb.get
  • 42. k-medios.iter = function(points, distfun,ncenters,centers = NULL) { from.dfs(mapreduce(input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v) } else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v) } }, reduce = function(k,vv) keyval(NULL, iter.center(vv)), structured = T)) }
  • 43. Parallel computing library(snowfall) library(tm) library(kernlab) svm_parallel = function(dtm){ sfInit(parallel=TRUE, cpus=4, type="MPI") data = as.data.frame(inspect(dtm)) data$type = factor(rep(1:5, times=c(500,500,500,500,564))) levels(data$type) = c('sports','tech','news','education','learning') sub = sample(c(0,1,2,3,4), size=2564, replace=T) wrapper = function(x){ if(require(kernlab)){ ksvm(type ~., data=x) } } ksvm.models = sfLapplyLB(c(data[sub==0,],data[sub==1,],data[sub==2,],data[sub==3,],data[sub==4,]), wrapper) sfStop() ksvm.models }
  • 44. Parallel computing > library(parallel) > cl = makeCluster(detectCores(logical=FALSE)) > parLapplyLB(cl, 46:50, test.function) [[1]] [1] 5.502622e+57 [[2]] [1] 2.586232e+59 [[3]] [1] 1.241392e+61 [[4]] [1] 6.082819e+62 [[5]] [1] 3.041409e+64
  • 45. R and Data Mining ● R 语言介绍 ● R 文本挖掘框架 ● High Performance Computing in R ● R 网络分析 ● 统计图形
  • 48. library(igraph) g <- graph.star(16, mode = c("undirected"), center = 1) plot(g)
  • 50. library(igraph) M <- matrix(runif(100),nrow=10) g <- graph.adjacency(M>0.9) plot(g)
  • 51. > M[,1:5] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 0.44746867 0.9753915 0.6890068 0.8500356 0.5812459 [2,] 0.10004725 0.9870645 0.9322102 0.6834764 0.8518852 [3,] 0.04882503 0.1599767 0.5268769 0.7756217 0.5713700 [4,] 0.91988082 0.4018993 0.3562261 0.7624379 0.1849250 [5,] 0.43281897 0.6032613 0.8240209 0.3340224 0.7189334 [6,] 0.87971431 0.9331585 0.4483813 0.4743045 0.5121772 [7,] 0.04519996 0.1875099 0.5615725 0.5913464 0.9487314 [8,] 0.78936780 0.6904077 0.6834867 0.2760950 0.1559759 [9,] 0.13621689 0.5607899 0.2745078 0.7246721 0.1932709 [10,] 0.54878255 0.4730136 0.7992216 0.4186087 0.2547914 > M[,1:5] > 0.9 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] FALSE TRUE FALSE FALSE FALSE [2,] FALSE TRUE TRUE FALSE FALSE [3,] FALSE FALSE FALSE FALSE FALSE [4,] TRUE FALSE FALSE FALSE FALSE [5,] FALSE FALSE FALSE FALSE FALSE [6,] FALSE TRUE FALSE FALSE FALSE [7,] FALSE FALSE FALSE FALSE TRUE [8,] FALSE FALSE FALSE FALSE FALSE [9,] FALSE FALSE FALSE FALSE FALSE [10,] FALSE FALSE FALSE FALSE FALSE
  • 52. library(igraph) g1 <- graph.full(6, directed=FALSE) g2 <- graph(c(6,7,7,8,8,9,9,10,9,7,11,12,12,8), directed=FALSE) g <- graph.union(g1, g2) plot(g)
  • 53. > V(g) Vertex sequence: [1] 1 2 3 4 5 6 7 8 9 10 11 12 > degree(g) [1] 5 5 5 5 5 6 3 3 3 1 1 2 > V(g)[degree(g)>1] Vertex sequence: [1] 1 2 3 4 5 6 7 8 9 12 > graph.dfs(g, 9) $order [1] 9 7 6 1 2 3 4 5 8 12 11 10 > graph.bfs(g, 9) $order [1] 9 7 8 10 6 12 1 2 3 4 5 11
  • 55. R and Data Mining ● R 语言介绍 ● R 文本挖掘框架 ● High Performance Computing in R ● R 网络分析基本 ● 统计图形
  • 56. 统计图形 Statistical graphics is, or should be, an transdisciplinary field informed by scientific, statistical,computing, aesthetic, psychological and sociological considerations.[Leland Wilkinson, The Grammar of Graphics]
  • 57. The grammar of Graphics In brief, the grammar tells us that the statistical graphic is a mapping from data to aesthetic attributes(color, shape,size) of geometric objects(points, lines, bars).
  • 65. 高级绘图程序 ● lattice ● ggplot2 An implementation of the grammar of graphics in R
  • 66. ggplot2 ● Data( 数据 ) 和 Mapping( 映射 ) ● Geom( 几何对象 ) ● Stat( 统计变换 ) ● Scale( 标度 ) ● Coord( 坐标系统 ) ● Facet( 分面 ) ● Layer( 图层 )
  • 67. ggplot2 ● 测试数据 > str(mpg) 'data.frame': 234 obs. of 11 variables: $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ... $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ... $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ... $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ... $ cyl : int 4 4 4 4 6 6 6 4 4 4 ... $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ... $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ... $ cty : int 18 21 20 21 16 18 18 18 16 20 ... $ hwy : int 29 29 31 30 26 26 27 26 25 28 ... $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ... $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
  • 68. ggplot2 > library(ggplot2) > p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy)) > p + geom_point()
  • 69. ggplot2 > p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy,colour=factor(year))) > p + geom_point()
  • 70. ggplot2 > p + geom_point() + stat_smooth()
  • 71. ggplot2 > p + geom_point(mapping=aes(size=displ)) + stat_smooth()
  • 72. ggplot2 > p + geom_point(mapping=aes(size=displ)) + stat_smooth() + coord_cartesian(xlim=c(20,30),ylim=c(0,40))
  • 73. ggplot2 > p + geom_point(mapping=aes(size=displ)) + stat_smooth() + facet_wrap(~year,ncol=2)
  • 75. ggplot2 y = sin(x) + rnorm(100) qplot(x,y,colour=factor(y) )
  • 77. R 中文博客 ● 肖凯 http://xccds1977.blogspot.jp ● 刘思喆 统计之都 R 语言版版主 http://cos.name/cn/ ● 谢益辉 http://yihui.name/
  • 78. 国外网站 ● 数据科学家 twitter Big Data: Experts to Follow on Twitter ● R 语言相关论文或书籍 Journal of Statistical Software ● R and Data Mining http://www.rdatamining.com/ ● R-project search http://www.rseek.org/