SlideShare uma empresa Scribd logo
1 de 17
Data Science 101: Using R Language
to get Big Insights
Satnam Singh,
Senior Chief Engineer,
Samsung Research India – Bangalore
[ Twitter - @satnam74s]
India Software Developers Conference, Bangalore
March 16, 2013
2
Motivation: Using Data to get Business Insights
Data Bases
& Clusters
Data Bases
& Clusters
Data Bases
& Clusters
Insights? Insights?
Insights?
Ref. [kaggle.com]
Data Science Programming Languages
Why R?
• Popular, Free
• Open source
• Multi-platform
• Vectorization
• Many statistical packages
• Large support base
• Obj. oriented prog. lang.
Ref [http://www.r-project.org]
R Language Basics
> y <- 21
> y
[1] 21
> z = 233
> z
[1] 233
> y <- c(1,2,3,4)
> y
[1] 1 2 3 4
Simple
Operations
Vector
Operations
Function
Calls
5
R Language: Data Structures Examples
• Data frame
• Matrix
• List
> MyFamilyage <- c(5,6,40,38)
> MyFamilyage <- c(5,6,40,38)
> MFamilyName <- c("Sat",“Veera",“Minu","Dummy")
> MyFamilyweight <- c(72,70,12,40)
> MyFamily<-
data.frame(MyFamilyName,MyFamilyage,MyFamilyweight)
> MyMatrix<-as.matrix(MyFamilyage)
> Mydataframe <-as.data.frame(MyMatrix)
> MyList <-a.list(Mydataframe)
6
Case Study: Activity Recognition
• Activity Recognition: Detect walking,
driving, biking, climbing stairs,
standing, etc.
Example of Accelerometer data
Smartphone’s
Accelerometer
Sensor
[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham
University, Bronx, NY
[Ref] Jordan Frank, McGill University
[Ref] Commercial API Providers: Sensor Platoforms, Movea,
Alohar
7
Data Analysis - Steps
Feature
Extraction
Time Series Data 43 Features
Mean for each
acc. Axis (3)
Std. dev. for each
acc. Axis (3)
200 samples (10 sec)
Avg. Abs. diff. from
Mean for each
acc. Axis (3)
Avg. Resultant Acc. (1)
Histogram (30)
Classifiers
CART: Decision Tree
RF: Random Forest
Classify the
Activity
[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY
[Ref] Jordan Frank, McGill University
Data Visualization – Activity (Class Variable)
[Ref] Rattle R Data Mining Tool
ds <-
rbind(summary(na.omit(crs$dataset[,]$clas
s)), summary(na.omit(crs$dataset[,][crs
$dataset$class=="Downstairs",]$class)),
summary(na.omit(crs$dataset[,][crs$datase
t$class=="Jogging",]$class)), summary(
na.omit(crs$dataset[,][crs$dataset$class=
="Sitting",]$class)), summary(na.omit(
crs$dataset[,][crs$dataset$class=="Standi
ng",]$class)), summary(na.omit(crs$dat
aset[,][crs$dataset$class=="Upstairs",]$c
lass)), summary(na.omit(crs$dataset[,]
[crs$dataset$class=="Walking",]$class)))
ord <- order(ds[1,], decreasing=TRUE)
bp <-
barplot2(ds[,ord], beside=TRUE, ylab="Fre
quency", xlab="class", ylim=c(0, 2497), c
ol=rainbow_hcl(7))
dotchart(ds[nrow(ds):1,ord],
col=rev(rainbow_hcl(7)), labels="",
xlab="Frequency", ylab="class",
pch=c(1:6, 19))
Bar Plot
Dot Plot
Data Visualization Example – Variable Yavg.
ds <-
rbind(data.frame(dat=crs$dataset[,][,"YAVG
"], grp="All"),
data.frame(dat=crs$dataset[,][crs$dataset$
class=="Downstairs","YAVG"],
grp="Downstairs"),
data.frame(dat=crs$dataset[,][crs$dataset$
class=="Jogging","YAVG"], grp="Jogging"),
data.frame(dat=crs$dataset[,][crs$dataset$
class=="Sitting","YAVG"], grp="Sitting"),
data.frame(dat=crs$dataset[,][crs$dataset$
class=="Standing","YAVG"],
grp="Standing"),
data.frame(dat=crs$dataset[,][crs$dataset$
class=="Upstairs","YAVG"],
grp="Upstairs"),
data.frame(dat=crs$dataset[,][crs$dataset$
class=="Walking","YAVG"], grp="Walking"))
bp <- boxplot(formula=dat ~ grp, data=ds,
col=rainbow_hcl(7), xlab="class",
ylab="YAVG", varwidth=TRUE, notch=TRUE)
require(doBy, quietly=TRUE)
points(1:7, summaryBy(dat ~ grp, data=ds,
FUN=mean, na.rm=TRUE)$dat.mean, pch=8)
hs <- hist(ds[ds$grp=="All",1], main="",
xlab="YAVG", ylab="Frequency", col="grey90",
ylim=c(0, 2137.72617616154), breaks="fd",
border=TRUE)
[Ref] Rattle R Data Mining Tool
• Easy to interpret
Blue : Positive correlation
Red: Negative correlation
Correlation Plot
[Ref] Rattle R Data Mining Tool
require(ellipse, quietly=TRUE)
crs$cor <-
cor(crs$dataset[, crs$numeric], use="
pairwise", method="pearson")
crs$ord <- order(crs$cor[1,])
crs$cor <- crs$cor[crs$ord, crs$ord]
print(crs$cor)
plotcorr(crs$cor,
col=colorRampPalette(c("red",
"white", "blue"))(11)[5*crs$cor + 6]
Functions Library Discription
Cluster hclust stats Hierarchical cluster analysis
kmeans stats Kmeans clustering
Classifiers glm stats Logistic regression
rpart rpart Recursive partitioning and
regression trees
ksvm kernlab Support Vector Machine
apriori arules Rule based classification
Ensemble ada ada Stochastic boosting
randomForest randomForest Random Forests classification and
regression
Data Science R Packages
Decision Tree - Visualization
[Ref] Rattle R Data Mining Tool
• Decision Tree Model Results:
n= 3792
1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38)
2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041)
4) ZAVG>=-4.125 1058 46 Jogging (0.0057 0.96 0 0 0.032 0.0057)
*
5) ZAVG< -4.125 39 0 Walking (0 0 0 0 0 1) *
3) YABSOLDEV< 5.095 2695 1312 Walking (0.14 0.047 0.08 0.069 0.16
0.51)
6) YSTANDDEV< 1.675 382 175 Sitting (0 0 0.54 0.44 0 0.016)
Variables actually used in tree construction:
RESULTANT YABSOLDEV YAVG YSTANDDEV ZABSOLDEV ZAVG
Root node error: 2364/3792 = 0.62342
Decision Tree
rpart(formula = class ~ ., data = smartphone_data, method =
"class", parms = list(split = "information"), control =
rpart.control(usesurrogate = 0, maxsurrogate = 0))
Random Forest: Ensemble of Trees
[Ref] Rattle R Data Mining Tool
…
Σ
Random Forest
Tree1 Tree2
Treen
• Random Forest Model Results:
Number of observations used to build the model: 3792
Type of random forest: classification
OOB estimate of error rate: 11.05%
Confusion matrix:
Downstairs Jogging Sitting Standing Upstairs Walking class.error
Downstairs 204 7 0 1 64 97 0.45308311
Jogging 6 1117 0 0 8 7 0.01845343
Sitting 0 0 209 5 1 0 0.02790698
Standing 4 0 0 177 4 0 0.04324324
Upstairs 48 31 1 0 276 97 0.39072848
Walking 20 1 1 1 15 1390 0.02661064
Random Forest Package in R
randomForest(formula = class ~ ., data =
smartphone_data, ntree = 300, mtry = 6, importance =
TRUE, replace = FALSE, na.action = na.roughfix)
• Fusion of data science and domain knowledge
enables the big insights from the data
• R language provides a platform to rapidly build
prototypes and test the ideas
• Getting data insights is an outcome of intense
team effort between various stakeholders
16
Summary
• R Project: http://www.r-project.org
• Activity Recognition Dataset- “ The Impact of Personalization on
Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W.
Lockhart, Activity Context Representation: Techniques and Languages,
AAAI Technical Report WS-12-05
• “Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank,
AAAI Conference on Artificial Intelligence -2010
• R wiki:
http://rwiki.sciviews.org/doku.php
• R graph gallery:
http://addictedtor.free.fr/graphiques/thumbs.php
• Kickstarting R:
http://cran.r-project.org/doc/contrib/Lemon-kickstart/
• Rattle – R Data Mining Tool [http://rattle.togaware.com/]
• Sensor Platforms, http://www.sensorplatforms.com/context-aware/
• Movea, http://www.movea.com/
• Alohar, https://www.alohar.com
17
References

Mais conteúdo relacionado

Mais procurados

Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2yannabraham
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandasPiyush rai
 
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Paul Richards
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Alexander Hendorf
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesAndrew Ferlitsch
 
Iris data analysis example in R
Iris data analysis example in RIris data analysis example in R
Iris data analysis example in RDuyen Do
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization Sourabh Sahu
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce worldYu Liu
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
 
Python for R Users
Python for R UsersPython for R Users
Python for R UsersAjay Ohri
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonWes McKinney
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupSri Ambati
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in RFlorian Uhlitz
 

Mais procurados (20)

Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
 
Pandas
PandasPandas
Pandas
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
 
Iris data analysis example in R
Iris data analysis example in RIris data analysis example in R
Iris data analysis example in R
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce world
 
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Python for R Users
Python for R UsersPython for R Users
Python for R Users
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 

Semelhante a India software developers conference 2013 Bangalore

R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
 
R language tutorial
R language tutorialR language tutorial
R language tutorialDavid Chiu
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumertirlukachaitanya
 
AiCore Brochure 27-Mar-2023-205529.pdf
AiCore Brochure 27-Mar-2023-205529.pdfAiCore Brochure 27-Mar-2023-205529.pdf
AiCore Brochure 27-Mar-2023-205529.pdfAjayRawat829497
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft AzureDmitry Petukhov
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdfRohanBorgalli
 
R Programming: Numeric Functions In R
R Programming: Numeric Functions In RR Programming: Numeric Functions In R
R Programming: Numeric Functions In RRsquared Academy
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data scienceLong Nguyen
 
Getting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsGetting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsBarbara Fusinska
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Programming with R in Big Data Analytics
Programming with R in Big Data AnalyticsProgramming with R in Big Data Analytics
Programming with R in Big Data AnalyticsArchana Gopinath
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programmingYanchang Zhao
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In RRsquared Academy
 

Semelhante a India software developers conference 2013 Bangalore (20)

R and data mining
R and data miningR and data mining
R and data mining
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
R language tutorial
R language tutorialR language tutorial
R language tutorial
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
AiCore Brochure 27-Mar-2023-205529.pdf
AiCore Brochure 27-Mar-2023-205529.pdfAiCore Brochure 27-Mar-2023-205529.pdf
AiCore Brochure 27-Mar-2023-205529.pdf
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 
R Programming: Numeric Functions In R
R Programming: Numeric Functions In RR Programming: Numeric Functions In R
R Programming: Numeric Functions In R
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
Getting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsGetting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commits
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Programming with R in Big Data Analytics
Programming with R in Big Data AnalyticsProgramming with R in Big Data Analytics
Programming with R in Big Data Analytics
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 

Mais de Satnam Singh

InfoSec Deep Learning in Action
InfoSec Deep Learning in ActionInfoSec Deep Learning in Action
InfoSec Deep Learning in ActionSatnam Singh
 
Probabilistic signals and systems satnam singh
Probabilistic signals and systems satnam singhProbabilistic signals and systems satnam singh
Probabilistic signals and systems satnam singhSatnam Singh
 
Threat Hunting with Deceptive Defense and Splunk Enterprise Security
Threat Hunting with Deceptive Defense and Splunk Enterprise SecurityThreat Hunting with Deceptive Defense and Splunk Enterprise Security
Threat Hunting with Deceptive Defense and Splunk Enterprise SecuritySatnam Singh
 
A Game between Adversary and AI Scientist
A Game between Adversary and AI ScientistA Game between Adversary and AI Scientist
A Game between Adversary and AI ScientistSatnam Singh
 
Deep learning fundamentals workshop
Deep learning fundamentals workshopDeep learning fundamentals workshop
Deep learning fundamentals workshopSatnam Singh
 
Deception-Triggered Security Data Science to Detect Adversary Movements
Deception-Triggered Security Data Science to Detect Adversary MovementsDeception-Triggered Security Data Science to Detect Adversary Movements
Deception-Triggered Security Data Science to Detect Adversary MovementsSatnam Singh
 
AI for CyberSecurity
AI for CyberSecurityAI for CyberSecurity
AI for CyberSecuritySatnam Singh
 
Using Deception to Detect and Profile Hidden Threats
Using Deception to Detect and Profile Hidden ThreatsUsing Deception to Detect and Profile Hidden Threats
Using Deception to Detect and Profile Hidden ThreatsSatnam Singh
 
HawkEye : A Real-time Anomaly Detection System
HawkEye : A Real-time Anomaly Detection SystemHawkEye : A Real-time Anomaly Detection System
HawkEye : A Real-time Anomaly Detection SystemSatnam Singh
 
The Fifth Elephant - 2013 Talk - "Smart Analytics in Smartphones"
The Fifth Elephant - 2013 Talk - "Smart Analytics in Smartphones"The Fifth Elephant - 2013 Talk - "Smart Analytics in Smartphones"
The Fifth Elephant - 2013 Talk - "Smart Analytics in Smartphones"Satnam Singh
 
Big Data Analytics Insights Conference- Satnam
Big Data Analytics Insights Conference- SatnamBig Data Analytics Insights Conference- Satnam
Big Data Analytics Insights Conference- SatnamSatnam Singh
 

Mais de Satnam Singh (11)

InfoSec Deep Learning in Action
InfoSec Deep Learning in ActionInfoSec Deep Learning in Action
InfoSec Deep Learning in Action
 
Probabilistic signals and systems satnam singh
Probabilistic signals and systems satnam singhProbabilistic signals and systems satnam singh
Probabilistic signals and systems satnam singh
 
Threat Hunting with Deceptive Defense and Splunk Enterprise Security
Threat Hunting with Deceptive Defense and Splunk Enterprise SecurityThreat Hunting with Deceptive Defense and Splunk Enterprise Security
Threat Hunting with Deceptive Defense and Splunk Enterprise Security
 
A Game between Adversary and AI Scientist
A Game between Adversary and AI ScientistA Game between Adversary and AI Scientist
A Game between Adversary and AI Scientist
 
Deep learning fundamentals workshop
Deep learning fundamentals workshopDeep learning fundamentals workshop
Deep learning fundamentals workshop
 
Deception-Triggered Security Data Science to Detect Adversary Movements
Deception-Triggered Security Data Science to Detect Adversary MovementsDeception-Triggered Security Data Science to Detect Adversary Movements
Deception-Triggered Security Data Science to Detect Adversary Movements
 
AI for CyberSecurity
AI for CyberSecurityAI for CyberSecurity
AI for CyberSecurity
 
Using Deception to Detect and Profile Hidden Threats
Using Deception to Detect and Profile Hidden ThreatsUsing Deception to Detect and Profile Hidden Threats
Using Deception to Detect and Profile Hidden Threats
 
HawkEye : A Real-time Anomaly Detection System
HawkEye : A Real-time Anomaly Detection SystemHawkEye : A Real-time Anomaly Detection System
HawkEye : A Real-time Anomaly Detection System
 
The Fifth Elephant - 2013 Talk - "Smart Analytics in Smartphones"
The Fifth Elephant - 2013 Talk - "Smart Analytics in Smartphones"The Fifth Elephant - 2013 Talk - "Smart Analytics in Smartphones"
The Fifth Elephant - 2013 Talk - "Smart Analytics in Smartphones"
 
Big Data Analytics Insights Conference- Satnam
Big Data Analytics Insights Conference- SatnamBig Data Analytics Insights Conference- Satnam
Big Data Analytics Insights Conference- Satnam
 

Último

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

India software developers conference 2013 Bangalore

  • 1. Data Science 101: Using R Language to get Big Insights Satnam Singh, Senior Chief Engineer, Samsung Research India – Bangalore [ Twitter - @satnam74s] India Software Developers Conference, Bangalore March 16, 2013
  • 2. 2 Motivation: Using Data to get Business Insights Data Bases & Clusters Data Bases & Clusters Data Bases & Clusters Insights? Insights? Insights?
  • 3. Ref. [kaggle.com] Data Science Programming Languages Why R? • Popular, Free • Open source • Multi-platform • Vectorization • Many statistical packages • Large support base • Obj. oriented prog. lang. Ref [http://www.r-project.org]
  • 4. R Language Basics > y <- 21 > y [1] 21 > z = 233 > z [1] 233 > y <- c(1,2,3,4) > y [1] 1 2 3 4 Simple Operations Vector Operations Function Calls
  • 5. 5 R Language: Data Structures Examples • Data frame • Matrix • List > MyFamilyage <- c(5,6,40,38) > MyFamilyage <- c(5,6,40,38) > MFamilyName <- c("Sat",“Veera",“Minu","Dummy") > MyFamilyweight <- c(72,70,12,40) > MyFamily<- data.frame(MyFamilyName,MyFamilyage,MyFamilyweight) > MyMatrix<-as.matrix(MyFamilyage) > Mydataframe <-as.data.frame(MyMatrix) > MyList <-a.list(Mydataframe)
  • 6. 6 Case Study: Activity Recognition • Activity Recognition: Detect walking, driving, biking, climbing stairs, standing, etc. Example of Accelerometer data Smartphone’s Accelerometer Sensor [Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University [Ref] Commercial API Providers: Sensor Platoforms, Movea, Alohar
  • 7. 7 Data Analysis - Steps Feature Extraction Time Series Data 43 Features Mean for each acc. Axis (3) Std. dev. for each acc. Axis (3) 200 samples (10 sec) Avg. Abs. diff. from Mean for each acc. Axis (3) Avg. Resultant Acc. (1) Histogram (30) Classifiers CART: Decision Tree RF: Random Forest Classify the Activity [Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University
  • 8. Data Visualization – Activity (Class Variable) [Ref] Rattle R Data Mining Tool ds <- rbind(summary(na.omit(crs$dataset[,]$clas s)), summary(na.omit(crs$dataset[,][crs $dataset$class=="Downstairs",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Jogging",]$class)), summary( na.omit(crs$dataset[,][crs$dataset$class= ="Sitting",]$class)), summary(na.omit( crs$dataset[,][crs$dataset$class=="Standi ng",]$class)), summary(na.omit(crs$dat aset[,][crs$dataset$class=="Upstairs",]$c lass)), summary(na.omit(crs$dataset[,] [crs$dataset$class=="Walking",]$class))) ord <- order(ds[1,], decreasing=TRUE) bp <- barplot2(ds[,ord], beside=TRUE, ylab="Fre quency", xlab="class", ylim=c(0, 2497), c ol=rainbow_hcl(7)) dotchart(ds[nrow(ds):1,ord], col=rev(rainbow_hcl(7)), labels="", xlab="Frequency", ylab="class", pch=c(1:6, 19)) Bar Plot Dot Plot
  • 9. Data Visualization Example – Variable Yavg. ds <- rbind(data.frame(dat=crs$dataset[,][,"YAVG "], grp="All"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Downstairs","YAVG"], grp="Downstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Jogging","YAVG"], grp="Jogging"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Sitting","YAVG"], grp="Sitting"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Standing","YAVG"], grp="Standing"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Upstairs","YAVG"], grp="Upstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Walking","YAVG"], grp="Walking")) bp <- boxplot(formula=dat ~ grp, data=ds, col=rainbow_hcl(7), xlab="class", ylab="YAVG", varwidth=TRUE, notch=TRUE) require(doBy, quietly=TRUE) points(1:7, summaryBy(dat ~ grp, data=ds, FUN=mean, na.rm=TRUE)$dat.mean, pch=8) hs <- hist(ds[ds$grp=="All",1], main="", xlab="YAVG", ylab="Frequency", col="grey90", ylim=c(0, 2137.72617616154), breaks="fd", border=TRUE) [Ref] Rattle R Data Mining Tool
  • 10. • Easy to interpret Blue : Positive correlation Red: Negative correlation Correlation Plot [Ref] Rattle R Data Mining Tool require(ellipse, quietly=TRUE) crs$cor <- cor(crs$dataset[, crs$numeric], use=" pairwise", method="pearson") crs$ord <- order(crs$cor[1,]) crs$cor <- crs$cor[crs$ord, crs$ord] print(crs$cor) plotcorr(crs$cor, col=colorRampPalette(c("red", "white", "blue"))(11)[5*crs$cor + 6]
  • 11. Functions Library Discription Cluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clustering Classifiers glm stats Logistic regression rpart rpart Recursive partitioning and regression trees ksvm kernlab Support Vector Machine apriori arules Rule based classification Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression Data Science R Packages
  • 12. Decision Tree - Visualization [Ref] Rattle R Data Mining Tool
  • 13. • Decision Tree Model Results: n= 3792 1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38) 2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041) 4) ZAVG>=-4.125 1058 46 Jogging (0.0057 0.96 0 0 0.032 0.0057) * 5) ZAVG< -4.125 39 0 Walking (0 0 0 0 0 1) * 3) YABSOLDEV< 5.095 2695 1312 Walking (0.14 0.047 0.08 0.069 0.16 0.51) 6) YSTANDDEV< 1.675 382 175 Sitting (0 0 0.54 0.44 0 0.016) Variables actually used in tree construction: RESULTANT YABSOLDEV YAVG YSTANDDEV ZABSOLDEV ZAVG Root node error: 2364/3792 = 0.62342 Decision Tree rpart(formula = class ~ ., data = smartphone_data, method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0, maxsurrogate = 0))
  • 14. Random Forest: Ensemble of Trees [Ref] Rattle R Data Mining Tool … Σ Random Forest Tree1 Tree2 Treen
  • 15. • Random Forest Model Results: Number of observations used to build the model: 3792 Type of random forest: classification OOB estimate of error rate: 11.05% Confusion matrix: Downstairs Jogging Sitting Standing Upstairs Walking class.error Downstairs 204 7 0 1 64 97 0.45308311 Jogging 6 1117 0 0 8 7 0.01845343 Sitting 0 0 209 5 1 0 0.02790698 Standing 4 0 0 177 4 0 0.04324324 Upstairs 48 31 1 0 276 97 0.39072848 Walking 20 1 1 1 15 1390 0.02661064 Random Forest Package in R randomForest(formula = class ~ ., data = smartphone_data, ntree = 300, mtry = 6, importance = TRUE, replace = FALSE, na.action = na.roughfix)
  • 16. • Fusion of data science and domain knowledge enables the big insights from the data • R language provides a platform to rapidly build prototypes and test the ideas • Getting data insights is an outcome of intense team effort between various stakeholders 16 Summary
  • 17. • R Project: http://www.r-project.org • Activity Recognition Dataset- “ The Impact of Personalization on Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W. Lockhart, Activity Context Representation: Techniques and Languages, AAAI Technical Report WS-12-05 • “Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank, AAAI Conference on Artificial Intelligence -2010 • R wiki: http://rwiki.sciviews.org/doku.php • R graph gallery: http://addictedtor.free.fr/graphiques/thumbs.php • Kickstarting R: http://cran.r-project.org/doc/contrib/Lemon-kickstart/ • Rattle – R Data Mining Tool [http://rattle.togaware.com/] • Sensor Platforms, http://www.sensorplatforms.com/context-aware/ • Movea, http://www.movea.com/ • Alohar, https://www.alohar.com 17 References

Notas do Editor

  1. The R statistical programming language is a free open source package based on the S language developed by Bell Labs.The language is very powerful for writing programs.Many statistical functions are already built in.Contributed packages expand the functionality to cutting edge research.Since it is a programming language, generating computer code to complete tasks is required.Implement many common statistical proceduresIt has a large collection of intermediate tools for data analysisExcellent graphical facilities for data analysis and display either on-screen or on hardcopyA well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.Versions of R exist of Windows, MacOS, Linux and various other Unix flavorsA vibrant world wide community
  2. Command c creates a vector that is assigned to object a
  3. A table where columns can contain numeric and string valuesAll columns must contain either numeric or string values, but these can not be combinedData frame d is converted into a matrix eR: f&lt;-as.data.frame(e)Matrix e is converted into a dataframe f
  4. Smartphone has Tri-axial accelerometer that measures acceleration in all three spatial dimensions.Accuracy for general model~75%, &gt;95% personalized model using 10 seconds training for each activityAccelerometer sensor is low power consuming sensor can be used for the whole day
  5. The &apos;randomForest&apos; and package provides the &apos;randomForest&apos; function.The ‘party’ package provide conditional random forest ‘randomForest’ can be used for classification and regression. It can also be used in unsupervised mode for assessing proximities among data points.