SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
Revolution Confidential



     Introduc tion to R
            for
        Data Mining
     2013 Webinar S eries


J os eph B . R ic kert

F ebruary 14, 2013



                                              1
F irs t P olling Ques tion             Revolution Confidential




 What is your favorite data mining software
  tool?
  1.   R
  2.   SAS
  3.   MapReduce
  4.   Weka
  5.   Other




                                                         2
My goal for today’s webinar is to c onvinc e
you that:                                                  Revolution Confidential




                       Seriously,
                   it is not difficult
                  to learn enough
                    R to do some
                     serious data
                         mining




         R
   is a serious                           Revolution R
     platform                               Enterprise
        for                              is the platform
   data mining                                  for
                                          serious data
                                              mining


                                                                             3
Revolution Confidential




A word about Data Mining
   We assume that you know a little
   bit about data mining and this is
      your context for learning R


                                                     4
Applications        Actions         Algorithms        Data Mining
                                                              Revolution Confidential




   Credit Scoring    Acquire Data        CART




  Fraud Detection      Prepare       Random Forests




  Ad Optimization      Classify           SVM




     Targeted
                       Predict          KMeans
     Marketing




                                       Hierarchical
  Gene Detection      Visualize
                                        clustering




  Recommendation                        Ensemble
                      Optimize
      systems                          Techniques




  Social Networks      Interpret




                                                                                5
Revolution Confidential




Getting Orientated

WHAT IS R ?


                                       6
Is :                                 Revolution Confidential




 The way to do statistical computing
 A full blown programming language
 The home of nearly every data mining
  algorithm known to data science.
 A vibrant world-wide community
                             Since 1997 a core
   R was written in early
   1990’s by                   group of ~ 20
          Robert             developers guides
         Gentleman          the evolution of the
          Ross Ihaka             language




                                                                     7
is organized into libraries of
            func tions c alled pac kages             Revolution Confidential




                              R Package Growth

                              4,332 packages as of 2/13/13




 CRAN R download
    Base
    Recommended packages
 User contributed packages




                                                                       8
F inding Your Way A round world of   Revolution Confidential




     Machine Learning
     Data Mining
     Visualization
     Finding Packages
          Task Views
          crantastic.org
     Blogs
          Revolutions
          R-Bloggers
          Quick-R
          Inside-R
     Getting Help
     Finding R People
          User Groups worldwide
     Twitter : #rstats




                                                       9
Revolution Confidential




Learning R

T HE S T R UC T UR E OF R
FA C IL ITAT E S L E A R NING

                                                 10
L earning R ?                                                                         Revolution Confidential




Levels of R Skill

Write production grade code                                             R developer


Write an R package                                      R contributor


Write code and algorithms                R programmer


Use R functions                     R user


Use a GUI                      R aware


                                  10                                                        10,000
                                                        Hours of use


                              The Malcolm Gladwell “Outlier” Scale



                                                                                                       11
B as ic Mac hine L earning F unc tions                              Revolution Confidential



              Function       Library        Description
Cluster       hclust         stats          Hierarchical cluster analysis
              kmeans         stats          Kmeans clustering
Classifiers   glm            stats          Logistic Regression
              rpart          rpart          Recursive partitioning and
                                            regression trees
              ksvm           kernlab        Support Vector Machine
              apriori        arules         Rule based classification
Ensemble      ada            ada            Stochastic boosting
              randomForest   randomForest   Random Forests classification and
                                            regression




                                                                                     12
Noteworthy Data Mining P ac kages                            Revolution Confidential




     Package   Comment
     caret     Well organized and remarkably complete
               collection of functions to facilitate model
               building for regression and classification
               problems
     rattle    A very intuitive GUI for data mining that
               produces useful R code




                                                                              13
Revolution Confidential

                                     Script
                              1      GETTING STARTED .R
                              2      ROLL with RATTLE .R
                              3      IN THE TREES . R
                              4      INTRO to CARET .R
                              5      BIG DATA with RevoScaleR .R
                              6      WORDCLOUD .R


Doing a lot with a little R

T IME TO R UN S OME C ODE
The R Scripts are available at:
https://gist.github.com/joseph-rickert/4742529


                                                                               14
S ec ond P olling Ques tion                 Revolution Confidential




 What are your favorite data mining
  techniques?
  1. Clustering techniques such as K-means
  2. Single model classifiers such as decision trees,
     or SVMs
  3. Ensemble classifiers such as Random Forests
     or boosting models
  4. Text mining techniques
  5. Other

                                                             15
T hird P olling Ques tion
(ins ert after running s c ript IN T HE T R E E S
                                            Revolution Confidential




 What kind of data do you analyze?
  1.   Financial data
  2.   Customer data (e.g. for recommendations)
  3.   Website data (e.g. for ads)
  4.   Health Care data
  5.   Other




                                                             16
Revolution Confidential




Working with B ig Data
RevoScaleR and Revolution R Enterprise




                                                   17
Too B ig for Open S ourc e R                                        Revolution Confidential




        mortDF <- rxXdfToDataFrame(mdata,maxRowsByCols=300000000)
        model <- glm(default ~ .,data=mortDF,family="binomial")

                                                                                     18
R evoS c aleR brings the power of
B ig Data to R                                               Revolution Confidential




Parallel External
                                                          Abstracted layer for
Memory Algorithms
                                                                    providing
that are distributed
                                       Communications         communication
among available        Distributed
                                         Framework         between compute
compute resources      Statistical
                       Algorithms                           nodes in a cluster
(cores & computers)
                                                        (MPI, MapReduce, In-
independent of
                                                                   Database)
platform

API for integrating
external data
                                           R Language
sources (files,                             Interface         Familiar, high-
databases, HDFS)         Data Source                             prodictivity
that provides                API                               programming
optimized reading of                                    paradigm for R users
rows and columns in
blocks



                                                                              19
R evoS c aleR P E MA s
P arallel E xternal Memory A lgorithms                                                                 Revolution Confidential




 XDF File

            Read blocks and compute                                                       R based algorithms
 Block 1    intermediate results in
            parallel, iterating as                                                        Work on blocks of data
                                                                                           Inherently parallel and
                                                        Block 1
            necessary                                   results
                                                                                       
                                                                                           distributed
 Block 2

                                                                  Block i results


 Block i                Block           Block                                             Do not require all data
                                                                                            to be in memory at one
              Block i                           Block i+1                Block i+2
                        i+1             i+2                              results
                                                results


 Block
 i+1                                                               Results from last
                                                                                            time
                                                                   block
                                                                                          Can deal with distributed
 Block
 i+2
                           1st   pass
                                                                                           and streaming data
                        2nd pass



                            3rd pass




                                                                                                                        20
Revolution Confidential




More than code, R is a community

WHE R E TO G O F R OM HE R E ?


                                                    21
C ontinuing to L earn R                                       Revolution Confidential




Resources                      Examples
 RevoJoe: How to Learn R         Thomson Nguyen on the Heritage
                                   Health Prize
 More R Documentation            Shannon Terry & Ben Ogorek
    The R Journal                 (Nationwide Insurance):       A Direct
                                   Marketing In-Flight Forecasting
    Books                         System
    Reference Card and more      Jeffrey Breen:
                                   Mining Twitter for Airline Consumer
 Classes                          Sentiment
    Coursera                     Joe Rothermich: Alternative Data
                                   Sources for Measuring Market
    Revolution Analytics          Sentiment and Events (Using R)




                                                                               22
S ome B ooks   Revolution Confidential




                                23
Revolution Confidential




The R Scripts are available at:
https://gist.github.com/joseph-rickert/4742529




                                                                  24

Mais conteúdo relacionado

Destaque

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Krishna Petrochemicals
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with RYanchang Zhao
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data MiningSushil Kulkarni
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with RYanchang Zhao
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
Ontology-driven KDD Process Composition
Ontology-driven KDD Process CompositionOntology-driven KDD Process Composition
Ontology-driven KDD Process CompositionEmanuele Storti
 
Data mining platform
Data mining platformData mining platform
Data mining platformchanson zhang
 
Rstudio in aws 16 9
Rstudio in aws 16 9Rstudio in aws 16 9
Rstudio in aws 16 9Tal Galili
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRamakant Soni
 

Destaque (16)

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
 
Data mining
Data miningData mining
Data mining
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data Mining
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with R
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
R refcard-data-mining
R refcard-data-miningR refcard-data-mining
R refcard-data-mining
 
Ontology-driven KDD Process Composition
Ontology-driven KDD Process CompositionOntology-driven KDD Process Composition
Ontology-driven KDD Process Composition
 
14.machine learning
14.machine learning14.machine learning
14.machine learning
 
26.docking
26.docking26.docking
26.docking
 
Data mining platform
Data mining platformData mining platform
Data mining platform
 
Rstudio in aws 16 9
Rstudio in aws 16 9Rstudio in aws 16 9
Rstudio in aws 16 9
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
 

Semelhante a Introduction to R for Data Mining (Feb 2013)

Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...Revolution Analytics
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution Analytics
 
Revolution R Enterprise - 100% R and More
Revolution R Enterprise - 100% R and MoreRevolution R Enterprise - 100% R and More
Revolution R Enterprise - 100% R and MoreRevolution Analytics
 
Basho and Riak at GOTO Stockholm: "Don't Use My Database."
Basho and Riak at GOTO Stockholm:  "Don't Use My Database."Basho and Riak at GOTO Stockholm:  "Don't Use My Database."
Basho and Riak at GOTO Stockholm: "Don't Use My Database."Basho Technologies
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationRevolution Analytics
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)Revolution Analytics
 
Revolution Analytics: a 5-minute history
Revolution Analytics: a 5-minute historyRevolution Analytics: a 5-minute history
Revolution Analytics: a 5-minute historyRevolution Analytics
 
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...Ryan Rosario
 
New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data AnalysisNew Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data AnalysisRevolution Analytics
 
useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsenrusersla
 
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...Revolution Analytics
 
The Rise of Dynamic Languages
The Rise of Dynamic LanguagesThe Rise of Dynamic Languages
The Rise of Dynamic Languagesgreenwop
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionRevolution Analytics
 
R as supporting tool for analytics and simulation
R as supporting tool for analytics and simulationR as supporting tool for analytics and simulation
R as supporting tool for analytics and simulationAlvaro Gil
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data setsR A Akerkar
 

Semelhante a Introduction to R for Data Mining (Feb 2013) (20)

Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)
 
Revolution R - 100% R and More
Revolution R - 100% R and MoreRevolution R - 100% R and More
Revolution R - 100% R and More
 
Revolution R Enterprise - 100% R and More
Revolution R Enterprise - 100% R and MoreRevolution R Enterprise - 100% R and More
Revolution R Enterprise - 100% R and More
 
Basho and Riak at GOTO Stockholm: "Don't Use My Database."
Basho and Riak at GOTO Stockholm:  "Don't Use My Database."Basho and Riak at GOTO Stockholm:  "Don't Use My Database."
Basho and Riak at GOTO Stockholm: "Don't Use My Database."
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar Presentation
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
Revolution Analytics: a 5-minute history
Revolution Analytics: a 5-minute historyRevolution Analytics: a 5-minute history
Revolution Analytics: a 5-minute history
 
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
 
Reason To learn & use r
Reason To learn & use rReason To learn & use r
Reason To learn & use r
 
New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data AnalysisNew Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
 
useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsen
 
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
 
The Rise of Dynamic Languages
The Rise of Dynamic LanguagesThe Rise of Dynamic Languages
The Rise of Dynamic Languages
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to Production
 
R as supporting tool for analytics and simulation
R as supporting tool for analytics and simulationR as supporting tool for analytics and simulation
R as supporting tool for analytics and simulation
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data sets
 

Mais de Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint packageRevolution Analytics
 

Mais de Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 

Último

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Último (20)

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Introduction to R for Data Mining (Feb 2013)

  • 1. Revolution Confidential Introduc tion to R for Data Mining 2013 Webinar S eries J os eph B . R ic kert F ebruary 14, 2013 1
  • 2. F irs t P olling Ques tion Revolution Confidential  What is your favorite data mining software tool? 1. R 2. SAS 3. MapReduce 4. Weka 5. Other 2
  • 3. My goal for today’s webinar is to c onvinc e you that: Revolution Confidential Seriously, it is not difficult to learn enough R to do some serious data mining R is a serious Revolution R platform Enterprise for is the platform data mining for serious data mining 3
  • 4. Revolution Confidential A word about Data Mining We assume that you know a little bit about data mining and this is your context for learning R 4
  • 5. Applications Actions Algorithms Data Mining Revolution Confidential Credit Scoring Acquire Data CART Fraud Detection Prepare Random Forests Ad Optimization Classify SVM Targeted Predict KMeans Marketing Hierarchical Gene Detection Visualize clustering Recommendation Ensemble Optimize systems Techniques Social Networks Interpret 5
  • 7. Is : Revolution Confidential  The way to do statistical computing  A full blown programming language  The home of nearly every data mining algorithm known to data science.  A vibrant world-wide community Since 1997 a core R was written in early 1990’s by group of ~ 20 Robert developers guides Gentleman the evolution of the Ross Ihaka language 7
  • 8. is organized into libraries of func tions c alled pac kages Revolution Confidential R Package Growth 4,332 packages as of 2/13/13  CRAN R download  Base  Recommended packages  User contributed packages 8
  • 9. F inding Your Way A round world of Revolution Confidential  Machine Learning  Data Mining  Visualization  Finding Packages  Task Views  crantastic.org  Blogs  Revolutions  R-Bloggers  Quick-R  Inside-R  Getting Help  Finding R People  User Groups worldwide  Twitter : #rstats 9
  • 10. Revolution Confidential Learning R T HE S T R UC T UR E OF R FA C IL ITAT E S L E A R NING 10
  • 11. L earning R ? Revolution Confidential Levels of R Skill Write production grade code R developer Write an R package R contributor Write code and algorithms R programmer Use R functions R user Use a GUI R aware 10 10,000 Hours of use The Malcolm Gladwell “Outlier” Scale 11
  • 12. B as ic Mac hine L earning F unc tions Revolution Confidential Function Library Description Cluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clustering Classifiers glm stats Logistic Regression rpart rpart Recursive partitioning and regression trees ksvm kernlab Support Vector Machine apriori arules Rule based classification Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression 12
  • 13. Noteworthy Data Mining P ac kages Revolution Confidential Package Comment caret Well organized and remarkably complete collection of functions to facilitate model building for regression and classification problems rattle A very intuitive GUI for data mining that produces useful R code 13
  • 14. Revolution Confidential Script 1 GETTING STARTED .R 2 ROLL with RATTLE .R 3 IN THE TREES . R 4 INTRO to CARET .R 5 BIG DATA with RevoScaleR .R 6 WORDCLOUD .R Doing a lot with a little R T IME TO R UN S OME C ODE The R Scripts are available at: https://gist.github.com/joseph-rickert/4742529 14
  • 15. S ec ond P olling Ques tion Revolution Confidential  What are your favorite data mining techniques? 1. Clustering techniques such as K-means 2. Single model classifiers such as decision trees, or SVMs 3. Ensemble classifiers such as Random Forests or boosting models 4. Text mining techniques 5. Other 15
  • 16. T hird P olling Ques tion (ins ert after running s c ript IN T HE T R E E S Revolution Confidential  What kind of data do you analyze? 1. Financial data 2. Customer data (e.g. for recommendations) 3. Website data (e.g. for ads) 4. Health Care data 5. Other 16
  • 17. Revolution Confidential Working with B ig Data RevoScaleR and Revolution R Enterprise 17
  • 18. Too B ig for Open S ourc e R Revolution Confidential mortDF <- rxXdfToDataFrame(mdata,maxRowsByCols=300000000) model <- glm(default ~ .,data=mortDF,family="binomial") 18
  • 19. R evoS c aleR brings the power of B ig Data to R Revolution Confidential Parallel External Abstracted layer for Memory Algorithms providing that are distributed Communications communication among available Distributed Framework between compute compute resources Statistical Algorithms nodes in a cluster (cores & computers) (MPI, MapReduce, In- independent of Database) platform API for integrating external data R Language sources (files, Interface Familiar, high- databases, HDFS) Data Source prodictivity that provides API programming optimized reading of paradigm for R users rows and columns in blocks 19
  • 20. R evoS c aleR P E MA s P arallel E xternal Memory A lgorithms Revolution Confidential XDF File Read blocks and compute  R based algorithms Block 1 intermediate results in parallel, iterating as  Work on blocks of data Inherently parallel and Block 1 necessary results  distributed Block 2 Block i results Block i Block Block  Do not require all data to be in memory at one Block i Block i+1 Block i+2 i+1 i+2 results results Block i+1 Results from last time block  Can deal with distributed Block i+2 1st pass and streaming data 2nd pass 3rd pass 20
  • 21. Revolution Confidential More than code, R is a community WHE R E TO G O F R OM HE R E ? 21
  • 22. C ontinuing to L earn R Revolution Confidential Resources Examples  RevoJoe: How to Learn R  Thomson Nguyen on the Heritage Health Prize  More R Documentation  Shannon Terry & Ben Ogorek  The R Journal (Nationwide Insurance): A Direct Marketing In-Flight Forecasting  Books System  Reference Card and more  Jeffrey Breen: Mining Twitter for Airline Consumer  Classes Sentiment  Coursera  Joe Rothermich: Alternative Data Sources for Measuring Market  Revolution Analytics Sentiment and Events (Using R) 22
  • 23. S ome B ooks Revolution Confidential 23
  • 24. Revolution Confidential The R Scripts are available at: https://gist.github.com/joseph-rickert/4742529 24