SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
Revolution Confidential




New A dvanc es in High
 P erformanc e A nalytic s
 with R : 'B ig Data'
 Dec is ion Trees and
 A nalys is of Hadoop
 Data
P res ented by:
S ue R anney
V P P roduct Development



               Revolution Confidential
In today’s webc as t:                 Revolution Confidential




 High Performance Analytics (HPA) with
  Revolution R Enterprise
 ‘Big Data’ Decision Trees
 Revolution’s HPA with Hadoop Data
 Resources, Q&A




                                                        2
R evolution R E nterpris e: What                            Revolution Confidential


G ets Ins talled?
   Latest stable version of Open-Source R
   High performance math libraries
   RevoScaleR package that adds:
       High performance ‘big data’ capabilities to R
       Access to a variety of ‘data sources’ (e.g., SAS, SPSS,
        text files, ODBC)
       Ability to compute in a variety of ‘compute contexts’
        (e.g., Windows/Linux workstation/server, Microsoft
        HPC Server cluster, Azure Burst, IBM Platform LSF
        cluster)
       High performance computing capabilities
   Integrated Development Environment based on Visual
    Studio technology (for Windows): the R Productivity
    Environment (RPE)
                      Revolution R Enterprise 5.0 Webinar                     3
High P erformanc e A nalytic s (HPA ) in
R evoS c aleR                                 Revolution Confidential




 High Performance Computing + Data
 Full-featured, fast, and scalable analysis
  functions
 Same code works on small and big data, and a
  variety of data sources
 Same code works on a variety of compute
  contexts - a laptop, server, cluster, or the cloud
 Scales approximately linearly with the number
  of observations – without increasing memory
  requirements

                    Revolution R Enterprise                     4
R evoS c aleR : HPA A lgorithms             Revolution Confidential




 Descriptive statistics (rxSummary)
 Tables and cubes (rxCube, rxCrossTabs)
 Correlations/covariances (rxCovCor, rxCor,
  rxCov, rxSSCP)
 K means clustering (rxKmeans)
 Linear regressions (rxLinMod)
 Logistic regressions (rxLogit)
 Generalized Linear Models (rxGlm)
 Predictions (scoring) (rxPredict)
 Decision Trees (rxDTree) NEW!

                  Revolution R Enterprise                     5
Dec is ion Trees                                         Revolution Confidential




   Relatively easy-to-interpret models
   Widely used in a variety of disciplines. For example,
      Predicting which patient characteristics are associated with
       high risk of, for example, heart attack.
      Deciding whether or not to offer a loan to an individual
       based on individual characteristics.
      Predicting the rate of return of various investment
       strategies
      Retail target marketing
   Can handle multi-factor response easily
   Useful in identifying important interactions


                        Revolution R Enterprise                            6
Dec is ion Tree Types                       Revolution Confidential




   Classification tree: predict what ‘class’ or
    ‘group’ an observation belongs in
    (dependent variable is a factor) for each
    terminal node or leaf
   Regression tree: predict average value of
    dependent variable for each terminal node
    or leaf




                  Revolution R Enterprise                     7
S imple E xample: Marketing R es pons e      Revolution Confidential




 Data set containing the following information:
  Response: Was response to a phone call, email, or
   mailing?
  Age
  Income
  Marital status
  Attended college?




                   Revolution R Enterprise                     8
S imple E xample: S pec ifying the model Revolution Confidential




 treeOut <- rxDTree(response~ age
  + income + college + marital,
  data = rdata)
 where rdata is the name of the data set




               Revolution R Enterprise                     9
S imple E xample: B as ic Output                                   Revolution Confidential




  Information on the split, the number of observations in
   the node, the number that match the y value, and the y
   probabilities

 1) root 10000 4069 Email (0.33260000 0.59310000 0.07430000)
    2) college=No College 5074 2378 Phone (0.53133622 0.38943634 0.07922743)
       4) age>=39.5 2518   330 Phone (0.86894361 0.00000000 0.13105639)
          8) age< 64.5 2256       77 Phone (0.96586879 0.00000000 0.03413121) *
          9) age>=64.5 262        9 Mail (0.03435115 0.00000000 0.96564885) *
       5) age< 39.5 2556   580 Email (0.19874804 0.77308294 0.02816901)
         10) marital=Single 835 371 Phone (0.55568862 0.40958084 0.03473054)
           20) income>=29.5 472 14 Phone(0.97033898 0.00000000 0.02966102) *
           21) income< 29.5 363 21 Email(0.01652893 0.94214876 0.04132231) *
         11) marital=Married 1721 87 Email(0.02556653 0.9494480 .02498547) *
    3) college=College 4926     971 Email (0.12789281 0.80288266 0.06922452) …


                             Revolution R Enterprise                                10
S imple E xample: Vis ual R epres entation                                                         Revolution Confidential




                                                         Root


                      No                                                                                 College
                    College

                                                                                             Age < 65          Age >=
     Age >= 40                           Age < 40                                                              65: Mail
                                                                                   Single          Married:
            Age >= 65:                         Married:
Age < 65:                      Single                                                               Email
               Mail                             Email
 Phone
                                                                     Age < 40           Age >=
                                                                                       40: Email
                         Income    Income <
                          >= 30:   30: Email
                          Phone                                 Income     Income <
                                                                 >= 30:    30: Email
                                                                 Phone




                                          Revolution R Enterprise                                                    11
S c aling HPA with R evoS c aleR                     Revolution Confidential




 RevoScaleR functions can read from data sets on disk in
  chunks, so you can increase the number of observations in
  the data set beyond what can be analyzed in memory all at
  once
 RevoScaleR analysis functions process chunks of data in
  parallel, taking greater advantage of your computing
  resources (Parallel External Memory Algorithms)
    Multiple cores on a desktop/server
    Cluster/grids have added advantage of more hard drives
     for storing & accessing data
       Windows HPC Server Cluster
       “Burst” computations to Azure in the cloud
       IBM Platform LSF Grid

                          Revolution R Enterprise                     12
T he ‘B ig Data’ Dec is ion Tree A lgorithm    Revolution Confidential




 Classical algorithms for building a decision tree
  sort all continuous variables in order to decide
  where to split the data.
 This sorting step becomes time and memory
  prohibitive when dealing with large data.
 rxDTree bins the data rather than sorting,
  computing histograms to create empirical
  distribution functions of the data
 rxDTree partitions the data horizontally, processing
  in parallel different sets of observations
                     Revolution R Enterprise                    13
Revolution Confidential

 Us eful rxDTree A rguments for B ig Data
 cp: complexity parameter. Increasing cp will
  decrease the number of splits attempted
 maxDepth: the maximum depth of any tree
  node. The computations take much longer at
  greater depth, so lowering maxDepth can
  greatly speed up computation time.
 maxNumBins: the maximum number of bins
  to use to cut numeric data. Decreasing
  maxNumBins will speed up computation
  time.
                 Revolution R Enterprise                    14
‘B ig Data’ E xample      Revolution Confidential



CDC Report in Jan. 2012




                                           15
T he U.S . B irth Data: 1985 - 2009                    Revolution Confidential




 Public-use data sets containing information on
  all births in the United States for each year from
  1985 to 2009 are available to download:
  http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
 “These natality files are gigantic; they’re
  approximately 3.1 GB uncompressed. That’s a
  little larger than R can easily process” – Joseph
  Adler, R in a Nutshell
 I’ve imported key variables from each year into
  a single .xdf file with over 100 million
  observations.
                                                                        16
Revolution Confidential

R egres s ion Tree: Multiple B irths
Call:
rxDTree(formula = IsMultiple ~ DadAgeR8
  + MAGER + FRACEREC + FHISP_REC +
  MRACEREC + MHISP_REC + DOB_YY,
  data = birthAllC,
  maxDepth = 6, cp = 1e-05,
  blocksPerRead = 10, verbose = 1)
File:
  C:RevolutionDataCDCBirthUS.xdf
Number of valid observations: 100672041
Number of missing observations: 0

                Revolution R Enterprise                    17
L eaves with L owes t P erc ent of Multiple
B irths                                     Revolution Confidential




   Mom is not black and under the   1.3%
   age of 20
   Mom is Asian or Pacific Islander  1.6%
   (and not Hispanic) and is between
   22 and 28 years of age. The birth
   is before 1997

   Mom is black and under the age   1.7%
   of 18


                                                             18
L eaves with Highes t P erc ent of Multiple
B irths                                  Revolution Confidential




      Mom is over 47 years old and   38.6%
      the birth is after 1996
      Mom is white, non-Hispanic, is 28.1%
      between 45 and 47 years old,
      and the birth is after 1996

      Mom is Hispanic, is between    15.5%
      45 and 47 years old, and the
      birth is after 1996


                                                          19
Revolution Confidential




P oll Ques tion
        Are you using Hadoop?
R evoS c aleR with Hadoop Data F iles NE W   Revolution Confidential




 The Hadoop Distributed File System (HDFS)
   is highly fault-tolerant and

   is designed to be deployed on low-cost
    hardware.

 RevoScaleR supports accessing data in the
  HDFS file system for import or for direct
  analysis

                                                              21
R evoS c aleR Data S ourc es                         Revolution Confidential




 Data Sources can be used for import or directly for
  analysis
    External: delimited text, fixed format text, SAS, SPSS,
     ODBC connections
    Provided with RevoScaleR: efficient .xdf file format

 Data Sources contain information about their file
  system
    Delimited text and .xdf data sources can both be used
     with the HDFS file system

 Data sources are used as input to HPA functions

                                                                      22
A n E xample Us ing Hadoop Data                     Revolution Confidential




 Hadoop cluster in our office
   Five nodes of commodity hardware
   Red Hat Enterprise Linux (RHEL) operating system
   Cloudera’s Hadoop (CDH3)
   Also has IBM Platform LSF workload management
    system installed (not required to use HDFS data)
 My colleague, Dawn Kinsey, recorded a data
  analysis session
   22 comma delimited files stored in HDFS
   Contain information on U.S. flight arrivals, 1997 – 2008

                      Revolution R Enterprise                        23
S teps in A nalys is                             Revolution Confidential




 Set up a ‘file system’ object and a ‘data source’
  object
 Explore the HDFS airline data for the year 2000
  directly
 Extract variables of interest from all the files into an
  .xdf file in the native file system
 Use R’s great plotting capabilities on summary
  information
 Perform a big logistic regression on an .xdf file
  stored in HDFS

                      Revolution R Enterprise                     24
Revolution Confidential




P oll Ques tion
     What features of Revolution R
   Enterprise 6.1 are most interesting
                 to you?
T hank You!                                                               Revolution Confidential



 Download slides, replay from today’s webinar
   http://bit.ly/QJfR4A
 Learn more about Revolution R Enterprise
    Overview: revolutionanalytics.com/products
    New feature videos:
     http://www.revolutionanalytics.com/products/new-features.php

 Contact Revolution Analytics
    http://bit.ly/hey-revo


  November 29: Real-Time Big Data Analytics: from Deployment
                        to Production
          David Smith, VP Marketing and Community, Revolution Analytics

        www.revolutionanalytics.com/news-events/free-webinars

                                                                                           26
Revolution Confidential




The leading commercial provider of software and support for the
          popular open source R statistics language.



                 www.revolutionanalytics.com
                     +1 (650) 646 9545
                   Twitter: @RevolutionR



                                                                          27

Mais conteúdo relacionado

Mais procurados

Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedRevolution Analytics
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Revolution Analytics
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for HadoopWilly Marroquin (WillyDevNET)
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R ServicesGregg Barrett
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
Intro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarIntro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarRevolution Analytics
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopRevolution Analytics
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution Analytics
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics
 
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Revolution Analytics
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
DeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence ApplicationsDeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence ApplicationsRevolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Revolution Analytics
 

Mais procurados (20)

Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Intro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarIntro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User Webinar
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with Hadoop
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
DeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence ApplicationsDeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence Applications
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
 
Revolution R - 100% R and More
Revolution R - 100% R and MoreRevolution R - 100% R and More
Revolution R - 100% R and More
 

Destaque

Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
The 2012 Future of Open Source Survey Results
The 2012 Future of Open Source Survey ResultsThe 2012 Future of Open Source Survey Results
The 2012 Future of Open Source Survey ResultsBlack Duck by Synopsys
 
[시즌2, week3] R Basic
[시즌2, week3] R Basic[시즌2, week3] R Basic
[시즌2, week3] R Basicneuroassociates
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionRevolution Analytics
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Data Mining. Classification
Data Mining. ClassificationData Mining. Classification
Data Mining. ClassificationSSA KPI
 
[week11] R_ggmap, leaflet
[week11] R_ggmap, leaflet[week11] R_ggmap, leaflet
[week11] R_ggmap, leafletneuroassociates
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningDr. Mirko Kämpf
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013SATOSHI TAGOMORI
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
Data mining technique (decision tree)
Data mining technique (decision tree)Data mining technique (decision tree)
Data mining technique (decision tree)Shweta Ghate
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopCloudera, Inc.
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Data mining - Classification - arbres de décision
Data mining - Classification - arbres de décisionData mining - Classification - arbres de décision
Data mining - Classification - arbres de décisionMohamed Heny SELMI
 

Destaque (20)

Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
The 2012 Future of Open Source Survey Results
The 2012 Future of Open Source Survey ResultsThe 2012 Future of Open Source Survey Results
The 2012 Future of Open Source Survey Results
 
[Week10] R graphics
[Week10] R graphics[Week10] R graphics
[Week10] R graphics
 
R2DOCX example
R2DOCX exampleR2DOCX example
R2DOCX example
 
[시즌2, week3] R Basic
[시즌2, week3] R Basic[시즌2, week3] R Basic
[시즌2, week3] R Basic
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to Production
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Data Mining. Classification
Data Mining. ClassificationData Mining. Classification
Data Mining. Classification
 
[week11] R_ggmap, leaflet
[week11] R_ggmap, leaflet[week11] R_ggmap, leaflet
[week11] R_ggmap, leaflet
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Data mining technique (decision tree)
Data mining technique (decision tree)Data mining technique (decision tree)
Data mining technique (decision tree)
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with Hadoop
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Data mining - Classification - arbres de décision
Data mining - Classification - arbres de décisionData mining - Classification - arbres de décision
Data mining - Classification - arbres de décision
 

Semelhante a New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

HospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics PlatformHospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics PlatformAngela Razzell
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationRevolution Analytics
 
Kudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxKudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxDIPESH30
 
Big Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsBig Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsFujio Turner
 
New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data AnalysisNew Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data AnalysisRevolution Analytics
 
Big data analytics on teradata with revolution r enterprise bill jacobs
Big data analytics on teradata with revolution r enterprise   bill jacobsBig data analytics on teradata with revolution r enterprise   bill jacobs
Big data analytics on teradata with revolution r enterprise bill jacobsBill Jacobs
 
CS 1150 – Lab #2 – Exploring Number SystemsTAs – Soumya Chiday.docx
CS 1150 – Lab #2 – Exploring Number SystemsTAs – Soumya Chiday.docxCS 1150 – Lab #2 – Exploring Number SystemsTAs – Soumya Chiday.docx
CS 1150 – Lab #2 – Exploring Number SystemsTAs – Soumya Chiday.docxmydrynan
 
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentationCary Millsap
 
Types Working for You, Not Against You
Types Working for You, Not Against YouTypes Working for You, Not Against You
Types Working for You, Not Against YouC4Media
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphsStanka Dalekova
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computingBAINIDA
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analyticstempledf
 
Data Mining Apriori Algorithm Implementation using R
Data Mining Apriori Algorithm Implementation using RData Mining Apriori Algorithm Implementation using R
Data Mining Apriori Algorithm Implementation using RIRJET Journal
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
Risk Analysis in the Financial Services Industry
Risk Analysis in the Financial Services IndustryRisk Analysis in the Financial Services Industry
Risk Analysis in the Financial Services IndustryRevolution Analytics
 
#rstats lessons for #measure
#rstats lessons for #measure#rstats lessons for #measure
#rstats lessons for #measureMark Edmondson
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsJen Stirrup
 

Semelhante a New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data (20)

HospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics PlatformHospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics Platform
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar Presentation
 
Kudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxKudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docx
 
Big Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsBig Data for Small Businesses & Startups
Big Data for Small Businesses & Startups
 
New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data AnalysisNew Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
 
Big data analytics on teradata with revolution r enterprise bill jacobs
Big data analytics on teradata with revolution r enterprise   bill jacobsBig data analytics on teradata with revolution r enterprise   bill jacobs
Big data analytics on teradata with revolution r enterprise bill jacobs
 
CS 1150 – Lab #2 – Exploring Number SystemsTAs – Soumya Chiday.docx
CS 1150 – Lab #2 – Exploring Number SystemsTAs – Soumya Chiday.docxCS 1150 – Lab #2 – Exploring Number SystemsTAs – Soumya Chiday.docx
CS 1150 – Lab #2 – Exploring Number SystemsTAs – Soumya Chiday.docx
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
 
Types Working for You, Not Against You
Types Working for You, Not Against YouTypes Working for You, Not Against You
Types Working for You, Not Against You
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
 
Data Mining Apriori Algorithm Implementation using R
Data Mining Apriori Algorithm Implementation using RData Mining Apriori Algorithm Implementation using R
Data Mining Apriori Algorithm Implementation using R
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Risk Analysis in the Financial Services Industry
Risk Analysis in the Financial Services IndustryRisk Analysis in the Financial Services Industry
Risk Analysis in the Financial Services Industry
 
#rstats lessons for #measure
#rstats lessons for #measure#rstats lessons for #measure
#rstats lessons for #measure
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and Statistics
 

Mais de Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint packageRevolution Analytics
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution Analytics
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solutionRevolution Analytics
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceRevolution Analytics
 

Mais de Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solution
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R Conference
 

New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

  • 1. Revolution Confidential New A dvanc es in High P erformanc e A nalytic s with R : 'B ig Data' Dec is ion Trees and A nalys is of Hadoop Data P res ented by: S ue R anney V P P roduct Development Revolution Confidential
  • 2. In today’s webc as t: Revolution Confidential  High Performance Analytics (HPA) with Revolution R Enterprise  ‘Big Data’ Decision Trees  Revolution’s HPA with Hadoop Data  Resources, Q&A 2
  • 3. R evolution R E nterpris e: What Revolution Confidential G ets Ins talled?  Latest stable version of Open-Source R  High performance math libraries  RevoScaleR package that adds:  High performance ‘big data’ capabilities to R  Access to a variety of ‘data sources’ (e.g., SAS, SPSS, text files, ODBC)  Ability to compute in a variety of ‘compute contexts’ (e.g., Windows/Linux workstation/server, Microsoft HPC Server cluster, Azure Burst, IBM Platform LSF cluster)  High performance computing capabilities  Integrated Development Environment based on Visual Studio technology (for Windows): the R Productivity Environment (RPE) Revolution R Enterprise 5.0 Webinar 3
  • 4. High P erformanc e A nalytic s (HPA ) in R evoS c aleR Revolution Confidential  High Performance Computing + Data  Full-featured, fast, and scalable analysis functions  Same code works on small and big data, and a variety of data sources  Same code works on a variety of compute contexts - a laptop, server, cluster, or the cloud  Scales approximately linearly with the number of observations – without increasing memory requirements Revolution R Enterprise 4
  • 5. R evoS c aleR : HPA A lgorithms Revolution Confidential  Descriptive statistics (rxSummary)  Tables and cubes (rxCube, rxCrossTabs)  Correlations/covariances (rxCovCor, rxCor, rxCov, rxSSCP)  K means clustering (rxKmeans)  Linear regressions (rxLinMod)  Logistic regressions (rxLogit)  Generalized Linear Models (rxGlm)  Predictions (scoring) (rxPredict)  Decision Trees (rxDTree) NEW! Revolution R Enterprise 5
  • 6. Dec is ion Trees Revolution Confidential  Relatively easy-to-interpret models  Widely used in a variety of disciplines. For example,  Predicting which patient characteristics are associated with high risk of, for example, heart attack.  Deciding whether or not to offer a loan to an individual based on individual characteristics.  Predicting the rate of return of various investment strategies  Retail target marketing  Can handle multi-factor response easily  Useful in identifying important interactions Revolution R Enterprise 6
  • 7. Dec is ion Tree Types Revolution Confidential  Classification tree: predict what ‘class’ or ‘group’ an observation belongs in (dependent variable is a factor) for each terminal node or leaf  Regression tree: predict average value of dependent variable for each terminal node or leaf Revolution R Enterprise 7
  • 8. S imple E xample: Marketing R es pons e Revolution Confidential Data set containing the following information:  Response: Was response to a phone call, email, or mailing?  Age  Income  Marital status  Attended college? Revolution R Enterprise 8
  • 9. S imple E xample: S pec ifying the model Revolution Confidential treeOut <- rxDTree(response~ age + income + college + marital, data = rdata) where rdata is the name of the data set Revolution R Enterprise 9
  • 10. S imple E xample: B as ic Output Revolution Confidential  Information on the split, the number of observations in the node, the number that match the y value, and the y probabilities 1) root 10000 4069 Email (0.33260000 0.59310000 0.07430000) 2) college=No College 5074 2378 Phone (0.53133622 0.38943634 0.07922743) 4) age>=39.5 2518 330 Phone (0.86894361 0.00000000 0.13105639) 8) age< 64.5 2256 77 Phone (0.96586879 0.00000000 0.03413121) * 9) age>=64.5 262 9 Mail (0.03435115 0.00000000 0.96564885) * 5) age< 39.5 2556 580 Email (0.19874804 0.77308294 0.02816901) 10) marital=Single 835 371 Phone (0.55568862 0.40958084 0.03473054) 20) income>=29.5 472 14 Phone(0.97033898 0.00000000 0.02966102) * 21) income< 29.5 363 21 Email(0.01652893 0.94214876 0.04132231) * 11) marital=Married 1721 87 Email(0.02556653 0.9494480 .02498547) * 3) college=College 4926 971 Email (0.12789281 0.80288266 0.06922452) … Revolution R Enterprise 10
  • 11. S imple E xample: Vis ual R epres entation Revolution Confidential Root No College College Age < 65 Age >= Age >= 40 Age < 40 65: Mail Single Married: Age >= 65: Married: Age < 65: Single Email Mail Email Phone Age < 40 Age >= 40: Email Income Income < >= 30: 30: Email Phone Income Income < >= 30: 30: Email Phone Revolution R Enterprise 11
  • 12. S c aling HPA with R evoS c aleR Revolution Confidential  RevoScaleR functions can read from data sets on disk in chunks, so you can increase the number of observations in the data set beyond what can be analyzed in memory all at once  RevoScaleR analysis functions process chunks of data in parallel, taking greater advantage of your computing resources (Parallel External Memory Algorithms)  Multiple cores on a desktop/server  Cluster/grids have added advantage of more hard drives for storing & accessing data  Windows HPC Server Cluster  “Burst” computations to Azure in the cloud  IBM Platform LSF Grid Revolution R Enterprise 12
  • 13. T he ‘B ig Data’ Dec is ion Tree A lgorithm Revolution Confidential  Classical algorithms for building a decision tree sort all continuous variables in order to decide where to split the data.  This sorting step becomes time and memory prohibitive when dealing with large data.  rxDTree bins the data rather than sorting, computing histograms to create empirical distribution functions of the data  rxDTree partitions the data horizontally, processing in parallel different sets of observations Revolution R Enterprise 13
  • 14. Revolution Confidential Us eful rxDTree A rguments for B ig Data  cp: complexity parameter. Increasing cp will decrease the number of splits attempted  maxDepth: the maximum depth of any tree node. The computations take much longer at greater depth, so lowering maxDepth can greatly speed up computation time.  maxNumBins: the maximum number of bins to use to cut numeric data. Decreasing maxNumBins will speed up computation time. Revolution R Enterprise 14
  • 15. ‘B ig Data’ E xample Revolution Confidential CDC Report in Jan. 2012 15
  • 16. T he U.S . B irth Data: 1985 - 2009 Revolution Confidential  Public-use data sets containing information on all births in the United States for each year from 1985 to 2009 are available to download: http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm  “These natality files are gigantic; they’re approximately 3.1 GB uncompressed. That’s a little larger than R can easily process” – Joseph Adler, R in a Nutshell  I’ve imported key variables from each year into a single .xdf file with over 100 million observations. 16
  • 17. Revolution Confidential R egres s ion Tree: Multiple B irths Call: rxDTree(formula = IsMultiple ~ DadAgeR8 + MAGER + FRACEREC + FHISP_REC + MRACEREC + MHISP_REC + DOB_YY, data = birthAllC, maxDepth = 6, cp = 1e-05, blocksPerRead = 10, verbose = 1) File: C:RevolutionDataCDCBirthUS.xdf Number of valid observations: 100672041 Number of missing observations: 0 Revolution R Enterprise 17
  • 18. L eaves with L owes t P erc ent of Multiple B irths Revolution Confidential Mom is not black and under the 1.3% age of 20 Mom is Asian or Pacific Islander 1.6% (and not Hispanic) and is between 22 and 28 years of age. The birth is before 1997 Mom is black and under the age 1.7% of 18 18
  • 19. L eaves with Highes t P erc ent of Multiple B irths Revolution Confidential Mom is over 47 years old and 38.6% the birth is after 1996 Mom is white, non-Hispanic, is 28.1% between 45 and 47 years old, and the birth is after 1996 Mom is Hispanic, is between 15.5% 45 and 47 years old, and the birth is after 1996 19
  • 20. Revolution Confidential P oll Ques tion Are you using Hadoop?
  • 21. R evoS c aleR with Hadoop Data F iles NE W Revolution Confidential  The Hadoop Distributed File System (HDFS)  is highly fault-tolerant and  is designed to be deployed on low-cost hardware.  RevoScaleR supports accessing data in the HDFS file system for import or for direct analysis 21
  • 22. R evoS c aleR Data S ourc es Revolution Confidential  Data Sources can be used for import or directly for analysis  External: delimited text, fixed format text, SAS, SPSS, ODBC connections  Provided with RevoScaleR: efficient .xdf file format  Data Sources contain information about their file system  Delimited text and .xdf data sources can both be used with the HDFS file system  Data sources are used as input to HPA functions 22
  • 23. A n E xample Us ing Hadoop Data Revolution Confidential  Hadoop cluster in our office  Five nodes of commodity hardware  Red Hat Enterprise Linux (RHEL) operating system  Cloudera’s Hadoop (CDH3)  Also has IBM Platform LSF workload management system installed (not required to use HDFS data)  My colleague, Dawn Kinsey, recorded a data analysis session  22 comma delimited files stored in HDFS  Contain information on U.S. flight arrivals, 1997 – 2008 Revolution R Enterprise 23
  • 24. S teps in A nalys is Revolution Confidential  Set up a ‘file system’ object and a ‘data source’ object  Explore the HDFS airline data for the year 2000 directly  Extract variables of interest from all the files into an .xdf file in the native file system  Use R’s great plotting capabilities on summary information  Perform a big logistic regression on an .xdf file stored in HDFS Revolution R Enterprise 24
  • 25. Revolution Confidential P oll Ques tion What features of Revolution R Enterprise 6.1 are most interesting to you?
  • 26. T hank You! Revolution Confidential  Download slides, replay from today’s webinar  http://bit.ly/QJfR4A  Learn more about Revolution R Enterprise  Overview: revolutionanalytics.com/products  New feature videos: http://www.revolutionanalytics.com/products/new-features.php  Contact Revolution Analytics  http://bit.ly/hey-revo November 29: Real-Time Big Data Analytics: from Deployment to Production David Smith, VP Marketing and Community, Revolution Analytics www.revolutionanalytics.com/news-events/free-webinars 26
  • 27. Revolution Confidential The leading commercial provider of software and support for the popular open source R statistics language. www.revolutionanalytics.com +1 (650) 646 9545 Twitter: @RevolutionR 27