Mais conteúdo relacionado Semelhante a Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02 (20) Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp021. © 2012 IBM Corporation1
Revolution Confidential
Revolution R Enterprise for IBM Netezza
2. © 2012 IBM Corporation2
Revolution Confidential
IBM Netezza with Revolution Analytics
High-performance, in-database analytics platform for Big Data
– Massively parallel processing delivers 10-100x performance
– Run analytics in-database and eliminate data movement
– Scalable architecture fosters experimentation
Innovation with Advanced Analytics
– Analytic modeling with most current statistical methods and 2,500+
open source packages
Enterprise ready advanced analytics software, services &
support
– Security, IDE, training, professional services
– Web Services stack enables integration with front-end
presentation layer
3. © 2012 IBM CorporationMarch 1, 2012
Revolution Analytics
4. © 2012 IBM Corporation4
Revolution Confidential
What is R?
Data analysis software
A programming language
– Development platform designed by and for statisticians
– Object-oriented: vector, matrix, model, …
– Built-in libraries of algorithms
An environment
– Huge library of algorithms for data access, data manipulation, analysis
and graphics
An open-source software project
– Free, open, and active
A community
– Thousands of contributors, 2 million users
– Resources and help in every domain
Download the White Paper
R is Hot
bit.ly/r-is-hot
5. Revolution Confidential
The professor who invented analytic software for
the experts now wants to take it to the masses
Most advanced statistical
analysis software available
Half the cost of
commercial alternatives
2M+ Users
2,500+ Applications
Statistics
Predictive
Analytics
Data Mining
Visualization
Finance
Life Sciences
Manufacturing
Retail
Telecom
Social Media
Government
5
Power
Productivity
Enterprise
Readiness
6. Revolution Confidential
R evolution R E nterpris e has the Open-
S ource R E ngine at the core
2,500 community packages and growing exponentially
6
R Engine
Language Libraries
Open Source R
Packages
Technical
Support
Web Services
API
Big Data
Analysis
Revolution
Productivity
Environment
Build
Assurance
Parallel
Tools
Multi-Threaded
Math Libraries
Technology
Partners
7. © 2012 IBM CorporationMarch 1, 2012
Working with Revolution R
Enterprise for IBM Netezza
8. © 2012 IBM Corporation8
Revolution Confidential
Revolution R Enterprise for IBM Netezza
inside the IBM Netezza Architecture
IBM Netezza
Analytics
9. © 2012 IBM Corporation9
Revolution Confidential
In-Database Paradigms for using R
In-database Scoring
– Family of apply functions which score
analytic models by using data
parallelism
– Underlying truism is that there is a fact
that can be applied across all data
Big Data Analytics
– Family of parallelized, in-database
analytics that have R wrappers and
work on entire data set
– Underlying truism exists across all
data
Grouped by Row (tapply)
– Data and Task Parallelism
• Data flow technique to apply analytics to
naturally occurring groups of data using
non-parallelized analytics
– Underlying relationship in data is by a
group
Examples
– Customer lifetime value
– Credit score
– Affinity
– Good stock/bad stock
Big data analytics
– Clustering of all data to determine
groupings
– Models that are apply across a whole
data set – decision trees
– Data transformation – variable
selection, correlation
Group
– Forecasting – by store, stock symbol,
etc.
– Build model for each customer or
product or etc.
10. © 2012 IBM Corporation10
Revolution Confidential
Access In-Database Language Support from R
SQL Java
PythonC
Fortran C++
11. © 2012 IBM Corporation11
Revolution Confidential
Open Source R Package Support
Vertical
• Econometrics
• Experimental Design
• Computational
Physics
• Clinical Trials
• Environmetrics
• Finance
• Genetics
• Medical Imaging
• Pharmacokinetics
• Phylogenetics
• Psychometrics
• Social Sciences
Horizontal
• Bayesian
• Cluster
• Distributions
• Graphics
• Graphical Models
• Machine Learning
• Multivariate
• Natural Language
Processing
• Optimization
• Robust Statistical
Metrics
• Spatial
• Survival Analysis
• Time Series
2500+
community
packages
12. © 2012 IBM Corporation12
Revolution Confidential
Using Revolution R Enterprise with IBM Netezza
R Packages integrate and
push analytics processing
in-database
Revolution R Enterprise - Workstation
HTTP
Revolution R Enterprise - Server
RevoDeployR Server
Web Services Interface for R
Business Intelligence, Excel
or Third-Party Application
HostIBM Netezza Analytics
S-Blade
IBM Netezza Analytics
S-Blade
IBM Netezza Analytics
S-Blade
IBM Netezza Analytics
S-Blade
IBM Netezza Analytics
S-Blade
IBM Netezza Analytics
RODBC
&
nzODBC
RODBC
&
nzODBC
13. © 2012 IBM Corporation13
Revolution Confidential
Deploying Revolution R Enterprise to IBM Netezza
•Remote terminal connection to Host
•Create your R Script
•Compile and Register your R Script as an AE (UDAP)
•Execute SQL that will invoke the registered AE
•Go back Revolution R Client to retrieve results and continue
additional analysis
HostIBM Netezza Analytics
S-Blade
IBM Netezza Analytics
S-Blade
IBM Netezza Analytics
S-Blade
IBM Netezza Analytics
S-Blade
IBM Netezza Analytics
S-Blade
IBM Netezza Analytics
14. © 2012 IBM Corporation14
Revolution Confidential
Revolution R Enterprise Client Configuration
Revolution R Enterprise
– Productivity Environment
Netezza ODBC Drivers
‘nz’ R Packages
– nzA, nzR, nzMatrix
R Package Dependencies
– RODBC
– caTools
– Tree
– Bitops
– E1071
– Rgl
– Ca
– MASS
– XML
15. © 2012 IBM Corporation15
Revolution ConfidentialIBM Netezza In-Database Analytics from Revolution R
nzR
Package
Encapsulate database and
expose “R”-like constructs
R data.frame =
database table
Apply an R function to a row
of data or grouped rows of
data
nzA
Package
Entry point to the
nzAnalytics
Explicitly parallelized
algorithms that run in
database
nzMatrix
Package
Encapsulation of Matrices
and operations in Database
nz.matrix construct in
R to access matrices in the
database
R operations on
nz.matrix translate to
matrix stored procedure
operations
16. © 2012 IBM Corporation16
Revolution Confidential
nzR Package
Basic Functions Sample Code
Database Connection nzConnect
nzConnectDSN
SQL Execution nzQuery,
nzScalarQuery
nzDeleteTable
Data Management as.nz.data.frame
nz.data.frame
Apply an R function nzApply
nzTApply
nzGroupedApply
R Package Management nzInstallPackages
nzIsPackageInstalled
#load packages
library(nzr)
#connect to a database via ODBC
nzConnect("admin", "xyz", "127.0.0.1", "iclasstest")
#load the iris table
nzdf <- nz.data.frame("iris")
#run a nzTApply against the nz dataframe
fun <- function(x) max(x[,1])
nzTApply(nzdf, nzdf[,5], fun)
17. © 2012 IBM Corporation17
Revolution Confidential
nzA Package
Data Manipulation
Moments nz.moments
Quantiles nz.quantile, nz.quartile
Outlier Detection nz.outliers
Frequency Table nz.bitable
Histogram nz.hist
Pearson's Correlation nz.corr
Spearman's Correlation nz.spearman.corr, nz.spearman.corr.s
Covariance nz.cov, nz.cov.matrix
Mutual Information nz.mutualinfo
Chi-Square Test nzChisq.test, nz.chisq.test
t -Test t.ls.test, t.me.test, t.pmd.test, t.umd.test
Mann-Whitney-Wilcoxon Test nz.mww.test
Wilcoxon Test nz.wilcoxon.test
Canonical Correlation nz.canonical.corr
One-Way ANOVA nzAnova, nz.anova.CRD.test, nz.anova.RBD.test
Principal Component Analysis nzPCA
Tree-Shaped Bayesian Networks nz.TBNet Apply, nz.TBNet Grow, nz.BigBNControl,
nz.TBNet1g2p, nz.TBNet1g,nz.TBNet2g
18. © 2012 IBM Corporation18
Revolution Confidential
nzA Package
Data Transformations
Model Diagnostics
Discretization nz.efdisc, nz.emdisc, nz.ewdisc
Standardization and Normalization nz.std.norm
Data Imputation nz.impute.data
Misclassification Error nz.cerror
Confusion Matrix nz.acc, nz.CMATRIX STATS
Mean Absolute Error nz.mae
Mean Square Error nz.mse
Relative Absolute Error nz.rae
Percentage Split nz.percentage.split
Cross-Validation nz.cross.validation
19. © 2012 IBM Corporation19
Revolution Confidential
nzA Package
Classification
Regression
Clustering
Associative Rule Mining
Naive Bayes nzNaiveBayes,
nz.naivebayes,
nz.predict.naivebayes
Decision Trees nzDecTree,
nz.dectree,
nz.grow.dectree,
nz.print.dectree,
nz.prune.dectree,
nz.predict.dectree
Nearest Neighbors nz.knn
Linear Regression nzLm
Regression Trees nzRegTree,
nz.regtree,
nz.grow.regtree,
nz.print.regtree,
nz.predict.regtree
K-Means Clustering nzKMeans, nz.kmeans,
nz.predict.kmeans
Divisive Clustering nz.divcluster,
nz.predict.divcluster
FP-Growth nz.fpgrowth,
nz.prepare.fpgrowth
20. © 2012 IBM Corporation20
Revolution Confidential
nzMatrix Package
Data Manipulation
Coerce or point to a nz.matrix as.nz.matrix, as.nz.matrix.matrix, nz.matrix
Combine Matrices nzCBind, nzRBind
Create Matrices From Tables nzCreateMatrixFromTable, nzCreateTableFromMatrix
Create Special Matrices nzIdentityMatrix, nzNormalMatrix, nzOnesMatrix,
nzRandomMatrix, nzVecToDiag
Decomposition nzSVD, svd, nzEigen
Delete Matrices nzDeleteMatrix, nzDeleteMatrixByName
Dimensions dim, NCOL, ncol, NROW, nrow
Mathematical Functions abs, add, aubtr, ceiling, div, exp, floor, ln, log10, mod,
mult, nzPowerMatrix, pow, rounding, sqrt, trunc
Matrix Engine Initialization nzMatrixEngineInitialization
Matrix Info is.nz.matrix, isSparse, nzExistMatrix, nzExistMatrixByName,
nzGetValidMatrixName
Operators *, +, -, <, ==, >, nzKronecker, nzPMax, nzPMin, nzSetValue,
[, scale, t
Printing Matrices print.nz.matrix
Solve nzInv, nzSolve, nzSolveLLS
Sparse Matrices isSparse, nzSparse2matrix
Summaries
nzAll, nzAny, nzMax, nzMin, nzSsq, nzSum, nzTr
21. © 2012 IBM CorporationMarch 1, 2012
Demonstration
Using Revolution R
with IBM Netezza
23. Revolution Confidential
Us e Cas e – Credit R is k
We have a dataset comprised of individuals
and their credit risk
stored on the Netezza Appliance
The goal is to model if someone is
“approvable” for a loan.
This use case will follow a modeling process
(though condensed) from start to finish.
I will discuss each of the parts and at the end
there will be a demo of the code
25. Revolution Confidential
1. Learning more about the data
Connect to the IBM Netezza appliance
Summarize the data
Visualize the data
Continuous Variable
x
Frequency
0 5 10 15 20 25
050100150200250300
High School Diploma Bachelors Degree Masters Degree Professional Degree PhD
Discrete Varible
050100150200250300
26. Revolution Confidential
2. Prepare the data for modeling
Split the data in to 70/30 Training/Test sets
Transform some variables
Discretize numeric variables for later use
27. Revolution Confidential
3. Fit models to the data
Build two different models to predict if an
individual is “approvable”
Decision Tree
Naïve Bayes
30. © 2012 IBM Corporation9
Summary
Familiar environment for R Developers
– World-class productivity tools
– Enterprise class service, support and integration
Execution of analytics in-database
– Analytic computing distributed across Netezza nodes and run
in a massively parallel manner
– Each Netezza node gets a data slice and analytics are pushed
down from the Host to the individual nodes
Capabilities
– R Code executed on Netezza nodes in row-by-row fashion or
on groups of rows
– Enables access to explicitly parallelized algorithms running on
entire data set
– Large-scale parallel matrix operations on database tables
Performance
– 10-100x Performance improvements
31. Revolution Confidential
Contact Us
Derek Norton
Solutions Executive
Revolution Analytics
derek.norton@revolutionanalytics.com
www.revolutionanalytics.com +1 (650) 646 9545 Twitter: @RevolutionR
Bill Zanine
Business Solutions Executive, Analytics Solutions
IBM Netezza
wzanine@us.ibm.com