When two of the most powerful innovations in modern analytics come together, the result is revolutionary.
This presentation covers:
- An overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization.
- The ways that R and Hadoop have been integrated.
- A use case that provides real-world experience.
- A look at how enterprises can take advantage of both of these industry-leading technologies.
Presented at Hadoop World 2011 by:
David Champagne
CTO, Revolution Analytics
David Champagne is a top software architect, programmer and product manager with over 20 years experience in enterprise and web application development for business customers across a wide range of industries. As Principal Architect/Engineer for SPSS, Champagne led the development teams and created and led the text mining team.
2. In Today’s Presentation:
About Revolution Analytics
Why R and Hadoop?
The Packages (rhdfs, rhbase, rmr)
Examples
3. Most advanced statistical
analysis software available
The professor who invented analytic software for
Half the cost of the experts now wants to take it to the masses
commercial alternatives
2M+ Users
Power
3,000+ Packages
Finance
Statistics
Life Sciences
Predictive Manufacturing
Analytics Productivity
Retail
Data Mining Telecom Enterprise
Visualization
Social Media Readiness
Government
4. What’s the Difference Between R and
Revolution R Enterprise?
Revolution R is 100% R and More®
Multi-Threaded Web-Based Web Services Big Data Parallel
Math Libraries GUI API Analysis Tools
Technical IDE / Developer
Support GUI
3,000+ Community Build
Packages R Engine Assurance
Language Libraries
For more information contact: info@revolutionanalytics.com
4
6. Why R and Hadoop?
Hadoop - a scalable infrastructure for
processing massive amounts of data
Storage – HDFS, HBASE
Distributed Computing - MapReduce
R - a statistical programming language
Need for more than counts and averages
Analyze all of the data
6
7. Motivation for this project
Make it easy for the R programmer to
interact with the Hadoop data stores and
write MapReduce programs
Run R on a massively distributed system
without having to understand the underlying
infrastructure
Statisticians stay focused on the analysis
Open source
7
8. R and Hadoop – The R Packages
Capabilities delivered as individual
HBASE R packages
HDFS
rhdfs - R and HDFS
R
Thrift rhbase - R and HBASE
Map or
Reduce
rmr - R and MapReduce
Task rhbase
rhdfs
Node
Downloads available from
R Client Github
Job
Tracker rmr
8
9. rhdfs
Manipulate HDFS directly from R
Mimic as much of the HDFS Java API as
possible
Examples:
Read a HDFS text file into a data frame.
Serialize/Deserialize a model to HDFS
Write an HDFS file to local storage
rhdfs/pkg/inst/unitTests
rhdfs/pkg/inst/examples
9
11. rhbase
Manipulate HBASE tables and their content
Uses Thrift C++ API as the mechanism to
communicate to HBASE
Examples
Create a data frame from a collection of rows
and columns in an HBASE table
Update an HBASE table with values from a data
frame
rhbase/pkg/inst/unitTests
11
14. rmr - For R Programmers
• A way to access big data sets
• A simple way to write parallel programs –
everyone will have to
• Very R-like, building on the functional
characteristics of R
• Just a library
14
15. rmr – For MapReduce Developers
• Much simpler than writing Java
• Not as simple as Hive, Pig at what they do,
but more general
• Great for prototyping, can transition to
production -- optimize instead of rewriting!
Lower risk, always executable.
15
16. rmr mapreduce Function
mapreduce (input, output, map, reduce, …)
input – input folder
output – output folder
map – R function used as map
reduce – R function used as reduce
… - other advanced parameters
16
17. Some Simple Things
Example showing sampling and counting
map = function(k, v) if (hash(k) %% 10 == 0) keyval(k, v)
reduce = function(k, vv) keyval(k, length(vv))
mapreduce(input, output, map, reduce)
18. More Simple Things
HIVE
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
rmr
mapreduce(input =
mapreduce(input = "pv_users",
map = function(k, v) keyval(v['userid'], v['gender']),
reduce = function(k, vv) keyval(k, vv[[1]]),
output = "pv_gender_sum",
map = function(k,v) keyval(v, 1)
reduce = function(k, vv) keyval(k, sum(unlist(vv)))
Takeaways
A language like HIVE makes a class of problems easy to solve, but it is not a general tool
The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities
18
22. k-means - Implementation
Well known design (MacQueen, 1967)
Comparison of the k-means in MapReduce
Pig
From Hortonworks
Requires coding in 3 languages (Python-Pig-Java)
100 lines of code
rmr
20 lines of only R code
22
24. k-means - Optimizations
Slow Fast Notes
for(i in 1:100) a=b+c light use of R interpreter, use
a[i] = b[i] + c[i] fast vector primitives, C if
necessary
[ 1, 2, 3, 4, 5] [[ 1, 2, 3, 4, 5],[6, 7, 8, 9, use beefier records, say 1k
10],[11, 12, 13, 14, 15]... points per record
distance(center, point) norm(center - P) compute all distances with
fast matrix operations
combiner = FALSE combiner = TRUE reduce often and early, use
combiner
keyval(k, mean(…)) keyval(k, replace means with (sum,
c(total, count)) count) pairs to enable early
reduction
https://github.com/RevolutionAnalytics/RHadoop/wiki/Fast-k-means
24
25. Final thoughts
R and Hadoop together offer innovation and
flexibility needed to meet analytics
challenges of big data
We need contributors to this project!
Developers
Documentation
Use cases
General Feedback
25
26. Resources
RHadoop Open source project:
https://github.com/RevolutionAnalytics/RHa
doop/wiki
Revolution R Enterprise: bit.ly/Enterprise-R
Cloudera CDH:
http://www.cloudera.com/hadoop/
Email: rhadoop@revolutionanalytics.com
26
27. Thank you.
The leading commercial provider of software and support for the popular
open source R statistics language.
www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR
27