The Powerful Marriage of Hadoop and R (David Champagne)

Revolution Analytics

Leveraging R in Hadoop
Environments

November 9, 2011

1

In Today’s Presentation:
About Revolution Analytics
Why R and Hadoop?
The Packages (rhdfs, rhbase, rmr)
Examples

Most advanced statistical
analysis software available
The professor who invented analytic software for

Half the cost of the experts now wants to take it to the masses

commercial alternatives

2M+ Users
Power
 3,000+ Packages

Finance
Statistics
Life Sciences
Predictive Manufacturing
Analytics Productivity
Retail
Data Mining Telecom Enterprise
Visualization
Social Media Readiness
Government

What’s the Difference Between R and
Revolution R Enterprise?
Revolution R is 100% R and More®
Multi-Threaded Web-Based Web Services Big Data Parallel
Math Libraries GUI API Analysis Tools

Technical IDE / Developer
Support GUI

3,000+ Community Build
Packages R Engine Assurance
Language Libraries

For more information contact: info@revolutionanalytics.com
4

Let’s Talk about R and Hadoop

5

Why R and Hadoop?
Hadoop - a scalable infrastructure for
processing massive amounts of data
Storage – HDFS, HBASE
Distributed Computing - MapReduce
R - a statistical programming language
Need for more than counts and averages
Analyze all of the data

6

Motivation for this project

Make it easy for the R programmer to
interact with the Hadoop data stores and
write MapReduce programs
Run R on a massively distributed system
without having to understand the underlying
infrastructure
Statisticians stay focused on the analysis
Open source

7

R and Hadoop – The R Packages

Capabilities delivered as individual
HBASE R packages
HDFS
rhdfs - R and HDFS
R
Thrift rhbase - R and HBASE
Map or
Reduce
rmr - R and MapReduce
Task rhbase
rhdfs
Node

Downloads available from
R Client Github
Job
Tracker rmr

8

rhdfs
Manipulate HDFS directly from R
Mimic as much of the HDFS Java API as
possible
Examples:
Read a HDFS text file into a data frame.
Serialize/Deserialize a model to HDFS
Write an HDFS file to local storage
rhdfs/pkg/inst/unitTests
rhdfs/pkg/inst/examples

9

rhdfs Functions
File Manipulations - hdfs.copy, hdfs.move, hdfs.rename,
hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put,
hdfs.get
File Read/Write - hdfs.file, hdfs.write, hdfs.close, hdfs.flush,
hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader,
hdfs.read.text.file
Directory - hdfs.dircreate, hdfs.mkdir
Utility - hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists
Initialization – hdfs.init, hdfs.defaults

10

rhbase
Manipulate HBASE tables and their content
Uses Thrift C++ API as the mechanism to
communicate to HBASE
Examples
Create a data frame from a collection of rows
and columns in an HBASE table
Update an HBASE table with values from a data
frame
rhbase/pkg/inst/unitTests

11

rhbase Functions
Table Manipulation – hb.new.table, hb.delete.table,
hb.describe.table, hb.set.table.mode, hb.regions.table
Row Read/Write - hb.insert, hb.get, hb.delete,
hb.insert.data.frame, hb.get.data.frame, hb.scan
Utility - hb.list.tables
Initialization - hb.defaults, hb.init

12

Writing MapReduce programs in R

13

rmr - For R Programmers
• A way to access big data sets
• A simple way to write parallel programs –
everyone will have to
• Very R-like, building on the functional
characteristics of R
• Just a library

14

rmr – For MapReduce Developers
• Much simpler than writing Java
• Not as simple as Hive, Pig at what they do,
but more general
• Great for prototyping, can transition to
production -- optimize instead of rewriting!
Lower risk, always executable.

15

rmr mapreduce Function
mapreduce (input, output, map, reduce, …)

input – input folder
output – output folder
map – R function used as map
reduce – R function used as reduce

… - other advanced parameters

16

Some Simple Things
Example showing sampling and counting

map = function(k, v) if (hash(k) %% 10 == 0) keyval(k, v)
reduce = function(k, vv) keyval(k, length(vv))
mapreduce(input, output, map, reduce)

More Simple Things
HIVE
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

rmr
mapreduce(input =
mapreduce(input = "pv_users",
map = function(k, v) keyval(v['userid'], v['gender']),
reduce = function(k, vv) keyval(k, vv[[1]]),
output = "pv_gender_sum",
map = function(k,v) keyval(v, 1)
reduce = function(k, vv) keyval(k, sum(unlist(vv)))

Takeaways
A language like HIVE makes a class of problems easy to solve, but it is not a general tool
The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities

18

Complex Things
k-means Clustering

19

k-means - Implementation
Well known design (MacQueen, 1967)
Comparison of the k-means in MapReduce
Pig
From Hortonworks
Requires coding in 3 languages (Python-Pig-Java)
100 lines of code
rmr
20 lines of only R code

22

k-means - Highlights

map = function(k,v)
keyval(which.min(distances(centers,v)),v)

reduce = function(k,vv)
keyval(NULL, col.average(vv))

centers = from.dfs(
mapreduce("data-points", map, reduce))

23

k-means - Optimizations
Slow Fast Notes

for(i in 1:100) a=b+c light use of R interpreter, use
a[i] = b[i] + c[i] fast vector primitives, C if
necessary
[ 1, 2, 3, 4, 5] [[ 1, 2, 3, 4, 5],[6, 7, 8, 9, use beefier records, say 1k
10],[11, 12, 13, 14, 15]... points per record
distance(center, point) norm(center - P) compute all distances with
fast matrix operations
combiner = FALSE combiner = TRUE reduce often and early, use
combiner
keyval(k, mean(…)) keyval(k, replace means with (sum,
c(total, count)) count) pairs to enable early
reduction

https://github.com/RevolutionAnalytics/RHadoop/wiki/Fast-k-means

24

Final thoughts
R and Hadoop together offer innovation and
flexibility needed to meet analytics
challenges of big data
We need contributors to this project!
Developers
Documentation
Use cases
General Feedback

25

Resources

RHadoop Open source project:
https://github.com/RevolutionAnalytics/RHa
doop/wiki

Revolution R Enterprise: bit.ly/Enterprise-R

Cloudera CDH:
http://www.cloudera.com/hadoop/

Email: rhadoop@revolutionanalytics.com

26

Thank you.

The leading commercial provider of software and support for the popular
open source R statistics language.

www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR

27

The Powerful Marriage of Hadoop and R (David Champagne)

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to The Powerful Marriage of Hadoop and R (David Champagne)

Similar to The Powerful Marriage of Hadoop and R (David Champagne) (20)

More from Revolution Analytics

More from Revolution Analytics (20)

Recently uploaded

Recently uploaded (20)

The Powerful Marriage of Hadoop and R (David Champagne)