SlideShare a Scribd company logo
1 of 58
Download to read offline
Distributed Data Analysis with Hadoop and R
Jonathan Seidman and Ramesh Venkataramaiah, Ph. D.

             OSCON Data 2011
Flow of this Talk

•  Introductions

•  Hadoop, R and Interfacing


•  Our Prototypes

•  A use case for interfacing Hadoop and R

•  Alternatives for Running R on Hadoop

•  Alternatives to Hadoop and R

•  Conclusions

•  References




                                             page 2
Who We Are
•  Ramesh Venkataramaiah, Ph. D.
   –  Principal Engineer, TechOps
   –  rvenkataramaiah@orbitz.com
   –  @rvenkatar

•  Jonathan Seidman
   –  Lead Engineer, Business Intelligence/Big Data Team
   –  Co-founder/organizer of Chicago Hadoop User Group (
      http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG) and
      Chicago Big Data (http://www.meetup.com/Chicago-Big-Data/
   –  jseidman@orbitz.com
   –  @jseidman


•  Orbitz Careers
   –  http://careers.orbitz.com/
   –  @OrbitzTalent


                                                                       page 3
Launched in 2001, Chicago, IL




                                Over 160 million bookings




                                                            page 4
Hadoop and R
as an analytic platform?




                           page 5
What is Hadoop?


Distributed file system (HDFS) and parallel processing
framework.


       Uses   MapReduce programming model as the core.

                Provides   fault tolerant and scalable storage
                 of very large datasets across machines in a cluster.




                                                                        page 6
What is R? When do we need it?


Open-source stat package with visualization
        Vibrant community support.
                 One-line calculations galore!
                          Steep learning curve but worth it!




Insight into statistical properties and trends…
                        or for machine learning purposes…
                                  or Big Data to be understood well.



                                                                  page 7
Our Options

•  Data volume reduction by sampling
   –  Very bad for long-tail data distribution
   –  Approximation lead to bad conclusion
•  Scaling R
   –  Still in-memory
   –  But make it parallel using segue, Rhipe, R-Hive…
•  Use sql-like interfaces
   –  Apache Hive with Hadoop
   –  File sprawl and process issues
•  Regular DBMS
   –  How to fit square peg in a round hole
   –  No in-line R calls from SQL but commercial efforts are underway.


•  This Talk: Interface Hadoop with R over dataspaces

                                                                         page 8
Why Interface Hadoop and R at cluster level?

•  R only works on:
   –  Data is in-memory, stand alone. Single-threaded, mostly.
       multicore package in R help here.

•  HDFS can be the data and analytic store.


•  Interfacing with Hadoop brings parallel processing capability to
   R environment.


How do we interface Hadoop and R, at cluster level?



                                                                      page 9
Our prototypes
  User segmentations
        Hotel bookings
  Airline Performance*


                   * Public dataset	





                                   page 10
Before Hadoop




                page 11
With Hadoop




              page 12
Getting a Buy-in
presented a long-term, unstructured data growth story and
explained how this will help harness long-tail opportunities at
lowest cost.
          - Traditional DW!                - Big Data!
          -  Classical Stats!              -  Specific spikes!
          -  Sampling!                     -  Median is not the message!




                                                                      * From a blog	


                                                                             page 13
Workload and Resource Partition


  �                                   �                               �           �


          �                                   �                   �           �
                      �

                                  �               �                           �
      �           �                                       �               �
                      �                                       �                       �


                                          �           �                   �
              �               �                                                       �
                          �




                                                                                      page 14
User Segmentation by Browsers




                                page 15
Seasonal variations
•  Customer hotel stay gets longer during summer months
•  Could help in designing search based on seasons.




                                                          page 16
Airline Performance




                      page 17
Description of Use Case


•  Analyze openly available dataset: Airline on-time performance.
•  Dataset was used in Visualization Poster Competition 2009
   –  Consists of flight arrival/departure details from 1987-2008.
   –  Approximately 120 MM records totaling 120GB.
•  Available at: http://stat-computing.org/dataexpo/2009/




                                                                     page 18
Airline Delay Plot: R code
> deptdelays.monthly.full <- read.delim("~/OSCON2011/Delays_by_Month.dat", header=F)!
> View(deptdelays.monthly.full)!
> names(deptdelays.monthly.full) <- c("Year","Month","Count","Airline","Delay”)!
> Delay_by_month <- deptdelays.monthly.full[order(deptdelays.monthly.full
$Delay,decreasing=TRUE),]!
> Top_10_Delay_by_Month <- Delay_by_Month[1:10,]!
> Top_10_Normal <- ((Delay - mean(Delay)) / sd(Delay))!
> symbols( Month, Delay, circles= Top_10_Normal, inches=.3, fg="white”,bg="red”,…)!
> text(Month, Delay, Airline, cex= 0.5)!




                                                                              page 19
Multiple Distributions: R code
>   library(lattice)!
>   deptdelays.monthly.full$Year <- as.character(deptdelays.monthly.full$Year)!
>   h <- histogram(~Delay|Year,data=deptdelays.monthly.full,layout=c(5,5))!
>   update(h)!




                                                                            page 20
Running R on Hadoop:
 Hadoop Streaming



                       page 21
Hadoop Streaming – Overview


•  An alternative to the Java MapReduce API which allows you to
   write jobs in any language supporting stdin/stdout.
•  Limited to text data in current versions of Hadoop. Support for
   binary streams added in 0.21.0.
•  Requires installation of R on all DataNodes.




                                                                     page 22
Hadoop Streaming – Dataflow



                   1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI...	

                   1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI…	

Input to map       1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO...	

                   1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO...	

                   1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO…	

                                                                                             *	

                   1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL...	





                                 PI|1988|1     	

 17 	

                                 PI|1988|1     	

 0 	

                                 PS|1987|10    	

 11 	

 Output from map                 PS|1987|10    	

 -2 	

                                 PS|1987|10    	

 1 	

                                 DL|1987|10    	

 14 	





* Map function receives input records line-by-line via standard input.


                                                                                                page 23
Hadoop Streaming – Dataflow Cont d



                   DL|1987|10   	

 14 	

                   PI|1988|1    	

 0 	

Input to reduce    PI|1988|1    	

 17 	

                   PS|1987|10
                   PS|1987|10
                                	

 1 	

                                	

 11 	

                                                                                          *	

                   PS|1987|10   	

 -2 	





                                1987         	

 10   	

 1   	

 DL   	

 14 	

 Output from reduce             1988
                                1987
                                             	

 1
                                             	

 10
                                                      	

 2
                                                      	

 3
                                                              	

 PI
                                                              	

 PS
                                                                       	

 8.5 	

                                                                       	

 3.333333 	





* Reduce receives map output key/value pairs sorted by key, line-by-line.


                                                                                             page 24
Hadoop Streaming – Example

map.R
   #! /usr/bin/env Rscript!

   con <- file("stdin", open = "r")!
   while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {!
      fields <- unlist(strsplit(line, ","))!
      if (!(identical(fields[[1]], "Year")) & length(fields) == 29) {!
         deptDelay <- fields[[16]]!
         if (!(identical(deptDelay, "NA"))) {!
            cat(paste(fields[[9]], "|", fields[[1]], "|", fields[[2]], !
                      sep=""), "t , deptDelay, "n")!
         }!
      }!
   }!
   close(con)!




                                                                           page 25
Hadoop Streaming – Example Cont d

reduce.R
 #!/usr/bin/env Rscript!

 con <- file("stdin", open = "r")!
 delays <- numeric(0) # vector of departure delays!
 lastKey <- ""!
 while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {!
   split <- unlist(strsplit(line, "t"))!
   key <- split[[1]]!
   deptDelay <- as.numeric(split[[2]])!

    if (!(identical(lastKey, "")) & (!(identical(lastKey, key)))) {!
       keySplit <- unlist(strsplit(lastKey, "|"))!
       cat(keySplit[[2]], "t", keySplit[[3]], "t", length(delays), "t", keySplit
 [[1]], "t", (mean(delays)), "n")!
       lastKey <- key!
       delays <- c(deptDelay) !
    } else { !
         lastKey <- key!
         delays <- c(delays, deptDelay)!
    }!
 }!
 keySplit <- unlist(strsplit(lastKey, "|"))!
 cat(keySplit[[2]], "t", keySplit[[3]], "t", length(delays), "t", keySplit[[1]],
 "t", (mean (delays)), "n")!


                                                                                      page 26
Running R on Hadoop:
 Hadoop Interactive



                       page 27
Hadoop Interactive (hive) – Overview


•  Very unfortunate acronym.
•  Provides an interface to Hadoop from the R environment.
•  Provides R functions to access HDFS – DFS_list(),
   DFS_dir_create(), DFS_put()…
•  Provides R functions to control Hadoop – hive_start(),
   hive_stop()…
•  And allows streaming jobs to be executed from R via
   hive_stream().
•  Allows HDFS data, including the output from MapReduce
   processing, to be manipulated and analyzed directly from R.




                                                                 page 28
Hadoop Interactive – Example

#! /usr/bin/env Rscript!

mapper <- function() {!
...!
}!

reducer <- function() {!
...!
}!

library(hive)!
DFS_dir_remove("/dept-delay-month", recursive = TRUE, henv = hive())!
hive_stream(mapper = mapper, reducer = reducer, !
            input="/data/airline", output="/dept-delay-month", !
            henv = hive())!
results <- DFS_read_lines("/dept-delay-month/part-r-00000", !
           henv = hive())!




                                                                  page 29
Running R on Hadoop:
       RHIPE



                       page 30
RHIPE – Overview

•  Active project with frequent updates and active community.
•  RHIPE is based on Hadoop streaming source, but provides
   some significant enhancements, such as support for binary
   files.
•  Developed to provide R users with access to same Hadoop
   functionality available to Java developers.
   –  For example, provides rhcounter() and rhstatus(),
      analagous to counters and the reporter interface in the Java
      API.
•  Can be somewhat confusing and intimidating.
   –  Then again, the same can be said for the Java API.
   –  Worth taking the time to get comfortable with.



                                                                     page 31
RHIPE – Overview Cont d


•  Allows developers to work directly on data stored in HDFS in
   the R environment.
•  Also allows developers to write MapReduce jobs in R and
   execute them on the Hadoop cluster.
•  RHIPE uses Google protocol buffers to serialize data. Most R
   data types are supported.
   –  Using protocol buffers increases efficiency and provides
      interoperability with other languages.
•  Must be installed on all DataNodes.




                                                                  page 32
RHIPE – MapReduce

map <- expression({}) !
reduce <- expression( !
       pre = {…},!
       reduce = {…}, !
       post = {…}!
 ) !
z <- rhmr(map=map,reduce=reduce,!
             inout=c("text","sequence ), !
             ifolder=INPUT_PATH ,!
             ofolder=OUTPUT_PATH,!
             …)!
rhex(z) !




                                             page 33
RHIPE – Dataflow



                   Keys = […]	

                   Values =	

                    [1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI...	

                     1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI…	

Input to map         1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO...	

                     1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO...	

       *
                     1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO…	

                     1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL...]	





                                 PI|1988|1     	

 17 	

                                 PI|1988|1     	

 0 	

                                 PS|1987|10    	

 11 	

 Output from map                 PS|1987|10    	

 -2 	

                                 PS|1987|10    	

 1 	

                                 DL|1987|10    	

 14 	





* Note that Input to map is a vector of keys and a vector of values.


                                                                                                    page 34
RHIPE – Dataflow Cont d



                   DL|1987|10   	

 [14] 	


Input to reduce    PI|1988|1    	

 [0, 17] 	

                                                                                                 *
                   PS|1987|10   	

 [1,11,-2] 	





                                1987                	

 10   	

 1   	

 DL   	

 14 	

 Output from reduce             1988
                                1987
                                                    	

 1
                                                    	

 10
                                                             	

 2
                                                             	

 3
                                                                     	

 PI
                                                                     	

 PS
                                                                              	

 8.5 	

                                                                              	

 3.333333 	





* Note that input to reduce is each unique key and a vector of values
  associated with that key.

                                                                                                     page 35
RHIPE – Example


#! /usr/bin/env Rscript!

library(Rhipe)!
rhinit(TRUE, TRUE)!

map <- expression({!
  # For each input record, parse out required fields and output new record:!
  mapLine = function(line) {!
     fields <- unlist(strsplit(line, ","))!
     # Skip header lines and bad records:!
     if (!(identical(fields[[1]], "Year")) & length(fields) == 29) {!
        deptDelay <- fields[[16]]!
      # Skip records where departure dalay is "NA":!
        if (!(identical(deptDelay, "NA"))) {!
           # field[9] is carrier, field[1] is year, field[2] is month:!
           rhcollect(paste(fields[[9]], "|", fields[[1]], "|", fields[[2]], sep=""),
deptDelay)!
        }!
     }!
  }!
  # Process each record in map input:!
  lapply(map.values, mapLine)!
})!



                                                                                       page 36
RHIPE – Example Cont d


reduce <- expression(!
   reduce = {!
      count <- length(reduce.values)!
      avg <- mean(as.numeric(reduce.values))!
      keySplit <- unlist(strsplit(reduce.key, "|"))!
   },!
   post = {!
      rhcollect(keySplit[[2]], !
                paste(keySplit[[3]], count, keySplit[[1]], avg, sep="t"))!
   }!
)!

# Create job object:!
z <- rhmr(map=map, reduce=reduce,!
           ifolder="/data/airline/", ofolder= /dept-delay-month,!
           inout=c('text', 'text'), jobname='Avg Departure Delay By Month')!
# Run it:!
rhex(z)!




                                                                               page 37
RHIPE – Example Cont d


library(lattice)!
rhget("/dept-delay-month/part-r-00000", "deptdelay.dat")!
deptdelays.monthly.full <- read.delim("deptdelay.dat", header=F)!
names(deptdelays.monthly.full)<- c("Year","Month","Count","Airline","Delay")!
deptdelays.monthly.full$Year <- as.character(deptdelays.monthly.full$Year)!
h <- histogram(~Delay|Year,data=deptdelays.monthly.full,layout=c(5,5))!
update(h)!




                                                                         page 38
Running R on Hadoop:
       Segue



                       page 39
Segue – Overview


•  Intended to work around single-threading in R by taking
   advantage of Hadoop streaming to provide simple parallel
   processing.
   –  For example, running multiple simulations in parallel.
•  Suitable for embarrassingly pleasantly parallel problems – big
   CPU, not big data.
•  Runs on Amazon’s Elastic Map Reduce (EMR).
   –  Not intended for internal clusters.
•  Provides emrlapply(), a parallel version of lapply()!




                                                                    page 40
Segue – Example


estimatePi <- function(seed){!
   set.seed(seed)!
   numDraws <- 1e6!

   r <- .5 #radius... in case the unit circle is too boring!
   x <- runif(numDraws, min=-r, max=r)!
   y <- runif(numDraws, min=-r, max=r)!
   inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0)!

   return(sum(inCircle) / length(inCircle) * 4)!
 }!


seedList <- as.list(1:1000)!
require(segue)!
myCluster <- createCluster(20)!
myEstimates <- emrlapply( myCluster, seedList, estimatePi )!
stopCluster(myCluster)!




                                                               page 41
Predictive Analytics on
   Hadoop: Sawmill



                          page 42
Sawmill – Overview


•  A framework for integrating a PMML-compliant Scoring Engine
   with Hadoop.
•  Hadoop streaming allows easier integration of a scoring engine
   into reducer code (Python and R).
   –  The output of a MapReduce run becomes a segmented
      PMML model – one segment for each partition
•  Training the models and Scoring are separate MapReduce
   jobs.
•  Interoperates with open source scoring engines such as
   Augustus, as well as a forthcoming R scoring engine.




                                                                    page 43
Alternatives

•  Apache Mahout
    –  Scalable machine learning library.
    –  Offers clustering, classification, collaborative filtering on Hadoop.
•  Python
    –  Many modules available to support scientific and statistical computing.
•  Revolution Analytics
    –  Provides commercial packages to support processing big data with R.
•  Ricardo – looks interesting but you can’t have it.
•  Other HPC/parallel processing packages, e.g. Rmpi or snow.
•  Apache Hive + RJDBC?
    –  We haven’t been able to get it to work yet.
    –  You can however wrap calls to the Hive client in R to return R objects. See
       https://github.com/satpreetsingh/rDBwrappers/wiki




                                                                                     page 44
Conclusions


•  If practical, consider using Hadoop to aggregate data for input
   to R analyses.
•  Avoid using R for general purpose MapReduce use.
•  For simple MapReduce jobs, or embarrassingly parallel jobs
   on a local cluster, consider Hadoop streaming (or Hadoop
   Interactive).
   –  Limited to processing text only.
   –  But easy to test at the command line outside of Hadoop:
      •  $ cat DATAFILE |./map.R |sort |./reduce.R!
•  To run compute-bound analyses with relatively small amount of
   data on the cloud look at Segue.
•  Otherwise, look at RHIPE.


                                                                     page 45
Conclusions Cont d


•  Also look at alternatives – Mahout, Python, etc.
•  Make sure your cluster nodes are consistent – same version of
   R installed, required libraries are installed on each node, etc.




                                                                      page 46
Example Code


•  https://github.com/jseidman/hadoop-R




                                          page 47
References


•  Hadoop
   –  Apache Hadoop project: http://hadoop.apache.org/
   –  Hadoop The Definitive Guide, Tom White, O Reilly Press,
      2011
•  R
   –  R Project for Statistical Computing: http://www.r-project.org/
   –  R Cookbook, Paul Teetor, O Reilly Press, 2011
   –  Getting Started With R: Some Resources:
      http://quanttrader.info/public/gettingStartedWithR.html




                                                                       page 48
References


•  Hadoop Streaming
  –  Documentation on Apache Hadoop Wiki:
     http://hadoop.apache.org/mapreduce/docs/current/
     streaming.html
  –  Word count example in R :
     https://forums.aws.amazon.com/thread.jspa?
     messageID=129163




                                                        page 49
References


•  Hadoop InteractiVE
  –  Project page on CRAN:
     http://cran.r-project.org/web/packages/hive/index.html
  –  Simple Parallel Computing in R Using Hadoop:
     http://www.rmetrics.org/Meielisalp2009/Presentations/
     Theussl1.pdf




                                                              page 50
References

•  RHIPE
   –  RHIPE - R and Hadoop Integrated Processing Environment:
      http://www.stat.purdue.edu/~sguha/rhipe/
   –  Wiki: https://github.com/saptarshiguha/RHIPE/wiki
   –  Installing RHIPE on CentOS:
      https://groups.google.com/forum/#!topic/brumail/qH1wjtBgwYI
   –  Introduction to using RHIPE: http://ml.stat.purdue.edu/rhafen/rhipe/
   –  RHIPE combines Hadoop and the R analytics language, SD Times:
      http://www.sdtimes.com/link/34792
   –  Using R and Hadoop to Analyze VoIP Network Data for QoS, Hadoop World 2010:
       •  video:
          http://www.cloudera.com/videos/
          hw10_video_using_r_and_hadoop_to_analyze_voip_network_data_for_qos
       •  slides:
          http://www.cloudera.com/resource/
          hw10_voice_over_ip_studying_traffic_characteristics_for_quality_of_service




                                                                                       page 51
References


•  Segue
  –  Project page: http://code.google.com/p/segue/
  –  Google Group:http://groups.google.com/group/segue-r
  –  Abusing Amazon s Elastic MapReduce Hadoop service…
     easily, from R, Jefferey Breen:
     http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-
     amazon-elastic-mapreduce-hadoop/
  –  Presentation at Chicago Hadoop Users Group March 23,
     2011:
     http://files.meetup.com/1634302/segue-presentation-
     RUG.pdf




                                                                page 52
References

•  Sawmill
   –  More information:
      •  Open Data Group www.opendatagroup.com
      •  oscon-info@opendatagroup.com
   –  Augustus, an open source system for building & scoring
      statistical models
      •  augustus.googlecode.com
   –  PMML
      •  Data Mining Group: dmg.org
   –  Analytics over Clouds using Hadoop, presentation at Chicago
      Hadoop User Group:
      http://files.meetup.com/1634302/CHUG 20100721 Sawmill.pdf


                                                                    page 53
References


•  Ricardo
   –  Ricardo: Integrating R and Hadoop, paper:
      http://www.cs.ucsb.edu/~sudipto/papers/sigmod2010-
      das.pdf
   –  Ricardo: Integrating R and Hadoop, Powerpoint
      presentation:
      http://www.uweb.ucsb.edu/~sudipto/talks/Ricardo-
      SIGMOD10.pptx




                                                           page 54
References


•  General references on Hadoop and R
  –  Pete Skomoroch s R and Hadoop bookmarks:
     http://www.delicious.com/pskomoroch/R+hadoop
  –  Pigs, Bees, and Elephants: A Comparison of Eight
     MapReduce Languages:
     http://www.dataspora.com/2011/04/pigs-bees-and-
     elephants-a-comparison-of-eight-mapreduce-languages/
  –  Quora – How can R and Hadoop be used together?:
     http://www.quora.com/How-can-R-and-Hadoop-be-used-
     together




                                                            page 55
References


•  Mahout
   –  Mahout project: http://mahout.apache.org/
   –  Mahout in Action, Owen, et. al., Manning Publications, 2011
•  Python
   –  Think Stats, Probability and Statistics for Programmers, Allen
      B. Downey, O Reilly Press, 2011
•  CRAN Task View: High-Performance and Parallel Computing with
   R, a set of resources compiled by Dirk Eddelbuettel:
   http://cran.r-project.org/web/views/
   HighPerformanceComputing.html




                                                                    page 56
References


•  Other examples of airline data analysis with R:
   –  A simple Big Data analysis using the RevoScaleR package
      in Revolution R:
      http://www.r-bloggers.com/a-simple-big-data-analysis-using-
      the-revoscaler-package-in-revolution-r/




                                                                    page 57
And finally…


 Parallel R (working title), Q Ethan McCallum, Stephen
  Weston, O Reilly Press, due autumn 2011


    R meets Big Data - a basket of strategies to help you use R
      for large-scale analysis and computation.




                                                                  page 58

More Related Content

What's hot

Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
Hortonworks
 
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Cloudera, Inc.
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
Cloudera, Inc.
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 

What's hot (20)

Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
SQL in Hadoop
SQL in HadoopSQL in Hadoop
SQL in Hadoop
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
NTT Data - Shinichi Yamada - Hadoop World 2010
NTT Data - Shinichi Yamada - Hadoop World 2010NTT Data - Shinichi Yamada - Hadoop World 2010
NTT Data - Shinichi Yamada - Hadoop World 2010
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data Warehouse
 
Hive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionHive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly Competition
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 

Viewers also liked

Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013
SATOSHI TAGOMORI
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with Hadoop
Cloudera, Inc.
 
Tizen 2.0 alpha でサポートされなかった native api
Tizen 2.0 alpha でサポートされなかった native apiTizen 2.0 alpha でサポートされなかった native api
Tizen 2.0 alpha でサポートされなかった native api
Naruto TAKAHASHI
 

Viewers also liked (10)

PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with Hadoop
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
Large-scale social media analysis with Hadoop
Large-scale social media analysis with HadoopLarge-scale social media analysis with Hadoop
Large-scale social media analysis with Hadoop
 
Tizen 2.0 alpha でサポートされなかった native api
Tizen 2.0 alpha でサポートされなかった native apiTizen 2.0 alpha でサポートされなかった native api
Tizen 2.0 alpha でサポートされなかった native api
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to Distributed Data Analysis with Hadoop and R - OSCON 2011

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
European Data Forum
 

Similar to Distributed Data Analysis with Hadoop and R - OSCON 2011 (20)

IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Hive Percona 2009
Hive Percona 2009Hive Percona 2009
Hive Percona 2009
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 

More from Jonathan Seidman

More from Jonathan Seidman (10)

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_final
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Distributed Data Analysis with Hadoop and R - OSCON 2011

  • 1. Distributed Data Analysis with Hadoop and R Jonathan Seidman and Ramesh Venkataramaiah, Ph. D. OSCON Data 2011
  • 2. Flow of this Talk •  Introductions •  Hadoop, R and Interfacing •  Our Prototypes •  A use case for interfacing Hadoop and R •  Alternatives for Running R on Hadoop •  Alternatives to Hadoop and R •  Conclusions •  References page 2
  • 3. Who We Are •  Ramesh Venkataramaiah, Ph. D. –  Principal Engineer, TechOps –  rvenkataramaiah@orbitz.com –  @rvenkatar •  Jonathan Seidman –  Lead Engineer, Business Intelligence/Big Data Team –  Co-founder/organizer of Chicago Hadoop User Group ( http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG) and Chicago Big Data (http://www.meetup.com/Chicago-Big-Data/ –  jseidman@orbitz.com –  @jseidman •  Orbitz Careers –  http://careers.orbitz.com/ –  @OrbitzTalent page 3
  • 4. Launched in 2001, Chicago, IL Over 160 million bookings page 4
  • 5. Hadoop and R as an analytic platform? page 5
  • 6. What is Hadoop? Distributed file system (HDFS) and parallel processing framework. Uses MapReduce programming model as the core. Provides fault tolerant and scalable storage of very large datasets across machines in a cluster. page 6
  • 7. What is R? When do we need it? Open-source stat package with visualization Vibrant community support. One-line calculations galore! Steep learning curve but worth it! Insight into statistical properties and trends… or for machine learning purposes… or Big Data to be understood well. page 7
  • 8. Our Options •  Data volume reduction by sampling –  Very bad for long-tail data distribution –  Approximation lead to bad conclusion •  Scaling R –  Still in-memory –  But make it parallel using segue, Rhipe, R-Hive… •  Use sql-like interfaces –  Apache Hive with Hadoop –  File sprawl and process issues •  Regular DBMS –  How to fit square peg in a round hole –  No in-line R calls from SQL but commercial efforts are underway. •  This Talk: Interface Hadoop with R over dataspaces page 8
  • 9. Why Interface Hadoop and R at cluster level? •  R only works on: –  Data is in-memory, stand alone. Single-threaded, mostly. multicore package in R help here. •  HDFS can be the data and analytic store. •  Interfacing with Hadoop brings parallel processing capability to R environment. How do we interface Hadoop and R, at cluster level? page 9
  • 10. Our prototypes User segmentations Hotel bookings Airline Performance* * Public dataset page 10
  • 11. Before Hadoop page 11
  • 12. With Hadoop page 12
  • 13. Getting a Buy-in presented a long-term, unstructured data growth story and explained how this will help harness long-tail opportunities at lowest cost. - Traditional DW! - Big Data! -  Classical Stats! -  Specific spikes! -  Sampling! -  Median is not the message! * From a blog page 13
  • 14. Workload and Resource Partition � � � � � � � � � � � � � � � � � � � � � � � � � � page 14
  • 15. User Segmentation by Browsers page 15
  • 16. Seasonal variations •  Customer hotel stay gets longer during summer months •  Could help in designing search based on seasons. page 16
  • 18. Description of Use Case •  Analyze openly available dataset: Airline on-time performance. •  Dataset was used in Visualization Poster Competition 2009 –  Consists of flight arrival/departure details from 1987-2008. –  Approximately 120 MM records totaling 120GB. •  Available at: http://stat-computing.org/dataexpo/2009/ page 18
  • 19. Airline Delay Plot: R code > deptdelays.monthly.full <- read.delim("~/OSCON2011/Delays_by_Month.dat", header=F)! > View(deptdelays.monthly.full)! > names(deptdelays.monthly.full) <- c("Year","Month","Count","Airline","Delay”)! > Delay_by_month <- deptdelays.monthly.full[order(deptdelays.monthly.full $Delay,decreasing=TRUE),]! > Top_10_Delay_by_Month <- Delay_by_Month[1:10,]! > Top_10_Normal <- ((Delay - mean(Delay)) / sd(Delay))! > symbols( Month, Delay, circles= Top_10_Normal, inches=.3, fg="white”,bg="red”,…)! > text(Month, Delay, Airline, cex= 0.5)! page 19
  • 20. Multiple Distributions: R code > library(lattice)! > deptdelays.monthly.full$Year <- as.character(deptdelays.monthly.full$Year)! > h <- histogram(~Delay|Year,data=deptdelays.monthly.full,layout=c(5,5))! > update(h)! page 20
  • 21. Running R on Hadoop: Hadoop Streaming page 21
  • 22. Hadoop Streaming – Overview •  An alternative to the Java MapReduce API which allows you to write jobs in any language supporting stdin/stdout. •  Limited to text data in current versions of Hadoop. Support for binary streams added in 0.21.0. •  Requires installation of R on all DataNodes. page 22
  • 23. Hadoop Streaming – Dataflow 1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI... 1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI… Input to map 1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO... 1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO... 1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO… * 1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL... PI|1988|1 17 PI|1988|1 0 PS|1987|10 11 Output from map PS|1987|10 -2 PS|1987|10 1 DL|1987|10 14 * Map function receives input records line-by-line via standard input. page 23
  • 24. Hadoop Streaming – Dataflow Cont d DL|1987|10 14 PI|1988|1 0 Input to reduce PI|1988|1 17 PS|1987|10 PS|1987|10 1 11 * PS|1987|10 -2 1987 10 1 DL 14 Output from reduce 1988 1987 1 10 2 3 PI PS 8.5 3.333333 * Reduce receives map output key/value pairs sorted by key, line-by-line. page 24
  • 25. Hadoop Streaming – Example map.R #! /usr/bin/env Rscript! con <- file("stdin", open = "r")! while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {! fields <- unlist(strsplit(line, ","))! if (!(identical(fields[[1]], "Year")) & length(fields) == 29) {! deptDelay <- fields[[16]]! if (!(identical(deptDelay, "NA"))) {! cat(paste(fields[[9]], "|", fields[[1]], "|", fields[[2]], ! sep=""), "t , deptDelay, "n")! }! }! }! close(con)! page 25
  • 26. Hadoop Streaming – Example Cont d reduce.R #!/usr/bin/env Rscript! con <- file("stdin", open = "r")! delays <- numeric(0) # vector of departure delays! lastKey <- ""! while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {! split <- unlist(strsplit(line, "t"))! key <- split[[1]]! deptDelay <- as.numeric(split[[2]])! if (!(identical(lastKey, "")) & (!(identical(lastKey, key)))) {! keySplit <- unlist(strsplit(lastKey, "|"))! cat(keySplit[[2]], "t", keySplit[[3]], "t", length(delays), "t", keySplit [[1]], "t", (mean(delays)), "n")! lastKey <- key! delays <- c(deptDelay) ! } else { ! lastKey <- key! delays <- c(delays, deptDelay)! }! }! keySplit <- unlist(strsplit(lastKey, "|"))! cat(keySplit[[2]], "t", keySplit[[3]], "t", length(delays), "t", keySplit[[1]], "t", (mean (delays)), "n")! page 26
  • 27. Running R on Hadoop: Hadoop Interactive page 27
  • 28. Hadoop Interactive (hive) – Overview •  Very unfortunate acronym. •  Provides an interface to Hadoop from the R environment. •  Provides R functions to access HDFS – DFS_list(), DFS_dir_create(), DFS_put()… •  Provides R functions to control Hadoop – hive_start(), hive_stop()… •  And allows streaming jobs to be executed from R via hive_stream(). •  Allows HDFS data, including the output from MapReduce processing, to be manipulated and analyzed directly from R. page 28
  • 29. Hadoop Interactive – Example #! /usr/bin/env Rscript! mapper <- function() {! ...! }! reducer <- function() {! ...! }! library(hive)! DFS_dir_remove("/dept-delay-month", recursive = TRUE, henv = hive())! hive_stream(mapper = mapper, reducer = reducer, ! input="/data/airline", output="/dept-delay-month", ! henv = hive())! results <- DFS_read_lines("/dept-delay-month/part-r-00000", ! henv = hive())! page 29
  • 30. Running R on Hadoop: RHIPE page 30
  • 31. RHIPE – Overview •  Active project with frequent updates and active community. •  RHIPE is based on Hadoop streaming source, but provides some significant enhancements, such as support for binary files. •  Developed to provide R users with access to same Hadoop functionality available to Java developers. –  For example, provides rhcounter() and rhstatus(), analagous to counters and the reporter interface in the Java API. •  Can be somewhat confusing and intimidating. –  Then again, the same can be said for the Java API. –  Worth taking the time to get comfortable with. page 31
  • 32. RHIPE – Overview Cont d •  Allows developers to work directly on data stored in HDFS in the R environment. •  Also allows developers to write MapReduce jobs in R and execute them on the Hadoop cluster. •  RHIPE uses Google protocol buffers to serialize data. Most R data types are supported. –  Using protocol buffers increases efficiency and provides interoperability with other languages. •  Must be installed on all DataNodes. page 32
  • 33. RHIPE – MapReduce map <- expression({}) ! reduce <- expression( ! pre = {…},! reduce = {…}, ! post = {…}! ) ! z <- rhmr(map=map,reduce=reduce,! inout=c("text","sequence ), ! ifolder=INPUT_PATH ,! ofolder=OUTPUT_PATH,! …)! rhex(z) ! page 33
  • 34. RHIPE – Dataflow Keys = […] Values = [1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI... 1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI… Input to map 1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO... 1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO... * 1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO… 1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL...] PI|1988|1 17 PI|1988|1 0 PS|1987|10 11 Output from map PS|1987|10 -2 PS|1987|10 1 DL|1987|10 14 * Note that Input to map is a vector of keys and a vector of values. page 34
  • 35. RHIPE – Dataflow Cont d DL|1987|10 [14] Input to reduce PI|1988|1 [0, 17] * PS|1987|10 [1,11,-2] 1987 10 1 DL 14 Output from reduce 1988 1987 1 10 2 3 PI PS 8.5 3.333333 * Note that input to reduce is each unique key and a vector of values associated with that key. page 35
  • 36. RHIPE – Example #! /usr/bin/env Rscript! library(Rhipe)! rhinit(TRUE, TRUE)! map <- expression({! # For each input record, parse out required fields and output new record:! mapLine = function(line) {! fields <- unlist(strsplit(line, ","))! # Skip header lines and bad records:! if (!(identical(fields[[1]], "Year")) & length(fields) == 29) {! deptDelay <- fields[[16]]! # Skip records where departure dalay is "NA":! if (!(identical(deptDelay, "NA"))) {! # field[9] is carrier, field[1] is year, field[2] is month:! rhcollect(paste(fields[[9]], "|", fields[[1]], "|", fields[[2]], sep=""), deptDelay)! }! }! }! # Process each record in map input:! lapply(map.values, mapLine)! })! page 36
  • 37. RHIPE – Example Cont d reduce <- expression(! reduce = {! count <- length(reduce.values)! avg <- mean(as.numeric(reduce.values))! keySplit <- unlist(strsplit(reduce.key, "|"))! },! post = {! rhcollect(keySplit[[2]], ! paste(keySplit[[3]], count, keySplit[[1]], avg, sep="t"))! }! )! # Create job object:! z <- rhmr(map=map, reduce=reduce,! ifolder="/data/airline/", ofolder= /dept-delay-month,! inout=c('text', 'text'), jobname='Avg Departure Delay By Month')! # Run it:! rhex(z)! page 37
  • 38. RHIPE – Example Cont d library(lattice)! rhget("/dept-delay-month/part-r-00000", "deptdelay.dat")! deptdelays.monthly.full <- read.delim("deptdelay.dat", header=F)! names(deptdelays.monthly.full)<- c("Year","Month","Count","Airline","Delay")! deptdelays.monthly.full$Year <- as.character(deptdelays.monthly.full$Year)! h <- histogram(~Delay|Year,data=deptdelays.monthly.full,layout=c(5,5))! update(h)! page 38
  • 39. Running R on Hadoop: Segue page 39
  • 40. Segue – Overview •  Intended to work around single-threading in R by taking advantage of Hadoop streaming to provide simple parallel processing. –  For example, running multiple simulations in parallel. •  Suitable for embarrassingly pleasantly parallel problems – big CPU, not big data. •  Runs on Amazon’s Elastic Map Reduce (EMR). –  Not intended for internal clusters. •  Provides emrlapply(), a parallel version of lapply()! page 40
  • 41. Segue – Example estimatePi <- function(seed){!    set.seed(seed)!    numDraws <- 1e6!    r <- .5 #radius... in case the unit circle is too boring!    x <- runif(numDraws, min=-r, max=r)!    y <- runif(numDraws, min=-r, max=r)!    inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0)!    return(sum(inCircle) / length(inCircle) * 4)!  }! seedList <- as.list(1:1000)! require(segue)! myCluster <- createCluster(20)! myEstimates <- emrlapply( myCluster, seedList, estimatePi )! stopCluster(myCluster)! page 41
  • 42. Predictive Analytics on Hadoop: Sawmill page 42
  • 43. Sawmill – Overview •  A framework for integrating a PMML-compliant Scoring Engine with Hadoop. •  Hadoop streaming allows easier integration of a scoring engine into reducer code (Python and R). –  The output of a MapReduce run becomes a segmented PMML model – one segment for each partition •  Training the models and Scoring are separate MapReduce jobs. •  Interoperates with open source scoring engines such as Augustus, as well as a forthcoming R scoring engine. page 43
  • 44. Alternatives •  Apache Mahout –  Scalable machine learning library. –  Offers clustering, classification, collaborative filtering on Hadoop. •  Python –  Many modules available to support scientific and statistical computing. •  Revolution Analytics –  Provides commercial packages to support processing big data with R. •  Ricardo – looks interesting but you can’t have it. •  Other HPC/parallel processing packages, e.g. Rmpi or snow. •  Apache Hive + RJDBC? –  We haven’t been able to get it to work yet. –  You can however wrap calls to the Hive client in R to return R objects. See https://github.com/satpreetsingh/rDBwrappers/wiki page 44
  • 45. Conclusions •  If practical, consider using Hadoop to aggregate data for input to R analyses. •  Avoid using R for general purpose MapReduce use. •  For simple MapReduce jobs, or embarrassingly parallel jobs on a local cluster, consider Hadoop streaming (or Hadoop Interactive). –  Limited to processing text only. –  But easy to test at the command line outside of Hadoop: •  $ cat DATAFILE |./map.R |sort |./reduce.R! •  To run compute-bound analyses with relatively small amount of data on the cloud look at Segue. •  Otherwise, look at RHIPE. page 45
  • 46. Conclusions Cont d •  Also look at alternatives – Mahout, Python, etc. •  Make sure your cluster nodes are consistent – same version of R installed, required libraries are installed on each node, etc. page 46
  • 48. References •  Hadoop –  Apache Hadoop project: http://hadoop.apache.org/ –  Hadoop The Definitive Guide, Tom White, O Reilly Press, 2011 •  R –  R Project for Statistical Computing: http://www.r-project.org/ –  R Cookbook, Paul Teetor, O Reilly Press, 2011 –  Getting Started With R: Some Resources: http://quanttrader.info/public/gettingStartedWithR.html page 48
  • 49. References •  Hadoop Streaming –  Documentation on Apache Hadoop Wiki: http://hadoop.apache.org/mapreduce/docs/current/ streaming.html –  Word count example in R : https://forums.aws.amazon.com/thread.jspa? messageID=129163 page 49
  • 50. References •  Hadoop InteractiVE –  Project page on CRAN: http://cran.r-project.org/web/packages/hive/index.html –  Simple Parallel Computing in R Using Hadoop: http://www.rmetrics.org/Meielisalp2009/Presentations/ Theussl1.pdf page 50
  • 51. References •  RHIPE –  RHIPE - R and Hadoop Integrated Processing Environment: http://www.stat.purdue.edu/~sguha/rhipe/ –  Wiki: https://github.com/saptarshiguha/RHIPE/wiki –  Installing RHIPE on CentOS: https://groups.google.com/forum/#!topic/brumail/qH1wjtBgwYI –  Introduction to using RHIPE: http://ml.stat.purdue.edu/rhafen/rhipe/ –  RHIPE combines Hadoop and the R analytics language, SD Times: http://www.sdtimes.com/link/34792 –  Using R and Hadoop to Analyze VoIP Network Data for QoS, Hadoop World 2010: •  video: http://www.cloudera.com/videos/ hw10_video_using_r_and_hadoop_to_analyze_voip_network_data_for_qos •  slides: http://www.cloudera.com/resource/ hw10_voice_over_ip_studying_traffic_characteristics_for_quality_of_service page 51
  • 52. References •  Segue –  Project page: http://code.google.com/p/segue/ –  Google Group:http://groups.google.com/group/segue-r –  Abusing Amazon s Elastic MapReduce Hadoop service… easily, from R, Jefferey Breen: http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to- amazon-elastic-mapreduce-hadoop/ –  Presentation at Chicago Hadoop Users Group March 23, 2011: http://files.meetup.com/1634302/segue-presentation- RUG.pdf page 52
  • 53. References •  Sawmill –  More information: •  Open Data Group www.opendatagroup.com •  oscon-info@opendatagroup.com –  Augustus, an open source system for building & scoring statistical models •  augustus.googlecode.com –  PMML •  Data Mining Group: dmg.org –  Analytics over Clouds using Hadoop, presentation at Chicago Hadoop User Group: http://files.meetup.com/1634302/CHUG 20100721 Sawmill.pdf page 53
  • 54. References •  Ricardo –  Ricardo: Integrating R and Hadoop, paper: http://www.cs.ucsb.edu/~sudipto/papers/sigmod2010- das.pdf –  Ricardo: Integrating R and Hadoop, Powerpoint presentation: http://www.uweb.ucsb.edu/~sudipto/talks/Ricardo- SIGMOD10.pptx page 54
  • 55. References •  General references on Hadoop and R –  Pete Skomoroch s R and Hadoop bookmarks: http://www.delicious.com/pskomoroch/R+hadoop –  Pigs, Bees, and Elephants: A Comparison of Eight MapReduce Languages: http://www.dataspora.com/2011/04/pigs-bees-and- elephants-a-comparison-of-eight-mapreduce-languages/ –  Quora – How can R and Hadoop be used together?: http://www.quora.com/How-can-R-and-Hadoop-be-used- together page 55
  • 56. References •  Mahout –  Mahout project: http://mahout.apache.org/ –  Mahout in Action, Owen, et. al., Manning Publications, 2011 •  Python –  Think Stats, Probability and Statistics for Programmers, Allen B. Downey, O Reilly Press, 2011 •  CRAN Task View: High-Performance and Parallel Computing with R, a set of resources compiled by Dirk Eddelbuettel: http://cran.r-project.org/web/views/ HighPerformanceComputing.html page 56
  • 57. References •  Other examples of airline data analysis with R: –  A simple Big Data analysis using the RevoScaleR package in Revolution R: http://www.r-bloggers.com/a-simple-big-data-analysis-using- the-revoscaler-package-in-revolution-r/ page 57
  • 58. And finally… Parallel R (working title), Q Ethan McCallum, Stephen Weston, O Reilly Press, due autumn 2011 R meets Big Data - a basket of strategies to help you use R for large-scale analysis and computation. page 58