Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis?
This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program.
Please mail me if you found any problem toward the slide. EMAIL: tr.ywchiu@gmail.com
談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ?
本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?
6. Enable developer to write Mapper/Reducer in
any scripting language(R, python, perl)
Mapper, reducer, and optional combiner
processes are written to read from standard
input and to write to standard output
Streaming Job would have additional overhead
of starting a scripting VM
Streaming v.s. Native Java.
7. Writing MapReduce Using R
mapreduce function
Mapreduce(input output, map, reduce…)
Changelog
rmr 3.0.0 (2014/02/10): 10X faster than rmr 2.3.0
rmr 2.3.0 (2013/10/07): support plyrmr
rmr2
8. Access HDFS From R
Exchange data from R dataframe and HDFS
rhdfs
10. Perform common data manipulation operations,
as found in plyr and reshape2
It provides a familiar plyr-like interface while
hiding many of the mapreduce details
plyr: Tools for splitting, applying and
combining data
NEW! plyrmr
12. R and related packages should be installed on
each tasknode of the cluster
A Hadoop cluster, CDH3 and higher or Apache
1.0.2 and higher but limited to mr1, not mr2.
Compatibility with mr2 from Apache 2.2.0 or
HDP2
Prerequisites
13. Getting Ready (Cloudera VM)
Download
http://www.cloudera.com/content/cloudera-content/cloudera-
docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html
This VM runs
CentOS 6.2
CDH4.4
R 3.0.1
Java 1.6.0_32
37. HDFS stores your files as data chunk distributed on
multiple datanodes
M/R runs multiple programs called mapper on each of
the data chunks or blocks. The (key,value) output of
these mappers are compiled together as result by
reducers.
It takes time for mapper and reducer being spawned on
these distributed system.
Hadoop Latency
38. Its not possible to apply built in machine learning
method on MapReduce Program
kcluster= kmeans((mydata, 4, iter.max=10)
Kmeans Clustering
42. Perform common data manipulation operations,
as found in plyr and reshape2
It provides a familiar plyr-like interface while
hiding many of the mapreduce details
plyr: Tools for splitting, applying and
combining data
NEW! plyrmr