Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
RHive tutorial - Installation
1. RHive tutorial - Installation
There are 3 ways to install RHive.
• Installation using CRAN
• Download from RHive project homepage an already built R package then use
R CMD to install
• Download the source from Github, build, then install.
Excluding the version deployed in CRAN, all RHive packages and sources
can be found in the site below:
RHive’s Github repository path: https://github.com/nexr/RHive
Contents of this Tutorial
This tutorial explains how to install and run R and RHive in an environment
where Hadoop and Hive are running.
Environments used in this Tutorial
This tutorial is written with installing RHive on a CentOS5 Linux 64bit version
in mind.
Installation procedures on other Linuxes or Mac OS x are virtually identical.
Only the methods of installing packages such as git or ant may differ for each
version of deployment.
Method of using RHive in Windows will be provided as a separate article.
Hadoop and Hive Structural Environment
The modules installed and are running with the servers used in this tutorial are
as follows.
10.1.1.1 - Hadoop namenode, Hive server, R, RHive
10.1.1.[2-4] - Hadoop job node, DFS node, Rserve node
Thus, this tutorial supposes the following have already been composed.
• Suppose Hadoop namenode is installed in server 10.1.1.1 and Hive is
installed and Hive server is running.
• Servers 10.1.1.2, 10.1.1.3, and 10.1.1.4 has Hadoop DFS node and Hadoop
Job node running in them.
• Suppose Hadoop and Hive are functioning as normal.
2. Should you require guidance beginning from Hadoop and Hive installation
then please use the Hive and Hadoop references.
Note
It’s generally not a good idea to install things of functions other than
namenode to Hadoop namenode, but for the sake of fast composition and
small-scale cluster setup (and out of convenience), this tutorial installs Hive
server, R, and RHive.
Should a greater scale with simultaneous usage by multiple users are desired,
an appropriately altered application of the contents of this tutorial should
suffice.
Method of Installing Git to Download Sources
It is not such a bother to download the source code from Github and installing
it and on top of that there is the advantage of being able to directly build and
use the newest packages.
If a problem is found in the currently used RHive and there are source code
updates, it is faster to just download the source code and build it.
The Github repository where you can download RHive’s source code is as
follows: git://github.com/nexr/RHive.git
If the OS you are using is Linux or Mac OS X and you want to open a terminal
and work within the server, then you can use SSH to connect to the remote
server you plan to work on.
This tutorial is going to use a root account as a work account, if the user’s
environment grants no permission to connect via a root account, then the user
has to obtain sudoer permission and work with a sudo command.
Connecting to or opening a terminal
Open a terminal window or
connect to the server you plan to work on
ssh
root@10.1.1.1
Note: we assume 10.1.1.1 is the server which RHive should be installed
Download Source Code
Make a temporary directory and download RHive source via git in it.
And move to the automatically created subdirectory, ‘RHive’.
3. mkdir
RHive_source
cd
RHive_source
git
clone
git://github.com/nexr/RHive.git
#
if
you
succeed,
the
name
"RHive'
is
made
automatically
cd
RHive
If there is no git and therefore be unable to clone, you must use the command
below to install git and follow the directions above.
yum
install
git
Using ant to build jar
Before building RHive package, one must build sub modules written in java
and ends with jar file extension
This may not be required in the cases of downloading from CRAN or
downloading the final version of a package,
this procedure is required in the case of downloading the source and manually
building it.
That is, the jar module used in RHive sub modules must be compiled and
readymade before RHive package becomes made into a form that can be
installed by R.
You can compile jar files which ant will include in the RHive sub modules.
ant
build
If there is no ant then install ant to Linux first, then execute the
aforementioned procedures.
And java must be installed, of course.
Ant can be installed with the following command:
yum
install
ant
Once the command has been executed then the following can result:
#
antBuildfile:
build.xml
compile:
[mkdir]
Created
dir:
/mnt/srv/RHive_package/RHive/build/classes
[javac]
Compiling
5
source
files
to
4. /mnt/srv/RHive_package/RHive/build/classes
[unjar]
Expanding:
/mnt/srv/RHive_package/RHive/RHive/inst/javasrc/lib/REngine.jar
into
/mnt/srv/RHive_package/RHive/build/classes
[unjar]
Expanding:
/mnt/srv/RHive_package/RHive/RHive/inst/javasrc/lib/RserveEngin
e.jar
into
/mnt/srv/RHive_package/RHive/build/classes
jar:
[jar]
Building
jar:
/mnt/srv/RHive_package/RHive/rhive_udf.jar
cran:
[copy]
Copying
1
file
to
/mnt/srv/RHive_package/RHive/RHive/inst/java
[copy]
Copying
13
files
to
/mnt/srv/RHive_package/RHive/build/CRAN/rhive/inst
[copy]
Copying
9
files
to
/mnt/srv/RHive_package/RHive/build/CRAN/rhive/man
[copy]
Copying
3
files
to
/mnt/srv/RHive_package/RHive/build/CRAN/rhive/R
[copy]
Copying
1
file
to
/mnt/srv/RHive_package/RHive/build/CRAN/rhive
[copy]
Copying
1
file
to
/mnt/srv/RHive_package/RHive/build/CRAN/rhive
[delete]
Deleting:
/mnt/srv/RHive_package/RHive/rhive_udf.jar
main:
BUILD
SUCCESSFUL
You can see the build has been successful and if it failed, the quickest
solution is to consult the RHive development team.
Building RHive Package
After making the sub modules, in order to install RHive package, it must be
made as an R package type.
The current path must be checked to see if it is the same as the directory
where jar was built, then build RHive package like below.
This can be done like this:
#
pwd
/root/RHive_package/RHive
#
ls
-‐l
total
76
5. -‐rw-‐r-‐-‐r-‐-‐
1
root
root
1413
Dec
11
16:41
ChangeLog
-‐rw-‐r-‐-‐r-‐-‐
1
root
root
2068
Dec
11
16:41
INSTALL
-‐rw-‐r-‐-‐r-‐-‐
1
root
root
2444
Dec
11
16:41
README
drwxr-‐xr-‐x
5
root
root
4096
Dec
11
16:41
RHive
drwxr-‐xr-‐x
4
root
root
4096
Dec
11
16:42
build
-‐rw-‐r-‐-‐r-‐-‐
1
root
root
2999
Dec
11
16:41
build.xml
-‐rw-‐r-‐-‐r-‐-‐
1
root
root
35244
Dec
11
16:41
rhive-‐logo.jpg
-‐rw-‐r-‐-‐r-‐-‐
1
root
root
12732
Dec
11
16:41
rhive-‐logo.png
#
R
CMD
build
./RHive
If the build was successful then you may see the following result message.
* checking for file ‘./RHive/DESCRIPTION’ ... OK
* preparing ‘RHive’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* building ‘RHive_0.0-4.tar.gz’
You can see RHive_0.0-4.tar.gz has been created.
This package is installable by R.
The created file’s name will be different according to the RHive package
version used for building.
Install RHive Package
Now we shall install the just created or downloaded RHive Package.
It can be installed with the following command:
R
CMD
INSTALL
./RHive_0.0-‐4.tar.gz
No errors mean installation success.
But you might encounter errors related to rJava and Rserver packages.
6. *
installing
to
library
‘/usr/lib64/R/library’ERROR:
dependencies
‘rJava’,
‘Rserve’
are
not
available
for
package
‘RHive’*
removing
‘/usr/lib64/R/library/RHive’
This error message indicates that R packages called rJava and Rserver are
not installed in the currently used R.
RHive depends on rJava and Rserve package so this package must already
be installed.
Using CRAN to install RHive will automatically install the depended packages
for your but in the case of having used source, automatic installation is difficult.
Manually install.
#
OpenR)
install.packages("rJava")
install.packages("Rserve")
#
and
install
RHive
install.packages("./RHive_0.0-‐4.tar.gz",
repos=NULL)
No errors indicate a successful installation.
Directly downloading RHive package from project site
The URL where you can download a built package is as follows:
https://github.com/nexr/RHive/downloads
We will be downloading a suitable version to download from the above site.
This tutorial will install the version as listed below:
RHive_0.0-‐4-‐2011121201.tar.gz
—
RHive_0.0-‐4
SNAPSHOP
(build2011121201)
-‐
R
package
You can also download this file via a web browser and install it to a laptop or
desktop, or install by sending the file to a remote server via FTP.
This tutorial will exemplify how to install it to a remote Linux server.
Firstly, use a terminal to connect a remote RHive to a Linux where it will be
installed.
In this tutorial it is server 10.1.1.1, located in the internal network.
ssh
root@10.1.1.1
7. mkdir
RHive_installable
cd
RHive_installable
Now create a temporary directory and use wget to download the file.
The download link path can be obtained from the aforementioned download
site.
Remember to write –no-check-certificate in the wget option.
wget
-‐-‐no-‐check-‐certificate
https://github.com/downloads/nexr/RHive/RHive_0.0-‐4-‐
2011121401.tar.gz
Once download is complete your current directory will contain the following file:
#
ls
-‐al
total
3240
drwxr-‐xr-‐x
2
root
root
4096
Dec
11
18:00
.
drwxr-‐x-‐-‐-‐
6
root
root
4096
Dec
11
18:02
..
-‐rw-‐r-‐-‐r-‐-‐
1
root
root
3302766
Dec
12
2011
RHive_0.0-‐4-‐
2011121401.tar.gz
This file is a package created by RHive development team made for uploading
it to CRAN, therefore doesn’t require a separate build procedure.
It can be straightforwardly installed by using R.
R
CMD
INSTALL
./RHive_0.0-‐4-‐2011121201.tar.gz
If you encounter an error message related to rJava and Rserve dependency
like the one mentioned before,
install those first inside R first and then install the reinstall the downloaded
files. Like below.
It was mentioned before but it can be installed via the following method:
Open
R
install.packages('rJava')
install.packages('Rserve')
No errors mean a completed installation.
8. Downloading source code without using Git client
You can download the source code from Github even without the use of Git
command or Git client.
Github supports the use of web browsers to download the compressed source
code.
You can download the newest source code like below.
wget
-‐-‐no-‐check-‐certificate
https://github.com/nexr/RHive/zipball/master
-‐O
RHive.zip
unzip
RHive.zip
cd
nexr-‐RHive-‐df7341c/
Compiling the sources and building the package is the same as if you
downloaded RHive source via use of Git client.
Installing R and RServe
In order to use RHive, all job nodes of Hadoop must have Rserve installed.
RHive controls the Rserve by referencing slaves which is in conf of RHive.
It is not hard to install Rserve.
Connect to both Hadoop name node and job node and install R and Rserve
for each.
Except for name node: it does not need Rserve installed into it.
ssh
root@10.1.1.1
If R is not already installed, install that first.
In CentOS5, you can use the following method to install the newest version of
R.
Remember to install R-devel, because it is necessary to install Rserve.
rpm
-‐Uvh
http://download.fedora.redhat.com/pub/epel/5/i386/epel-‐release-‐
5-‐4.noarch.rpm
yum
install
R
yum
install
R-‐devel
9. If the required packages are installed, install Rserve via the following
command.
open
R
install.packages("Rserve")
If the installed R does not possess a file named libR.so, the following error
occurs when attempting to install Rserve.
*
installing
*source*
package
‘Rserve’
...
**
package
‘Rserve’
successfully
unpacked
and
MD5
sums
checked
checking
whether
to
compile
the
server...
yes
configure:
error:
R
was
configured
without
-‐-‐enable-‐R-‐shlib
or
-‐-‐enable-‐R-‐static-‐lib
***
Rserve
requires
R
(shared
or
static)
library.
***
***
Please
install
R
library
or
compile
R
with
either
-‐-‐enable-‐
R-‐shlib
***
***
or
-‐-‐enable-‐R-‐static-‐lib
support
***
Alternatively
use
-‐-‐without-‐server
if
you
wish
to
build
only
Rserve
client.
ERROR:
configuration
failed
for
package
‘Rserve’
*
removing
‘/usr/lib64/R/library/Rserve’
In order to solve this problem, when compiling R it must be compiled using --
enable-R-shlib or --enable-R-static-lib
but most Linux has these compiled with such options so this error is probably
caused by something else.
First, use the command below to search in the file path where R’s library files
are.
#
R
CMD
config
-‐-‐ldflags
10. -‐L/usr/lib64/R/lib
-‐lR
You might encounter the following error while executing the above command.
[root@i-‐10-‐24-‐1-‐34
Rserve]#
R
CMD
config
-‐-‐ldflags
/usr/lib64/R/bin/config:
line
142:
make:
command
not
found
/usr/lib64/R/bin/config:
line
143:
make:
command
not
found
This means there is no ‘make’ utility and Rserve needs it for installation so
‘make’ utility has to be installed.
Install the ‘make’ utility like below and then execute “R CMD config –ldflags”
and see whether library path becomes successfully displayed.
yum
install
make
And let’s check if libR.so is indeed in the printed path.
#
ls
-‐al
/usr/lib64/R/lib
total
4560
drwxr-‐xr-‐x
2
root
root
4096
Dec
13
03:00
.
drwxr-‐xr-‐x
7
root
root
4096
Dec
13
03:35
..
-‐rwxr-‐xr-‐x
1
root
root
2996480
Nov
8
14:19
libR.so
-‐rwxr-‐xr-‐x
1
root
root
177176
Nov
8
14:19
libRblas.so
-‐rwxr-‐xr-‐x
1
root
root
1470264
Nov
8
14:19
libRlapack.so
libR.so is confirmed to be there. Now that all preparations for installing Rserve
are complete, retry and finish installing Rserve.
open R
install.packages("Rserve")
*** Rserve requires R (shared or static) library. ***
*** Please install R library or compile R with either --enable-R-shlib ***
*** or --enable-R-static-lib support
Running Rserve
11. Once Rserve installation is complete, use DAEMON to run Rserve.
Before running Rserve, configurations must be adjusted to enable remote
connections to Rserve.
Adjust the configurations as follows:
Connect
to
the
server
where
Rserve
will
be
run.
In
all
Hadoop
job
nodes,
open
the
file,
"/etc/Rserv.conf",
using
a
text
editor.
If
there
is
no
such
file
then
it
must
be
created.
Insert
'remote
enable'
into
the
file.
Save
and
exit.
Rserv.conf
can
configure
many
other
options.
Details
pertaining
to
configuration
can
be
found
in
the
URL
below.
http://www.rforge.net/Rserve/doc.html
And then leave R and run Rserve in the command prompt.
R
CMD
Rserve
If Rserve is run via Daemon then the following command can be used to
check if it is listening to any ports.
#
netstat
-‐nltp
Active
Internet
connections
(only
servers)
Proto
Recv-‐Q
Send-‐Q
Local
Address
Foreign
Address
State
PID/Program
name
tcp
0
0
0.0.0.0:6311
0.0.0.0:*
LISTEN
25516/Rserve
tcp
0
0
:::59873
:::*
LISTEN
13023/java
tcp
0
0
:::50020
:::*
LISTEN
13023/java
tcp
0
0
::ffff:127.0.0.1:46056
:::*
LISTEN
13112/java
tcp
0
0
:::50060
:::*
12.
LISTEN
13112/java
tcp
0
0
:::22
:::*
LISTEN
1109/sshd
tcp
0
0
:::50010
:::*
LISTEN
13023/java
tcp
0
0
:::50075
:::*
LISTEN
13023/java
You can see the Rserve Daemon listening to port 6311.
Port 6311 is the default port which Rserve uses. This can be changed via
adjusting the configuration.
But don’t change it unless there is a special reason to.
And if the port isn’t open due to the firewall, then permission must be obtained
so as to enable connection between internal servers.
To check this, first see if the server where RHive will be run can achieve
connection.
#
connect
to
the
RHive
server
ssh
root@10.1.1.1
#
telnet
10.1.1.2
6311
Trying
10.1.1.2...
Connected
to
10.1.1.2.
Escape
character
is
'^]'.
Rsrv0103QAP1
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
#
telnet
10.1.1.3
6311
Trying
10.1.1.3...
Connected
to
10.1.1.3.
Escape
character
is
'^]'.
Rsrv0103QAP1
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
#
telnet
10.1.1.4
6311
Trying
10.1.1.4...
13. Connected
to
10.1.1.4.
Escape
character
is
'^]'.
Rsrv0103QAP1
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Configuring Hadoop and Hive for RHive
In order to run RHive, the laptops or desktops with RHive installed must also
have Hadoop and Hive installed, and their Hadoop configurations must also
match the configuration of the Hadoop cluster.
If the server planned for RHive installation do not have Hadoop or Hive
installed into it, then install a version same as the one installed for the Hadoop
cluster. Then copy the Hadoop’s configuration and match them up.
After matching that, configure environment variables.
export
HADOOP_HOME=/service/hadoop-‐0.20.203.0
export
HIVE_HOME=/service/hive-‐0.7.1
In the contents above, /service/hadoop-0.20.203.0is the path where Hadoop is
installed
and /service/hive-0.7.1 is where Hive is installed.
These must be put into /etc/profile
If RHive is installed in the same server as Hadoop namenode then no
separate configuring is required.
But if it’s a different server or a laptop then edit the contents of
/service/hadoop-0.20.203.0/conf to be the same as the Hadoop cluster you
plan to use.
Running the RHive Example
As stated before, in order to activate RHive, then environment variable must
be configured before running R.
To put it more precisely, a suitable environment variable must be set before
initializing RHive.
If you forgot to set HIVE_HOME and HADOOP_HOME for the laptop or
server’s environment variables, or wish to toggle between using different
versions then, as listed below, can be set after running R.
Open
R
14. Sys.setenv(HIVE_HOME="/service/hive-‐0.7.1")
Sys.setenv(HADOOP_HOME="/service/hadoop-‐0.20.203.0")
library(RHive)
You can skip this if you edited /etc/profile and etc. This method suffers the
disadvantage of having to be done every time R is run.
Checking for and Setting RHive Environment Variables
You can check whether the environment variable is properly set by running R
and using the rhive.env() Function.
Should either Hive Home Directory or Hadoop Home Directory not properly
show up then you must recheck whether they have been correctly set.
rhive.env()
Hive
Home
Directory
:
/mnt/srv/hive-‐0.8.1
Hadoop
Home
Directory
:
/mnt/srv/hadoop-‐0.20.203.0
Default
RServe
List
node1
node2
node3
Disconnected
HiveServer
and
HDFS
RHive connect
After loading RHive and before doing any work, the rhive.connect function
must be called and Hive server and connection must be made.
If the connection isn’t made then RHive Functions will not work.
rhive.connect()
SLF4J:
Class
path
contains
multiple
SLF4J
bindings.
SLF4J:
Found
binding
in
[jar:file:/service/hive-‐
0.7.1/lib/slf4j-‐log4j12-‐
1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:
Found
binding
in
[jar:file:/service/hadoop-‐
0.20.203.0/lib/slf4j-‐log4j12-‐
1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:
See
http://www.slf4j.org/codes.html#multiple_bindings
for
an
explanation.
Checking the contents of HDFS files
15. You might see how many complex messages result when making the
connection. These may be ignored.
Now you can use the rhive.hdfs.* Functions to handle Hadoop’s HDFS and
these correspond to the commands which “hadoop fs” .
you can use the rhive.hdfs.ls() Function to check the HDFS’s list of files.
rhive.hdfs.ls("/")
permission
owner
group
length
modify-‐
time
file
1
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
14:27
/airline
2
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
13:16
/benchmarks
3
rw-‐r-‐-‐r-‐-‐
root
supergroup
11186419
2011-‐12-‐06
03:59
/messages
4
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
22:05
/mnt
5
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
22:15
/rhive
6
rwxr-‐xr-‐x
root
supergroup
0
2011-‐12-‐07
20:19
/tmp
Checking table list of Hive
Also, you can check the list of tables registered in Hive by using the
rhive.list.tables() Function.
If you have not made any tables then you can see the following result.
rhive.list.tables()
[1]
tab_name
<0
rows>
(or
0-‐length
row.names)
Creating Hive table
You can use a simple command to save R’s data frame to a Hive table.
tablename
<-‐
rhive.write.table(USArrests)
16. USArrests is data provided with R. RHive converts data frame’s object name
into Hive table name and store it as Hive table.
Checking Table descriptions
And you can use the rhive.list.desc() Function to see the descriptions of the
table of Hive.
rhive.desc.table("USArrests")
col_name
data_type
comment
1
rowname
string
2
murder
double
3
assault
int
4
urbanpop
int
5
rape
double
As a note, Hive’s table names do not distinguish between upper and lower
cases.
Creating Hive Tables 2
It is possible to take other data in MASS package or data with CSV files
loaded and store them into Hive.
library(MASS)
tablename
<-‐
rhive.write.table(Aids2)
rhive.desc.table(tablename)
rhive.load.table(tablename)
This method is useful for uploading to Hive some data of relatively small sizes
and if attempting to save several Gbs of data to Hive, the recommended
method is to save files to HDFS and configuring as an external table
RHive currently does not automatically handle this for users and such a
feature is still in the drawing board.
Executing a simple SQL syntax
17. You can use the rhive.query() function to send SQL to Hive.
Let’s try running a simple SQL syntax that checks the entire number of
Records for the Hive table, usarrests.
rhive.query("SELECT
COUNT(*)
FROM
usarrests")
X_c0
1
50
The SQL syntax executed above is the result of Map/Reducing using Hadoop
and Hive. If you saw SQL results like above, then it indicates the RHive,
Hadoop, and Hive configurations are alright, and Hadoop calculated and
outputted the total count of the input data.
One thing to watch out for is that this example only used a very small data so
it is not safe to assert this has made full use of the potential of Hive and
Hadoop, which are distributed processing platforms.
Small data such as ”usarrests” that can be loaded into a single server’s
memory can be processed within R, without the use of RHive.
This step is just checking if the configurations are properly calibrated and
basic functions are in working order.
If you wish to use RHive through Hadoop and Hive, then it is fitting to use data
at least the proportions ranging from several GiBs to the tens of GiBs.
FAQ and Contact Info
Consult the following reference materials for explanations and details for
RHives for each Function.
If you find a bug or find difficulty in using RHive then do a bug report on the
RHive site or ask the RHive development team via e-mail.
The RHive development team is always open and responsive to questions,
requests, and bug reports.
e-mail: rhive@nexr.com