Random forest using apache mahout

CS 267 : Data Mining Presentation
Guided by : Dr. Tran
-Gaurav Kasliwal

Outline
 RandomForest Model
 Mahout Overview
 RandomForest using Mahout
 Problem Description
 Working Environment
 Data Preparation
 ML Model Generation
 Demo
 Using Gini Index

RandomForest Model
 Random forests are an ensemble learning method
for classification that operate by constructing a
multitude of decision trees at training time and
outputting the class that is the mode of
the classes output by individual trees.
 Developed by Leo Breiman and Adele Cutler.

Mahout
 Mahout is a library of scalable machine-learning
algorithms, implemented on top of Apache Hadoop
and using the MapReduce paradigm.
 Scalable to large data sets

RandomForest using Mahout
 Generate a file descriptor for the dataset.
 Run the example with train data and build Decision
Forest model.
 Use the Decision Forest model to Classify test data and
get results.
 Tuning the model to get better results.

Problem Definition
 To Benchmark machine learning model for Page-Rank
 Yahoo! Learning to Rank
 Train Data : 34815 Records
 Test Data : 130166 Records
 Data Description :
 {R} | {q_id} | {List: feature_id -> feature_value}
 where R = {0, 1, 2, 3, 4}
 q_id = query id (number)
 feature_id = number feature_value = 0 to 1

Working Environment
 Ubuntu
 Hadoop 1.2.1
 Mahout 0.9

Prepare Dataset
 Take data from input text file
 Make a .csv file
 Make directory in HDFS and upload train.csv and
test.csv to the folder.
 Data Loading (Load data to HDFS)
 #hadoop fs -put train.arff final_data
 #hadoop fs -put test.arff final_data
 #hadoop fs -ls final_data (check by ls command )

Using Mahout
make metadata:
#hadoop jar mahout-core-0.9-job.jar org.apache.mahout.classifier.df.tools.Describe -p
final_data/train.csv -f final_data/train.info1 -d 702 N L
 It creates a metadata train.info1 in final_data folder.

Create Model
make model
#hadoop jar mahout-examples-0.9-job.jar
org.apache.mahout.classifier.df.mapreduce.BuildForest -
Dmapred.max.split.size=1874231 -d final_data/train.arff -ds
final_data/train.info -sl 5 -p -t 100 -o final-forest

Test Model
test model
#hadoop jar mahout-examples-0.9-job.jar
org.apache.mahout.classifier.df.mapreduce.BuildForest -
Dmapred.max.split.size=1874231 -d final_data/train.arff -ds
final_data/train.info -p -t 1000 -o final-forest

Results
Summary results : Confusion Matrix and statistics

Tuning
 (change the parameters -t and -sl) and check the
results.
 --nbtrees (-t) nbtrees Number of trees to grow
 --selection (-sl) m Number of variables to
select randomly at each tree-node.

Results
 #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -
Dmapred.max.split.size=1874231 -d final_data/train.csv -ds final_data/train.info1 -sl 700 -p -t 600 -o
final-forest2
 #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i
final_data/test.csv -ds final_data/train.info1 -m final-forest2 -a -mr -o final-pred2

RF Split selection
 Typically we select about square root (K) when there
are K is the total number of predictors available
 If we have 500 columns of predictors we will select
only about 23
 We split our node with the best variable among the 23,
not the best variable among the 500

Using Gini Index
 If a dataset T is split into two subsets T1 and T2 with
sizes N1 and N2 respectively, the gini index of the split
data contains examples from n classes, the gini index
(T) is defined as:
 **The attribute value that provides the smallest SPLIT Gini (T) is chosen to
split the node.

Example
 The example below shows the construction of a single
tree using the dataset .
 Only two of the original four attributes are chosen for
this tree construction.

 tabulates the gini index value for the HOME_TYPE
attribute at all possible splits.
 the split HOME_TYPE <= 10 has the lowest value
Gini SPILT Value
Gini SPILT(HOME_TYPE<=6) 0.4000

Random forest using apache mahout

Random forest using apache mahout

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Random forest using apache mahout

Semelhante a Random forest using apache mahout (20)

Último

Último (20)

Random forest using apache mahout