SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
CS 267 : Data Mining Presentation
Guided by : Dr. Tran
-Gaurav Kasliwal
Outline
 RandomForest Model
 Mahout Overview
 RandomForest using Mahout
 Problem Description
 Working Environment
 Data Preparation
 ML Model Generation
 Demo
 Using Gini Index
RandomForest Model
 Random forests are an ensemble learning method
for classification that operate by constructing a
multitude of decision trees at training time and
outputting the class that is the mode of
the classes output by individual trees.
 Developed by Leo Breiman and Adele Cutler.
Mahout
 Mahout is a library of scalable machine-learning
algorithms, implemented on top of Apache Hadoop
and using the MapReduce paradigm.
 Scalable to large data sets
RandomForest using Mahout
 Generate a file descriptor for the dataset.
 Run the example with train data and build Decision
Forest model.
 Use the Decision Forest model to Classify test data and
get results.
 Tuning the model to get better results.
Problem Definition
 To Benchmark machine learning model for Page-Rank
 Yahoo! Learning to Rank
 Train Data : 34815 Records
 Test Data : 130166 Records
 Data Description :
 {R} | {q_id} | {List: feature_id -> feature_value}
 where R = {0, 1, 2, 3, 4}
 q_id = query id (number)
 feature_id = number feature_value = 0 to 1
Working Environment
 Ubuntu
 Hadoop 1.2.1
 Mahout 0.9
Prepare Dataset
 Take data from input text file
 Make a .csv file
 Make directory in HDFS and upload train.csv and
test.csv to the folder.
 Data Loading (Load data to HDFS)
 #hadoop fs -put train.arff final_data
 #hadoop fs -put test.arff final_data
 #hadoop fs -ls final_data (check by ls command )
Using Mahout
make metadata:
#hadoop jar mahout-core-0.9-job.jar org.apache.mahout.classifier.df.tools.Describe -p
final_data/train.csv -f final_data/train.info1 -d 702 N L
 It creates a metadata train.info1 in final_data folder.
Create Model
make model
#hadoop jar mahout-examples-0.9-job.jar
org.apache.mahout.classifier.df.mapreduce.BuildForest -
Dmapred.max.split.size=1874231 -d final_data/train.arff -ds
final_data/train.info -sl 5 -p -t 100 -o final-forest
Test Model
test model
#hadoop jar mahout-examples-0.9-job.jar
org.apache.mahout.classifier.df.mapreduce.BuildForest -
Dmapred.max.split.size=1874231 -d final_data/train.arff -ds
final_data/train.info -p -t 1000 -o final-forest
Results
Summary results : Confusion Matrix and statistics
Tuning
 (change the parameters -t and -sl) and check the
results.
 --nbtrees (-t) nbtrees Number of trees to grow
 --selection (-sl) m Number of variables to
select randomly at each tree-node.
Results
 #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -
Dmapred.max.split.size=1874231 -d final_data/train.csv -ds final_data/train.info1 -sl 700 -p -t 600 -o
final-forest2
 #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i
final_data/test.csv -ds final_data/train.info1 -m final-forest2 -a -mr -o final-pred2
RF Split selection
 Typically we select about square root (K) when there
are K is the total number of predictors available
 If we have 500 columns of predictors we will select
only about 23
 We split our node with the best variable among the 23,
not the best variable among the 500
Using Gini Index
 If a dataset T is split into two subsets T1 and T2 with
sizes N1 and N2 respectively, the gini index of the split
data contains examples from n classes, the gini index
(T) is defined as:
 **The attribute value that provides the smallest SPLIT Gini (T) is chosen to
split the node.
Example
 The example below shows the construction of a single
tree using the dataset .
 Only two of the original four attributes are chosen for
this tree construction.
 tabulates the gini index value for the HOME_TYPE
attribute at all possible splits.
 the split HOME_TYPE <= 10 has the lowest value
Gini SPILT Value
Gini SPILT(HOME_TYPE<=6) 0.4000
Gini SPILT(HOME_TYPE<=10) 0.2671
Gini SPILT(HOME_TYPE<=15) 0.4671
Gini SPILT(HOME_TYPE<=30) 0.3000
Gini SPILT(HOME_TYPE<=31) 0.4800
Random forest using apache mahout

Mais conteúdo relacionado

Mais procurados

Probabilistic logic
Probabilistic logicProbabilistic logic
Probabilistic logic
Rushdi Shams
 

Mais procurados (20)

Meta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-InformationMeta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
 
Data mining
Data miningData mining
Data mining
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Slide 2 data models
Slide 2 data modelsSlide 2 data models
Slide 2 data models
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization
 
Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithm
 
Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and Clustering
 
Dbms Notes Lecture 4 : Data Models in DBMS
Dbms Notes Lecture 4 : Data Models in DBMSDbms Notes Lecture 4 : Data Models in DBMS
Dbms Notes Lecture 4 : Data Models in DBMS
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Testing data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti BhushanTesting data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti Bhushan
 
Aggregate fact tables
Aggregate fact tablesAggregate fact tables
Aggregate fact tables
 
Data mining fp growth
Data mining fp growthData mining fp growth
Data mining fp growth
 
data modeling and models
data modeling and modelsdata modeling and models
data modeling and models
 
Probabilistic logic
Probabilistic logicProbabilistic logic
Probabilistic logic
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 

Destaque

Sdforum 11-04-2010
Sdforum 11-04-2010Sdforum 11-04-2010
Sdforum 11-04-2010
Ted Dunning
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
Daniel Glauser
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
Tommaso Teofili
 
SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015
SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015
SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015
Ertunga Arsal
 

Destaque (20)

Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
Sdforum 11-04-2010
Sdforum 11-04-2010Sdforum 11-04-2010
Sdforum 11-04-2010
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
 
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestUnsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Tools
 
Build Your Strategy and Projections with Azure Machine Learning (Sergey Popla...
Build Your Strategy and Projections with Azure Machine Learning (Sergey Popla...Build Your Strategy and Projections with Azure Machine Learning (Sergey Popla...
Build Your Strategy and Projections with Azure Machine Learning (Sergey Popla...
 
VPN Types, Vulnerabilities & Solutions - Tareq Hanaysha
VPN Types, Vulnerabilities & Solutions - Tareq HanayshaVPN Types, Vulnerabilities & Solutions - Tareq Hanaysha
VPN Types, Vulnerabilities & Solutions - Tareq Hanaysha
 
Data Science for Cyber Risk
Data Science for Cyber RiskData Science for Cyber Risk
Data Science for Cyber Risk
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is fun
 
SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015
SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015
SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015
 
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
 
AI&BigData Lab. Маргарита Остапчук "Алгоритмы в Azure Machine Learning и где ...
AI&BigData Lab. Маргарита Остапчук "Алгоритмы в Azure Machine Learning и где ...AI&BigData Lab. Маргарита Остапчук "Алгоритмы в Azure Machine Learning и где ...
AI&BigData Lab. Маргарита Остапчук "Алгоритмы в Azure Machine Learning и где ...
 
Building an Analytics - Enabled SOC Breakout Session
Building an Analytics - Enabled SOC Breakout Session Building an Analytics - Enabled SOC Breakout Session
Building an Analytics - Enabled SOC Breakout Session
 
Building an Analytics Enables SOC
Building an Analytics Enables SOCBuilding an Analytics Enables SOC
Building an Analytics Enables SOC
 
Introducing OpenText Auto-Classification
Introducing OpenText Auto-ClassificationIntroducing OpenText Auto-Classification
Introducing OpenText Auto-Classification
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 

Semelhante a Random forest using apache mahout

Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...
Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...
Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...
Cisco
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
DataWorks Summit
 
MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015
Asaf Ben Gal
 

Semelhante a Random forest using apache mahout (20)

Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...
Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...
Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / Pipeline
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using spark
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Robert Meyer- pypet
Robert Meyer- pypetRobert Meyer- pypet
Robert Meyer- pypet
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
 
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 

Último

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Último (20)

social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 

Random forest using apache mahout

  • 1. CS 267 : Data Mining Presentation Guided by : Dr. Tran -Gaurav Kasliwal
  • 2. Outline  RandomForest Model  Mahout Overview  RandomForest using Mahout  Problem Description  Working Environment  Data Preparation  ML Model Generation  Demo  Using Gini Index
  • 3. RandomForest Model  Random forests are an ensemble learning method for classification that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.  Developed by Leo Breiman and Adele Cutler.
  • 4. Mahout  Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm.  Scalable to large data sets
  • 5. RandomForest using Mahout  Generate a file descriptor for the dataset.  Run the example with train data and build Decision Forest model.  Use the Decision Forest model to Classify test data and get results.  Tuning the model to get better results.
  • 6. Problem Definition  To Benchmark machine learning model for Page-Rank  Yahoo! Learning to Rank  Train Data : 34815 Records  Test Data : 130166 Records  Data Description :  {R} | {q_id} | {List: feature_id -> feature_value}  where R = {0, 1, 2, 3, 4}  q_id = query id (number)  feature_id = number feature_value = 0 to 1
  • 7. Working Environment  Ubuntu  Hadoop 1.2.1  Mahout 0.9
  • 8. Prepare Dataset  Take data from input text file  Make a .csv file  Make directory in HDFS and upload train.csv and test.csv to the folder.  Data Loading (Load data to HDFS)  #hadoop fs -put train.arff final_data  #hadoop fs -put test.arff final_data  #hadoop fs -ls final_data (check by ls command )
  • 9. Using Mahout make metadata: #hadoop jar mahout-core-0.9-job.jar org.apache.mahout.classifier.df.tools.Describe -p final_data/train.csv -f final_data/train.info1 -d 702 N L  It creates a metadata train.info1 in final_data folder.
  • 10. Create Model make model #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest - Dmapred.max.split.size=1874231 -d final_data/train.arff -ds final_data/train.info -sl 5 -p -t 100 -o final-forest
  • 11. Test Model test model #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest - Dmapred.max.split.size=1874231 -d final_data/train.arff -ds final_data/train.info -p -t 1000 -o final-forest
  • 12. Results Summary results : Confusion Matrix and statistics
  • 13. Tuning  (change the parameters -t and -sl) and check the results.  --nbtrees (-t) nbtrees Number of trees to grow  --selection (-sl) m Number of variables to select randomly at each tree-node.
  • 14. Results  #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest - Dmapred.max.split.size=1874231 -d final_data/train.csv -ds final_data/train.info1 -sl 700 -p -t 600 -o final-forest2  #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i final_data/test.csv -ds final_data/train.info1 -m final-forest2 -a -mr -o final-pred2
  • 15. RF Split selection  Typically we select about square root (K) when there are K is the total number of predictors available  If we have 500 columns of predictors we will select only about 23  We split our node with the best variable among the 23, not the best variable among the 500
  • 16. Using Gini Index  If a dataset T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index (T) is defined as:  **The attribute value that provides the smallest SPLIT Gini (T) is chosen to split the node.
  • 17. Example  The example below shows the construction of a single tree using the dataset .  Only two of the original four attributes are chosen for this tree construction.
  • 18.
  • 19.  tabulates the gini index value for the HOME_TYPE attribute at all possible splits.  the split HOME_TYPE <= 10 has the lowest value Gini SPILT Value Gini SPILT(HOME_TYPE<=6) 0.4000 Gini SPILT(HOME_TYPE<=10) 0.2671 Gini SPILT(HOME_TYPE<=15) 0.4671 Gini SPILT(HOME_TYPE<=30) 0.3000 Gini SPILT(HOME_TYPE<=31) 0.4800