SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
Digit Recognizer (Machine Learning)

           Internship Project Report




               Submitted to:

          Persistent System Limited

Product Engineering Unit-1 for Data Mining, Pune




                 Submitted By:-

                  Amit Kumar

        PGPBA Praxis Business School, Kolkata

          Project Mentor:-Mr. Yogesh Badhe

       (Technical Specialist, Product Engineering Unit-1)

          Start Date of Internship: 16th July 2012

          End date of Internship: 15th October 2012

                Report Date: 15th October 2012
Preface


This report documents is the work done during the summer internship at Persistent System Limited,
Pune on the classify handwritten digits, under the supervision of Mr. Yohesh Badhe. The report will
give an overview of the tasks completed during the period of internship with technical details. Then the
results obtained are discussed and analyzed. I have tried my best to keep report simple yet technically
correct. I hope I succeed in my attempt.


Amit Kumar
ACKNOWLEDGEMENT




Simply, I could not have done this work without the lots of help I received cheerfully from Data Mining
Team. The work culture in Persistent System Limited is really motivates.Everybody is such a friendly
and cheerful companion here that work stress is never comes in way. I would specially like to thank
Mr. Mukund Deshpande who gave me this project to learn and understand the business implication of
statistical algorithm. Once again I would be thankful to Mr. Yogesh Badhe who helped me from the
understanding of the project to the building the statistical model. He not only advised me in the project,
but listened my arguments in our discussion. I am also very thankful to Ms. Deepti Takale who helped
me lot to absorb the statistical concepts.

Amit Kumar
Abstract

The report presents the three tasks completed during summer internship at Persistent System Limited

Which are listed below:
1. Understand of the Problem objective & business implication
2. Understanding the data & build the model
3. Evaluation of the model
All these tasks have been completed successfully and results were according to Expectations. All the
tasks were need very systematic approach, starting from the behavior of the data to the application of
the algorithm and till evaluation of the model. The most challenging task was the domain knowledge, to
understand the behavior of the data. Once the data has been prepared, we applied statistical algorithm
for model building. It is one of the major area and really need very fundamental and conceptual
knowledge of Advanced Statistics.

Amit Kumar
Introduction:- This project is taken from kaggle. It is a platform for predictive modeling and Analytics
competitions. Here organization and researchers post the data. Statisticians and data scientist from all
over the world compete to produce the best models.

Problem Statement:- There is a image of hand written digit and each image is 28pixels in height & 28 pixels
in width, for a total of 784 pixels. Each pixel has a single pixel value associated with it, indicating the lightness
or darkness of that pixel, with higher meaning darker. The pixel value is an integer between 0 and 255,
inclusive. Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and
783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i
and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28
matrix, (indexing by zero).

Goal of the Competition:- take an image of a handwritten single digit, and determine what that digit is?

Approach for the model building:-

Develop the Analysis Plan:- For the conceptual model establishment, we need to understand the selected
techniques and model implementation issue. Here we will establish predictive model, which will be in based
on random forest algorithm. This will our model consideration based upon sample size and required type of
variable(metric versus nonmetric).

Evaluation of Underlying techniques:-Since all multivariate techniques rely on underlying assumption, both
statistical & conceptual, that substantially affect their ability to represent multivariate relationships. For the
techniques based on the statistical inference, the assumptions of the multivariate normality, linearity,
independence of the error term, and equality of variance in a dependence relationship must all be met. Since
our data is categorical so we do not need to identify the linearity or any independence relation.

Estimate the Model and Assess Overall Model Fit:- With the assumption satisfied, the analysis proceeds to
the actual estimation of the model and assessment of overall model fit. In the estimation process we can
choose option to meet specific characteristics of the data or to maximize the fit of data. After the model is
estimated, the overall model fit is evaluated to ascertain whether it achieves acceptable levels on statistical
criteria. Many times, the model will be specified in an attempt to better level of overall fit or explanation.

Interpret the variate:- With the acceptable level of model fit, interpreting the variate reveals the nature of
relationship. The interpretation of effects for individual variables is made by examining the estimated weight
for each variable in the variate.

Validate the Model:-Before accepting the results, we must subject them to one final set of diagnostic
analyses that assess the degree of generalization of the results by the available validation methods. The
attempt to validate the model is directed towards demonstrating the generalization of the results to the total
population. These diagnostic analyses add little to the interpretation of the results but can be viewed as
insurance that the results are the most descriptive of the data.
Required statistical concepts for this project:-

Data Mining:-In another words we say it Knowledge Discovery in Database. It is a field at the interaction
of computer science and statistics to attempt to discover the pattern in large data set. It utilize
methods at the intersection of artificial intelligence , machine learning, statistics and database system.
The overall goal of the data mining process is to extract information from a data set and transform it
into an understandable structure for future use.

Decision Tree:- Decision tree can be used to predict a pattern or to classify the class of a data. It is
commonly used in data mining. The goal to use the decision tree algorithm is to create the model that
predict the target variable based upon the several input variables. In decision tree each leaf represents a
value of target variable given the value of input variables represented by the path from the root of the
leaf. A tree can be learned by splitting the source set into the subset based on an attribute value test.
This process repeated in each derived subset called recursive. The general fashion for tree is top down
induction.

Decision tree used in data mining are of two main types:-

    1. Classification Tree:- When the predicted outcome is the class to which the data belongs.
    2. Regression Tree:- When the predicted outcome can be consider a real number.

The term classification & regression tree(CART) analysis is an umbrella term used to refer to both of the
above procedure.

Some other techniques constructs more than one decision tree like, Bagging, Random Forest, Boosted
tree etc. We have used Random Forest decision tree for this project. The algorithm that are used for
constructing decision trees usually work top-down by choosing a variable at each step, that is next best
variable to use in splitting the set of variables. “Best” is defined by how well the variable splits the set
into homogeneous subsets that have the same value as target variable. Different algorithm use
different formulae for measuring “Best”.

These are the mathematical function through we measure the Impurity.
Random Forest:- Recently lot of interest in “ensemble learning” methods that generates many classifier
and aggregates their results. Two well-known methods are Boosting & bagging of classification trees. In
Boosting Successive tree give extra weight to points incorrectly predicted by earlier predictor. In the
End, a weighted vote is taken for prediction. In bagging, successive trees do not depend on
earlier trees each is independently constructed using a bootstrap sample of the data set. In the
end, a simple majority vote is taken for prediction. Breiman (2001) proposed random forests,
which add an additional layer of randomness to bagging. In addition to constructing each tree
using a different bootstrap sample of the data, random forests change how the classification or
regression trees are constructed. In standard trees, each node is split using the best split among
all variables. In a random forest, each node is split using the best among a subset of predictors
randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform
Very well compared to many other classifiers, including discriminant analysis, support vector
machines and neural networks, and is robust against over fitting. In addition, it is very user-
friendly in the sense that it has only two parameters (the number of variables in the random
subset at each node and the number of trees in the forest), and is usually not very sensitive to
their value.

The algorithm:-
The random forests algorithm (for both classification and regression) is as follows:
1. Draw ntree bootstrap samples from the original data.
2. For each of the bootstrap samples, grow an unpruned classification or regression tree, with
the following modification: at each node, rather than choosing the best split among all
predictors, randomly sample mtry of the predictors and choose the best split from among those
Variables. (Bagging can be thought of as the special case of random forests obtained when mtry
= p, the number of predictors.)
3. Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes for
Classification, Average for regression). An estimate of the error rate can be obtained, based on
the training data, by the following:
1. At each bootstrap iteration, predict the data not in the bootstrap sample (what Breiman
Calls “out-of-bag”, or OOB, data) using the tree grown with the bootstrap sample.
2. Aggregate the OOB predictions. (On the average, each data point would be out-of-bag
around 36% of the times, so aggregate these predictions.) Calculate the error rate, and call it
the OOB estimate of error rate.
Source Code:- # makes the random forest submission

library(randomForest)

train <- read.csv("../data/train.csv", header=TRUE)

test <- read.csv("../data/test.csv", header=TRUE)

labels <- as.factor(train[,1])

train <- train[,-1]

rf <- randomForest(train, labels, xtest=test, ntree=1000)

predictions <- levels(labels)[rf$test$predicted]

write(predictions, file="rf_benchmark.csv", ncolumns=1)
Solutions:--For this train data set we are taking the five random sample(20percent) and making five
different model. Since our data set is large we need to combine the respective model results to validate the
overall model accuracy.. Our approach can be vary, like we can build a model on the 80percent of the train
data and keep 20percent of the data to validate the model.

Model-1(Random Forest Algorithm,
sample=train.csv)
OOB estimation of error rate :- 5.32%
                                     Confusion Matrix
      0     1     2     3       4      5       6       7        8      9 class.error
0   848     0     0     1       3      0       6       0        8      0             0.0207852
1     0   933     8     3       3      1       2       1        0      1              0.019958
2    10     2   765     6       4      0       5       7        9      2             0.0555556
3     2     3    20   812       3     18       2       9       17      4             0.0876405
4     2     1     0     0     787      0       6       3        0     23             0.0425791
5     7     2     1    20       0    687      13       0        4      7             0.0728745
6     9     2     1     0       5     12    762        0        2      0             0.0390921
7     1     6    12     1       6      2       1     870        5     14             0.0522876
8     2     6     8    16       4      8       2       3     744      11             0.0746269
9     5     4     1    11      16      3       2       7       10    745             0.0733831
                                                                    8400
Model-2(RandomForest Algorithm, sample=train1.csv)
OOB estimation of error :-5.14%
                                  Confusion matrix
      0     1     2     3      4      5       6        7      8      9 class.error
0   773     0     1     0      2      0       7        0      5      0          0.0190355
1     0   946     3     5      2      1       1        2      3      0          0.0176532
2     3     1   766     4     11      1       6       11      6      1           0.054321
3     2     4    17   767      1     14       2        2     16      8          0.0792317
4     5     1     1     0    774      0       8        0      2     20          0.0456227
5     7     5     1    19      3    732       3        1      5      9          0.0675159
6     7     2     0     1      2      6    821         0      2      0          0.0237812
7     1     3    12     2      7      0       0     879       5     20          0.0538213
8     2     9     4     9      4     14       6        0    716     15          0.0808729
9     7     3     2    14     18      1       1       14      7    794          0.0778165
                                                                  8400
Model-3(RandomForest Algorithm , sample=train2.csv)
OOB estimation of error:-5.63%
                                  Confusion matrix
      0     1     2     3     4       5       6        7     8      9 class.error
0   864     0     2     0     1       1       4        0     7      1          0.0181818
1     0   896     6     2     4       2       1        3     4      2           0.026087
2     6     4   810     6     8       2       8      11      5      0          0.0581395
3     5     3    18   788     1      14       3      11     18      5          0.0900693
4     3     3     3     1   714       2       4        2     5     21          0.0580475
5     7     3     0    18     2     709       9        1     6      4          0.0658762
6     7     2     0     0     4       9    790         0     3      0          0.0306749
7     1     7    11     0     6       0       0     838      3     24           0.058427
8     4    11     5    18     4       9       3        3   745     13          0.0858896
9     8     5     2    14    14       2       0      10      9    773          0.0764636
                                                                 8400
Model 4:- (RandomForest Algorithm Used)
OOB estimation of the error:-5.35%
                                    Confusion matrix
      0     1     2     3      4        5       6        7      8      9 class.error
0   778     0     0     0      0        4       3        0      4      0           0.0139417
1     0   956     6     1      2        1       2        2      0      1            0.015448
2     7     5   759    10      8        0       7      12      10      3           0.0755177
3     1     3    12   812      1      24        1        7      9      4           0.0709382
4     1     3     0     0    788        0       6        3      2     22           0.0448485
5     6     4     1    19      3     710        8        1      7      6           0.0718954
6     7     2     1     0      2        5     800        0      2      0            0.023199
7     1     4     9     2      7        0       0     830       0     26           0.0557452
8     2     8     3    14      2      10        6        2    746     15           0.0767327
9     5     0     7    15     24        2       2      11      11    772           0.0906949
                                                                    8400
Model 5:-(RandomForest Algorithm Used)
OOB estimation of the error:-5.30%
                                    Confusion matrix
      0     1     2     3       4       5       6        7      8      9 class.error
0   782     0     0     1       1       1       4        1      5      1          0.0175879
1     0   927     5     5       1       2       3        1      1      1          0.0200846
2     6     1   802     8       6       0       5      14       5      3          0.0564706
3     1     2     9   815       2     20        2        9     11      7           0.071754
4     1     3     1     0     776       0      12        2      5     21          0.0548112
5     6     4     2    12       4    696       10        0      5      6          0.0657718
6     6     1     5     0       3       8    796         0      2      0          0.0304507
7     1     9    14     1       7       1       0     825       4      8          0.0517241
8     0     8    11    15       8     11        7        0    727     17          0.0957711
9     5     2     0    12      12       4       1      11      13    809          0.0690449
                                                                    8400
Model Validation:-
Overall Model Accuracy:-

         Predicted                               OOB estimated error 3.32%
Actual           0      1      2      3      4        5      6        7     8       9 Class error
    0        4074       0      2      1      3        4     18        1    16       0 0.010898523
    1            0   4686     20      7      8        4     10        6     7       4 0.013888889
    2           14     12   3992     19     26        7     18       32    25       6 0.038304023
    3            3      3     44   4144      2      62       7       17    43      16 0.045381249
    4            4      8      4      1   3916        0     16        6    10      72 0.029972752
    5           21     14      2     30      5    3656      29        3    20      15 0.036627141
    6           24      5      3      0      9      25 4017           0     6       0 0.017608217
    7            7     18     37      7     20        1      0 4327         8      61 0.035443602
    8            5     30     18     45     20      28      18        3 3811       32 0.049625935
    9           20      5      9     47     51        6      2       31    20    4029 0.045260664
                                                                                42000
Interpretation of the Model:-
The prediction of model in the test data set taken as the collective results of all the five model and the
weighted average taken to determine the best fit of result. Based upon the confusion matrix and
recursive pattern of data set to build the model is show that the average confidence is .954. Here we
have avoided over fitting because in the Random Forest algorithm data taken to build the model is
random & unbiased. Here the interesting thing is Random Forest can handle missing values and it
doesn’t require the pruning. In random Forest roughly 30-35% of the samples are not selected in
bootstrap, which we call as (OOB) sample. Using OOB sample as input to the corresponding tree,
predictions are made.



Bibliography:-

Multivariate Data Analysis by:- Hair black & tatham

http://www.webchem.science.ru.nl:8080/PRiNS/rF.pdf

http://people.revoledu.com/kardi/tutorial/DecisionTree/index.html

http://www.statmethods.net/interface/workspace.html

Mais conteúdo relacionado

Mais procurados

Criminal Detection System
Criminal Detection SystemCriminal Detection System
Criminal Detection SystemIntrader Amit
 
Credit card fraud detection through machine learning
Credit card fraud detection through machine learningCredit card fraud detection through machine learning
Credit card fraud detection through machine learningdataalcott
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Issues in knowledge representation
Issues in knowledge representationIssues in knowledge representation
Issues in knowledge representationSravanthi Emani
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)EdutechLearners
 
Artificial intelligence and IoT
Artificial intelligence and IoTArtificial intelligence and IoT
Artificial intelligence and IoTVeselin Pizurica
 
Machine learning seminar ppt
Machine learning seminar pptMachine learning seminar ppt
Machine learning seminar pptRAHUL DANGWAL
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learningTonmoy Bhagawati
 
Brain tumor detection using image segmentation ppt
Brain tumor detection using image segmentation pptBrain tumor detection using image segmentation ppt
Brain tumor detection using image segmentation pptRoshini Vijayakumar
 
artificial neural network
artificial neural networkartificial neural network
artificial neural networkPallavi Yadav
 
What is Deep Learning?
What is Deep Learning?What is Deep Learning?
What is Deep Learning?NVIDIA
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaSupervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaEdureka!
 
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Simplilearn
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine LearningJoel Graff
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 

Mais procurados (20)

Criminal Detection System
Criminal Detection SystemCriminal Detection System
Criminal Detection System
 
Credit card fraud detection through machine learning
Credit card fraud detection through machine learningCredit card fraud detection through machine learning
Credit card fraud detection through machine learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Issues in knowledge representation
Issues in knowledge representationIssues in knowledge representation
Issues in knowledge representation
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
 
Artificial intelligence and IoT
Artificial intelligence and IoTArtificial intelligence and IoT
Artificial intelligence and IoT
 
Weather Forecasting using Deep Learning A lgorithm for the Ethiopian Context
Weather Forecasting using Deep Learning A lgorithm for the Ethiopian ContextWeather Forecasting using Deep Learning A lgorithm for the Ethiopian Context
Weather Forecasting using Deep Learning A lgorithm for the Ethiopian Context
 
Machine learning seminar ppt
Machine learning seminar pptMachine learning seminar ppt
Machine learning seminar ppt
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Brain tumor detection using image segmentation ppt
Brain tumor detection using image segmentation pptBrain tumor detection using image segmentation ppt
Brain tumor detection using image segmentation ppt
 
artificial neural network
artificial neural networkartificial neural network
artificial neural network
 
Perceptron
PerceptronPerceptron
Perceptron
 
What is Deep Learning?
What is Deep Learning?What is Deep Learning?
What is Deep Learning?
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaSupervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
 
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 

Destaque

Destaque (6)

internship1
internship1internship1
internship1
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
In-plant Training Guidelines_SCSE
In-plant Training Guidelines_SCSEIn-plant Training Guidelines_SCSE
In-plant Training Guidelines_SCSE
 
Pradeep_Chaudhari_Report
Pradeep_Chaudhari_ReportPradeep_Chaudhari_Report
Pradeep_Chaudhari_Report
 
Internship Report
Internship ReportInternship Report
Internship Report
 
Summer internship project report
Summer internship project reportSummer internship project report
Summer internship project report
 

Semelhante a Internship project report,Predictive Modelling

Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5ssuser33da69
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionIRJET Journal
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET Journal
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - ReportAkanksha Gohil
 
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...TEJVEER SINGH
 
Real Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine LearningReal Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine LearningIRJET Journal
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RIOSR Journals
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruptionjagan477830
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...IRJET Journal
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET Journal
 
Regression and Artificial Neural Network in R
Regression and Artificial Neural Network in RRegression and Artificial Neural Network in R
Regression and Artificial Neural Network in RDr. Vaibhav Kumar
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET Journal
 
CUSTOMER CHURN PREDICTION
CUSTOMER CHURN PREDICTIONCUSTOMER CHURN PREDICTION
CUSTOMER CHURN PREDICTIONIRJET Journal
 
IRJET- Analysis of Music Recommendation System using Machine Learning Alg...
IRJET-  	  Analysis of Music Recommendation System using Machine Learning Alg...IRJET-  	  Analysis of Music Recommendation System using Machine Learning Alg...
IRJET- Analysis of Music Recommendation System using Machine Learning Alg...IRJET Journal
 
Final Report
Final ReportFinal Report
Final Reportimu409
 
Algorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxAlgorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxdaniahendric
 
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmWater Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmIRJET Journal
 

Semelhante a Internship project report,Predictive Modelling (20)

Unit 2-ML.pptx
Unit 2-ML.pptxUnit 2-ML.pptx
Unit 2-ML.pptx
 
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & Prediction
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and Challenges
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
 
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
 
Real Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine LearningReal Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine Learning
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
 
Regression and Artificial Neural Network in R
Regression and Artificial Neural Network in RRegression and Artificial Neural Network in R
Regression and Artificial Neural Network in R
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
CUSTOMER CHURN PREDICTION
CUSTOMER CHURN PREDICTIONCUSTOMER CHURN PREDICTION
CUSTOMER CHURN PREDICTION
 
IRJET- Analysis of Music Recommendation System using Machine Learning Alg...
IRJET-  	  Analysis of Music Recommendation System using Machine Learning Alg...IRJET-  	  Analysis of Music Recommendation System using Machine Learning Alg...
IRJET- Analysis of Music Recommendation System using Machine Learning Alg...
 
Final Report
Final ReportFinal Report
Final Report
 
Algorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxAlgorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docx
 
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmWater Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
 

Mais de Amit Kumar

Churn model for telecom
Churn model for telecomChurn model for telecom
Churn model for telecomAmit Kumar
 
A case study on churn analysis1
A case study on churn analysis1A case study on churn analysis1
A case study on churn analysis1Amit Kumar
 
A project report on declined sales of citrus
A project report on declined sales of citrusA project report on declined sales of citrus
A project report on declined sales of citrusAmit Kumar
 
Big data analytics final
Big data analytics finalBig data analytics final
Big data analytics finalAmit Kumar
 
Designing of Dataware-house
Designing of Dataware-houseDesigning of Dataware-house
Designing of Dataware-houseAmit Kumar
 
Mc donalds part 2
Mc donalds part 2Mc donalds part 2
Mc donalds part 2Amit Kumar
 
Mc donalds part 1
Mc donalds part 1Mc donalds part 1
Mc donalds part 1Amit Kumar
 
Data analytics for marketing decision support
Data analytics for marketing decision supportData analytics for marketing decision support
Data analytics for marketing decision supportAmit Kumar
 
Planning & Budgeting for ITC retails
Planning & Budgeting for ITC retailsPlanning & Budgeting for ITC retails
Planning & Budgeting for ITC retailsAmit Kumar
 
Strategy to launch a basketball League
Strategy to launch a basketball LeagueStrategy to launch a basketball League
Strategy to launch a basketball LeagueAmit Kumar
 
A report on merger & acquisition of united
A report on merger & acquisition of unitedA report on merger & acquisition of united
A report on merger & acquisition of unitedAmit Kumar
 
Market Research Project on small car segment
Market Research Project on small car segmentMarket Research Project on small car segment
Market Research Project on small car segmentAmit Kumar
 
Fsa presentation
Fsa presentationFsa presentation
Fsa presentationAmit Kumar
 

Mais de Amit Kumar (14)

Churn model for telecom
Churn model for telecomChurn model for telecom
Churn model for telecom
 
A case study on churn analysis1
A case study on churn analysis1A case study on churn analysis1
A case study on churn analysis1
 
A project report on declined sales of citrus
A project report on declined sales of citrusA project report on declined sales of citrus
A project report on declined sales of citrus
 
BI Approach
BI ApproachBI Approach
BI Approach
 
Big data analytics final
Big data analytics finalBig data analytics final
Big data analytics final
 
Designing of Dataware-house
Designing of Dataware-houseDesigning of Dataware-house
Designing of Dataware-house
 
Mc donalds part 2
Mc donalds part 2Mc donalds part 2
Mc donalds part 2
 
Mc donalds part 1
Mc donalds part 1Mc donalds part 1
Mc donalds part 1
 
Data analytics for marketing decision support
Data analytics for marketing decision supportData analytics for marketing decision support
Data analytics for marketing decision support
 
Planning & Budgeting for ITC retails
Planning & Budgeting for ITC retailsPlanning & Budgeting for ITC retails
Planning & Budgeting for ITC retails
 
Strategy to launch a basketball League
Strategy to launch a basketball LeagueStrategy to launch a basketball League
Strategy to launch a basketball League
 
A report on merger & acquisition of united
A report on merger & acquisition of unitedA report on merger & acquisition of united
A report on merger & acquisition of united
 
Market Research Project on small car segment
Market Research Project on small car segmentMarket Research Project on small car segment
Market Research Project on small car segment
 
Fsa presentation
Fsa presentationFsa presentation
Fsa presentation
 

Internship project report,Predictive Modelling

  • 1. Digit Recognizer (Machine Learning) Internship Project Report Submitted to: Persistent System Limited Product Engineering Unit-1 for Data Mining, Pune Submitted By:- Amit Kumar PGPBA Praxis Business School, Kolkata Project Mentor:-Mr. Yogesh Badhe (Technical Specialist, Product Engineering Unit-1) Start Date of Internship: 16th July 2012 End date of Internship: 15th October 2012 Report Date: 15th October 2012
  • 2. Preface This report documents is the work done during the summer internship at Persistent System Limited, Pune on the classify handwritten digits, under the supervision of Mr. Yohesh Badhe. The report will give an overview of the tasks completed during the period of internship with technical details. Then the results obtained are discussed and analyzed. I have tried my best to keep report simple yet technically correct. I hope I succeed in my attempt. Amit Kumar
  • 3. ACKNOWLEDGEMENT Simply, I could not have done this work without the lots of help I received cheerfully from Data Mining Team. The work culture in Persistent System Limited is really motivates.Everybody is such a friendly and cheerful companion here that work stress is never comes in way. I would specially like to thank Mr. Mukund Deshpande who gave me this project to learn and understand the business implication of statistical algorithm. Once again I would be thankful to Mr. Yogesh Badhe who helped me from the understanding of the project to the building the statistical model. He not only advised me in the project, but listened my arguments in our discussion. I am also very thankful to Ms. Deepti Takale who helped me lot to absorb the statistical concepts. Amit Kumar
  • 4. Abstract The report presents the three tasks completed during summer internship at Persistent System Limited Which are listed below: 1. Understand of the Problem objective & business implication 2. Understanding the data & build the model 3. Evaluation of the model All these tasks have been completed successfully and results were according to Expectations. All the tasks were need very systematic approach, starting from the behavior of the data to the application of the algorithm and till evaluation of the model. The most challenging task was the domain knowledge, to understand the behavior of the data. Once the data has been prepared, we applied statistical algorithm for model building. It is one of the major area and really need very fundamental and conceptual knowledge of Advanced Statistics. Amit Kumar
  • 5. Introduction:- This project is taken from kaggle. It is a platform for predictive modeling and Analytics competitions. Here organization and researchers post the data. Statisticians and data scientist from all over the world compete to produce the best models. Problem Statement:- There is a image of hand written digit and each image is 28pixels in height & 28 pixels in width, for a total of 784 pixels. Each pixel has a single pixel value associated with it, indicating the lightness or darkness of that pixel, with higher meaning darker. The pixel value is an integer between 0 and 255, inclusive. Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero). Goal of the Competition:- take an image of a handwritten single digit, and determine what that digit is? Approach for the model building:- Develop the Analysis Plan:- For the conceptual model establishment, we need to understand the selected techniques and model implementation issue. Here we will establish predictive model, which will be in based on random forest algorithm. This will our model consideration based upon sample size and required type of variable(metric versus nonmetric). Evaluation of Underlying techniques:-Since all multivariate techniques rely on underlying assumption, both statistical & conceptual, that substantially affect their ability to represent multivariate relationships. For the techniques based on the statistical inference, the assumptions of the multivariate normality, linearity, independence of the error term, and equality of variance in a dependence relationship must all be met. Since our data is categorical so we do not need to identify the linearity or any independence relation. Estimate the Model and Assess Overall Model Fit:- With the assumption satisfied, the analysis proceeds to the actual estimation of the model and assessment of overall model fit. In the estimation process we can choose option to meet specific characteristics of the data or to maximize the fit of data. After the model is estimated, the overall model fit is evaluated to ascertain whether it achieves acceptable levels on statistical criteria. Many times, the model will be specified in an attempt to better level of overall fit or explanation. Interpret the variate:- With the acceptable level of model fit, interpreting the variate reveals the nature of relationship. The interpretation of effects for individual variables is made by examining the estimated weight for each variable in the variate. Validate the Model:-Before accepting the results, we must subject them to one final set of diagnostic analyses that assess the degree of generalization of the results by the available validation methods. The attempt to validate the model is directed towards demonstrating the generalization of the results to the total population. These diagnostic analyses add little to the interpretation of the results but can be viewed as insurance that the results are the most descriptive of the data.
  • 6. Required statistical concepts for this project:- Data Mining:-In another words we say it Knowledge Discovery in Database. It is a field at the interaction of computer science and statistics to attempt to discover the pattern in large data set. It utilize methods at the intersection of artificial intelligence , machine learning, statistics and database system. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for future use. Decision Tree:- Decision tree can be used to predict a pattern or to classify the class of a data. It is commonly used in data mining. The goal to use the decision tree algorithm is to create the model that predict the target variable based upon the several input variables. In decision tree each leaf represents a value of target variable given the value of input variables represented by the path from the root of the leaf. A tree can be learned by splitting the source set into the subset based on an attribute value test. This process repeated in each derived subset called recursive. The general fashion for tree is top down induction. Decision tree used in data mining are of two main types:- 1. Classification Tree:- When the predicted outcome is the class to which the data belongs. 2. Regression Tree:- When the predicted outcome can be consider a real number. The term classification & regression tree(CART) analysis is an umbrella term used to refer to both of the above procedure. Some other techniques constructs more than one decision tree like, Bagging, Random Forest, Boosted tree etc. We have used Random Forest decision tree for this project. The algorithm that are used for constructing decision trees usually work top-down by choosing a variable at each step, that is next best variable to use in splitting the set of variables. “Best” is defined by how well the variable splits the set into homogeneous subsets that have the same value as target variable. Different algorithm use different formulae for measuring “Best”. These are the mathematical function through we measure the Impurity.
  • 7. Random Forest:- Recently lot of interest in “ensemble learning” methods that generates many classifier and aggregates their results. Two well-known methods are Boosting & bagging of classification trees. In Boosting Successive tree give extra weight to points incorrectly predicted by earlier predictor. In the End, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction. Breiman (2001) proposed random forests, which add an additional layer of randomness to bagging. In addition to constructing each tree using a different bootstrap sample of the data, random forests change how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform Very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust against over fitting. In addition, it is very user- friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their value. The algorithm:- The random forests algorithm (for both classification and regression) is as follows: 1. Draw ntree bootstrap samples from the original data. 2. For each of the bootstrap samples, grow an unpruned classification or regression tree, with the following modification: at each node, rather than choosing the best split among all predictors, randomly sample mtry of the predictors and choose the best split from among those Variables. (Bagging can be thought of as the special case of random forests obtained when mtry = p, the number of predictors.) 3. Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes for Classification, Average for regression). An estimate of the error rate can be obtained, based on the training data, by the following: 1. At each bootstrap iteration, predict the data not in the bootstrap sample (what Breiman Calls “out-of-bag”, or OOB, data) using the tree grown with the bootstrap sample. 2. Aggregate the OOB predictions. (On the average, each data point would be out-of-bag around 36% of the times, so aggregate these predictions.) Calculate the error rate, and call it the OOB estimate of error rate.
  • 8. Source Code:- # makes the random forest submission library(randomForest) train <- read.csv("../data/train.csv", header=TRUE) test <- read.csv("../data/test.csv", header=TRUE) labels <- as.factor(train[,1]) train <- train[,-1] rf <- randomForest(train, labels, xtest=test, ntree=1000) predictions <- levels(labels)[rf$test$predicted] write(predictions, file="rf_benchmark.csv", ncolumns=1)
  • 9. Solutions:--For this train data set we are taking the five random sample(20percent) and making five different model. Since our data set is large we need to combine the respective model results to validate the overall model accuracy.. Our approach can be vary, like we can build a model on the 80percent of the train data and keep 20percent of the data to validate the model. Model-1(Random Forest Algorithm, sample=train.csv)
  • 10. OOB estimation of error rate :- 5.32% Confusion Matrix 0 1 2 3 4 5 6 7 8 9 class.error 0 848 0 0 1 3 0 6 0 8 0 0.0207852 1 0 933 8 3 3 1 2 1 0 1 0.019958 2 10 2 765 6 4 0 5 7 9 2 0.0555556 3 2 3 20 812 3 18 2 9 17 4 0.0876405 4 2 1 0 0 787 0 6 3 0 23 0.0425791 5 7 2 1 20 0 687 13 0 4 7 0.0728745 6 9 2 1 0 5 12 762 0 2 0 0.0390921 7 1 6 12 1 6 2 1 870 5 14 0.0522876 8 2 6 8 16 4 8 2 3 744 11 0.0746269 9 5 4 1 11 16 3 2 7 10 745 0.0733831 8400
  • 12. OOB estimation of error :-5.14% Confusion matrix 0 1 2 3 4 5 6 7 8 9 class.error 0 773 0 1 0 2 0 7 0 5 0 0.0190355 1 0 946 3 5 2 1 1 2 3 0 0.0176532 2 3 1 766 4 11 1 6 11 6 1 0.054321 3 2 4 17 767 1 14 2 2 16 8 0.0792317 4 5 1 1 0 774 0 8 0 2 20 0.0456227 5 7 5 1 19 3 732 3 1 5 9 0.0675159 6 7 2 0 1 2 6 821 0 2 0 0.0237812 7 1 3 12 2 7 0 0 879 5 20 0.0538213 8 2 9 4 9 4 14 6 0 716 15 0.0808729 9 7 3 2 14 18 1 1 14 7 794 0.0778165 8400
  • 13. Model-3(RandomForest Algorithm , sample=train2.csv)
  • 14. OOB estimation of error:-5.63% Confusion matrix 0 1 2 3 4 5 6 7 8 9 class.error 0 864 0 2 0 1 1 4 0 7 1 0.0181818 1 0 896 6 2 4 2 1 3 4 2 0.026087 2 6 4 810 6 8 2 8 11 5 0 0.0581395 3 5 3 18 788 1 14 3 11 18 5 0.0900693 4 3 3 3 1 714 2 4 2 5 21 0.0580475 5 7 3 0 18 2 709 9 1 6 4 0.0658762 6 7 2 0 0 4 9 790 0 3 0 0.0306749 7 1 7 11 0 6 0 0 838 3 24 0.058427 8 4 11 5 18 4 9 3 3 745 13 0.0858896 9 8 5 2 14 14 2 0 10 9 773 0.0764636 8400
  • 15. Model 4:- (RandomForest Algorithm Used)
  • 16. OOB estimation of the error:-5.35% Confusion matrix 0 1 2 3 4 5 6 7 8 9 class.error 0 778 0 0 0 0 4 3 0 4 0 0.0139417 1 0 956 6 1 2 1 2 2 0 1 0.015448 2 7 5 759 10 8 0 7 12 10 3 0.0755177 3 1 3 12 812 1 24 1 7 9 4 0.0709382 4 1 3 0 0 788 0 6 3 2 22 0.0448485 5 6 4 1 19 3 710 8 1 7 6 0.0718954 6 7 2 1 0 2 5 800 0 2 0 0.023199 7 1 4 9 2 7 0 0 830 0 26 0.0557452 8 2 8 3 14 2 10 6 2 746 15 0.0767327 9 5 0 7 15 24 2 2 11 11 772 0.0906949 8400
  • 18. OOB estimation of the error:-5.30% Confusion matrix 0 1 2 3 4 5 6 7 8 9 class.error 0 782 0 0 1 1 1 4 1 5 1 0.0175879 1 0 927 5 5 1 2 3 1 1 1 0.0200846 2 6 1 802 8 6 0 5 14 5 3 0.0564706 3 1 2 9 815 2 20 2 9 11 7 0.071754 4 1 3 1 0 776 0 12 2 5 21 0.0548112 5 6 4 2 12 4 696 10 0 5 6 0.0657718 6 6 1 5 0 3 8 796 0 2 0 0.0304507 7 1 9 14 1 7 1 0 825 4 8 0.0517241 8 0 8 11 15 8 11 7 0 727 17 0.0957711 9 5 2 0 12 12 4 1 11 13 809 0.0690449 8400
  • 20. Overall Model Accuracy:- Predicted OOB estimated error 3.32% Actual 0 1 2 3 4 5 6 7 8 9 Class error 0 4074 0 2 1 3 4 18 1 16 0 0.010898523 1 0 4686 20 7 8 4 10 6 7 4 0.013888889 2 14 12 3992 19 26 7 18 32 25 6 0.038304023 3 3 3 44 4144 2 62 7 17 43 16 0.045381249 4 4 8 4 1 3916 0 16 6 10 72 0.029972752 5 21 14 2 30 5 3656 29 3 20 15 0.036627141 6 24 5 3 0 9 25 4017 0 6 0 0.017608217 7 7 18 37 7 20 1 0 4327 8 61 0.035443602 8 5 30 18 45 20 28 18 3 3811 32 0.049625935 9 20 5 9 47 51 6 2 31 20 4029 0.045260664 42000
  • 21. Interpretation of the Model:- The prediction of model in the test data set taken as the collective results of all the five model and the weighted average taken to determine the best fit of result. Based upon the confusion matrix and recursive pattern of data set to build the model is show that the average confidence is .954. Here we have avoided over fitting because in the Random Forest algorithm data taken to build the model is random & unbiased. Here the interesting thing is Random Forest can handle missing values and it doesn’t require the pruning. In random Forest roughly 30-35% of the samples are not selected in bootstrap, which we call as (OOB) sample. Using OOB sample as input to the corresponding tree, predictions are made. Bibliography:- Multivariate Data Analysis by:- Hair black & tatham http://www.webchem.science.ru.nl:8080/PRiNS/rF.pdf http://people.revoledu.com/kardi/tutorial/DecisionTree/index.html http://www.statmethods.net/interface/workspace.html