3. Problem: Digit Recognizer
Identify handwritten single digits 0~9, based
on grey scale images.
Sample images
4. Statement
Each image is 28 pixels in height and 28 pixels in width, for a
total of 784 pixels in total. Each pixel has a single pixel-
value associated with it, indicating the lightness or darkness
of that pixel, with higher numbers meaning darker. This
pixel-value is an integer between 0 and 255, inclusive.
pixel0 pixel1 pixel2 ... pixel27
pixel28 pixel29 pixel30 ... pixel55
| | | ... |
pixel756 pixel757 pixel758 ... pixel783
5. Statement
The training data set, has 785 columns. The first
column, called "label", is the digit that was drawn by the
user. The rest of the columns contain the pixel-values of
the associated image.
The test data set, is the same as the training set, except
that it does not contain the "label" column.
Goal of the problem is to predict the images in the test
data set
6. Methods used to solve the
problem
Random Forest
Support Vector Machine (SVM)
K-Nearest Neighborhood (KNN)
7. Random Forest
Ensemble of decision trees
Each tree is trained on a bootstrapped sample of the
original data set
Each time a node is split, only a randomly chosen subset
of the dimensions are considered for splitting
Each tree is fully grown and not pruned
When a new input is entered into the system, it is run down
all of the trees. The result may either be an average or
weighted average of all of the terminal nodes that are
reached, or, in the case of categorical variables, a voting
majority
9. Support Vector Machine
In a SVM model original objects (training data) are treated
as a points in the space (input space)
These are mapped (rearranged) to a new space (feature
space) using mathematical functions called kernels
After mapping objects of separate categories are divided
by a clear gap as wide as possible
10. K Nearest Neighborhood
Basic idea
If it walks like a duck, quacks like a duck than it is probably a duck
There are three key elements :
a set of labeled objects (e.g., a set of stored records)
a distance or similarity metric to compute distance between objects,
and
the value of k, the number of nearest neighbors.
To classify an unlabeled object :
the distance of this object to the labeled objects is computed,
its k-nearest neighbors are identified, and
the class labels of these nearest neighbors are then used to
determine the class label of the object.
11. Results
Random Forests with 500 trees gave 97%
accuracy on the test data.
SVM with RBF kernel and C=1, gave 97.71%
accuracy on the test data.
KNN with k=10 gave 96% accuracy.
13. Problem
The sinking of the RMS Titanic is one of the most
infamous shipwrecks in history.
One of the reasons that the shipwreck led to such loss
of life was that there were not enough lifeboats for the
passengers and crew. Although there was some
element of luck involved in surviving the sinking, some
groups of people were more likely to survive than
others, such as women, children, and the upper-class.
In this project, the analysis of what sorts of people
were likely to survive is done. In particular, the tools of
machine learning are applied to predict which
passengers survived the tragedy.
14. Statement
The historical data has been split into two
groups, a 'training set' and a 'test set'. For the
training set, the outcome whether or not the
passenger survived the sinking ( 0 for deceased,
1 for survived ) is provided.
The goal of the problem is to predict the
outcome for each passenger in the test set.
15. Methods used to solve the
problem
• Random Forest
• Support Vector Machine (SVM)
16. Results
Random Forests with 300 trees gave 77.9%
accuracy on the test data.
SVM with RBF kernel and C=1, gave 77.7%
accuracy on the test data.