10. Machine Learning - Evaluation Metrics
● Confusion Matrix
○ Evaluation for performance of classification model
● Accuracy = (TP + TN) /total samples
11. Machine Learning - Evaluation Metrics
● Root Mean Squared Error
○ Spread of the predicted y-values about the original y-values.
N = Total Samples
Yi
= Predicted
Yi
= Actual
14. Neural Nets - Supervised
Input Output Application
Home Features Cost Real Estate
Ad, User Information Click on Ad ? Online Advertising
Image (1...1000) Class Photo Tagging
Audio Text Speech Recognition
English Chinese Machine Translation
15. Computation Graph
J(a, b, c) = 3(a + bc)
U = bc
V = a + U
J = 3V
Substitution
U=b*c
b
c
a V= a+U J = 3V
Input
a = 5
b = 3
c = 2
How does J
change if we
change V a bit?
11
33
6
How does J
change if we
change a a bit?
a→V→J
∂J/∂a = (∂J/∂V) x (∂V/∂a)
How does J
change if we
change b a bit?
b→U→V→J
∂J/∂b = (∂J/∂V) x (∂V/∂U) x (∂U/∂b)
Forward →
Backward ←
17. Hyperparameters
● There are number of parameters that can be tuned in while building your
neural network.
○ Number of Hidden Layers
○ Epochs
○ Loss Function
○ Optimization Function
○ Weight Initialization
○ Activation Functions
○ Batch Size
○ Learning Rate
18. Weight Initialization
● If the weights in a network start too small, then the signal shrinks as it
passes through each layer until it’s too tiny to be useful.
● If the weights in a network start too large, then the signal grows as it
passes through each layer until it’s too massive to be useful.
-
Xavier Initialization
-
20. Hyperparameters
● There are number of parameters that can be tuned in while building your
neural network.
○ Number of Hidden Layers
○ Epochs
○ Loss Function
○ Optimization Function
○ Weight Initialization
○ Activation Functions
○ Batch Size
○ Learning Rate
22. Hyperparameters
● There are number of parameters that can be tuned in while building your
neural network.
○ Number of Hidden Layers
○ Epochs
○ Loss Function
○ Optimization Function
○ Weight Initialization
○ Activation Functions
○ Batch Size
○ Learning Rate
26. Hyperparameters
● There are number of parameters that can be tuned in while building your
neural network.
○ Number of Hidden Layers
○ Epochs
○ Loss Function
○ Optimization Function
○ Weight Initialization
○ Activation Functions
○ Batch Size
○ Learning Rate
27. Learning Rate
● Decaying the Learning Rate overtime is seen to fasten the learning
process/convergence.
30. Learning Rate- Special Case
Wi
= Wi-1
+ Alpha x Slope
Pseudo Self Adaptive in
Convex Curve
31. Hyperparameters
● There are number of parameters that can be tuned in while building your
neural network.
○ Number of Hidden Layers
○ Epochs
○ Loss Function
○ Optimization Function
○ Weight Initialization
○ Activation Functions
○ Batch Size
○ Learning Rate
36. Activation Functions - Standards
● In practice, Tanh outperforms Sigmoid for internal layers.
○ Mean 0, Tanh Function.
○ Mean 0, Sigmoid Function.
○ In ML, we tend to center our data to avoid any kind of bias behaviour.
● Rule of thumb, ReLU for hidden layers generally performs well.
● Avoid Sigmoid for hidden layers.
● Sigmoid is a good candidate for Binary Classification problem.
● Identity Function for hidden layers - No Sense
37. Activation Functions - ReLU or Tanh ?
ReLU > Tanh
-
Avoids Vanishing Gradient
-
Is it the best ? [No]
39. Activation Functions - Why ?
● More Advanced Functions - Nonlinear.
● Should be Differentiable - for Backpropagation.
40. Hyperparameters
● There are number of parameters that can be tuned in while building your
neural network.
○ Number of Hidden Layers
○ Epochs
○ Loss Function
○ Optimization Function
○ Weight Initialization
○ Activation Functions
○ Batch Size
○ Learning Rate
41. Batch Size
● The Batch Size is the number of samples that will be passed through the
network at a time.
● Advantages
○ Your machine might not fit all the data in-memory at any given instance.
○ You want your model to generalize quickly.
47. Training - Backward Propagation
The Goal,
Is to update each of the weights in the network so that they
cause the actual output to be closer the target output.
48. Training - Backward Propagation
∂Error/∂w4
= (∂Error/∂O 1
) x (∂O 1
/∂O1
) x (∂O1
/∂w4
)
∂Error/∂wi
= Partial derivative w.r.t wi
w4
O 1O1
w
4
Error
∂Error/∂w1
= (∂Error/∂H 1
) x (∂H 1
/∂H1
) x (∂H1
/∂w1
)
= |Y - Yi
|
w1
52. Regularization- Dropout
● Dropout refers to ignoring units (i.e. neurons) during the training phase of
certain set of neurons which is chosen at random.
● Avoids co-dependency amongst neurons during training.
● Dropout with a given probability (20%-50%) in each weight update cycle.
● Dropout at each layer of the network has shown good results.
54. References
● Adam Optimization
● Andrew Ng Youtube
● Siraj Raval Youtube
● Adam Optimization
● Cross Entropy
● Deep Learning Basics
● BackPropagation