Machine learning workshop, session 4.
- Generalization in Machine Learning
- Overfitting and Underfitting
- Algorithms by Similarity
- Real Application
- People to follow
2. Table of contents
1. Recap
2. Generalization in Machine Learning
3. Overfitting and Underfitting
4. Algorithms by Similarity
5. Real Application
6. People to follow
4. Recap
● Training, validation and test data sets.
● Learning Style
○ Supervised
○ Unsupervised
○ Semi-Supervised Learning.
● Similarity
○ Regression Algorithms
○ Instance-based Algorithms
○ Regularization Algorithms
○ Decision Tree Algorithms
5.
6. Recap
Decision trees
Possible applications in PlantMiner:
For a searcher: Based on previous quotes,
identify an item that usually is being hired along
other.
● Suggest the item.
● Offer a discount to add the suggested
item.
For a supplier: Identify suppliers that would
crunch on the next subscription renewal.
8. Induction and deduction
Induction refers to learning general concepts
from specific examples which is exactly the
problem that supervised machine learning
problems aim to solve.
This is different from deduction that is the other
way around and seeks to learn specific concepts
from general rules.
9. Induction and deduction
The goal of a good machine learning model is to
generalize well from the training data to any data
from the problem domain.
This allows us to make predictions in the future
on data the model has never seen.
11. Overfitting
In machine learning, one of the most common
tasks is to fit a "model" to a set of training data,
so as to be able to make reliable predictions on
general untrained data.
In overfitting, a statistical model describes
random error or noise instead of the underlying
relationship.
The green line represents an overfitted model and the black line
represents a regularised model. While the green line best follows
the training data, it is too dependent on it and it is likely to have a
higher error rate on new unseen data, compared to the black
line.
12. Overfitting
A model that has been overfit has poor
predictive performance, as it overreacts to minor
fluctuations in the training data.
Noisy (roughly linear) data is fitted to both linear and polynomial
functions. Although the polynomial function is a perfect fit, the
linear version can be expected to generalize better. In other
words, if the two functions were used to extrapolate the data
beyond the fit data, the linear function would make better
predictions.
13. Overfitting
Overfitting occurs when a model is excessively
complex, such as having too many parameters
relative to the number of observations.
Overfitting/overtraining in supervised learning (e.g., neural
network). Training error is shown in blue, validation error in red,
both as a function of the number of training cycles. If the
validation error increases(positive slope) while the training error
steadily decreases(negative slope) then a situation of overfitting
may have occurred. The best predictive and fitted model would
be where the validation error has its global minimum.
14. Underfitting
Underfitting occurs when a statistical model or machine
learning algorithm cannot capture the underlying trend of
the data.
It occurs when the model or algorithm does not fit the data
enough. Underfitting occurs if the model or algorithm
shows low variance but high bias (to contrast the opposite,
overfitting from high variance and low bias). It is often a
result of an excessively simple model.
Underfitting would occur, for example, when fitting a linear
model to non-linear data.
Such a model would have poor predictive performance.
15. Underfitting
There are two important techniques that you can use
when evaluating machine learning algorithms to limit
overfitting:
● Use a resampling technique to estimate model
accuracy.
● Hold back a validation dataset.
16. Underfitting
Resampling
The most popular resampling technique is k-fold cross validation. It allows you to train and test your
model k-times on different subsets of training data and build up an estimate of the performance of a
machine learning model on unseen data.
Validation dataset
A validation dataset is simply a subset of your training data that you hold back from your machine
learning algorithms until the very end of your project. After you have selected and tuned your
machine learning algorithms on your training dataset you can evaluate the learned models on the
validation dataset to get a final objective idea of how the models might perform on unseen data.
18. Bayesian Algorithms
Bayesian methods are those that explicitly apply
Bayes’ Theorem for problems such as
classification and regression.
With appropriate pre-processing, it is competitive
in this domain with more advanced methods
including support vector machines.
It also finds application in automatic medical
diagnosis.
Document classification, based on word
frequencies. e.g. SPAM.
21. DoseMe.com.au
Bayesian dosing uses patient data and
laboratory results to estimate a patient's ability to
absorb, process, and clear a drug from their
system. Using a published population model,
DoseMe's algorithms adjusts the
pharmacokinetic and/or pharmacodynamic
parameters so that a patient-specific,
individualised drug model is built. This individual
model is then used to provide a patient-specific
dosing recommendation to reach a therapeutic
target.
23. Fei-Fei Li
Fei-Fei Li, who publishes under the name Li Fei-Fei, is an
Associate Professor of Computer Science at Stanford
University. She is the director of the Stanford Artificial
Intelligence Lab and the Stanford Vision Lab.
● Born: 1976, Beijing, China
● Spouse: Silvio Savarese
● Education: California Institute of Technology (2005)
● Residence: United States of America
● Books: Computer Vision: From 3D Reconstruction to
Visual Recognition, more
● Doctoral advisors: Pietro Perona, Christof Koch
● http://vision.stanford.edu/feifeili/
● @drfeifei
24. Andrej Karpathy
Director of AI at Tesla, currently focused on perception for the
Autopilot.
Previously, I was a Research Scientist at OpenAI working on
Deep Learning in Computer Vision, Generative Modeling and
Reinforcement Learning.
PhD from Stanford, where I worked with Fei-Fei Li on
Convolutional/Recurrent Neural Network architectures and
their applications in Computer Vision, Natural Language
Processing and their intersection.
● http://cs.stanford.edu/people/karpathy/
● @karpathy
25. OpenAI Gym
Founded: December 11, 2015
Founders: Elon Musk, Sam Altman, and others
Type: 501(c)(3) Nonprofit organization
Location: San Francisco, California, USA
Products: OpenAI Gym
Mission: Friendly artificial intelligence
● https://www.openai.com/
● @OpenAI