O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Avoid Overfitting with Regularization

Have you ever created a machine learning model that is perfect for the training samples but gives very bad predictions with unseen samples! Did you ever think why this happens? This article explains overfitting which is one of the reasons for poor predictions for unseen samples. Also, regularization technique based on regression is presented by simple steps to make it clear how to avoid overfitting.

  • Seja o primeiro a comentar

Avoid Overfitting with Regularization

  1. 1. 0 Avoid Overfitting with Regularization By Ahmed Fawzy Gad Information Technology (IT) Department Faculty of Computers and Information (FCI) Menoufia University Egypt ahmed.fawzy@ci.menofia.edu.eg 14-Jan-2018 ‫المنوفية‬ ‫جامعة‬ ‫والمعلومات‬ ‫الحاسبات‬ ‫كلية‬ MENOUFIA UNIVERSITY FACULTY OF COMPUTERS AND INFORMATION ‫المنوفية‬ ‫جامعة‬
  2. 2. 1 Have you ever created a machine learning model that is perfect for the training samples but gives very bad predictions with unseen samples! Did you ever think why this happens? This article explains overfitting which is one of the reasons for poor predictions for unseen samples. Also, regularization technique based on regression is presented by simple steps to make it clear how to avoid overfitting. The focus of machine learning (ML) is to train an algorithm with training data in order create a model that is able to make the correct predictions for unseen data (test data). To create a classifier, for example, a human expert will start by collecting the data required to train the ML algorithm. The human is responsible for finding the best types of features to represent each class which is capable of discriminating between the different classes. Such features will be used to train the ML algorithm. Suppose we are to build a ML model that classifies images as containing cats or not using the following training data. The first question we have to answer is “what are the best features to use?”. This is a critical question in ML as the better the used features the better the predictions the trained ML model makes and vice versa. Let us try to visualize such images and extract some features that are representative of cats. Some of the representative features may be the existence of two dark eye pupils and two ears with a diagonal direction. Assuming that we extracted such features, somehow, from the above training images and a trained ML model is created. Such model can work with a wide range of cat images because the used features are existing in most of the cats. We can test the model using someunseen data as the following. Assuming that the classification accuracy of the test data is x%. One may want to increase the classification accuracy. The first thing to think of is by using more features than the two ones used previously. This is because the more discriminative features to use, the better the accuracy. By inspecting the training data again, we can find more features such as the overall image color as all training cat samples are white and the eye irises color as the training data has a yellow iris color. The feature vector will have the 4 features shown below. They will be used to retrain the ML model. Feature Dark Eye Pupils Diagonal Ears White Cat Color Yellow Eye Irises After creating the trained model next is to test it. The expected result after using the new feature vector is that the classification accuracy will decrease to be less than x%. But why? The cause of accuracy drop is using some features that are already existing in the training data but not existing generally in all cat images. The features are not general across all cat images. All used training images have a while image color and a yellow eye irises but they are generalized to all cats. In the testing data, some cats have a black or yellow color which is not white as used in training. Some cats have not the irises color yellow.
  3. 3. 2 Our case in which the used features are powerful for the training samples but very poor for the testing samples is known as overfitting. Themodel is trained with some features thatare exclusive to the training data but not existing in the testing data. The goal of the previous discussion is to make the idea of overfitting simple by a high-level example. To get into the details itis preferableto work with a simpler example. Thatis why therest of thediscussion will bebased on a regression example. Understand Regularization based on a Regression Example Assume we want to create a regression model that fits the data shown below. We can use polynomial regression. The simplest model that we can start with is the linear model with a first-degree polynomial equation: y1 = f1(x) = Θ1x + Θ0 Where Θ0 and Θ1 are the model parameters & 𝑥 is the only feature used. The plot of the previous model is shown below: Based on a loss function such as the one shown below, we can conclude that the model is not fitting the data well. L = ∑ |f1(x 𝑖) − d𝑖|𝑁 𝑖=0 𝑁 Where f𝑖(x 𝑖) is the expected output for sample 𝑖 and d𝑖 is the desired output for the same sample. The model is too simple and there are many predictions that are not accurate. For such reason, we should create a more complex model that can fit the data well. For such reason, we can increase the degree of the equation from one to two. It will be as follows: y2 = f1(x) = Θ2x2 + Θ1x + Θ0 By using the same feature x after being raised to power 2 (x2 ), we created a new feature and we will not only capture the linear properties of the data, but also some non-linear properties. The graph of the new model will be as follows:
  4. 4. 3 The graph shows that the second degree polynomial fits the data better than the first degree. But also the quadratic equation does not fit well some of the data samples. This is why we can create a more complex model of the third degree with the following equation: y3 = f3(x) = Θ3x3 + Θ2x2 + Θ1x + Θ0 The graph will be as follows: It is noted that the model fits the data better after adding a new feature that capturing the data properties of the third degree. To fit the data better than before, we can increase the degree of the equation to be of the fourth degree as in the following equation: y4 = f4(x) = Θ4x4 + Θ3x3 + Θ2x2 + Θ1x + Θ0 The graph will be as follows: It seems that the higher the degree of the polynomial equation the better it fits the data. But there are some important questions to be answered. If increasing the degree of the polynomial equation by adding new features enhances the results, so why not using a very high degree such as 100th degree? What is the best degree to be used for a problem? Model Capacity/Complexity There is a term called model capacity or complexity. Model capacity/complexity refers to the level of variation that the model can work with. The higher the capacity the more variation the model can cope with. The first model y1 is said to be of a small capacity compared to y4. In our case, the capacity increases by increasing the polynomial degree. For sure the higher the degree of the polynomial equation the more fit it will be for the data. But remember that increasing the polynomial degree increases the complexity of the model. Using a model with a capacity higher than required may lead to overfitting. The model becomes very complex and fits the training data very well but unfortunately, it is a very weak for unseen data. The goal of ML is not only creating a model that is robust with the training data but also to the unseen data samples. The model of the fourth degree (y4) is very complex. Yes, it fits the seen data well but it will not for unseen data. For such case, the newly used feature in y4 which is 𝑥4 captures more details than required. Because that new feature makes the model too complex, we should get rid of it. In this example, we actually know which features to remove. So, we can remove it and return back to the previous model of the third-degree (Θ4x4 + Θ3x3 + Θ2x2 + Θ1x + Θ0). But in actual work, we do not know which features to remove.
  5. 5. 4 Moreover, assume that the new feature is not too bad and we do not want to completely remove it and just want to penalize it. What should we do? Looking back at the loss function, the only goal is to minimize/penalize the prediction error. We can set a new objective to minimize/penalize the effect of the new feature 𝑥4 as much as possible. After modifying the loss function to penalize x3, it will be as follows: L 𝑛𝑒𝑤 = [∑ |f4(x 𝑖) − d𝑖|𝑁 𝑖=0 + Θ4x4 ] 𝑁 Our objective now is to minimize the loss function. We are now just interested in minimizing this term Θ4x4 . It is obvious that to minimize Θ4x4 we should minimize Θ4 as it is the only free parameter we can change. We can set its value to a value equal to zero if we want to remove that feature completely in case it is very bad one as shown below: L 𝑛𝑒𝑤 = [∑ |f4(x 𝑖) − d𝑖|𝑁 𝑖=0 + 0 ∗ x4 ] 𝑁 By removing it, we go back to the third-degree polynomial equation (y3). y3 does not fit the seen data perfectly as in y4 but generally, it will have a better performance for unseen data than y4. But in case it x4 is a relatively good feature and we just want to penalize it but not to remove it completely, we can set it to a value close to zero but not to zero (say 0.1) as shown next. By doing that, we limit the effect of x4. As a result, the new model will not be complex as before. L 𝑛𝑒𝑤 = [∑ |f4(x 𝑖) − d𝑖|𝑁 𝑖=0 + 0.1 ∗ x4 ] 𝑁 Going back to y2, it seems that it is the simpler than y3. It can work well with both seen and unseen data samples. So, we should remove the new feature used in y3 which is x3 or just penalize it if it relatively does well. We can modify the loss function to do that. L 𝑛𝑒𝑤 = [∑ |f4(x 𝑖) − d𝑖|𝑁 𝑖=0 + 0.1 ∗ x4 + Θ3x3] 𝑁 L 𝑛𝑒𝑤 = [∑ |f4(x 𝑖) − d𝑖|𝑁 𝑖=0 + 0.1 ∗ x4 + 0.04 ∗ x3] 𝑁 Regularization Note that we actually knew that y2 is the best model to fit the data because the data graph is available for us. It is a very simple task that we can solve manually. But if such information is not available for us and as the number of samples and data complexity increases, we will not be able to reach such conclusions easily. There must be something automatic to tell us which degree will fit the data and tell us which features to penalize to get the best predictions for unseen data. This is regularization. Regularization helps us to select the model complexity to fit the data. It is useful to automatically penalize features that make the model too complex. Remember that regularization is useful if the features are not bad and relatively helps us to get good predictions and we just need to penalize but not to remove them completely. Regularization penalizes all used features, not a selected subset. Previously, we penalized just two features x4 and x3 not all features. But it is not the case with regularization. Using regularization, a new term is added to the loss function to penalize the features so the loss function will be as follows: L 𝑛𝑒𝑤 = [∑ |f4(x 𝑖) − d𝑖|𝑁 𝑖=0 + ∑ λΘ𝑗 𝑁 𝑗=1 ] 𝑁 It can also be written as follows after moving Λ outside the summation:
  6. 6. 5 L 𝑛𝑒𝑤 = [∑ |f4(x 𝑖) − d𝑖|𝑁 𝑖=0 + λ ∑ Θ𝑗 𝑁 𝑗=1 ] 𝑁 The newly added term λ ∑ Θ𝑗 𝑁 𝑗=1 is used to penalize the features to control the level of model complexity. Our previous goal before adding the regularization term is to minimize the prediction error as much as possible. Now our goal is to minimize the error but to be careful of not making the model too complex and avoids overfitting. There is a regularization parameter called lambda (λ) which controls how to penalize the features. It is a hyperparameter with no fixed value. Its value is variable based on the task at hand. As its value increases as there will be high penalization for the features. As a result, the model becomes simpler. When its values decrease there will be a low penalization of the features and thus the model complexity increases. A value of zero means no removal of features at all. When λ is zero, then the values of Θ𝑗 will not be penalized at all as shown in the next equation. This is because setting λ to zero means the removal of the regularization term and just leaving the error term. So, our objective will return back to just minimize the error to be close to zero. When error minimization is the objective, the model may overfit. L 𝑛𝑒𝑤 = [∑ |f4(x 𝑖) − d𝑖|𝑁 𝑖=0 + 0 ∗ ∑ Θ𝑗 𝑁 𝑗=1 ] 𝑁 L 𝑛𝑒𝑤 = [∑ |f4(x 𝑖) − d𝑖|𝑁 𝑖=0 + 0] 𝑁 L 𝑛𝑒𝑤 = ∑ |f4(x 𝑖) − d𝑖|𝑁 𝑖=0 𝑁 But when the value of the penalization parameter λ is very high (say 109), then there must be a very high penalization for the parameters Θ𝑗 in order to keep the loss at its minimum value. As a result, the parameters Θ𝑗 will be zeros. As a result, the model (y4) will have its Θ𝑖 pruned as shown below. y4 = f4(x) = Θ4x4 + Θ3x3 + Θ2x2 + Θ1x + Θ0 y4 = 0 ∗ x4 + 0 ∗ x3 + 0 ∗ x2 + 0 ∗ x + Θ0 y4 = Θ0 Please note that the regularization term starts its index 𝑗 from 1 not zero. Actually, we use the regularization term to penalize features (x 𝑖). Because Θ0 has not associated feature, then there is no reason to penalize it. In such case, the model will be y4 = Θ0 with the following graph: