O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

14 Machine Learning Single Layer Perceptron

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 228 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Quem viu também gostou (17)

Anúncio

Semelhante a 14 Machine Learning Single Layer Perceptron (20)

Mais de Andres Mendez-Vazquez (20)

Anúncio

Mais recentes (20)

14 Machine Learning Single Layer Perceptron

  1. 1. Neural Networks Introduction Single-Layer Perceptron Andres Mendez-Vazquez July 3, 2016 1 / 101
  2. 2. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 2 / 101
  3. 3. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 3 / 101
  4. 4. History At the beginning of Neural Networks (1943 - 1958) McCulloch and Pitts (1943) for introducing the idea of neural networks as computing machines. Hebb (1949) for postulating the first rule for self-organized learning. Rosenblatt (1958) for proposing the perceptron as the first model for learning with a teacher (i.e., supervised learning). In this chapter, we are interested in the perceptron The perceptron is the simplest form of a neural network used for the classifica tion of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). 4 / 101
  5. 5. History At the beginning of Neural Networks (1943 - 1958) McCulloch and Pitts (1943) for introducing the idea of neural networks as computing machines. Hebb (1949) for postulating the first rule for self-organized learning. Rosenblatt (1958) for proposing the perceptron as the first model for learning with a teacher (i.e., supervised learning). In this chapter, we are interested in the perceptron The perceptron is the simplest form of a neural network used for the classifica tion of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). 4 / 101
  6. 6. History At the beginning of Neural Networks (1943 - 1958) McCulloch and Pitts (1943) for introducing the idea of neural networks as computing machines. Hebb (1949) for postulating the first rule for self-organized learning. Rosenblatt (1958) for proposing the perceptron as the first model for learning with a teacher (i.e., supervised learning). In this chapter, we are interested in the perceptron The perceptron is the simplest form of a neural network used for the classifica tion of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). 4 / 101
  7. 7. History At the beginning of Neural Networks (1943 - 1958) McCulloch and Pitts (1943) for introducing the idea of neural networks as computing machines. Hebb (1949) for postulating the first rule for self-organized learning. Rosenblatt (1958) for proposing the perceptron as the first model for learning with a teacher (i.e., supervised learning). In this chapter, we are interested in the perceptron The perceptron is the simplest form of a neural network used for the classifica tion of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). 4 / 101
  8. 8. In addition Something Notable The single neuron also forms the basis of an adaptive filter. A functional block that is basic to the ever-expanding subject of signal processing. Furthermore The development of adaptive filtering owes much to the classic paper of Widrow and Hoff (1960) for pioneering the so-called least-mean-square (LMS) algorithm, also known as the delta rule. 5 / 101
  9. 9. In addition Something Notable The single neuron also forms the basis of an adaptive filter. A functional block that is basic to the ever-expanding subject of signal processing. Furthermore The development of adaptive filtering owes much to the classic paper of Widrow and Hoff (1960) for pioneering the so-called least-mean-square (LMS) algorithm, also known as the delta rule. 5 / 101
  10. 10. In addition Something Notable The single neuron also forms the basis of an adaptive filter. A functional block that is basic to the ever-expanding subject of signal processing. Furthermore The development of adaptive filtering owes much to the classic paper of Widrow and Hoff (1960) for pioneering the so-called least-mean-square (LMS) algorithm, also known as the delta rule. 5 / 101
  11. 11. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 6 / 101
  12. 12. Adapting Filtering Problem Consider a dynamical system Unknown dynamical system Output INPUTS 7 / 101
  13. 13. Signal-Flow Graph of Adaptive Model We have the following equivalence -1 8 / 101
  14. 14. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 9 / 101
  15. 15. Description of the Behavior of the System We have the data set T = {(x (i) , d (i)) |i = 1, 2, ..., n, ...} (1) Where x (i) = (x1 (i) , x2 (i) ..., xm (i))T (2) 10 / 101
  16. 16. Description of the Behavior of the System We have the data set T = {(x (i) , d (i)) |i = 1, 2, ..., n, ...} (1) Where x (i) = (x1 (i) , x2 (i) ..., xm (i))T (2) 10 / 101
  17. 17. The Stimulus x (i) The stimulus x(i) can arise from The m elements of x(i) originate at different points in space (spatial) 11 / 101
  18. 18. The Stimulus x (i) The stimulus x(i) can arise from The m elements of x(i) represent the set of present and (m − 1) past values of some excitation that are uniformly spaced in time (temporal). Time 12 / 101
  19. 19. Problem Quite important How do we design a multiple input-single output model of the unknown dynamical system? It is more We want to build this around a single neuron!!! 13 / 101
  20. 20. Problem Quite important How do we design a multiple input-single output model of the unknown dynamical system? It is more We want to build this around a single neuron!!! 13 / 101
  21. 21. Thus, we have the following... We need an algorithm to control the weight adjustment of the neuron Control Algorithm 14 / 101
  22. 22. Which steps do you need for the algorithm? First The algorithms starts from an arbitrary setting of the neuron’s synaptic weight. Second Adjustments, with respect to changes on the environment, are made on a continuous basis. Time is incorporated to the algorithm. Third Computation of adjustments to synaptic weights are completed inside a time interval that is one sampling period long. 15 / 101
  23. 23. Which steps do you need for the algorithm? First The algorithms starts from an arbitrary setting of the neuron’s synaptic weight. Second Adjustments, with respect to changes on the environment, are made on a continuous basis. Time is incorporated to the algorithm. Third Computation of adjustments to synaptic weights are completed inside a time interval that is one sampling period long. 15 / 101
  24. 24. Which steps do you need for the algorithm? First The algorithms starts from an arbitrary setting of the neuron’s synaptic weight. Second Adjustments, with respect to changes on the environment, are made on a continuous basis. Time is incorporated to the algorithm. Third Computation of adjustments to synaptic weights are completed inside a time interval that is one sampling period long. 15 / 101
  25. 25. Signal-Flow Graph of Adaptive Model We have the following equivalence -1 16 / 101
  26. 26. Thus, This Neural Model ≈ Adaptive Filter with two continous processes Filtering processes 1 An output, denoted by y(i), that is produced in response to the m elements of the stimulus vector x (i). 2 An error signal, e (i), that is obtained by comparing the output y(i) to the corresponding desired output d(i) produced by the unknown system. Adaptive Process It involves the automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i) Remark The combination of these two processes working together constitutes a feedback loop acting around the neuron. 17 / 101
  27. 27. Thus, This Neural Model ≈ Adaptive Filter with two continous processes Filtering processes 1 An output, denoted by y(i), that is produced in response to the m elements of the stimulus vector x (i). 2 An error signal, e (i), that is obtained by comparing the output y(i) to the corresponding desired output d(i) produced by the unknown system. Adaptive Process It involves the automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i) Remark The combination of these two processes working together constitutes a feedback loop acting around the neuron. 17 / 101
  28. 28. Thus, This Neural Model ≈ Adaptive Filter with two continous processes Filtering processes 1 An output, denoted by y(i), that is produced in response to the m elements of the stimulus vector x (i). 2 An error signal, e (i), that is obtained by comparing the output y(i) to the corresponding desired output d(i) produced by the unknown system. Adaptive Process It involves the automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i) Remark The combination of these two processes working together constitutes a feedback loop acting around the neuron. 17 / 101
  29. 29. Thus The output y(i) is exactly the same as the induced local field v(i) y (i) = v (i) = m i=1 wk (i) xk (i) (3) In matrix form, we have - remember we only have a neuron, so we do not have neuron k y (i) = xT (i) w (i) (4) Error e (i) = d (i) − y (i) (5) 18 / 101
  30. 30. Thus The output y(i) is exactly the same as the induced local field v(i) y (i) = v (i) = m i=1 wk (i) xk (i) (3) In matrix form, we have - remember we only have a neuron, so we do not have neuron k y (i) = xT (i) w (i) (4) Error e (i) = d (i) − y (i) (5) 18 / 101
  31. 31. Thus The output y(i) is exactly the same as the induced local field v(i) y (i) = v (i) = m i=1 wk (i) xk (i) (3) In matrix form, we have - remember we only have a neuron, so we do not have neuron k y (i) = xT (i) w (i) (4) Error e (i) = d (i) − y (i) (5) 18 / 101
  32. 32. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 19 / 101
  33. 33. Consider A continous differentiable function J (w) We want to find an optimal solution w∗ such that J (w∗ ) ≤ J (w) , ∀w (6) We want to Minimize the cost function J(w) with respect to the weight vector w. For this J (w) = 0 (7) 20 / 101
  34. 34. Consider A continous differentiable function J (w) We want to find an optimal solution w∗ such that J (w∗ ) ≤ J (w) , ∀w (6) We want to Minimize the cost function J(w) with respect to the weight vector w. For this J (w) = 0 (7) 20 / 101
  35. 35. Consider A continous differentiable function J (w) We want to find an optimal solution w∗ such that J (w∗ ) ≤ J (w) , ∀w (6) We want to Minimize the cost function J(w) with respect to the weight vector w. For this J (w) = 0 (7) 20 / 101
  36. 36. Where is the gradient operator = ∂ ∂w1 , ∂ ∂w2 , ..., ∂ ∂wm T (8) Thus J (w) = ∂J (w) ∂w1 , ∂J (w) ∂w2 , ..., ∂J (w) ∂wm T (9) 21 / 101
  37. 37. Where is the gradient operator = ∂ ∂w1 , ∂ ∂w2 , ..., ∂ ∂wm T (8) Thus J (w) = ∂J (w) ∂w1 , ∂J (w) ∂w2 , ..., ∂J (w) ∂wm T (9) 21 / 101
  38. 38. Thus Starting with an initial guess denoted by w(0), Then, generate a sequence of weight vectors w (1) , w (2) , ... Such that you can reduce J (w) at each iteration J (w (n + 1)) < J (w (n)) (10) Where: w (n) is the old value and w (n + 1) is the new value. 22 / 101
  39. 39. Thus Starting with an initial guess denoted by w(0), Then, generate a sequence of weight vectors w (1) , w (2) , ... Such that you can reduce J (w) at each iteration J (w (n + 1)) < J (w (n)) (10) Where: w (n) is the old value and w (n + 1) is the new value. 22 / 101
  40. 40. The Three Main Methods for Unconstrained Optimization We will look at 1 Steepest Descent. 2 Newton’s Method 3 Gauss-Newton Method 23 / 101
  41. 41. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 24 / 101
  42. 42. Steepest Descent In the method of steepest descent, we have a cost function J (w) where w (n + 1) = w (n) − η J (w (n)) How, we prove that J (w (n + 1)) < J (w (n))? zuA626NR We use the first-order Taylor series expansion around w (n) J (w (n + 1)) ≈ J (w (n)) + JT (w (n)) ∆w (n) (11) Remark: This is quite true when the step size is quite small!!! In addition, ∆w (n) = w (n + 1) − w (n) 25 / 101
  43. 43. Steepest Descent In the method of steepest descent, we have a cost function J (w) where w (n + 1) = w (n) − η J (w (n)) How, we prove that J (w (n + 1)) < J (w (n))? zuA626NR We use the first-order Taylor series expansion around w (n) J (w (n + 1)) ≈ J (w (n)) + JT (w (n)) ∆w (n) (11) Remark: This is quite true when the step size is quite small!!! In addition, ∆w (n) = w (n + 1) − w (n) 25 / 101
  44. 44. Why? Look at the case in R The equation of the tangent line to the curve y = J (w (n)) L (w (n)) = J (w (n)) [w (n + 1) − w (n)] + J (w (n)) (12) Example 26 / 101
  45. 45. Why? Look at the case in R The equation of the tangent line to the curve y = J (w (n)) L (w (n)) = J (w (n)) [w (n + 1) − w (n)] + J (w (n)) (12) Example 26 / 101
  46. 46. Thus, we have that in R Remember Something quite Classic tan θ = J (w (n + 1)) − J (w (n)) w (n + 1) − w (n) tan θ (w (n + 1) − w (n)) =J (w (n + 1)) − J (w (n)) J (w (n)) (w (n + 1) − w (n)) = J (w (n + 1)) − J (w (n)) 27 / 101
  47. 47. Thus, we have that in R Remember Something quite Classic tan θ = J (w (n + 1)) − J (w (n)) w (n + 1) − w (n) tan θ (w (n + 1) − w (n)) =J (w (n + 1)) − J (w (n)) J (w (n)) (w (n + 1) − w (n)) = J (w (n + 1)) − J (w (n)) 27 / 101
  48. 48. Thus, we have that in R Remember Something quite Classic tan θ = J (w (n + 1)) − J (w (n)) w (n + 1) − w (n) tan θ (w (n + 1) − w (n)) =J (w (n + 1)) − J (w (n)) J (w (n)) (w (n + 1) − w (n)) = J (w (n + 1)) − J (w (n)) 27 / 101
  49. 49. Thus, we have that Using the First Taylor expansion J (w (n)) ≈ J (w (n)) + J (w (n)) [w (n + 1) − w (n)] (13) 28 / 101
  50. 50. Now, for Many Variables An hyperplane in Rn is a set of the form H = x|aT x = b (14) Given x ∈ H and x0 ∈ H b = aT x = aT x0 Thus, we have that H = x|aT (x − x0) = 0 29 / 101
  51. 51. Now, for Many Variables An hyperplane in Rn is a set of the form H = x|aT x = b (14) Given x ∈ H and x0 ∈ H b = aT x = aT x0 Thus, we have that H = x|aT (x − x0) = 0 29 / 101
  52. 52. Now, for Many Variables An hyperplane in Rn is a set of the form H = x|aT x = b (14) Given x ∈ H and x0 ∈ H b = aT x = aT x0 Thus, we have that H = x|aT (x − x0) = 0 29 / 101
  53. 53. Thus, we have the following definition Definition (Differentiability) Assume that J is defined in a disk D containing w (n). We say that J is differentiable at w (n) if: 1 ∂J(w(n)) ∂wi exist for all i = 1, ..., n. 2 J is locally linear at w (n). 30 / 101
  54. 54. Thus, we have the following definition Definition (Differentiability) Assume that J is defined in a disk D containing w (n). We say that J is differentiable at w (n) if: 1 ∂J(w(n)) ∂wi exist for all i = 1, ..., n. 2 J is locally linear at w (n). 30 / 101
  55. 55. Thus, we have the following definition Definition (Differentiability) Assume that J is defined in a disk D containing w (n). We say that J is differentiable at w (n) if: 1 ∂J(w(n)) ∂wi exist for all i = 1, ..., n. 2 J is locally linear at w (n). 30 / 101
  56. 56. Thus, given J (w (n)) We know that we have the following operator = ∂ ∂w1 , ∂ ∂w2 , ..., ∂ ∂wm (15) Thus, we have J (w (n)) = ∂J (w (n)) ∂w1 , ∂J (w (n)) ∂w2 , ..., ∂J (w (n)) ∂wm = m i=1 ˆwi ∂J (w (n)) ∂wi Where: ˆwT i = (1, 0, ..., 0) ∈ R 31 / 101
  57. 57. Thus, given J (w (n)) We know that we have the following operator = ∂ ∂w1 , ∂ ∂w2 , ..., ∂ ∂wm (15) Thus, we have J (w (n)) = ∂J (w (n)) ∂w1 , ∂J (w (n)) ∂w2 , ..., ∂J (w (n)) ∂wm = m i=1 ˆwi ∂J (w (n)) ∂wi Where: ˆwT i = (1, 0, ..., 0) ∈ R 31 / 101
  58. 58. Now Given a curve function r (t) that lies on the level set J (w (n)) = c (When is in R3 ) 32 / 101
  59. 59. Level Set Definition {(w1, w2, ..., wm) ∈ Rm |J (w1, w2, ..., wm) = c} (16) Remark: In a normal Calculus course we will use x and f instead of w and J. 33 / 101
  60. 60. Where Any curve has the following parametrization r : [a, b] → Rm r(t) = (w1 (t) , ..., wm (t)) With r(n + 1) = (w1 (n + 1) , ..., wm (n + 1)) We can write the parametrized version of it z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17) Differentiating with respect to t and using the chain rule for multiple variables dz(t) dt = m i=1 ∂J (w (t)) ∂wi · dwi(t) dt = 0 (18) 34 / 101
  61. 61. Where Any curve has the following parametrization r : [a, b] → Rm r(t) = (w1 (t) , ..., wm (t)) With r(n + 1) = (w1 (n + 1) , ..., wm (n + 1)) We can write the parametrized version of it z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17) Differentiating with respect to t and using the chain rule for multiple variables dz(t) dt = m i=1 ∂J (w (t)) ∂wi · dwi(t) dt = 0 (18) 34 / 101
  62. 62. Where Any curve has the following parametrization r : [a, b] → Rm r(t) = (w1 (t) , ..., wm (t)) With r(n + 1) = (w1 (n + 1) , ..., wm (n + 1)) We can write the parametrized version of it z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17) Differentiating with respect to t and using the chain rule for multiple variables dz(t) dt = m i=1 ∂J (w (t)) ∂wi · dwi(t) dt = 0 (18) 34 / 101
  63. 63. Note First Given y = f (u) = (f1 (u) , ..., fl (u)) and u = g (x) = (g1 (x) , ..., gm (x)). We have then that ∂ (f1, f2, ..., fl) ∂ (x1, x2, ..., xk) = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂ (x1, x2, ..., xk) (19) Thus ∂ (f1, f2, ..., fl) ∂xi = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂xi = m k=1 ∂ (f1, f2, ..., fl) ∂gk ∂gk ∂xi 35 / 101
  64. 64. Note First Given y = f (u) = (f1 (u) , ..., fl (u)) and u = g (x) = (g1 (x) , ..., gm (x)). We have then that ∂ (f1, f2, ..., fl) ∂ (x1, x2, ..., xk) = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂ (x1, x2, ..., xk) (19) Thus ∂ (f1, f2, ..., fl) ∂xi = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂xi = m k=1 ∂ (f1, f2, ..., fl) ∂gk ∂gk ∂xi 35 / 101
  65. 65. Note First Given y = f (u) = (f1 (u) , ..., fl (u)) and u = g (x) = (g1 (x) , ..., gm (x)). We have then that ∂ (f1, f2, ..., fl) ∂ (x1, x2, ..., xk) = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂ (x1, x2, ..., xk) (19) Thus ∂ (f1, f2, ..., fl) ∂xi = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂xi = m k=1 ∂ (f1, f2, ..., fl) ∂gk ∂gk ∂xi 35 / 101
  66. 66. Thus Evaluating at t = n m i=1 ∂J (w (n)) ∂wi · dwi(n) dt = 0 We have that J (w (n)) · r (n) = 0 (20) This proves that for every level set the gradient is perpendicular to the tangent to any curve that lies on the level set In particular to the point w (n). 36 / 101
  67. 67. Thus Evaluating at t = n m i=1 ∂J (w (n)) ∂wi · dwi(n) dt = 0 We have that J (w (n)) · r (n) = 0 (20) This proves that for every level set the gradient is perpendicular to the tangent to any curve that lies on the level set In particular to the point w (n). 36 / 101
  68. 68. Thus Evaluating at t = n m i=1 ∂J (w (n)) ∂wi · dwi(n) dt = 0 We have that J (w (n)) · r (n) = 0 (20) This proves that for every level set the gradient is perpendicular to the tangent to any curve that lies on the level set In particular to the point w (n). 36 / 101
  69. 69. Now the tangent plane to the surface can be described generally Thus L (w (n + 1)) = J (w (n)) + JT (w (n)) [w (n + 1) − w (n)] (21) This looks like 37 / 101
  70. 70. Now the tangent plane to the surface can be described generally Thus L (w (n + 1)) = J (w (n)) + JT (w (n)) [w (n + 1) − w (n)] (21) This looks like 37 / 101
  71. 71. Proving the fact about the Steepest Descent We want the following J (w (n + 1)) < J (w (n)) Using the first-order Taylor approximation J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) So, we ask the following ∆w (n) ≈ −η J (w (n)) with η > 0 38 / 101
  72. 72. Proving the fact about the Steepest Descent We want the following J (w (n + 1)) < J (w (n)) Using the first-order Taylor approximation J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) So, we ask the following ∆w (n) ≈ −η J (w (n)) with η > 0 38 / 101
  73. 73. Proving the fact about the Steepest Descent We want the following J (w (n + 1)) < J (w (n)) Using the first-order Taylor approximation J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) So, we ask the following ∆w (n) ≈ −η J (w (n)) with η > 0 38 / 101
  74. 74. Then We have that J (w (n + 1)) − J (w (n)) ≈ −η JT (w (n)) J (w (n)) = −η J (w (n)) 2 Thus J (w (n + 1)) − J (w (n)) < 0 Or J (w (n + 1)) < J (w (n)) 39 / 101
  75. 75. Then We have that J (w (n + 1)) − J (w (n)) ≈ −η JT (w (n)) J (w (n)) = −η J (w (n)) 2 Thus J (w (n + 1)) − J (w (n)) < 0 Or J (w (n + 1)) < J (w (n)) 39 / 101
  76. 76. Then We have that J (w (n + 1)) − J (w (n)) ≈ −η JT (w (n)) J (w (n)) = −η J (w (n)) 2 Thus J (w (n + 1)) − J (w (n)) < 0 Or J (w (n + 1)) < J (w (n)) 39 / 101
  77. 77. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 40 / 101
  78. 78. Newton’s Method Here The basic idea of Newton’s method is to minimize the quadratic approximation of the cost function J (w) around the current point w (n). Using a second-order Taylor series expansion of the cost function around the point w (n) ∆J (w (n)) = J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) Where given that w (n) is a vector with dimension m H = 2 J (w) =        ∂2 J(w) ∂w2 1 ∂2 J(w) ∂w1∂w2 · · · ∂2 J(w) ∂w1∂wm ∂2 J(w) ∂w2∂w1 ∂2 J(w) ∂w2 2 · · · ∂2 J(w) ∂w2∂wm ... ... ... ∂2 J(w) ∂wm∂w1 ∂2 J(w) ∂wm∂w2 · · · ∂2 J(w) ∂w2 m        41 / 101
  79. 79. Newton’s Method Here The basic idea of Newton’s method is to minimize the quadratic approximation of the cost function J (w) around the current point w (n). Using a second-order Taylor series expansion of the cost function around the point w (n) ∆J (w (n)) = J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) Where given that w (n) is a vector with dimension m H = 2 J (w) =        ∂2 J(w) ∂w2 1 ∂2 J(w) ∂w1∂w2 · · · ∂2 J(w) ∂w1∂wm ∂2 J(w) ∂w2∂w1 ∂2 J(w) ∂w2 2 · · · ∂2 J(w) ∂w2∂wm ... ... ... ∂2 J(w) ∂wm∂w1 ∂2 J(w) ∂wm∂w2 · · · ∂2 J(w) ∂w2 m        41 / 101
  80. 80. Newton’s Method Here The basic idea of Newton’s method is to minimize the quadratic approximation of the cost function J (w) around the current point w (n). Using a second-order Taylor series expansion of the cost function around the point w (n) ∆J (w (n)) = J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) Where given that w (n) is a vector with dimension m H = 2 J (w) =        ∂2 J(w) ∂w2 1 ∂2 J(w) ∂w1∂w2 · · · ∂2 J(w) ∂w1∂wm ∂2 J(w) ∂w2∂w1 ∂2 J(w) ∂w2 2 · · · ∂2 J(w) ∂w2∂wm ... ... ... ∂2 J(w) ∂wm∂w1 ∂2 J(w) ∂wm∂w2 · · · ∂2 J(w) ∂w2 m        41 / 101
  81. 81. Now, we want to minimize J (w (n + 1)) Do you have any idea? Look again J (w (n)) + JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) (22) Derive with respect to ∆w (n) J (w (n)) + H (n) ∆w (n) = 0 (23) Thus ∆w (n) = −H−1 (n) J (w (n)) 42 / 101
  82. 82. Now, we want to minimize J (w (n + 1)) Do you have any idea? Look again J (w (n)) + JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) (22) Derive with respect to ∆w (n) J (w (n)) + H (n) ∆w (n) = 0 (23) Thus ∆w (n) = −H−1 (n) J (w (n)) 42 / 101
  83. 83. Now, we want to minimize J (w (n + 1)) Do you have any idea? Look again J (w (n)) + JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) (22) Derive with respect to ∆w (n) J (w (n)) + H (n) ∆w (n) = 0 (23) Thus ∆w (n) = −H−1 (n) J (w (n)) 42 / 101
  84. 84. The Final Method Define the following J (w (n + 1)) − J (w (n)) = −H−1 (n) J (w (n)) Then J (w (n + 1)) = J (w (n)) − H−1 (n) J (w (n)) 43 / 101
  85. 85. The Final Method Define the following J (w (n + 1)) − J (w (n)) = −H−1 (n) J (w (n)) Then J (w (n + 1)) = J (w (n)) − H−1 (n) J (w (n)) 43 / 101
  86. 86. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 44 / 101
  87. 87. We have then an error Something Notable J (w) = 1 2 n i=1 e2 (i) Thus using the first order Taylor expansion, we linearize el (i, w) = e (i) + ∂e (i) ∂w T [w − w (n)] In matrix form el (n, w) = e (n) + J (n) [w − w (n)] 45 / 101
  88. 88. We have then an error Something Notable J (w) = 1 2 n i=1 e2 (i) Thus using the first order Taylor expansion, we linearize el (i, w) = e (i) + ∂e (i) ∂w T [w − w (n)] In matrix form el (n, w) = e (n) + J (n) [w − w (n)] 45 / 101
  89. 89. We have then an error Something Notable J (w) = 1 2 n i=1 e2 (i) Thus using the first order Taylor expansion, we linearize el (i, w) = e (i) + ∂e (i) ∂w T [w − w (n)] In matrix form el (n, w) = e (n) + J (n) [w − w (n)] 45 / 101
  90. 90. Where The error vector is equal to e (n) = [e (1) , e (2) , ..., e (n)]T (24) Thus, we get the famous Jacobian once we derive ∂e(i) ∂w J (n) =        ∂e(1) ∂w1 ∂e(1) ∂w2 · · · ∂e(1) ∂wm ∂e(2) ∂w1 ∂e(2) ∂w2 · · · ∂e(2) ∂wm ... ... ... ∂e(n) ∂w1 ∂e(n) ∂w2 · · · ∂e(n) ∂wm        46 / 101
  91. 91. Where The error vector is equal to e (n) = [e (1) , e (2) , ..., e (n)]T (24) Thus, we get the famous Jacobian once we derive ∂e(i) ∂w J (n) =        ∂e(1) ∂w1 ∂e(1) ∂w2 · · · ∂e(1) ∂wm ∂e(2) ∂w1 ∂e(2) ∂w2 · · · ∂e(2) ∂wm ... ... ... ∂e(n) ∂w1 ∂e(n) ∂w2 · · · ∂e(n) ∂wm        46 / 101
  92. 92. Where We want the following w (n + 1) = arg min w 1 2 el (n, w) 2 Ideas What if we expand out the equation? 47 / 101
  93. 93. Where We want the following w (n + 1) = arg min w 1 2 el (n, w) 2 Ideas What if we expand out the equation? 47 / 101
  94. 94. Expanded Version We get 1 2 el (n, w) 2 = 1 2 e (n) 2 + eT (n) J (n) (w − w (n)) + ... 1 2 (w − w (n))T JT (n) J (n) (w − w (n)) 48 / 101
  95. 95. Now,doing the Differential, we get Differentiating the equation with respect to w JT (n) e (n) + JT (n) J (n) [w − w (n)] = 0 We get finally w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) (25) 49 / 101
  96. 96. Now,doing the Differential, we get Differentiating the equation with respect to w JT (n) e (n) + JT (n) J (n) [w − w (n)] = 0 We get finally w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) (25) 49 / 101
  97. 97. Remarks We have that The Newton’s method that requires knowledge of the Hessian matrix of the cost function. The Gauss-Newton method only requires the Jacobian matrix of the error vector e (n). However The Gauss-Newton iteration to be computable, the matrix product JT (n) J (n) must be nonsingular!!! 50 / 101
  98. 98. Remarks We have that The Newton’s method that requires knowledge of the Hessian matrix of the cost function. The Gauss-Newton method only requires the Jacobian matrix of the error vector e (n). However The Gauss-Newton iteration to be computable, the matrix product JT (n) J (n) must be nonsingular!!! 50 / 101
  99. 99. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 51 / 101
  100. 100. Introduction A linear least-squares filter has two distinctive characteristics First, the single neuron around which it is built is linear. The cost function J (w) used to design the filter consists of the sum of error squares. Thus, expressing the error e (n) = d (n) − (x (1) , ..., x (n))T w (n) Short Version - error is linear in the weight vector w (n) e (n) = d (n) − X (n) w (n) Where d (n) is a n × 1 desired response vector. Where X(n) is the n × m data matrix. 52 / 101
  101. 101. Introduction A linear least-squares filter has two distinctive characteristics First, the single neuron around which it is built is linear. The cost function J (w) used to design the filter consists of the sum of error squares. Thus, expressing the error e (n) = d (n) − (x (1) , ..., x (n))T w (n) Short Version - error is linear in the weight vector w (n) e (n) = d (n) − X (n) w (n) Where d (n) is a n × 1 desired response vector. Where X(n) is the n × m data matrix. 52 / 101
  102. 102. Introduction A linear least-squares filter has two distinctive characteristics First, the single neuron around which it is built is linear. The cost function J (w) used to design the filter consists of the sum of error squares. Thus, expressing the error e (n) = d (n) − (x (1) , ..., x (n))T w (n) Short Version - error is linear in the weight vector w (n) e (n) = d (n) − X (n) w (n) Where d (n) is a n × 1 desired response vector. Where X(n) is the n × m data matrix. 52 / 101
  103. 103. Now, differentiate e (n) with respect to w (n) Thus e (n) = −XT (n) Correspondingly, the Jacobian of e(n) is J (n) = −X (n) Let us to use the Gaussian-Newton w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) 53 / 101
  104. 104. Now, differentiate e (n) with respect to w (n) Thus e (n) = −XT (n) Correspondingly, the Jacobian of e(n) is J (n) = −X (n) Let us to use the Gaussian-Newton w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) 53 / 101
  105. 105. Now, differentiate e (n) with respect to w (n) Thus e (n) = −XT (n) Correspondingly, the Jacobian of e(n) is J (n) = −X (n) Let us to use the Gaussian-Newton w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) 53 / 101
  106. 106. Thus We have the following w (n + 1) = w (n) − −XT (n) · −X (n) −1 · −XT (n) [d (n) − X (n) w (n)] We have then w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − ... XT (n) X (n) −1 XT (n) X (n) w (n) Thus, we have w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − w (n) = XT (n) X (n) −1 XT (n) d (n) 54 / 101
  107. 107. Thus We have the following w (n + 1) = w (n) − −XT (n) · −X (n) −1 · −XT (n) [d (n) − X (n) w (n)] We have then w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − ... XT (n) X (n) −1 XT (n) X (n) w (n) Thus, we have w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − w (n) = XT (n) X (n) −1 XT (n) d (n) 54 / 101
  108. 108. Thus We have the following w (n + 1) = w (n) − −XT (n) · −X (n) −1 · −XT (n) [d (n) − X (n) w (n)] We have then w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − ... XT (n) X (n) −1 XT (n) X (n) w (n) Thus, we have w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − w (n) = XT (n) X (n) −1 XT (n) d (n) 54 / 101
  109. 109. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 55 / 101
  110. 110. Again Our Error Cost function We have J (w) = 1 2 e2 (n) where e (n) is the error signal measured at time n. Again differentiating against the vector w ∂J (w) ∂w = e (n) ∂e (n) ∂w LMS algorithm operates with a linear neuron so we may express the error signal as e (n) = d (n) − xT (n) w (n) (26) 56 / 101
  111. 111. Again Our Error Cost function We have J (w) = 1 2 e2 (n) where e (n) is the error signal measured at time n. Again differentiating against the vector w ∂J (w) ∂w = e (n) ∂e (n) ∂w LMS algorithm operates with a linear neuron so we may express the error signal as e (n) = d (n) − xT (n) w (n) (26) 56 / 101
  112. 112. Again Our Error Cost function We have J (w) = 1 2 e2 (n) where e (n) is the error signal measured at time n. Again differentiating against the vector w ∂J (w) ∂w = e (n) ∂e (n) ∂w LMS algorithm operates with a linear neuron so we may express the error signal as e (n) = d (n) − xT (n) w (n) (26) 56 / 101
  113. 113. We have Something Notable ∂e (n) ∂w = −x (n) Then ∂J (w) ∂w = −x (n) e (n) Using this as an estimate for the gradient vector, we have for the gradient descent w (n + 1) = w (n) + ηx (n) e (n) (27) 57 / 101
  114. 114. We have Something Notable ∂e (n) ∂w = −x (n) Then ∂J (w) ∂w = −x (n) e (n) Using this as an estimate for the gradient vector, we have for the gradient descent w (n + 1) = w (n) + ηx (n) e (n) (27) 57 / 101
  115. 115. We have Something Notable ∂e (n) ∂w = −x (n) Then ∂J (w) ∂w = −x (n) e (n) Using this as an estimate for the gradient vector, we have for the gradient descent w (n + 1) = w (n) + ηx (n) e (n) (27) 57 / 101
  116. 116. Remarks The feedback loop around the weight vector low-pass filter It behaves like a low-pass filter. It passes the low frequency component of the error signal and attenuating its high frequency component. Low-Pass filter 58 / 101
  117. 117. Remarks The feedback loop around the weight vector low-pass filter It behaves like a low-pass filter. It passes the low frequency component of the error signal and attenuating its high frequency component. Low-Pass filter 58 / 101
  118. 118. Remarks The feedback loop around the weight vector low-pass filter It behaves like a low-pass filter. It passes the low frequency component of the error signal and attenuating its high frequency component. Low-Pass filter R C Vin Vout 58 / 101
  119. 119. Actually Thus The average time constant of this filtering action is inversely proportional to the learning-rate parameter η. Thus Assigning a small value to η, the adaptive process progresses slowly. Thus More of the past data are remembered by the LMS algorithm. Thus, LMS is a more accurate filter. 59 / 101
  120. 120. Actually Thus The average time constant of this filtering action is inversely proportional to the learning-rate parameter η. Thus Assigning a small value to η, the adaptive process progresses slowly. Thus More of the past data are remembered by the LMS algorithm. Thus, LMS is a more accurate filter. 59 / 101
  121. 121. Actually Thus The average time constant of this filtering action is inversely proportional to the learning-rate parameter η. Thus Assigning a small value to η, the adaptive process progresses slowly. Thus More of the past data are remembered by the LMS algorithm. Thus, LMS is a more accurate filter. 59 / 101
  122. 122. Actually Thus The average time constant of this filtering action is inversely proportional to the learning-rate parameter η. Thus Assigning a small value to η, the adaptive process progresses slowly. Thus More of the past data are remembered by the LMS algorithm. Thus, LMS is a more accurate filter. 59 / 101
  123. 123. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 60 / 101
  124. 124. Convergence of the LMS This convergence depends on the following points The statistical characteristics of the input vector x (n). The learning-rate parameter η. Something Notable However instead using E [w (n)] as n → ∞, we use E e2 (n) →constant as n → ∞ 61 / 101
  125. 125. Convergence of the LMS This convergence depends on the following points The statistical characteristics of the input vector x (n). The learning-rate parameter η. Something Notable However instead using E [w (n)] as n → ∞, we use E e2 (n) →constant as n → ∞ 61 / 101
  126. 126. To make this analysis practical We take the following assumptions The successive input vectors x(1), x(2), .. are statistically independent of each other. At time step n, the input vector x(n) is statistically independent of all previous samples of the desired response, namely d(1), d(2), ..., d(n − 1). At time step n, the desired response d(n) is dependent on x(n), but statistically independent of all previous values of the desired response. The input vector x(n) and desired response d(n) are drawn from Gaussiandistributed populations. 62 / 101
  127. 127. To make this analysis practical We take the following assumptions The successive input vectors x(1), x(2), .. are statistically independent of each other. At time step n, the input vector x(n) is statistically independent of all previous samples of the desired response, namely d(1), d(2), ..., d(n − 1). At time step n, the desired response d(n) is dependent on x(n), but statistically independent of all previous values of the desired response. The input vector x(n) and desired response d(n) are drawn from Gaussiandistributed populations. 62 / 101
  128. 128. To make this analysis practical We take the following assumptions The successive input vectors x(1), x(2), .. are statistically independent of each other. At time step n, the input vector x(n) is statistically independent of all previous samples of the desired response, namely d(1), d(2), ..., d(n − 1). At time step n, the desired response d(n) is dependent on x(n), but statistically independent of all previous values of the desired response. The input vector x(n) and desired response d(n) are drawn from Gaussiandistributed populations. 62 / 101
  129. 129. To make this analysis practical We take the following assumptions The successive input vectors x(1), x(2), .. are statistically independent of each other. At time step n, the input vector x(n) is statistically independent of all previous samples of the desired response, namely d(1), d(2), ..., d(n − 1). At time step n, the desired response d(n) is dependent on x(n), but statistically independent of all previous values of the desired response. The input vector x(n) and desired response d(n) are drawn from Gaussiandistributed populations. 62 / 101
  130. 130. We get the following The LMS is convergent in the mean square provided that η satisfies 0 < η < 2 λmax (28) Because λmax is the largest eigenvalue of the correlation sample Rx This can be difficult in reality.... then we use the trace instead 0 < η < 2 trace [Rx] (29) However, each diagonal element of Rx is equal the mean-squared value of the corresponding of the sensor input We can re-state the previous condition as 0 < η < 2 sum of the mean-square values of the sensor input (30) 63 / 101
  131. 131. We get the following The LMS is convergent in the mean square provided that η satisfies 0 < η < 2 λmax (28) Because λmax is the largest eigenvalue of the correlation sample Rx This can be difficult in reality.... then we use the trace instead 0 < η < 2 trace [Rx] (29) However, each diagonal element of Rx is equal the mean-squared value of the corresponding of the sensor input We can re-state the previous condition as 0 < η < 2 sum of the mean-square values of the sensor input (30) 63 / 101
  132. 132. We get the following The LMS is convergent in the mean square provided that η satisfies 0 < η < 2 λmax (28) Because λmax is the largest eigenvalue of the correlation sample Rx This can be difficult in reality.... then we use the trace instead 0 < η < 2 trace [Rx] (29) However, each diagonal element of Rx is equal the mean-squared value of the corresponding of the sensor input We can re-state the previous condition as 0 < η < 2 sum of the mean-square values of the sensor input (30) 63 / 101
  133. 133. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
  134. 134. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
  135. 135. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
  136. 136. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
  137. 137. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
  138. 138. More of this in... Simon Haykin Simon Haykin - Adaptive Filter Theory (3rd Edition) 65 / 101
  139. 139. Exercises We have from NN by Haykin 3.1, 3.2, 3.3, 3.4, 3.5, 3.7 and 3.8 66 / 101
  140. 140. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 67 / 101
  141. 141. Objective Goal Correctly classify a series of samples (External applied stimuli) x1, x2, x3, ..., xm into one of two classes, C1 and C2. Output of each input 1 Class C1 output y +1. 2 Class C2 output y -1. 68 / 101
  142. 142. Objective Goal Correctly classify a series of samples (External applied stimuli) x1, x2, x3, ..., xm into one of two classes, C1 and C2. Output of each input 1 Class C1 output y +1. 2 Class C2 output y -1. 68 / 101
  143. 143. History Frank Rosenblatt The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. Something Notable Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!! Frank Rosenblatt He helped to develop the Mark I Perceptron - a new machine based in the connectivity of neural networks!!! Some problems with it The most important is the impossibility to use the perceptron with a single neuron to solve the XOR problem 69 / 101
  144. 144. History Frank Rosenblatt The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. Something Notable Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!! Frank Rosenblatt He helped to develop the Mark I Perceptron - a new machine based in the connectivity of neural networks!!! Some problems with it The most important is the impossibility to use the perceptron with a single neuron to solve the XOR problem 69 / 101
  145. 145. History Frank Rosenblatt The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. Something Notable Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!! Frank Rosenblatt He helped to develop the Mark I Perceptron - a new machine based in the connectivity of neural networks!!! Some problems with it The most important is the impossibility to use the perceptron with a single neuron to solve the XOR problem 69 / 101
  146. 146. History Frank Rosenblatt The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. Something Notable Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!! Frank Rosenblatt He helped to develop the Mark I Perceptron - a new machine based in the connectivity of neural networks!!! Some problems with it The most important is the impossibility to use the perceptron with a single neuron to solve the XOR problem 69 / 101
  147. 147. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 70 / 101
  148. 148. Perceptron: Local Field of a Neuron Signal-Flow Bias Inputs Induced local field of a neuron v = m i=1 wixi + b (31) 71 / 101
  149. 149. Perceptron: Local Field of a Neuron Signal-Flow Bias Inputs Induced local field of a neuron v = m i=1 wixi + b (31) 71 / 101
  150. 150. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 72 / 101
  151. 151. Perceptron: One Neuron Structure Based in the previous induced local field In the simplest form of the perceptron there are two decision regions separated by an hyperplane: m i=1 wixi + b = 0 (32) Example with two signals 73 / 101
  152. 152. Perceptron: One Neuron Structure Based in the previous induced local field In the simplest form of the perceptron there are two decision regions separated by an hyperplane: m i=1 wixi + b = 0 (32) Example with two signals Decision Boundary Class Class 73 / 101
  153. 153. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 74 / 101
  154. 154. Deriving the Algorithm First, you put signals together x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T (33) Weights v(n) = m i=0 wi (n) xi (n) = wT (n) x (n) (34) Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable 75 / 101
  155. 155. Deriving the Algorithm First, you put signals together x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T (33) Weights v(n) = m i=0 wi (n) xi (n) = wT (n) x (n) (34) Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable 75 / 101
  156. 156. Deriving the Algorithm First, you put signals together x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T (33) Weights v(n) = m i=0 wi (n) xi (n) = wT (n) x (n) (34) Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable Decision Boundary Class Class 75 / 101
  157. 157. Rule for Linear Separable Classes There must exist a vector w 1 wT x > 0 for every input vector x belonging to class C1. 2 wT x ≤ 0 for every input vector x belonging to class C2. What is the derivative of dv(n) dw ? dv (n) dw = x (n) (35) 76 / 101
  158. 158. Rule for Linear Separable Classes There must exist a vector w 1 wT x > 0 for every input vector x belonging to class C1. 2 wT x ≤ 0 for every input vector x belonging to class C2. What is the derivative of dv(n) dw ? dv (n) dw = x (n) (35) 76 / 101
  159. 159. Rule for Linear Separable Classes There must exist a vector w 1 wT x > 0 for every input vector x belonging to class C1. 2 wT x ≤ 0 for every input vector x belonging to class C2. What is the derivative of dv(n) dw ? dv (n) dw = x (n) (35) 76 / 101
  160. 160. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
  161. 161. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
  162. 162. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
  163. 163. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
  164. 164. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
  165. 165. A little bit on the Geometry For Example, w(n + 1) = w(n) − η (n) x (n) 78 / 101
  166. 166. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 79 / 101
  167. 167. Under Linear Separability - Convergence happens!!! If we assume Linear Separabilty for the classes C1 and C2. Rosenblatt - 1962 Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that is a solution vector for w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36) is a solution vector for n0 ≤ nmax 80 / 101
  168. 168. Under Linear Separability - Convergence happens!!! If we assume Linear Separabilty for the classes C1 and C2. Rosenblatt - 1962 Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that is a solution vector for w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36) is a solution vector for n0 ≤ nmax 80 / 101
  169. 169. Under Linear Separability - Convergence happens!!! If we assume Linear Separabilty for the classes C1 and C2. Rosenblatt - 1962 Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that is a solution vector for w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36) is a solution vector for n0 ≤ nmax 80 / 101
  170. 170. Under Linear Separability - Convergence happens!!! If we assume Linear Separabilty for the classes C1 and C2. Rosenblatt - 1962 Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that is a solution vector for w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36) is a solution vector for n0 ≤ nmax 80 / 101
  171. 171. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 81 / 101
  172. 172. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
  173. 173. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
  174. 174. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
  175. 175. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
  176. 176. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
  177. 177. Proof II Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (40) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (41) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (42) 83 / 101
  178. 178. Proof II Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (40) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (41) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (42) 83 / 101
  179. 179. Proof II Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (40) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (41) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (42) 83 / 101
  180. 180. Proof III Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (43) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (44) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (45) 84 / 101
  181. 181. Proof III Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (43) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (44) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (45) 84 / 101
  182. 182. Proof III Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (43) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (44) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (45) 84 / 101
  183. 183. Proof IV Thus we use the α wT 0 w (n + 1) ≥ nα (46) Thus using the Cauchy-Schwartz Inequality wT 0 2 w (n + 1) 2 ≥ wT 0 w (n + 1) 2 (47) · is the Euclidean distance. Thus wT 0 2 w (n + 1) 2 ≥ n2 α2 w (n + 1) 2 ≥ n2α2 wT 0 2 85 / 101
  184. 184. Proof IV Thus we use the α wT 0 w (n + 1) ≥ nα (46) Thus using the Cauchy-Schwartz Inequality wT 0 2 w (n + 1) 2 ≥ wT 0 w (n + 1) 2 (47) · is the Euclidean distance. Thus wT 0 2 w (n + 1) 2 ≥ n2 α2 w (n + 1) 2 ≥ n2α2 wT 0 2 85 / 101
  185. 185. Proof IV Thus we use the α wT 0 w (n + 1) ≥ nα (46) Thus using the Cauchy-Schwartz Inequality wT 0 2 w (n + 1) 2 ≥ wT 0 w (n + 1) 2 (47) · is the Euclidean distance. Thus wT 0 2 w (n + 1) 2 ≥ n2 α2 w (n + 1) 2 ≥ n2α2 wT 0 2 85 / 101
  186. 186. Proof V - Now a Upper Bound for w (n + 1) 2 Now rewrite equation 39 w (k + 1) = w (k) + x (k) (48) for k = 1, 2, ..., n and x (k) ∈ C1 Squaring the Euclidean norm of both sides w (k + 1) 2 = w (k) 2 + x (k) 2 + 2wT (k) x (k) (49) Now taking that wT (k) x (k) < 0 w (k + 1) 2 ≤ w (k) 2 + x (k) 2 w (k + 1) 2 − w (k) 2 ≤ x (k) 2 86 / 101
  187. 187. Proof V - Now a Upper Bound for w (n + 1) 2 Now rewrite equation 39 w (k + 1) = w (k) + x (k) (48) for k = 1, 2, ..., n and x (k) ∈ C1 Squaring the Euclidean norm of both sides w (k + 1) 2 = w (k) 2 + x (k) 2 + 2wT (k) x (k) (49) Now taking that wT (k) x (k) < 0 w (k + 1) 2 ≤ w (k) 2 + x (k) 2 w (k + 1) 2 − w (k) 2 ≤ x (k) 2 86 / 101
  188. 188. Proof V - Now a Upper Bound for w (n + 1) 2 Now rewrite equation 39 w (k + 1) = w (k) + x (k) (48) for k = 1, 2, ..., n and x (k) ∈ C1 Squaring the Euclidean norm of both sides w (k + 1) 2 = w (k) 2 + x (k) 2 + 2wT (k) x (k) (49) Now taking that wT (k) x (k) < 0 w (k + 1) 2 ≤ w (k) 2 + x (k) 2 w (k + 1) 2 − w (k) 2 ≤ x (k) 2 86 / 101
  189. 189. Proof VI Use the telescopic sum n k=0 w (k + 1) 2 − w (k) 2 ≤ n k=0 x (k) 2 (50) Assume w (0) = 0 x (0) = 0 Thus w (n + 1) 2 ≤ n k=1 x (k) 2 87 / 101
  190. 190. Proof VI Use the telescopic sum n k=0 w (k + 1) 2 − w (k) 2 ≤ n k=0 x (k) 2 (50) Assume w (0) = 0 x (0) = 0 Thus w (n + 1) 2 ≤ n k=1 x (k) 2 87 / 101
  191. 191. Proof VI Use the telescopic sum n k=0 w (k + 1) 2 − w (k) 2 ≤ n k=0 x (k) 2 (50) Assume w (0) = 0 x (0) = 0 Thus w (n + 1) 2 ≤ n k=1 x (k) 2 87 / 101
  192. 192. Proof VII Then, we can define a positive number β = max x(k)∈C1 x (k) 2 (51) Thus w (k + 1) 2 ≤ n k=1 x (k) 2 ≤ nβ Thus, we satisfies the equations only when exists a nmax (Using Our Sandwich ) n2 maxα2 w0 2 = nmaxβ (52) 88 / 101
  193. 193. Proof VII Then, we can define a positive number β = max x(k)∈C1 x (k) 2 (51) Thus w (k + 1) 2 ≤ n k=1 x (k) 2 ≤ nβ Thus, we satisfies the equations only when exists a nmax (Using Our Sandwich ) n2 maxα2 w0 2 = nmaxβ (52) 88 / 101
  194. 194. Proof VII Then, we can define a positive number β = max x(k)∈C1 x (k) 2 (51) Thus w (k + 1) 2 ≤ n k=1 x (k) 2 ≤ nβ Thus, we satisfies the equations only when exists a nmax (Using Our Sandwich ) n2 maxα2 w0 2 = nmaxβ (52) 88 / 101
  195. 195. Proof VIII Solving nmax = β w0 2 α2 (53) Thus For η (n) = 1 for all n, w (0) = 0 and a solution vector w0: The rule for adaptying the synaptic weights of the perceptron must terminate after at most nmax steps. In addition Because w0 the solution is not unique. 89 / 101
  196. 196. Proof VIII Solving nmax = β w0 2 α2 (53) Thus For η (n) = 1 for all n, w (0) = 0 and a solution vector w0: The rule for adaptying the synaptic weights of the perceptron must terminate after at most nmax steps. In addition Because w0 the solution is not unique. 89 / 101
  197. 197. Proof VIII Solving nmax = β w0 2 α2 (53) Thus For η (n) = 1 for all n, w (0) = 0 and a solution vector w0: The rule for adaptying the synaptic weights of the perceptron must terminate after at most nmax steps. In addition Because w0 the solution is not unique. 89 / 101
  198. 198. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 90 / 101
  199. 199. Algorithm Using Error-Correcting Now, if we use the 1 2 ek (n)2 We can actually simplify the rules and the final algorithm!!! Thus, we have the following Delta Value ∆w (n) = η ((dj − yj (n))) x (n) (54) 91 / 101
  200. 200. Algorithm Using Error-Correcting Now, if we use the 1 2 ek (n)2 We can actually simplify the rules and the final algorithm!!! Thus, we have the following Delta Value ∆w (n) = η ((dj − yj (n))) x (n) (54) 91 / 101
  201. 201. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 92 / 101
  202. 202. We could use the previous Rule In order to generate an algorithm However, you need classes that are linearly separable!!! Therefore, we can use a more generals Gradient Descent Rule To obtain an algorithm to the best separation hyperplane!!! 93 / 101
  203. 203. We could use the previous Rule In order to generate an algorithm However, you need classes that are linearly separable!!! Therefore, we can use a more generals Gradient Descent Rule To obtain an algorithm to the best separation hyperplane!!! 93 / 101
  204. 204. Gradient Descent Algorithm Algorithm - Off-line/Batch Learning 1 Set n = 0. 2 Set dj = +1 if xj (n) ∈ Class 1 −1 if xj (n) ∈ Class 2 for all j = 1, 2, ..., m. 3 Initialize the weights, wT = (w1 (n) , w2 (n) , ..., wn (n)). Weights may be initialized to 0 or to a small random value. 4 Initialize Dummy outputs so you can enter loop yt = y1 (n) ., y2 (n) , ..., ym (n) 5 Initialize Stopping error > 0. 6 Initialize learning error η. 7 While 1 m m j=1 dj − yj (n) > For each sample (xj, dj) for j = 1, ..., m: Calculate output yj = ϕ wT (n) · xj Update weights wi (n + 1) = wi (n) + η (dj − yj (n)) xij. n = n + 1 94 / 101
  205. 205. Gradient Descent Algorithm Algorithm - Off-line/Batch Learning 1 Set n = 0. 2 Set dj = +1 if xj (n) ∈ Class 1 −1 if xj (n) ∈ Class 2 for all j = 1, 2, ..., m. 3 Initialize the weights, wT = (w1 (n) , w2 (n) , ..., wn (n)). Weights may be initialized to 0 or to a small random value. 4 Initialize Dummy outputs so you can enter loop yt = y1 (n) ., y2 (n) , ..., ym (n) 5 Initialize Stopping error > 0. 6 Initialize learning error η. 7 While 1 m m j=1 dj − yj (n) > For each sample (xj, dj) for j = 1, ..., m: Calculate output yj = ϕ wT (n) · xj Update weights wi (n + 1) = wi (n) + η (dj − yj (n)) xij. n = n + 1 94 / 101
  206. 206. Nevertheless We have the following problem > 0 (55) Thus... Convergence to the best linear separation is a tweaking business!!! 95 / 101
  207. 207. Nevertheless We have the following problem > 0 (55) Thus... Convergence to the best linear separation is a tweaking business!!! 95 / 101
  208. 208. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 96 / 101
  209. 209. However, if we limit our features!!! The Winnow Algorithm!!! It converges even with no-linear separability. Feature Vector A Boolean-valued features X = {0, 1}d Weight Vector w 1 wt = (w1, w2, ..., wp) for all wi ∈ R 2 For all i, wi ≥ 0. 97 / 101
  210. 210. However, if we limit our features!!! The Winnow Algorithm!!! It converges even with no-linear separability. Feature Vector A Boolean-valued features X = {0, 1}d Weight Vector w 1 wt = (w1, w2, ..., wp) for all wi ∈ R 2 For all i, wi ≥ 0. 97 / 101
  211. 211. However, if we limit our features!!! The Winnow Algorithm!!! It converges even with no-linear separability. Feature Vector A Boolean-valued features X = {0, 1}d Weight Vector w 1 wt = (w1, w2, ..., wp) for all wi ∈ R 2 For all i, wi ≥ 0. 97 / 101
  212. 212. Classification Scheme We use a specific θ 1 wT x ≥ θ ⇒ positive classification Class 1 if x ∈Class 1 2 wT x < θ ⇒ negative classification Class 2 if x ∈Class 2 Rule We use two possible Rules for training!!! With a learning rate of α > 1. Rule 1 When misclassifying a positive training example x ∈Class 1 i.e. wT x < θ ∀xi = 1 : wi ← αwi (56) 98 / 101
  213. 213. Classification Scheme We use a specific θ 1 wT x ≥ θ ⇒ positive classification Class 1 if x ∈Class 1 2 wT x < θ ⇒ negative classification Class 2 if x ∈Class 2 Rule We use two possible Rules for training!!! With a learning rate of α > 1. Rule 1 When misclassifying a positive training example x ∈Class 1 i.e. wT x < θ ∀xi = 1 : wi ← αwi (56) 98 / 101
  214. 214. Classification Scheme We use a specific θ 1 wT x ≥ θ ⇒ positive classification Class 1 if x ∈Class 1 2 wT x < θ ⇒ negative classification Class 2 if x ∈Class 2 Rule We use two possible Rules for training!!! With a learning rate of α > 1. Rule 1 When misclassifying a positive training example x ∈Class 1 i.e. wT x < θ ∀xi = 1 : wi ← αwi (56) 98 / 101
  215. 215. Classification Scheme Rule 2 When misclassifying a negative training example x ∈Class 2 i.e. wT x ≥ θ ∀xi = 1 : wi ← wi α (57) Rule 3 If samples are correctly classified do nothing!!! 99 / 101
  216. 216. Classification Scheme Rule 2 When misclassifying a negative training example x ∈Class 2 i.e. wT x ≥ θ ∀xi = 1 : wi ← wi α (57) Rule 3 If samples are correctly classified do nothing!!! 99 / 101
  217. 217. Properties of Winnow Property If there are many irrelevant variables Winnow is better than the Perceptron. Drawback Sensitive to the learning rate α. 100 / 101
  218. 218. Properties of Winnow Property If there are many irrelevant variables Winnow is better than the Perceptron. Drawback Sensitive to the learning rate α. 100 / 101
  219. 219. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  220. 220. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  221. 221. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  222. 222. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  223. 223. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  224. 224. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  225. 225. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  226. 226. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  227. 227. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  228. 228. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101

×