Atualizámos a nossa política de privacidade. Clique aqui para ver os detalhes. Toque aqui para ver os detalhes.

O slideshow foi denunciado.

Seu SlideShare está sendo baixado.
×

Ative o seu período de avaliaçõo gratuito de 30 dias para desbloquear leituras ilimitadas.

Ative o seu teste gratuito de 30 dias para continuar a ler.

Slide mais recortado

1 de 228
Anúncio

Baixar para ler offline

Single Layer Perceptron

Some updates were done to the Winnow Algorithm and I added the Pocket version....

Machine Learning, Deep Learning and Computational Mathematics

Some updates were done to the Winnow Algorithm and I added the Pocket version....

- 1. Neural Networks Introduction Single-Layer Perceptron Andres Mendez-Vazquez July 3, 2016 1 / 101
- 2. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 2 / 101
- 3. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 3 / 101
- 4. History At the beginning of Neural Networks (1943 - 1958) McCulloch and Pitts (1943) for introducing the idea of neural networks as computing machines. Hebb (1949) for postulating the ﬁrst rule for self-organized learning. Rosenblatt (1958) for proposing the perceptron as the ﬁrst model for learning with a teacher (i.e., supervised learning). In this chapter, we are interested in the perceptron The perceptron is the simplest form of a neural network used for the classiﬁca tion of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). 4 / 101
- 5. History At the beginning of Neural Networks (1943 - 1958) McCulloch and Pitts (1943) for introducing the idea of neural networks as computing machines. Hebb (1949) for postulating the ﬁrst rule for self-organized learning. Rosenblatt (1958) for proposing the perceptron as the ﬁrst model for learning with a teacher (i.e., supervised learning). In this chapter, we are interested in the perceptron The perceptron is the simplest form of a neural network used for the classiﬁca tion of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). 4 / 101
- 6. History At the beginning of Neural Networks (1943 - 1958) McCulloch and Pitts (1943) for introducing the idea of neural networks as computing machines. Hebb (1949) for postulating the ﬁrst rule for self-organized learning. Rosenblatt (1958) for proposing the perceptron as the ﬁrst model for learning with a teacher (i.e., supervised learning). In this chapter, we are interested in the perceptron The perceptron is the simplest form of a neural network used for the classiﬁca tion of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). 4 / 101
- 7. History At the beginning of Neural Networks (1943 - 1958) McCulloch and Pitts (1943) for introducing the idea of neural networks as computing machines. Hebb (1949) for postulating the ﬁrst rule for self-organized learning. Rosenblatt (1958) for proposing the perceptron as the ﬁrst model for learning with a teacher (i.e., supervised learning). In this chapter, we are interested in the perceptron The perceptron is the simplest form of a neural network used for the classiﬁca tion of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). 4 / 101
- 8. In addition Something Notable The single neuron also forms the basis of an adaptive ﬁlter. A functional block that is basic to the ever-expanding subject of signal processing. Furthermore The development of adaptive ﬁltering owes much to the classic paper of Widrow and Hoﬀ (1960) for pioneering the so-called least-mean-square (LMS) algorithm, also known as the delta rule. 5 / 101
- 9. In addition Something Notable The single neuron also forms the basis of an adaptive ﬁlter. A functional block that is basic to the ever-expanding subject of signal processing. Furthermore The development of adaptive ﬁltering owes much to the classic paper of Widrow and Hoﬀ (1960) for pioneering the so-called least-mean-square (LMS) algorithm, also known as the delta rule. 5 / 101
- 10. In addition Something Notable The single neuron also forms the basis of an adaptive ﬁlter. A functional block that is basic to the ever-expanding subject of signal processing. Furthermore The development of adaptive ﬁltering owes much to the classic paper of Widrow and Hoﬀ (1960) for pioneering the so-called least-mean-square (LMS) algorithm, also known as the delta rule. 5 / 101
- 11. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 6 / 101
- 12. Adapting Filtering Problem Consider a dynamical system Unknown dynamical system Output INPUTS 7 / 101
- 13. Signal-Flow Graph of Adaptive Model We have the following equivalence -1 8 / 101
- 14. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 9 / 101
- 15. Description of the Behavior of the System We have the data set T = {(x (i) , d (i)) |i = 1, 2, ..., n, ...} (1) Where x (i) = (x1 (i) , x2 (i) ..., xm (i))T (2) 10 / 101
- 16. Description of the Behavior of the System We have the data set T = {(x (i) , d (i)) |i = 1, 2, ..., n, ...} (1) Where x (i) = (x1 (i) , x2 (i) ..., xm (i))T (2) 10 / 101
- 17. The Stimulus x (i) The stimulus x(i) can arise from The m elements of x(i) originate at diﬀerent points in space (spatial) 11 / 101
- 18. The Stimulus x (i) The stimulus x(i) can arise from The m elements of x(i) represent the set of present and (m − 1) past values of some excitation that are uniformly spaced in time (temporal). Time 12 / 101
- 19. Problem Quite important How do we design a multiple input-single output model of the unknown dynamical system? It is more We want to build this around a single neuron!!! 13 / 101
- 20. Problem Quite important How do we design a multiple input-single output model of the unknown dynamical system? It is more We want to build this around a single neuron!!! 13 / 101
- 21. Thus, we have the following... We need an algorithm to control the weight adjustment of the neuron Control Algorithm 14 / 101
- 22. Which steps do you need for the algorithm? First The algorithms starts from an arbitrary setting of the neuron’s synaptic weight. Second Adjustments, with respect to changes on the environment, are made on a continuous basis. Time is incorporated to the algorithm. Third Computation of adjustments to synaptic weights are completed inside a time interval that is one sampling period long. 15 / 101
- 23. Which steps do you need for the algorithm? First The algorithms starts from an arbitrary setting of the neuron’s synaptic weight. Second Adjustments, with respect to changes on the environment, are made on a continuous basis. Time is incorporated to the algorithm. Third Computation of adjustments to synaptic weights are completed inside a time interval that is one sampling period long. 15 / 101
- 24. Which steps do you need for the algorithm? First The algorithms starts from an arbitrary setting of the neuron’s synaptic weight. Second Adjustments, with respect to changes on the environment, are made on a continuous basis. Time is incorporated to the algorithm. Third Computation of adjustments to synaptic weights are completed inside a time interval that is one sampling period long. 15 / 101
- 25. Signal-Flow Graph of Adaptive Model We have the following equivalence -1 16 / 101
- 26. Thus, This Neural Model ≈ Adaptive Filter with two continous processes Filtering processes 1 An output, denoted by y(i), that is produced in response to the m elements of the stimulus vector x (i). 2 An error signal, e (i), that is obtained by comparing the output y(i) to the corresponding desired output d(i) produced by the unknown system. Adaptive Process It involves the automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i) Remark The combination of these two processes working together constitutes a feedback loop acting around the neuron. 17 / 101
- 27. Thus, This Neural Model ≈ Adaptive Filter with two continous processes Filtering processes 1 An output, denoted by y(i), that is produced in response to the m elements of the stimulus vector x (i). 2 An error signal, e (i), that is obtained by comparing the output y(i) to the corresponding desired output d(i) produced by the unknown system. Adaptive Process It involves the automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i) Remark The combination of these two processes working together constitutes a feedback loop acting around the neuron. 17 / 101
- 28. Thus, This Neural Model ≈ Adaptive Filter with two continous processes Filtering processes 1 An output, denoted by y(i), that is produced in response to the m elements of the stimulus vector x (i). 2 An error signal, e (i), that is obtained by comparing the output y(i) to the corresponding desired output d(i) produced by the unknown system. Adaptive Process It involves the automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i) Remark The combination of these two processes working together constitutes a feedback loop acting around the neuron. 17 / 101
- 29. Thus The output y(i) is exactly the same as the induced local ﬁeld v(i) y (i) = v (i) = m i=1 wk (i) xk (i) (3) In matrix form, we have - remember we only have a neuron, so we do not have neuron k y (i) = xT (i) w (i) (4) Error e (i) = d (i) − y (i) (5) 18 / 101
- 30. Thus The output y(i) is exactly the same as the induced local ﬁeld v(i) y (i) = v (i) = m i=1 wk (i) xk (i) (3) In matrix form, we have - remember we only have a neuron, so we do not have neuron k y (i) = xT (i) w (i) (4) Error e (i) = d (i) − y (i) (5) 18 / 101
- 31. Thus The output y(i) is exactly the same as the induced local ﬁeld v(i) y (i) = v (i) = m i=1 wk (i) xk (i) (3) In matrix form, we have - remember we only have a neuron, so we do not have neuron k y (i) = xT (i) w (i) (4) Error e (i) = d (i) − y (i) (5) 18 / 101
- 32. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 19 / 101
- 33. Consider A continous diﬀerentiable function J (w) We want to ﬁnd an optimal solution w∗ such that J (w∗ ) ≤ J (w) , ∀w (6) We want to Minimize the cost function J(w) with respect to the weight vector w. For this J (w) = 0 (7) 20 / 101
- 34. Consider A continous diﬀerentiable function J (w) We want to ﬁnd an optimal solution w∗ such that J (w∗ ) ≤ J (w) , ∀w (6) We want to Minimize the cost function J(w) with respect to the weight vector w. For this J (w) = 0 (7) 20 / 101
- 35. Consider A continous diﬀerentiable function J (w) We want to ﬁnd an optimal solution w∗ such that J (w∗ ) ≤ J (w) , ∀w (6) We want to Minimize the cost function J(w) with respect to the weight vector w. For this J (w) = 0 (7) 20 / 101
- 36. Where is the gradient operator = ∂ ∂w1 , ∂ ∂w2 , ..., ∂ ∂wm T (8) Thus J (w) = ∂J (w) ∂w1 , ∂J (w) ∂w2 , ..., ∂J (w) ∂wm T (9) 21 / 101
- 37. Where is the gradient operator = ∂ ∂w1 , ∂ ∂w2 , ..., ∂ ∂wm T (8) Thus J (w) = ∂J (w) ∂w1 , ∂J (w) ∂w2 , ..., ∂J (w) ∂wm T (9) 21 / 101
- 38. Thus Starting with an initial guess denoted by w(0), Then, generate a sequence of weight vectors w (1) , w (2) , ... Such that you can reduce J (w) at each iteration J (w (n + 1)) < J (w (n)) (10) Where: w (n) is the old value and w (n + 1) is the new value. 22 / 101
- 39. Thus Starting with an initial guess denoted by w(0), Then, generate a sequence of weight vectors w (1) , w (2) , ... Such that you can reduce J (w) at each iteration J (w (n + 1)) < J (w (n)) (10) Where: w (n) is the old value and w (n + 1) is the new value. 22 / 101
- 40. The Three Main Methods for Unconstrained Optimization We will look at 1 Steepest Descent. 2 Newton’s Method 3 Gauss-Newton Method 23 / 101
- 41. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 24 / 101
- 42. Steepest Descent In the method of steepest descent, we have a cost function J (w) where w (n + 1) = w (n) − η J (w (n)) How, we prove that J (w (n + 1)) < J (w (n))? zuA626NR We use the ﬁrst-order Taylor series expansion around w (n) J (w (n + 1)) ≈ J (w (n)) + JT (w (n)) ∆w (n) (11) Remark: This is quite true when the step size is quite small!!! In addition, ∆w (n) = w (n + 1) − w (n) 25 / 101
- 43. Steepest Descent In the method of steepest descent, we have a cost function J (w) where w (n + 1) = w (n) − η J (w (n)) How, we prove that J (w (n + 1)) < J (w (n))? zuA626NR We use the ﬁrst-order Taylor series expansion around w (n) J (w (n + 1)) ≈ J (w (n)) + JT (w (n)) ∆w (n) (11) Remark: This is quite true when the step size is quite small!!! In addition, ∆w (n) = w (n + 1) − w (n) 25 / 101
- 44. Why? Look at the case in R The equation of the tangent line to the curve y = J (w (n)) L (w (n)) = J (w (n)) [w (n + 1) − w (n)] + J (w (n)) (12) Example 26 / 101
- 45. Why? Look at the case in R The equation of the tangent line to the curve y = J (w (n)) L (w (n)) = J (w (n)) [w (n + 1) − w (n)] + J (w (n)) (12) Example 26 / 101
- 46. Thus, we have that in R Remember Something quite Classic tan θ = J (w (n + 1)) − J (w (n)) w (n + 1) − w (n) tan θ (w (n + 1) − w (n)) =J (w (n + 1)) − J (w (n)) J (w (n)) (w (n + 1) − w (n)) = J (w (n + 1)) − J (w (n)) 27 / 101
- 47. Thus, we have that in R Remember Something quite Classic tan θ = J (w (n + 1)) − J (w (n)) w (n + 1) − w (n) tan θ (w (n + 1) − w (n)) =J (w (n + 1)) − J (w (n)) J (w (n)) (w (n + 1) − w (n)) = J (w (n + 1)) − J (w (n)) 27 / 101
- 48. Thus, we have that in R Remember Something quite Classic tan θ = J (w (n + 1)) − J (w (n)) w (n + 1) − w (n) tan θ (w (n + 1) − w (n)) =J (w (n + 1)) − J (w (n)) J (w (n)) (w (n + 1) − w (n)) = J (w (n + 1)) − J (w (n)) 27 / 101
- 49. Thus, we have that Using the First Taylor expansion J (w (n)) ≈ J (w (n)) + J (w (n)) [w (n + 1) − w (n)] (13) 28 / 101
- 50. Now, for Many Variables An hyperplane in Rn is a set of the form H = x|aT x = b (14) Given x ∈ H and x0 ∈ H b = aT x = aT x0 Thus, we have that H = x|aT (x − x0) = 0 29 / 101
- 51. Now, for Many Variables An hyperplane in Rn is a set of the form H = x|aT x = b (14) Given x ∈ H and x0 ∈ H b = aT x = aT x0 Thus, we have that H = x|aT (x − x0) = 0 29 / 101
- 52. Now, for Many Variables An hyperplane in Rn is a set of the form H = x|aT x = b (14) Given x ∈ H and x0 ∈ H b = aT x = aT x0 Thus, we have that H = x|aT (x − x0) = 0 29 / 101
- 53. Thus, we have the following deﬁnition Deﬁnition (Diﬀerentiability) Assume that J is deﬁned in a disk D containing w (n). We say that J is diﬀerentiable at w (n) if: 1 ∂J(w(n)) ∂wi exist for all i = 1, ..., n. 2 J is locally linear at w (n). 30 / 101
- 54. Thus, we have the following deﬁnition Deﬁnition (Diﬀerentiability) Assume that J is deﬁned in a disk D containing w (n). We say that J is diﬀerentiable at w (n) if: 1 ∂J(w(n)) ∂wi exist for all i = 1, ..., n. 2 J is locally linear at w (n). 30 / 101
- 55. Thus, we have the following deﬁnition Deﬁnition (Diﬀerentiability) Assume that J is deﬁned in a disk D containing w (n). We say that J is diﬀerentiable at w (n) if: 1 ∂J(w(n)) ∂wi exist for all i = 1, ..., n. 2 J is locally linear at w (n). 30 / 101
- 56. Thus, given J (w (n)) We know that we have the following operator = ∂ ∂w1 , ∂ ∂w2 , ..., ∂ ∂wm (15) Thus, we have J (w (n)) = ∂J (w (n)) ∂w1 , ∂J (w (n)) ∂w2 , ..., ∂J (w (n)) ∂wm = m i=1 ˆwi ∂J (w (n)) ∂wi Where: ˆwT i = (1, 0, ..., 0) ∈ R 31 / 101
- 57. Thus, given J (w (n)) We know that we have the following operator = ∂ ∂w1 , ∂ ∂w2 , ..., ∂ ∂wm (15) Thus, we have J (w (n)) = ∂J (w (n)) ∂w1 , ∂J (w (n)) ∂w2 , ..., ∂J (w (n)) ∂wm = m i=1 ˆwi ∂J (w (n)) ∂wi Where: ˆwT i = (1, 0, ..., 0) ∈ R 31 / 101
- 58. Now Given a curve function r (t) that lies on the level set J (w (n)) = c (When is in R3 ) 32 / 101
- 59. Level Set Deﬁnition {(w1, w2, ..., wm) ∈ Rm |J (w1, w2, ..., wm) = c} (16) Remark: In a normal Calculus course we will use x and f instead of w and J. 33 / 101
- 60. Where Any curve has the following parametrization r : [a, b] → Rm r(t) = (w1 (t) , ..., wm (t)) With r(n + 1) = (w1 (n + 1) , ..., wm (n + 1)) We can write the parametrized version of it z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17) Diﬀerentiating with respect to t and using the chain rule for multiple variables dz(t) dt = m i=1 ∂J (w (t)) ∂wi · dwi(t) dt = 0 (18) 34 / 101
- 61. Where Any curve has the following parametrization r : [a, b] → Rm r(t) = (w1 (t) , ..., wm (t)) With r(n + 1) = (w1 (n + 1) , ..., wm (n + 1)) We can write the parametrized version of it z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17) Diﬀerentiating with respect to t and using the chain rule for multiple variables dz(t) dt = m i=1 ∂J (w (t)) ∂wi · dwi(t) dt = 0 (18) 34 / 101
- 62. Where Any curve has the following parametrization r : [a, b] → Rm r(t) = (w1 (t) , ..., wm (t)) With r(n + 1) = (w1 (n + 1) , ..., wm (n + 1)) We can write the parametrized version of it z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17) Diﬀerentiating with respect to t and using the chain rule for multiple variables dz(t) dt = m i=1 ∂J (w (t)) ∂wi · dwi(t) dt = 0 (18) 34 / 101
- 63. Note First Given y = f (u) = (f1 (u) , ..., fl (u)) and u = g (x) = (g1 (x) , ..., gm (x)). We have then that ∂ (f1, f2, ..., fl) ∂ (x1, x2, ..., xk) = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂ (x1, x2, ..., xk) (19) Thus ∂ (f1, f2, ..., fl) ∂xi = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂xi = m k=1 ∂ (f1, f2, ..., fl) ∂gk ∂gk ∂xi 35 / 101
- 64. Note First Given y = f (u) = (f1 (u) , ..., fl (u)) and u = g (x) = (g1 (x) , ..., gm (x)). We have then that ∂ (f1, f2, ..., fl) ∂ (x1, x2, ..., xk) = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂ (x1, x2, ..., xk) (19) Thus ∂ (f1, f2, ..., fl) ∂xi = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂xi = m k=1 ∂ (f1, f2, ..., fl) ∂gk ∂gk ∂xi 35 / 101
- 65. Note First Given y = f (u) = (f1 (u) , ..., fl (u)) and u = g (x) = (g1 (x) , ..., gm (x)). We have then that ∂ (f1, f2, ..., fl) ∂ (x1, x2, ..., xk) = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂ (x1, x2, ..., xk) (19) Thus ∂ (f1, f2, ..., fl) ∂xi = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂xi = m k=1 ∂ (f1, f2, ..., fl) ∂gk ∂gk ∂xi 35 / 101
- 66. Thus Evaluating at t = n m i=1 ∂J (w (n)) ∂wi · dwi(n) dt = 0 We have that J (w (n)) · r (n) = 0 (20) This proves that for every level set the gradient is perpendicular to the tangent to any curve that lies on the level set In particular to the point w (n). 36 / 101
- 67. Thus Evaluating at t = n m i=1 ∂J (w (n)) ∂wi · dwi(n) dt = 0 We have that J (w (n)) · r (n) = 0 (20) This proves that for every level set the gradient is perpendicular to the tangent to any curve that lies on the level set In particular to the point w (n). 36 / 101
- 68. Thus Evaluating at t = n m i=1 ∂J (w (n)) ∂wi · dwi(n) dt = 0 We have that J (w (n)) · r (n) = 0 (20) This proves that for every level set the gradient is perpendicular to the tangent to any curve that lies on the level set In particular to the point w (n). 36 / 101
- 69. Now the tangent plane to the surface can be described generally Thus L (w (n + 1)) = J (w (n)) + JT (w (n)) [w (n + 1) − w (n)] (21) This looks like 37 / 101
- 70. Now the tangent plane to the surface can be described generally Thus L (w (n + 1)) = J (w (n)) + JT (w (n)) [w (n + 1) − w (n)] (21) This looks like 37 / 101
- 71. Proving the fact about the Steepest Descent We want the following J (w (n + 1)) < J (w (n)) Using the ﬁrst-order Taylor approximation J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) So, we ask the following ∆w (n) ≈ −η J (w (n)) with η > 0 38 / 101
- 72. Proving the fact about the Steepest Descent We want the following J (w (n + 1)) < J (w (n)) Using the ﬁrst-order Taylor approximation J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) So, we ask the following ∆w (n) ≈ −η J (w (n)) with η > 0 38 / 101
- 73. Proving the fact about the Steepest Descent We want the following J (w (n + 1)) < J (w (n)) Using the ﬁrst-order Taylor approximation J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) So, we ask the following ∆w (n) ≈ −η J (w (n)) with η > 0 38 / 101
- 74. Then We have that J (w (n + 1)) − J (w (n)) ≈ −η JT (w (n)) J (w (n)) = −η J (w (n)) 2 Thus J (w (n + 1)) − J (w (n)) < 0 Or J (w (n + 1)) < J (w (n)) 39 / 101
- 75. Then We have that J (w (n + 1)) − J (w (n)) ≈ −η JT (w (n)) J (w (n)) = −η J (w (n)) 2 Thus J (w (n + 1)) − J (w (n)) < 0 Or J (w (n + 1)) < J (w (n)) 39 / 101
- 76. Then We have that J (w (n + 1)) − J (w (n)) ≈ −η JT (w (n)) J (w (n)) = −η J (w (n)) 2 Thus J (w (n + 1)) − J (w (n)) < 0 Or J (w (n + 1)) < J (w (n)) 39 / 101
- 77. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 40 / 101
- 78. Newton’s Method Here The basic idea of Newton’s method is to minimize the quadratic approximation of the cost function J (w) around the current point w (n). Using a second-order Taylor series expansion of the cost function around the point w (n) ∆J (w (n)) = J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) Where given that w (n) is a vector with dimension m H = 2 J (w) = ∂2 J(w) ∂w2 1 ∂2 J(w) ∂w1∂w2 · · · ∂2 J(w) ∂w1∂wm ∂2 J(w) ∂w2∂w1 ∂2 J(w) ∂w2 2 · · · ∂2 J(w) ∂w2∂wm ... ... ... ∂2 J(w) ∂wm∂w1 ∂2 J(w) ∂wm∂w2 · · · ∂2 J(w) ∂w2 m 41 / 101
- 79. Newton’s Method Here The basic idea of Newton’s method is to minimize the quadratic approximation of the cost function J (w) around the current point w (n). Using a second-order Taylor series expansion of the cost function around the point w (n) ∆J (w (n)) = J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) Where given that w (n) is a vector with dimension m H = 2 J (w) = ∂2 J(w) ∂w2 1 ∂2 J(w) ∂w1∂w2 · · · ∂2 J(w) ∂w1∂wm ∂2 J(w) ∂w2∂w1 ∂2 J(w) ∂w2 2 · · · ∂2 J(w) ∂w2∂wm ... ... ... ∂2 J(w) ∂wm∂w1 ∂2 J(w) ∂wm∂w2 · · · ∂2 J(w) ∂w2 m 41 / 101
- 80. Newton’s Method Here The basic idea of Newton’s method is to minimize the quadratic approximation of the cost function J (w) around the current point w (n). Using a second-order Taylor series expansion of the cost function around the point w (n) ∆J (w (n)) = J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) Where given that w (n) is a vector with dimension m H = 2 J (w) = ∂2 J(w) ∂w2 1 ∂2 J(w) ∂w1∂w2 · · · ∂2 J(w) ∂w1∂wm ∂2 J(w) ∂w2∂w1 ∂2 J(w) ∂w2 2 · · · ∂2 J(w) ∂w2∂wm ... ... ... ∂2 J(w) ∂wm∂w1 ∂2 J(w) ∂wm∂w2 · · · ∂2 J(w) ∂w2 m 41 / 101
- 81. Now, we want to minimize J (w (n + 1)) Do you have any idea? Look again J (w (n)) + JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) (22) Derive with respect to ∆w (n) J (w (n)) + H (n) ∆w (n) = 0 (23) Thus ∆w (n) = −H−1 (n) J (w (n)) 42 / 101
- 82. Now, we want to minimize J (w (n + 1)) Do you have any idea? Look again J (w (n)) + JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) (22) Derive with respect to ∆w (n) J (w (n)) + H (n) ∆w (n) = 0 (23) Thus ∆w (n) = −H−1 (n) J (w (n)) 42 / 101
- 83. Now, we want to minimize J (w (n + 1)) Do you have any idea? Look again J (w (n)) + JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) (22) Derive with respect to ∆w (n) J (w (n)) + H (n) ∆w (n) = 0 (23) Thus ∆w (n) = −H−1 (n) J (w (n)) 42 / 101
- 84. The Final Method Deﬁne the following J (w (n + 1)) − J (w (n)) = −H−1 (n) J (w (n)) Then J (w (n + 1)) = J (w (n)) − H−1 (n) J (w (n)) 43 / 101
- 85. The Final Method Deﬁne the following J (w (n + 1)) − J (w (n)) = −H−1 (n) J (w (n)) Then J (w (n + 1)) = J (w (n)) − H−1 (n) J (w (n)) 43 / 101
- 86. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 44 / 101
- 87. We have then an error Something Notable J (w) = 1 2 n i=1 e2 (i) Thus using the ﬁrst order Taylor expansion, we linearize el (i, w) = e (i) + ∂e (i) ∂w T [w − w (n)] In matrix form el (n, w) = e (n) + J (n) [w − w (n)] 45 / 101
- 88. We have then an error Something Notable J (w) = 1 2 n i=1 e2 (i) Thus using the ﬁrst order Taylor expansion, we linearize el (i, w) = e (i) + ∂e (i) ∂w T [w − w (n)] In matrix form el (n, w) = e (n) + J (n) [w − w (n)] 45 / 101
- 89. We have then an error Something Notable J (w) = 1 2 n i=1 e2 (i) Thus using the ﬁrst order Taylor expansion, we linearize el (i, w) = e (i) + ∂e (i) ∂w T [w − w (n)] In matrix form el (n, w) = e (n) + J (n) [w − w (n)] 45 / 101
- 90. Where The error vector is equal to e (n) = [e (1) , e (2) , ..., e (n)]T (24) Thus, we get the famous Jacobian once we derive ∂e(i) ∂w J (n) = ∂e(1) ∂w1 ∂e(1) ∂w2 · · · ∂e(1) ∂wm ∂e(2) ∂w1 ∂e(2) ∂w2 · · · ∂e(2) ∂wm ... ... ... ∂e(n) ∂w1 ∂e(n) ∂w2 · · · ∂e(n) ∂wm 46 / 101
- 91. Where The error vector is equal to e (n) = [e (1) , e (2) , ..., e (n)]T (24) Thus, we get the famous Jacobian once we derive ∂e(i) ∂w J (n) = ∂e(1) ∂w1 ∂e(1) ∂w2 · · · ∂e(1) ∂wm ∂e(2) ∂w1 ∂e(2) ∂w2 · · · ∂e(2) ∂wm ... ... ... ∂e(n) ∂w1 ∂e(n) ∂w2 · · · ∂e(n) ∂wm 46 / 101
- 92. Where We want the following w (n + 1) = arg min w 1 2 el (n, w) 2 Ideas What if we expand out the equation? 47 / 101
- 93. Where We want the following w (n + 1) = arg min w 1 2 el (n, w) 2 Ideas What if we expand out the equation? 47 / 101
- 94. Expanded Version We get 1 2 el (n, w) 2 = 1 2 e (n) 2 + eT (n) J (n) (w − w (n)) + ... 1 2 (w − w (n))T JT (n) J (n) (w − w (n)) 48 / 101
- 95. Now,doing the Diﬀerential, we get Diﬀerentiating the equation with respect to w JT (n) e (n) + JT (n) J (n) [w − w (n)] = 0 We get ﬁnally w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) (25) 49 / 101
- 96. Now,doing the Diﬀerential, we get Diﬀerentiating the equation with respect to w JT (n) e (n) + JT (n) J (n) [w − w (n)] = 0 We get ﬁnally w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) (25) 49 / 101
- 97. Remarks We have that The Newton’s method that requires knowledge of the Hessian matrix of the cost function. The Gauss-Newton method only requires the Jacobian matrix of the error vector e (n). However The Gauss-Newton iteration to be computable, the matrix product JT (n) J (n) must be nonsingular!!! 50 / 101
- 98. Remarks We have that The Newton’s method that requires knowledge of the Hessian matrix of the cost function. The Gauss-Newton method only requires the Jacobian matrix of the error vector e (n). However The Gauss-Newton iteration to be computable, the matrix product JT (n) J (n) must be nonsingular!!! 50 / 101
- 99. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 51 / 101
- 100. Introduction A linear least-squares ﬁlter has two distinctive characteristics First, the single neuron around which it is built is linear. The cost function J (w) used to design the ﬁlter consists of the sum of error squares. Thus, expressing the error e (n) = d (n) − (x (1) , ..., x (n))T w (n) Short Version - error is linear in the weight vector w (n) e (n) = d (n) − X (n) w (n) Where d (n) is a n × 1 desired response vector. Where X(n) is the n × m data matrix. 52 / 101
- 101. Introduction A linear least-squares ﬁlter has two distinctive characteristics First, the single neuron around which it is built is linear. The cost function J (w) used to design the ﬁlter consists of the sum of error squares. Thus, expressing the error e (n) = d (n) − (x (1) , ..., x (n))T w (n) Short Version - error is linear in the weight vector w (n) e (n) = d (n) − X (n) w (n) Where d (n) is a n × 1 desired response vector. Where X(n) is the n × m data matrix. 52 / 101
- 102. Introduction A linear least-squares ﬁlter has two distinctive characteristics First, the single neuron around which it is built is linear. The cost function J (w) used to design the ﬁlter consists of the sum of error squares. Thus, expressing the error e (n) = d (n) − (x (1) , ..., x (n))T w (n) Short Version - error is linear in the weight vector w (n) e (n) = d (n) − X (n) w (n) Where d (n) is a n × 1 desired response vector. Where X(n) is the n × m data matrix. 52 / 101
- 103. Now, diﬀerentiate e (n) with respect to w (n) Thus e (n) = −XT (n) Correspondingly, the Jacobian of e(n) is J (n) = −X (n) Let us to use the Gaussian-Newton w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) 53 / 101
- 104. Now, diﬀerentiate e (n) with respect to w (n) Thus e (n) = −XT (n) Correspondingly, the Jacobian of e(n) is J (n) = −X (n) Let us to use the Gaussian-Newton w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) 53 / 101
- 105. Now, diﬀerentiate e (n) with respect to w (n) Thus e (n) = −XT (n) Correspondingly, the Jacobian of e(n) is J (n) = −X (n) Let us to use the Gaussian-Newton w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) 53 / 101
- 106. Thus We have the following w (n + 1) = w (n) − −XT (n) · −X (n) −1 · −XT (n) [d (n) − X (n) w (n)] We have then w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − ... XT (n) X (n) −1 XT (n) X (n) w (n) Thus, we have w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − w (n) = XT (n) X (n) −1 XT (n) d (n) 54 / 101
- 107. Thus We have the following w (n + 1) = w (n) − −XT (n) · −X (n) −1 · −XT (n) [d (n) − X (n) w (n)] We have then w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − ... XT (n) X (n) −1 XT (n) X (n) w (n) Thus, we have w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − w (n) = XT (n) X (n) −1 XT (n) d (n) 54 / 101
- 108. Thus We have the following w (n + 1) = w (n) − −XT (n) · −X (n) −1 · −XT (n) [d (n) − X (n) w (n)] We have then w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − ... XT (n) X (n) −1 XT (n) X (n) w (n) Thus, we have w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − w (n) = XT (n) X (n) −1 XT (n) d (n) 54 / 101
- 109. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 55 / 101
- 110. Again Our Error Cost function We have J (w) = 1 2 e2 (n) where e (n) is the error signal measured at time n. Again diﬀerentiating against the vector w ∂J (w) ∂w = e (n) ∂e (n) ∂w LMS algorithm operates with a linear neuron so we may express the error signal as e (n) = d (n) − xT (n) w (n) (26) 56 / 101
- 111. Again Our Error Cost function We have J (w) = 1 2 e2 (n) where e (n) is the error signal measured at time n. Again diﬀerentiating against the vector w ∂J (w) ∂w = e (n) ∂e (n) ∂w LMS algorithm operates with a linear neuron so we may express the error signal as e (n) = d (n) − xT (n) w (n) (26) 56 / 101
- 112. Again Our Error Cost function We have J (w) = 1 2 e2 (n) where e (n) is the error signal measured at time n. Again diﬀerentiating against the vector w ∂J (w) ∂w = e (n) ∂e (n) ∂w LMS algorithm operates with a linear neuron so we may express the error signal as e (n) = d (n) − xT (n) w (n) (26) 56 / 101
- 113. We have Something Notable ∂e (n) ∂w = −x (n) Then ∂J (w) ∂w = −x (n) e (n) Using this as an estimate for the gradient vector, we have for the gradient descent w (n + 1) = w (n) + ηx (n) e (n) (27) 57 / 101
- 114. We have Something Notable ∂e (n) ∂w = −x (n) Then ∂J (w) ∂w = −x (n) e (n) Using this as an estimate for the gradient vector, we have for the gradient descent w (n + 1) = w (n) + ηx (n) e (n) (27) 57 / 101
- 115. We have Something Notable ∂e (n) ∂w = −x (n) Then ∂J (w) ∂w = −x (n) e (n) Using this as an estimate for the gradient vector, we have for the gradient descent w (n + 1) = w (n) + ηx (n) e (n) (27) 57 / 101
- 116. Remarks The feedback loop around the weight vector low-pass ﬁlter It behaves like a low-pass ﬁlter. It passes the low frequency component of the error signal and attenuating its high frequency component. Low-Pass ﬁlter 58 / 101
- 117. Remarks The feedback loop around the weight vector low-pass ﬁlter It behaves like a low-pass ﬁlter. It passes the low frequency component of the error signal and attenuating its high frequency component. Low-Pass ﬁlter 58 / 101
- 118. Remarks The feedback loop around the weight vector low-pass ﬁlter It behaves like a low-pass ﬁlter. It passes the low frequency component of the error signal and attenuating its high frequency component. Low-Pass ﬁlter R C Vin Vout 58 / 101
- 119. Actually Thus The average time constant of this ﬁltering action is inversely proportional to the learning-rate parameter η. Thus Assigning a small value to η, the adaptive process progresses slowly. Thus More of the past data are remembered by the LMS algorithm. Thus, LMS is a more accurate ﬁlter. 59 / 101
- 120. Actually Thus The average time constant of this ﬁltering action is inversely proportional to the learning-rate parameter η. Thus Assigning a small value to η, the adaptive process progresses slowly. Thus More of the past data are remembered by the LMS algorithm. Thus, LMS is a more accurate ﬁlter. 59 / 101
- 121. Actually Thus The average time constant of this ﬁltering action is inversely proportional to the learning-rate parameter η. Thus Assigning a small value to η, the adaptive process progresses slowly. Thus More of the past data are remembered by the LMS algorithm. Thus, LMS is a more accurate ﬁlter. 59 / 101
- 122. Actually Thus The average time constant of this ﬁltering action is inversely proportional to the learning-rate parameter η. Thus Assigning a small value to η, the adaptive process progresses slowly. Thus More of the past data are remembered by the LMS algorithm. Thus, LMS is a more accurate ﬁlter. 59 / 101
- 123. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 60 / 101
- 124. Convergence of the LMS This convergence depends on the following points The statistical characteristics of the input vector x (n). The learning-rate parameter η. Something Notable However instead using E [w (n)] as n → ∞, we use E e2 (n) →constant as n → ∞ 61 / 101
- 125. Convergence of the LMS This convergence depends on the following points The statistical characteristics of the input vector x (n). The learning-rate parameter η. Something Notable However instead using E [w (n)] as n → ∞, we use E e2 (n) →constant as n → ∞ 61 / 101
- 126. To make this analysis practical We take the following assumptions The successive input vectors x(1), x(2), .. are statistically independent of each other. At time step n, the input vector x(n) is statistically independent of all previous samples of the desired response, namely d(1), d(2), ..., d(n − 1). At time step n, the desired response d(n) is dependent on x(n), but statistically independent of all previous values of the desired response. The input vector x(n) and desired response d(n) are drawn from Gaussiandistributed populations. 62 / 101
- 127. To make this analysis practical We take the following assumptions The successive input vectors x(1), x(2), .. are statistically independent of each other. At time step n, the input vector x(n) is statistically independent of all previous samples of the desired response, namely d(1), d(2), ..., d(n − 1). At time step n, the desired response d(n) is dependent on x(n), but statistically independent of all previous values of the desired response. The input vector x(n) and desired response d(n) are drawn from Gaussiandistributed populations. 62 / 101
- 128. To make this analysis practical We take the following assumptions The successive input vectors x(1), x(2), .. are statistically independent of each other. At time step n, the input vector x(n) is statistically independent of all previous samples of the desired response, namely d(1), d(2), ..., d(n − 1). At time step n, the desired response d(n) is dependent on x(n), but statistically independent of all previous values of the desired response. The input vector x(n) and desired response d(n) are drawn from Gaussiandistributed populations. 62 / 101
- 129. To make this analysis practical We take the following assumptions The successive input vectors x(1), x(2), .. are statistically independent of each other. At time step n, the input vector x(n) is statistically independent of all previous samples of the desired response, namely d(1), d(2), ..., d(n − 1). At time step n, the desired response d(n) is dependent on x(n), but statistically independent of all previous values of the desired response. The input vector x(n) and desired response d(n) are drawn from Gaussiandistributed populations. 62 / 101
- 130. We get the following The LMS is convergent in the mean square provided that η satisﬁes 0 < η < 2 λmax (28) Because λmax is the largest eigenvalue of the correlation sample Rx This can be diﬃcult in reality.... then we use the trace instead 0 < η < 2 trace [Rx] (29) However, each diagonal element of Rx is equal the mean-squared value of the corresponding of the sensor input We can re-state the previous condition as 0 < η < 2 sum of the mean-square values of the sensor input (30) 63 / 101
- 131. We get the following The LMS is convergent in the mean square provided that η satisﬁes 0 < η < 2 λmax (28) Because λmax is the largest eigenvalue of the correlation sample Rx This can be diﬃcult in reality.... then we use the trace instead 0 < η < 2 trace [Rx] (29) However, each diagonal element of Rx is equal the mean-squared value of the corresponding of the sensor input We can re-state the previous condition as 0 < η < 2 sum of the mean-square values of the sensor input (30) 63 / 101
- 132. We get the following The LMS is convergent in the mean square provided that η satisﬁes 0 < η < 2 λmax (28) Because λmax is the largest eigenvalue of the correlation sample Rx This can be diﬃcult in reality.... then we use the trace instead 0 < η < 2 trace [Rx] (29) However, each diagonal element of Rx is equal the mean-squared value of the corresponding of the sensor input We can re-state the previous condition as 0 < η < 2 sum of the mean-square values of the sensor input (30) 63 / 101
- 133. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
- 134. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
- 135. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
- 136. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
- 137. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
- 138. More of this in... Simon Haykin Simon Haykin - Adaptive Filter Theory (3rd Edition) 65 / 101
- 139. Exercises We have from NN by Haykin 3.1, 3.2, 3.3, 3.4, 3.5, 3.7 and 3.8 66 / 101
- 140. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 67 / 101
- 141. Objective Goal Correctly classify a series of samples (External applied stimuli) x1, x2, x3, ..., xm into one of two classes, C1 and C2. Output of each input 1 Class C1 output y +1. 2 Class C2 output y -1. 68 / 101
- 142. Objective Goal Correctly classify a series of samples (External applied stimuli) x1, x2, x3, ..., xm into one of two classes, C1 and C2. Output of each input 1 Class C1 output y +1. 2 Class C2 output y -1. 68 / 101
- 143. History Frank Rosenblatt The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. Something Notable Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!! Frank Rosenblatt He helped to develop the Mark I Perceptron - a new machine based in the connectivity of neural networks!!! Some problems with it The most important is the impossibility to use the perceptron with a single neuron to solve the XOR problem 69 / 101
- 144. History Frank Rosenblatt The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. Something Notable Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!! Frank Rosenblatt He helped to develop the Mark I Perceptron - a new machine based in the connectivity of neural networks!!! Some problems with it The most important is the impossibility to use the perceptron with a single neuron to solve the XOR problem 69 / 101
- 145. History Frank Rosenblatt The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. Something Notable Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!! Frank Rosenblatt He helped to develop the Mark I Perceptron - a new machine based in the connectivity of neural networks!!! Some problems with it The most important is the impossibility to use the perceptron with a single neuron to solve the XOR problem 69 / 101
- 146. History Frank Rosenblatt The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. Something Notable Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!! Frank Rosenblatt He helped to develop the Mark I Perceptron - a new machine based in the connectivity of neural networks!!! Some problems with it The most important is the impossibility to use the perceptron with a single neuron to solve the XOR problem 69 / 101
- 147. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 70 / 101
- 148. Perceptron: Local Field of a Neuron Signal-Flow Bias Inputs Induced local ﬁeld of a neuron v = m i=1 wixi + b (31) 71 / 101
- 149. Perceptron: Local Field of a Neuron Signal-Flow Bias Inputs Induced local ﬁeld of a neuron v = m i=1 wixi + b (31) 71 / 101
- 150. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 72 / 101
- 151. Perceptron: One Neuron Structure Based in the previous induced local ﬁeld In the simplest form of the perceptron there are two decision regions separated by an hyperplane: m i=1 wixi + b = 0 (32) Example with two signals 73 / 101
- 152. Perceptron: One Neuron Structure Based in the previous induced local ﬁeld In the simplest form of the perceptron there are two decision regions separated by an hyperplane: m i=1 wixi + b = 0 (32) Example with two signals Decision Boundary Class Class 73 / 101
- 153. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 74 / 101
- 154. Deriving the Algorithm First, you put signals together x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T (33) Weights v(n) = m i=0 wi (n) xi (n) = wT (n) x (n) (34) Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable 75 / 101
- 155. Deriving the Algorithm First, you put signals together x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T (33) Weights v(n) = m i=0 wi (n) xi (n) = wT (n) x (n) (34) Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable 75 / 101
- 156. Deriving the Algorithm First, you put signals together x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T (33) Weights v(n) = m i=0 wi (n) xi (n) = wT (n) x (n) (34) Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable Decision Boundary Class Class 75 / 101
- 157. Rule for Linear Separable Classes There must exist a vector w 1 wT x > 0 for every input vector x belonging to class C1. 2 wT x ≤ 0 for every input vector x belonging to class C2. What is the derivative of dv(n) dw ? dv (n) dw = x (n) (35) 76 / 101
- 158. Rule for Linear Separable Classes There must exist a vector w 1 wT x > 0 for every input vector x belonging to class C1. 2 wT x ≤ 0 for every input vector x belonging to class C2. What is the derivative of dv(n) dw ? dv (n) dw = x (n) (35) 76 / 101
- 159. Rule for Linear Separable Classes There must exist a vector w 1 wT x > 0 for every input vector x belonging to class C1. 2 wT x ≤ 0 for every input vector x belonging to class C2. What is the derivative of dv(n) dw ? dv (n) dw = x (n) (35) 76 / 101
- 160. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
- 161. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
- 162. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
- 163. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
- 164. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
- 165. A little bit on the Geometry For Example, w(n + 1) = w(n) − η (n) x (n) 78 / 101
- 166. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 79 / 101
- 167. Under Linear Separability - Convergence happens!!! If we assume Linear Separabilty for the classes C1 and C2. Rosenblatt - 1962 Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that is a solution vector for w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36) is a solution vector for n0 ≤ nmax 80 / 101
- 168. Under Linear Separability - Convergence happens!!! If we assume Linear Separabilty for the classes C1 and C2. Rosenblatt - 1962 Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that is a solution vector for w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36) is a solution vector for n0 ≤ nmax 80 / 101
- 169. Under Linear Separability - Convergence happens!!! If we assume Linear Separabilty for the classes C1 and C2. Rosenblatt - 1962 Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that is a solution vector for w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36) is a solution vector for n0 ≤ nmax 80 / 101
- 170. Under Linear Separability - Convergence happens!!! If we assume Linear Separabilty for the classes C1 and C2. Rosenblatt - 1962 Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that is a solution vector for w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36) is a solution vector for n0 ≤ nmax 80 / 101
- 171. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 81 / 101
- 172. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
- 173. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
- 174. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
- 175. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
- 176. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
- 177. Proof II Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (40) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (41) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (42) 83 / 101
- 178. Proof II Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (40) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (41) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (42) 83 / 101
- 179. Proof II Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (40) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (41) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (42) 83 / 101
- 180. Proof III Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (43) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (44) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (45) 84 / 101
- 181. Proof III Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (43) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (44) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (45) 84 / 101
- 182. Proof III Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (43) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (44) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (45) 84 / 101
- 183. Proof IV Thus we use the α wT 0 w (n + 1) ≥ nα (46) Thus using the Cauchy-Schwartz Inequality wT 0 2 w (n + 1) 2 ≥ wT 0 w (n + 1) 2 (47) · is the Euclidean distance. Thus wT 0 2 w (n + 1) 2 ≥ n2 α2 w (n + 1) 2 ≥ n2α2 wT 0 2 85 / 101
- 184. Proof IV Thus we use the α wT 0 w (n + 1) ≥ nα (46) Thus using the Cauchy-Schwartz Inequality wT 0 2 w (n + 1) 2 ≥ wT 0 w (n + 1) 2 (47) · is the Euclidean distance. Thus wT 0 2 w (n + 1) 2 ≥ n2 α2 w (n + 1) 2 ≥ n2α2 wT 0 2 85 / 101
- 185. Proof IV Thus we use the α wT 0 w (n + 1) ≥ nα (46) Thus using the Cauchy-Schwartz Inequality wT 0 2 w (n + 1) 2 ≥ wT 0 w (n + 1) 2 (47) · is the Euclidean distance. Thus wT 0 2 w (n + 1) 2 ≥ n2 α2 w (n + 1) 2 ≥ n2α2 wT 0 2 85 / 101
- 186. Proof V - Now a Upper Bound for w (n + 1) 2 Now rewrite equation 39 w (k + 1) = w (k) + x (k) (48) for k = 1, 2, ..., n and x (k) ∈ C1 Squaring the Euclidean norm of both sides w (k + 1) 2 = w (k) 2 + x (k) 2 + 2wT (k) x (k) (49) Now taking that wT (k) x (k) < 0 w (k + 1) 2 ≤ w (k) 2 + x (k) 2 w (k + 1) 2 − w (k) 2 ≤ x (k) 2 86 / 101
- 187. Proof V - Now a Upper Bound for w (n + 1) 2 Now rewrite equation 39 w (k + 1) = w (k) + x (k) (48) for k = 1, 2, ..., n and x (k) ∈ C1 Squaring the Euclidean norm of both sides w (k + 1) 2 = w (k) 2 + x (k) 2 + 2wT (k) x (k) (49) Now taking that wT (k) x (k) < 0 w (k + 1) 2 ≤ w (k) 2 + x (k) 2 w (k + 1) 2 − w (k) 2 ≤ x (k) 2 86 / 101
- 188. Proof V - Now a Upper Bound for w (n + 1) 2 Now rewrite equation 39 w (k + 1) = w (k) + x (k) (48) for k = 1, 2, ..., n and x (k) ∈ C1 Squaring the Euclidean norm of both sides w (k + 1) 2 = w (k) 2 + x (k) 2 + 2wT (k) x (k) (49) Now taking that wT (k) x (k) < 0 w (k + 1) 2 ≤ w (k) 2 + x (k) 2 w (k + 1) 2 − w (k) 2 ≤ x (k) 2 86 / 101
- 189. Proof VI Use the telescopic sum n k=0 w (k + 1) 2 − w (k) 2 ≤ n k=0 x (k) 2 (50) Assume w (0) = 0 x (0) = 0 Thus w (n + 1) 2 ≤ n k=1 x (k) 2 87 / 101
- 190. Proof VI Use the telescopic sum n k=0 w (k + 1) 2 − w (k) 2 ≤ n k=0 x (k) 2 (50) Assume w (0) = 0 x (0) = 0 Thus w (n + 1) 2 ≤ n k=1 x (k) 2 87 / 101
- 191. Proof VI Use the telescopic sum n k=0 w (k + 1) 2 − w (k) 2 ≤ n k=0 x (k) 2 (50) Assume w (0) = 0 x (0) = 0 Thus w (n + 1) 2 ≤ n k=1 x (k) 2 87 / 101
- 192. Proof VII Then, we can deﬁne a positive number β = max x(k)∈C1 x (k) 2 (51) Thus w (k + 1) 2 ≤ n k=1 x (k) 2 ≤ nβ Thus, we satisﬁes the equations only when exists a nmax (Using Our Sandwich ) n2 maxα2 w0 2 = nmaxβ (52) 88 / 101
- 193. Proof VII Then, we can deﬁne a positive number β = max x(k)∈C1 x (k) 2 (51) Thus w (k + 1) 2 ≤ n k=1 x (k) 2 ≤ nβ Thus, we satisﬁes the equations only when exists a nmax (Using Our Sandwich ) n2 maxα2 w0 2 = nmaxβ (52) 88 / 101
- 194. Proof VII Then, we can deﬁne a positive number β = max x(k)∈C1 x (k) 2 (51) Thus w (k + 1) 2 ≤ n k=1 x (k) 2 ≤ nβ Thus, we satisﬁes the equations only when exists a nmax (Using Our Sandwich ) n2 maxα2 w0 2 = nmaxβ (52) 88 / 101
- 195. Proof VIII Solving nmax = β w0 2 α2 (53) Thus For η (n) = 1 for all n, w (0) = 0 and a solution vector w0: The rule for adaptying the synaptic weights of the perceptron must terminate after at most nmax steps. In addition Because w0 the solution is not unique. 89 / 101
- 196. Proof VIII Solving nmax = β w0 2 α2 (53) Thus For η (n) = 1 for all n, w (0) = 0 and a solution vector w0: The rule for adaptying the synaptic weights of the perceptron must terminate after at most nmax steps. In addition Because w0 the solution is not unique. 89 / 101
- 197. Proof VIII Solving nmax = β w0 2 α2 (53) Thus For η (n) = 1 for all n, w (0) = 0 and a solution vector w0: The rule for adaptying the synaptic weights of the perceptron must terminate after at most nmax steps. In addition Because w0 the solution is not unique. 89 / 101
- 198. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 90 / 101
- 199. Algorithm Using Error-Correcting Now, if we use the 1 2 ek (n)2 We can actually simplify the rules and the ﬁnal algorithm!!! Thus, we have the following Delta Value ∆w (n) = η ((dj − yj (n))) x (n) (54) 91 / 101
- 200. Algorithm Using Error-Correcting Now, if we use the 1 2 ek (n)2 We can actually simplify the rules and the ﬁnal algorithm!!! Thus, we have the following Delta Value ∆w (n) = η ((dj − yj (n))) x (n) (54) 91 / 101
- 201. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 92 / 101
- 202. We could use the previous Rule In order to generate an algorithm However, you need classes that are linearly separable!!! Therefore, we can use a more generals Gradient Descent Rule To obtain an algorithm to the best separation hyperplane!!! 93 / 101
- 203. We could use the previous Rule In order to generate an algorithm However, you need classes that are linearly separable!!! Therefore, we can use a more generals Gradient Descent Rule To obtain an algorithm to the best separation hyperplane!!! 93 / 101
- 204. Gradient Descent Algorithm Algorithm - Oﬀ-line/Batch Learning 1 Set n = 0. 2 Set dj = +1 if xj (n) ∈ Class 1 −1 if xj (n) ∈ Class 2 for all j = 1, 2, ..., m. 3 Initialize the weights, wT = (w1 (n) , w2 (n) , ..., wn (n)). Weights may be initialized to 0 or to a small random value. 4 Initialize Dummy outputs so you can enter loop yt = y1 (n) ., y2 (n) , ..., ym (n) 5 Initialize Stopping error > 0. 6 Initialize learning error η. 7 While 1 m m j=1 dj − yj (n) > For each sample (xj, dj) for j = 1, ..., m: Calculate output yj = ϕ wT (n) · xj Update weights wi (n + 1) = wi (n) + η (dj − yj (n)) xij. n = n + 1 94 / 101
- 205. Gradient Descent Algorithm Algorithm - Oﬀ-line/Batch Learning 1 Set n = 0. 2 Set dj = +1 if xj (n) ∈ Class 1 −1 if xj (n) ∈ Class 2 for all j = 1, 2, ..., m. 3 Initialize the weights, wT = (w1 (n) , w2 (n) , ..., wn (n)). Weights may be initialized to 0 or to a small random value. 4 Initialize Dummy outputs so you can enter loop yt = y1 (n) ., y2 (n) , ..., ym (n) 5 Initialize Stopping error > 0. 6 Initialize learning error η. 7 While 1 m m j=1 dj − yj (n) > For each sample (xj, dj) for j = 1, ..., m: Calculate output yj = ϕ wT (n) · xj Update weights wi (n + 1) = wi (n) + η (dj − yj (n)) xij. n = n + 1 94 / 101
- 206. Nevertheless We have the following problem > 0 (55) Thus... Convergence to the best linear separation is a tweaking business!!! 95 / 101
- 207. Nevertheless We have the following problem > 0 (55) Thus... Convergence to the best linear separation is a tweaking business!!! 95 / 101
- 208. Outline 1 Introduction History 2 Adapting Filtering Problem Deﬁnition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 96 / 101
- 209. However, if we limit our features!!! The Winnow Algorithm!!! It converges even with no-linear separability. Feature Vector A Boolean-valued features X = {0, 1}d Weight Vector w 1 wt = (w1, w2, ..., wp) for all wi ∈ R 2 For all i, wi ≥ 0. 97 / 101
- 210. However, if we limit our features!!! The Winnow Algorithm!!! It converges even with no-linear separability. Feature Vector A Boolean-valued features X = {0, 1}d Weight Vector w 1 wt = (w1, w2, ..., wp) for all wi ∈ R 2 For all i, wi ≥ 0. 97 / 101
- 211. However, if we limit our features!!! The Winnow Algorithm!!! It converges even with no-linear separability. Feature Vector A Boolean-valued features X = {0, 1}d Weight Vector w 1 wt = (w1, w2, ..., wp) for all wi ∈ R 2 For all i, wi ≥ 0. 97 / 101
- 212. Classiﬁcation Scheme We use a speciﬁc θ 1 wT x ≥ θ ⇒ positive classiﬁcation Class 1 if x ∈Class 1 2 wT x < θ ⇒ negative classiﬁcation Class 2 if x ∈Class 2 Rule We use two possible Rules for training!!! With a learning rate of α > 1. Rule 1 When misclassifying a positive training example x ∈Class 1 i.e. wT x < θ ∀xi = 1 : wi ← αwi (56) 98 / 101
- 213. Classiﬁcation Scheme We use a speciﬁc θ 1 wT x ≥ θ ⇒ positive classiﬁcation Class 1 if x ∈Class 1 2 wT x < θ ⇒ negative classiﬁcation Class 2 if x ∈Class 2 Rule We use two possible Rules for training!!! With a learning rate of α > 1. Rule 1 When misclassifying a positive training example x ∈Class 1 i.e. wT x < θ ∀xi = 1 : wi ← αwi (56) 98 / 101
- 214. Classiﬁcation Scheme We use a speciﬁc θ 1 wT x ≥ θ ⇒ positive classiﬁcation Class 1 if x ∈Class 1 2 wT x < θ ⇒ negative classiﬁcation Class 2 if x ∈Class 2 Rule We use two possible Rules for training!!! With a learning rate of α > 1. Rule 1 When misclassifying a positive training example x ∈Class 1 i.e. wT x < θ ∀xi = 1 : wi ← αwi (56) 98 / 101
- 215. Classiﬁcation Scheme Rule 2 When misclassifying a negative training example x ∈Class 2 i.e. wT x ≥ θ ∀xi = 1 : wi ← wi α (57) Rule 3 If samples are correctly classiﬁed do nothing!!! 99 / 101
- 216. Classiﬁcation Scheme Rule 2 When misclassifying a negative training example x ∈Class 2 i.e. wT x ≥ θ ∀xi = 1 : wi ← wi α (57) Rule 3 If samples are correctly classiﬁed do nothing!!! 99 / 101
- 217. Properties of Winnow Property If there are many irrelevant variables Winnow is better than the Perceptron. Drawback Sensitive to the learning rate α. 100 / 101
- 218. Properties of Winnow Property If there are many irrelevant variables Winnow is better than the Perceptron. Drawback Sensitive to the learning rate α. 100 / 101
- 219. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulﬁlled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Deﬁne a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to ﬁnd the number h of samples correctly classiﬁed. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
- 220. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulﬁlled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Deﬁne a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to ﬁnd the number h of samples correctly classiﬁed. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
- 221. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulﬁlled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Deﬁne a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to ﬁnd the number h of samples correctly classiﬁed. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
- 222. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulﬁlled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Deﬁne a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to ﬁnd the number h of samples correctly classiﬁed. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
- 223. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulﬁlled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Deﬁne a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to ﬁnd the number h of samples correctly classiﬁed. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
- 224. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulﬁlled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Deﬁne a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to ﬁnd the number h of samples correctly classiﬁed. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
- 225. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulﬁlled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Deﬁne a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to ﬁnd the number h of samples correctly classiﬁed. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
- 226. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulﬁlled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Deﬁne a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to ﬁnd the number h of samples correctly classiﬁed. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
- 227. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulﬁlled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Deﬁne a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to ﬁnd the number h of samples correctly classiﬁed. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
- 228. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulﬁlled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Deﬁne a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to ﬁnd the number h of samples correctly classiﬁed. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101

Nenhum painel de recortes público que contém este slide

Parece que você já adicionou este slide ao painel

Criar painel de recortes

Você recortou seu primeiro slide!

Recortar slides é uma maneira fácil de colecionar slides importantes para acessar mais tarde. Agora, personalize o nome do seu painel de recortes.Detesta anúncios?

Desfrute de acesso a milhões de apresentações, documentos, livros eletrónicos, audiolivros, revistas e muito mais **sem anúncios.**

A família SlideShare acabou de crescer. Desfrute do acesso a milhões de ebooks, áudiolivros, revistas e muito mais a partir do Scribd.

Cancele a qualquer momento.Visualizações totais

1.740

No SlideShare

0

A partir de incorporações

0

Número de incorporações

56

Leitura ilimitada

Aprenda de forma mais rápida e inteligente com os maiores especialistas

Transferências ilimitadas

Faça transferências para ler em qualquer lugar e em movimento

Também terá acesso gratuito ao Scribd!

Acesso instantâneo a milhões de e-books, audiolivros, revistas, podcasts e muito mais.

Leia e ouça offline com qualquer dispositivo.

Acesso gratuito a serviços premium como Tuneln, Mubi e muito mais.

Atualizámos a nossa política de privacidade de modo a estarmos em conformidade com os regulamentos de privacidade em constante mutação a nível mundial e para lhe fornecer uma visão sobre as formas limitadas de utilização dos seus dados.

Pode ler os detalhes abaixo. Ao aceitar, está a concordar com a política de privacidade atualizada.

Obrigado!

Encontrámos um problema, por favor tente novamente.