Enviar pesquisa
Carregar
Neural Networks
•
0 gostou
•
598 visualizações
G
guestfee8698
Seguir
Tecnologia
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 34
Baixar agora
Baixar para ler offline
Recomendados
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
guestfee8698
Eight Regression Algorithms
Eight Regression Algorithms
guestfee8698
Lesson 22: Optimization I (Section 4 version)
Lesson 22: Optimization I (Section 4 version)
Matthew Leingang
Lesson 22: Optimization I (Section 10 Version)
Lesson 22: Optimization I (Section 10 Version)
Matthew Leingang
Regularized Estimation of Spatial Patterns
Regularized Estimation of Spatial Patterns
Wen-Ting Wang
Generarlized operations on fuzzy graphs
Generarlized operations on fuzzy graphs
Alexander Decker
Per de matematica listo.
Per de matematica listo.
nohelicordero
Gaussians
Gaussians
guestfee8698
Recomendados
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
guestfee8698
Eight Regression Algorithms
Eight Regression Algorithms
guestfee8698
Lesson 22: Optimization I (Section 4 version)
Lesson 22: Optimization I (Section 4 version)
Matthew Leingang
Lesson 22: Optimization I (Section 10 Version)
Lesson 22: Optimization I (Section 10 Version)
Matthew Leingang
Regularized Estimation of Spatial Patterns
Regularized Estimation of Spatial Patterns
Wen-Ting Wang
Generarlized operations on fuzzy graphs
Generarlized operations on fuzzy graphs
Alexander Decker
Per de matematica listo.
Per de matematica listo.
nohelicordero
Gaussians
Gaussians
guestfee8698
Artificial neural network
Artificial neural network
Ildar Nurgaliev
Gaussian Mixture Models
Gaussian Mixture Models
guestfee8698
Mining socio-political and socio-economic signals from social media content
Mining socio-political and socio-economic signals from social media content
Vasileios Lampos
Chapter 12
Chapter 12
ramiz100111
Ch05 6
Ch05 6
Rendy Robert
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Patrick Diehl
Lesson 30: Duality In Linear Programming
Lesson 30: Duality In Linear Programming
guest463822
Solution 4
Solution 4
frank cahui
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
Facultad de Informática UCM
Complex varible
Complex varible
Naveen Sihag
Leidy rivadeneira deber_1
Leidy rivadeneira deber_1
L.R. Rivadeneira
Fuzzy graph
Fuzzy graph
MSheelaSheela
Sect4 4
Sect4 4
inKFUPM
2015 01 22 - Rende - Unical - Angelo Fanelli: An Overview of Congestion Games
2015 01 22 - Rende - Unical - Angelo Fanelli: An Overview of Congestion Games
kaliaragorn
B010310813
B010310813
IOSR Journals
近似ベイズ計算によるベイズ推定
近似ベイズ計算によるベイズ推定
Kosei ABE
Transformation of random variables
Transformation of random variables
Tarun Gehlot
International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions www.ijeijournal.com
基礎からのベイズ統計学 輪読会資料 第8章 「比率・相関・信頼性」
基礎からのベイズ統計学 輪読会資料 第8章 「比率・相関・信頼性」
Ken'ichi Matsui
INTUITIONISTIC S-FUZZY SOFT NORMAL SUBGROUPS
INTUITIONISTIC S-FUZZY SOFT NORMAL SUBGROUPS
International Research Journal of Modernization in Engineering Technology and Science
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
guestfee8698
Bayesian Networks
Bayesian Networks
guestfee8698
Mais conteúdo relacionado
Mais procurados
Artificial neural network
Artificial neural network
Ildar Nurgaliev
Gaussian Mixture Models
Gaussian Mixture Models
guestfee8698
Mining socio-political and socio-economic signals from social media content
Mining socio-political and socio-economic signals from social media content
Vasileios Lampos
Chapter 12
Chapter 12
ramiz100111
Ch05 6
Ch05 6
Rendy Robert
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Patrick Diehl
Lesson 30: Duality In Linear Programming
Lesson 30: Duality In Linear Programming
guest463822
Solution 4
Solution 4
frank cahui
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
Facultad de Informática UCM
Complex varible
Complex varible
Naveen Sihag
Leidy rivadeneira deber_1
Leidy rivadeneira deber_1
L.R. Rivadeneira
Fuzzy graph
Fuzzy graph
MSheelaSheela
Sect4 4
Sect4 4
inKFUPM
2015 01 22 - Rende - Unical - Angelo Fanelli: An Overview of Congestion Games
2015 01 22 - Rende - Unical - Angelo Fanelli: An Overview of Congestion Games
kaliaragorn
B010310813
B010310813
IOSR Journals
近似ベイズ計算によるベイズ推定
近似ベイズ計算によるベイズ推定
Kosei ABE
Transformation of random variables
Transformation of random variables
Tarun Gehlot
International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions www.ijeijournal.com
基礎からのベイズ統計学 輪読会資料 第8章 「比率・相関・信頼性」
基礎からのベイズ統計学 輪読会資料 第8章 「比率・相関・信頼性」
Ken'ichi Matsui
INTUITIONISTIC S-FUZZY SOFT NORMAL SUBGROUPS
INTUITIONISTIC S-FUZZY SOFT NORMAL SUBGROUPS
International Research Journal of Modernization in Engineering Technology and Science
Mais procurados
(20)
Artificial neural network
Artificial neural network
Gaussian Mixture Models
Gaussian Mixture Models
Mining socio-political and socio-economic signals from social media content
Mining socio-political and socio-economic signals from social media content
Chapter 12
Chapter 12
Ch05 6
Ch05 6
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Lesson 30: Duality In Linear Programming
Lesson 30: Duality In Linear Programming
Solution 4
Solution 4
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
Complex varible
Complex varible
Leidy rivadeneira deber_1
Leidy rivadeneira deber_1
Fuzzy graph
Fuzzy graph
Sect4 4
Sect4 4
2015 01 22 - Rende - Unical - Angelo Fanelli: An Overview of Congestion Games
2015 01 22 - Rende - Unical - Angelo Fanelli: An Overview of Congestion Games
B010310813
B010310813
近似ベイズ計算によるベイズ推定
近似ベイズ計算によるベイズ推定
Transformation of random variables
Transformation of random variables
International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI),
基礎からのベイズ統計学 輪読会資料 第8章 「比率・相関・信頼性」
基礎からのベイズ統計学 輪読会資料 第8章 「比率・相関・信頼性」
INTUITIONISTIC S-FUZZY SOFT NORMAL SUBGROUPS
INTUITIONISTIC S-FUZZY SOFT NORMAL SUBGROUPS
Destaque
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
guestfee8698
Bayesian Networks
Bayesian Networks
guestfee8698
Hidden Markov Models
Hidden Markov Models
guestfee8698
Inference in Bayesian Networks
Inference in Bayesian Networks
guestfee8698
FY2009 Sex Ed Abstinence Fact Sheet
FY2009 Sex Ed Abstinence Fact Sheet
Marcus Peterson
Role Of States
Role Of States
Marcus Peterson
Learning Bayesian Networks
Learning Bayesian Networks
guestfee8698
Probability for Data Miners
Probability for Data Miners
guestfee8698
Training pendidikan 2012
Training pendidikan 2012
Azure Linger
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
guestfee8698
VC dimensio
VC dimensio
guestfee8698
PAC Learning
PAC Learning
guestfee8698
Cross-Validation
Cross-Validation
guestfee8698
Maximum Likelihood Estimation
Maximum Likelihood Estimation
guestfee8698
K-means and Hierarchical Clustering
K-means and Hierarchical Clustering
guestfee8698
Destaque
(15)
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
Bayesian Networks
Bayesian Networks
Hidden Markov Models
Hidden Markov Models
Inference in Bayesian Networks
Inference in Bayesian Networks
FY2009 Sex Ed Abstinence Fact Sheet
FY2009 Sex Ed Abstinence Fact Sheet
Role Of States
Role Of States
Learning Bayesian Networks
Learning Bayesian Networks
Probability for Data Miners
Probability for Data Miners
Training pendidikan 2012
Training pendidikan 2012
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
VC dimensio
VC dimensio
PAC Learning
PAC Learning
Cross-Validation
Cross-Validation
Maximum Likelihood Estimation
Maximum Likelihood Estimation
K-means and Hierarchical Clustering
K-means and Hierarchical Clustering
Semelhante a Neural Networks
A simple confidence interval for meta analysis
A simple confidence interval for meta analysis
rsd kol abundjani
Perceptrons
Perceptrons
ESCOM
Classification Theory
Classification Theory
SSA KPI
Information Gain
Information Gain
guest32311f
Partitions
Partitions
Nicholas Teff
Gaussian Integration
Gaussian Integration
Reza Rahimi
Computational Linguistics week 5
Computational Linguistics week 5
Mark Chang
Stochastic modelling and quasi-random numbers
Stochastic modelling and quasi-random numbers
Olivier Teytaud
Aces Verona 07 Foils
Aces Verona 07 Foils
Antonini
Semelhante a Neural Networks
(9)
A simple confidence interval for meta analysis
A simple confidence interval for meta analysis
Perceptrons
Perceptrons
Classification Theory
Classification Theory
Information Gain
Information Gain
Partitions
Partitions
Gaussian Integration
Gaussian Integration
Computational Linguistics week 5
Computational Linguistics week 5
Stochastic modelling and quasi-random numbers
Stochastic modelling and quasi-random numbers
Aces Verona 07 Foils
Aces Verona 07 Foils
Último
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Miguel Araújo
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
sudhanshuwaghmare1
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Martijn de Jong
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
Overkill Security
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Andrey Devyatkin
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
jfdjdjcjdnsjd
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Anna Loughnan Colquhoun
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
The Digital Insurer
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Dropbox
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
apidays
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
rafiqahmad00786416
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Juan lago vázquez
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
apidays
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Nanddeep Nachan
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
Zilliz
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Khem
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Deepika Singh
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Igalia
Último
(20)
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Neural Networks
1.
Regression and
Classification with Neural Networks Andrew W. Moore Note to other teachers and users of these slides. Andrew would be delighted Professor if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them School of Computer Science to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in Carnegie Mellon University your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: www.cs.cmu.edu/~awm http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully awm@cs.cmu.edu received. 412-268-7599 Copyright © 2001, 2003, Andrew W. Moore Sep 25th, 2001 Linear Regression DATASET inputs outputs x1 = 1 y1 = 1 x2 = 3 y2 = 2.2 ↑ x3 = 2 y3 = 2 w ↓ ←1→ x4 = 1.5 y4 = 1.9 x5 = 4 y5 = 3.1 Linear regression assumes that the expected value of the output given an input, E[y|x], is linear. Simplest case: Out(x) = wx for some unknown w. Given the data, we can estimate w. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 2
2.
1-parameter linear regression Assume
that the data is formed by yi = wxi + noisei where… • the noise signals are independent • the noise has a normal distribution with mean 0 and unknown variance σ2 P(y|w,x) has a normal distribution with • mean wx • variance σ2 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 3 Bayesian Linear Regression P(y|w,x) = Normal (mean wx, var σ2) We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn) which are EVIDENCE about w. We want to infer w from the data. P(w|x1, x2, x3,…xn, y1, y2…yn) •You can use BAYES rule to work out a posterior distribution for w given the data. •Or you could do Maximum Likelihood Estimation Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 4
3.
Maximum likelihood estimation
of w Asks the question: “For which value of w is this data most likely to have happened?” <=> For what w is P(y1, y2…yn |x1, x2, x3,…xn, w) maximized? <=> For what w is n ∏P( y w, x ) maximized? i i i =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 5 For what w is n ∏ P( y w , x i ) maximized? i i =1 For what w is n 1 ∏ exp(− 2 ( yi − wxi ) 2 ) maximized? σ i =1 For what w is 2 1 y − wx i n ∑ −i maximized? σ 2 i =1 For what w is 2 n ∑ (y ) − wx minimized? i i i =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 6
4.
Linear Regression The maximum
likelihood w is the one that E(w) w minimizes sum- Ε = ∑( yi − wxi ) 2 of-squares of residuals i (∑ x )w = ∑ yi − (2∑ xi yi )w + 2 2 2 i i We want to minimize a quadratic function of w. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 7 Linear Regression Easy to show the sum of squares is minimized when ∑x y i i w= ∑x 2 i The maximum likelihood Out(x) = wx model is We can use it for prediction Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 8
5.
Linear Regression Easy to
show the sum of squares is minimized when ∑ xi yi p(w) w= w ∑x 2 Note: i In Bayesian stats you’d have ended up with a prob dist of w The maximum likelihood Out(x) = wx model is And predictions would have given a prob dist of expected output Often useful to know your confidence. We can use it for Max likelihood can give some kinds of prediction confidence too. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 9 Multivariate Regression What if the inputs are vectors? 3. .4 6. 2-d input 5 . example .8 x2 . 10 x1 Dataset has form x1 y1 x2 y2 x3 y3 .: : . xR yR Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 10
6.
Multivariate Regression Write matrix
X and Y thus: .....x1 ..... x11 x1m y1 x12 ... .....x ..... x y x22 ... x2 m = 21 x= y = 2 2 M M M .....x R ..... xR1 xR 2 ... xRm yR (there are R datapoints. Each input has m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 11 Multivariate Regression Write matrix X and Y thus: .....x1 ..... x11 x1m y1 x12 ... .....x ..... x ... x2 m x22 y = y2 = 21 x= 2 M M M .....x R ..... xR1 xR 2 ... xRm yR IMPORTANT EXERCISE: (there are R datapoints. Each input hasPROVE IT !!!!! m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 12
7.
Multivariate Regression (con’t) The
max. likelihood w is w = (XTX)-1(XTY) R ∑x x ki kj XTX is an m x m matrix: i,j’th elt is k =1 R ∑x y XTY is an m-element vector: i’th elt ki k k =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 13 What about a constant term? We may expect linear data that does not go through the origin. Statisticians and Neural Net Folks all agree on a simple obvious hack. Can you guess?? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 14
8.
The constant term •
The trick is to create a fake input “X0” that always takes the value 1 X1 X2 Y X0 X1 X2 Y 2 4 16 1 2 4 16 3 4 17 1 3 4 17 5 5 20 1 5 5 20 Before: After: Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2 …has to be a poor = w0+w1X1+ w2X2 In this example, You should be able model …has a fine constant to see the MLE w0 , w1 and w2 by term inspection Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 15 Regression with varying noise • Suppose you know the variance of the noise that was added to each datapoint. y=3 σ=2 σi2 xi yi ½ ½ 4 y=2 σ=1/2 1 1 1 σ=1 2 1 1/4 y=1 σ=1/2 σ=2 2 3 4 y=0 3 2 1/4 x=0 x=1 x=2 x=3 MLE the w? yi ~ N ( wxi , σ i2 ) at’s f W h at e o Assume stim e Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 16
9.
MLE estimation with
varying noise argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ 12 , σ 2 ,..., σ R , w) = 2 2 1 2 R w Assuming i.i.d. and ( yi − wxi ) 2 R then plugging in argmin ∑ = equation for Gaussian σ i2 and simplifying. i =1 w x ( y − wx ) Setting dLL/dw R w such that ∑ i i 2 i = 0 = equal to zero σi i =1 R xi yi Trivial algebra ∑ 2 i =1 σ i R xi2 ∑ 2 σ i =1 i Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 17 This is Weighted Regression • We are asking to minimize the weighted sum of squares y=3 σ=2 ( yi − wxi ) 2 R argmin ∑ σ i2 y=2 i =1 σ=1/2 w σ=1 y=1 σ=1/2 σ=2 y=0 x=0 x=1 x=2 x=3 1 where weight for i’th datapoint is σ i2 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 18
10.
Weighted Multivariate Regression The
max. likelihood w is w = (WXTWX)-1(WXTWY) xki xkj R ∑ (WXTWX) is an m x m matrix: i,j’th elt is σ i2 k =1 R xki yk (WXTWY) is an m-element vector: i’th elt ∑ σ i2 k =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 19 Non-linear Regression • Suppose you know that y is related to a function of x in such a way that the predicted values have a non-linear dependence on w, e.g: y=3 xi yi ½ ½ y=2 1 2.5 2 3 y=1 3 2 y=0 3 3 x=0 x=1 x=2 x=3 MLE yi ~ N ( w + xi , σ ) the w? 2 at’s f W h at e o Assume stim e Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 20
11.
Non-linear MLE estimation
argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ (y ) R then plugging in 2 − w + xi = equation for Gaussian i and simplifying. i =1 w y − w + xi Setting dLL/dw R w such that ∑ i = 0 = equal to zero w + xi i =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 21 Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ (y ) R then plugging in 2 − w + xi = equation for Gaussian i and simplifying. i =1 w y − w + xi Setting dLL/dw R w such that ∑ i = 0 = equal to zero w + xi i =1 We’re down the algebraic toilet t wha ess u So g e do? w Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 22
12.
Non-linear MLE estimation
argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ ( ) R Common (but not only) approach: then plugging in 2 yi − w + xi = equation for Gaussian Numerical Solutions: and simplifying. i =1 w • Line Search • Simulated Annealing yi − w + xi Setting dLL/dw R ∑ w such = 0 = • Gradient Descent that equal to zero w + xi • Conjugate Gradient i =1 • Levenberg Marquart We’re down the • Newton’s Method algebraic toilet Also, special purpose statistical- t wha optimization-specific tricks such as uess ? So g e do E.M. (See Gaussian Mixtures lecture for introduction) w Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 23 GRADIENT DESCENT f(w): ℜ → ℜ Suppose we have a scalar function We want to find a local minimum. Assume our current weight is w ∂ f (w) GRADIENT DESCENT RULE: w ← w −η ∂w η is called the LEARNING RATE. A small positive number, e.g. η = 0.05 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 24
13.
GRADIENT DESCENT
f(w): ℜ → ℜ Suppose we have a scalar function We want to find a local minimum. Assume our current weight is w ∂ f (w) GRADIENT DESCENT RULE: w ← w −η ∂w Recall Andrew’s favorite default value for anything η is called the LEARNING RATE. A small positive number, e.g. η = 0.05 QUESTION: Justify the Gradient Descent Rule Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 25 Gradient Descent in “m” Dimensions f(w ) : ℜ m → ℜ Given ∂ f (w ) ∂w1 points in direction of steepest ascent. ∇f (w) = M ∂ ∂w f (w) m ∇f (w) is the gradient in that direction w ← w -η∇f (w) GRADIENT DESCENT RULE: Equivalently ∂ f (w ) ….where wj is the jth weight wj ← wj - η ∂w j “just like a linear feedback system” Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 26
14.
What’s all this
got to do with Neural Nets, then, eh?? For supervised learning, neural nets are also models with vectors of w parameters in them. They are now called weights. As before, we want to compute the weights to minimize sum- of-squared residuals. Which turns out, under “Gaussian i.i.d noise” assumption to be max. likelihood. Instead of explicitly solving for max. likelihood weights, we use GRADIENT DESCENT to SEARCH for them. our eyes. ression in y s exp , a querulou y?” you ask “Wh e later.” ply: “We’ll se “Aha!!” I re Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 27 Linear Perceptrons They are multivariate linear models: Out(x) = wTx And “training” consists of minimizing sum-of-squared residuals by gradient descent. ∑ (Out (x ) − )2 Ε= yk k k ∑ (w ) 2 Τ = x k − yk k QUESTION: Derive the perceptron training rule. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 28
15.
Linear Perceptron Training
Rule R E = ∑ ( yk − w T x k ) 2 k =1 Gradient descent tells us we should update w thusly if we wish to minimize E: ∂E wj ← wj - η ∂w j ∂E ? So what’s ∂w j Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 29 Linear Perceptron Training Rule R ∂E ∂ R E = ∑ ( yk − w T x k ) 2 =∑ ( yk − w T x k ) 2 ∂w j k =1 ∂w j k =1 ∂ R = ∑ 2( yk − w T x k ) Gradient descent tells us ( yk − w T x k ) ∂w j we should update w k =1 thusly if we wish to ∂ R = −2∑ δk minimize E: wT xk ∂w j k =1 ∂E …where… wj ← wj - η δk = yk − w T x k ∂w j ∂ R m = −2∑ δk ∑w x i ki ∂w j k =1 i =1 ∂E ? So what’s R ∂w j = −2∑ δk xkj k =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 30
16.
Linear Perceptron Training
Rule R E = ∑ ( yk − w T x k ) 2 k =1 Gradient descent tells us we should update w thusly if we wish to minimize E: ∂E wj ← wj - η R w j ← w j + 2η∑ δk xkj ∂w j …where… k =1 ∂E R = −2∑ δk xkj ∂w j We frequently neglect the 2 (meaning k =1 we halve the learning rate) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 31 The “Batch” perceptron algorithm 1) Randomly initialize weights w1 w2 … wm 2) Get your dataset (append 1’s to the inputs if you don’t want to go through the origin). 3) for i = 1 to R δ i := yi − wΤxi 4) for j = 1 to m R w j ← w j + η ∑ δ i xij i =1 5) if ∑ δ i 2 stops improving then stop. Else loop back to 3. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 32
17.
δ i ←
yi − w Τ x i A RULE KNOWN BY w j ← w j + ηδ i xij MANY NAMES rule off e wH Rul idro LMS W The e Th The delta ru le Th e ad alin er ule Classical conditioning Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 33 If data is voluminous and arrives fast Input-output pairs (x,y) come streaming in very quickly. THEN Don’t bother remembering old ones. Just keep using new ones. observe (x,y) δ ← y − wΤx ∀j w j ← w j + η δ x j Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 34
18.
Gradient Descent vs
Matrix Inversion for Linear Perceptrons GD Advantages (MI disadvantages): • Biologically plausible • With very very many attributes each iteration costs only O(mR). If fewer than m iterations needed we’ve beaten Matrix Inversion • More easily parallelizable (or implementable in wetware)? GD Disadvantages (MI advantages): • It’s moronic • It’s essentially a slow implementation of a way to build the XTX matrix and then solve a set of linear equations • If m is small it’s especially outageous. If m is large then the direct matrix inversion method gets fiddly but not impossible if you want to be efficient. • Hard to choose a good learning rate • Matrix inversion takes predictable time. You can’t be sure when gradient descent will stop. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 35 Gradient Descent vs Matrix Inversion for Linear Perceptrons GD Advantages (MI disadvantages): • Biologically plausible • With very very many attributes each iteration costs only O(mR). If fewer than m iterations needed we’ve beaten Matrix Inversion • More easily parallelizable (or implementable in wetware)? GD Disadvantages (MI advantages): • It’s moronic • It’s essentially a slow implementation of a way to build the XTX matrix and then solve a set of linear equations • If m is small it’s especially outageous. If m is large then the direct matrix inversion method gets fiddly but not impossible if you want to be efficient. • Hard to choose a good learning rate • Matrix inversion takes predictable time. You can’t be sure when gradient descent will stop. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 36
19.
Gradient Descent vs
Matrix Inversion for Linear Perceptrons GD Advantages (MI disadvantages): • Biologically plausible • With very very many attributes each iteration costs only O(mR). If fewer than m iterations needed we’ve beaten Matrix Inversion But we’ll • More easily parallelizable (or implementable in wetware)? GD Disadvantages (MIsoon see that advantages): GD • It’s moronic has an important extra • It’s essentially a slow implementation of a way to build the XTX matrix and then solve a set of linear equations trick up its sleeve • If m is small it’s especially outageous. If m is large then the direct matrix inversion method gets fiddly but not impossible if you want to be efficient. • Hard to choose a good learning rate • Matrix inversion takes predictable time. You can’t be sure when gradient descent will stop. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 37 Perceptrons for Classification What if all outputs are 0’s or 1’s ? or We can do a linear fit. Our prediction is 0 if out(x)≤1/2 1 if out(x)>1/2 WHAT’S THE BIG PROBLEM WITH THIS??? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 38
20.
Perceptrons for Classification What
if all outputs are 0’s or 1’s ? or Blue = Out(x) We can do a linear fit. Our prediction is 0 if out(x)≤½ 1 if out(x)>½ WHAT’S THE BIG PROBLEM WITH THIS??? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 39 Perceptrons for Classification What if all outputs are 0’s or 1’s ? or Blue = Out(x) We can do a linear fit. Our prediction is 0 if out(x)≤½ Green = Classification 1 if out(x)>½ Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 40
21.
Classification with Perceptrons
I ∑ (y )2 − w Τxi . Don’t minimize i Minimize number of misclassifications instead. [Assume outputs are ∑ (y ( )) +1 & -1, not +1 & 0] − Round w Τ x i i where Round(x) = -1 if x<0 NOTE: CUTE & NON OBVIOUS WHY 1 if x≥0 THIS WORKS!! The gradient descent rule can be changed to: if (xi,yi) correctly classed, don’t change w w - xi if wrongly predicted as 1 w w + xi if wrongly predicted as -1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 41 Classification with Perceptrons II: Sigmoid Functions Least squares fit useless This fit would classify much better. But not a least squares fit. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 42
22.
Classification with Perceptrons
II: Sigmoid Functions Least squares fit useless This fit would classify much SOLUTION: better. But not a least squares fit. Instead of Out(x) = wTx We’ll use Out(x) = g(wTx) where g( x) : ℜ → (0,1) is a squashing function Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 43 The Sigmoid 1 g ( h) = 1 + exp(− h) Note that if you rotate this curve through 180o centered on (0,1/2) you get the same curve. i.e. g(h)=1-g(-h) Can you prove this? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 44
23.
The Sigmoid
1 g ( h) = 1 + exp(− h) Now we choose w to minimize ∑ [ yi − Out(x i )] = ∑ [yi − g (w Τ x i )] R R 2 2 i =1 i =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 45 Linear Perceptron Classification Regions 0 0 0 1 X2 1 1 X1 Out(x) = g(wT(x,1)) We’ll use the model = g(w1x1 + w2x2 + w0) Which region of above diagram classified with +1, and which with 0 ?? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 46
24.
Gradient descent with
sigmoid on a perceptron First, notice g ' ( x ) = g ( x )(1 − g ( x )) − e− x 1 Because : g ( x ) = so g ' ( x ) = 1 + e− x 2 1 + e − x 1 −1 − e− x −1 1 1 1 = − g ( x )(1 − g ( x )) 1 − = = − = 2 1 + e− x 1 + e− x 1 + e− x 2 1 + e − x 1 + e − x Out(x) = g ∑ wk xk The sigmoid perceptron k update rule: 2 Ε = ∑ yi − g ∑ wk xik k i R w j ← w j + η ∑ δ i gi (1 − gi )xij ∂ ∂Ε = ∑ 2 yi − g ∑ wk xik − g ∑ wk xik ∂w j k ∂w j k i =1 i m ∂ = ∑ − 2 yi − g ∑ wk xik g ' ∑ wk xik ∑w x gi = g ∑ w j xij k ik where ∂w j k k i k j =1 = ∑ − 2δ i g (net i )(1 − g (net i ))xij δ i = yi − gi i net i = ∑ wk xk where δ i = yi − Out(x i ) k Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 47 Other Things about Perceptrons • Invented and popularized by Rosenblatt (1962) • Even with sigmoid nonlinearity, correct convergence is guaranteed • Stable behavior for overconstrained and underconstrained problems Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 48
25.
Perceptrons and Boolean
Functions If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s… X2 • Can learn the function x1 ∧ x2 X1 X2 • Can learn the function x1 ∨ x2 . X1 • Can learn any conjunction of literals, e.g. x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5 QUESTION: WHY? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 49 Perceptrons and Boolean Functions • Can learn any disjunction of literals e.g. x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5 • Can learn majority function f(x1,x2 … xn) = 1 if n/2 xi’s or more are = 1 0 if less than n/2 xi’s are = 1 • What about the exclusive or function? f(x1,x2) = x1 ∀ x2 = (x1 ∧ ~x2) ∨ (~ x1 ∧ x2) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 50
26.
Multilayer Networks The class
of functions representable by perceptrons is limited ( ) Out(x) = g w Τ x = g ∑ w j x j j Use a wider representation ! Out(x) = g ∑W j g ∑ w jk x jk This is a nonlinear function k j Of a linear combination Of non linear functions Of linear combinations of inputs Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 51 A 1-HIDDEN LAYER NET NINPUTS = 2 NHIDDEN = 3 NINS v1 = g ∑ w1k xk w11 k =1 w1 x1 w21 NINS w31 v2 = g ∑ w2k xk N HID w2 Out = g ∑ Wk vk k =1 w12 k =1 w22 x2 w3 NINS w32 v3 = g ∑ w3k xk k =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 52
27.
OTHER NEURAL NETS
1 x1 x2 x3 2-Hidden layers + Constant Term “JUMP” CONNECTIONS x1 N INS N HID Out = g ∑ w0 k xk + ∑Wk vk x2 k =1 k =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 53 Backpropagation Out(x) = g ∑W j g ∑ w jk xk k j Find a set of weights {W j },{w jk } to minimize ∑ ( y − Out(x )) 2 i i i by gradient descent. That’s it! That’s it! That’s the backpropagation That’s the backpropagation algorithm. algorithm. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 54
28.
Backpropagation Convergence Convergence to
a global minimum is not guaranteed. •In practice, this is not a problem, apparently. Tweaking to find the right number of hidden units, or a useful learning rate η, is more hassle, apparently. IMPLEMENTING BACKPROP: Differentiate Monster sum-square residual Write down the Gradient Descent Rule It turns out to be easier & computationally efficient to use lots of local variables with names like hj ok vj neti etc… Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 55 Choosing the learning rate • This is a subtle art. • Too small: can take days instead of minutes to converge • Too large: diverges (MSE gets larger and larger while the weights increase and usually oscillate) • Sometimes the “just right” value is hard to find. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 56
29.
Learning-rate problems From J.
Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley, 1994. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 57 Improving Simple Gradient Descent Momentum Don’t just change weights according to the current datapoint. Re-use changes from earlier iterations. Let ∆w(t) = weight changes at time t. Let be the change we would make with ∂Ε −η ∂w regular gradient descent. Instead we use ∂Ε ∆w (t +1) = −η + α∆w (t ) ∂w w (t + 1) = w (t ) + ∆w (t ) Momentum damps oscillations. momentum parameter A hack? Well, maybe. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 58
30.
Momentum illustration Copyright ©
2001, 2003, Andrew W. Moore Neural Networks: Slide 59 Improving Simple Gradient Descent Newton’s method ∂E 1 T ∂ 2 E E ( w + h) = E ( w ) + hT +h h + O(| h |3 ) ∂w 2 ∂w 2 If we neglect the O(h3) terms, this is a quadratic form Quadratic form fun facts: If y = c + bT x - 1/2 xT A x And if A is SPD Then xopt = A-1b is the value of x that maximizes y Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 60
31.
Improving Simple Gradient
Descent Newton’s method ∂E 1 T ∂ 2 E E ( w + h) = E ( w ) + hT +h h + O(| h |3 ) ∂w 2 ∂w 2 If we neglect the O(h3) terms, this is a quadratic form −1 ∂ 2 E ∂E w ←w− 2 ∂w ∂w This should send us directly to the global minimum if the function is truly quadratic. And it might get us close if it’s locally quadraticish Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 61 Improving Simple Gradient Descent Newton’s method ∂E 1 T ∂ 2 E E ( w + h) = E ( w ) + hT +h h + O(| h |3 ) ∂w 2 ∂w 2 IfBUT neglect the O(h3) terms, this is a quadratic form we (and it’s a big b That ut) −1 secon d der w ←… − ∂ E ∂E 2 expen w 2 i va s i ve a nd fid tive matr ∂w ∂w ix If we dly to comp can b ’r we’ll e not alrea us directlyute.the e Thisoshould sendy i to global minimum if the g nu dn ts the q function is. truly quadratic. adr ua tic bo wl, And it might get us close if it’s locally quadraticish Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 62
32.
Improving Simple Gradient
Descent Conjugate Gradient Another method which attempts to exploit the “local quadratic bowl” assumption ∂E But does so while only needing to use ∂w and not ∂ E 2 ∂w 2 It is also more stable than Newton’s method if the local quadratic bowl assumption is violated. It’s complicated, outside our scope, but it often works well. More details in Numerical Recipes in C. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 63 BEST GENERALIZATION Intuitively, you want to use the smallest, simplest net that seems to fit the data. HOW TO FORMALIZE THIS INTUITION? 1. Don’t. Just use intuition 2. Bayesian Methods Get it Right 3. Statistical Analysis explains what’s going on 4. Cross-validation Discussed in the next lecture Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 64
33.
What You Should
Know • How to implement multivariate Least- squares linear regression. • Derivation of least squares as max. likelihood estimator of linear coefficients • The general gradient descent rule Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 65 What You Should Know • Perceptrons Linear output, least squares Sigmoid output, least squares • Multilayer nets The idea behind back prop Awareness of better minimization methods • Generalization. What it means. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 66
34.
APPLICATIONS To Discuss:
• What can non-linear regression be useful for? • What can neural nets (used as non-linear regressors) be useful for? • What are the advantages of N. Nets for nonlinear regression? • What are the disadvantages? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 67 Other Uses of Neural Nets… • Time series with recurrent nets • Unsupervised learning (clustering principal components and non-linear versions thereof) • Combinatorial optimization with Hopfield nets, Boltzmann Machines • Evaluation function learning (in reinforcement learning) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 68
Baixar agora