SlideShare uma empresa Scribd logo
1 de 31
Timeline

• Remaining lectures
   – 2 lectures on machine learning (today and next Tuesday)
   – And then onto planning/search


• Homeworks
   – # (N-grams) is due today
   – # (HMMs) will be out shortly, due Friday next week
Naïve Bayes Model


       Y1           Y2            Y3                            Yn




                                  C


            P(C | Y1,…Yn) = α Π P(Yi | C) P (C)

 Features Y are conditionally independent given the class variable C

 Widely used in machine learning
         e.g., spam email classification: Y’s = counts of words in emails

 Conditional probabilities P(Yi | C) can easily be estimated from labeled data
Hidden Markov Model (HMM)



                                                            Observed
   Y1          Y2          Y3                         Yn


---------------------------------------------------
-
                      S3                   Sn    Hidden
   S1       S2



 Two key assumptions:
        1. hidden state sequence is Markov
        2. observation Yt is CI of all other variables given St

 Widely used in speech recognition, protein sequence models,
 natural language processing
Machine Learning




CS 371: Spring 2012
Outline



• Different types of learning algorithms

• Induction learning
   – Decision trees


• Evaluation

• Reading for today’s lecture
   – AIMA Chapter 18.1 to 18.3 (inclusive)
Learning

• Why is it useful for our agent to be able to learn?
    – Learning is a key hallmark of intelligence


• Learning is essential for unknown environments,
    – i.e., when designer lacks omniscience

• Learning is useful as a system construction method,
    – i.e., expose the agent to reality rather than trying to write it down

• Learning modifies the agent's decision mechanisms to improve
  performance
Learning agents
Learning element

• Design of a learning element is affected by
   – Which components of the performance element are to be learned
   – What feedback is available to learn these components
   – What representation is used for the components
Automated Learning


•   Types of learning based on feedback
     – Supervised learning : correct answers for each example
        • Learning a mapping from a set of inputs to a target variable
             – Classification: target variable is discrete (e.g., spam email)
             – Regression: target variable is real-valued (e.g., stock market)

     – Unsupervised learning : correct answers not given
        • No target variable provided
             – Clustering: grouping data into K groups

     – Reinforcement learning: occasional rewards
              – e.g., game-playing agent

     –   Other types of learning
          • Learning to rank, e.g., document ranking in Web search
Simple illustrative learning problem

Problem:
  decide whether to wait for a table at a restaurant, based on the following attributes:

     1. Alternate: is there an alternative restaurant nearby?
     2. Bar: is there a comfortable bar area to wait in?
     3. Fri/Sat: is today Friday or Saturday?
     4. Hungry: are we hungry?
     5. Patrons: number of people in the restaurant (None, Some, Full)
     6. Price: price range ($, $$, $$$)
     7. Raining: is it raining outside?
     8. Reservation: have we made a reservation?
     9. Type: kind of restaurant (French, Italian, Thai, Burger)
     10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Training Data for Supervised Learning
Terminology

• Attributes
   – Also known as features, variables, independent variables,
     covariates


• Target Variable
   – Also known as goal predicate, dependent variable, …



• Classification
   – Also known as discrimination, supervised classification, …


• Error function
   – Objective function, loss function, …
Inductive learning

• Let x represent the input vector of attributes

• Let f(x) represent the value of the target variable for x
    – The implicit mapping from x to f(x) is unknown to us
    – We just have training data pairs, D = {x, f(x)} available


• We want to learn a mapping from x to f, i.e.,
      h(x; θ) is “close” to f(x) for all training data points x



        θ are the parameters of our predictor h(..)

• Examples:
    – h(x; θ) = sign(w1x1 + w2x2+ w3)


    – hk(x) = (x1 OR x2) AND (x3 OR NOT(x4))
Empirical Error Functions

• Empirical error function:
       E(h) =   Σx distance[h(x; θ) , f]

   e.g., distance = squared error if h and f are real-valued (regression)
         distance = delta-function if h and f are categorical (classification)

   Sum is over all training pairs in the training data D




   In learning, we get to choose

      1. what class of functions h(..) that we want to learn
           – potentially a huge space! (“hypothesis space”)

      2. what error function/distance to use
            - should be chosen to reflect real “loss” in problem
            - but often chosen for mathematical/algorithmic convenience
Inductive Learning as Optimization or Search

•   Empirical error function:
        E(h) =   Σx   distance[h(x; θ) , f]



•   Empirical learning = finding h(x), or h(x; θ) that minimizes E(h)


•   Once we decide on what the functional form of h is, and what the error function E
    is, then machine learning typically reduces to a large search or optimization
    problem

•   Additional aspect: we really want to learn an h(..) that will generalize well to new
    data, not just memorize training data – will return to this later
Our training data example (again)




•   If all attributes were binary, h(..) could be any arbitrary Boolean function

•   Natural error function E(h) to use is classification error, i.e., how many incorrect
    predictions does a hypothesis h make

•   Note an implicit assumption:
     –   For any set of attribute values there is a unique target value
     –   This in effect assumes a “no-noise” mapping from inputs to targets
           • This is often not true in practice (e.g., in medicine). Will return to this later
Learning Boolean Functions

•   Given examples of the function, can we learn the function?

•   How many Boolean functions can be defined on d attributes?
     – Boolean function = Truth table + column for target function (binary)
     – Truth table has 2d rows
     – So there are 2 to the power of 2d different Boolean functions we can define
       (!)
     – This is the size of our hypothesis space

     – E.g., d = 6, there are 18.4 x 1018 possible Boolean functions


•   Observations:
     – Huge hypothesis spaces –> directly searching over all functions is
       impossible
     – Given a small data (n pairs) our learning problem may be underconstrained
         • Ockham’s razor: if multiple candidate functions all explain the data
           equally well, pick the simplest explanation (least complex function)
         • Constrain our search to classes of Boolean functions, e.g.,
              – decision trees
              – Weighted linear sums of inputs (e.g., perceptrons)
Decision Tree Learning

•   Constrain h(..) to be a decision tree
Decision Tree Representations

•   Decision trees are fully expressive
     – can represent any Boolean function
     – Every path in the tree could represent 1 row in the truth table
     – Yields an exponentially large tree
         • Truth table is of size 2d, where d is the number of attributes
Decision Tree Representations


•   Trees can be very inefficient for certain types of functions
     – Parity function: 1 only if an even number of 1’s in the input vector
         • Trees are very inefficient at representing such functions
     – Majority function: 1 if more than ½ the inputs are 1’s
         • Also inefficient
Decision Tree Learning

•   Find the smallest decision tree consistent with the n examples
     – Unfortunately this is provably intractable to do optimally

•   Greedy heuristic search used in practice:
     –   Select root node that is “best” in some sense
     –   Partition data into 2 subsets, depending on root attribute value
     –   Recursively grow subtrees
     –   Different termination criteria
           • For noiseless data, if all examples at a node have the same label then
              declare it a leaf and backup
           • For noisy data it might not be possible to find a “pure” leaf using the
              given attributes
                 – we’ll return to this later – but a simple approach is to have a
                   depth-bound on the tree (or go to max depth) and use majority
                   vote

•   We have talked about binary variables up until now, but we can
    trivially extend to multi-valued variables
Pseudocode for Decision tree learning
Training Data for Supervised Learning
Choosing an attribute

•   Idea: a good attribute splits the examples into subsets that are
    (ideally) "all positive" or "all negative"




•   Patrons? is a better choice
     – How can we quantify this?
     – One approach would be to use the classification error E directly (greedily)
        • Empirically it is found that this works poorly
     – Much better is to use information gain (next slides)
•
Entropy

H(p) = entropy of distribution p = {pi}
                          (called “information” in text)

     = E [pi log (1/pi) ] = - p log p - (1-p) log (1-p)


    Intuitively log 1/pi is the amount of information we get when we find
      out that outcome i occurred, e.g.,
         i = “6.0 earthquake in New York today”, p(i) = 1/220
                             log 1/pi = 20 bits
         j = “rained in New York today”, p(i) = ½
                          log 1/pj = 1 bit


    Entropy is the expected amount of information we gain, given a
      probability distribution – its our average uncertainty

   In general, H(p) is maximized when all pi are equal and minimized
      (=0) when one of the pi’s is 1 and all others zero.
Entropy with only 2 outcomes


   Consider 2 class problem: p = probability of class 1, 1 – p =
     probability of class 2

   In binary case, H(p) = - p log p - (1-p) log (1-p)




       H(p)

              1




               0             0.5            1               p
Information Gain

• H(p) = entropy of class distribution at a particular node

• H(p | A) = conditional entropy = average entropy of
  conditional class distribution, after we have partitioned the
  data according to the values in A

• Gain(A) = H(p) – H(p | A)

• Simple rule in decision tree learning
    – At each internal node, split on the node with the largest
      information gain (or equivalently, with smallest H(p|A))


• Note that by definition, conditional entropy can’t be greater
  than the entropy
Root Node Example

For the training set, 6 positives, 6 negatives, H(6/12, 6/12) = 1 bit

Consider the attributes Patrons and Type:

                       2         4         6  2 4
IG ( Patrons ) = 1 − [ H (0,1) + H (1,0) + H ( , )] = .0541 bits
                      12        12        12  6 6

                     2    1 1   2   1 1   4   2 2   4   2 2
IG (Type) = 1 − [      H ( , ) + H ( , ) + H ( , ) + H ( , )] = 0 bits
                    12    2 2 12    2 2 12    4 4 12    4 4


Patrons has the highest IG of all attributes and so is chosen by the learning
    algorithm as the root

Information gain is then repeatedly applied at internal nodes until all leaves contain
    only examples from one class or the other
Decision Tree Learned

•   Decision tree learned from the 12 examples:
True Tree (left) versus Learned Tree (right)
Summary

• Inductive learning
   – Error function, class of hypothesis/models {h}
   – Want to minimize E on our training data
   – Example: decision tree learning

Mais conteúdo relacionado

Mais procurados

09 logic programming
09 logic programming09 logic programming
09 logic programmingsaru40
 
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLPbutest
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiersbutest
 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)Nitesh Singh
 
Learning deep structured semantic models for web search
Learning deep structured semantic models for web searchLearning deep structured semantic models for web search
Learning deep structured semantic models for web searchhyunsung lee
 
Word vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmWord vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmhyunsung lee
 
Fuzzy logic and fuzzy time series edited
Fuzzy logic and fuzzy time series   editedFuzzy logic and fuzzy time series   edited
Fuzzy logic and fuzzy time series editedProf Dr S.M.Aqil Burney
 
Chaps 1-3-ai-prolog
Chaps 1-3-ai-prologChaps 1-3-ai-prolog
Chaps 1-3-ai-prologsaru40
 
State Space Search
State Space SearchState Space Search
State Space SearchJasmine Chen
 
Artificial intelligence cs607 handouts lecture 11 - 45
Artificial intelligence   cs607 handouts lecture 11 - 45Artificial intelligence   cs607 handouts lecture 11 - 45
Artificial intelligence cs607 handouts lecture 11 - 45Sattar kayani
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.pptbutest
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networksbutest
 

Mais procurados (20)

Prolog 7-Languages
Prolog 7-LanguagesProlog 7-Languages
Prolog 7-Languages
 
My7class
My7classMy7class
My7class
 
09 logic programming
09 logic programming09 logic programming
09 logic programming
 
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLP
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiers
 
Fuzzy logic
Fuzzy logicFuzzy logic
Fuzzy logic
 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)
 
Learning deep structured semantic models for web search
Learning deep structured semantic models for web searchLearning deep structured semantic models for web search
Learning deep structured semantic models for web search
 
Word vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmWord vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlm
 
Fuzzy logic and fuzzy time series edited
Fuzzy logic and fuzzy time series   editedFuzzy logic and fuzzy time series   edited
Fuzzy logic and fuzzy time series edited
 
Chaps 1-3-ai-prolog
Chaps 1-3-ai-prologChaps 1-3-ai-prolog
Chaps 1-3-ai-prolog
 
State Space Search
State Space SearchState Space Search
State Space Search
 
Classical Sets & fuzzy sets
Classical Sets & fuzzy setsClassical Sets & fuzzy sets
Classical Sets & fuzzy sets
 
Introduction to Prolog
Introduction to PrologIntroduction to Prolog
Introduction to Prolog
 
Oop
OopOop
Oop
 
Artificial intelligence cs607 handouts lecture 11 - 45
Artificial intelligence   cs607 handouts lecture 11 - 45Artificial intelligence   cs607 handouts lecture 11 - 45
Artificial intelligence cs607 handouts lecture 11 - 45
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
 
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
 
Eric Smidth
Eric SmidthEric Smidth
Eric Smidth
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networks
 

Semelhante a Machine learning

AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptxGaytriDhingra1
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos butest
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.pptssuserec53e73
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.pptssuserec53e73
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.pptssuser71fb63
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2Nandhini S
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Kaniska Mandal
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfKuan-Tsae Huang
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introductionAnas Jamil
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningAI Summary
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.pptbutest
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.pptbutest
 
Boosted tree
Boosted treeBoosted tree
Boosted treeZhuyi Xue
 

Semelhante a Machine learning (20)

M18 learning
M18 learningM18 learning
M18 learning
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
 
Machine Learning
Machine Learning Machine Learning
Machine Learning
 
AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptx
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.ppt
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.ppt
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.ppt
 
lecture 01.1.ppt
lecture 01.1.pptlecture 01.1.ppt
lecture 01.1.ppt
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdf
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for Regression
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.ppt
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.ppt
 
Learning
LearningLearning
Learning
 
Boosted tree
Boosted treeBoosted tree
Boosted tree
 

Mais de Digvijay Singh (14)

Week3 applications
Week3 applicationsWeek3 applications
Week3 applications
 
Week3.1
Week3.1Week3.1
Week3.1
 
Week2.1
Week2.1Week2.1
Week2.1
 
Week1.2 intro
Week1.2 introWeek1.2 intro
Week1.2 intro
 
Networks
NetworksNetworks
Networks
 
Uncertainty
UncertaintyUncertainty
Uncertainty
 
Overfitting and-tbl
Overfitting and-tblOverfitting and-tbl
Overfitting and-tbl
 
Ngrams smoothing
Ngrams smoothingNgrams smoothing
Ngrams smoothing
 
Query execution
Query executionQuery execution
Query execution
 
Query compiler
Query compilerQuery compiler
Query compiler
 
Hmm viterbi
Hmm viterbiHmm viterbi
Hmm viterbi
 
3 fol examples v2
3 fol examples v23 fol examples v2
3 fol examples v2
 
Multidimensional Indexing
Multidimensional IndexingMultidimensional Indexing
Multidimensional Indexing
 
Bayesnetwork
BayesnetworkBayesnetwork
Bayesnetwork
 

Último

EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...
EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...
EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...liera silvan
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 

Último (20)

YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...
EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...
EmpTech Lesson 18 - ICT Project for Website Traffic Statistics and Performanc...
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 

Machine learning

  • 1. Timeline • Remaining lectures – 2 lectures on machine learning (today and next Tuesday) – And then onto planning/search • Homeworks – # (N-grams) is due today – # (HMMs) will be out shortly, due Friday next week
  • 2. Naïve Bayes Model Y1 Y2 Y3 Yn C P(C | Y1,…Yn) = α Π P(Yi | C) P (C) Features Y are conditionally independent given the class variable C Widely used in machine learning e.g., spam email classification: Y’s = counts of words in emails Conditional probabilities P(Yi | C) can easily be estimated from labeled data
  • 3. Hidden Markov Model (HMM) Observed Y1 Y2 Y3 Yn --------------------------------------------------- - S3 Sn Hidden S1 S2 Two key assumptions: 1. hidden state sequence is Markov 2. observation Yt is CI of all other variables given St Widely used in speech recognition, protein sequence models, natural language processing
  • 5. Outline • Different types of learning algorithms • Induction learning – Decision trees • Evaluation • Reading for today’s lecture – AIMA Chapter 18.1 to 18.3 (inclusive)
  • 6. Learning • Why is it useful for our agent to be able to learn? – Learning is a key hallmark of intelligence • Learning is essential for unknown environments, – i.e., when designer lacks omniscience • Learning is useful as a system construction method, – i.e., expose the agent to reality rather than trying to write it down • Learning modifies the agent's decision mechanisms to improve performance
  • 8. Learning element • Design of a learning element is affected by – Which components of the performance element are to be learned – What feedback is available to learn these components – What representation is used for the components
  • 9. Automated Learning • Types of learning based on feedback – Supervised learning : correct answers for each example • Learning a mapping from a set of inputs to a target variable – Classification: target variable is discrete (e.g., spam email) – Regression: target variable is real-valued (e.g., stock market) – Unsupervised learning : correct answers not given • No target variable provided – Clustering: grouping data into K groups – Reinforcement learning: occasional rewards – e.g., game-playing agent – Other types of learning • Learning to rank, e.g., document ranking in Web search
  • 10. Simple illustrative learning problem Problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
  • 11. Training Data for Supervised Learning
  • 12. Terminology • Attributes – Also known as features, variables, independent variables, covariates • Target Variable – Also known as goal predicate, dependent variable, … • Classification – Also known as discrimination, supervised classification, … • Error function – Objective function, loss function, …
  • 13. Inductive learning • Let x represent the input vector of attributes • Let f(x) represent the value of the target variable for x – The implicit mapping from x to f(x) is unknown to us – We just have training data pairs, D = {x, f(x)} available • We want to learn a mapping from x to f, i.e., h(x; θ) is “close” to f(x) for all training data points x θ are the parameters of our predictor h(..) • Examples: – h(x; θ) = sign(w1x1 + w2x2+ w3) – hk(x) = (x1 OR x2) AND (x3 OR NOT(x4))
  • 14. Empirical Error Functions • Empirical error function: E(h) = Σx distance[h(x; θ) , f] e.g., distance = squared error if h and f are real-valued (regression) distance = delta-function if h and f are categorical (classification) Sum is over all training pairs in the training data D In learning, we get to choose 1. what class of functions h(..) that we want to learn – potentially a huge space! (“hypothesis space”) 2. what error function/distance to use - should be chosen to reflect real “loss” in problem - but often chosen for mathematical/algorithmic convenience
  • 15. Inductive Learning as Optimization or Search • Empirical error function: E(h) = Σx distance[h(x; θ) , f] • Empirical learning = finding h(x), or h(x; θ) that minimizes E(h) • Once we decide on what the functional form of h is, and what the error function E is, then machine learning typically reduces to a large search or optimization problem • Additional aspect: we really want to learn an h(..) that will generalize well to new data, not just memorize training data – will return to this later
  • 16. Our training data example (again) • If all attributes were binary, h(..) could be any arbitrary Boolean function • Natural error function E(h) to use is classification error, i.e., how many incorrect predictions does a hypothesis h make • Note an implicit assumption: – For any set of attribute values there is a unique target value – This in effect assumes a “no-noise” mapping from inputs to targets • This is often not true in practice (e.g., in medicine). Will return to this later
  • 17. Learning Boolean Functions • Given examples of the function, can we learn the function? • How many Boolean functions can be defined on d attributes? – Boolean function = Truth table + column for target function (binary) – Truth table has 2d rows – So there are 2 to the power of 2d different Boolean functions we can define (!) – This is the size of our hypothesis space – E.g., d = 6, there are 18.4 x 1018 possible Boolean functions • Observations: – Huge hypothesis spaces –> directly searching over all functions is impossible – Given a small data (n pairs) our learning problem may be underconstrained • Ockham’s razor: if multiple candidate functions all explain the data equally well, pick the simplest explanation (least complex function) • Constrain our search to classes of Boolean functions, e.g., – decision trees – Weighted linear sums of inputs (e.g., perceptrons)
  • 18. Decision Tree Learning • Constrain h(..) to be a decision tree
  • 19. Decision Tree Representations • Decision trees are fully expressive – can represent any Boolean function – Every path in the tree could represent 1 row in the truth table – Yields an exponentially large tree • Truth table is of size 2d, where d is the number of attributes
  • 20. Decision Tree Representations • Trees can be very inefficient for certain types of functions – Parity function: 1 only if an even number of 1’s in the input vector • Trees are very inefficient at representing such functions – Majority function: 1 if more than ½ the inputs are 1’s • Also inefficient
  • 21. Decision Tree Learning • Find the smallest decision tree consistent with the n examples – Unfortunately this is provably intractable to do optimally • Greedy heuristic search used in practice: – Select root node that is “best” in some sense – Partition data into 2 subsets, depending on root attribute value – Recursively grow subtrees – Different termination criteria • For noiseless data, if all examples at a node have the same label then declare it a leaf and backup • For noisy data it might not be possible to find a “pure” leaf using the given attributes – we’ll return to this later – but a simple approach is to have a depth-bound on the tree (or go to max depth) and use majority vote • We have talked about binary variables up until now, but we can trivially extend to multi-valued variables
  • 22. Pseudocode for Decision tree learning
  • 23. Training Data for Supervised Learning
  • 24. Choosing an attribute • Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" • Patrons? is a better choice – How can we quantify this? – One approach would be to use the classification error E directly (greedily) • Empirically it is found that this works poorly – Much better is to use information gain (next slides) •
  • 25. Entropy H(p) = entropy of distribution p = {pi} (called “information” in text) = E [pi log (1/pi) ] = - p log p - (1-p) log (1-p) Intuitively log 1/pi is the amount of information we get when we find out that outcome i occurred, e.g., i = “6.0 earthquake in New York today”, p(i) = 1/220 log 1/pi = 20 bits j = “rained in New York today”, p(i) = ½ log 1/pj = 1 bit Entropy is the expected amount of information we gain, given a probability distribution – its our average uncertainty In general, H(p) is maximized when all pi are equal and minimized (=0) when one of the pi’s is 1 and all others zero.
  • 26. Entropy with only 2 outcomes Consider 2 class problem: p = probability of class 1, 1 – p = probability of class 2 In binary case, H(p) = - p log p - (1-p) log (1-p) H(p) 1 0 0.5 1 p
  • 27. Information Gain • H(p) = entropy of class distribution at a particular node • H(p | A) = conditional entropy = average entropy of conditional class distribution, after we have partitioned the data according to the values in A • Gain(A) = H(p) – H(p | A) • Simple rule in decision tree learning – At each internal node, split on the node with the largest information gain (or equivalently, with smallest H(p|A)) • Note that by definition, conditional entropy can’t be greater than the entropy
  • 28. Root Node Example For the training set, 6 positives, 6 negatives, H(6/12, 6/12) = 1 bit Consider the attributes Patrons and Type: 2 4 6 2 4 IG ( Patrons ) = 1 − [ H (0,1) + H (1,0) + H ( , )] = .0541 bits 12 12 12 6 6 2 1 1 2 1 1 4 2 2 4 2 2 IG (Type) = 1 − [ H ( , ) + H ( , ) + H ( , ) + H ( , )] = 0 bits 12 2 2 12 2 2 12 4 4 12 4 4 Patrons has the highest IG of all attributes and so is chosen by the learning algorithm as the root Information gain is then repeatedly applied at internal nodes until all leaves contain only examples from one class or the other
  • 29. Decision Tree Learned • Decision tree learned from the 12 examples:
  • 30. True Tree (left) versus Learned Tree (right)
  • 31. Summary • Inductive learning – Error function, class of hypothesis/models {h} – Want to minimize E on our training data – Example: decision tree learning