SlideShare uma empresa Scribd logo
1 de 46
Over fitting
              &
Transformation Based Learning


     CS 371: Spring 2012
Machine Learning

• Machines can learn from examples
   –    Learning modifies the agent's decision mechanisms to improve
       performance


• Given training data, machines analyze the data, and learn
  rules which generalize to new examples
   – Can be sub-symbolic (rule may be a mathematical function)
   – Or it can be symbolic (rules are in a representation that is similar
     to representation used for hand-coded rules)


• In general, machine learning approaches allow for more tuning
  to the needs of a corpus, and can be reused across corpora
Training data example




•   Inductive learning
          Empirical error function:
                       E(h) =   x distance[h(x; ) , f]


           Empirical learning = finding h(x), or h(x; ) that minimizes E(h)

•   Note an implicit assumption:
     –   For any set of attribute values there is a unique target value
     –   This in effect assumes a “no-noise” mapping from inputs to targets
           • This is often not true in practice (e.g., in medicine).
Learning Boolean Functions

•   Given examples of the function, can we learn the function?

•   2 to the power of 2d different Boolean functions can be defined on d
    attributes
     – This is the size of our hypothesis space


•   Observations:
     – Huge hypothesis spaces –> directly searching over all functions is impossible
     – Given a small data (n pairs) our learning problem may be underconstrained
         • Ockham’s razor: if multiple candidate functions all explain the data
           equally well, pick the simplest explanation (least complex function)
         • Constrain our search to classes of Boolean functions, e.g.,
              – decision trees
Decision Tree Learning

•   Constrain h(..) to be a decision tree
Pseudocode for Decision tree learning
Major issues

Q1: Choosing best attribute: what quality measure to use?

Q2: Handling training data with missing attribute values

Q3: Handling training data with noise, irrelevant attributes
  - Determining when to stop splitting: avoid overfitting
Major issues

Q1: Choosing best attribute: different quality measures.
       Information gain, gain ratio …

Q2: Handling training data with missing attribute values: blank
  value, most common value, or fractional count

Q3: Handling training data with noise, irrelevant attributes:
  - Determining when to stop splitting: ????
Assessing Performance

Training data performance is typically optimistic
       e.g., error rate on training data



Reasons?
      - classifier may not have enough data to fully learn the concept (but
          on training data we don’t know this)
      - for noisy data, the classifier may overfit the training data



In practice we want to assess performance “out of sample”
       how well will the classifier do on new unseen data? This is the
         true test of what we have learned (just like a classroom)

With large data sets we can partition our data into 2 subsets, train and test
       - build a model on the training data
       - assess performance on the test data
Example of Test Performance

Restaurant problem
      - simulate 100 data sets of different sizes
      - train on this data, and assess performance on an independent test set
      - learning curve = plotting accuracy as a function of training set size
      - typical “diminishing returns” effect
Example
Example
Example
Example
Example
How Overfitting affects Prediction




Predictive
  Error




                                      Error on Training Data


                                            Model Complexity
How Overfitting affects Prediction




Predictive
  Error


                                         Error on Test Data




                                      Error on Training Data


                                            Model Complexity
How Overfitting affects Prediction



             Underfitting                     Overfitting
Predictive
  Error


                                                   Error on Test Data




                                               Error on Training Data


                                                      Model Complexity

                            Ideal Range
                       for Model Complexity
Training and Validation Data




Full Data Set
                                                 Idea: train each
                               Training Data     model on the
                                                 “training data”

                                                 and then test
                                                 each model’s
                               Validation Data   accuracy on
                                                 the validation data
The v-fold Cross-Validation Method

• Why just choose one particular 90/10 “split” of the data?
   – In principle we could do this multiple times


• “v-fold Cross-Validation” (e.g., v=10)
   – randomly partition our full data set into v disjoint subsets (each
     roughly of size n/v, n = total number of training data points)
       • for i = 1:10 (here v = 10)
           – train on 90% of data,
           – Acc(i) = accuracy on other 10%
       • end

       • Cross-Validation-Accuracy = 1/v      i   Acc(i)
   – choose the method with the highest cross-validation accuracy
   – common values for v are 5 and 10
   – Can also do “leave-one-out” where v = n
Disjoint Validation Data Sets




Full Data Set

                                      Validation Data


                                      Training Data




                      1st partition
Disjoint Validation Data Sets




Full Data Set

                                      Validation Data
                                                                        Validation
                                                                        Data
                                      Training Data




                      1st partition                     2nd partition
More on Cross-Validation

• Notes
   – cross-validation generates an approximate estimate of how well
     the learned model will do on “unseen” data

   – by averaging over different partitions it is more robust than just a
     single train/validate partition of the data

   – “v-fold” cross-validation is a generalization
       • partition data into disjoint validation subsets of size n/v
       • train, validate, and average over the v partitions
       • e.g., v=10 is commonly used

   – v-fold cross-validation is approximately v times computationally
     more expensive than just fitting a model to all of the data
Lets look at an other symbolic learner …
Problem Domain: POS Tagging


 What is text tagging?
  –    Some sort of markup, enabling understanding of
  language.
  –    Can be word tags:
               He will race/VERB the car.
               He will not race/VERB the truck.
               When will the race/NOUN end?
Why do we care?



 Sometimes, meaning changes a lot
  – Transcribed speech lacks clear punctuation:
       “I called, John and Mary are there.”
       → I called John and Mary are there.
                (I called John) and (Mary are there.) ??
                I called ((John and Mary) are there.)

  – We can tell, but can a computer?
      Here, needs to know about verb forms and collections

  – Can be important!
      Quick! Wrap the bandage on the table around her leg!
      Imagine a robotic medical assistant with this one . . .
Where is this used?

• Any natural language task!
  – Translators: word-by-word translation does not always work,
        sentences need re-arranging.
  – It can help with OCR or voice transcription

 “I need to writer. I'm a good write her.”
      “to writer”??    “a good write”?
→ “I need to write her. I'm a good writer.
Some terms

  Corpus
   – Big body of text, annotated (expert-tagged) or not
  Dictionary
   – List of known words, and all possible parts of speech
  Lexical/Morphological vs. Contextual
   – Is it a word property (spelling) or surroundings (neighboring
   parts of speech)?
  Semantics vs Syntax
   – Meaning (definition) vs. Structure (phrases, parsing)
  Tokenizer
   – Separates text into words or other sized blocks (idioms,
phrases . . . )
  Disambiguator
   – Extra pass to reduce possible tags to a single one.
Some problems we face

Classification challenges:
   – Large number of classes:
           English POS: varying tagsets, 48 to 195 tags

  – Often ambiguous, varying with use/context
         POS: There must be a way to go there; I know a
                person from there – see that guy there?
        (pron., adv., n.)

  – Varying number of relevant features
         Spelling, position, surrounding words, paragraph
       position, article topic . . .
TBL: A Symbolic Learning Method

• A method called error-driven Transformation-Based Learning
  (TBL) (Brill algorithm) can be used for symbolic learning
   – The rules (actually, a sequence of rules) are learned from an
     annotated corpus
   – Performs about as accurately as other statistical approaches


• Can have better treatment of context compared to HMMs (as
  we’ll see)
   – rules which use the next (or previous) POS
       • HMMs just use P(Ti| Ti-1) or P(Ti| Ti-2 Ti-1)
   – rules which use the previous (next) word
       • HMMs just use P(Wi|Ti)
What does it do?

 Transformation-Based Error-Driven Learning:
  – First, a dictionary tags every word with its most
  common POS. So, “run” is tagged as a verb in both:

   “The run lasted 30 minutes” and “We run 3 miles every day”

  – Unknown capitalized words are assumed to be proper
  nouns, and remaining unknown words are assigned the most
  common tag for their three-letter ending.
      → “blahblahous” is probably an adjective.

  – Finally, the tags are updated by a set of “patches,” with the
  form “Change tag a to b if:”
        – The word is in context C (eg, the pattern of surrounding tags)
        – The word or one in a region R has lexical property P (eg,
  capitalization)
Rule Templates

• Brill’s method learns transformations which fit different
  templates
    – Template: Change tag X to tag Y when previous word is W
       • Transformation: NN  VB when previous word = to

    – Change tag X to tag Y when previous tag is Z
      Ex:
       – The can rusted.
       → The (determiner) can (modal verb) rusted (verb) . (.)
       – Transformation: Modal  Noun when previous tag = DET
       → The (determiner) can (noun) rusted (verb) . (.)

    – Change tag X to tag Y when previous 1st, 2nd, or 3rd word is W
       • VBP  VB when one of previous 3 words = has
• The learning process is guided by a small number of templates
  (e.g., 26) to learn specific rules from the corpus
• Note how these rules sort of match linguistic intuition
Brill Algorithm (Overview)

  • Assume you are given a        1. Initial-state annotator:
    training corpus G (for gold      Label every word token
    standard)                        in V with most likely tag
                                     for that word type from
  • First, create a tag-free         G.
    version V of it … then do
                                  2. Consider every possible
    steps 1-4                        transformational rule:
  • Notes:                           select the one that leads
                                     to the most
     – As the algorithm              improvement in V using
       proceeds, each                G to measure the error
       successive rule covers
                                  3. Retag V based on this
       fewer examples, but           rule
       potentially more
                                  4. Go back to 2, until there
       accurately
                                     is no significant
     – Some later rules may          improvement in accuracy
       change tags changed           over previous iteration
       by earlier rules
Error-driven method


• How does one learn the rules?
• The TBL method is error-driven
   – The rule which is learned on a given iteration is the one which
     reduces the error rate of the corpus the most, e.g.:
   – Rule 1 fixes 50 errors but introduces 25 more  net decrease is 25
   – Rule 2 fixes 45 errors but introduces 15 more  net decrease is 30
    Choose rule 2 in this case
• We set a stopping criterion, or threshold  once we stop
  reducing the error rate by a big enough margin, learning is
  stopped
Example of Error Reduction




                From Eric Brill (1995):
                Computational Linguistics, 21, 4, p. 7
Rule ordering

• One rule is learned with every pass through the corpus.
   – The set of final rules is what the final output is
   – Unlike HMMs, such a representation allows a linguist to look
     through and make more sense of the rules


• Thus, the rules are learned iteratively and must be applied in
  an iterative fashion.
   – At one stage, it may make sense to change NN to VB after to
   – But at a later stage, it may make sense to change VB back to NN
     in the same context, e.g., if the current word is school
Example of Learned Rule Sequence

•   1. NN VB PREVTAG TO
    –   to/TO race/NN->VB
• 2. VBP VB PREV1OR20R3TAG MD
    – might/MD vanish/VBP-> VB
• 3. NN VB PREV1OR2TAG MD
    –   might/MD not/RB reply/NN -> VB
• 4. VB NN PREV1OR2TAG DT
    – the/DT great/JJ feast/VB->NN
• 5. VBD VBN PREV1OR20R3TAG VBZ
    – He/PP was/VBZ killed/VBD->VBN by/IN Chapman/NNP
Insights on TBL

• TBL takes a long time to train, but is relatively fast at tagging
  once the rules are learned
• The rules in the sequence may be decomposed into non-
  interacting subsets, i.e., only focus on VB tagging (need to
  only look at rules which affect it)
• In cases where the data is sparse, the initial guess needs to be
  weak enough to allow for learning
• Rules become increasingly specific as you go down the
  sequence.
    – However, the more specific rules generally don’t overfit because
      they cover just a few cases
Relation between DT and TBL
DT and TBL



DT is a subset of TBL




                        1. Label with S
                        2. If X then S  A
                        3. S  B
DT is a proper subset of TBL

• There exists a problem that can be solved by TBL but not a DT,
  for a fixed set of primitive queries.

• Ex: Given a sequence of characters
   – Classify a char based on its position
       • If pos % 4 == 0 then “yes” else “no”
   – Input attributes available: previous two chars
• Transformation list:
  – Label with S: A/S A/S A/S A/S A/S A/S A/S

  – If there is no previous character, then S F
        A/F A/S A/S A/S A/S A/S A/S

  – If the char two to the left is labeled with F, then S F
        A/F A/S A/F A/S A/F A/S A/F

  – If the char two to the left is labeled with F, then FS
        A/F A/S A/S A/S A/F A/S A/S
  – F  yes
  – S  no
DT and TBL

• TBL is more powerful than DT

• Extra power of TBL comes from
   – Transformations are applied in sequence
   – Results of previous transformations are visible to following
     transformations.
Brill Algorithm (More Detailed)

 •   1. Label every word token with its most
     likely tag (based on lexical generation
                                                   Most likely tag:
     probabilities).
                                                        P(NN|race) = .98
 •   2. List the positions of tagging errors and
     their counts, by comparing with “truth” (T)        P(VB|race) = .02
 •   3. For each error position, consider each          Is/VBZ expected/VBN to/TO
     instantiation I of X, Y, and Z in Rule
                                                            race/NN tomorrow/NN
     template.

      –    If Y=T, increment improvements[I],      Rule template:     Change a word from
           else increment errors[I].                   tag X to tag Y when previous tag is
 •   4. Pick the I which results in the greatest       Z
     error reduction, and add to output
                                                        Rule Instantiation for above example:
      –    VB NN PREV1OR2TAG DT improves
                                                            NN VB PREV1OR2TAG TO
           on 98 errors, but produces 18 new
           errors, so net decrease of 80 errors         Applying this rule yields:
 •   5. Apply that I to corpus
                                                        Is/VBZ expected/VBN to/TO
 •   6. Go to 2, unless stopping criterion is
                                                            race/VB tomorrow/NN
     reached
Handling Unknown Words

• Can also use the Brill
                                 Example Learned Rule Sequence
  method to learn how to tag
                                 for Unknown Words
  unknown words
• Instead of using surrounding
  words and tags, use affix
  info, capitalization, etc.
   – Guess NNP if capitalized,
      NN otherwise.
   – Or use the tag most
      common for words
      ending in the last 3
      letters.
   – etc.
• TBL has also been applied to
  some parsing tasks

Mais conteúdo relacionado

Mais procurados

Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
Francesco Collova'
 

Mais procurados (20)

Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Tree pruning
 Tree pruning Tree pruning
Tree pruning
 
Evaluating hypothesis
Evaluating  hypothesisEvaluating  hypothesis
Evaluating hypothesis
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
Artificial Neural Networks for NIU session 2016 17
Artificial Neural Networks for NIU session 2016 17 Artificial Neural Networks for NIU session 2016 17
Artificial Neural Networks for NIU session 2016 17
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
 
Linear regression with gradient descent
Linear regression with gradient descentLinear regression with gradient descent
Linear regression with gradient descent
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Pattern Recognition
Pattern RecognitionPattern Recognition
Pattern Recognition
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
Target language in compiler design
Target language in compiler designTarget language in compiler design
Target language in compiler design
 
Deep learning
Deep learningDeep learning
Deep learning
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Back propagation
Back propagationBack propagation
Back propagation
 
Hill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligenceHill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligence
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Emotions uncontrolled emotions
Emotions uncontrolled emotionsEmotions uncontrolled emotions
Emotions uncontrolled emotions
 
Neural networks introduction
Neural networks introductionNeural networks introduction
Neural networks introduction
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 

Semelhante a Overfitting and-tbl

notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
butest
 
Introduction
IntroductionIntroduction
Introduction
butest
 
Introduction
IntroductionIntroduction
Introduction
butest
 
Introduction
IntroductionIntroduction
Introduction
butest
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
ESCOM
 

Semelhante a Overfitting and-tbl (20)

Lecture 1
Lecture 1Lecture 1
Lecture 1
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
Using the Machine to predict Testability
Using the Machine to predict TestabilityUsing the Machine to predict Testability
Using the Machine to predict Testability
 
Introduction
IntroductionIntroduction
Introduction
 
Introduction
IntroductionIntroduction
Introduction
 
Introduction
IntroductionIntroduction
Introduction
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Mis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsMis End Term Exam Theory Concepts
Mis End Term Exam Theory Concepts
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
 
Getting started with Machine Learning
Getting started with Machine LearningGetting started with Machine Learning
Getting started with Machine Learning
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learning
 
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво....NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 

Mais de Digvijay Singh (14)

Week3 applications
Week3 applicationsWeek3 applications
Week3 applications
 
Week3.1
Week3.1Week3.1
Week3.1
 
Week2.1
Week2.1Week2.1
Week2.1
 
Week1.2 intro
Week1.2 introWeek1.2 intro
Week1.2 intro
 
Networks
NetworksNetworks
Networks
 
Uncertainty
UncertaintyUncertainty
Uncertainty
 
Ngrams smoothing
Ngrams smoothingNgrams smoothing
Ngrams smoothing
 
Query execution
Query executionQuery execution
Query execution
 
Query compiler
Query compilerQuery compiler
Query compiler
 
Machine learning
Machine learningMachine learning
Machine learning
 
Hmm viterbi
Hmm viterbiHmm viterbi
Hmm viterbi
 
3 fol examples v2
3 fol examples v23 fol examples v2
3 fol examples v2
 
Multidimensional Indexing
Multidimensional IndexingMultidimensional Indexing
Multidimensional Indexing
 
Bayesnetwork
BayesnetworkBayesnetwork
Bayesnetwork
 

Último

Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Último (20)

Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 

Overfitting and-tbl

  • 1. Over fitting & Transformation Based Learning CS 371: Spring 2012
  • 2. Machine Learning • Machines can learn from examples – Learning modifies the agent's decision mechanisms to improve performance • Given training data, machines analyze the data, and learn rules which generalize to new examples – Can be sub-symbolic (rule may be a mathematical function) – Or it can be symbolic (rules are in a representation that is similar to representation used for hand-coded rules) • In general, machine learning approaches allow for more tuning to the needs of a corpus, and can be reused across corpora
  • 3. Training data example • Inductive learning Empirical error function: E(h) = x distance[h(x; ) , f] Empirical learning = finding h(x), or h(x; ) that minimizes E(h) • Note an implicit assumption: – For any set of attribute values there is a unique target value – This in effect assumes a “no-noise” mapping from inputs to targets • This is often not true in practice (e.g., in medicine).
  • 4. Learning Boolean Functions • Given examples of the function, can we learn the function? • 2 to the power of 2d different Boolean functions can be defined on d attributes – This is the size of our hypothesis space • Observations: – Huge hypothesis spaces –> directly searching over all functions is impossible – Given a small data (n pairs) our learning problem may be underconstrained • Ockham’s razor: if multiple candidate functions all explain the data equally well, pick the simplest explanation (least complex function) • Constrain our search to classes of Boolean functions, e.g., – decision trees
  • 5. Decision Tree Learning • Constrain h(..) to be a decision tree
  • 6. Pseudocode for Decision tree learning
  • 7. Major issues Q1: Choosing best attribute: what quality measure to use? Q2: Handling training data with missing attribute values Q3: Handling training data with noise, irrelevant attributes - Determining when to stop splitting: avoid overfitting
  • 8. Major issues Q1: Choosing best attribute: different quality measures. Information gain, gain ratio … Q2: Handling training data with missing attribute values: blank value, most common value, or fractional count Q3: Handling training data with noise, irrelevant attributes: - Determining when to stop splitting: ????
  • 9. Assessing Performance Training data performance is typically optimistic e.g., error rate on training data Reasons? - classifier may not have enough data to fully learn the concept (but on training data we don’t know this) - for noisy data, the classifier may overfit the training data In practice we want to assess performance “out of sample” how well will the classifier do on new unseen data? This is the true test of what we have learned (just like a classroom) With large data sets we can partition our data into 2 subsets, train and test - build a model on the training data - assess performance on the test data
  • 10. Example of Test Performance Restaurant problem - simulate 100 data sets of different sizes - train on this data, and assess performance on an independent test set - learning curve = plotting accuracy as a function of training set size - typical “diminishing returns” effect
  • 16. How Overfitting affects Prediction Predictive Error Error on Training Data Model Complexity
  • 17. How Overfitting affects Prediction Predictive Error Error on Test Data Error on Training Data Model Complexity
  • 18. How Overfitting affects Prediction Underfitting Overfitting Predictive Error Error on Test Data Error on Training Data Model Complexity Ideal Range for Model Complexity
  • 19. Training and Validation Data Full Data Set Idea: train each Training Data model on the “training data” and then test each model’s Validation Data accuracy on the validation data
  • 20. The v-fold Cross-Validation Method • Why just choose one particular 90/10 “split” of the data? – In principle we could do this multiple times • “v-fold Cross-Validation” (e.g., v=10) – randomly partition our full data set into v disjoint subsets (each roughly of size n/v, n = total number of training data points) • for i = 1:10 (here v = 10) – train on 90% of data, – Acc(i) = accuracy on other 10% • end • Cross-Validation-Accuracy = 1/v i Acc(i) – choose the method with the highest cross-validation accuracy – common values for v are 5 and 10 – Can also do “leave-one-out” where v = n
  • 21. Disjoint Validation Data Sets Full Data Set Validation Data Training Data 1st partition
  • 22. Disjoint Validation Data Sets Full Data Set Validation Data Validation Data Training Data 1st partition 2nd partition
  • 23. More on Cross-Validation • Notes – cross-validation generates an approximate estimate of how well the learned model will do on “unseen” data – by averaging over different partitions it is more robust than just a single train/validate partition of the data – “v-fold” cross-validation is a generalization • partition data into disjoint validation subsets of size n/v • train, validate, and average over the v partitions • e.g., v=10 is commonly used – v-fold cross-validation is approximately v times computationally more expensive than just fitting a model to all of the data
  • 24. Lets look at an other symbolic learner …
  • 25. Problem Domain: POS Tagging What is text tagging? – Some sort of markup, enabling understanding of language. – Can be word tags: He will race/VERB the car. He will not race/VERB the truck. When will the race/NOUN end?
  • 26. Why do we care? Sometimes, meaning changes a lot – Transcribed speech lacks clear punctuation: “I called, John and Mary are there.” → I called John and Mary are there. (I called John) and (Mary are there.) ?? I called ((John and Mary) are there.) – We can tell, but can a computer? Here, needs to know about verb forms and collections – Can be important! Quick! Wrap the bandage on the table around her leg! Imagine a robotic medical assistant with this one . . .
  • 27. Where is this used? • Any natural language task! – Translators: word-by-word translation does not always work, sentences need re-arranging. – It can help with OCR or voice transcription “I need to writer. I'm a good write her.” “to writer”?? “a good write”? → “I need to write her. I'm a good writer.
  • 28. Some terms Corpus – Big body of text, annotated (expert-tagged) or not Dictionary – List of known words, and all possible parts of speech Lexical/Morphological vs. Contextual – Is it a word property (spelling) or surroundings (neighboring parts of speech)? Semantics vs Syntax – Meaning (definition) vs. Structure (phrases, parsing) Tokenizer – Separates text into words or other sized blocks (idioms, phrases . . . ) Disambiguator – Extra pass to reduce possible tags to a single one.
  • 29. Some problems we face Classification challenges: – Large number of classes: English POS: varying tagsets, 48 to 195 tags – Often ambiguous, varying with use/context POS: There must be a way to go there; I know a person from there – see that guy there? (pron., adv., n.) – Varying number of relevant features Spelling, position, surrounding words, paragraph position, article topic . . .
  • 30. TBL: A Symbolic Learning Method • A method called error-driven Transformation-Based Learning (TBL) (Brill algorithm) can be used for symbolic learning – The rules (actually, a sequence of rules) are learned from an annotated corpus – Performs about as accurately as other statistical approaches • Can have better treatment of context compared to HMMs (as we’ll see) – rules which use the next (or previous) POS • HMMs just use P(Ti| Ti-1) or P(Ti| Ti-2 Ti-1) – rules which use the previous (next) word • HMMs just use P(Wi|Ti)
  • 31. What does it do? Transformation-Based Error-Driven Learning: – First, a dictionary tags every word with its most common POS. So, “run” is tagged as a verb in both: “The run lasted 30 minutes” and “We run 3 miles every day” – Unknown capitalized words are assumed to be proper nouns, and remaining unknown words are assigned the most common tag for their three-letter ending. → “blahblahous” is probably an adjective. – Finally, the tags are updated by a set of “patches,” with the form “Change tag a to b if:” – The word is in context C (eg, the pattern of surrounding tags) – The word or one in a region R has lexical property P (eg, capitalization)
  • 32. Rule Templates • Brill’s method learns transformations which fit different templates – Template: Change tag X to tag Y when previous word is W • Transformation: NN  VB when previous word = to – Change tag X to tag Y when previous tag is Z Ex: – The can rusted. → The (determiner) can (modal verb) rusted (verb) . (.) – Transformation: Modal  Noun when previous tag = DET → The (determiner) can (noun) rusted (verb) . (.) – Change tag X to tag Y when previous 1st, 2nd, or 3rd word is W • VBP  VB when one of previous 3 words = has • The learning process is guided by a small number of templates (e.g., 26) to learn specific rules from the corpus • Note how these rules sort of match linguistic intuition
  • 33. Brill Algorithm (Overview) • Assume you are given a 1. Initial-state annotator: training corpus G (for gold Label every word token standard) in V with most likely tag for that word type from • First, create a tag-free G. version V of it … then do 2. Consider every possible steps 1-4 transformational rule: • Notes: select the one that leads to the most – As the algorithm improvement in V using proceeds, each G to measure the error successive rule covers 3. Retag V based on this fewer examples, but rule potentially more 4. Go back to 2, until there accurately is no significant – Some later rules may improvement in accuracy change tags changed over previous iteration by earlier rules
  • 34. Error-driven method • How does one learn the rules? • The TBL method is error-driven – The rule which is learned on a given iteration is the one which reduces the error rate of the corpus the most, e.g.: – Rule 1 fixes 50 errors but introduces 25 more  net decrease is 25 – Rule 2 fixes 45 errors but introduces 15 more  net decrease is 30  Choose rule 2 in this case • We set a stopping criterion, or threshold  once we stop reducing the error rate by a big enough margin, learning is stopped
  • 35. Example of Error Reduction From Eric Brill (1995): Computational Linguistics, 21, 4, p. 7
  • 36. Rule ordering • One rule is learned with every pass through the corpus. – The set of final rules is what the final output is – Unlike HMMs, such a representation allows a linguist to look through and make more sense of the rules • Thus, the rules are learned iteratively and must be applied in an iterative fashion. – At one stage, it may make sense to change NN to VB after to – But at a later stage, it may make sense to change VB back to NN in the same context, e.g., if the current word is school
  • 37. Example of Learned Rule Sequence • 1. NN VB PREVTAG TO – to/TO race/NN->VB • 2. VBP VB PREV1OR20R3TAG MD – might/MD vanish/VBP-> VB • 3. NN VB PREV1OR2TAG MD – might/MD not/RB reply/NN -> VB • 4. VB NN PREV1OR2TAG DT – the/DT great/JJ feast/VB->NN • 5. VBD VBN PREV1OR20R3TAG VBZ – He/PP was/VBZ killed/VBD->VBN by/IN Chapman/NNP
  • 38. Insights on TBL • TBL takes a long time to train, but is relatively fast at tagging once the rules are learned • The rules in the sequence may be decomposed into non- interacting subsets, i.e., only focus on VB tagging (need to only look at rules which affect it) • In cases where the data is sparse, the initial guess needs to be weak enough to allow for learning • Rules become increasingly specific as you go down the sequence. – However, the more specific rules generally don’t overfit because they cover just a few cases
  • 40. DT and TBL DT is a subset of TBL 1. Label with S 2. If X then S  A 3. S  B
  • 41. DT is a proper subset of TBL • There exists a problem that can be solved by TBL but not a DT, for a fixed set of primitive queries. • Ex: Given a sequence of characters – Classify a char based on its position • If pos % 4 == 0 then “yes” else “no” – Input attributes available: previous two chars
  • 42. • Transformation list: – Label with S: A/S A/S A/S A/S A/S A/S A/S – If there is no previous character, then S F A/F A/S A/S A/S A/S A/S A/S – If the char two to the left is labeled with F, then S F A/F A/S A/F A/S A/F A/S A/F – If the char two to the left is labeled with F, then FS A/F A/S A/S A/S A/F A/S A/S – F  yes – S  no
  • 43. DT and TBL • TBL is more powerful than DT • Extra power of TBL comes from – Transformations are applied in sequence – Results of previous transformations are visible to following transformations.
  • 44.
  • 45. Brill Algorithm (More Detailed) • 1. Label every word token with its most likely tag (based on lexical generation Most likely tag: probabilities). P(NN|race) = .98 • 2. List the positions of tagging errors and their counts, by comparing with “truth” (T) P(VB|race) = .02 • 3. For each error position, consider each Is/VBZ expected/VBN to/TO instantiation I of X, Y, and Z in Rule race/NN tomorrow/NN template. – If Y=T, increment improvements[I], Rule template: Change a word from else increment errors[I]. tag X to tag Y when previous tag is • 4. Pick the I which results in the greatest Z error reduction, and add to output Rule Instantiation for above example: – VB NN PREV1OR2TAG DT improves NN VB PREV1OR2TAG TO on 98 errors, but produces 18 new errors, so net decrease of 80 errors Applying this rule yields: • 5. Apply that I to corpus Is/VBZ expected/VBN to/TO • 6. Go to 2, unless stopping criterion is race/VB tomorrow/NN reached
  • 46. Handling Unknown Words • Can also use the Brill Example Learned Rule Sequence method to learn how to tag for Unknown Words unknown words • Instead of using surrounding words and tags, use affix info, capitalization, etc. – Guess NNP if capitalized, NN otherwise. – Or use the tag most common for words ending in the last 3 letters. – etc. • TBL has also been applied to some parsing tasks