This document discusses machine learning algorithms for natural language processing (NLP) classification problems. It introduces decision trees as a machine learning algorithm that recursively partitions training data using hierarchical tree structures. Decision trees are learned from labeled training examples using a top-down induction approach where the training data is recursively split based on feature tests until some stopping criteria is reached. Common algorithms for learning decision trees include ID3, C4.5, and CART.
1. Seminar: Statistical NLP
Machine Learning for
Natural Language Processing
Lluís Màrquez
TALP Research Center
Llenguatges i Sistemes Informàtics
Universitat Politècnica de Catalunya
Girona, June 2003
Machine Learning for NLP 30/06/2003
2. Outline
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
• Applications to NLP
Machine Learning for NLP 30/06/2003
3. Outline
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
• Applications to NLP
Machine Learning for NLP 30/06/2003
4. ML4NLP
Machine Learning
• There are many general-purpose definitions of Machine
Learning (or artificial learning):
Making a computer automatically acquire some
kind of knowledge from a concrete data domain
• Learners are computers: we study learning algorithms
• Resources are scarce: time, memory, data, etc.
• It has (almost) nothing to do with: Cognitive science,
neuroscience, theory of scientific discovery and research, etc.
• Biological plausibility is welcome but not the main goal
Machine Learning for NLP 30/06/2003
5. ML4NLP
Machine Learning
• Learning... but what for?
– To perform some particular task
– To react to environmental inputs
– Concept learning from data:
• modelling concepts underlying data
• predicting unseen observations
• compacting the knowledge representation
• knowledge discovery for expert systems
• We will concentrate on:
– Supervised inductive learning for classification
= discriminative learning
Machine Learning for NLP 30/06/2003
6. ML4NLP
Machine Learning
A more precise definition:
Obtaining a description of the concept in
some representation language that explains
observations and helps predicting new
instances of the same distribution
• What to read?
– Machine Learning (Mitchell, 1997)
Machine Learning for NLP 30/06/2003
7. ML4NLP
Empirical NLP
90’s: Application of Machine Learning techniques
(ML) to NLP problems
• Lexical and structural ambiguity problems
– Word selection (SR, MT)
– Part-of-speech tagging
Clasification
– Semantic ambiguity (polysemy)
problems
– Prepositional phrase attachment
– Reference ambiguity (anaphora)
– etc.
• What to read? Foundations of Statistical Language
Processing (Manning & Schütze, 1999)
Machine Learning for NLP 30/06/2003
8. ML4NLP
NLP “classification” problems
• Ambiguity is a crucial problem for natural
language understanding/processing.
Ambiguity Resolution = Classification
He was shot in the hand as he chased
the robbers in the back street
(The Wall Street Journal Corpus)
Machine Learning for NLP 30/06/2003
9. ML4NLP
NLP “classification” problems
• Morpho-syntactic ambiguity
He was shot in the hand as he chased
the robbers in the back street JJ
NN NN
VB VB VB
(The Wall Street Journal Corpus)
Machine Learning for NLP 30/06/2003
10. ML4NLP
NLP “classification” problems
• Morpho-syntactic ambiguity:
Part of Speech Tagging
He was shot in the hand as he chased
the robbers in the back street JJ
NN NN
VB VB VB
(The Wall Street Journal Corpus)
Machine Learning for NLP 30/06/2003
11. ML4NLP
NLP “classification” problems
• Semantic (lexical) ambiguity
He was shot in the hand as he chased
the robbers in body-part street
the back
clock-part
(The Wall Street Journal Corpus)
Machine Learning for NLP 30/06/2003
12. ML4NLP
NLP “classification” problems
• Semantic (lexical) ambiguity:
Word Sense Disambiguation
He was shot in the hand as he chased
the robbers in body-part street
the back
clock-part
(The Wall Street Journal Corpus)
Machine Learning for NLP 30/06/2003
13. ML4NLP
NLP “classification” problems
• Structural (syntactic) ambiguity
He was shot in the hand as he chased
the robbers in the back street
(The Wall Street Journal Corpus)
Machine Learning for NLP 30/06/2003
14. ML4NLP
NLP “classification” problems
• Structural (syntactic) ambiguity
He was shot in the hand as he chased
the robbers in the back street
(The Wall Street Journal Corpus)
Machine Learning for NLP 30/06/2003
15. ML4NLP
NLP “classification” problems
• Structural (syntactic) ambiguity:
PP-attachment disambiguation
He was shot in the hand as he (chased
(the robbers)NP (in the back street)PP)
(The Wall Street Journal Corpus)
Machine Learning for NLP 30/06/2003
16. Outline
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms in detail
• Applications to NLP
Machine Learning for NLP 30/06/2003
17. Classification
Feature Vector Classification
IA
perspective
• An instance is a vector: x=<x1,…, xn> whose components,
called features (or attributes), are discrete or real-valued.
• Let X be the space of all possible instances.
• Let Y={y1,…, ym} be the set of categories (or classes).
• The goal is to learn an unknown target function, f : X Y
• A training example is an instance x belonging to X, labelled
with the correct value for f(x), i.e., a pair <x, f(x)>
• Let D be the set of all training examples.
Machine Learning for NLP 30/06/2003
18. Classification
Feature Vector Classification
• The hypotheses space, H, is the set of functions h: X Y
that the learner can consider as possible definitions
The goal is to find a function h belonging to H
such that for all pair <x, f (x)> belonging to D,
h(x) = f (x)
Machine Learning for NLP 30/06/2003
19. Classification
An Example
Example SIZE COLOR SHAPE CLASS
1 small red circle positive
2 big red circle positive
3 small red triangle negative
4 big blue circle negative
Rules Decision Tree
COLOR
(COLOR=red)
red blue
(SHAPE=circle) positive
SHAPE negative
otherwise negative
circle triangle
positive negative
Machine Learning for NLP 30/06/2003
20. Classification
An Example
Example SIZE COLOR SHAPE CLASS
1 small red circle positive
2 big red circle positive
3 small red triangle negative
4 big blue circle negative
Rules Decision Tree
SIZE
(SIZE=small) (SHAPE=circle) positive
small big
(SIZE=big) (COLOR=red) positive
otherwise negative SHAPE COLOR
circle triang red blue
pos neg pos neg
Machine Learning for NLP 30/06/2003
21. Classification
Some important concepts
• Inductive Bias
“Any means that a classification learning system uses to choose
between to functions that are both consistent with the training
data is called inductive bias” (Mooney & Cardie, 99)
– Language / Search bias
Decision Tree
COLOR
red blue
SHAPE negative
circle triangle
positive negative
Machine Learning for NLP 30/06/2003
22. Classification
Some important concepts
• Inductive Bias
• Training error and generalization error
• Generalization ability and overfitting
• Batch Learning vs. on-line Leaning
• Symbolic vs. statistical Learning
• Propositional vs. first-order learning
Machine Learning for NLP 30/06/2003
24. Classification
The Classification Setting
Class, Point, Example, Data Set, ...
CoLT/SLT
• Input Space: X Rn perspective
• (binary) Output Space: Y = {+1,-1}
• A point, pattern or instance:
x X, x = (x1, x2, …, xn)
• Example: (x, y) with x X, y Y
• Training Set: a set of m examples generated i.i.d.
according to an unknown distribution P(x,y)
S = {(x1, y1), …, (xm, ym)} (X Y)m
Machine Learning for NLP 30/06/2003
25. Classification
The Classification Setting
Learning, Error, ...
• The hypotheses space, H, is the set of functions
h: X Y that the learner can consider as possible
definitions. In SVM are of the form:
n
h( x) wi i (x) b
i 1
• The goal is to find a function h belonging to H such
that the expected misclassification error on new
examples, also drawn from P(x,y), is minimal
(Risk Minimization, RM)
Machine Learning for NLP 30/06/2003
26. Classification
The Classification Setting
Learning, Error, ...
• Expected error (risk)
Rh loss h(x), y dP (x, y )
• Problem: P itself is unknown. Known are training
examples an induction principle is needed
• Empirical Risk Minimization (ERM): Find the
function h belonging to H for which the training
error (empirical risk) is minimal
m
Remp h 1m i 1
loss h(x i ), yi
Machine Learning for NLP 30/06/2003
27. Classification
The Classification Setting
Error, Over(under)fitting,...
• Low training error low true error?
• The overfitting dilemma:
Underfitting Overfitting
• Trade-off between training error and complexity
• Different learning biases can be used
Machine Learning for NLP 30/06/2003
28. Outline
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
• Applications to NLP
Machine Learning for NLP 30/06/2003
29. Outline
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
−Decision Trees
−AdaBoost
−Support Vector Machines
• Applications to NLP
Machine Learning for NLP 30/06/2003
30. Algorithms
Learning Paradigms
• Statistical learning:
– HMM, Bayesian Networks, ME, CRF, etc.
• Traditional methods from Artificial Intelligence
(ML, AI)
– Decision trees/lists, exemplar-based learning, rule
induction, neural networks, etc.
• Methods from Computational Learning
Theory (CoLT/SLT)
– Winnow, AdaBoost, SVM’s, etc.
Machine Learning for NLP 30/06/2003
31. Algorithms
Learning Paradigms
• Classifier combination:
– Bagging, Boosting, Randomization, ECOC,
Stacking, etc.
• Semi-supervised learning: learning from
labelled and unlabelled examples
– Bootstrapping, EM, Transductive learning
(SVM’s, AdaBoost), Co-Training, etc.
• etc.
Machine Learning for NLP 30/06/2003
32. Algorithms
Decision Trees
• Decision trees are a way to represent rules underlying
training data, with hierarchical structures that
recursively partition the data.
• They have been used by many research communities
(Pattern Recognition, Statistics, ML, etc.) for data
exploration with some of the following purposes:
Description, Classification, and Generalization.
• From a machine-learning perspective: Decision Trees
are n -ary branching trees that represent classification
rules for classifying the objects of a certain domain into
a set of mutually exclusive classes
Machine Learning for NLP 30/06/2003
33. Algorithms
Decision Trees
• Acquisition:
Top-Down Induction of Decision Trees
(TDIDT)
• Systems:
CART (Breiman et al. 84),
ID3, C4.5, C5.0 (Quinlan 86,93,98),
ASSISTANT, ASSISTANT-R (Cestnik et al. 87)
(Kononenko et al. 95)
etc.
Machine Learning for NLP 30/06/2003
34. Algorithms
An Example
A1
v1 v3
v2
A2 A3 ... A2
... v4 v5 ...
Decision Tree A5 A2
... v6
SIZE
small big
A5 C3
SHAPE COLOR v7
circle triang red blue
C1 C2 C1
pos neg pos neg
Machine Learning for NLP 30/06/2003
35. Algorithms
Learning Decision Trees
Training
Training
+ TDIDT =
Set
DT
Test
Example + = Class
DT
Machine Learning for NLP 30/06/2003
36. Algorithms
General Induction Algorithm
function TDIDT (X:set-of-examples; A:set-of-features)
var: tree1,tree2: decision-tree;
X’: set-of-examples;
A’: set-of-features
end-var
if (stopping_criterion (X)) then
tree1 := create_leaf_tree (X)
else
amax := feature_selection (X,A);
tree1 := create_tree (X, amax);
for-all val in values (amax) do
X’ := select_examples (X,amax,val);
A’ := A - {amax};
tree2 := TDIDT (X’,A’);
tree1 := add_branch (tree1,tree2,val)
end-for
end-if
return (tree1)
end-function
Machine Learning for NLP 30/06/2003
37. Algorithms
General Induction Algorithm
function TDIDT (X:set-of-examples; A:set-of-features)
var: tree1,tree2: decision-tree;
X’: set-of-examples;
A’: set-of-features
end-var
if (stopping_criterion (X)) then
tree1 := create_leaf_tree (X)
else
amax := feature_selection (X,A);
tree1 := create_tree (X, amax);
for-all val in values (amax) do
X’ := select_examples (X,amax,val);
A’ := A - {amax};
tree2 := TDIDT (X’,A’);
tree1 := add_branch (tree1,tree2,val)
end-for
end-if
return (tree1)
end-function
Machine Learning for NLP 30/06/2003
38. Algorithms
Feature Selection Criteria
• Functions derived from Information Theory:
– Information Gain, Gain Ratio (Quinlan 86)
• Functions derived from Distance Measures
– Gini Diversity Index (Breiman et al. 84)
– RLM (López de Mántaras 91)
• Statistically-based
– Chi-square test (Sestito & Dillon 94)
– Symmetrical Tau (Zhou & Dillon 91)
• RELIEFF-IG: variant of RELIEFF (Kononenko 94)
Machine Learning for NLP 30/06/2003
39. Algorithms
Extensions of DTs
(Murthy 95)
• Pruning (pre/post)
• Minimize the effect of the greedy approach:
lookahead
• Non-lineal splits
• Combination of multiple models
• Incremental learning (on-line)
• etc.
Machine Learning for NLP 30/06/2003
40. Algorithms
Decision Trees and NLP
• Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)
• POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez &
Rodríguez 95,97; Màrquez et al. 00)
• Word sense disambiguation (Brown et al. 91; Cardie 93;
Mooney 96)
• Parsing (Magerman 95,96; Haruno et al. 98,99)
• Text categorization (Lewis & Ringuette 94; Weiss et al. 99)
• Text summarization (Mani & Bloedorn 98)
• Dialogue act tagging (Samuel et al. 98)
Machine Learning for NLP 30/06/2003
41. Algorithms
Decision Trees and NLP
• Noun phrase coreference
(Aone & Benett 95; Mc Carthy & Lehnert 95)
• Discourse analysis in information extraction
(Soderland & Lehnert 94)
• Cue phrase identification in text and speech
(Litman 94; Siegel & McKeown 94)
• Verb classification in Machine Translation
(Tanaka 96; Siegel 97)
Machine Learning for NLP 30/06/2003
42. Algorithms
Decision Trees: pros&cons
• Advantages
– Acquires symbolic knowledge in a
understandable way
– Very well studied ML algorithms and variants
– Can be easily translated into rules
– Existence of available software: C4.5, C5.0, etc.
– Can be easily integrated into an ensemble
Machine Learning for NLP 30/06/2003
43. Algorithms
Decision Trees: pros&cons
• Drawbacks
– Computationally expensive when scaling to large
natural language domains: training examples,
features, etc.
– Data sparseness and data fragmentation: the problem
of the small disjuncts => Probability estimation
– DTs is a model with high variance (unstable)
– Tendency to overfit training data: pruning is necessary
– Requires quite a big effort in tuning the model
Machine Learning for NLP 30/06/2003
44. Algorithms
Boosting algorithms
• Idea
“to combine many simple and moderately accurate
hypotheses (weak classifiers) into a single and highly
accurate classifier”
• AdaBoost (Freund & Schapire 95) has been
theoretically and empirically studied extensively
• Many other variants extensions (1997-2003)
http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html
Machine Learning for NLP 30/06/2003
45. Algorithms
AdaBoost: general scheme
Linear
F(h1,h2,...,hT)
combination
2
...
h1 h2 hT
Weak Weak Weak
Learner Probability Learner
Learner
distribution
updating
TS1 TS2
... TST
D1 D2 DT
Machine Learning for NLP 30/06/2003
47. Algorithms
AdaBoost: example
Weak hypotheses = vertical/horizontal hyperplanes
Machine Learning for NLP 30/06/2003
48. Algorithms
AdaBoost: round 1
Machine Learning for NLP 30/06/2003
49. Algorithms
AdaBoost: round 2
Machine Learning for NLP 30/06/2003
50. Algorithms
AdaBoost: round 3
Machine Learning for NLP 30/06/2003
51. Algorithms
Combined Hypothesis
www.research.att.com/~yoav/adaboost
Machine Learning for NLP 30/06/2003
52. Algorithms
AdaBoost and NLP
• POS Tagging (Abney et al. 99; Màrquez 99)
• Text and Speech Categorization
(Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99)
• PP-attachment Disambiguation (Abney et al. 99)
• Parsing (Haruno et al. 99)
• Word Sense Disambiguation (Escudero et al. 00, 01)
• Shallow parsing (Carreras & Màrquez, 01a; 02)
• Email spam filtering (Carreras & Màrquez, 01b)
• Term Extraction (Vivaldi, et al. 01)
Machine Learning for NLP 30/06/2003
53. Algorithms
AdaBoost: pros&cons
+ Easy to implement and few parameters to set
+ Time and space grow linearly with number of
examples. Ability to manage very large learning
problems
+ Does not constrain explicitly the complexity of the
learner
+ Naturally combines feature selection with learning
+ Has been succesfully applied to many practical
problems
Machine Learning for NLP 30/06/2003
54. Algorithms
AdaBoost: pros&cons
± Seems to be rather robust to overfitting
(number of rounds) but sensitive to noise
± Performance is very good when there are
relatively few relevant terms (features)
– Can perform poorly when there is insufficient
training data relative to the complexity of the
base classifiers, the training errors of the base
classifiers become too large too quickly
Machine Learning for NLP 30/06/2003
55. Algorithms
SVM: A General Definition
• “Support Vector Machines (SVM) are learning
systems that use a hypothesis space of linear
functions in a high dimensional feature space,
trained with a learning algorithm from optimisation
theory that implements a learning bias derived
from statistical learning theory”.
(Cristianini & Shawe-Taylor, 2000)
Machine Learning for NLP 30/06/2003
56. Algorithms
SVM: A General Definition
• “Support Vector Machines (SVM) are learning
systems that use a hypothesis space of linear
functions in a high dimensional feature space,
trained with a learning algorithm from optimisation
theory that implements a learning bias derived
from statistical learning theory”.
(Cristianini & Shawe-Taylor, 2000)
Key Concepts
Machine Learning for NLP 30/06/2003
57. Algorithms
Linear Classifiers
• Hyperplanes in RN.
• Defined by a weight vector (w) and a threshold (b).
• They induce a classification rule:
N
N
1 if wi xi b 0
h(x) sign wi xi b i 1
i 1
1 otherwise
+
+ +
+ +
_ _ w
_ _ +
b _
_ _
w _
_
Machine Learning for NLP 30/06/2003
59. Algorithms
Optimal Hyperplane:
Geometric Intuition
These are the
Support
Vectors
Maximal
Margin
Hyperplane
Machine Learning for NLP 30/06/2003
60. Algorithms
Linearly separable data
Quadratic
geometricmargin 2 / w 2 Programming
2
maximizing the margin is equivalent to minimize w subject to constraint s :
yi ( w xi b) 1 for all i 1,, l
Seminari SVMs Machine Learning for NLP 22/05/2001
30/06/2003
61. Algorithms
Non-separable case (soft margin)
1 ,, l positiveslack vari
ables for introducin costs
g
n
2
Minimize w C i subject toconstraint :
s
i 1
yi ( w xi b) 1 i for all i 1,, l
i 0 for all i 1,, l
Seminari SVMs Machine Learning for NLP 22/05/2001
30/06/2003
62. Algorithms
Non-linear SVMs
• Implicit mapping into feature space via kernel functions
:X F Non-linear mapping
n
f ( x) wi i (x) b Set of hypotheses
i 1
l
f (x) i yi (xi ) (x) b Dual formulation
i 1
K (x, z) (x) (z) Kernel function
l
f ( x) i yi K (xi , x) b Evaluation
i 1
Seminari SVMs Machine Learning for NLP 22/05/2001
30/06/2003
63. Algorithms
Non-linear SVMs
• Kernel functions
– Must be efficiently computable
– Characterization via Mercer’s theorem
– One of the curious facts about using a kernel is
that we do not need to know the underlying
feature map in order to be able to learn in the
feature space! (Cristianini & Shawe-Taylor, 2000)
– Examples: polynomials, Gaussian radial basis
functions, two-layer sigmoidal neural networks,
etc.
Seminari SVMs Machine Learning for NLP 22/05/2001
30/06/2003
64. Algorithms
Non linear SVMs
Degree 3 polynomial kernel
lin. separable lin. non-separable
Seminari SVMs Machine Learning for NLP 22/05/2001
30/06/2003
65. Algorithms
Toy Examples
• All examples have been run with the 2D graphic
interface of SVMLIB (Chang and Lin, National University
of Taiwan)
“LIBSVM is an integrated software for support vector classification,
(C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution
estimation (one-class SVM). It supports multi-class classification. The
basic algorithm is a simplification of both SMO by Platt and SVMLight
by Joachims. It is also a simplification of the modification 2 of SMO by
Keerthy et al. Our goal is to help users from other fields to easily use
SVM as a tool. LIBSVM provides a simple interface where users can
easily link it with their own programs…”
• Available from: www.csie.ntu.edu.tw/~cjlin/libsvm
(it icludes a Web integrated demo tool)
Machine Learning for NLP 30/06/2003
66. Algorithms
Toy Examples (I)
Linearly separable data set
Linear SVM
Maximal margin Hyperplane
. What happens if we add
a blue training example
here?
Machine Learning for NLP 30/06/2003
67. Algorithms
Toy Examples (I)
(still) Linearly separable
data set
Linear SVM
High value of C parameter
Maximal margin Hyperplane
The example is
correctly classified
Machine Learning for NLP 30/06/2003
68. Algorithms
Toy Examples (I)
(still) Linearly separable
data set
Linear SVM
Low value of C parameter
Trade-off between: margin
and training error
The example is
now a bounded SV
Machine Learning for NLP 30/06/2003
69. Algorithms
Toy Examples (II)
Machine Learning for NLP 30/06/2003
70. Algorithms
Toy Examples (II)
Machine Learning for NLP 30/06/2003
71. Algorithms
Toy Examples (II)
Machine Learning for NLP 30/06/2003
72. Algorithms
Toy Examples (III)
Machine Learning for NLP 30/06/2003
73. Algorithms
SVM: Summary
• SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik,
1992). Great developement since then
• Kernel-induced feature spaces: SVMs work efficiently
in very high dimensional feature spaces (+)
• Learning bias: maximal margin optimisation.
Reduces the danger of overfitting. Generalization
bounds for SVMs (+)
• Compact representation of the induced hypothesis.
The solution is sparse in terms of SVs (+)
Machine Learning for NLP 30/06/2003
74. Algorithms
SVM: Summary
• Due to Mercer’s conditions on the kernels the optimi-
sation problems are convex. No local minima (+)
• Optimisation theory guides the implementation.
Efficient learning (+)
• Mainly for classification but also for regression,
density estimation, clustering, etc.
• Success in many real-world applications: OCR, vision,
bioinformatics, speech recognition, NLP: TextCat, POS
tagging, chunking, parsing, etc. (+)
• Parameter tuning (–). Implications in convergence
times, sparsity of the solution, etc.
Machine Learning for NLP 30/06/2003
75. Outline
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
• Applications to NLP
Machine Learning for NLP 30/06/2003
76. Applications
NLP problems
• Warning! We will not focus on
final NLP applications, but on
intermediate tasks...
• We will classify the NLP tasks
according to their (structural)
complexity
Machine Learning for NLP 30/06/2003
77. Applications
NLP problems: structural
complexity
• Decisional problems
− Text Categorization, Document filtering, Word
Sense Disambiguation, etc.
• Sequence tagging and detection of
sequential structures
− POS tagging, Named Entity extraction,
syntactic chunking, etc.
• Hierarchical structures
− Clause detection, full parsing, IE of complex
concepts, composite Named Entities, etc.
Machine Learning for NLP 30/06/2003
78. Applications
POS tagging
• Morpho-syntactic ambiguity:
Part of Speech Tagging
He was shot in the hand as he chased
the robbers in the back street JJ
NN NN
VB VB VB
(The Wall Street Journal Corpus)
Machine Learning for NLP 30/06/2003
79. Applications
POS tagging
“preposition-adverb” tree
root
P(IN)=0.81
P(RB)=0.19
Word Form
“As”,“as”
others
... P(IN)=0.83
P(RB)=0.17
tag(+1)
RB
others
... P(IN)=0.13
Probabilistic interpretation: P(RB)=0.87
tag(+2)
^
P( RB | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.987 IN
^
P( IN | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.013
P(IN)=0.013
leaf
P(RB)=0.987
Machine Learning for NLP 30/06/2003
80. Applications
POS tagging
“preposition-adverb” tree
root
P(IN)=0.81
P(RB)=0.19
Word Form
“As”,“as”
others
... P(IN)=0.83
P(RB)=0.17
tag(+1)
RB
Collocations: others
“as_RB much_RB as_IN”
... P(IN)=0.13
P(RB)=0.87
tag(+2)
“as_RB soon_RB as_IN” IN
“as_RB well_RB as_IN”
P(IN)=0.013
leaf
P(RB)=0.987
Machine Learning for NLP 30/06/2003
81. Applications
POS tagging
RTT (Màrquez & Rodríguez 97)
Language
Model
A Sequential Model for Multi-class Classification:
NLP/POS Tagging (Even-Zohar & Roth, 01)
stop?
Morphological Tagged
Raw
analysis
Classify Update Filter
text yes text
no
Disambiguation
Machine Learning for NLP 30/06/2003
82. Applications
POS tagging
STT (Màrquez & Rodríguez 97)
Language Model
Lexical
probs. +
The Use of Classifiers in sequential inference:
Contextual probs.
Chunking (Punyakanok & Roth, 00)
Raw Morphological Viterbi Tagged
analysis algorithm
text text
Disambiguation
Machine Learning for NLP 30/06/2003
83. Applications
Detection of sequential and
hierarchical structures
• Named Entity recognition
• Clause detection
Machine Learning for NLP 30/06/2003
84. Conclusions
Summary/conclusions
• We have briefly outlined:
−The ML setting: “supervised learning for
classification”
−Three concrete machine learning
algorithms
−How to apply them to solve itermediate
NLP tasks
Machine Learning for NLP 30/06/2003
85. Conclusions
Summary/conclusions
• Any ML algorithm for NLP should be:
– Robust to noise and outliers
– Efficient in large feature/example spaces
– Adaptive to new/changing domains:
portability, tuning, etc.
– Able to take advantage of unlabelled
examples: semi-supervised learning
Machine Learning for NLP 30/06/2003
86. Conclusions
Summary/conclusions
• Statistical and ML-based Natural
Language Processing is a very active
and multidisciplinary area of research
Machine Learning for NLP 30/06/2003
87. Conclusions
Some current research lines
• Appropriate learning paradigm for all kind of
NLP problems: TiMBL (DBZ99), TBEDL (Brill95), ME
(Ratnaparkhi98), SNoW (Roth98), CRF (Pereira & Singer02).
• Definition of an adequate (and task-specific)
feature space: mapping from the input space to a
high dimensional feature space, kernels, etc.
• Resolution of complex NLP problems:
inference with classifiers + constraint satisfaction
• etc.
Machine Learning for NLP 30/06/2003
88. Conclusions
Bibliografia
• You may found additional information at:
http://www.lsi.upc.es/~lluism/
tesi.html
publicacions/pubs.html
cursos/talks.html
cursos/MLandNL.html
cursos/emnlp1.html
• This talk at:
http://www.lsi.upc.es/~lluism/udg03.ppt.gz
Machine Learning for NLP 30/06/2003
89. Seminar: Statistical NLP
Machine Learning for
Natural Language Processing
Lluís Màrquez
TALP Research Center
Llenguatges i Sistemes Informàtics
Universitat Politècnica de Catalunya
Girona, June 2003
Machine Learning for NLP 30/06/2003