SlideShare a Scribd company logo
1 of 8
Download to read offline
Conditional Random Fields: Probabilistic Models
                        for Segmenting and Labeling Sequence Data


John Lafferty†∗                                                                             LAFFERTY @ CS . CMU . EDU
Andrew McCallum∗†                                                                     MCCALLUM @ WHIZBANG . COM
Fernando Pereira∗‡                                                                      FPEREIRA @ WHIZBANG . COM
∗
  WhizBang! Labs–Research, 4616 Henry Street, Pittsburgh, PA 15213 USA
†
  School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA
‡
  Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104 USA


                       Abstract                               mize the joint likelihood of training examples. To define
                                                              a joint probability over observation and label sequences,
     We present conditional random fields, a frame-
                                                              a generative model needs to enumerate all possible ob-
     work for building probabilistic models to seg-
                                                              servation sequences, typically requiring a representation
     ment and label sequence data. Conditional ran-
                                                              in which observations are task-appropriate atomic entities,
     dom fields offer several advantages over hid-
                                                              such as words or nucleotides. In particular, it is not practi-
     den Markov models and stochastic grammars
                                                              cal to represent multiple interacting features or long-range
     for such tasks, including the ability to relax
                                                              dependencies of the observations, since the inference prob-
     strong independence assumptions made in those
                                                              lem for such models is intractable.
     models. Conditional random fields also avoid
     a fundamental limitation of maximum entropy              This difficulty is one of the main motivations for looking at
     Markov models (MEMMs) and other discrimi-                conditional models as an alternative. A conditional model
     native Markov models based on directed graph-            specifies the probabilities of possible label sequences given
     ical models, which can be biased towards states          an observation sequence. Therefore, it does not expend
     with few successor states. We present iterative          modeling effort on the observations, which at test time
     parameter estimation algorithms for conditional          are fixed anyway. Furthermore, the conditional probabil-
     random fields and compare the performance of              ity of the label sequence can depend on arbitrary, non-
     the resulting models to HMMs and MEMMs on                independent features of the observation sequence without
     synthetic and natural-language data.                     forcing the model to account for the distribution of those
                                                              dependencies. The chosen features may represent attributes
                                                              at different levels of granularity of the same observations
1. Introduction                                               (for example, words and characters in English text), or
                                                              aggregate properties of the observation sequence (for in-
The need to segment and label sequences arises in many        stance, text layout). The probability of a transition between
different problems in several scientific fields. Hidden         labels may depend not only on the current observation,
Markov models (HMMs) and stochastic grammars are well         but also on past and future observations, if available. In
understood and widely used probabilistic models for such      contrast, generative models must make very strict indepen-
problems. In computational biology, HMMs and stochas-         dence assumptions on the observations, for instance condi-
tic grammars have been successfully used to align bio-        tional independence given the labels, to achieve tractability.
logical sequences, find sequences homologous to a known
evolutionary family, and analyze RNA secondary structure      Maximum entropy Markov models (MEMMs) are condi-
(Durbin et al., 1998). In computational linguistics and       tional probabilistic sequence models that attain all of the
computer science, HMMs and stochastic grammars have           above advantages (McCallum et al., 2000). In MEMMs,
been applied to a wide variety of problems in text and        each source state1 has a exponential model that takes the
speech processing, including topic segmentation, part-of-     observation features as input, and outputs a distribution
speech (POS) tagging, information extraction, and syntac-     over possible next states. These exponential models are
tic disambiguation (Manning & Sch¨ tze, 1999).
                                   u                          trained by an appropriate iterative scaling method in the
                                                                  1
HMMs and stochastic grammars are generative models, as-             Output labels are associated with states; it is possible for sev-
signing a joint probability to paired observation and label   eral states to have the same label, but for simplicity in the rest of
                                                              this paper we assume a one-to-one correspondence.
sequences; the parameters are typically trained to maxi-
maximum entropy framework. Previously published exper-                                             i:_
                                                                                 r:_       1                 2        b:rib
imental results show MEMMs increasing recall and dou-
bling precision relative to HMMs in a FAQ segmentation                  0        r:_                                  b:rob       3
task.                                                                                      4
                                                                                                   o:_
                                                                                                             5
MEMMs and other non-generative finite-state models
based on next-state classifiers, such as discriminative
                                                                      Figure 1. Label bias example, after (Bottou, 1991). For concise-
Markov models (Bottou, 1991), share a weakness we call
                                                                      ness, we place observation-label pairs o : l on transitions rather
here the label bias problem: the transitions leaving a given          than states; the symbol ‘ ’ represents the null output label.
state compete only against each other, rather than against
all other transitions in the model. In probabilistic terms,           We present the model, describe two training procedures and
transition scores are the conditional probabilities of pos-           sketch a proof of convergence. We also give experimental
sible next states given the current state and the observa-            results on synthetic data showing that CRFs solve the clas-
tion sequence. This per-state normalization of transition             sical version of the label bias problem, and, more signifi-
scores implies a “conservation of score mass” (Bottou,                cantly, that CRFs perform better than HMMs and MEMMs
1991) whereby all the mass that arrives at a state must be            when the true data distribution has higher-order dependen-
distributed among the possible successor states. An obser-            cies than the model, as is often the case in practice. Finally,
vation can affect which destination states get the mass, but          we confirm these results as well as the claimed advantages
not how much total mass to pass on. This causes a bias to-            of conditional models by evaluating HMMs, MEMMs and
ward states with fewer outgoing transitions. In the extreme           CRFs with identical state structure on a part-of-speech tag-
case, a state with a single outgoing transition effectively           ging task.
ignores the observation. In those cases, unlike in HMMs,
Viterbi decoding cannot downgrade a branch based on ob-
servations after the branch point, and models with state-             2. The Label Bias Problem
transition structures that have sparsely connected chains of          Classical probabilistic automata (Paz, 1971), discrimina-
states are not properly handled. The Markovian assump-                tive Markov models (Bottou, 1991), maximum entropy
tions in MEMMs and similar state-conditional models in-               taggers (Ratnaparkhi, 1996), and MEMMs, as well as
sulate decisions at one state from future decisions in a way          non-probabilistic sequence tagging and segmentation mod-
that does not match the actual dependencies between con-              els with independently trained next-state classifiers (Pun-
secutive states.                                                      yakanok & Roth, 2001) are all potential victims of the label
This paper introduces conditional random fields (CRFs), a              bias problem.
sequence modeling framework that has all the advantages               For example, Figure 1 represents a simple finite-state
of MEMMs but also solves the label bias problem in a                  model designed to distinguish between the two words rib
principled way. The critical difference between CRFs and              and rob. Suppose that the observation sequence is r i b.
MEMMs is that a MEMM uses per-state exponential mod-                  In the first time step, r matches both transitions from the
els for the conditional probabilities of next states given the        start state, so the probability mass gets distributed roughly
current state, while a CRF has a single exponential model             equally among those two transitions. Next we observe i.
for the joint probability of the entire sequence of labels            Both states 1 and 4 have only one outgoing transition. State
given the observation sequence. Therefore, the weights of             1 has seen this observation often in training, state 4 has al-
different features at different states can be traded off against      most never seen this observation; but like state 1, state 4
each other.                                                           has no choice but to pass all its mass to its single outgoing
We can also think of a CRF as a finite state model with un-            transition, since it is not generating the observation, only
normalized transition probabilities. However, unlike some             conditioning on it. Thus, states with a single outgoing tran-
other weighted finite-state approaches (LeCun et al., 1998),           sition effectively ignore their observations. More generally,
CRFs assign a well-defined probability distribution over               states with low-entropy next state distributions will take lit-
possible labelings, trained by maximum likelihood or MAP              tle notice of observations. Returning to the example, the
estimation. Furthermore, the loss function is convex,2 guar-          top path and the bottom path will be about equally likely,
anteeing convergence to the global optimum. CRFs also                 independently of the observation sequence. If one of the
generalize easily to analogues of stochastic context-free             two words is slightly more common in the training set, the
grammars that would be useful in such problems as RNA                 transitions out of the start state will slightly prefer its cor-
secondary structure prediction and natural language pro-              responding transition, and that word’s state sequence will
cessing.                                                              always win. This behavior is demonstrated experimentally
                                                                      in Section 5.
   2
     In the case of fully observable states, as we are discussing
here; if several states have the same label, the usual local maxima   L´ on Bottou (1991) discussed two solutions for the label
                                                                       e
of Baum-Welch arise.                                                  bias problem. One is to change the state-transition struc-
ture of the model. In the above example we could collapse            tant example for modeling sequences, G is a simple chain
states 1 and 4, and delay the branching until we get a dis-          or line: G = (V = {1, 2, . . . m}, E = {(i, i + 1)}).
criminating observation. This operation is a special case            X may also have a natural graph structure; yet in gen-
of determinization (Mohri, 1997), but determinization of             eral it is not necessary to assume that X and Y have the
weighted finite-state machines is not always possible, and            same graphical structure, or even that X has any graph-
even when possible, it may lead to combinatorial explo-              ical structure at all. However, in this paper we will be
sion. The other solution mentioned is to start with a fully-         most concerned with sequences X = (X1 , X2 , . . . , Xn )
connected model and let the training procedure figure out             and Y = (Y1 , Y2 , . . . , Yn ).
a good structure. But that would preclude the use of prior
                                                                     If the graph G = (V, E) of Y is a tree (of which a chain
structural knowledge that has proven so valuable in infor-
                                                                     is the simplest example), its cliques are the edges and ver-
mation extraction tasks (Freitag & McCallum, 2000).
                                                                     tices. Therefore, by the fundamental theorem of random
Proper solutions require models that account for whole               fields (Hammersley & Clifford, 1971), the joint distribu-
state sequences at once by letting some transitions “vote”           tion over the label sequence Y given X has the form
more strongly than others depending on the corresponding
observations. This implies that score mass will not be con-           p θ (y | x) ∝                                                   (1)
served, but instead individual transitions can “amplify” or
                                                                                                                                    
“dampen” the mass they receive. In the above example, the             exp            λk fk (e, y|e , x) +           µk gk (v, y|v , x) ,
transitions from the start state would have a very weak ef-                  e∈E,k                           v∈V,k
fect on path score, while the transitions from states 1 and 4
would have much stronger effects, amplifying or damping              where x is a data sequence, y a label sequence, and y|S is
depending on the actual observation, and a proportionally            the set of components of y associated with the vertices in
higher contribution to the selection of the Viterbi path.3           subgraph S.
In the related work section we discuss other heuristic model         We assume that the features fk and gk are given and fixed.
classes that account for state sequences globally rather than        For example, a Boolean vertex feature gk might be true if
locally. To the best of our knowledge, CRFs are the only             the word Xi is upper case and the tag Yi is “proper noun.”
model class that does this in a purely probabilistic setting,
with guaranteed global maximum likelihood convergence.               The parameter estimation problem is to determine the pa-
                                                                     rameters θ = (λ1 , λ2 , . . . ; µ1 , µ2 , . . .) from training data
                                                                     D = {(x(i) , y(i) )}N with empirical distribution p(x, y).
                                                                                         i=1
3. Conditional Random Fields                                         In Section 4 we describe an iterative scaling algorithm that
In what follows, X is a random variable over data se-                maximizes the log-likelihood objective function O(θ):
quences to be labeled, and Y is a random variable over                                          N
corresponding label sequences. All components Yi of Y                           O(θ)      =           log p θ (y(i) | x(i) )
are assumed to range over a finite label alphabet Y. For ex-                                     i=1
ample, X might range over natural language sentences and
Y range over part-of-speech taggings of those sentences,                                  ∝           p(x, y) log p θ (y | x) .
with Y the set of possible part-of-speech tags. The ran-                                        x,y

dom variables X and Y are jointly distributed, but in a dis-
criminative framework we construct a conditional model               As a particular case, we can construct an HMM-like CRF
p(Y | X) from paired observation and label sequences, and            by defining one feature for each state pair (y , y), and one
do not explicitly model the marginal p(X).                           feature for each state-observation pair (y, x):
Definition. Let G = (V, E) be a graph such that                          fy ,y (<u, v>, y|<u,v> , x) = δ(yu , y ) δ(yv , y)
Y = (Yv )v∈V , so that Y is indexed by the vertices
                                                                                    gy,x (v, y|v , x) = δ(yv , y) δ(xv , x) .
of G. Then (X, Y) is a conditional random field in
case, when conditioned on X, the random variables Yv                 The corresponding parameters λy ,y and µy,x play a simi-
obey the Markov property with respect to the graph:                  lar role to the (logarithms of the) usual HMM parameters
p(Yv | X, Yw , w = v) = p(Yv | X, Yw , w ∼ v), where                 p(y | y) and p(x|y). Boltzmann chain models (Saul & Jor-
w ∼ v means that w and v are neighbors in G.                         dan, 1996; MacKay, 1996) have a similar form but use a
Thus, a CRF is a random field globally conditioned on the             single normalization constant to yield a joint distribution,
observation X. Throughout the paper we tacitly assume                whereas CRFs use the observation-dependent normaliza-
that the graph G is fixed. In the simplest and most impor-            tion Z(x) for conditional distributions.
    3
      Weighted determinization and minimization techniques shift     Although it encompasses HMM-like models, the class of
transition weights while preserving overall path weight (Mohri,      conditional random fields is much more expressive, be-
2000); their connection to this discussion deserves further study.   cause it allows arbitrary dependencies on the observation
Yi−1         Yi            Yi+1                Yi−1              Yi           Yi+1                         Yi−1                   Yi                Yi+1
   s          - s            - s                   s               - s           - s                          s                      s                  s
                                                   6                 6             6

   s               s             s                    c             c                   c                      c                     c                   c
   ?               ?             ?

   Xi−1            Xi         Xi+1                Xi−1              Xi           Xi+1                         Xi−1                   Xi                Xi+1

Figure 2. Graphical structures of simple HMMs (left), MEMMs (center), and the chain-structured case of CRFs (right) for sequences.
An open circle indicates that the variable is not generated by the model.

sequence. In addition, the features do not need to specify               of the training data. Both algorithms are based on the im-
completely a state or observation, so one might expect that              proved iterative scaling (IIS) algorithm of Della Pietra et al.
the model can be estimated from less training data. Another              (1997); the proof technique based on auxiliary functions
attractive property is the convexity of the loss function; in-           can be extended to show convergence of the algorithms for
deed, CRFs share all of the convexity properties of general              CRFs.
maximum entropy models.
                                                                         Iterative scaling algorithms update the weights as λk ←
For the remainder of the paper we assume that the depen-                 λk + δλk and µk ← µk + δµk for appropriately chosen
dencies of Y, conditioned on X, form a chain. To sim-                    δλk and δµk . In particular, the IIS update δλk for an edge
plify some expressions, we add special start and stop states             feature fk is the solution of
Y0 = start and Yn+1 = stop. Thus, we will be using the
                                                                                                              n+1
graphical structure shown in Figure 2. For a chain struc-                         def
                                                                          E[fk ] =                p(x, y)            fk (ei , y|ei , x)
ture, the conditional probability of a label sequence can be
                                                                                            x,y               i=1
expressed concisely in matrix form, which will be useful
                                                                                                               n+1
in describing the parameter estimation and inference al-
gorithms in Section 4. Suppose that p θ (Y | X) is a CRF                     =              p(x) p(y | x)                fk (ei , y|ei , x) e δλk T (x,y) .
                                                                                  x,y                              i=1
given by (1). For each position i in the observation se-
quence x, we define the |Y| × |Y| matrix random variable                  where T (x, y) is the total feature count
Mi (x) = [Mi (y , y | x)] by
                                                                                        def
  Mi (y , y | x)   =    exp (Λi (y , y | x))                               T (x, y) =                   fk (ei , y|ei , x) +              gk (vi , y|vi , x) .
                                                                                                  i,k                               i,k
  Λi (y , y | x)   =      k λk fk (ei , Y|ei = (y , y), x) +
                                                                         The equations for vertex feature updates δµk have similar
                          k µk gk (vi , Y|vi = y, x) ,
                                                                         form.
where ei is the edge with labels (Yi−1 , Yi ) and vi is the
                                                                         However, efficiently computing the exponential sums on
vertex with label Yi . In contrast to generative models, con-
                                                                         the right-hand sides of these equations is problematic, be-
ditional models like CRFs do not need to enumerate over
                                                                         cause T (x, y) is a global property of (x, y), and dynamic
all possible observation sequences x, and therefore these
                                                                         programming will sum over sequences with potentially
matrices can be computed directly as needed from a given
                                                                         varying T . To deal with this, the first algorithm, Algorithm
training or test observation sequence x and the parameter
                                                                         S, uses a “slack feature.” The second, Algorithm T, keeps
vector θ. Then the normalization (partition function) Zθ (x)
                                                                         track of partial T totals.
is the (start, stop) entry of the product of these matrices:
                                                                         For Algorithm S, we define the slack feature by
    Zθ (x) = (M1 (x) M2 (x) · · · Mn+1 (x))start,stop .
                                                                                  def
Using this notation, the conditional probability of a label              s(x, y) =
sequence y is written as                                                      S−                    fk (ei , y|ei , x) −                      gk (vi , y|vi , x) ,
                             n+1                                                        i     k                                 i         k
                             i=1   Mi (yi−1 , yi | x)
       p θ (y | x) =                                           ,         where S is a constant chosen so that s(x(i) , y) ≥ 0 for all
                                   n+1
                                         Mi (x)
                                   i=1
                                                  start,stop             y and all observation vectors x(i) in the training set, thus
                                                                         making T (x, y) = S. Feature s is “global,” that is, it does
where y0 = start and yn+1 = stop.                                        not correspond to any particular edge or vertex.
                                                                         For each index i = 0, . . . , n + 1 we now define the forward
4. Parameter Estimation for CRFs                                         vectors αi (x) with base case
We now describe two iterative scaling algorithms to find                                                              1 if y = start
the parameter vector θ that maximizes the log-likelihood                                    α0 (y | x) =
                                                                                                                     0 otherwise
and recurrence                                                         βk and γk are the unique positive roots to the following
                                                                       polynomial equations
                     αi (x) = αi−1 (x) Mi (x) .                            Tmax                           Tmax
                                                                                        t                              t
                                                                                  ak,t βk = Efk ,                bk,t γk = Egk ,          (2)
Similarly, the backward vectors βi (x) are defined by
                                                                            i=0                           i=0

                                        1 if y = stop                  which can be easily computed by Newton’s method.
              βn+1 (y | x) =
                                        0 otherwise
                                                                       A single iteration of Algorithm S and Algorithm T has
and                                                                    roughly the same time and space complexity as the well
                 βi (x)     = Mi+1 (x) βi+1 (x) .                      known Baum-Welch algorithm for HMMs. To prove con-
                                                                       vergence of our algorithms, we can derive an auxiliary
                                                                       function to bound the change in likelihood from below; this
With these definitions, the update equations are
                                                                       method is developed in detail by Della Pietra et al. (1997).
                                                                       The full proof is somewhat detailed; however, here we give
              1     Efk                       1     Egk                an idea of how to derive the auxiliary function. To simplify
      δλk =     log     ,             δµk =     log     ,
              S     Efk                       S     Egk                notation, we assume only edge features fk with parameters
                                                                       λk .
where
                                                                       Given two parameter settings θ = (λ1 , λ2 , . . .) and θ =
                            n+1                                        (λ1 +δλ1 , λ2 +δλ2 , . . .), we bound from below the change
   Efk =             p(x)              fk (ei , y|ei = (y , y), x) ×   in the objective function with an auxiliary function A(θ , θ)
                 x          i=1 y ,y                                   as follows
                          αi−1 (y | x) Mi (y , y | x) βi (y | x)                                                      p θ (y | x)
                                        Zθ (x)                          O(θ ) − O(θ) =                p(x, y) log
                                                                                                x,y
                                                                                                                      p θ (y | x)
                            n
   Egk =             p(x)              gk (vi , y|vi = y, x) ×                                                         Zθ (x)
                                                                           =      (θ − θ) · Ef −          p(x) log
                 x          i=1   y
                                                                                                      x
                                                                                                                       Zθ (x)
                          αi (y | x) βi (y | x)                                                                   Zθ (x)
                                                .                          ≥ (θ − θ) · Ef −               p(x)
                                 Zθ (x)                                                                           Zθ (x)
                                                                                                      x
The factors involving the forward and backward vectors in                  = δλ · Ef −           p(x)            pθ (y | x) eδλ·f (x,y)
the above equations have the same meaning as for standard                                   x              y
hidden Markov models. For example,
                                                                                                                        fk (x, y) δλk T (x)
                                                                           ≥ δλ · Ef −             p(x) pθ (y | x)               e
                                        αi (y | x) βi (y | x)                                                            T (x)
                                                                                           x,y,k
          p θ (Yi = y | x) =
                                               Zθ (x)                      def
                                                                           = A(θ , θ)
is the marginal probability of label Yi = y given that the             where the inequalities follow from the convexity of − log
observation sequence is x. This algorithm is closely related           and exp. Differentiating A with respect to δλk and setting
to the algorithm of Darroch and Ratcliff (1972), and MART              the result to zero yields equation (2).
algorithms used in image reconstruction.
The constant S in Algorithm S can be quite large, since in             5. Experiments
practice it is proportional to the length of the longest train-
ing observation sequence. As a result, the algorithm may               We first discuss two sets of experiments with synthetic data
converge slowly, taking very small steps toward the maxi-              that highlight the differences between CRFs and MEMMs.
mum in each iteration. If the length of the observations x(i)          The first experiments are a direct verification of the label
and the number of active features varies greatly, a faster-            bias problem discussed in Section 2. In the second set of
converging algorithm can be obtained by keeping track of               experiments, we generate synthetic data using randomly
feature totals for each observation sequence separately.               chosen hidden Markov models, each of which is a mix-
                                                                       ture of a first-order and second-order model. Competing
           def
Let T (x) = maxy T (x, y). Algorithm T accumulates                     first-order models are then trained and compared on test
feature expectations into counters indexed by T (x). More              data. As the data becomes more second-order, the test er-
specifically, we use the forward-backward recurrences just              ror rates of the trained models increase. This experiment
introduced to compute the expectations ak,t of feature fk              corresponds to the common modeling practice of approxi-
and bk,t of feature gk given that T (x) = t. Then our param-           mating complex local and long-range dependencies, as oc-
eter updates are δλk = log βk and δµk = log γk , where                 cur in natural data, by small-order Markov models. Our
60                                                         60                                                            60


         50                                                         50                                                            50


         40                                                         40                                                            40
MEMM Error




                                                           MEMM Error




                                                                                                                          CRF Error
         30                                                         30                                                            30


         20                                                         20                                                            20


         10                                                         10                                                            10


             0                                                          0                                                             0
                 0   10    20     30        40   50   60                    0   10   20     30          40     50    60                   0   10   20     30        40   50   60
                                CRF Error                                                 HMM Error                                                     HMM Error


             Figure 3. Plots of 2×2 error rates for HMMs, CRFs, and MEMMs on randomly generated synthetic data sets, as described in Section 5.2.
             As the data becomes “more second order,” the error rates of the test models increase. As shown in the left plot, the CRF typically
             significantly outperforms the MEMM. The center plot shows that the HMM outperforms the MEMM. In the right plot, each open square
                                             1                                                   1
             represents a data set with α < 2 , and a solid circle indicates a data set with α ≥ 2 . The plot shows that when the data is mostly second
                          1
             order (α ≥ 2 ), the discriminatively trained CRF typically outperforms the HMM. These experiments are not designed to demonstrate
             the advantages of the additional representational power of CRFs and MEMMs relative to HMMs.

             results clearly indicate that even when the models are pa-                               of the Bayes error rate for the resulting models, the con-
             rameterized in exactly the same way, CRFs are more ro-                                   ditional probability tables pα are constrained to be sparse.
             bust to inaccurate modeling assumptions than MEMMs or                                    In particular, pα (· | y, y ) can have at most two nonzero en-
             HMMs, and resolve the label bias problem, which affects                                  tries, for each y, y , and p α (· | y, x ) can have at most three
             the performance of MEMMs. To avoid confusion of dif-                                     nonzero entries for each y, x . For each randomly gener-
             ferent effects, the MEMMs and CRFs in these experiments                                  ated model, a sample of 1,000 sequences of length 25 is
             do not use overlapping features of the observations. Fi-                                 generated for training and testing.
             nally, in a set of POS tagging experiments, we confirm the
                                                                                                      On each randomly generated training set, a CRF is trained
             advantage of CRFs over MEMMs. We also show that the
                                                                                                      using Algorithm S. (Note that since the length of the se-
             addition of overlapping features to CRFs and MEMMs al-
                                                                                                      quences and number of active features is constant, Algo-
             lows them to perform much better than HMMs, as already
                                                                                                      rithms S and T are identical.) The algorithm is fairly slow
             shown for MEMMs by McCallum et al. (2000).
                                                                                                      to converge, typically taking approximately 500 iterations
                                                                                                      for the model to stabilize. On the 500 MHz Pentium PC
             5.1 Modeling label bias                                                                  used in our experiments, each iteration takes approximately
             We generate data from a simple HMM which encodes a                                       0.2 seconds. On the same data an MEMM is trained using
             noisy version of the finite-state network in Figure 1. Each                               iterative scaling, which does not require forward-backward
             state emits its designated symbol with probability 29/32                                 calculations, and is thus more efficient. The MEMM train-
             and any of the other symbols with probability 1/32. We                                   ing converges more quickly, stabilizing after approximately
             train both an MEMM and a CRF with the same topologies                                    100 iterations. For each model, the Viterbi algorithm is
             on the data generated by the HMM. The observation fea-                                   used to label a test set; the experimental results do not sig-
             tures are simply the identity of the observation symbols.                                nificantly change when using forward-backward decoding
             In a typical run using 2, 000 training and 500 test samples,                             to minimize the per-symbol error rate.
             trained to convergence of the iterative scaling algorithm,                               The results of several runs are presented in Figure 3. Each
             the CRF error is 4.6% while the MEMM error is 42%,                                       plot compares two classes of models, with each point indi-
             showing that the MEMM fails to discriminate between the                                  cating the error rate for a single test set. As α increases, the
             two branches.                                                                            error rates generally increase, as the first-order models fail
                                                                                                      to fit the second-order data. The figure compares models
             5.2 Modeling mixed-order sources                                                         parameterized as µy , λy ,y , and λy ,y,x ; results for models
             For these results, we use five labels, a-e (|Y| = 5), and 26                              parameterized as µy , λy ,y , and µy,x are qualitatively the
             observation values, A-Z (|X | = 26); however, the results                                same. As shown in the first graph, the CRF generally out-
             were qualitatively the same over a range of sizes for Y and                              performs the MEMM, often by a wide margin of 10%–20%
             X . We generate data from a mixed-order HMM with state                                   relative error. (The points for very small error rate, with
             transition probabilities given by pα (yi | yi−1 , yi−2 ) =                               α < 0.01, where the MEMM does better than the CRF,
             α p2 (yi | yi−1 , yi−2 ) + (1 − α) p1 (yi | yi−1 ) and, simi-                            are suspected to be the result of an insufficient number of
             larly, emission probabilities given by pα (xi | yi , xi−1 ) =                            training iterations for the CRF.)
             α p2 (xi | yi , xi−1 )+(1−α) p1 (xi | yi ). Thus, for α = 0 we
             have a standard first-order HMM. In order to limit the size
model       error      oov error                  6. Further Aspects of CRFs
                 HMM         5.69%       45.99%                    Many further aspects of CRFs are attractive for applica-
                MEMM         6.37%       54.61%                    tions and deserve further study. In this section we briefly
                  CRF        5.55%       48.05%                    mention just two.
              MEMM+          4.81%       26.99%                    Conditional random fields can be trained using the expo-
                CRF+         4.27%       23.76%                    nential loss objective function used by the AdaBoost algo-
                   +                                               rithm (Freund & Schapire, 1997). Typically, boosting is
                       Using spelling features
                                                                   applied to classification problems with a small, fixed num-
                                                                   ber of classes; applications of boosting to sequence labeling
Figure 4. Per-word error rates for POS tagging on the Penn tree-   have treated each label as a separate classification problem
bank, using first-order models trained on 50% of the 1.1 million    (Abney et al., 1999). However, it is possible to apply the
word corpus. The oov rate is 5.45%.                                parallel update algorithm of Collins et al. (2000) to op-
                                                                   timize the per-sequence exponential loss. This requires a
5.3 POS tagging experiments                                        forward-backward algorithm to compute efficiently certain
To confirm our synthetic data results, we also compared             feature expectations, along the lines of Algorithm T, ex-
HMMs, MEMMs and CRFs on Penn treebank POS tag-                     cept that each feature requires a separate set of forward and
ging, where each word in a given input sentence must be            backward accumulators.
labeled with one of 45 syntactic tags.                             Another attractive aspect of CRFs is that one can imple-
We carried out two sets of experiments with this natural           ment efficient feature selection and feature induction al-
language data. First, we trained first-order HMM, MEMM,             gorithms for them. That is, rather than specifying in ad-
and CRF models as in the synthetic data experiments, in-           vance which features of (X, Y) to use, we could start from
troducing parameters µy,x for each tag-word pair and λy ,y         feature-generating rules and evaluate the benefit of gener-
for each tag-tag pair in the training set. The results are con-    ated features automatically on data. In particular, the fea-
sistent with what is observed on synthetic data: the HMM           ture induction algorithms presented in Della Pietra et al.
outperforms the MEMM, as a consequence of the label bias           (1997) can be adapted to fit the dynamic programming
problem, while the CRF outperforms the HMM. The er-                techniques of conditional random fields.
ror rates for training runs using a 50%-50% train-test split
are shown in Figure 5.3; the results are qualitatively sim-        7. Related Work and Conclusions
ilar for other splits of the data. The error rates on out-
of-vocabulary (oov) words, which are not observed in the           As far as we know, the present work is the first to combine
training set, are reported separately.                             the benefits of conditional models with the global normal-
                                                                   ization of random field models. Other applications of expo-
In the second set of experiments, we take advantage of the         nential models in sequence modeling have either attempted
power of conditional models by adding a small set of or-           to build generative models (Rosenfeld, 1997), which in-
thographic features: whether a spelling begins with a num-         volve a hard normalization problem, or adopted local con-
ber or upper case letter, whether it contains a hyphen, and        ditional models (Berger et al., 1996; Ratnaparkhi, 1996;
whether it ends in one of the following suffixes: -ing, -           McCallum et al., 2000) that may suffer from label bias.
ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies. Here we find, as
expected, that both the MEMM and the CRF benefit signif-            Non-probabilistic local decision models have also been
icantly from the use of these features, with the overall error     widely used in segmentation and tagging (Brill, 1995;
rate reduced by around 25%, and the out-of-vocabulary er-          Roth, 1998; Abney et al., 1999). Because of the computa-
ror rate reduced by around 50%.                                    tional complexity of global training, these models are only
                                                                   trained to minimize the error of individual label decisions
One usually starts training from the all zero parameter vec-       assuming that neighboring labels are correctly chosen. La-
tor, corresponding to the uniform distribution. However,           bel bias would be expected to be a problem here too.
for these datasets, CRF training with that initialization is
much slower than MEMM training. Fortunately, we can                An alternative approach to discriminative modeling of se-
use the optimal MEMM parameter vector as a starting                quence labeling is to use a permissive generative model,
point for training the corresponding CRF. In Figure 5.3,           which can only model local dependencies, to produce a
MEMM+ was trained to convergence in around 100 iter-               list of candidates, and then use a more global discrimina-
ations. Its parameters were then used to initialize the train-     tive model to rerank those candidates. This approach is
ing of CRF+ , which converged in 1,000 iterations. In con-         standard in large-vocabulary speech recognition (Schwartz
trast, training of the same CRF from the uniform distribu-         & Austin, 1993), and has also been proposed for parsing
tion had not converged even after 2,000 iterations.                (Collins, 2000). However, these methods fail when the cor-
                                                                   rect output is pruned away in the first pass.
Closest to our proposal are gradient-descent methods that         scaling for log-linear models. The Annals of Mathemat-
adjust the parameters of all of the local classifiers to mini-     ical Statistics, 43, 1470–1480.
mize a smooth loss function (e.g., quadratic loss) combin-      Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). In-
ing loss terms for each label. If state dependencies are lo-      ducing features of random fields. IEEE Transactions on
cal, this can be done efficiently with dynamic programming         Pattern Analysis and Machine Intelligence, 19, 380–393.
(LeCun et al., 1998). Such methods should alleviate label       Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998).
bias. However, their loss function is not convex, so they         Biological sequence analysis: Probabilistic models of
may get stuck in local minima.                                    proteins and nucleic acids. Cambridge University Press.
                                                                Freitag, D., & McCallum, A. (2000). Information extrac-
Conditional random fields offer a unique combination of
                                                                  tion with HMM structures learned by stochastic opti-
properties: discriminatively trained models for sequence
                                                                  mization. Proc. AAAI 2000.
segmentation and labeling; combination of arbitrary, over-
                                                                Freund, Y., & Schapire, R. (1997). A decision-theoretic
lapping and agglomerative observation features from both
                                                                  generalization of on-line learning and an application to
the past and future; efficient training and decoding based
                                                                  boosting. Journal of Computer and System Sciences, 55,
on dynamic programming; and parameter estimation guar-
                                                                  119–139.
anteed to find the global optimum. Their main current lim-
                                                                Hammersley, J., & Clifford, P. (1971). Markov fields on
itation is the slow convergence of the training algorithm
                                                                  finite graphs and lattices. Unpublished manuscript.
relative to MEMMs, let alone to HMMs, for which training
                                                                LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998).
on fully observed data is very efficient. In future work, we
                                                                  Gradient-based learning applied to document recogni-
plan to investigate alternative training methods such as the
                                                                  tion. Proceedings of the IEEE, 86, 2278–2324.
update methods of Collins et al. (2000) and refinements on
                                                                MacKay, D. J. (1996). Equivalence of linear Boltzmann
using a MEMM as starting point as we did in some of our
                                                                  chains and hidden Markov models. Neural Computation,
experiments. More general tree-structured random fields,
                                                                  8, 178–181.
feature induction methods, and further natural data evalua-
                                                                Manning, C. D., & Sch¨ tze, H. (1999). Foundations of sta-
                                                                                         u
tions will also be investigated.
                                                                  tistical natural language processing. Cambridge Mas-
                                                                  sachusetts: MIT Press.
Acknowledgments                                                 McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum
                                                                  entropy Markov models for information extraction and
We thank Yoshua Bengio, L´ on Bottou, Michael Collins
                             e                                    segmentation. Proc. ICML 2000 (pp. 591–598). Stan-
and Yann LeCun for alerting us to what we call here the la-       ford, California.
bel bias problem. We also thank Andrew Ng and Sebastian         Mohri, M. (1997). Finite-state transducers in language and
Thrun for discussions related to this work.                       speech processing. Computational Linguistics, 23.
                                                                Mohri, M. (2000). Minimization algorithms for sequential
References                                                        transducers. Theoretical Computer Science, 234, 177–
                                                                  201.
Abney, S., Schapire, R. E., & Singer, Y. (1999). Boosting       Paz, A. (1971). Introduction to probabilistic automata.
  applied to tagging and PP attachment. Proc. EMNLP-              Academic Press.
  VLC. New Brunswick, New Jersey: Association for               Punyakanok, V., & Roth, D. (2001). The use of classifiers
  Computational Linguistics.                                      in sequential inference. NIPS 13. Forthcoming.
Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J.       Ratnaparkhi, A. (1996). A maximum entropy model for
  (1996). A maximum entropy approach to natural lan-              part-of-speech tagging. Proc. EMNLP. New Brunswick,
  guage processing. Computational Linguistics, 22.                New Jersey: Association for Computational Linguistics.
Bottou, L. (1991).          Une approche th´ orique de
                                               e                Rosenfeld, R. (1997). A whole sentence maximum entropy
  l’apprentissage connexionniste: Applications a la recon-
                                                 `                language model. Proceedings of the IEEE Workshop on
  naissance de la parole. Doctoral dissertation, Universit´e      Speech Recognition and Understanding. Santa Barbara,
  de Paris XI.                                                    California.
Brill, E. (1995). Transformation-based error-driven learn-      Roth, D. (1998). Learning to resolve natural language am-
  ing and natural language processing: a case study in part       biguities: A unified approach. Proc. 15th AAAI (pp. 806–
  of speech tagging. Computational Linguistics, 21, 543–          813). Menlo Park, California: AAAI Press.
  565.                                                          Saul, L., & Jordan, M. (1996). Boltzmann chains and hid-
Collins, M. (2000). Discriminative reranking for natural          den Markov models. Advances in Neural Information
  language parsing. Proc. ICML 2000. Stanford, Califor-           Processing Systems 7. MIT Press.
  nia.                                                          Schwartz, R., & Austin, S. (1993). A comparison of several
Collins, M., Schapire, R., & Singer, Y. (2000). Logistic re-      approximate algorithms for finding multiple (N-BEST)
  gression, AdaBoost, and Bregman distances. Proc. 13th           sentence hypotheses. Proc. ICASSP. Minneapolis, MN.
  COLT.
Darroch, J. N., & Ratcliff, D. (1972). Generalized iterative

More Related Content

What's hot

Text Categorization using N-grams and Hidden-Markov-Models
Text Categorization using N-grams and Hidden-Markov-ModelsText Categorization using N-grams and Hidden-Markov-Models
Text Categorization using N-grams and Hidden-Markov-ModelsThomas Mathew
 
Design patterns
Design patternsDesign patterns
Design patternsNYversity
 
AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...
AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...
AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...Nexgen Technology
 
Automatic face naming by learning discriminative
Automatic face naming by learning discriminativeAutomatic face naming by learning discriminative
Automatic face naming by learning discriminativejpstudcorner
 
Automatic face naming by learning discriminative affinity matrices from weakl...
Automatic face naming by learning discriminative affinity matrices from weakl...Automatic face naming by learning discriminative affinity matrices from weakl...
Automatic face naming by learning discriminative affinity matrices from weakl...Raja Ram
 
Automatic face naming by learning discriminative affinity matrices from weakl...
Automatic face naming by learning discriminative affinity matrices from weakl...Automatic face naming by learning discriminative affinity matrices from weakl...
Automatic face naming by learning discriminative affinity matrices from weakl...Shakas Technologies
 
Paper id 312201512
Paper id 312201512Paper id 312201512
Paper id 312201512IJRAT
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfacepathsproject
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_crDmitry Kan
 

What's hot (10)

Text Categorization using N-grams and Hidden-Markov-Models
Text Categorization using N-grams and Hidden-Markov-ModelsText Categorization using N-grams and Hidden-Markov-Models
Text Categorization using N-grams and Hidden-Markov-Models
 
Design patterns
Design patternsDesign patterns
Design patterns
 
AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...
AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...
AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...
 
Automatic face naming by learning discriminative
Automatic face naming by learning discriminativeAutomatic face naming by learning discriminative
Automatic face naming by learning discriminative
 
Automatic face naming by learning discriminative affinity matrices from weakl...
Automatic face naming by learning discriminative affinity matrices from weakl...Automatic face naming by learning discriminative affinity matrices from weakl...
Automatic face naming by learning discriminative affinity matrices from weakl...
 
Automatic face naming by learning discriminative affinity matrices from weakl...
Automatic face naming by learning discriminative affinity matrices from weakl...Automatic face naming by learning discriminative affinity matrices from weakl...
Automatic face naming by learning discriminative affinity matrices from weakl...
 
Paper id 312201512
Paper id 312201512Paper id 312201512
Paper id 312201512
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interface
 
Maude .:° ASCENS 2011
Maude .:° ASCENS 2011Maude .:° ASCENS 2011
Maude .:° ASCENS 2011
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_cr
 

Viewers also liked

Cefpi june 2012
Cefpi   june 2012Cefpi   june 2012
Cefpi june 2012Larry Espe
 
Superintendent's report Aug 2011
Superintendent's report   Aug 2011Superintendent's report   Aug 2011
Superintendent's report Aug 2011Larry Espe
 
Using Social Media Learnings for Overall Brand Strategy
Using Social Media Learnings for Overall Brand StrategyUsing Social Media Learnings for Overall Brand Strategy
Using Social Media Learnings for Overall Brand StrategyMichelle Spellerberg, MBA
 
Inspiring Learning Life Conference - Larry Espe
Inspiring Learning Life Conference - Larry EspeInspiring Learning Life Conference - Larry Espe
Inspiring Learning Life Conference - Larry EspeLarry Espe
 
Trades,transitions,transformation oct'13
Trades,transitions,transformation oct'13Trades,transitions,transformation oct'13
Trades,transitions,transformation oct'13Larry Espe
 
Careers & Student Transitions
Careers & Student TransitionsCareers & Student Transitions
Careers & Student TransitionsLarry Espe
 
Careers, Transitions and Transformation - BCSTA December 2013.key
Careers, Transitions and Transformation - BCSTA December 2013.keyCareers, Transitions and Transformation - BCSTA December 2013.key
Careers, Transitions and Transformation - BCSTA December 2013.keyLarry Espe
 

Viewers also liked (9)

BCSSA - ELC
BCSSA - ELCBCSSA - ELC
BCSSA - ELC
 
Cefpi june 2012
Cefpi   june 2012Cefpi   june 2012
Cefpi june 2012
 
Superintendent's report Aug 2011
Superintendent's report   Aug 2011Superintendent's report   Aug 2011
Superintendent's report Aug 2011
 
Using Social Media Learnings for Overall Brand Strategy
Using Social Media Learnings for Overall Brand StrategyUsing Social Media Learnings for Overall Brand Strategy
Using Social Media Learnings for Overall Brand Strategy
 
Inspiring Learning Life Conference - Larry Espe
Inspiring Learning Life Conference - Larry EspeInspiring Learning Life Conference - Larry Espe
Inspiring Learning Life Conference - Larry Espe
 
Trades,transitions,transformation oct'13
Trades,transitions,transformation oct'13Trades,transitions,transformation oct'13
Trades,transitions,transformation oct'13
 
Careers & Student Transitions
Careers & Student TransitionsCareers & Student Transitions
Careers & Student Transitions
 
Careers, Transitions and Transformation - BCSTA December 2013.key
Careers, Transitions and Transformation - BCSTA December 2013.keyCareers, Transitions and Transformation - BCSTA December 2013.key
Careers, Transitions and Transformation - BCSTA December 2013.key
 
Database Basics
Database BasicsDatabase Basics
Database Basics
 

Similar to Crf

Executive SummaryIntroductionProtein engineering
Executive SummaryIntroductionProtein engineeringExecutive SummaryIntroductionProtein engineering
Executive SummaryIntroductionProtein engineeringBetseyCalderon89
 
Learning weighted lower linear envelope potentials in binary markov random fi...
Learning weighted lower linear envelope potentials in binary markov random fi...Learning weighted lower linear envelope potentials in binary markov random fi...
Learning weighted lower linear envelope potentials in binary markov random fi...jpstudcorner
 
PowerPoint Presentation - Conditional Random Fields - A ...
PowerPoint Presentation - Conditional Random Fields - A ...PowerPoint Presentation - Conditional Random Fields - A ...
PowerPoint Presentation - Conditional Random Fields - A ...butest
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERTAbdurrahimDerric
 
Samplying in Factored Dynamic Systems_Fadel.pdf
Samplying in Factored Dynamic Systems_Fadel.pdfSamplying in Factored Dynamic Systems_Fadel.pdf
Samplying in Factored Dynamic Systems_Fadel.pdfFadel Adoe
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...Editor IJARCET
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERkevig
 
Energy-Based Models with Applications to Speech and Language Processing
Energy-Based Models with Applications to Speech and Language ProcessingEnergy-Based Models with Applications to Speech and Language Processing
Energy-Based Models with Applications to Speech and Language Processingnxmaosdh232
 
De identification of patient notes
De identification of patient notesDe identification of patient notes
De identification of patient notesFadi Bakoura
 
20090608 Abstraction and reusability in the biological modelling process
20090608 Abstraction and reusability in the biological modelling process20090608 Abstraction and reusability in the biological modelling process
20090608 Abstraction and reusability in the biological modelling processJonathan Blakes
 
Hyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyHyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyIJwest
 
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...dannyijwest
 
Local Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptxLocal Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptxlwz614595250
 
Adaptive web page content identification
Adaptive web page content identificationAdaptive web page content identification
Adaptive web page content identificationJhih-Ming Chen
 
Monoton-working version-1995.doc
Monoton-working version-1995.docMonoton-working version-1995.doc
Monoton-working version-1995.docbutest
 

Similar to Crf (20)

Executive SummaryIntroductionProtein engineering
Executive SummaryIntroductionProtein engineeringExecutive SummaryIntroductionProtein engineering
Executive SummaryIntroductionProtein engineering
 
Learning weighted lower linear envelope potentials in binary markov random fi...
Learning weighted lower linear envelope potentials in binary markov random fi...Learning weighted lower linear envelope potentials in binary markov random fi...
Learning weighted lower linear envelope potentials in binary markov random fi...
 
PowerPoint Presentation - Conditional Random Fields - A ...
PowerPoint Presentation - Conditional Random Fields - A ...PowerPoint Presentation - Conditional Random Fields - A ...
PowerPoint Presentation - Conditional Random Fields - A ...
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
Samplying in Factored Dynamic Systems_Fadel.pdf
Samplying in Factored Dynamic Systems_Fadel.pdfSamplying in Factored Dynamic Systems_Fadel.pdf
Samplying in Factored Dynamic Systems_Fadel.pdf
 
Bay's marko chain
Bay's marko chainBay's marko chain
Bay's marko chain
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
Homology modelling
Homology modellingHomology modelling
Homology modelling
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
Energy-Based Models with Applications to Speech and Language Processing
Energy-Based Models with Applications to Speech and Language ProcessingEnergy-Based Models with Applications to Speech and Language Processing
Energy-Based Models with Applications to Speech and Language Processing
 
De identification of patient notes
De identification of patient notesDe identification of patient notes
De identification of patient notes
 
20090608 Abstraction and reusability in the biological modelling process
20090608 Abstraction and reusability in the biological modelling process20090608 Abstraction and reusability in the biological modelling process
20090608 Abstraction and reusability in the biological modelling process
 
EiB Seminar from Esteban Vegas, Ph.D.
EiB Seminar from Esteban Vegas, Ph.D. EiB Seminar from Esteban Vegas, Ph.D.
EiB Seminar from Esteban Vegas, Ph.D.
 
Hyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyHyponymy extraction of domain ontology
Hyponymy extraction of domain ontology
 
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...
 
Local Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptxLocal Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptx
 
Adaptive web page content identification
Adaptive web page content identificationAdaptive web page content identification
Adaptive web page content identification
 
Monoton-working version-1995.doc
Monoton-working version-1995.docMonoton-working version-1995.doc
Monoton-working version-1995.doc
 

Recently uploaded

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Crf

  • 1. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty†∗ LAFFERTY @ CS . CMU . EDU Andrew McCallum∗† MCCALLUM @ WHIZBANG . COM Fernando Pereira∗‡ FPEREIRA @ WHIZBANG . COM ∗ WhizBang! Labs–Research, 4616 Henry Street, Pittsburgh, PA 15213 USA † School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA ‡ Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104 USA Abstract mize the joint likelihood of training examples. To define a joint probability over observation and label sequences, We present conditional random fields, a frame- a generative model needs to enumerate all possible ob- work for building probabilistic models to seg- servation sequences, typically requiring a representation ment and label sequence data. Conditional ran- in which observations are task-appropriate atomic entities, dom fields offer several advantages over hid- such as words or nucleotides. In particular, it is not practi- den Markov models and stochastic grammars cal to represent multiple interacting features or long-range for such tasks, including the ability to relax dependencies of the observations, since the inference prob- strong independence assumptions made in those lem for such models is intractable. models. Conditional random fields also avoid a fundamental limitation of maximum entropy This difficulty is one of the main motivations for looking at Markov models (MEMMs) and other discrimi- conditional models as an alternative. A conditional model native Markov models based on directed graph- specifies the probabilities of possible label sequences given ical models, which can be biased towards states an observation sequence. Therefore, it does not expend with few successor states. We present iterative modeling effort on the observations, which at test time parameter estimation algorithms for conditional are fixed anyway. Furthermore, the conditional probabil- random fields and compare the performance of ity of the label sequence can depend on arbitrary, non- the resulting models to HMMs and MEMMs on independent features of the observation sequence without synthetic and natural-language data. forcing the model to account for the distribution of those dependencies. The chosen features may represent attributes at different levels of granularity of the same observations 1. Introduction (for example, words and characters in English text), or aggregate properties of the observation sequence (for in- The need to segment and label sequences arises in many stance, text layout). The probability of a transition between different problems in several scientific fields. Hidden labels may depend not only on the current observation, Markov models (HMMs) and stochastic grammars are well but also on past and future observations, if available. In understood and widely used probabilistic models for such contrast, generative models must make very strict indepen- problems. In computational biology, HMMs and stochas- dence assumptions on the observations, for instance condi- tic grammars have been successfully used to align bio- tional independence given the labels, to achieve tractability. logical sequences, find sequences homologous to a known evolutionary family, and analyze RNA secondary structure Maximum entropy Markov models (MEMMs) are condi- (Durbin et al., 1998). In computational linguistics and tional probabilistic sequence models that attain all of the computer science, HMMs and stochastic grammars have above advantages (McCallum et al., 2000). In MEMMs, been applied to a wide variety of problems in text and each source state1 has a exponential model that takes the speech processing, including topic segmentation, part-of- observation features as input, and outputs a distribution speech (POS) tagging, information extraction, and syntac- over possible next states. These exponential models are tic disambiguation (Manning & Sch¨ tze, 1999). u trained by an appropriate iterative scaling method in the 1 HMMs and stochastic grammars are generative models, as- Output labels are associated with states; it is possible for sev- signing a joint probability to paired observation and label eral states to have the same label, but for simplicity in the rest of this paper we assume a one-to-one correspondence. sequences; the parameters are typically trained to maxi-
  • 2. maximum entropy framework. Previously published exper- i:_ r:_ 1 2 b:rib imental results show MEMMs increasing recall and dou- bling precision relative to HMMs in a FAQ segmentation 0 r:_ b:rob 3 task. 4 o:_ 5 MEMMs and other non-generative finite-state models based on next-state classifiers, such as discriminative Figure 1. Label bias example, after (Bottou, 1991). For concise- Markov models (Bottou, 1991), share a weakness we call ness, we place observation-label pairs o : l on transitions rather here the label bias problem: the transitions leaving a given than states; the symbol ‘ ’ represents the null output label. state compete only against each other, rather than against all other transitions in the model. In probabilistic terms, We present the model, describe two training procedures and transition scores are the conditional probabilities of pos- sketch a proof of convergence. We also give experimental sible next states given the current state and the observa- results on synthetic data showing that CRFs solve the clas- tion sequence. This per-state normalization of transition sical version of the label bias problem, and, more signifi- scores implies a “conservation of score mass” (Bottou, cantly, that CRFs perform better than HMMs and MEMMs 1991) whereby all the mass that arrives at a state must be when the true data distribution has higher-order dependen- distributed among the possible successor states. An obser- cies than the model, as is often the case in practice. Finally, vation can affect which destination states get the mass, but we confirm these results as well as the claimed advantages not how much total mass to pass on. This causes a bias to- of conditional models by evaluating HMMs, MEMMs and ward states with fewer outgoing transitions. In the extreme CRFs with identical state structure on a part-of-speech tag- case, a state with a single outgoing transition effectively ging task. ignores the observation. In those cases, unlike in HMMs, Viterbi decoding cannot downgrade a branch based on ob- servations after the branch point, and models with state- 2. The Label Bias Problem transition structures that have sparsely connected chains of Classical probabilistic automata (Paz, 1971), discrimina- states are not properly handled. The Markovian assump- tive Markov models (Bottou, 1991), maximum entropy tions in MEMMs and similar state-conditional models in- taggers (Ratnaparkhi, 1996), and MEMMs, as well as sulate decisions at one state from future decisions in a way non-probabilistic sequence tagging and segmentation mod- that does not match the actual dependencies between con- els with independently trained next-state classifiers (Pun- secutive states. yakanok & Roth, 2001) are all potential victims of the label This paper introduces conditional random fields (CRFs), a bias problem. sequence modeling framework that has all the advantages For example, Figure 1 represents a simple finite-state of MEMMs but also solves the label bias problem in a model designed to distinguish between the two words rib principled way. The critical difference between CRFs and and rob. Suppose that the observation sequence is r i b. MEMMs is that a MEMM uses per-state exponential mod- In the first time step, r matches both transitions from the els for the conditional probabilities of next states given the start state, so the probability mass gets distributed roughly current state, while a CRF has a single exponential model equally among those two transitions. Next we observe i. for the joint probability of the entire sequence of labels Both states 1 and 4 have only one outgoing transition. State given the observation sequence. Therefore, the weights of 1 has seen this observation often in training, state 4 has al- different features at different states can be traded off against most never seen this observation; but like state 1, state 4 each other. has no choice but to pass all its mass to its single outgoing We can also think of a CRF as a finite state model with un- transition, since it is not generating the observation, only normalized transition probabilities. However, unlike some conditioning on it. Thus, states with a single outgoing tran- other weighted finite-state approaches (LeCun et al., 1998), sition effectively ignore their observations. More generally, CRFs assign a well-defined probability distribution over states with low-entropy next state distributions will take lit- possible labelings, trained by maximum likelihood or MAP tle notice of observations. Returning to the example, the estimation. Furthermore, the loss function is convex,2 guar- top path and the bottom path will be about equally likely, anteeing convergence to the global optimum. CRFs also independently of the observation sequence. If one of the generalize easily to analogues of stochastic context-free two words is slightly more common in the training set, the grammars that would be useful in such problems as RNA transitions out of the start state will slightly prefer its cor- secondary structure prediction and natural language pro- responding transition, and that word’s state sequence will cessing. always win. This behavior is demonstrated experimentally in Section 5. 2 In the case of fully observable states, as we are discussing here; if several states have the same label, the usual local maxima L´ on Bottou (1991) discussed two solutions for the label e of Baum-Welch arise. bias problem. One is to change the state-transition struc-
  • 3. ture of the model. In the above example we could collapse tant example for modeling sequences, G is a simple chain states 1 and 4, and delay the branching until we get a dis- or line: G = (V = {1, 2, . . . m}, E = {(i, i + 1)}). criminating observation. This operation is a special case X may also have a natural graph structure; yet in gen- of determinization (Mohri, 1997), but determinization of eral it is not necessary to assume that X and Y have the weighted finite-state machines is not always possible, and same graphical structure, or even that X has any graph- even when possible, it may lead to combinatorial explo- ical structure at all. However, in this paper we will be sion. The other solution mentioned is to start with a fully- most concerned with sequences X = (X1 , X2 , . . . , Xn ) connected model and let the training procedure figure out and Y = (Y1 , Y2 , . . . , Yn ). a good structure. But that would preclude the use of prior If the graph G = (V, E) of Y is a tree (of which a chain structural knowledge that has proven so valuable in infor- is the simplest example), its cliques are the edges and ver- mation extraction tasks (Freitag & McCallum, 2000). tices. Therefore, by the fundamental theorem of random Proper solutions require models that account for whole fields (Hammersley & Clifford, 1971), the joint distribu- state sequences at once by letting some transitions “vote” tion over the label sequence Y given X has the form more strongly than others depending on the corresponding observations. This implies that score mass will not be con- p θ (y | x) ∝ (1) served, but instead individual transitions can “amplify” or   “dampen” the mass they receive. In the above example, the exp  λk fk (e, y|e , x) + µk gk (v, y|v , x) , transitions from the start state would have a very weak ef- e∈E,k v∈V,k fect on path score, while the transitions from states 1 and 4 would have much stronger effects, amplifying or damping where x is a data sequence, y a label sequence, and y|S is depending on the actual observation, and a proportionally the set of components of y associated with the vertices in higher contribution to the selection of the Viterbi path.3 subgraph S. In the related work section we discuss other heuristic model We assume that the features fk and gk are given and fixed. classes that account for state sequences globally rather than For example, a Boolean vertex feature gk might be true if locally. To the best of our knowledge, CRFs are the only the word Xi is upper case and the tag Yi is “proper noun.” model class that does this in a purely probabilistic setting, with guaranteed global maximum likelihood convergence. The parameter estimation problem is to determine the pa- rameters θ = (λ1 , λ2 , . . . ; µ1 , µ2 , . . .) from training data D = {(x(i) , y(i) )}N with empirical distribution p(x, y). i=1 3. Conditional Random Fields In Section 4 we describe an iterative scaling algorithm that In what follows, X is a random variable over data se- maximizes the log-likelihood objective function O(θ): quences to be labeled, and Y is a random variable over N corresponding label sequences. All components Yi of Y O(θ) = log p θ (y(i) | x(i) ) are assumed to range over a finite label alphabet Y. For ex- i=1 ample, X might range over natural language sentences and Y range over part-of-speech taggings of those sentences, ∝ p(x, y) log p θ (y | x) . with Y the set of possible part-of-speech tags. The ran- x,y dom variables X and Y are jointly distributed, but in a dis- criminative framework we construct a conditional model As a particular case, we can construct an HMM-like CRF p(Y | X) from paired observation and label sequences, and by defining one feature for each state pair (y , y), and one do not explicitly model the marginal p(X). feature for each state-observation pair (y, x): Definition. Let G = (V, E) be a graph such that fy ,y (<u, v>, y|<u,v> , x) = δ(yu , y ) δ(yv , y) Y = (Yv )v∈V , so that Y is indexed by the vertices gy,x (v, y|v , x) = δ(yv , y) δ(xv , x) . of G. Then (X, Y) is a conditional random field in case, when conditioned on X, the random variables Yv The corresponding parameters λy ,y and µy,x play a simi- obey the Markov property with respect to the graph: lar role to the (logarithms of the) usual HMM parameters p(Yv | X, Yw , w = v) = p(Yv | X, Yw , w ∼ v), where p(y | y) and p(x|y). Boltzmann chain models (Saul & Jor- w ∼ v means that w and v are neighbors in G. dan, 1996; MacKay, 1996) have a similar form but use a Thus, a CRF is a random field globally conditioned on the single normalization constant to yield a joint distribution, observation X. Throughout the paper we tacitly assume whereas CRFs use the observation-dependent normaliza- that the graph G is fixed. In the simplest and most impor- tion Z(x) for conditional distributions. 3 Weighted determinization and minimization techniques shift Although it encompasses HMM-like models, the class of transition weights while preserving overall path weight (Mohri, conditional random fields is much more expressive, be- 2000); their connection to this discussion deserves further study. cause it allows arbitrary dependencies on the observation
  • 4. Yi−1 Yi Yi+1 Yi−1 Yi Yi+1 Yi−1 Yi Yi+1 s - s - s s - s - s s s s 6 6 6 s s s c c c c c c ? ? ? Xi−1 Xi Xi+1 Xi−1 Xi Xi+1 Xi−1 Xi Xi+1 Figure 2. Graphical structures of simple HMMs (left), MEMMs (center), and the chain-structured case of CRFs (right) for sequences. An open circle indicates that the variable is not generated by the model. sequence. In addition, the features do not need to specify of the training data. Both algorithms are based on the im- completely a state or observation, so one might expect that proved iterative scaling (IIS) algorithm of Della Pietra et al. the model can be estimated from less training data. Another (1997); the proof technique based on auxiliary functions attractive property is the convexity of the loss function; in- can be extended to show convergence of the algorithms for deed, CRFs share all of the convexity properties of general CRFs. maximum entropy models. Iterative scaling algorithms update the weights as λk ← For the remainder of the paper we assume that the depen- λk + δλk and µk ← µk + δµk for appropriately chosen dencies of Y, conditioned on X, form a chain. To sim- δλk and δµk . In particular, the IIS update δλk for an edge plify some expressions, we add special start and stop states feature fk is the solution of Y0 = start and Yn+1 = stop. Thus, we will be using the n+1 graphical structure shown in Figure 2. For a chain struc- def E[fk ] = p(x, y) fk (ei , y|ei , x) ture, the conditional probability of a label sequence can be x,y i=1 expressed concisely in matrix form, which will be useful n+1 in describing the parameter estimation and inference al- gorithms in Section 4. Suppose that p θ (Y | X) is a CRF = p(x) p(y | x) fk (ei , y|ei , x) e δλk T (x,y) . x,y i=1 given by (1). For each position i in the observation se- quence x, we define the |Y| × |Y| matrix random variable where T (x, y) is the total feature count Mi (x) = [Mi (y , y | x)] by def Mi (y , y | x) = exp (Λi (y , y | x)) T (x, y) = fk (ei , y|ei , x) + gk (vi , y|vi , x) . i,k i,k Λi (y , y | x) = k λk fk (ei , Y|ei = (y , y), x) + The equations for vertex feature updates δµk have similar k µk gk (vi , Y|vi = y, x) , form. where ei is the edge with labels (Yi−1 , Yi ) and vi is the However, efficiently computing the exponential sums on vertex with label Yi . In contrast to generative models, con- the right-hand sides of these equations is problematic, be- ditional models like CRFs do not need to enumerate over cause T (x, y) is a global property of (x, y), and dynamic all possible observation sequences x, and therefore these programming will sum over sequences with potentially matrices can be computed directly as needed from a given varying T . To deal with this, the first algorithm, Algorithm training or test observation sequence x and the parameter S, uses a “slack feature.” The second, Algorithm T, keeps vector θ. Then the normalization (partition function) Zθ (x) track of partial T totals. is the (start, stop) entry of the product of these matrices: For Algorithm S, we define the slack feature by Zθ (x) = (M1 (x) M2 (x) · · · Mn+1 (x))start,stop . def Using this notation, the conditional probability of a label s(x, y) = sequence y is written as S− fk (ei , y|ei , x) − gk (vi , y|vi , x) , n+1 i k i k i=1 Mi (yi−1 , yi | x) p θ (y | x) = , where S is a constant chosen so that s(x(i) , y) ≥ 0 for all n+1 Mi (x) i=1 start,stop y and all observation vectors x(i) in the training set, thus making T (x, y) = S. Feature s is “global,” that is, it does where y0 = start and yn+1 = stop. not correspond to any particular edge or vertex. For each index i = 0, . . . , n + 1 we now define the forward 4. Parameter Estimation for CRFs vectors αi (x) with base case We now describe two iterative scaling algorithms to find 1 if y = start the parameter vector θ that maximizes the log-likelihood α0 (y | x) = 0 otherwise
  • 5. and recurrence βk and γk are the unique positive roots to the following polynomial equations αi (x) = αi−1 (x) Mi (x) . Tmax Tmax t t ak,t βk = Efk , bk,t γk = Egk , (2) Similarly, the backward vectors βi (x) are defined by i=0 i=0 1 if y = stop which can be easily computed by Newton’s method. βn+1 (y | x) = 0 otherwise A single iteration of Algorithm S and Algorithm T has and roughly the same time and space complexity as the well βi (x) = Mi+1 (x) βi+1 (x) . known Baum-Welch algorithm for HMMs. To prove con- vergence of our algorithms, we can derive an auxiliary function to bound the change in likelihood from below; this With these definitions, the update equations are method is developed in detail by Della Pietra et al. (1997). The full proof is somewhat detailed; however, here we give 1 Efk 1 Egk an idea of how to derive the auxiliary function. To simplify δλk = log , δµk = log , S Efk S Egk notation, we assume only edge features fk with parameters λk . where Given two parameter settings θ = (λ1 , λ2 , . . .) and θ = n+1 (λ1 +δλ1 , λ2 +δλ2 , . . .), we bound from below the change Efk = p(x) fk (ei , y|ei = (y , y), x) × in the objective function with an auxiliary function A(θ , θ) x i=1 y ,y as follows αi−1 (y | x) Mi (y , y | x) βi (y | x) p θ (y | x) Zθ (x) O(θ ) − O(θ) = p(x, y) log x,y p θ (y | x) n Egk = p(x) gk (vi , y|vi = y, x) × Zθ (x) = (θ − θ) · Ef − p(x) log x i=1 y x Zθ (x) αi (y | x) βi (y | x) Zθ (x) . ≥ (θ − θ) · Ef − p(x) Zθ (x) Zθ (x) x The factors involving the forward and backward vectors in = δλ · Ef − p(x) pθ (y | x) eδλ·f (x,y) the above equations have the same meaning as for standard x y hidden Markov models. For example, fk (x, y) δλk T (x) ≥ δλ · Ef − p(x) pθ (y | x) e αi (y | x) βi (y | x) T (x) x,y,k p θ (Yi = y | x) = Zθ (x) def = A(θ , θ) is the marginal probability of label Yi = y given that the where the inequalities follow from the convexity of − log observation sequence is x. This algorithm is closely related and exp. Differentiating A with respect to δλk and setting to the algorithm of Darroch and Ratcliff (1972), and MART the result to zero yields equation (2). algorithms used in image reconstruction. The constant S in Algorithm S can be quite large, since in 5. Experiments practice it is proportional to the length of the longest train- ing observation sequence. As a result, the algorithm may We first discuss two sets of experiments with synthetic data converge slowly, taking very small steps toward the maxi- that highlight the differences between CRFs and MEMMs. mum in each iteration. If the length of the observations x(i) The first experiments are a direct verification of the label and the number of active features varies greatly, a faster- bias problem discussed in Section 2. In the second set of converging algorithm can be obtained by keeping track of experiments, we generate synthetic data using randomly feature totals for each observation sequence separately. chosen hidden Markov models, each of which is a mix- ture of a first-order and second-order model. Competing def Let T (x) = maxy T (x, y). Algorithm T accumulates first-order models are then trained and compared on test feature expectations into counters indexed by T (x). More data. As the data becomes more second-order, the test er- specifically, we use the forward-backward recurrences just ror rates of the trained models increase. This experiment introduced to compute the expectations ak,t of feature fk corresponds to the common modeling practice of approxi- and bk,t of feature gk given that T (x) = t. Then our param- mating complex local and long-range dependencies, as oc- eter updates are δλk = log βk and δµk = log γk , where cur in natural data, by small-order Markov models. Our
  • 6. 60 60 60 50 50 50 40 40 40 MEMM Error MEMM Error CRF Error 30 30 30 20 20 20 10 10 10 0 0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 CRF Error HMM Error HMM Error Figure 3. Plots of 2×2 error rates for HMMs, CRFs, and MEMMs on randomly generated synthetic data sets, as described in Section 5.2. As the data becomes “more second order,” the error rates of the test models increase. As shown in the left plot, the CRF typically significantly outperforms the MEMM. The center plot shows that the HMM outperforms the MEMM. In the right plot, each open square 1 1 represents a data set with α < 2 , and a solid circle indicates a data set with α ≥ 2 . The plot shows that when the data is mostly second 1 order (α ≥ 2 ), the discriminatively trained CRF typically outperforms the HMM. These experiments are not designed to demonstrate the advantages of the additional representational power of CRFs and MEMMs relative to HMMs. results clearly indicate that even when the models are pa- of the Bayes error rate for the resulting models, the con- rameterized in exactly the same way, CRFs are more ro- ditional probability tables pα are constrained to be sparse. bust to inaccurate modeling assumptions than MEMMs or In particular, pα (· | y, y ) can have at most two nonzero en- HMMs, and resolve the label bias problem, which affects tries, for each y, y , and p α (· | y, x ) can have at most three the performance of MEMMs. To avoid confusion of dif- nonzero entries for each y, x . For each randomly gener- ferent effects, the MEMMs and CRFs in these experiments ated model, a sample of 1,000 sequences of length 25 is do not use overlapping features of the observations. Fi- generated for training and testing. nally, in a set of POS tagging experiments, we confirm the On each randomly generated training set, a CRF is trained advantage of CRFs over MEMMs. We also show that the using Algorithm S. (Note that since the length of the se- addition of overlapping features to CRFs and MEMMs al- quences and number of active features is constant, Algo- lows them to perform much better than HMMs, as already rithms S and T are identical.) The algorithm is fairly slow shown for MEMMs by McCallum et al. (2000). to converge, typically taking approximately 500 iterations for the model to stabilize. On the 500 MHz Pentium PC 5.1 Modeling label bias used in our experiments, each iteration takes approximately We generate data from a simple HMM which encodes a 0.2 seconds. On the same data an MEMM is trained using noisy version of the finite-state network in Figure 1. Each iterative scaling, which does not require forward-backward state emits its designated symbol with probability 29/32 calculations, and is thus more efficient. The MEMM train- and any of the other symbols with probability 1/32. We ing converges more quickly, stabilizing after approximately train both an MEMM and a CRF with the same topologies 100 iterations. For each model, the Viterbi algorithm is on the data generated by the HMM. The observation fea- used to label a test set; the experimental results do not sig- tures are simply the identity of the observation symbols. nificantly change when using forward-backward decoding In a typical run using 2, 000 training and 500 test samples, to minimize the per-symbol error rate. trained to convergence of the iterative scaling algorithm, The results of several runs are presented in Figure 3. Each the CRF error is 4.6% while the MEMM error is 42%, plot compares two classes of models, with each point indi- showing that the MEMM fails to discriminate between the cating the error rate for a single test set. As α increases, the two branches. error rates generally increase, as the first-order models fail to fit the second-order data. The figure compares models 5.2 Modeling mixed-order sources parameterized as µy , λy ,y , and λy ,y,x ; results for models For these results, we use five labels, a-e (|Y| = 5), and 26 parameterized as µy , λy ,y , and µy,x are qualitatively the observation values, A-Z (|X | = 26); however, the results same. As shown in the first graph, the CRF generally out- were qualitatively the same over a range of sizes for Y and performs the MEMM, often by a wide margin of 10%–20% X . We generate data from a mixed-order HMM with state relative error. (The points for very small error rate, with transition probabilities given by pα (yi | yi−1 , yi−2 ) = α < 0.01, where the MEMM does better than the CRF, α p2 (yi | yi−1 , yi−2 ) + (1 − α) p1 (yi | yi−1 ) and, simi- are suspected to be the result of an insufficient number of larly, emission probabilities given by pα (xi | yi , xi−1 ) = training iterations for the CRF.) α p2 (xi | yi , xi−1 )+(1−α) p1 (xi | yi ). Thus, for α = 0 we have a standard first-order HMM. In order to limit the size
  • 7. model error oov error 6. Further Aspects of CRFs HMM 5.69% 45.99% Many further aspects of CRFs are attractive for applica- MEMM 6.37% 54.61% tions and deserve further study. In this section we briefly CRF 5.55% 48.05% mention just two. MEMM+ 4.81% 26.99% Conditional random fields can be trained using the expo- CRF+ 4.27% 23.76% nential loss objective function used by the AdaBoost algo- + rithm (Freund & Schapire, 1997). Typically, boosting is Using spelling features applied to classification problems with a small, fixed num- ber of classes; applications of boosting to sequence labeling Figure 4. Per-word error rates for POS tagging on the Penn tree- have treated each label as a separate classification problem bank, using first-order models trained on 50% of the 1.1 million (Abney et al., 1999). However, it is possible to apply the word corpus. The oov rate is 5.45%. parallel update algorithm of Collins et al. (2000) to op- timize the per-sequence exponential loss. This requires a 5.3 POS tagging experiments forward-backward algorithm to compute efficiently certain To confirm our synthetic data results, we also compared feature expectations, along the lines of Algorithm T, ex- HMMs, MEMMs and CRFs on Penn treebank POS tag- cept that each feature requires a separate set of forward and ging, where each word in a given input sentence must be backward accumulators. labeled with one of 45 syntactic tags. Another attractive aspect of CRFs is that one can imple- We carried out two sets of experiments with this natural ment efficient feature selection and feature induction al- language data. First, we trained first-order HMM, MEMM, gorithms for them. That is, rather than specifying in ad- and CRF models as in the synthetic data experiments, in- vance which features of (X, Y) to use, we could start from troducing parameters µy,x for each tag-word pair and λy ,y feature-generating rules and evaluate the benefit of gener- for each tag-tag pair in the training set. The results are con- ated features automatically on data. In particular, the fea- sistent with what is observed on synthetic data: the HMM ture induction algorithms presented in Della Pietra et al. outperforms the MEMM, as a consequence of the label bias (1997) can be adapted to fit the dynamic programming problem, while the CRF outperforms the HMM. The er- techniques of conditional random fields. ror rates for training runs using a 50%-50% train-test split are shown in Figure 5.3; the results are qualitatively sim- 7. Related Work and Conclusions ilar for other splits of the data. The error rates on out- of-vocabulary (oov) words, which are not observed in the As far as we know, the present work is the first to combine training set, are reported separately. the benefits of conditional models with the global normal- ization of random field models. Other applications of expo- In the second set of experiments, we take advantage of the nential models in sequence modeling have either attempted power of conditional models by adding a small set of or- to build generative models (Rosenfeld, 1997), which in- thographic features: whether a spelling begins with a num- volve a hard normalization problem, or adopted local con- ber or upper case letter, whether it contains a hyphen, and ditional models (Berger et al., 1996; Ratnaparkhi, 1996; whether it ends in one of the following suffixes: -ing, - McCallum et al., 2000) that may suffer from label bias. ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies. Here we find, as expected, that both the MEMM and the CRF benefit signif- Non-probabilistic local decision models have also been icantly from the use of these features, with the overall error widely used in segmentation and tagging (Brill, 1995; rate reduced by around 25%, and the out-of-vocabulary er- Roth, 1998; Abney et al., 1999). Because of the computa- ror rate reduced by around 50%. tional complexity of global training, these models are only trained to minimize the error of individual label decisions One usually starts training from the all zero parameter vec- assuming that neighboring labels are correctly chosen. La- tor, corresponding to the uniform distribution. However, bel bias would be expected to be a problem here too. for these datasets, CRF training with that initialization is much slower than MEMM training. Fortunately, we can An alternative approach to discriminative modeling of se- use the optimal MEMM parameter vector as a starting quence labeling is to use a permissive generative model, point for training the corresponding CRF. In Figure 5.3, which can only model local dependencies, to produce a MEMM+ was trained to convergence in around 100 iter- list of candidates, and then use a more global discrimina- ations. Its parameters were then used to initialize the train- tive model to rerank those candidates. This approach is ing of CRF+ , which converged in 1,000 iterations. In con- standard in large-vocabulary speech recognition (Schwartz trast, training of the same CRF from the uniform distribu- & Austin, 1993), and has also been proposed for parsing tion had not converged even after 2,000 iterations. (Collins, 2000). However, these methods fail when the cor- rect output is pruned away in the first pass.
  • 8. Closest to our proposal are gradient-descent methods that scaling for log-linear models. The Annals of Mathemat- adjust the parameters of all of the local classifiers to mini- ical Statistics, 43, 1470–1480. mize a smooth loss function (e.g., quadratic loss) combin- Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). In- ing loss terms for each label. If state dependencies are lo- ducing features of random fields. IEEE Transactions on cal, this can be done efficiently with dynamic programming Pattern Analysis and Machine Intelligence, 19, 380–393. (LeCun et al., 1998). Such methods should alleviate label Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). bias. However, their loss function is not convex, so they Biological sequence analysis: Probabilistic models of may get stuck in local minima. proteins and nucleic acids. Cambridge University Press. Freitag, D., & McCallum, A. (2000). Information extrac- Conditional random fields offer a unique combination of tion with HMM structures learned by stochastic opti- properties: discriminatively trained models for sequence mization. Proc. AAAI 2000. segmentation and labeling; combination of arbitrary, over- Freund, Y., & Schapire, R. (1997). A decision-theoretic lapping and agglomerative observation features from both generalization of on-line learning and an application to the past and future; efficient training and decoding based boosting. Journal of Computer and System Sciences, 55, on dynamic programming; and parameter estimation guar- 119–139. anteed to find the global optimum. Their main current lim- Hammersley, J., & Clifford, P. (1971). Markov fields on itation is the slow convergence of the training algorithm finite graphs and lattices. Unpublished manuscript. relative to MEMMs, let alone to HMMs, for which training LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). on fully observed data is very efficient. In future work, we Gradient-based learning applied to document recogni- plan to investigate alternative training methods such as the tion. Proceedings of the IEEE, 86, 2278–2324. update methods of Collins et al. (2000) and refinements on MacKay, D. J. (1996). Equivalence of linear Boltzmann using a MEMM as starting point as we did in some of our chains and hidden Markov models. Neural Computation, experiments. More general tree-structured random fields, 8, 178–181. feature induction methods, and further natural data evalua- Manning, C. D., & Sch¨ tze, H. (1999). Foundations of sta- u tions will also be investigated. tistical natural language processing. Cambridge Mas- sachusetts: MIT Press. Acknowledgments McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models for information extraction and We thank Yoshua Bengio, L´ on Bottou, Michael Collins e segmentation. Proc. ICML 2000 (pp. 591–598). Stan- and Yann LeCun for alerting us to what we call here the la- ford, California. bel bias problem. We also thank Andrew Ng and Sebastian Mohri, M. (1997). Finite-state transducers in language and Thrun for discussions related to this work. speech processing. Computational Linguistics, 23. Mohri, M. (2000). Minimization algorithms for sequential References transducers. Theoretical Computer Science, 234, 177– 201. Abney, S., Schapire, R. E., & Singer, Y. (1999). Boosting Paz, A. (1971). Introduction to probabilistic automata. applied to tagging and PP attachment. Proc. EMNLP- Academic Press. VLC. New Brunswick, New Jersey: Association for Punyakanok, V., & Roth, D. (2001). The use of classifiers Computational Linguistics. in sequential inference. NIPS 13. Forthcoming. Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. Ratnaparkhi, A. (1996). A maximum entropy model for (1996). A maximum entropy approach to natural lan- part-of-speech tagging. Proc. EMNLP. New Brunswick, guage processing. Computational Linguistics, 22. New Jersey: Association for Computational Linguistics. Bottou, L. (1991). Une approche th´ orique de e Rosenfeld, R. (1997). A whole sentence maximum entropy l’apprentissage connexionniste: Applications a la recon- ` language model. Proceedings of the IEEE Workshop on naissance de la parole. Doctoral dissertation, Universit´e Speech Recognition and Understanding. Santa Barbara, de Paris XI. California. Brill, E. (1995). Transformation-based error-driven learn- Roth, D. (1998). Learning to resolve natural language am- ing and natural language processing: a case study in part biguities: A unified approach. Proc. 15th AAAI (pp. 806– of speech tagging. Computational Linguistics, 21, 543– 813). Menlo Park, California: AAAI Press. 565. Saul, L., & Jordan, M. (1996). Boltzmann chains and hid- Collins, M. (2000). Discriminative reranking for natural den Markov models. Advances in Neural Information language parsing. Proc. ICML 2000. Stanford, Califor- Processing Systems 7. MIT Press. nia. Schwartz, R., & Austin, S. (1993). A comparison of several Collins, M., Schapire, R., & Singer, Y. (2000). Logistic re- approximate algorithms for finding multiple (N-BEST) gression, AdaBoost, and Bregman distances. Proc. 13th sentence hypotheses. Proc. ICASSP. Minneapolis, MN. COLT. Darroch, J. N., & Ratcliff, D. (1972). Generalized iterative