SlideShare uma empresa Scribd logo
1 de 81
Baixar para ler offline
Technical Tricks of Vowpal Wabbit



                http://hunch.net/~vw/


    John Langford, Columbia, Data-Driven Modeling,
                       April 16


  git clone
  git://github.com/JohnLangford/vowpal_wabbit.git
Goals of the VW project
   1   State of the art in scalable, fast, ecient
       Machine Learning. VW is (by far) the most
       scalable public linear learner, and plausibly the
       most scalable anywhere.
   2   Support research into new ML algorithms. ML
       researchers can deploy new algorithms on an
       ecient platform eciently. BSD open source.
   3   Simplicity. No strange dependencies, currently
       only 9437 lines of code.
   4   It just works. A package in debian  R.
       Otherwise, users just type make, and get a
       working system. At least a half-dozen companies
       use VW.
Demonstration


  vw -c rcv1.train.vw.gz exact_adaptive_norm
  power_t 1 -l 0.5
The basic learning algorithm
  Learn w such that fw (x )    = w .x   predicts well.
    1    Online learning with strong defaults.
    2    Every input source but library.
    3    Every output sink but library.
    4    In-core feature manipulation for ngrams, outer
         products, etc... Custom is easy.
    5    Debugging with readable models  audit mode.
    6    Dierent loss functions  squared, logistic, ...
    7
          1 and   2 regularization.
    8    Compatible LBFGS-based batch-mode
         optimization
    9    Cluster parallel
    10   Daemon deployable.
The tricks

      Basic VW         Newer Algorithmics       Parallel Stu

   Feature Caching     Adaptive Learning    Parameter Averaging
   Feature Hashing     Importance Updates    Nonuniform Average
   Online Learning      Dim. Correction       Gradient Summing
   Implicit Features        L-BFGS            Hadoop AllReduce
                        Hybrid Learning
  We'll discuss Basic VW and algorithmics, then Parallel.
Feature Caching


  Compare: time vw rcv1.train.vw.gz exact_adaptive_norm
  power_t 1
Feature Hashing



                                   String − Index dictionary
                        RAM




                                 Weights Weights
                              Conventional VW


  Most algorithms use a hashmap to change a word into an
  index for a weight.
  VW uses a hash function which takes almost no RAM, is x10
  faster, and is easily parallelized.
The spam example [WALS09]

    1   3.2   ∗ 106   labeled emails.

    2   433167 users.

    3   ∼ 40 ∗ 106     unique features.


  How do we construct a spam lter which is personalized, yet
  uses global information?
The spam example [WALS09]

    1   3.2   ∗ 106   labeled emails.

    2   433167 users.

    3   ∼ 40 ∗ 106     unique features.


  How do we construct a spam lter which is personalized, yet
  uses global information?




  Answer: Use hashing to predict according to:
  w , φ(x )   +   w , φ (x )
                        u
Results




  (baseline = global only predictor)
Basic Online Learning

  Start with   ∀i :       w
                          i   = 0,   Repeatedly:

    1   Get example  x ∈ (∞, ∞)∗.
    2   Make prediction y −
                         ˆ        w x clipped to interval [0, 1].
                                         i   i   i


    3   Learn truth y ∈ [0, 1] with importance I or goto (1).

    4   Update w ← w + η 2(y − y )Ix and go to (1).
                      i        i    ˆ                i
Reasons for Online Learning
    1   Fast convergence to a good predictor

    2   It's RAM ecient. You need store only one example in
        RAM rather than all of them.   ⇒   Entirely new scales of
        data are possible.

    3   Online Learning algorithm = Online Optimization
        Algorithm. Online Learning Algorithms    ⇒   the ability to
        solve entirely new categories of applications.

    4   Online Learning = ability to deal with drifting
        distributions.
Implicit Outer Product
  Sometimes you care about the interaction of two sets of
  features (ad features x query features, news features x user
  features, etc...).
  Choices:

     1   Expand the set of features explicitly, consuming   n2 disk
         space.

     2   Expand the features dynamically in the core of your
         learning algorithm.
Implicit Outer Product
  Sometimes you care about the interaction of two sets of
  features (ad features x query features, news features x user
  features, etc...).
  Choices:

     1   Expand the set of features explicitly, consuming   n2 disk
         space.

     2   Expand the features dynamically in the core of your
         learning algorithm.

  Option (2) is x10 faster. You need to be comfortable with
  hashes rst.
The tricks

      Basic VW          Newer Algorithmics      Parallel Stu

   Feature Caching      Adaptive Learning    Parameter Averaging
   Feature Hashing      Importance Updates   Nonuniform Average
   Online Learning       Dim. Correction     Gradient Summing
   Implicit Features         L-BFGS           Hadoop AllReduce
                         Hybrid Learning
  Next: algorithmics.
Adaptive Learning [DHS10,MS10]

  For example   t , let g   it   = 2(y − y )xit .
                                         ˆ
Adaptive Learning [DHS10,MS10]

  For example   t , let g   it   = 2(y − y )xit .
                                         ˆ


  New update rule:    w i        ← wi − η √Pit,t +1
                                              g
                                                           2
                                                  t   =1 git
Adaptive Learning [DHS10,MS10]

  For example   t , let g   it   = 2(y − y )xit .
                                         ˆ


  New update rule:    w i        ← wi − η √Pit,t +1
                                              g
                                                           2
                                                  t   =1 git


  Common features stabilize quickly. Rare features can have
  large updates.
Learning with importance weights [KL11]




                   y
Learning with importance weights [KL11]




           wt x    y
Learning with importance weights [KL11]

                  −η(   ) x




           wt x           y
Learning with importance weights [KL11]

               −η(   ) x




           wt x wt+1 x   y
Learning with importance weights [KL11]

                       −6η(   ) x




           wt x    y
Learning with importance weights [KL11]

                       −6η(   ) x




           wt x    y           wt+1 x ??
Learning with importance weights [KL11]

                  −η(   ) x




           wt x           y   wt+1 x
Learning with importance weights [KL11]




           wt x wt+1 x y
Learning with importance weights [KL11]




               s(h)||x||2

           wt x wt+1 x y
Robust results for unweighted problems
                                          astro - logistic loss                                                          spam - quantile loss
              0.97                                                                            0.98

              0.96                                                                            0.97

                                                                                              0.96
              0.95
                                                                                              0.95
   standard




                                                                                   standard
              0.94
                                                                                              0.94
              0.93
                                                                                              0.93
              0.92
                                                                                              0.92
              0.91                                                                            0.91

               0.9                                                                             0.9
                     0.9   0.91    0.92      0.93   0.94   0.95    0.96    0.97                      0.9   0.91   0.92    0.93 0.94 0.95        0.96   0.97   0.98
                                          importance aware                                                                importance aware

                                          rcv1 - squared loss                                                            webspam - hinge loss
               0.95                                                                             1
              0.945                                                                           0.99
               0.94                                                                           0.98
              0.935                                                                           0.97
               0.93                                                                           0.96
   standard




                                                                                   standard


              0.925                                                                           0.95
               0.92                                                                           0.94
              0.915                                                                           0.93
               0.91                                                                           0.92
              0.905                                                                           0.91
                0.9                                                                            0.9
                      0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95                     0.9   0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
                                         importance aware                                                              importance aware
Dimensional Correction
  Gradient of squared loss =
                             ∂(fw (x )−y )2
                                  ∂ wi
                                                   f x
                                            = 2( w ( ) −   y )x   i   and
  change weights in the negative gradient direction:


                                     ∂(fw (x ) − y )2
                    w i   ← wi − η
                                           ∂ wi
Dimensional Correction
  Gradient of squared loss =
                             ∂(fw (x )−y )2
                                  ∂ wi
                                            = 2( w ( ) − f x   y )x   i   and
  change weights in the negative gradient direction:


                                     ∂(fw (x ) − y )2
                    w i   ← wi − η
                                           ∂ wi

  But the gradient has intrinsic problems. w naturally has units
                                                     i

  of 1/i since doubling x implies halving w to get the same
                           i                     i

  prediction.
  ⇒   Update rule has mixed units!
Dimensional Correction
  Gradient of squared loss =
                             ∂(fw (x )−y )2
                                  ∂ wi
                                            = 2( w ( ) −            f x   y )x   i   and
  change weights in the negative gradient direction:


                                           ∂(fw (x ) − y )2
                     wi    ← wi − η
                                                 ∂ wi

  But the gradient has intrinsic problems. w naturally has unitsi

  of 1/i since doubling x implies halving w to get the same
                                 i                          i

  prediction.
  ⇒   Update rule has mixed units!




  A crude x: divide update by                   x 2.
                                      It helps much!
                                             i    i
                                                      (fw (x )−y )2
  This is scary! The problem optimized is minw           P 2
                                                 x ,y
                                                            i xi
  rather than minw        x ,y
                                 (fw (x ) − y )2 .      But it works.
LBFGS [Nocedal80]
  Batch(!) second order algorithm. Core idea = ecient
  approximate Newton step.
LBFGS [Nocedal80]
  Batch(!) second order algorithm. Core idea = ecient
  approximate Newton step.




  H = ∂ (∂w (i ∂)−j )
          2   f       x       y
                                  2
                                      = Hessian.
                  w       w

  Newton step =                       w → w + H −1g .
LBFGS [Nocedal80]
  Batch(!) second order algorithm. Core idea = ecient
  approximate Newton step.




  H = ∂ (∂w (i ∂)−j )
          2   f       x       y
                                  2
                                      = Hessian.
                  w       w

  Newton step =                       w → w + H −1g .

  Newton fails: you can't even represent                H.
  Instead build up approximate inverse Hessian according to:
  ∆w ∆Tw where ∆ is a change in weights
   ∆T ∆g
    w            w                                      w
                                            and ∆g is a change

  in the loss gradient                   g.
Hybrid Learning
  Online learning is GREAT for getting to a good solution fast.
  LBFGS is GREAT for getting a perfect solution.
Hybrid Learning
  Online learning is GREAT for getting to a good solution fast.
  LBFGS is GREAT for getting a perfect solution.
  Use Online Learning, then LBFGS

                                                                          0.484
           0.55
                                                                          0.482
            0.5
                                                                           0.48
           0.45                                                           0.478
   auPRC




                                                                  auPRC
                                                                          0.476
            0.4
                                                                          0.474
           0.35
                                                                          0.472
            0.3                                                            0.47
                                     Online                                                        Online
                                     L−BFGS w/ 5 online passes            0.468                    L−BFGS w/ 5 online passes
           0.25                      L−BFGS w/ 1 online pass                                       L−BFGS w/ 1 online pass
                                     L−BFGS                               0.466                    L−BFGS
            0.2
               0   10   20               30       40         50               0   5   10               15         20
                             Iteration                                                     Iteration
The tricks

      Basic VW         Newer Algorithmics      Parallel Stu

   Feature Caching     Adaptive Learning    Parameter Averaging
   Feature Hashing     Importance Updates   Nonuniform Average
   Online Learning      Dim. Correction     Gradient Summing
   Implicit Features        L-BFGS           Hadoop AllReduce
                        Hybrid Learning
  Next: Parallel.
Applying for a fellowship in 1997
Applying for a fellowship in 1997

  Interviewer: So, what do you want to do?
Applying for a fellowship in 1997

  Interviewer: So, what do you want to do?
  John: I'd like to solve AI.
Applying for a fellowship in 1997

  Interviewer: So, what do you want to do?
  John: I'd like to solve AI.
  I: How?
Applying for a fellowship in 1997

  Interviewer: So, what do you want to do?
  John: I'd like to solve AI.
  I: How?
  J: I want to use parallel learning algorithms to create fantastic
  learning machines!
Applying for a fellowship in 1997

  Interviewer: So, what do you want to do?
  John: I'd like to solve AI.
  I: How?
  J: I want to use parallel learning algorithms to create fantastic
  learning machines!
  I: You fool! The only thing parallel machines are good for is
  computational windtunnels!
Applying for a fellowship in 1997

  Interviewer: So, what do you want to do?
  John: I'd like to solve AI.
  I: How?
  J: I want to use parallel learning algorithms to create fantastic
  learning machines!
  I: You fool! The only thing parallel machines are good for is
  computational windtunnels!
  The worst part: he had a point.
Terascale Linear Learning ACDL11
  Given 2.1 Terafeatures of data, how can you learn a good
  linear predictor   f (x ) =
                     w          i
                                    wx?
                                    i   i
Terascale Linear Learning ACDL11
  Given 2.1 Terafeatures of data, how can you learn a good
  linear predictor   f (x ) =
                     w          i
                                    wx?
                                    i   i




  2.1T sparse features
  17B Examples
  16M parameters
  1K nodes
Terascale Linear Learning ACDL11
  Given 2.1 Terafeatures of data, how can you learn a good
  linear predictor   f (x ) =
                     w          i
                                    wx?
                                    i   i




  2.1T sparse features
  17B Examples
  16M parameters
  1K nodes




  70 minutes = 500M features/second: faster than the IO
  bandwidth of a single machine⇒ we beat all possible single
  machine linear learning algorithms.
Features/s




                      100
                     1000
                    10000
                   100000
                    1e+06
                    1e+07
                    1e+08
      RBF-SVM       1e+09
       MPI?-500
           RCV1
  Ensemble Tree
        MPI-128
       Synthetic
                             single
                            parallel




      RBF-SVM
        TCP-48
                                                          Parallel Learning book



    MNIST 220K
   Decision Tree
    MapRed-200
    Ad-Bounce #
     Boosted DT
         MPI-32
                                       Speed per method




      Ranking #
          Linear
      Threads-2
           RCV1
          Linear
adoop+TCP-1000
          Ads *
                                                          Compare: Other Supervised Algorithms in
MPI-style AllReduce
   Allreduce initial state


       5       7       6

   1       2       3         4
MPI-style AllReduce
   Allreduce final state


        28        28        28

   28        28        28        28
MPI-style AllReduce
   Create Binary Tree
               7

       5               6

   1       2       3       4
MPI-style AllReduce
   Reducing, step 1
               7

       8               13

   1       2       3        4
MPI-style AllReduce
   Reducing, step 2
               28

       8                13

   1       2        3        4
MPI-style AllReduce
   Broadcast, step 1
                28

       28                28

   1        2        3        4
MPI-style AllReduce
   Allreduce final state
                  28

        28                  28

   28        28        28        28
  AllReduce = Reduce+Broadcast
MPI-style AllReduce
    Allreduce final state
                      28

          28                       28

    28          28            28        28
  AllReduce = Reduce+Broadcast
  Properties:

    1    Easily pipelined so no latency concerns.

    2    Bandwidth   ≤ 6n .
    3    No need to rewrite code!
An Example Algorithm: Weight averaging
  n = AllReduce(1)
  While (pass number      max)

    1   While (examples left)

          1   Do online update.

    2   AllReduce(weights)

    3   For each weight   w ← w /n
An Example Algorithm: Weight averaging
  n = AllReduce(1)
  While (pass number      max)

    1   While (examples left)

          1   Do online update.

    2   AllReduce(weights)

    3   For each weight   w ← w /n

  Other algorithms implemented:

    1   Nonuniform averaging for online learning

    2   Conjugate Gradient

    3   LBFGS
What is Hadoop AllReduce?
                                   Program
         Data


   1

       Map job moves program to data.
What is Hadoop AllReduce?
                                       Program
         Data


   1

       Map job moves program to data.

   2   Delayed initialization: Most failures are disk failures.
       First read (and cache) all data, before initializing
       allreduce. Failures autorestart on dierent node with
       identical data.
What is Hadoop AllReduce?
                                       Program
         Data


   1

       Map job moves program to data.

   2   Delayed initialization: Most failures are disk failures.
       First read (and cache) all data, before initializing
       allreduce. Failures autorestart on dierent node with
       identical data.

   3   Speculative execution: In a busy cluster, one node is
       often slow. Hadoop can speculatively start additional
       mappers. We use the rst to nish reading all data once.
Approach Used
   1   Optimize hard so few data passes required.

         1   Normalized, adaptive, safe, online, gradient
             descent.
         2   L-BFGS
         3   Use (1) to warmstart (2).
Approach Used
   1   Optimize hard so few data passes required.

         1   Normalized, adaptive, safe, online, gradient
             descent.
         2   L-BFGS
         3   Use (1) to warmstart (2).

   2   Use map-only Hadoop for process control and error
       recovery.
Approach Used
   1   Optimize hard so few data passes required.

         1   Normalized, adaptive, safe, online, gradient
             descent.
         2   L-BFGS
         3   Use (1) to warmstart (2).

   2   Use map-only Hadoop for process control and error
       recovery.

   3   Use AllReduce code to sync state.
Approach Used
   1   Optimize hard so few data passes required.

         1   Normalized, adaptive, safe, online, gradient
             descent.
         2   L-BFGS
         3   Use (1) to warmstart (2).

   2   Use map-only Hadoop for process control and error
       recovery.

   3   Use AllReduce code to sync state.

   4   Always save input examples in a cachele to speed later
       passes.
Approach Used
   1   Optimize hard so few data passes required.

         1   Normalized, adaptive, safe, online, gradient
             descent.
         2   L-BFGS
         3   Use (1) to warmstart (2).

   2   Use map-only Hadoop for process control and error
       recovery.

   3   Use AllReduce code to sync state.

   4   Always save input examples in a cachele to speed later
       passes.

   5   Use hashing trick to reduce input complexity.
Approach Used
    1   Optimize hard so few data passes required.

          1   Normalized, adaptive, safe, online, gradient
              descent.
          2   L-BFGS
          3   Use (1) to warmstart (2).

    2   Use map-only Hadoop for process control and error
        recovery.

    3   Use AllReduce code to sync state.

    4   Always save input examples in a cachele to speed later
        passes.

    5   Use hashing trick to reduce input complexity.


  Open source in Vowpal Wabbit 6.1. Search for it.
Robustness  Speedup
                                       Speed per method
              10
                        Average_10
               9            Min_10
               8           Max_10
                              linear
               7
    Speedup




               6
               5
               4
               3
               2
               1
               0
                   10     20    30     40   50   60   70   80   90   100
                                             Nodes
Splice Site Recognition
            0.55

             0.5

            0.45
    auPRC




             0.4

            0.35

             0.3
                                      Online
                                      L−BFGS w/ 5 online passes
            0.25                      L−BFGS w/ 1 online pass
                                      L−BFGS
             0.2
                0   10   20               30       40         50
                              Iteration
Splice Site Recognition
           0.6

           0.5


           0.4
   auPRC




           0.3


           0.2

                                       L−BFGS w/ one online pass
           0.1                         Zinkevich et al.
                                       Dekel et al.
            0
                 0     5              10             15            20
                     Effective number of passes over data
To learn more

  The wiki has tutorials, examples, and help:
  https://github.com/JohnLangford/vowpal_wabbit/wiki


  Mailing List: vowpal_wabbit@yahoo.com


  Various discussion: http://hunch.net Machine Learning
  (Theory) blog
Bibliography: Original VW
Caching L. Bottou. Stochastic Gradient Descent Examples on
         Toy Problems,
         http://leon.bottou.org/projects/sgd,        2007.

                                      http:
Release Vowpal Wabbit open source project,
         //github.com/JohnLangford/vowpal_wabbit/wiki,
         2007.

Hashing Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola,
         and SVN Vishwanathan, Hash Kernels for Structured
         Data, AISTAT 2009.

Hashing K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and
         J. Attenberg, Feature Hashing for Large Scale Multitask
         Learning, ICML 2009.
Bibliography: Algorithmics
L-BFGS J. Nocedal, Updating Quasi-Newton Matrices with
         Limited Storage, Mathematics of Computation
         35:773782, 1980.

Adaptive H. B. McMahan and M. Streeter, Adaptive Bound
         Optimization for Online Convex Optimization, COLT
         2010.

Adaptive J. Duchi, E. Hazan, and Y. Singer, Adaptive Subgradient
         Methods for Online Learning and Stochastic
         Optimization, COLT 2010.

    Safe N. Karampatziakis, and J. Langford, Online Importance
         Weight Aware Updates, UAI 2011.
Bibliography: Parallel
grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A Scalable
          Modular Convex Solver for Regularized Risk
          Minimization, KDD 2007.

   avg. 1 G. Mann et al. Ecient large-scale distributed training
          of conditional maximum entropy models, NIPS 2009.

   avg. 2 K. Hall, S. Gilpin, and G. Mann, MapReduce/Bigtable
          for Distributed Optimization, LCCC 2010.

  ov. avg M. Zinkevich, M. Weimar, A. Smola, and L. Li,
          Parallelized Stochastic Gradient Descent, NIPS 2010.

P. online D. Hsu, N. Karampatziakis, J. Langford, and A. Smola,
          Parallel Online Learning, in SUML 2010.

D. Mini 1 O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao,
          Optimal Distributed Online Predictions Using Minibatch,
          http://arxiv.org/abs/1012.1367
Vowpal Wabbit Goals for Future
Development
   1   Native learning reductions. Just like more complicated
       losses. In development now.

   2   Librarication, so people can use VW in their favorite
       language.

   3   Other learning algorithms, as interest dictates.

   4   Various further optimizations. (Allreduce can be
       improved by a factor of 3...)
Reductions
Goal: minimize   on   D
                                Transform     D into D

                                     Algorithm for optimizing   0/1


                                 h

  Transform  h with small 0/1(h, D ) into R with small (R , D ).
                                              h            h


  such that if h does well on (D , 0,1 ), R is guaranteed to do
                                          h

  well on (D , ).
The transformation


  R = transformer from complex example to simple example.

  R −1 = transformer from simple predictions to complex
  prediction.
example: One Against All

  Create   k binary regression problems, one per class.
  For class  i predict Is the label i or not?
                                (x , 1(y = 1))
                                
                                (x , 1(y = 2))
                                
                                
                    (x , y ) −→
                                . . .
                                  (x , 1(y = k ))
                                
                                
                                

  Multiclass prediction: evaluate all the classiers and choose
  the largest scoring label.
The code: oaa.cc
  // parses reduction-specic ags.
  void parse_ags(size_t s, void (*base_l)(example*), void
  (*base_f )())


  // Implements   R and R −1 using base_l.
  void learn(example* ec)


  // Cleans any temporary state and calls base_f.
  void nish()


  The important point: anything tting this interface is easy to
  code in VW now, including all forms of feature diddling and
  creation.
  And reductions inherit all the
  input/output/optimization/parallelization of VW!
Reductions implemented
   1   One-Against-All ( oaa k). The baseline multiclass
       reduction.

   2   Cost Sensitive One-Against-All ( csoaa k).
       Predicts cost of each label and minimizes the cost.

   3   Weighted All-Pairs ( wap k). An alternative to
       csoaa with better theory.

   4   Cost Sensitive One-Against-All with Label Dependent
       Features ( csoaa_ldf). As csoaa, but features not
       shared between labels.

   5   WAP with Label Dependent Features ( wap_ldf).

   6   Sequence Prediction ( sequence k). A simple
       implementation of Searn and Dagger for sequence
       prediction. Uses cost sensitive predictor.
Reductions to Implement
           Regret Transform Reductions
                                                     AUC Ranking
                                              1
                    Regret multiplier              Quicksort     Algorithm Name
                             Classification

                                     Ei Costing
                        1
  Quantile Regression         IW Classification 1 Mean Regression
                    Quanting                     Probing
                       Offset
                    k−1Tree       4 ECT                        4 PECOC

   k−Partial Label         k−Classification            k−way Regression
                                               k/2 Filter
                                                   Tree
                                                 k−cost
                                              Classification
                                     Tk PSDP                Tk ln T Searn

                T step RL with State Visitation     T Step RL with Demonstration Policy

                                       ??                         ??

                   Dynamic Models                 Unsupervised by Self Prediction

Mais conteúdo relacionado

Mais procurados

Scoring Metrics for Classification Models
Scoring Metrics for Classification ModelsScoring Metrics for Classification Models
Scoring Metrics for Classification ModelsKNIMESlides
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
Supervised Machine Learning Techniques
Supervised Machine Learning TechniquesSupervised Machine Learning Techniques
Supervised Machine Learning TechniquesTara ram Goyal
 
Contextual Bandit Survey
Contextual Bandit SurveyContextual Bandit Survey
Contextual Bandit SurveySangwoo Mo
 
Reinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual BanditsReinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual BanditsMax Pagels
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Alexandros Karatzoglou
 
Cours_Physique_des_Composants_dElectroni.pptx
Cours_Physique_des_Composants_dElectroni.pptxCours_Physique_des_Composants_dElectroni.pptx
Cours_Physique_des_Composants_dElectroni.pptxAbdo Brahmi
 
Deep learning book_chap_02
Deep learning book_chap_02Deep learning book_chap_02
Deep learning book_chap_02HyeongGooKang
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsJaya Kawale
 
AI-Driven Personalized Email Marketing
AI-Driven Personalized Email MarketingAI-Driven Personalized Email Marketing
AI-Driven Personalized Email MarketingDatabricks
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Massimo Quadrana
 
"The Arrival of Quantum Computing" by Will Zeng
"The Arrival of Quantum Computing" by Will Zeng"The Arrival of Quantum Computing" by Will Zeng
"The Arrival of Quantum Computing" by Will ZengImpact.Tech
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiersbutest
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix ScaleJustin Basilico
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overviewjins0618
 
Bpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedbackBpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedbackPark JunPyo
 

Mais procurados (20)

Scoring Metrics for Classification Models
Scoring Metrics for Classification ModelsScoring Metrics for Classification Models
Scoring Metrics for Classification Models
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Supervised Machine Learning Techniques
Supervised Machine Learning TechniquesSupervised Machine Learning Techniques
Supervised Machine Learning Techniques
 
Contextual Bandit Survey
Contextual Bandit SurveyContextual Bandit Survey
Contextual Bandit Survey
 
Reinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual BanditsReinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual Bandits
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
Cours_Physique_des_Composants_dElectroni.pptx
Cours_Physique_des_Composants_dElectroni.pptxCours_Physique_des_Composants_dElectroni.pptx
Cours_Physique_des_Composants_dElectroni.pptx
 
Deep learning book_chap_02
Deep learning book_chap_02Deep learning book_chap_02
Deep learning book_chap_02
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
AI-Driven Personalized Email Marketing
AI-Driven Personalized Email MarketingAI-Driven Personalized Email Marketing
AI-Driven Personalized Email Marketing
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
 
"The Arrival of Quantum Computing" by Will Zeng
"The Arrival of Quantum Computing" by Will Zeng"The Arrival of Quantum Computing" by Will Zeng
"The Arrival of Quantum Computing" by Will Zeng
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiers
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix Scale
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overview
 
Bpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedbackBpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedback
 
La Biométrie
La BiométrieLa Biométrie
La Biométrie
 

Semelhante a Technical Tricks of Vowpal Wabbit

Dual SVM Problem.pdf
Dual SVM Problem.pdfDual SVM Problem.pdf
Dual SVM Problem.pdfssuser8547f2
 
Notes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVMNotes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVMSyedSaimGardezi
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackarogozhnikov
 
super vector machines algorithms using deep
super vector machines algorithms using deepsuper vector machines algorithms using deep
super vector machines algorithms using deepKNaveenKumarECE
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the thesanjaibalajeessn
 
Support vector machine
Support vector machineSupport vector machine
Support vector machineRishabh Gupta
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines SimplyEmad Nabil
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learningpauldix
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svmtaikhoan262
 
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKSSupport Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKSrajalakshmi5921
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slidesSara Asher
 
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016MLconf
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascentjeykottalam
 

Semelhante a Technical Tricks of Vowpal Wabbit (20)

Dual SVM Problem.pdf
Dual SVM Problem.pdfDual SVM Problem.pdf
Dual SVM Problem.pdf
 
Notes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVMNotes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVM
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
super vector machines algorithms using deep
super vector machines algorithms using deepsuper vector machines algorithms using deep
super vector machines algorithms using deep
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the the
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKSSupport Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
 
Svm V SVC
Svm V SVCSvm V SVC
Svm V SVC
 
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 

Mais de jakehofman

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networksjakehofman
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classificationjakehofman
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationjakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1jakehofman
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in Rjakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overviewjakehofman
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systemsjakehofman
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayesjakehofman
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studiesjakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Sciencejakehofman
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classificationjakehofman
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regressionjakehofman
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experimentsjakehofman
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wranglingjakehofman
 

Mais de jakehofman (20)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 

Último

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 

Último (20)

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 

Technical Tricks of Vowpal Wabbit

  • 1. Technical Tricks of Vowpal Wabbit http://hunch.net/~vw/ John Langford, Columbia, Data-Driven Modeling, April 16 git clone git://github.com/JohnLangford/vowpal_wabbit.git
  • 2. Goals of the VW project 1 State of the art in scalable, fast, ecient Machine Learning. VW is (by far) the most scalable public linear learner, and plausibly the most scalable anywhere. 2 Support research into new ML algorithms. ML researchers can deploy new algorithms on an ecient platform eciently. BSD open source. 3 Simplicity. No strange dependencies, currently only 9437 lines of code. 4 It just works. A package in debian R. Otherwise, users just type make, and get a working system. At least a half-dozen companies use VW.
  • 3. Demonstration vw -c rcv1.train.vw.gz exact_adaptive_norm power_t 1 -l 0.5
  • 4. The basic learning algorithm Learn w such that fw (x ) = w .x predicts well. 1 Online learning with strong defaults. 2 Every input source but library. 3 Every output sink but library. 4 In-core feature manipulation for ngrams, outer products, etc... Custom is easy. 5 Debugging with readable models audit mode. 6 Dierent loss functions squared, logistic, ... 7 1 and 2 regularization. 8 Compatible LBFGS-based batch-mode optimization 9 Cluster parallel 10 Daemon deployable.
  • 5. The tricks Basic VW Newer Algorithmics Parallel Stu Feature Caching Adaptive Learning Parameter Averaging Feature Hashing Importance Updates Nonuniform Average Online Learning Dim. Correction Gradient Summing Implicit Features L-BFGS Hadoop AllReduce Hybrid Learning We'll discuss Basic VW and algorithmics, then Parallel.
  • 6. Feature Caching Compare: time vw rcv1.train.vw.gz exact_adaptive_norm power_t 1
  • 7. Feature Hashing String − Index dictionary RAM Weights Weights Conventional VW Most algorithms use a hashmap to change a word into an index for a weight. VW uses a hash function which takes almost no RAM, is x10 faster, and is easily parallelized.
  • 8. The spam example [WALS09] 1 3.2 ∗ 106 labeled emails. 2 433167 users. 3 ∼ 40 ∗ 106 unique features. How do we construct a spam lter which is personalized, yet uses global information?
  • 9. The spam example [WALS09] 1 3.2 ∗ 106 labeled emails. 2 433167 users. 3 ∼ 40 ∗ 106 unique features. How do we construct a spam lter which is personalized, yet uses global information? Answer: Use hashing to predict according to: w , φ(x ) + w , φ (x ) u
  • 10. Results (baseline = global only predictor)
  • 11. Basic Online Learning Start with ∀i : w i = 0, Repeatedly: 1 Get example x ∈ (∞, ∞)∗. 2 Make prediction y − ˆ w x clipped to interval [0, 1]. i i i 3 Learn truth y ∈ [0, 1] with importance I or goto (1). 4 Update w ← w + η 2(y − y )Ix and go to (1). i i ˆ i
  • 12. Reasons for Online Learning 1 Fast convergence to a good predictor 2 It's RAM ecient. You need store only one example in RAM rather than all of them. ⇒ Entirely new scales of data are possible. 3 Online Learning algorithm = Online Optimization Algorithm. Online Learning Algorithms ⇒ the ability to solve entirely new categories of applications. 4 Online Learning = ability to deal with drifting distributions.
  • 13. Implicit Outer Product Sometimes you care about the interaction of two sets of features (ad features x query features, news features x user features, etc...). Choices: 1 Expand the set of features explicitly, consuming n2 disk space. 2 Expand the features dynamically in the core of your learning algorithm.
  • 14. Implicit Outer Product Sometimes you care about the interaction of two sets of features (ad features x query features, news features x user features, etc...). Choices: 1 Expand the set of features explicitly, consuming n2 disk space. 2 Expand the features dynamically in the core of your learning algorithm. Option (2) is x10 faster. You need to be comfortable with hashes rst.
  • 15. The tricks Basic VW Newer Algorithmics Parallel Stu Feature Caching Adaptive Learning Parameter Averaging Feature Hashing Importance Updates Nonuniform Average Online Learning Dim. Correction Gradient Summing Implicit Features L-BFGS Hadoop AllReduce Hybrid Learning Next: algorithmics.
  • 16. Adaptive Learning [DHS10,MS10] For example t , let g it = 2(y − y )xit . ˆ
  • 17. Adaptive Learning [DHS10,MS10] For example t , let g it = 2(y − y )xit . ˆ New update rule: w i ← wi − η √Pit,t +1 g 2 t =1 git
  • 18. Adaptive Learning [DHS10,MS10] For example t , let g it = 2(y − y )xit . ˆ New update rule: w i ← wi − η √Pit,t +1 g 2 t =1 git Common features stabilize quickly. Rare features can have large updates.
  • 19. Learning with importance weights [KL11] y
  • 20. Learning with importance weights [KL11] wt x y
  • 21. Learning with importance weights [KL11] −η( ) x wt x y
  • 22. Learning with importance weights [KL11] −η( ) x wt x wt+1 x y
  • 23. Learning with importance weights [KL11] −6η( ) x wt x y
  • 24. Learning with importance weights [KL11] −6η( ) x wt x y wt+1 x ??
  • 25. Learning with importance weights [KL11] −η( ) x wt x y wt+1 x
  • 26. Learning with importance weights [KL11] wt x wt+1 x y
  • 27. Learning with importance weights [KL11] s(h)||x||2 wt x wt+1 x y
  • 28. Robust results for unweighted problems astro - logistic loss spam - quantile loss 0.97 0.98 0.96 0.97 0.96 0.95 0.95 standard standard 0.94 0.94 0.93 0.93 0.92 0.92 0.91 0.91 0.9 0.9 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 importance aware importance aware rcv1 - squared loss webspam - hinge loss 0.95 1 0.945 0.99 0.94 0.98 0.935 0.97 0.93 0.96 standard standard 0.925 0.95 0.92 0.94 0.915 0.93 0.91 0.92 0.905 0.91 0.9 0.9 0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 importance aware importance aware
  • 29. Dimensional Correction Gradient of squared loss = ∂(fw (x )−y )2 ∂ wi f x = 2( w ( ) − y )x i and change weights in the negative gradient direction: ∂(fw (x ) − y )2 w i ← wi − η ∂ wi
  • 30. Dimensional Correction Gradient of squared loss = ∂(fw (x )−y )2 ∂ wi = 2( w ( ) − f x y )x i and change weights in the negative gradient direction: ∂(fw (x ) − y )2 w i ← wi − η ∂ wi But the gradient has intrinsic problems. w naturally has units i of 1/i since doubling x implies halving w to get the same i i prediction. ⇒ Update rule has mixed units!
  • 31. Dimensional Correction Gradient of squared loss = ∂(fw (x )−y )2 ∂ wi = 2( w ( ) − f x y )x i and change weights in the negative gradient direction: ∂(fw (x ) − y )2 wi ← wi − η ∂ wi But the gradient has intrinsic problems. w naturally has unitsi of 1/i since doubling x implies halving w to get the same i i prediction. ⇒ Update rule has mixed units! A crude x: divide update by x 2. It helps much! i i (fw (x )−y )2 This is scary! The problem optimized is minw P 2 x ,y i xi rather than minw x ,y (fw (x ) − y )2 . But it works.
  • 32. LBFGS [Nocedal80] Batch(!) second order algorithm. Core idea = ecient approximate Newton step.
  • 33. LBFGS [Nocedal80] Batch(!) second order algorithm. Core idea = ecient approximate Newton step. H = ∂ (∂w (i ∂)−j ) 2 f x y 2 = Hessian. w w Newton step = w → w + H −1g .
  • 34. LBFGS [Nocedal80] Batch(!) second order algorithm. Core idea = ecient approximate Newton step. H = ∂ (∂w (i ∂)−j ) 2 f x y 2 = Hessian. w w Newton step = w → w + H −1g . Newton fails: you can't even represent H. Instead build up approximate inverse Hessian according to: ∆w ∆Tw where ∆ is a change in weights ∆T ∆g w w w and ∆g is a change in the loss gradient g.
  • 35. Hybrid Learning Online learning is GREAT for getting to a good solution fast. LBFGS is GREAT for getting a perfect solution.
  • 36. Hybrid Learning Online learning is GREAT for getting to a good solution fast. LBFGS is GREAT for getting a perfect solution. Use Online Learning, then LBFGS 0.484 0.55 0.482 0.5 0.48 0.45 0.478 auPRC auPRC 0.476 0.4 0.474 0.35 0.472 0.3 0.47 Online Online L−BFGS w/ 5 online passes 0.468 L−BFGS w/ 5 online passes 0.25 L−BFGS w/ 1 online pass L−BFGS w/ 1 online pass L−BFGS 0.466 L−BFGS 0.2 0 10 20 30 40 50 0 5 10 15 20 Iteration Iteration
  • 37. The tricks Basic VW Newer Algorithmics Parallel Stu Feature Caching Adaptive Learning Parameter Averaging Feature Hashing Importance Updates Nonuniform Average Online Learning Dim. Correction Gradient Summing Implicit Features L-BFGS Hadoop AllReduce Hybrid Learning Next: Parallel.
  • 38. Applying for a fellowship in 1997
  • 39. Applying for a fellowship in 1997 Interviewer: So, what do you want to do?
  • 40. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI.
  • 41. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How?
  • 42. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines!
  • 43. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines! I: You fool! The only thing parallel machines are good for is computational windtunnels!
  • 44. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines! I: You fool! The only thing parallel machines are good for is computational windtunnels! The worst part: he had a point.
  • 45. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a good linear predictor f (x ) = w i wx? i i
  • 46. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a good linear predictor f (x ) = w i wx? i i 2.1T sparse features 17B Examples 16M parameters 1K nodes
  • 47. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a good linear predictor f (x ) = w i wx? i i 2.1T sparse features 17B Examples 16M parameters 1K nodes 70 minutes = 500M features/second: faster than the IO bandwidth of a single machine⇒ we beat all possible single machine linear learning algorithms.
  • 48. Features/s 100 1000 10000 100000 1e+06 1e+07 1e+08 RBF-SVM 1e+09 MPI?-500 RCV1 Ensemble Tree MPI-128 Synthetic single parallel RBF-SVM TCP-48 Parallel Learning book MNIST 220K Decision Tree MapRed-200 Ad-Bounce # Boosted DT MPI-32 Speed per method Ranking # Linear Threads-2 RCV1 Linear adoop+TCP-1000 Ads * Compare: Other Supervised Algorithms in
  • 49. MPI-style AllReduce Allreduce initial state 5 7 6 1 2 3 4
  • 50. MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28
  • 51. MPI-style AllReduce Create Binary Tree 7 5 6 1 2 3 4
  • 52. MPI-style AllReduce Reducing, step 1 7 8 13 1 2 3 4
  • 53. MPI-style AllReduce Reducing, step 2 28 8 13 1 2 3 4
  • 54. MPI-style AllReduce Broadcast, step 1 28 28 28 1 2 3 4
  • 55. MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28 AllReduce = Reduce+Broadcast
  • 56. MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28 AllReduce = Reduce+Broadcast Properties: 1 Easily pipelined so no latency concerns. 2 Bandwidth ≤ 6n . 3 No need to rewrite code!
  • 57. An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number max) 1 While (examples left) 1 Do online update. 2 AllReduce(weights) 3 For each weight w ← w /n
  • 58. An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number max) 1 While (examples left) 1 Do online update. 2 AllReduce(weights) 3 For each weight w ← w /n Other algorithms implemented: 1 Nonuniform averaging for online learning 2 Conjugate Gradient 3 LBFGS
  • 59. What is Hadoop AllReduce? Program Data 1 Map job moves program to data.
  • 60. What is Hadoop AllReduce? Program Data 1 Map job moves program to data. 2 Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce. Failures autorestart on dierent node with identical data.
  • 61. What is Hadoop AllReduce? Program Data 1 Map job moves program to data. 2 Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce. Failures autorestart on dierent node with identical data. 3 Speculative execution: In a busy cluster, one node is often slow. Hadoop can speculatively start additional mappers. We use the rst to nish reading all data once.
  • 62. Approach Used 1 Optimize hard so few data passes required. 1 Normalized, adaptive, safe, online, gradient descent. 2 L-BFGS 3 Use (1) to warmstart (2).
  • 63. Approach Used 1 Optimize hard so few data passes required. 1 Normalized, adaptive, safe, online, gradient descent. 2 L-BFGS 3 Use (1) to warmstart (2). 2 Use map-only Hadoop for process control and error recovery.
  • 64. Approach Used 1 Optimize hard so few data passes required. 1 Normalized, adaptive, safe, online, gradient descent. 2 L-BFGS 3 Use (1) to warmstart (2). 2 Use map-only Hadoop for process control and error recovery. 3 Use AllReduce code to sync state.
  • 65. Approach Used 1 Optimize hard so few data passes required. 1 Normalized, adaptive, safe, online, gradient descent. 2 L-BFGS 3 Use (1) to warmstart (2). 2 Use map-only Hadoop for process control and error recovery. 3 Use AllReduce code to sync state. 4 Always save input examples in a cachele to speed later passes.
  • 66. Approach Used 1 Optimize hard so few data passes required. 1 Normalized, adaptive, safe, online, gradient descent. 2 L-BFGS 3 Use (1) to warmstart (2). 2 Use map-only Hadoop for process control and error recovery. 3 Use AllReduce code to sync state. 4 Always save input examples in a cachele to speed later passes. 5 Use hashing trick to reduce input complexity.
  • 67. Approach Used 1 Optimize hard so few data passes required. 1 Normalized, adaptive, safe, online, gradient descent. 2 L-BFGS 3 Use (1) to warmstart (2). 2 Use map-only Hadoop for process control and error recovery. 3 Use AllReduce code to sync state. 4 Always save input examples in a cachele to speed later passes. 5 Use hashing trick to reduce input complexity. Open source in Vowpal Wabbit 6.1. Search for it.
  • 68. Robustness Speedup Speed per method 10 Average_10 9 Min_10 8 Max_10 linear 7 Speedup 6 5 4 3 2 1 0 10 20 30 40 50 60 70 80 90 100 Nodes
  • 69. Splice Site Recognition 0.55 0.5 0.45 auPRC 0.4 0.35 0.3 Online L−BFGS w/ 5 online passes 0.25 L−BFGS w/ 1 online pass L−BFGS 0.2 0 10 20 30 40 50 Iteration
  • 70. Splice Site Recognition 0.6 0.5 0.4 auPRC 0.3 0.2 L−BFGS w/ one online pass 0.1 Zinkevich et al. Dekel et al. 0 0 5 10 15 20 Effective number of passes over data
  • 71. To learn more The wiki has tutorials, examples, and help: https://github.com/JohnLangford/vowpal_wabbit/wiki Mailing List: vowpal_wabbit@yahoo.com Various discussion: http://hunch.net Machine Learning (Theory) blog
  • 72. Bibliography: Original VW Caching L. Bottou. Stochastic Gradient Descent Examples on Toy Problems, http://leon.bottou.org/projects/sgd, 2007. http: Release Vowpal Wabbit open source project, //github.com/JohnLangford/vowpal_wabbit/wiki, 2007. Hashing Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and SVN Vishwanathan, Hash Kernels for Structured Data, AISTAT 2009. Hashing K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, Feature Hashing for Large Scale Multitask Learning, ICML 2009.
  • 73. Bibliography: Algorithmics L-BFGS J. Nocedal, Updating Quasi-Newton Matrices with Limited Storage, Mathematics of Computation 35:773782, 1980. Adaptive H. B. McMahan and M. Streeter, Adaptive Bound Optimization for Online Convex Optimization, COLT 2010. Adaptive J. Duchi, E. Hazan, and Y. Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, COLT 2010. Safe N. Karampatziakis, and J. Langford, Online Importance Weight Aware Updates, UAI 2011.
  • 74. Bibliography: Parallel grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A Scalable Modular Convex Solver for Regularized Risk Minimization, KDD 2007. avg. 1 G. Mann et al. Ecient large-scale distributed training of conditional maximum entropy models, NIPS 2009. avg. 2 K. Hall, S. Gilpin, and G. Mann, MapReduce/Bigtable for Distributed Optimization, LCCC 2010. ov. avg M. Zinkevich, M. Weimar, A. Smola, and L. Li, Parallelized Stochastic Gradient Descent, NIPS 2010. P. online D. Hsu, N. Karampatziakis, J. Langford, and A. Smola, Parallel Online Learning, in SUML 2010. D. Mini 1 O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, Optimal Distributed Online Predictions Using Minibatch, http://arxiv.org/abs/1012.1367
  • 75. Vowpal Wabbit Goals for Future Development 1 Native learning reductions. Just like more complicated losses. In development now. 2 Librarication, so people can use VW in their favorite language. 3 Other learning algorithms, as interest dictates. 4 Various further optimizations. (Allreduce can be improved by a factor of 3...)
  • 76. Reductions Goal: minimize on D Transform D into D Algorithm for optimizing 0/1 h Transform h with small 0/1(h, D ) into R with small (R , D ). h h such that if h does well on (D , 0,1 ), R is guaranteed to do h well on (D , ).
  • 77. The transformation R = transformer from complex example to simple example. R −1 = transformer from simple predictions to complex prediction.
  • 78. example: One Against All Create k binary regression problems, one per class. For class i predict Is the label i or not? (x , 1(y = 1))  (x , 1(y = 2))   (x , y ) −→ . . . (x , 1(y = k ))    Multiclass prediction: evaluate all the classiers and choose the largest scoring label.
  • 79. The code: oaa.cc // parses reduction-specic ags. void parse_ags(size_t s, void (*base_l)(example*), void (*base_f )()) // Implements R and R −1 using base_l. void learn(example* ec) // Cleans any temporary state and calls base_f. void nish() The important point: anything tting this interface is easy to code in VW now, including all forms of feature diddling and creation. And reductions inherit all the input/output/optimization/parallelization of VW!
  • 80. Reductions implemented 1 One-Against-All ( oaa k). The baseline multiclass reduction. 2 Cost Sensitive One-Against-All ( csoaa k). Predicts cost of each label and minimizes the cost. 3 Weighted All-Pairs ( wap k). An alternative to csoaa with better theory. 4 Cost Sensitive One-Against-All with Label Dependent Features ( csoaa_ldf). As csoaa, but features not shared between labels. 5 WAP with Label Dependent Features ( wap_ldf). 6 Sequence Prediction ( sequence k). A simple implementation of Searn and Dagger for sequence prediction. Uses cost sensitive predictor.
  • 81. Reductions to Implement Regret Transform Reductions AUC Ranking 1 Regret multiplier Quicksort Algorithm Name Classification Ei Costing 1 Quantile Regression IW Classification 1 Mean Regression Quanting Probing Offset k−1Tree 4 ECT 4 PECOC k−Partial Label k−Classification k−way Regression k/2 Filter Tree k−cost Classification Tk PSDP Tk ln T Searn T step RL with State Visitation T Step RL with Demonstration Policy ?? ?? Dynamic Models Unsupervised by Self Prediction