SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
Matrix Computations in Machine Learning

                                Inderjit S. Dhillon
                           University of Texas at Austin



    International Conference on Machine Learning (ICML)
                        June 18, 2009




Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Outline

   Overview of Matrix Computations

   Matrix Computations in Machine Learning
        Traditional — Regression, PCA, Spectral Graph Partitioning, etc.
        Specialized — Lasso, NMF, Co-clustering, Kernel Learning, etc.

   Resources

   Discussion




      Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Matrix Computations
   Linear Systems of Equations
        Linear Least Squares
        Dense versus Sparse — Gaussian Elimination versus Iterative Methods
        Further Structure (Toeplitz, Hankel, Low Displacement Rank.....)

   Eigenvalue Problems
        SVD
        Dense versus Sparse — “Direct” Methods versus Iterative Methods
        Symmetric versus Non-Symmetric
        Further Structure (Tridiagonal, Banded)




      Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Computational/Accuracy Issues

   Mathematician versus Numerical Analyst

        Mathematician: To solve Ax = b, compute A−1 , and set x = A−1 b

        Numerical Analyst: Use A = PLU (Gaussian Elimination with Partial
        Pivoting) to solve for x

        Mathematician: To solve for eigenvalues of A, form characteristic
        polynomial det(λI − A) and solve for its roots

        Numerical Analyst: Use orthogonal transformations to transform A to
        either tridiagonal form or Hessenberg form, and then solve for
        eigenvalues of reduced form




      Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Key Issues in Matrix Computations
   Computation Time
        Factors of Two Matter! GEPP preferred over GECP.
        Operation Counts
        Memory Hierarchy (algorithms based on BLAS3 perform better)
        LAPACK versus LINPACK
   Accuracy
        Algorithms in Floating Point can have surprising behavior
        Examples – Eigenvalues through Characteristic Polynomial,
        Gram-Schmidt Orthogonalization, Loss of Orthogonality in Lanczos
        Algorithms
        Forward Stability usually not possible
        Backward Error Analysis – pioneered by Wilkinson in 1950’s, 1960’s
        Need to understand Floating Point Arithmetic
   Structure
        Dense versus Sparse, Tridiagonal versus Banded, Toeplitz, Hankel.....
        Has big impact on computation time and accuracy issues


      Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Backward Error Analysis

   Consider Ax = b
   Suppose x is the computed answer
           ˆ
   Question: How close is x to x?
                          ˆ
   Correct but Pessimistic Answer: x can be wrong in all digits for even
                                   ˆ
   the best algorithm
   Backward Error Analysis to the Rescue:
        Show that the computed answer is EXACT for the perturbed problem:

                                                  (A + δA)ˆ = b + δb
                                                          x

        Conditioning: How does the solution change when problem is perturbed?

                                      x −x
                                      ˆ                              δA   δb
                                           ≈ cond(A)                    +
                                        x                            A    b

        Separates Properties of Algorithm from Properties of Problem

      Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Matrix Computations
                            in
                    Machine Learning




Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Regression

   Given: Training data (xi , yi ), i = 1, 2, . . . , N, where xi ∈ R d , yi ∈ R
   Goal: Find Linear Predictor: β0 + β1 x1 + · · · βd xd , where
   x = [x1 , x2 , . . . , xd ]T is an unseen data point
   Formulation: Linear Least Squares Problem:

                                          Solve min X β − y 2 .
                                                          β

   Ridge Regression:

                                  Solve min X β − y                   2
                                                                      2   + λ β 2.
                                                                                2
                                                β

   Algorithms (in order of increasing cost and increasing accuracy):
        Normal Equations
        QR decomposition
        SVD

      Inderjit S. Dhillon University of Texas at Austin       Matrix Computations in Machine Learning
PCA

  Given: xi , i = 1, 2, . . . , N, where xi ∈ R d
  Goal: Find Direction(s) of Largest Variance
  Formulation:

                Solve          max trace(Y T (X − me T )(X − me T )T Y ).
                             Y T Y =Ik

  Solution: Leading Eigenvectors of (X − me T )(X − me T )T
  (Equivalently, leading left singular vectors of X − me T )
  Algorithms when X is dense:
        QR Algorithm (eig in MATLAB)
        Divide & Conquer Algorithm
        MR3 Algorithm (fastest)
  When X is sparse
        Lanczos and its Variants
  Applications: Dimensionality Reduction, Visualization, LSI, Eigenfaces,
  Eigengenes, Eigenducks, Eigenlips.......
      Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Graph Partitioning/Clustering

       In many applications, goal is to partition nodes of a graph:




                                     High School Friendship Network
[James Moody. American Journal of Sociology, 2001]


             Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Spectral Graph Partitioning

   Partitioning Objective: Minimize Cut subject to Equal Partition Sizes
   Applications in Circuit Layout and Parallel Computing
   Take a real relaxation of the Objective
   Globally optimal solution of the relaxed problem is given by
   eigenvectors of Graph Laplacian
   Post-process eigenvectors to obtain a discrete partitioning

   Long History
        [Hall, 1970] An r-dimensional quadratic placement algorithm, Manag Sci
        [Donath & Hoffman, 1972] Algorithms for partitioning of graphs and
        computer logic based on eigenvectors of connection matrices, IBM
        Techn. Disclosure Bull.
        [Fiedler, 1973] Algebraic connectivity of graphs, Czech Math Journal
        [Pothen, Simon & Liou, 1990] Partitioning sparse matrices with
        eigenvectors of graphs, SIAM J. Matrix Anal. Appl.


      Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
“Specialized” Matrix Computations




Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Matrix Computations in Machine Learning

Traditional Problems
    Regression, PCA, Spectral Graph Partitioning

“Specialized” Problems that arise in Machine Learning
    Intersection of Numerical Optimization & Matrix Computations
         Lasso – L1-regularized least squares
         Non-negative Matrix Factorization (NMF)
         Sparse PCA
         Matrix Completion
         Co-clustering
         Metric/Kernel Learning
    Machine Learning to the Rescue
         Multilevel Graph Clustering




       Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Non-Negative Matrix Approximation

   Suppose data matrix A ≥ 0
   NMF problem: Given A and k, solve

                                                min             A − BC        2
                                                                              F
                                               B,C ≥0

   Constrained Optimization Problem
                               x2
                                                     ∇f (xk )

                                                                   P+[xk − αH k ∇f (xk )]


                                                                 P+[xk − (GT G)−1 (GT Gxk − GT h)]
                                                xk                                                      x1



                                                                                               level sets of f
                                    x∗
                                          k     k     k
                                         x − αH ∇f (x )

                                                     x = xk − (GT G)−1 (GT Gxk − GT h)
                                                     ¯



   [Kim, Sra, Dhillon] Fast Newton-type Methods for the Least Squares
   Nonnegative Matrix Approximation Problem, SDM, 2007.
      Inderjit S. Dhillon University of Texas at Austin         Matrix Computations in Machine Learning
Graph Clustering

   How do we measure the quality of a graph clustering?
   Could simply minimize the edge-cut in the graph
        Can lead to clusters that are highly unbalanced in size
   Could minimize the edge-cut in the graph while constraining the
   clusters to be equal in size
        Not a natural restriction in data analysis

   Popular objectives include normalized cut and ratio cut:
                                                                          c
                                                                               links(Vi , V  Vi )
              Normalized Cut:                         minimize
                                                                                  degree(Vi )
                                                                         i=1
                                                                          c
                                                                               links(Vi , V  Vi )
                        Ratio Cut:                    minimize
                                                                                     |Vi |
                                                                         i=1

     [Shi & Malik, IEEE Pattern Analysis & Machine Intelligence, 2000]

     [Chan, Schlag & Zien, IEEE Integrated Circuits & Systems, 1994]

      Inderjit S. Dhillon University of Texas at Austin    Matrix Computations in Machine Learning
Machine Learning to the Rescue

    Traditional Approach to minimize Normalized Cut: Compute
    eigenvectors of normalized Graph Laplacian
    Exact Mathematical Equivalence: Normalized Cut Objective is special
    case of Weighted Kernel-Kmeans Objective

Our Multilevel Approach:
    Phase I: Coarsening
         Coarsen graph by merging nodes to form smaller and smaller graphs
         Use simple greedy heuristic specialized to each graph cut objective
    Phase II: Base Clustering
         Once the graph is small enough, perform a base clustering
         Variety of techniques possible for this step
    Phase III: Refining
         Uncoarsen the graph, level by level
         Use weighted kernel k-means to refine the clusterings at each level
         Input to weighted kernel k-means is clustering from previous level
       Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Multilevel Graph Clustering

   Overview of our approach:
                                                 Input Graph


                                                  Final Clustering




                    Coarsening
                                                                                    Refining

                                                  Initial Clustering



   [Dhillon, Guan & Kulis, 2007] Weighted Graph Cuts without
   Eigenvectors: A Multilevel Approach, IEEE Transactions on Pattern
   Analysis and Machine Intelligence (PAMI), 2007.
   Graclus software: freely downloadable from the web
      Inderjit S. Dhillon University of Texas at Austin       Matrix Computations in Machine Learning
Metric Learning: Example




                      Similarity by Person(identity) or by Pose

     Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Metric Learning: Formulation

   Goal:
      minΣ loss(Σ, Σ0 )
      (xi − xj )T Σ(xi − xj ) ≤ u if (i, j) ∈ S [similarity constraints]
      (xi − xj )T Σ(xi − xj ) ≥   if (i, j) ∈ D [dissimilarity constraints]


   Learn spd matrix Σ that is “close” to the baseline spd matrix Σ0

   Other linear constraints on Σ are possible

   Constraints can arise from various scenarios
        Unsupervised: Click-through feedback
        Semi-supervised: must-link and cannot-link constraints
        Supervised: points in the same class have “small” distance, etc.

   QUESTION: What should “loss” be?

      Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Bregman Matrix Divergences
   Define

             Dϕ (X , Y ) = ϕ(X ) − ϕ(Y ) − trace(( ϕ(Y ))T (X − Y ))

   Squared Frobenius norm: ϕ(X ) = X                           2.   Then
                                                               F

                                                           1
                                        DF (X , Y ) =        X −Y           2
                                                                            F
                                                           2
   von Neumann Divergence: For X                           0, ϕ(X ) = trace(X log X ). Then

                   DvN (X , Y ) = trace(X log X − X log Y − X + Y )

        also called quantum relative entropy

   LogDet divergence: For X                        0, ϕ(X ) = − log det X . Then

                    D d (X , Y ) = trace(XY −1 ) − log det(XY −1 ) − d

      Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Algorithm: Bregman Projections for LogDet

   Goal:
      minΣ D d (Σ, Σ0 )
      (xi − xj )T Σ(xi − xj ) ≤ u if (i, j) ∈ S [similarity constraints]
      (xi − xj )T Σ(xi − xj ) ≥   if (i, j) ∈ D [dissimilarity constraints]

   Algorithm: Cyclic Bregman Projections (successively onto each linear
   constraint) — converges to globally optimal solution

   Use Bregman projections to update the Mahalanobis matrix:

                                 min           D d (Σ, Σt )
                                   Σ
                                  s.t.         (xi − xj )T Σ(xi − xj ) ≤ u

   Can be solved by rank-one update:

                          Σt+1 = Σt + βt Σt (xi − xj )(xi − xj )T Σt

      Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Algorithm: Details

   Exploits properties of LogDet and Sherman-Morrison-Woodbury
   formula

   Advantages:
        Simple, closed-form projections
        Automatic enforcement of positive semidefiniteness
        No eigenvector calculation or need for SDP
        Easy to incorporate slack for each constraint

   Question: Can the Algorithm be Kernelized?




      Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Equivalence with Kernel Learning

    LogDet Formulation (1)                 Kernel Formulation (2)
minΣ D d (Σ, I )                       minK D d (K , K0 )
  s.t. tr(Σ(xi − xj )(xi − xj )T ) ≤ u s.t. tr(K (ei − ej )(ei − ej )T ) ≤ u
       tr(Σ(xi − xj )(xi − xj )T ) ≥         tr(K (ei − ej )(ei − ej )T ) ≥
       Σ 0                                   K 0
   (1) optimizes w.r.t. the d × d Mahalanobis matrix Σ
   (2) optimizes w.r.t. the N × N kernel matrix K
   Let K0 = X T X , where X is the input data
   Let Σ∗ be optimal solution to (1) and K ∗ be optimal solution to (2)
   Theorem: K ∗ = X T Σ∗ X , and Σ∗ = I + XMX T
   Consequence: Kernel Learning generalizes to unseen data

   [Davis, Kulis, Jain, Sra, Dhillon, 2007], Information-Theoretic Metric
   Learning, ICML.

       Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Resources and Discussion




Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Books

  [Trefethen and Bau, 1997], Numerical Linear Algebra
  [G. Golub and C. Van Loan, 1996], Matrix Computations —
  Encylopedic reference
  [Watkins, 2002], Fundamentals of Matrix Computations
  [Demmel, 1997],Applied Numerical Linear Algebra
  [Parlett, 1998], Symmetric Eigenvalue Problem
  [Pete Stewart] – 6 Books on Matrix Computations
  [Gil Strang] – Excellent textbooks on linear algebra (esp.,
  undergraduate), applied mathematics, computational science — video
  lectures available on MIT course page
  [Horn & Johnson] – Matrix Analysis
  [Barrett et al, 1994], Templates for the Solution of Linear Systems:
  Building Blocks for Iterative Methods
  [Bai et al, 2000], Templates for the Solution of Eigenvalue Problems: A
  Practical Guide
     Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Software

Dense Matrix Computations:
    Very well-developed public-domain software
    LAPACK, ScaLAPACK
    FLAME, PLAPACK
    Very efficient BLAS3 Code available
    LINPACK
    EISPACK

Sparse Matrix Computations:
    Less well-developed public-domain software, but still available
    Netlib
    Iterative Linear Solvers
    ARPACK — Sparse Eigenvalue Solver
    Sparse SVD – SVDPACKC, PROPACK
        Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning
Discussion

Matrix Computations
    Well-established area (from 1950s)
    Problems have clear mathematical goals that can be directly optimized
    Theoretical analysis of computation time & accuracy
    Success often based on choosing right representation of problem
    Industry-strength public-domain software

Machine Learning
    Relatively newer area
    Problems have harder objectives that can’t always be directly optimized
    Little analysis of accuracy of algorithms
    Little work on choice of representation
    Some public-domain software

       Inderjit S. Dhillon University of Texas at Austin   Matrix Computations in Machine Learning

Mais conteúdo relacionado

Mais procurados

Cctv 2012 integrador
Cctv  2012   integradorCctv  2012   integrador
Cctv 2012 integrador
Vicos FFuhrer
 
Querying XML: XPath and XQuery
Querying XML: XPath and XQueryQuerying XML: XPath and XQuery
Querying XML: XPath and XQuery
Katrien Verbert
 
Esds Corporate Presentation
Esds Corporate PresentationEsds Corporate Presentation
Esds Corporate Presentation
Sandesh Sonar
 

Mais procurados (20)

Introduction to computer network
Introduction to computer networkIntroduction to computer network
Introduction to computer network
 
EMC Atmos Cloud Storage Architecture
EMC Atmos Cloud Storage ArchitectureEMC Atmos Cloud Storage Architecture
EMC Atmos Cloud Storage Architecture
 
Mba i-ifm-u-4-data communication and network
Mba i-ifm-u-4-data communication and networkMba i-ifm-u-4-data communication and network
Mba i-ifm-u-4-data communication and network
 
Computer network-and Network topology
Computer network-and Network topologyComputer network-and Network topology
Computer network-and Network topology
 
Cctv 2012 integrador
Cctv  2012   integradorCctv  2012   integrador
Cctv 2012 integrador
 
Hierarchical inheritance
Hierarchical inheritanceHierarchical inheritance
Hierarchical inheritance
 
Users of dbms
Users of dbmsUsers of dbms
Users of dbms
 
Basic oracle-database-administration
Basic oracle-database-administrationBasic oracle-database-administration
Basic oracle-database-administration
 
Orientation to Computer Networks
Orientation to Computer NetworksOrientation to Computer Networks
Orientation to Computer Networks
 
Data Modeling Using the EntityRelationship (ER) Model
Data Modeling Using the EntityRelationship (ER) ModelData Modeling Using the EntityRelationship (ER) Model
Data Modeling Using the EntityRelationship (ER) Model
 
Querying XML: XPath and XQuery
Querying XML: XPath and XQueryQuerying XML: XPath and XQuery
Querying XML: XPath and XQuery
 
Database architecture
Database architectureDatabase architecture
Database architecture
 
[Question Paper] Object Oriented Programming With C++ (Revised Course) [April...
[Question Paper] Object Oriented Programming With C++ (Revised Course) [April...[Question Paper] Object Oriented Programming With C++ (Revised Course) [April...
[Question Paper] Object Oriented Programming With C++ (Revised Course) [April...
 
The Relational Data Model and Relational Database Constraints
The Relational Data Model and Relational Database ConstraintsThe Relational Data Model and Relational Database Constraints
The Relational Data Model and Relational Database Constraints
 
Data Models
Data ModelsData Models
Data Models
 
Database application developer and end users
Database application developer and end usersDatabase application developer and end users
Database application developer and end users
 
Python for cloud computing
Python for cloud computingPython for cloud computing
Python for cloud computing
 
Architecture of dbms(lecture 3)
Architecture of dbms(lecture 3)Architecture of dbms(lecture 3)
Architecture of dbms(lecture 3)
 
Esds Corporate Presentation
Esds Corporate PresentationEsds Corporate Presentation
Esds Corporate Presentation
 
Dbms architecture
Dbms architectureDbms architecture
Dbms architecture
 

Destaque (7)

Lakshana Geetham
Lakshana GeethamLakshana Geetham
Lakshana Geetham
 
Text-Independent Speaker Verification Report
Text-Independent Speaker Verification ReportText-Independent Speaker Verification Report
Text-Independent Speaker Verification Report
 
Carnatic Music Basic Guidelines: Art of Veena-playing
Carnatic Music Basic Guidelines: Art of Veena-playingCarnatic Music Basic Guidelines: Art of Veena-playing
Carnatic Music Basic Guidelines: Art of Veena-playing
 
Music and Machine Learning
Music and Machine LearningMusic and Machine Learning
Music and Machine Learning
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Raga Identification In Carnatic Music
Raga Identification In Carnatic MusicRaga Identification In Carnatic Music
Raga Identification In Carnatic Music
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
 

Semelhante a Matrix Computations in Machine Learning

Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
The Statistical and Applied Mathematical Sciences Institute
 

Semelhante a Matrix Computations in Machine Learning (20)

Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Presentation on machine learning
Presentation on machine learningPresentation on machine learning
Presentation on machine learning
 
Presentation.pdf
Presentation.pdfPresentation.pdf
Presentation.pdf
 
Automatic bayesian cubature
Automatic bayesian cubatureAutomatic bayesian cubature
Automatic bayesian cubature
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
Efficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsEfficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representations
 
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
Distributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and RelatedDistributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and Related
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Matrix Computations in Machine Learning

  • 1. Matrix Computations in Machine Learning Inderjit S. Dhillon University of Texas at Austin International Conference on Machine Learning (ICML) June 18, 2009 Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 2. Outline Overview of Matrix Computations Matrix Computations in Machine Learning Traditional — Regression, PCA, Spectral Graph Partitioning, etc. Specialized — Lasso, NMF, Co-clustering, Kernel Learning, etc. Resources Discussion Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 3. Matrix Computations Linear Systems of Equations Linear Least Squares Dense versus Sparse — Gaussian Elimination versus Iterative Methods Further Structure (Toeplitz, Hankel, Low Displacement Rank.....) Eigenvalue Problems SVD Dense versus Sparse — “Direct” Methods versus Iterative Methods Symmetric versus Non-Symmetric Further Structure (Tridiagonal, Banded) Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 4. Computational/Accuracy Issues Mathematician versus Numerical Analyst Mathematician: To solve Ax = b, compute A−1 , and set x = A−1 b Numerical Analyst: Use A = PLU (Gaussian Elimination with Partial Pivoting) to solve for x Mathematician: To solve for eigenvalues of A, form characteristic polynomial det(λI − A) and solve for its roots Numerical Analyst: Use orthogonal transformations to transform A to either tridiagonal form or Hessenberg form, and then solve for eigenvalues of reduced form Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 5. Key Issues in Matrix Computations Computation Time Factors of Two Matter! GEPP preferred over GECP. Operation Counts Memory Hierarchy (algorithms based on BLAS3 perform better) LAPACK versus LINPACK Accuracy Algorithms in Floating Point can have surprising behavior Examples – Eigenvalues through Characteristic Polynomial, Gram-Schmidt Orthogonalization, Loss of Orthogonality in Lanczos Algorithms Forward Stability usually not possible Backward Error Analysis – pioneered by Wilkinson in 1950’s, 1960’s Need to understand Floating Point Arithmetic Structure Dense versus Sparse, Tridiagonal versus Banded, Toeplitz, Hankel..... Has big impact on computation time and accuracy issues Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 6. Backward Error Analysis Consider Ax = b Suppose x is the computed answer ˆ Question: How close is x to x? ˆ Correct but Pessimistic Answer: x can be wrong in all digits for even ˆ the best algorithm Backward Error Analysis to the Rescue: Show that the computed answer is EXACT for the perturbed problem: (A + δA)ˆ = b + δb x Conditioning: How does the solution change when problem is perturbed? x −x ˆ δA δb ≈ cond(A) + x A b Separates Properties of Algorithm from Properties of Problem Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 7. Matrix Computations in Machine Learning Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 8. Regression Given: Training data (xi , yi ), i = 1, 2, . . . , N, where xi ∈ R d , yi ∈ R Goal: Find Linear Predictor: β0 + β1 x1 + · · · βd xd , where x = [x1 , x2 , . . . , xd ]T is an unseen data point Formulation: Linear Least Squares Problem: Solve min X β − y 2 . β Ridge Regression: Solve min X β − y 2 2 + λ β 2. 2 β Algorithms (in order of increasing cost and increasing accuracy): Normal Equations QR decomposition SVD Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 9. PCA Given: xi , i = 1, 2, . . . , N, where xi ∈ R d Goal: Find Direction(s) of Largest Variance Formulation: Solve max trace(Y T (X − me T )(X − me T )T Y ). Y T Y =Ik Solution: Leading Eigenvectors of (X − me T )(X − me T )T (Equivalently, leading left singular vectors of X − me T ) Algorithms when X is dense: QR Algorithm (eig in MATLAB) Divide & Conquer Algorithm MR3 Algorithm (fastest) When X is sparse Lanczos and its Variants Applications: Dimensionality Reduction, Visualization, LSI, Eigenfaces, Eigengenes, Eigenducks, Eigenlips....... Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 10. Graph Partitioning/Clustering In many applications, goal is to partition nodes of a graph: High School Friendship Network [James Moody. American Journal of Sociology, 2001] Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 11. Spectral Graph Partitioning Partitioning Objective: Minimize Cut subject to Equal Partition Sizes Applications in Circuit Layout and Parallel Computing Take a real relaxation of the Objective Globally optimal solution of the relaxed problem is given by eigenvectors of Graph Laplacian Post-process eigenvectors to obtain a discrete partitioning Long History [Hall, 1970] An r-dimensional quadratic placement algorithm, Manag Sci [Donath & Hoffman, 1972] Algorithms for partitioning of graphs and computer logic based on eigenvectors of connection matrices, IBM Techn. Disclosure Bull. [Fiedler, 1973] Algebraic connectivity of graphs, Czech Math Journal [Pothen, Simon & Liou, 1990] Partitioning sparse matrices with eigenvectors of graphs, SIAM J. Matrix Anal. Appl. Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 12. “Specialized” Matrix Computations Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 13. Matrix Computations in Machine Learning Traditional Problems Regression, PCA, Spectral Graph Partitioning “Specialized” Problems that arise in Machine Learning Intersection of Numerical Optimization & Matrix Computations Lasso – L1-regularized least squares Non-negative Matrix Factorization (NMF) Sparse PCA Matrix Completion Co-clustering Metric/Kernel Learning Machine Learning to the Rescue Multilevel Graph Clustering Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 14. Non-Negative Matrix Approximation Suppose data matrix A ≥ 0 NMF problem: Given A and k, solve min A − BC 2 F B,C ≥0 Constrained Optimization Problem x2 ∇f (xk ) P+[xk − αH k ∇f (xk )] P+[xk − (GT G)−1 (GT Gxk − GT h)] xk x1 level sets of f x∗ k k k x − αH ∇f (x ) x = xk − (GT G)−1 (GT Gxk − GT h) ¯ [Kim, Sra, Dhillon] Fast Newton-type Methods for the Least Squares Nonnegative Matrix Approximation Problem, SDM, 2007. Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 15. Graph Clustering How do we measure the quality of a graph clustering? Could simply minimize the edge-cut in the graph Can lead to clusters that are highly unbalanced in size Could minimize the edge-cut in the graph while constraining the clusters to be equal in size Not a natural restriction in data analysis Popular objectives include normalized cut and ratio cut: c links(Vi , V Vi ) Normalized Cut: minimize degree(Vi ) i=1 c links(Vi , V Vi ) Ratio Cut: minimize |Vi | i=1 [Shi & Malik, IEEE Pattern Analysis & Machine Intelligence, 2000] [Chan, Schlag & Zien, IEEE Integrated Circuits & Systems, 1994] Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 16. Machine Learning to the Rescue Traditional Approach to minimize Normalized Cut: Compute eigenvectors of normalized Graph Laplacian Exact Mathematical Equivalence: Normalized Cut Objective is special case of Weighted Kernel-Kmeans Objective Our Multilevel Approach: Phase I: Coarsening Coarsen graph by merging nodes to form smaller and smaller graphs Use simple greedy heuristic specialized to each graph cut objective Phase II: Base Clustering Once the graph is small enough, perform a base clustering Variety of techniques possible for this step Phase III: Refining Uncoarsen the graph, level by level Use weighted kernel k-means to refine the clusterings at each level Input to weighted kernel k-means is clustering from previous level Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 17. Multilevel Graph Clustering Overview of our approach: Input Graph Final Clustering Coarsening Refining Initial Clustering [Dhillon, Guan & Kulis, 2007] Weighted Graph Cuts without Eigenvectors: A Multilevel Approach, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2007. Graclus software: freely downloadable from the web Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 18. Metric Learning: Example Similarity by Person(identity) or by Pose Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 19. Metric Learning: Formulation Goal: minΣ loss(Σ, Σ0 ) (xi − xj )T Σ(xi − xj ) ≤ u if (i, j) ∈ S [similarity constraints] (xi − xj )T Σ(xi − xj ) ≥ if (i, j) ∈ D [dissimilarity constraints] Learn spd matrix Σ that is “close” to the baseline spd matrix Σ0 Other linear constraints on Σ are possible Constraints can arise from various scenarios Unsupervised: Click-through feedback Semi-supervised: must-link and cannot-link constraints Supervised: points in the same class have “small” distance, etc. QUESTION: What should “loss” be? Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 20. Bregman Matrix Divergences Define Dϕ (X , Y ) = ϕ(X ) − ϕ(Y ) − trace(( ϕ(Y ))T (X − Y )) Squared Frobenius norm: ϕ(X ) = X 2. Then F 1 DF (X , Y ) = X −Y 2 F 2 von Neumann Divergence: For X 0, ϕ(X ) = trace(X log X ). Then DvN (X , Y ) = trace(X log X − X log Y − X + Y ) also called quantum relative entropy LogDet divergence: For X 0, ϕ(X ) = − log det X . Then D d (X , Y ) = trace(XY −1 ) − log det(XY −1 ) − d Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 21. Algorithm: Bregman Projections for LogDet Goal: minΣ D d (Σ, Σ0 ) (xi − xj )T Σ(xi − xj ) ≤ u if (i, j) ∈ S [similarity constraints] (xi − xj )T Σ(xi − xj ) ≥ if (i, j) ∈ D [dissimilarity constraints] Algorithm: Cyclic Bregman Projections (successively onto each linear constraint) — converges to globally optimal solution Use Bregman projections to update the Mahalanobis matrix: min D d (Σ, Σt ) Σ s.t. (xi − xj )T Σ(xi − xj ) ≤ u Can be solved by rank-one update: Σt+1 = Σt + βt Σt (xi − xj )(xi − xj )T Σt Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 22. Algorithm: Details Exploits properties of LogDet and Sherman-Morrison-Woodbury formula Advantages: Simple, closed-form projections Automatic enforcement of positive semidefiniteness No eigenvector calculation or need for SDP Easy to incorporate slack for each constraint Question: Can the Algorithm be Kernelized? Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 23. Equivalence with Kernel Learning LogDet Formulation (1) Kernel Formulation (2) minΣ D d (Σ, I ) minK D d (K , K0 ) s.t. tr(Σ(xi − xj )(xi − xj )T ) ≤ u s.t. tr(K (ei − ej )(ei − ej )T ) ≤ u tr(Σ(xi − xj )(xi − xj )T ) ≥ tr(K (ei − ej )(ei − ej )T ) ≥ Σ 0 K 0 (1) optimizes w.r.t. the d × d Mahalanobis matrix Σ (2) optimizes w.r.t. the N × N kernel matrix K Let K0 = X T X , where X is the input data Let Σ∗ be optimal solution to (1) and K ∗ be optimal solution to (2) Theorem: K ∗ = X T Σ∗ X , and Σ∗ = I + XMX T Consequence: Kernel Learning generalizes to unseen data [Davis, Kulis, Jain, Sra, Dhillon, 2007], Information-Theoretic Metric Learning, ICML. Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 24. Resources and Discussion Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 25. Books [Trefethen and Bau, 1997], Numerical Linear Algebra [G. Golub and C. Van Loan, 1996], Matrix Computations — Encylopedic reference [Watkins, 2002], Fundamentals of Matrix Computations [Demmel, 1997],Applied Numerical Linear Algebra [Parlett, 1998], Symmetric Eigenvalue Problem [Pete Stewart] – 6 Books on Matrix Computations [Gil Strang] – Excellent textbooks on linear algebra (esp., undergraduate), applied mathematics, computational science — video lectures available on MIT course page [Horn & Johnson] – Matrix Analysis [Barrett et al, 1994], Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods [Bai et al, 2000], Templates for the Solution of Eigenvalue Problems: A Practical Guide Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 26. Software Dense Matrix Computations: Very well-developed public-domain software LAPACK, ScaLAPACK FLAME, PLAPACK Very efficient BLAS3 Code available LINPACK EISPACK Sparse Matrix Computations: Less well-developed public-domain software, but still available Netlib Iterative Linear Solvers ARPACK — Sparse Eigenvalue Solver Sparse SVD – SVDPACKC, PROPACK Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning
  • 27. Discussion Matrix Computations Well-established area (from 1950s) Problems have clear mathematical goals that can be directly optimized Theoretical analysis of computation time & accuracy Success often based on choosing right representation of problem Industry-strength public-domain software Machine Learning Relatively newer area Problems have harder objectives that can’t always be directly optimized Little analysis of accuracy of algorithms Little work on choice of representation Some public-domain software Inderjit S. Dhillon University of Texas at Austin Matrix Computations in Machine Learning