Linear Algebra

Introduction to Statistical
Machine Learning

c 2011
Christfried Webers
NICTA
The Australian National
University

Introduction to Statistical Machine Learning

Christfried Webers

Statistical Machine Learning Group
NICTA
and
College of Engineering and Computer Science
The Australian National University

Canberra
February – June 2011

(Many ﬁgures from C. M. Bishop, "Pattern Recognition and Machine Learning")

1of 112

Machine Learning

c 2011
Christfried Webers
NICTA
University

Part III
Basic Concepts

Linear Transformations

Linear Algebra Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular Value
Decomposition

Directional Derivative,
Gradient

Books

80of 112


Intuition Machine Learning

c 2011
Christfried Webers
NICTA
University

Geometry
Points and Lines
Vector addition and scaling Basic Concepts

Humans have experience with 3 dimensions (less with 1, 2 Linear Transformations

though) Trace

Inner Product
Generalisation to N dimensions (possibly N → ∞) Projection

Line → vector space V Rank, Determinant, Trace

Matrix Inverse
Point → vector x ∈ V
Eigenvectors
Example : X ∈ Rn×m Singular Value
Decomposition
n×m n·m
Space of matrices R and the space of vectors R are Directional Derivative,
isomorphic Gradient

Books

81of 112


Vector Space V, underlying Field F Machine Learning

c 2011
Christfried Webers
NICTA
Given vectors u, v, w ∈ V and scalars α, β ∈ F, the following University

holds:
Associativity of addition : u + (v + w) = (u + v) + w .
Commutativity of addition : v + w = w + v .
Identity element of addition : ∃ 0 such that Basic Concepts

v + 0 = v, ∀ v ∈ V. Linear Transformations

Trace
Inverse of addition : For all v ∈ V, ∃w ∈ V, such that Inner Product
v + w = 0. The additive inverse is denoted −v. Projection

Distributivity of scalar multiplication with respect to vector Rank, Determinant, Trace

addition : α(v + w) = αv + αw . Matrix Inverse

Eigenvectors
Distributivity of scalar multiplication with respect to ﬁeld
Singular Value
addition : (α + β)v = αv + βv . Decomposition

Compatibility of scalar multiplication with ﬁeld Gradient

multiplication : α(βv) = (αβ)v . Books

Identity element of scalar multiplication : 1v = v, where 1 is
the identity in F .
82of 112


Matrix-Vector Multiplication Machine Learning

c 2011
Christfried Webers
NICTA
University
§ ¤
1 f o r i i n xrange (m) :
R[ i ] = 0 . 0 ;
f o r j i n xrange ( n ) :
R[ i ] = R[ i ] + A[ i , j ] ∗ V[ j ]
¦ ¥ Basic Concepts
Listing 1: Code for elementwise matrix multiplication. Linear Transformations

Trace

Inner Product

Projection


A V = R Matrix Inverse

     Eigenvectors
a11 a12 . . . a1n v1 a11 v1 + a12 v2 + · · · + a1n vn Singular Value
 a21
 a22 . . . a2n   v2   a21 v1 + a22 v2 + · · · + a2n vn 
  =  
Decomposition

. . . Directional Derivative,
... . . . . . .  . . .  ...  Gradient

am1 am2 . . . amn vn am1 v1 + am2 v2 + · · · + amn vn Books

83of 112



c 2011
Christfried Webers
NICTA
University

Basic Concepts
A V = R Linear Transformations
    
a11 a12 . . . a1n v1 a11 v1 + a12 v2 + · · · + a1n vn Trace

 a21
 a22 . . . a2n   v2   a21 v1 + a22 v2 + · · · + a2n vn 
  =  
Inner Product

. . . ... . . . . . .  . . .  ...  Projection

am1 am2 . . . amn vn am1 v1 + am2 v2 + · · · + amn vn Rank, Determinant, Trace

Matrix Inverse

§ ¤ Eigenvectors

Singular Value
1 R = A[: ,0]; Decomposition
f o r j i n xrange ( 1 , n ) :
R += A [ : , j ] ∗ V [ j ] ;
¦ ¥
Gradient

Books
Listing 2: Code for columnwise matrix multiplication.

84of 112



c 2011
Christfried Webers
NICTA
University

Denote the n columns of A by a1 , a2 , . . . , an Basic Concepts

Each ai is now a (column) vector
Trace

Inner Product

Projection

A V = R Rank, Determinant, Trace
  Matrix Inverse
  v1  
Eigenvectors
 v2 
a1 a2 ... an    = a1 v1 + a2 v2 + · · · + an vn 
. . .
Singular Value
Decomposition

vn Directional Derivative,
Gradient

Books

85of 112


Transpose of R Machine Learning

c 2011
Christfried Webers
NICTA
University

Given R = AV,
  Basic Concepts

R = a1 v1 + a2 v2 + · · · + an vn 
Trace

Inner Product

Projection
T
What is R ? Rank, Determinant, Trace

Matrix Inverse

RT = v1 aT + v2 aT + · · · + vn aT
1 2 n
Eigenvectors

Singular Value
NOT equal to AT V T ! (In fact, AT V T is not even deﬁned.) Decomposition

Gradient

Books

86of 112


Transpose Machine Learning

c 2011
Christfried Webers
NICTA
University

Reverse order rule : (AV)T = V T AT
Basic Concepts

VT AT = RT Linear Transformations

Trace
 
 aT
1  Inner Product

  Projection
 aT
2

 = v1 aT + v2 aT + · · · + vn aT
v1 v2 . . . vn 
  1 2 n

Matrix Inverse

 ... 
 Eigenvectors

aT
n Singular Value
Decomposition

Gradient

Books

87of 112


Trace Machine Learning

c 2011
Christfried Webers
NICTA
University

The trace of a square matrix A is the sum of all diagonal
elements of A.
n
Basic Concepts
tr {A} = Akk Linear Transformations
k=1 Trace

Example Inner Product

Projection
 
1 2 3 Rank, Determinant, Trace

A = 4 5 6 tr {A} = 1 + 5 + 9 = 15 Matrix Inverse

Eigenvectors
7 8 9
Singular Value
Decomposition
The trace does not exist for a non-square matrix. Directional Derivative,
Gradient

Books

88of 112


vec(x) operator Machine Learning

c 2011
Christfried Webers
NICTA
Deﬁne vec(X) as the vector which results from stacking all University

columns of a matrix A on top of each other.
Example
 
1
Basic Concepts
4
  Linear Transformations
7
    Trace
1 2 3 2
  Inner Product
A = 4 5 6 vec(A) = 5
  Projection
7 8 9 8
  Rank, Determinant, Trace
3 Matrix Inverse
 
6 Eigenvectors

9 Singular Value
Decomposition

Given two matrices X, Y ∈ Rn×p , the trace of the product Gradient

XT Y can be written as Books

tr XT Y = vec(X)T vec(Y)
89of 112


Inner Product Machine Learning

c 2011
Christfried Webers
NICTA
University

Mapping from two vectors x, y ∈ V to a ﬁeld of scalars F
( F = R or F = C ):
Basic Concepts

·, · : V × V → F Linear Transformations

Trace

Conjugate Symmetry : Inner Product

x, y = y, x Projection

Linearity : Matrix Inverse
ax, y = a x, y , and x + y, z = x, z + y, z Eigenvectors

Positive-deﬁnitness : Singular Value
Decomposition
x, x ≥ 0, and x, x = 0 for x = 0 only. Directional Derivative,
Gradient

Books

90of 112


Canonical Inner Products Machine Learning

c 2011
Christfried Webers
NICTA
Inner product for real numbers x, y ∈ R: The Australian National
University

x, y = x y

Dot product between two vectors: Basic Concepts
Given x, y ∈ Rn Linear Transformations

Trace
n
T Inner Product
x, y = x y ≡ xk yk = x1 y1 + x2 y2 + · · · + xn yn
Projection
k=1

Canonical inner product for matrices : Matrix Inverse

Given X, Y ∈ Rn×p Eigenvectors

Singular Value
p p n Decomposition

X, Y = tr XT Y = (XT Y)kk = (XT )kl (Y)lk Directional Derivative,
Gradient
k=1 k=1 l=1 Books
p n
= Xlk Ylk
k=1 l=1
91of 112


Calculations with Matrices Machine Learning

c 2011
Christfried Webers
th NICTA
Denote (i, j) element of matrix X by Xij The Australian National
University
Transpose : (XT )ij = Xji
...
Product : (XY)ij = k=1 Xik Ykj
Proof of the linearity of the canonical matrix inner product :
p
T
X + Y, Z = tr (X + Y) Z = ((X + Y)T Z)kk Basic Concepts

k=1 Linear Transformations
p n p n Trace

= ((X + Y)T )kl Zlk = (Xlk + Ylk )Zlk Inner Product

k=1 l=1 k=1 l=1 Projection

p n Rank, Determinant, Trace

= Xlk Zlk + Ylk Zlk Matrix Inverse

Eigenvectors
k=1 l=1
p n Singular Value
Decomposition
T T
= (X )kl Zlk + (Y )kl Zlk Directional Derivative,
Gradient
k=1 l=1
p Books

T T
= (X Z)kk + (Y Z)kk
k=1
= tr XT Z + tr YT Z = X, Z + Y, Z 92of 112


Projection Machine Learning

c 2011
Christfried Webers
NICTA
In linear algebra and functional analysis, a projection is a University

linear transformation P from a vector space V to itself such
that
P2 = P
Basic Concepts

(a,b) Trace
y Inner Product

Projection


Matrix Inverse

Eigenvectors

Singular Value
Decomposition
x
(a-b,0) Gradient

Books

a a−b 1 −1
A = A=
b 0 0 0
93of 112


Orthogonal Projection Machine Learning

c 2011
Christfried Webers
NICTA
Orthogonality : need an inner product x, y The Australian National
University

Choose two arbitrary vectors x and y. Then Px and y − Py
are orthogonal.

0 = Px, y − Py = (Px)T (y − Py) = xT (P − PT P)y
Basic Concepts

Orthogonal Projection Linear Transformations

Trace
2 T
P =P P=P Inner Product

Projection

Example : Given some unit vector u ∈ Rn characterising a Rank, Determinant, Trace

line through the origin in Rn Matrix Inverse

Eigenvectors
project an arbitrary vector x ∈ Rn onto this line by Singular Value
Decomposition

P = uuT Directional Derivative,
Gradient

Books
Proof : P = PT , and

P2 x = (uuT )(uuT )x = uuT x = Px
94of 112


Orthogonal Projection Machine Learning

c 2011
Christfried Webers
NICTA
University
Given a matrix A ∈ Rn×p and a vector x. What is the
closest point x to x in the column space of A ?
Orthogonal Projection into the column space of A

x = A(AT A)−1 AT x Basic Concepts


Projection matrix Trace

Inner Product

P = A(AT A)−1 AT Projection


Proof Matrix Inverse

Eigenvectors

P2 = A(AT A)−1 AT A(AT A)−1 AT = A(AT A)−1 AT = P Singular Value
Decomposition

Orthogonal projection ? Gradient

Books

PT = (A(AT A)−1 AT )T = A(AT A)−1 AT = P

95of 112


Linear Independence of Vectors Machine Learning

c 2011
Christfried Webers
NICTA
University

A set of vectors is linearly independent if none of them can
be written as a linear combination of ﬁnitely many other
vectors in the set.
Given a vector space V and k vectors v1 , . . . , vk ∈ V and Basic Concepts
scalars a1 , . . . , ak . The vectors are linearly independent, if Linear Transformations
and only if Trace

  Inner Product
0 Projection
0 Rank, Determinant, Trace
a1 v1 + a2 v2 + · · · + ak vk = 0 ≡  . 
 
.
Matrix Inverse
. Eigenvectors
0 Singular Value
Decomposition

has only the trivial solution Directional Derivative,
Gradient

Books
a1 = a2 = ak = 0

96of 112


Rank Machine Learning

c 2011
Christfried Webers
NICTA
University

The column rank of a matrix A is the maximal number of Basic Concepts

linearly independent columns of A. Linear Transformations

Trace
The row rank of a matrix A is the maximal number of Inner Product
linearly independent rows of A. Projection

For every matrix A, the row rank and column rank are Rank, Determinant, Trace

equal, called the rank of A. Matrix Inverse

Eigenvectors

Singular Value
Decomposition

Gradient

Books

97of 112


Determinant Machine Learning

c 2011
Christfried Webers
NICTA
University

Let Sn be the set of all permutation of the numbers {1, . . . , n}.
The determinant of the square matrix A is then given by
n Basic Concepts

det {A} = sgn(σ) Ai,σ(i) Linear Transformations

Trace
σ∈Sn i=1
Inner Product

where sgn is the signature of the permutation ( +1 if the Projection

permutation is even, −1 if the permutation is odd). Rank, Determinant, Trace

Matrix Inverse
det {AB} = det {A} det {B} for A, B square matrices. Eigenvectors
−1
det A−1 = det {A} . Singular Value
Decomposition

det AT = det {A}. Directional Derivative,
Gradient

Books

98of 112


Matrix Inverse Machine Learning

c 2011
Christfried Webers
NICTA
University

Identity Matrix I
I = AA−1 = A−1 A
The matrix inverse A−1 does only exist for square matrices Basic Concepts

which are NOT singular. Linear Transformations

Trace
Singular matrix Inner Product
at least one eigenvalue is zero, Projection
determinant |A| = 0. Rank, Determinant, Trace

Inversion ’reverses the order’ (like transposition) Matrix Inverse

Eigenvectors

(AB)−1 = B−1 A−1 Singular Value
Decomposition

(Only then is (AB)(AB)−1 = (AB)B−1 A−1 = AA−1 = I.) Gradient

Books

99of 112


Matrix Inverse and Transpose Machine Learning

c 2011
Christfried Webers
NICTA
University

Basic Concepts

−1 T Linear Transformations
(A ) ?
Trace
need to assume that (A−1 ) exists Inner Product

rule : (A−1 )T = (AT )−1 Projection


Matrix Inverse

Eigenvectors

Singular Value
Decomposition

Gradient

Books

100of 112


Useful Identity Machine Learning

c 2011
Christfried Webers
NICTA
University

(A−1 + BT C−1 B)−1 BT C−1 = ABT (BABT + C)−1 Basic Concepts


Trace

How to analyse and prove such an equation? Inner Product

Projection
−1 n×n
A must exist, therefore A must be square, say A ∈ R . Rank, Determinant, Trace

Same for C−1 , so let’s assume C ∈ Rm×m . Matrix Inverse

m×n
Therefore, B ∈ R . Eigenvectors

Singular Value
Decomposition

Gradient

Books

101of 112


Prove the Identity Machine Learning

c 2011
Christfried Webers
NICTA
University

(A−1 + BT C−1 B)−1 BT C−1 = ABT (BABT + C)−1

Basic Concepts
Multiply by (BABT + C) Linear Transformations

Trace

(A−1 + BT C−1 B)−1 BT C−1 (BABT + C) = ABT Inner Product

Projection

Simplify the left-hand side Rank, Determinant, Trace

Matrix Inverse

(A−1 + BT C−1 B)−1 [BT C−1 B(ABT ) + BT ] Eigenvectors

−1 T −1 −1 T −1 −1 T Singular Value
= (A +B C B) [B C B+A ]AB Decomposition

= ABT Directional Derivative,
Gradient

Books

102of 112


Woodbury Identity Machine Learning

c 2011
Christfried Webers
NICTA
University

Basic Concepts

(A + BD−1 C)−1 = A−1 − A−1 B(D + CA−1 B)−1 CA−1 Linear Transformations

Trace

Inner Product

Useful if A is diagonal and easy to invert, and B is very tall Projection

(many rows, but only a few columns) and C is very wide Rank, Determinant, Trace

(few rows, many columns). Matrix Inverse

Eigenvectors

Singular Value
Decomposition

Gradient

Books

103of 112


Manipulating Matrix Equations Machine Learning

c 2011
Christfried Webers
NICTA
University

Don’t multiply by a matrix which does not have full rank. Basic Concepts

Why? In the general case, you will loose solutions. Linear Transformations

Trace
Don’t multiply by nonsquare matrices. Inner Product

Don’t assume a matrix inverse exists because a matrix is Projection

square. The matrix might be singular and you are making Rank, Determinant, Trace

the same mistake as if deducing a = b from a 0 = b 0. Matrix Inverse

Eigenvectors

Singular Value
Decomposition

Gradient

Books

104of 112


Eigenvectors Machine Learning

c 2011
Christfried Webers
NICTA
University

Every square matrix A ∈ Rn×n has an Eigenvector
decomposition
Ax = λx
Basic Concepts

where x ∈ Rn and λ ∈ C. Linear Transformations

Trace
Example:
Inner Product
0 1
x = λx Projection
−1 0 Rank, Determinant, Trace

Matrix Inverse

λ = {−ı, ı} Eigenvectors

Singular Value
ı −ı Decomposition
x= , Directional Derivative,
1 1 Gradient

Books

105of 112


Eigenvectors Machine Learning

c 2011
Christfried Webers
NICTA
University

How many eigenvalue/eigenvector pairs?
Basic Concepts

Ax = λx Trace

Inner Product
is equivalent to
Projection
(A − λI)x = 0

Has only non-trivial solution for det {A − λI} = 0 Matrix Inverse

Eigenvectors
polynom of nth order; at most n distinct solutions
Singular Value
Decomposition

Gradient

Books

106of 112


Real Eigenvalues Machine Learning

c 2011
Christfried Webers
NICTA
University

How can we enforce real eigenvalues?
Let’s look at matrices with complex entries A ∈ Cn×n . Basic Concepts

Transposition is replaced by Hermitian adjoint, e.g. Trace

Inner Product
H
1 + ı2 3 + ı4 1 − ı2 5 − ı6 Projection
=
5 + ı6 7 + ı8 7 − ı8 7 − ı8 Rank, Determinant, Trace

Matrix Inverse

Denote the complex conjugate of a complex number λ by Eigenvectors

λ. Singular Value
Decomposition

Gradient

Books

107of 112


Real Eigenvalues Machine Learning

c 2011
Christfried Webers
NICTA
How can we enforce real eigenvalues? The Australian National
University

Let’s assume A ∈ Cn×n , Hermitian (AH = A).
Calculate
xH Ax = λxH x
for an eigenvector x ∈ Cn of A. Basic Concepts

Another possibility to calculate xH Ax Linear Transformations

Trace

xH Ax = xH AH x (A is Hermitian) Inner Product

H H Projection
= (x Ax) (reverse order)

= (λxH x)H (eigenvalue) Matrix Inverse

H Eigenvectors
= λx x
Singular Value
Decomposition
and therefore Directional Derivative,
Gradient
λ=λ (λ is real).
Books

If A is Hermitian, then all eigenvalues are real.
Special case: If A has only real entries and is symmetric,
then all eigenvalues are real.
108of 112


Singular Value Decomposition Machine Learning

c 2011
Christfried Webers
NICTA
University

Every matrix A ∈ Rn×p can be decomposed into a product of Basic Concepts

three matrices Linear Transformations

A = UΣV T Trace

Inner Product
where U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices ( Projection

U T U = I and V T V = I ), and Σ ∈ Rn×p has nonnegative Rank, Determinant, Trace

numbers on the diagonal. Matrix Inverse

Eigenvectors

Singular Value
Decomposition

Gradient

Books

109of 112


How to calculate a gradient? Machine Learning

c 2011
Christfried Webers
NICTA
Given a vector space V, and a function f : V → R. University

Example: X ∈ Rn×p , arbitrary C ∈ Rn×n ,

f (X) = tr X T CX .
Basic Concepts
How to calculate the gradient of f ? Linear Transformations

Ill-deﬁned question. There is no gradient in a vector space. Trace

Inner Product
Only a Directional Derivative at point X ∈ V in direction
Projection
ξ ∈ V. Rank, Determinant, Trace
f (X + hξ) − f (X)
Df (X)(ξ) = lim Matrix Inverse
h→0 h Eigenvectors

For the example f (X) = tr X T CX Singular Value
Decomposition

tr (X + hξ)T C(X + hξ) − tr X T CX Gradient

Df (X)(ξ) = lim Books
h→0 h
= tr ξ CX + X Cξ = tr X T (CT + C)ξ
T T

110of 112


How to calculate a gradient? Machine Learning

c 2011
Christfried Webers
NICTA
University
Given a vector space V, and a function f : V → R.
How to calculate the gradient of f ?

1 Calculate the Directional Derivative
2 Define an inner product A, B on the vector space Basic Concepts

3 The gradient is now defined as Trace

Inner Product
Df (X)(ξ) = grad f , ξ Projection


For the example f (X) = tr X T CX Matrix Inverse

Eigenvectors
Define the inner product A, B = tr AT B . Singular Value
Decomposition
Write the Directional Derivative as an inner product with ξ
Gradient
T T
Df (X)(ξ) = tr X (C + C)ξ Books

T
= (C + C )X, ξ = grad f , ξ

111of 112


Books Machine Learning

c 2011
Christfried Webers
NICTA
University

Basic Concepts

Gilbert Strang, "Introduction to Linear Algebra", Wellesley Linear Transformations

Cambridge, 2009. Trace

Inner Product
David C. Lay, "Linear Algebra and Its Applications", Projection
Addison Wesley, 2005. Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular Value
Decomposition

Gradient

Books

112of 112

Linear Algebra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Linear Algebra

Similar to Linear Algebra (6)

More from nep_test_account

More from nep_test_account (13)

Linear Algebra