A Few Useful Things to Know about Machine Learning
Linear Algebra
1. Introduction to Statistical
Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Introduction to Statistical Machine Learning
Christfried Webers
Statistical Machine Learning Group
NICTA
and
College of Engineering and Computer Science
The Australian National University
Canberra
February – June 2011
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
1of 112
2. Introduction to Statistical
Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Part III
Basic Concepts
Linear Transformations
Linear Algebra Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books
80of 112
3. Introduction to Statistical
Intuition Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Geometry
Points and Lines
Vector addition and scaling Basic Concepts
Humans have experience with 3 dimensions (less with 1, 2 Linear Transformations
though) Trace
Inner Product
Generalisation to N dimensions (possibly N → ∞) Projection
Line → vector space V Rank, Determinant, Trace
Matrix Inverse
Point → vector x ∈ V
Eigenvectors
Example : X ∈ Rn×m Singular Value
Decomposition
n×m n·m
Space of matrices R and the space of vectors R are Directional Derivative,
isomorphic Gradient
Books
81of 112
4. Introduction to Statistical
Vector Space V, underlying Field F Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
Given vectors u, v, w ∈ V and scalars α, β ∈ F, the following University
holds:
Associativity of addition : u + (v + w) = (u + v) + w .
Commutativity of addition : v + w = w + v .
Identity element of addition : ∃ 0 such that Basic Concepts
v + 0 = v, ∀ v ∈ V. Linear Transformations
Trace
Inverse of addition : For all v ∈ V, ∃w ∈ V, such that Inner Product
v + w = 0. The additive inverse is denoted −v. Projection
Distributivity of scalar multiplication with respect to vector Rank, Determinant, Trace
addition : α(v + w) = αv + αw . Matrix Inverse
Eigenvectors
Distributivity of scalar multiplication with respect to field
Singular Value
addition : (α + β)v = αv + βv . Decomposition
Directional Derivative,
Compatibility of scalar multiplication with field Gradient
multiplication : α(βv) = (αβ)v . Books
Identity element of scalar multiplication : 1v = v, where 1 is
the identity in F .
82of 112
5. Introduction to Statistical
Matrix-Vector Multiplication Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
§ ¤
1 f o r i i n xrange (m) :
R[ i ] = 0 . 0 ;
f o r j i n xrange ( n ) :
R[ i ] = R[ i ] + A[ i , j ] ∗ V[ j ]
¦ ¥ Basic Concepts
Listing 1: Code for elementwise matrix multiplication. Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
A V = R Matrix Inverse
Eigenvectors
a11 a12 . . . a1n v1 a11 v1 + a12 v2 + · · · + a1n vn Singular Value
a21
a22 . . . a2n v2 a21 v1 + a22 v2 + · · · + a2n vn
=
Decomposition
. . . Directional Derivative,
... . . . . . . . . . ... Gradient
am1 am2 . . . amn vn am1 v1 + am2 v2 + · · · + amn vn Books
83of 112
6. Introduction to Statistical
Matrix-Vector Multiplication Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Basic Concepts
A V = R Linear Transformations
a11 a12 . . . a1n v1 a11 v1 + a12 v2 + · · · + a1n vn Trace
a21
a22 . . . a2n v2 a21 v1 + a22 v2 + · · · + a2n vn
=
Inner Product
. . . ... . . . . . . . . . ... Projection
am1 am2 . . . amn vn am1 v1 + am2 v2 + · · · + amn vn Rank, Determinant, Trace
Matrix Inverse
§ ¤ Eigenvectors
Singular Value
1 R = A[: ,0]; Decomposition
f o r j i n xrange ( 1 , n ) :
Directional Derivative,
R += A [ : , j ] ∗ V [ j ] ;
¦ ¥
Gradient
Books
Listing 2: Code for columnwise matrix multiplication.
84of 112
7. Introduction to Statistical
Matrix-Vector Multiplication Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Denote the n columns of A by a1 , a2 , . . . , an Basic Concepts
Linear Transformations
Each ai is now a (column) vector
Trace
Inner Product
Projection
A V = R Rank, Determinant, Trace
Matrix Inverse
v1
Eigenvectors
v2
a1 a2 ... an = a1 v1 + a2 v2 + · · · + an vn
. . .
Singular Value
Decomposition
vn Directional Derivative,
Gradient
Books
85of 112
8. Introduction to Statistical
Transpose of R Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Given R = AV,
Basic Concepts
Linear Transformations
R = a1 v1 + a2 v2 + · · · + an vn
Trace
Inner Product
Projection
T
What is R ? Rank, Determinant, Trace
Matrix Inverse
RT = v1 aT + v2 aT + · · · + vn aT
1 2 n
Eigenvectors
Singular Value
NOT equal to AT V T ! (In fact, AT V T is not even defined.) Decomposition
Directional Derivative,
Gradient
Books
86of 112
9. Introduction to Statistical
Transpose Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Reverse order rule : (AV)T = V T AT
Basic Concepts
VT AT = RT Linear Transformations
Trace
aT
1 Inner Product
Projection
aT
2
= v1 aT + v2 aT + · · · + vn aT
v1 v2 . . . vn
1 2 n
Rank, Determinant, Trace
Matrix Inverse
...
Eigenvectors
aT
n Singular Value
Decomposition
Directional Derivative,
Gradient
Books
87of 112
10. Introduction to Statistical
Trace Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
The trace of a square matrix A is the sum of all diagonal
elements of A.
n
Basic Concepts
tr {A} = Akk Linear Transformations
k=1 Trace
Example Inner Product
Projection
1 2 3 Rank, Determinant, Trace
A = 4 5 6 tr {A} = 1 + 5 + 9 = 15 Matrix Inverse
Eigenvectors
7 8 9
Singular Value
Decomposition
The trace does not exist for a non-square matrix. Directional Derivative,
Gradient
Books
88of 112
11. Introduction to Statistical
vec(x) operator Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
Define vec(X) as the vector which results from stacking all University
columns of a matrix A on top of each other.
Example
1
Basic Concepts
4
Linear Transformations
7
Trace
1 2 3 2
Inner Product
A = 4 5 6 vec(A) = 5
Projection
7 8 9 8
Rank, Determinant, Trace
3 Matrix Inverse
6 Eigenvectors
9 Singular Value
Decomposition
Directional Derivative,
Given two matrices X, Y ∈ Rn×p , the trace of the product Gradient
XT Y can be written as Books
tr XT Y = vec(X)T vec(Y)
89of 112
12. Introduction to Statistical
Inner Product Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Mapping from two vectors x, y ∈ V to a field of scalars F
( F = R or F = C ):
Basic Concepts
·, · : V × V → F Linear Transformations
Trace
Conjugate Symmetry : Inner Product
x, y = y, x Projection
Rank, Determinant, Trace
Linearity : Matrix Inverse
ax, y = a x, y , and x + y, z = x, z + y, z Eigenvectors
Positive-definitness : Singular Value
Decomposition
x, x ≥ 0, and x, x = 0 for x = 0 only. Directional Derivative,
Gradient
Books
90of 112
13. Introduction to Statistical
Canonical Inner Products Machine Learning
c 2011
Christfried Webers
NICTA
Inner product for real numbers x, y ∈ R: The Australian National
University
x, y = x y
Dot product between two vectors: Basic Concepts
Given x, y ∈ Rn Linear Transformations
Trace
n
T Inner Product
x, y = x y ≡ xk yk = x1 y1 + x2 y2 + · · · + xn yn
Projection
k=1
Rank, Determinant, Trace
Canonical inner product for matrices : Matrix Inverse
Given X, Y ∈ Rn×p Eigenvectors
Singular Value
p p n Decomposition
X, Y = tr XT Y = (XT Y)kk = (XT )kl (Y)lk Directional Derivative,
Gradient
k=1 k=1 l=1 Books
p n
= Xlk Ylk
k=1 l=1
91of 112
14. Introduction to Statistical
Calculations with Matrices Machine Learning
c 2011
Christfried Webers
th NICTA
Denote (i, j) element of matrix X by Xij The Australian National
University
Transpose : (XT )ij = Xji
...
Product : (XY)ij = k=1 Xik Ykj
Proof of the linearity of the canonical matrix inner product :
p
T
X + Y, Z = tr (X + Y) Z = ((X + Y)T Z)kk Basic Concepts
k=1 Linear Transformations
p n p n Trace
= ((X + Y)T )kl Zlk = (Xlk + Ylk )Zlk Inner Product
k=1 l=1 k=1 l=1 Projection
p n Rank, Determinant, Trace
= Xlk Zlk + Ylk Zlk Matrix Inverse
Eigenvectors
k=1 l=1
p n Singular Value
Decomposition
T T
= (X )kl Zlk + (Y )kl Zlk Directional Derivative,
Gradient
k=1 l=1
p Books
T T
= (X Z)kk + (Y Z)kk
k=1
= tr XT Z + tr YT Z = X, Z + Y, Z 92of 112
15. Introduction to Statistical
Projection Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
In linear algebra and functional analysis, a projection is a University
linear transformation P from a vector space V to itself such
that
P2 = P
Basic Concepts
Linear Transformations
(a,b) Trace
y Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
x
Directional Derivative,
(a-b,0) Gradient
Books
a a−b 1 −1
A = A=
b 0 0 0
93of 112
16. Introduction to Statistical
Orthogonal Projection Machine Learning
c 2011
Christfried Webers
NICTA
Orthogonality : need an inner product x, y The Australian National
University
Choose two arbitrary vectors x and y. Then Px and y − Py
are orthogonal.
0 = Px, y − Py = (Px)T (y − Py) = xT (P − PT P)y
Basic Concepts
Orthogonal Projection Linear Transformations
Trace
2 T
P =P P=P Inner Product
Projection
Example : Given some unit vector u ∈ Rn characterising a Rank, Determinant, Trace
line through the origin in Rn Matrix Inverse
Eigenvectors
project an arbitrary vector x ∈ Rn onto this line by Singular Value
Decomposition
P = uuT Directional Derivative,
Gradient
Books
Proof : P = PT , and
P2 x = (uuT )(uuT )x = uuT x = Px
94of 112
17. Introduction to Statistical
Orthogonal Projection Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Given a matrix A ∈ Rn×p and a vector x. What is the
closest point x to x in the column space of A ?
Orthogonal Projection into the column space of A
x = A(AT A)−1 AT x Basic Concepts
Linear Transformations
Projection matrix Trace
Inner Product
P = A(AT A)−1 AT Projection
Rank, Determinant, Trace
Proof Matrix Inverse
Eigenvectors
P2 = A(AT A)−1 AT A(AT A)−1 AT = A(AT A)−1 AT = P Singular Value
Decomposition
Directional Derivative,
Orthogonal projection ? Gradient
Books
PT = (A(AT A)−1 AT )T = A(AT A)−1 AT = P
95of 112
18. Introduction to Statistical
Linear Independence of Vectors Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
A set of vectors is linearly independent if none of them can
be written as a linear combination of finitely many other
vectors in the set.
Given a vector space V and k vectors v1 , . . . , vk ∈ V and Basic Concepts
scalars a1 , . . . , ak . The vectors are linearly independent, if Linear Transformations
and only if Trace
Inner Product
0 Projection
0 Rank, Determinant, Trace
a1 v1 + a2 v2 + · · · + ak vk = 0 ≡ .
.
Matrix Inverse
. Eigenvectors
0 Singular Value
Decomposition
has only the trivial solution Directional Derivative,
Gradient
Books
a1 = a2 = ak = 0
96of 112
19. Introduction to Statistical
Rank Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
The column rank of a matrix A is the maximal number of Basic Concepts
linearly independent columns of A. Linear Transformations
Trace
The row rank of a matrix A is the maximal number of Inner Product
linearly independent rows of A. Projection
For every matrix A, the row rank and column rank are Rank, Determinant, Trace
equal, called the rank of A. Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books
97of 112
20. Introduction to Statistical
Determinant Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Let Sn be the set of all permutation of the numbers {1, . . . , n}.
The determinant of the square matrix A is then given by
n Basic Concepts
det {A} = sgn(σ) Ai,σ(i) Linear Transformations
Trace
σ∈Sn i=1
Inner Product
where sgn is the signature of the permutation ( +1 if the Projection
permutation is even, −1 if the permutation is odd). Rank, Determinant, Trace
Matrix Inverse
det {AB} = det {A} det {B} for A, B square matrices. Eigenvectors
−1
det A−1 = det {A} . Singular Value
Decomposition
det AT = det {A}. Directional Derivative,
Gradient
Books
98of 112
21. Introduction to Statistical
Matrix Inverse Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Identity Matrix I
I = AA−1 = A−1 A
The matrix inverse A−1 does only exist for square matrices Basic Concepts
which are NOT singular. Linear Transformations
Trace
Singular matrix Inner Product
at least one eigenvalue is zero, Projection
determinant |A| = 0. Rank, Determinant, Trace
Inversion ’reverses the order’ (like transposition) Matrix Inverse
Eigenvectors
(AB)−1 = B−1 A−1 Singular Value
Decomposition
Directional Derivative,
(Only then is (AB)(AB)−1 = (AB)B−1 A−1 = AA−1 = I.) Gradient
Books
99of 112
22. Introduction to Statistical
Matrix Inverse and Transpose Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Basic Concepts
−1 T Linear Transformations
(A ) ?
Trace
need to assume that (A−1 ) exists Inner Product
rule : (A−1 )T = (AT )−1 Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books
100of 112
23. Introduction to Statistical
Useful Identity Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
(A−1 + BT C−1 B)−1 BT C−1 = ABT (BABT + C)−1 Basic Concepts
Linear Transformations
Trace
How to analyse and prove such an equation? Inner Product
Projection
−1 n×n
A must exist, therefore A must be square, say A ∈ R . Rank, Determinant, Trace
Same for C−1 , so let’s assume C ∈ Rm×m . Matrix Inverse
m×n
Therefore, B ∈ R . Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books
101of 112
24. Introduction to Statistical
Prove the Identity Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
(A−1 + BT C−1 B)−1 BT C−1 = ABT (BABT + C)−1
Basic Concepts
Multiply by (BABT + C) Linear Transformations
Trace
(A−1 + BT C−1 B)−1 BT C−1 (BABT + C) = ABT Inner Product
Projection
Simplify the left-hand side Rank, Determinant, Trace
Matrix Inverse
(A−1 + BT C−1 B)−1 [BT C−1 B(ABT ) + BT ] Eigenvectors
−1 T −1 −1 T −1 −1 T Singular Value
= (A +B C B) [B C B+A ]AB Decomposition
= ABT Directional Derivative,
Gradient
Books
102of 112
25. Introduction to Statistical
Woodbury Identity Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Basic Concepts
(A + BD−1 C)−1 = A−1 − A−1 B(D + CA−1 B)−1 CA−1 Linear Transformations
Trace
Inner Product
Useful if A is diagonal and easy to invert, and B is very tall Projection
(many rows, but only a few columns) and C is very wide Rank, Determinant, Trace
(few rows, many columns). Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books
103of 112
26. Introduction to Statistical
Manipulating Matrix Equations Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Don’t multiply by a matrix which does not have full rank. Basic Concepts
Why? In the general case, you will loose solutions. Linear Transformations
Trace
Don’t multiply by nonsquare matrices. Inner Product
Don’t assume a matrix inverse exists because a matrix is Projection
square. The matrix might be singular and you are making Rank, Determinant, Trace
the same mistake as if deducing a = b from a 0 = b 0. Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books
104of 112
27. Introduction to Statistical
Eigenvectors Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Every square matrix A ∈ Rn×n has an Eigenvector
decomposition
Ax = λx
Basic Concepts
where x ∈ Rn and λ ∈ C. Linear Transformations
Trace
Example:
Inner Product
0 1
x = λx Projection
−1 0 Rank, Determinant, Trace
Matrix Inverse
λ = {−ı, ı} Eigenvectors
Singular Value
ı −ı Decomposition
x= , Directional Derivative,
1 1 Gradient
Books
105of 112
28. Introduction to Statistical
Eigenvectors Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
How many eigenvalue/eigenvector pairs?
Basic Concepts
Linear Transformations
Ax = λx Trace
Inner Product
is equivalent to
Projection
(A − λI)x = 0
Rank, Determinant, Trace
Has only non-trivial solution for det {A − λI} = 0 Matrix Inverse
Eigenvectors
polynom of nth order; at most n distinct solutions
Singular Value
Decomposition
Directional Derivative,
Gradient
Books
106of 112
29. Introduction to Statistical
Real Eigenvalues Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
How can we enforce real eigenvalues?
Let’s look at matrices with complex entries A ∈ Cn×n . Basic Concepts
Linear Transformations
Transposition is replaced by Hermitian adjoint, e.g. Trace
Inner Product
H
1 + ı2 3 + ı4 1 − ı2 5 − ı6 Projection
=
5 + ı6 7 + ı8 7 − ı8 7 − ı8 Rank, Determinant, Trace
Matrix Inverse
Denote the complex conjugate of a complex number λ by Eigenvectors
λ. Singular Value
Decomposition
Directional Derivative,
Gradient
Books
107of 112
30. Introduction to Statistical
Real Eigenvalues Machine Learning
c 2011
Christfried Webers
NICTA
How can we enforce real eigenvalues? The Australian National
University
Let’s assume A ∈ Cn×n , Hermitian (AH = A).
Calculate
xH Ax = λxH x
for an eigenvector x ∈ Cn of A. Basic Concepts
Another possibility to calculate xH Ax Linear Transformations
Trace
xH Ax = xH AH x (A is Hermitian) Inner Product
H H Projection
= (x Ax) (reverse order)
Rank, Determinant, Trace
= (λxH x)H (eigenvalue) Matrix Inverse
H Eigenvectors
= λx x
Singular Value
Decomposition
and therefore Directional Derivative,
Gradient
λ=λ (λ is real).
Books
If A is Hermitian, then all eigenvalues are real.
Special case: If A has only real entries and is symmetric,
then all eigenvalues are real.
108of 112
31. Introduction to Statistical
Singular Value Decomposition Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Every matrix A ∈ Rn×p can be decomposed into a product of Basic Concepts
three matrices Linear Transformations
A = UΣV T Trace
Inner Product
where U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices ( Projection
U T U = I and V T V = I ), and Σ ∈ Rn×p has nonnegative Rank, Determinant, Trace
numbers on the diagonal. Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books
109of 112
32. Introduction to Statistical
How to calculate a gradient? Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
Given a vector space V, and a function f : V → R. University
Example: X ∈ Rn×p , arbitrary C ∈ Rn×n ,
f (X) = tr X T CX .
Basic Concepts
How to calculate the gradient of f ? Linear Transformations
Ill-defined question. There is no gradient in a vector space. Trace
Inner Product
Only a Directional Derivative at point X ∈ V in direction
Projection
ξ ∈ V. Rank, Determinant, Trace
f (X + hξ) − f (X)
Df (X)(ξ) = lim Matrix Inverse
h→0 h Eigenvectors
For the example f (X) = tr X T CX Singular Value
Decomposition
Directional Derivative,
tr (X + hξ)T C(X + hξ) − tr X T CX Gradient
Df (X)(ξ) = lim Books
h→0 h
= tr ξ CX + X Cξ = tr X T (CT + C)ξ
T T
110of 112
33. Introduction to Statistical
How to calculate a gradient? Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Given a vector space V, and a function f : V → R.
How to calculate the gradient of f ?
1 Calculate the Directional Derivative
2 Define an inner product A, B on the vector space Basic Concepts
Linear Transformations
3 The gradient is now defined as Trace
Inner Product
Df (X)(ξ) = grad f , ξ Projection
Rank, Determinant, Trace
For the example f (X) = tr X T CX Matrix Inverse
Eigenvectors
Define the inner product A, B = tr AT B . Singular Value
Decomposition
Write the Directional Derivative as an inner product with ξ
Directional Derivative,
Gradient
T T
Df (X)(ξ) = tr X (C + C)ξ Books
T
= (C + C )X, ξ = grad f , ξ
111of 112
34. Introduction to Statistical
Books Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Basic Concepts
Gilbert Strang, "Introduction to Linear Algebra", Wellesley Linear Transformations
Cambridge, 2009. Trace
Inner Product
David C. Lay, "Linear Algebra and Its Applications", Projection
Addison Wesley, 2005. Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular Value
Decomposition
Directional Derivative,
Gradient
Books
112of 112