Linear Algebra

1 Linear Algebra for Machine Learning

This section covers essential linear concepts that serve as the mathematical foundation for machine learning and deep learning.

1.1 Vectors and Matrices

Vector: A 1D array of numbers, often used to represent features.
Matrix: A 2D array of numbers, common for representing datasets and weights.

Example:

\[ \mathbf{x} = \begin{bmatrix} 2 \\ 3 \end{bmatrix}, \quad \mathbf{W} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \]

1.2 Matrix Multiplication

Used in the forward pass of neural networks:

\[ \mathbf{y} = \mathbf{W}\mathbf{x} \]

1.3 Dot Product

Measures projection or similarity:

\[ \mathbf{a} \cdot \mathbf{b} = \sum_{i} a_i b_i \]

Geometric interpretation:

\[ \mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta) \]

Used in cosine similarity for comparing vectors

1.4 Norms and Distances

L2 norm (Euclidean length): \[ \|\mathbf{x}\|_2 = \sqrt{\sum_i x_i^2} \]
L1 norm (Manhattan distance): \[ \|\mathbf{x}\|_1 = \sum_i |x_i| \]
Frobenius norm (for matrices): \[ \|\mathbf{A}\|_F = \sqrt{\sum_{i,j} A_{ij}^2} \]

Used to compute distances and regularization penalties.

1.5 Determinant

The determinant of a square matrix \(\mathbf{A}\) is a scalar value:

\[ \det(\mathbf{A}) \]

\(\det(\mathbf{A}) = 0\) means matrix is singular (non-invertible)
\(|\det(\mathbf{A})|\) represents volume scaling factor
Used in linear transformations and invertibility checks

1.6 Trace

The trace of a square matrix is the sum of diagonal elements:

\[ \text{Tr}(\mathbf{A}) = \sum_i A_{ii} \]

Properties:

\(\text{Tr}(\mathbf{A} + \mathbf{B}) = \text{Tr}(\mathbf{A}) + \text{Tr}(\mathbf{B})\)
\(\text{Tr}(\mathbf{AB}) = \text{Tr}(\mathbf{BA})\) (cyclic property)
\(\text{Tr}(\mathbf{A}) = \sum_i \lambda_i\) (sum of eigenvalues)

1.7 Eigenvalues and Eigenvectors

Used in PCA (dimensionality reduction):

\[ \mathbf{A}\mathbf{v} = \lambda\mathbf{v} \]

Where:

\(\mathbf{v}\) is an eigenvector
\(\lambda\) is the corresponding eigenvalue
Characteristic equation: \(\det(\mathbf{A} - \lambda\mathbf{I}) = 0\)

Properties:

\(\text{Tr}(\mathbf{A}) = \sum_i \lambda_i\) (sum of eigenvalues)
\(\det(\mathbf{A}) = \prod_i \lambda_i\) (product of eigenvalues)

1.8 Singular Value Decomposition (SVD)

Used in matrix factorization and recommender systems:

\[ \mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T \]

Where:

\(\mathbf{U}\): left singular vectors (m × m orthogonal)
\(\mathbf{\Sigma}\): diagonal matrix of singular values (m × n)
\(\mathbf{V}\): right singular vectors (n × n orthogonal)

Key properties:

Always exists (unlike eigendecomposition)
Singular values \(\sigma_i \geq 0\), ordered by magnitude
Low-rank approximation: keep top \(k\) singular values

Applications:

Dimensionality reduction (truncated SVD)
Image compression
Collaborative filtering
Pseudo-inverse calculation

1.9 Matrix Rank and Invertibility

Rank: The dimension of the column space (or row space):

\[ \text{rank}(\mathbf{A}) = \text{number of linearly independent columns} \]

Properties:

\(\text{rank}(\mathbf{A}) \leq \min(m, n)\) for \(m \times n\) matrix
Full rank: \(\text{rank}(\mathbf{A}) = \min(m, n)\)
Rank-deficient: \(\text{rank}(\mathbf{A}) < \min(m, n)\)

Invertibility:

Square matrix \(\mathbf{A}\) is invertible \(\Leftrightarrow\) \(\text{rank}(\mathbf{A}) = n\)
\(\mathbf{A}^{-1}\) exists \(\Leftrightarrow\) \(\det(\mathbf{A}) \neq 0\)
For invertible \(\mathbf{A}\): \(\mathbf{AA}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}\)

ML relevance: Feature redundancy, linear dependence in datasets

1.10 Orthogonality and Projections

Orthogonal vectors: \(\mathbf{x} \perp \mathbf{y}\) if \(\mathbf{x}^T\mathbf{y} = 0\)

Orthogonal matrix \(\mathbf{Q}\):

\(\mathbf{Q}^T\mathbf{Q} = \mathbf{QQ}^T = \mathbf{I}\)
Columns are orthonormal
Preserves lengths: \(\|\mathbf{Qx}\| = \|\mathbf{x}\|\)

Projection matrix \(\mathbf{P}\):

Projection of \(\mathbf{b}\) onto column space of \(\mathbf{A}\):

\[ \mathbf{P} = \mathbf{A}(\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T \]

Properties:

\(\mathbf{P}^2 = \mathbf{P}\) (idempotent)
\(\mathbf{P}^T = \mathbf{P}\) (symmetric)

QR Decomposition:

\[ \mathbf{A} = \mathbf{QR} \]

Where \(\mathbf{Q}\) is orthogonal, \(\mathbf{R}\) is upper triangular

ML relevance:

Linear regression (least squares)
Gram-Schmidt orthogonalization
Solving linear systems

1.11 Positive Definite Matrices

A symmetric matrix \(\mathbf{A}\) is positive definite if:

\[ \mathbf{x}^T\mathbf{A}\mathbf{x} > 0 \text{ for all } \mathbf{x} \neq \mathbf{0} \]

Equivalent conditions:

All eigenvalues \(\lambda_i > 0\)
All leading principal minors \(> 0\)
\(\mathbf{A} = \mathbf{B}^T\mathbf{B}\) for some invertible \(\mathbf{B}\)

Positive semi-definite: \(\mathbf{x}^T\mathbf{A}\mathbf{x} \geq 0\) (allows zero eigenvalues)

ML relevance:

Covariance matrices (always positive semi-definite)
Hessian matrices in optimization
Convex optimization (positive definite Hessian → strict convexity)
Convergence guarantees

1.12 Matrix Calculus Basics

Gradient (scalar function w.r.t vector):

\[ \nabla_\mathbf{x} f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \]

Jacobian (vector function w.r.t vector):

\[ \mathbf{J} = \begin{bmatrix} \frac{\partial f_i}{\partial x_j} \end{bmatrix} \text{ (m × n matrix)} \]

Common derivatives:

\(\nabla_\mathbf{x}(\mathbf{a}^T\mathbf{x}) = \mathbf{a}\)
\(\nabla_\mathbf{x}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}\)
For symmetric \(\mathbf{A}\): \(\nabla_\mathbf{x}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = 2\mathbf{A}\mathbf{x}\)
\(\nabla_\mathbf{x}(\|\mathbf{x}\|^2) = 2\mathbf{x}\)

Hessian (second derivatives):

\[ \mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_i \partial x_j} \end{bmatrix} \]

ML relevance:

Backpropagation
Gradient descent optimization
Neural network training

1.13 Spectral Theorem

For symmetric matrix \(\mathbf{A}\):

\[ \mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^T \]

Where:

\(\mathbf{Q}\): orthogonal matrix of eigenvectors
\(\mathbf{\Lambda}\): diagonal matrix of eigenvalues
\(\mathbf{Q}^T = \mathbf{Q}^{-1}\)

Properties:

All eigenvalues are real
Eigenvectors are orthogonal
Can always be diagonalized

ML relevance:

Principal Component Analysis (PCA)
Covariance matrix diagonalization
Quadratic forms