Linear Algebra

1 Linear Algebra for Machine Learning

This section covers essential linear concepts that serve as the mathematical foundation for machine learning and deep learning.

1.1 Vectors and Matrices

  • Vector: A 1D array of numbers, often used to represent features.
  • Matrix: A 2D array of numbers, common for representing datasets and weights.

Example:

\[ \mathbf{x} = \begin{bmatrix} 2 \\ 3 \end{bmatrix}, \quad \mathbf{W} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \]

1.2 Matrix Multiplication

Used in the forward pass of neural networks:

\[ \mathbf{y} = \mathbf{W}\mathbf{x} \]

1.3 Dot Product

Measures projection or similarity:

\[ \mathbf{a} \cdot \mathbf{b} = \sum_{i} a_i b_i \]

Geometric interpretation:

\[ \mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta) \]

  • Used in cosine similarity for comparing vectors

1.4 Norms and Distances

  • L2 norm (Euclidean length): \[ \|\mathbf{x}\|_2 = \sqrt{\sum_i x_i^2} \]

  • L1 norm (Manhattan distance): \[ \|\mathbf{x}\|_1 = \sum_i |x_i| \]

  • Frobenius norm (for matrices): \[ \|\mathbf{A}\|_F = \sqrt{\sum_{i,j} A_{ij}^2} \]

Used to compute distances and regularization penalties.

1.5 Determinant

The determinant of a square matrix \(\mathbf{A}\) is a scalar value:

\[ \det(\mathbf{A}) \]

  • \(\det(\mathbf{A}) = 0\) means matrix is singular (non-invertible)
  • \(|\det(\mathbf{A})|\) represents volume scaling factor
  • Used in linear transformations and invertibility checks

1.6 Trace

The trace of a square matrix is the sum of diagonal elements:

\[ \text{Tr}(\mathbf{A}) = \sum_i A_{ii} \]

Properties:

  • \(\text{Tr}(\mathbf{A} + \mathbf{B}) = \text{Tr}(\mathbf{A}) + \text{Tr}(\mathbf{B})\)
  • \(\text{Tr}(\mathbf{AB}) = \text{Tr}(\mathbf{BA})\) (cyclic property)
  • \(\text{Tr}(\mathbf{A}) = \sum_i \lambda_i\) (sum of eigenvalues)

1.7 Eigenvalues and Eigenvectors

Used in PCA (dimensionality reduction):

\[ \mathbf{A}\mathbf{v} = \lambda\mathbf{v} \]

Where:

  • \(\mathbf{v}\) is an eigenvector
  • \(\lambda\) is the corresponding eigenvalue
  • Characteristic equation: \(\det(\mathbf{A} - \lambda\mathbf{I}) = 0\)

Properties:

  • \(\text{Tr}(\mathbf{A}) = \sum_i \lambda_i\) (sum of eigenvalues)
  • \(\det(\mathbf{A}) = \prod_i \lambda_i\) (product of eigenvalues)

1.8 Singular Value Decomposition (SVD)

Used in matrix factorization and recommender systems:

\[ \mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T \]

Where:

  • \(\mathbf{U}\): left singular vectors (m × m orthogonal)
  • \(\mathbf{\Sigma}\): diagonal matrix of singular values (m × n)
  • \(\mathbf{V}\): right singular vectors (n × n orthogonal)

Key properties:

  • Always exists (unlike eigendecomposition)
  • Singular values \(\sigma_i \geq 0\), ordered by magnitude
  • Low-rank approximation: keep top \(k\) singular values

Applications:

  • Dimensionality reduction (truncated SVD)
  • Image compression
  • Collaborative filtering
  • Pseudo-inverse calculation

1.9 Matrix Rank and Invertibility

Rank: The dimension of the column space (or row space):

\[ \text{rank}(\mathbf{A}) = \text{number of linearly independent columns} \]

Properties:

  • \(\text{rank}(\mathbf{A}) \leq \min(m, n)\) for \(m \times n\) matrix
  • Full rank: \(\text{rank}(\mathbf{A}) = \min(m, n)\)
  • Rank-deficient: \(\text{rank}(\mathbf{A}) < \min(m, n)\)

Invertibility:

  • Square matrix \(\mathbf{A}\) is invertible \(\Leftrightarrow\) \(\text{rank}(\mathbf{A}) = n\)
  • \(\mathbf{A}^{-1}\) exists \(\Leftrightarrow\) \(\det(\mathbf{A}) \neq 0\)
  • For invertible \(\mathbf{A}\): \(\mathbf{AA}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}\)

ML relevance: Feature redundancy, linear dependence in datasets

1.10 Orthogonality and Projections

Orthogonal vectors: \(\mathbf{x} \perp \mathbf{y}\) if \(\mathbf{x}^T\mathbf{y} = 0\)

Orthogonal matrix \(\mathbf{Q}\):

  • \(\mathbf{Q}^T\mathbf{Q} = \mathbf{QQ}^T = \mathbf{I}\)
  • Columns are orthonormal
  • Preserves lengths: \(\|\mathbf{Qx}\| = \|\mathbf{x}\|\)

Projection matrix \(\mathbf{P}\):

Projection of \(\mathbf{b}\) onto column space of \(\mathbf{A}\):

\[ \mathbf{P} = \mathbf{A}(\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T \]

Properties:

  • \(\mathbf{P}^2 = \mathbf{P}\) (idempotent)
  • \(\mathbf{P}^T = \mathbf{P}\) (symmetric)

QR Decomposition:

\[ \mathbf{A} = \mathbf{QR} \]

Where \(\mathbf{Q}\) is orthogonal, \(\mathbf{R}\) is upper triangular

ML relevance:

  • Linear regression (least squares)
  • Gram-Schmidt orthogonalization
  • Solving linear systems

1.11 Positive Definite Matrices

A symmetric matrix \(\mathbf{A}\) is positive definite if:

\[ \mathbf{x}^T\mathbf{A}\mathbf{x} > 0 \text{ for all } \mathbf{x} \neq \mathbf{0} \]

Equivalent conditions:

  • All eigenvalues \(\lambda_i > 0\)
  • All leading principal minors \(> 0\)
  • \(\mathbf{A} = \mathbf{B}^T\mathbf{B}\) for some invertible \(\mathbf{B}\)

Positive semi-definite: \(\mathbf{x}^T\mathbf{A}\mathbf{x} \geq 0\) (allows zero eigenvalues)

ML relevance:

  • Covariance matrices (always positive semi-definite)
  • Hessian matrices in optimization
  • Convex optimization (positive definite Hessian → strict convexity)
  • Convergence guarantees

1.12 Matrix Calculus Basics

Gradient (scalar function w.r.t vector):

\[ \nabla_\mathbf{x} f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \]

Jacobian (vector function w.r.t vector):

\[ \mathbf{J} = \begin{bmatrix} \frac{\partial f_i}{\partial x_j} \end{bmatrix} \text{ (m × n matrix)} \]

Common derivatives:

  • \(\nabla_\mathbf{x}(\mathbf{a}^T\mathbf{x}) = \mathbf{a}\)
  • \(\nabla_\mathbf{x}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}\)
  • For symmetric \(\mathbf{A}\): \(\nabla_\mathbf{x}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = 2\mathbf{A}\mathbf{x}\)
  • \(\nabla_\mathbf{x}(\|\mathbf{x}\|^2) = 2\mathbf{x}\)

Hessian (second derivatives):

\[ \mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_i \partial x_j} \end{bmatrix} \]

ML relevance:

  • Backpropagation
  • Gradient descent optimization
  • Neural network training

1.13 Spectral Theorem

For symmetric matrix \(\mathbf{A}\):

\[ \mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^T \]

Where:

  • \(\mathbf{Q}\): orthogonal matrix of eigenvectors
  • \(\mathbf{\Lambda}\): diagonal matrix of eigenvalues
  • \(\mathbf{Q}^T = \mathbf{Q}^{-1}\)

Properties:

  • All eigenvalues are real
  • Eigenvectors are orthogonal
  • Can always be diagonalized

ML relevance:

  • Principal Component Analysis (PCA)
  • Covariance matrix diagonalization
  • Quadratic forms