Linear Algebra
1 Linear Algebra for Machine Learning
This section covers essential linear concepts that serve as the mathematical foundation for machine learning and deep learning.
1.1 Vectors and Matrices
- Vector: A 1D array of numbers, often used to represent features.
- Matrix: A 2D array of numbers, common for representing datasets and weights.
Example:
\[ \mathbf{x} = \begin{bmatrix} 2 \\ 3 \end{bmatrix}, \quad \mathbf{W} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \]
1.2 Matrix Multiplication
Used in the forward pass of neural networks:
\[ \mathbf{y} = \mathbf{W}\mathbf{x} \]
1.3 Dot Product
Measures projection or similarity:
\[ \mathbf{a} \cdot \mathbf{b} = \sum_{i} a_i b_i \]
Geometric interpretation:
\[ \mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta) \]
- Used in cosine similarity for comparing vectors
1.4 Norms and Distances
L2 norm (Euclidean length): \[ \|\mathbf{x}\|_2 = \sqrt{\sum_i x_i^2} \]
L1 norm (Manhattan distance): \[ \|\mathbf{x}\|_1 = \sum_i |x_i| \]
Frobenius norm (for matrices): \[ \|\mathbf{A}\|_F = \sqrt{\sum_{i,j} A_{ij}^2} \]
Used to compute distances and regularization penalties.
1.5 Determinant
The determinant of a square matrix \(\mathbf{A}\) is a scalar value:
\[ \det(\mathbf{A}) \]
- \(\det(\mathbf{A}) = 0\) means matrix is singular (non-invertible)
- \(|\det(\mathbf{A})|\) represents volume scaling factor
- Used in linear transformations and invertibility checks
1.6 Trace
The trace of a square matrix is the sum of diagonal elements:
\[ \text{Tr}(\mathbf{A}) = \sum_i A_{ii} \]
Properties:
- \(\text{Tr}(\mathbf{A} + \mathbf{B}) = \text{Tr}(\mathbf{A}) + \text{Tr}(\mathbf{B})\)
- \(\text{Tr}(\mathbf{AB}) = \text{Tr}(\mathbf{BA})\) (cyclic property)
- \(\text{Tr}(\mathbf{A}) = \sum_i \lambda_i\) (sum of eigenvalues)
1.7 Eigenvalues and Eigenvectors
Used in PCA (dimensionality reduction):
\[ \mathbf{A}\mathbf{v} = \lambda\mathbf{v} \]
Where:
- \(\mathbf{v}\) is an eigenvector
- \(\lambda\) is the corresponding eigenvalue
- Characteristic equation: \(\det(\mathbf{A} - \lambda\mathbf{I}) = 0\)
Properties:
- \(\text{Tr}(\mathbf{A}) = \sum_i \lambda_i\) (sum of eigenvalues)
- \(\det(\mathbf{A}) = \prod_i \lambda_i\) (product of eigenvalues)
1.8 Singular Value Decomposition (SVD)
Used in matrix factorization and recommender systems:
\[ \mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T \]
Where:
- \(\mathbf{U}\): left singular vectors (m × m orthogonal)
- \(\mathbf{\Sigma}\): diagonal matrix of singular values (m × n)
- \(\mathbf{V}\): right singular vectors (n × n orthogonal)
Key properties:
- Always exists (unlike eigendecomposition)
- Singular values \(\sigma_i \geq 0\), ordered by magnitude
- Low-rank approximation: keep top \(k\) singular values
Applications:
- Dimensionality reduction (truncated SVD)
- Image compression
- Collaborative filtering
- Pseudo-inverse calculation
1.9 Matrix Rank and Invertibility
Rank: The dimension of the column space (or row space):
\[ \text{rank}(\mathbf{A}) = \text{number of linearly independent columns} \]
Properties:
- \(\text{rank}(\mathbf{A}) \leq \min(m, n)\) for \(m \times n\) matrix
- Full rank: \(\text{rank}(\mathbf{A}) = \min(m, n)\)
- Rank-deficient: \(\text{rank}(\mathbf{A}) < \min(m, n)\)
Invertibility:
- Square matrix \(\mathbf{A}\) is invertible \(\Leftrightarrow\) \(\text{rank}(\mathbf{A}) = n\)
- \(\mathbf{A}^{-1}\) exists \(\Leftrightarrow\) \(\det(\mathbf{A}) \neq 0\)
- For invertible \(\mathbf{A}\): \(\mathbf{AA}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}\)
ML relevance: Feature redundancy, linear dependence in datasets
1.10 Orthogonality and Projections
Orthogonal vectors: \(\mathbf{x} \perp \mathbf{y}\) if \(\mathbf{x}^T\mathbf{y} = 0\)
Orthogonal matrix \(\mathbf{Q}\):
- \(\mathbf{Q}^T\mathbf{Q} = \mathbf{QQ}^T = \mathbf{I}\)
- Columns are orthonormal
- Preserves lengths: \(\|\mathbf{Qx}\| = \|\mathbf{x}\|\)
Projection matrix \(\mathbf{P}\):
Projection of \(\mathbf{b}\) onto column space of \(\mathbf{A}\):
\[ \mathbf{P} = \mathbf{A}(\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T \]
Properties:
- \(\mathbf{P}^2 = \mathbf{P}\) (idempotent)
- \(\mathbf{P}^T = \mathbf{P}\) (symmetric)
QR Decomposition:
\[ \mathbf{A} = \mathbf{QR} \]
Where \(\mathbf{Q}\) is orthogonal, \(\mathbf{R}\) is upper triangular
ML relevance:
- Linear regression (least squares)
- Gram-Schmidt orthogonalization
- Solving linear systems
1.11 Positive Definite Matrices
A symmetric matrix \(\mathbf{A}\) is positive definite if:
\[ \mathbf{x}^T\mathbf{A}\mathbf{x} > 0 \text{ for all } \mathbf{x} \neq \mathbf{0} \]
Equivalent conditions:
- All eigenvalues \(\lambda_i > 0\)
- All leading principal minors \(> 0\)
- \(\mathbf{A} = \mathbf{B}^T\mathbf{B}\) for some invertible \(\mathbf{B}\)
Positive semi-definite: \(\mathbf{x}^T\mathbf{A}\mathbf{x} \geq 0\) (allows zero eigenvalues)
ML relevance:
- Covariance matrices (always positive semi-definite)
- Hessian matrices in optimization
- Convex optimization (positive definite Hessian → strict convexity)
- Convergence guarantees
1.12 Matrix Calculus Basics
Gradient (scalar function w.r.t vector):
\[ \nabla_\mathbf{x} f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \]
Jacobian (vector function w.r.t vector):
\[ \mathbf{J} = \begin{bmatrix} \frac{\partial f_i}{\partial x_j} \end{bmatrix} \text{ (m × n matrix)} \]
Common derivatives:
- \(\nabla_\mathbf{x}(\mathbf{a}^T\mathbf{x}) = \mathbf{a}\)
- \(\nabla_\mathbf{x}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}\)
- For symmetric \(\mathbf{A}\): \(\nabla_\mathbf{x}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = 2\mathbf{A}\mathbf{x}\)
- \(\nabla_\mathbf{x}(\|\mathbf{x}\|^2) = 2\mathbf{x}\)
Hessian (second derivatives):
\[ \mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_i \partial x_j} \end{bmatrix} \]
ML relevance:
- Backpropagation
- Gradient descent optimization
- Neural network training
1.13 Spectral Theorem
For symmetric matrix \(\mathbf{A}\):
\[ \mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^T \]
Where:
- \(\mathbf{Q}\): orthogonal matrix of eigenvectors
- \(\mathbf{\Lambda}\): diagonal matrix of eigenvalues
- \(\mathbf{Q}^T = \mathbf{Q}^{-1}\)
Properties:
- All eigenvalues are real
- Eigenvectors are orthogonal
- Can always be diagonalized
ML relevance:
- Principal Component Analysis (PCA)
- Covariance matrix diagonalization
- Quadratic forms