Probability
1 Probability for Machine Learning
Probability is central to machine learning — particularly in generative models, classification, and Bayesian methods.
1.1 Random Variables and Distributions
1.1.1 Types of Random Variables
- Discrete: Bernoulli, Binomial, Poisson, Categorical
- Continuous: Normal (Gaussian), Exponential, Uniform, Beta
Notation:
- \(P(X = x)\) (discrete): Probability mass function (PMF)
- \(f_X(x)\) (continuous): Probability density function (PDF)
- \(F_X(x) = P(X \leq x)\): Cumulative distribution function (CDF)
1.2 Expectation and Variance
1.2.1 Expected Value
The average or mean value:
Discrete: \[ \mathbb{E}[X] = \sum_x x P(X = x) \]
Continuous: \[ \mathbb{E}[X] = \int x f_X(x) \, dx \]
Properties:
- Linearity: \(\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]\)
- \(\mathbb{E}[c] = c\) for constant \(c\)
- \(\mathbb{E}[g(X)] = \sum_x g(x)P(X=x)\) or \(\int g(x)f_X(x)dx\)
1.2.2 Variance
Measure of spread around the mean:
\[ \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \]
Properties:
- \(\text{Var}(aX + b) = a^2 \text{Var}(X)\)
- For independent \(X, Y\): \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\)
Standard Deviation: \(\sigma = \sqrt{\text{Var}(X)}\)
1.2.3 Covariance and Correlation
Covariance: Measures linear relationship between two variables
\[ \text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] \]
Properties:
- \(\text{Cov}(X, X) = \text{Var}(X)\)
- \(\text{Cov}(X, Y) = \text{Cov}(Y, X)\) (symmetric)
- If \(X, Y\) independent: \(\text{Cov}(X, Y) = 0\) (converse not always true)
Correlation coefficient:
\[ \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \]
where \(-1 \leq \rho \leq 1\)
ML relevance: Feature correlation analysis, covariance matrices in PCA
1.3 Joint, Marginal, and Conditional Probability
1.3.1 Joint Probability
Probability of multiple events occurring together:
\[ P(X = x, Y = y) \quad \text{or} \quad f_{X,Y}(x, y) \]
1.3.2 Marginal Probability
Probability of one variable, ignoring others:
Discrete: \[ P(X = x) = \sum_y P(X = x, Y = y) \]
Continuous: \[ f_X(x) = \int f_{X,Y}(x, y) \, dy \]
1.3.3 Conditional Probability
Probability of \(A\) given \(B\) has occurred:
\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A, B)}{P(B)} \]
Chain rule: \[ P(A, B) = P(A \mid B) P(B) = P(B \mid A) P(A) \]
General chain rule (for multiple variables): \[ P(X_1, X_2, \ldots, X_n) = P(X_1) P(X_2 \mid X_1) P(X_3 \mid X_1, X_2) \cdots P(X_n \mid X_1, \ldots, X_{n-1}) \]
1.4 Independence
1.4.1 Statistical Independence
\(X\) and \(Y\) are independent if:
\[ P(X, Y) = P(X) P(Y) \]
Equivalent conditions:
- \(P(X \mid Y) = P(X)\)
- \(P(Y \mid X) = P(Y)\)
- \(\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]\)
1.4.2 Conditional Independence
\(X\) and \(Y\) are conditionally independent given \(Z\) if:
\[ P(X, Y \mid Z) = P(X \mid Z) P(Y \mid Z) \]
Notation: \(X \perp Y \mid Z\)
ML relevance:
- Naive Bayes assumes features are conditionally independent given class
- Graphical models encode conditional independence
1.5 Bayes’ Theorem
Bayes’ Theorem lets us update what we believe about a situation after seeing new evidence.
The formula:
\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]
Components:
- \(P(A \mid B)\): Posterior — probability of \(A\) given that \(B\) happened
- \(P(B \mid A)\): Likelihood — how likely is \(B\) if \(A\) is true
- \(P(A)\): Prior — our initial belief about \(A\)
- \(P(B)\): Evidence — total probability of \(B\) happening under all possibilities
Expanded form (Law of Total Probability):
\[ P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B \mid A)P(A) + P(B \mid A^c)P(A^c)} \]
1.5.1 What is Bayes’ Theorem really doing?
Bayes’ Theorem is about updating your belief:
- Start with your prior belief \(P(A)\)
- Then observe new evidence \((B)\)
- Update your belief using how likely that evidence is under \(A\) (\(P(B \mid A)\))
1.5.2 Example: Medical Test
Suppose a disease affects 1% of the population.
You take a test that is: - \(P(\text{Positive} \mid \text{Disease}) = 0.99\) (true positive rate) - \(P(\text{Positive} \mid \text{No Disease}) = 0.05\) (false positive rate)
You test positive. What is the chance you actually have the disease?
We want to compute:
\[ P(\text{Disease} \mid \text{Positive}) = \frac{P(\text{Positive} \mid \text{Disease}) \cdot P(\text{Disease})}{P(\text{Positive})} \]
Let’s plug in values: - \(P(\text{Disease}) = 0.01\) - \(P(\text{No Disease}) = 0.99\) - \(P(\text{Positive}) = 0.99 \cdot 0.01 + 0.05 \cdot 0.99 = 0.0594\)
So:
\[ P(\text{Disease} \mid \text{Positive}) = \frac{0.99 \cdot 0.01}{0.0594} \approx 0.167 \]
Surprising result: Even after a positive test, the chance of having the disease is only about 16.7%, because false positives are more common than true positives.
1.5.3 Why It Matters
Bayes’ Theorem is widely used in:
- Medical diagnosis
- Spam detection
- Probabilistic machine learning (e.g., Naive Bayes classifiers)
- Updating beliefs in AI models
- A/B testing and experimental design
1.5.4 Proportional Version of Bayes’ Rule
In many machine learning applications, the denominator \(P(B)\) is the same for all outcomes and can be ignored:
\[ P(A \mid B) \propto P(B \mid A) \cdot P(A) \]
This version is often used for ranking outcomes instead of calculating exact probabilities.
ML relevance: Naive Bayes classification, Bayesian inference
1.6 Law of Total Probability
For a partition of the sample space \(\{A_1, A_2, \ldots, A_n\}\):
\[ P(B) = \sum_{i=1}^n P(B \mid A_i) P(A_i) \]
Continuous version:
\[ P(B) = \int P(B \mid A = a) P(A = a) \, da \]
ML relevance: Computing marginal probabilities, evidence in Bayesian inference
1.7 Probability Distributions
Probability distributions describe how values of a random variable are distributed.
1.7.1 Discrete Distributions
1.7.1.1 Bernoulli Distribution
Single binary trial (success/failure):
\[ P(X = 1) = p, \quad P(X = 0) = 1 - p \]
- \(\mathbb{E}[X] = p\)
- \(\text{Var}(X) = p(1-p)\)
ML relevance: Binary classification outputs
1.7.1.2 Binomial Distribution
Number of successes in \(n\) Bernoulli trials:
\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \]
- \(\mathbb{E}[X] = np\)
- \(\text{Var}(X) = np(1-p)\)
1.7.1.3 Categorical (Multinoulli) Distribution
Generalization of Bernoulli to \(k\) categories:
\[ P(X = i) = p_i, \quad \sum_{i=1}^k p_i = 1 \]
ML relevance: Multi-class classification
1.7.1.4 Poisson Distribution
Number of events in fixed interval:
\[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \]
- \(\mathbb{E}[X] = \lambda\)
- \(\text{Var}(X) = \lambda\)
ML relevance: Count data, rare events
1.7.2 Continuous Distributions
1.7.2.1 Uniform Distribution
Equal probability over interval \([a, b]\):
\[ f(x) = \frac{1}{b - a}, \quad a \leq x \leq b \]
- \(\mathbb{E}[X] = \frac{a + b}{2}\)
- \(\text{Var}(X) = \frac{(b-a)^2}{12}\)
ML relevance: Random initialization, data augmentation
1.7.2.2 Exponential Distribution
Time between events in Poisson process:
\[ f(x) = \lambda e^{-\lambda x}, \quad x \geq 0 \]
- \(\mathbb{E}[X] = \frac{1}{\lambda}\)
- \(\text{Var}(X) = \frac{1}{\lambda^2}\)
1.7.2.3 Normal (Gaussian) Distribution
PDF of normal distribution in 1D:
\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]
Notation: \(X \sim \mathcal{N}(\mu, \sigma^2)\)
Parameters: - \(\mu\): mean - \(\sigma^2\): variance - \(\sigma\): standard deviation
Properties:
- Symmetric around \(\mu\)
- 68-95-99.7 rule: ~68% within 1σ, ~95% within 2σ, ~99.7% within 3σ
- Sum of independent Gaussians is Gaussian
- Central Limit Theorem: sum of many i.i.d. variables → Gaussian
Standard Normal: \(\mathcal{N}(0, 1)\)
\[ Z = \frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1) \]
1.7.2.4 Multivariate Normal Distribution
The normal (Gaussian) distribution can be extended to multiple variables — for example, when \(\mathbf{x}\) is a vector instead of just a number.
When \(\mathbf{x}\) is a vector in \(\mathbb{R}^d\) (i.e., a list of \(d\) values), the multivariate Gaussian looks like:
\[ \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right) \]
What do all these symbols mean?
- \(\mathbf{x}\): A \(d\)-dimensional vector (e.g., an image flattened into a 1D array)
- \(\boldsymbol{\mu}\): The mean vector (the “center” of the distribution)
- \(\boldsymbol{\Sigma}\): The \(d \times d\) covariance matrix, which captures the spread and correlation of the variables
- \(|\boldsymbol{\Sigma}|\): The determinant of \(\boldsymbol{\Sigma}\), representing the volume scaling
- \((2\pi)^{d/2}\): Comes from extending the 1D normalizing constant to \(d\) dimensions
- \((\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\): A generalized squared distance from \(\mathbf{x}\) to the mean — see below!
What’s the Mahalanobis Distance?
The term \((\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\) measures how far \(\mathbf{x}\) is from the mean \(\boldsymbol{\mu}\), taking into account how spread out the data is in different directions. It’s like Euclidean distance, but adjusted for direction-dependent variance. The more variance in a direction, the less distance is penalized in that direction.
Special Case: Spherical Gaussian
If all variables are independent and have equal variance \(\sigma^2\), then \(\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}\) and the formula simplifies to:
\[ \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \sigma^2 \mathbf{I}) = \frac{1}{(2\pi\sigma^2)^{d/2}} \exp\left(-\frac{1}{2\sigma^2}\|\mathbf{x} - \boldsymbol{\mu}\|^2\right) \]
This looks more like the familiar 1D bell curve — but extended to \(d\) dimensions.
ML relevance:
- Gaussian Mixture Models (GMM)
- Latent variable models (VAE)
- Gaussian processes
- Discriminant analysis
1.7.2.5 Beta Distribution
Probability distribution over probabilities \([0, 1]\):
\[ f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}, \quad 0 \leq x \leq 1 \]
where \(B(\alpha, \beta)\) is the beta function
ML relevance: Bayesian inference (conjugate prior for Bernoulli/Binomial)
1.7.2.6 Gamma Distribution
Generalization of exponential distribution:
\[ f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x \geq 0 \]
ML relevance: Prior distributions in Bayesian models
1.8 Maximum Likelihood Estimation (MLE)
Used to estimate parameters of models:
\[ \hat{\theta}_{\text{MLE}} = \arg\max_\theta P(\mathcal{D} \mid \theta) \]
Equivalently (using log-likelihood):
\[ \hat{\theta}_{\text{MLE}} = \arg\max_\theta \sum_{i=1}^n \log P(x_i \mid \theta) \]
Why log-likelihood?
- Products become sums (easier to optimize)
- Numerically more stable
- Same maximum as likelihood
Example: MLE for Gaussian
Given data \(\{x_1, \ldots, x_n\}\) from \(\mathcal{N}(\mu, \sigma^2)\):
\[ \hat{\mu}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n x_i \quad \text{(sample mean)} \]
\[ \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x_i - \hat{\mu})^2 \quad \text{(sample variance)} \]
ML relevance: Training probabilistic models, logistic regression
1.9 Maximum A Posteriori (MAP) Estimation
Incorporates prior knowledge:
\[ \hat{\theta}_{\text{MAP}} = \arg\max_\theta P(\theta \mid \mathcal{D}) = \arg\max_\theta P(\mathcal{D} \mid \theta) P(\theta) \]
Using log:
\[ \hat{\theta}_{\text{MAP}} = \arg\max_\theta \left[\log P(\mathcal{D} \mid \theta) + \log P(\theta)\right] \]
Difference from MLE: MAP includes prior \(P(\theta)\)
ML relevance:
- Regularization (L2 = Gaussian prior, L1 = Laplace prior)
- Bayesian neural networks
1.10 KL Divergence
Measures how one distribution diverges from another:
\[ D_{\text{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} \]
Continuous:
\[ D_{\text{KL}}(P \| Q) = \int P(x) \log \frac{P(x)}{Q(x)} \, dx \]
Properties:
- \(D_{\text{KL}}(P \| Q) \geq 0\) (Gibb’s inequality)
- \(D_{\text{KL}}(P \| Q) = 0\) iff \(P = Q\)
- Not symmetric: \(D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)\)
- Not a true distance metric
Equivalent form:
\[ D_{\text{KL}}(P \| Q) = \mathbb{E}_{x \sim P}\left[\log \frac{P(x)}{Q(x)}\right] = \mathbb{E}_{x \sim P}[\log P(x)] - \mathbb{E}_{x \sim P}[\log Q(x)] \]
\[ = -H(P) - \mathbb{E}_{x \sim P}[\log Q(x)] \]
where \(H(P)\) is the entropy of \(P\)
ML relevance:
- Variational inference
- Training VAEs (ELBO)
- Information-theoretic learning
- Model comparison
1.11 Cross-Entropy
Measures expected log-likelihood under distribution \(Q\) when true distribution is \(P\):
\[ H(P, Q) = -\sum_x P(x) \log Q(x) \]
Continuous:
\[ H(P, Q) = -\int P(x) \log Q(x) \, dx \]
Relation to KL divergence:
\[ H(P, Q) = H(P) + D_{\text{KL}}(P \| Q) \]
where \(H(P) = -\sum_x P(x) \log P(x)\) is the entropy
ML relevance:
- Cross-entropy loss in classification
- Minimizing cross-entropy ≡ minimizing KL divergence (since \(H(P)\) is constant)
1.12 Entropy
Measures uncertainty or information content:
Shannon Entropy:
\[ H(X) = -\sum_x P(x) \log P(x) = \mathbb{E}[-\log P(X)] \]
Continuous (Differential Entropy):
\[ H(X) = -\int f(x) \log f(x) \, dx \]
Properties:
- \(H(X) \geq 0\)
- Maximum entropy for discrete uniform distribution
- Higher entropy = more uncertainty
Conditional Entropy:
\[ H(Y \mid X) = \sum_x P(x) H(Y \mid X = x) \]
ML relevance:
- Information gain in decision trees
- Entropy regularization
- Information theory foundations
1.13 Mutual Information
Measures dependence between variables:
\[ I(X; Y) = D_{\text{KL}}(P(X,Y) \| P(X)P(Y)) \]
Equivalent forms:
\[ I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) \]
\[ I(X; Y) = H(X) + H(Y) - H(X, Y) \]
Properties:
- \(I(X; Y) \geq 0\)
- \(I(X; Y) = 0\) iff \(X\) and \(Y\) independent
- Symmetric: \(I(X; Y) = I(Y; X)\)
ML relevance:
- Feature selection
- Information bottleneck theory
- Variational information maximization
1.14 Central Limit Theorem (CLT)
For i.i.d. random variables \(X_1, \ldots, X_n\) with mean \(\mu\) and variance \(\sigma^2\):
\[ \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1) \]
where \(\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i\)
Practical form: For large \(n\):
\[ \bar{X}_n \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \]
ML relevance:
- Justifies Gaussian assumptions
- Confidence intervals
- Bootstrap methods
1.15 Law of Large Numbers (LLN)
1.15.1 Weak LLN
For i.i.d. \(X_1, \ldots, X_n\) with mean \(\mu\):
\[ \bar{X}_n \xrightarrow{P} \mu \]
(Sample mean converges in probability to true mean)
1.15.2 Strong LLN
\[ \bar{X}_n \xrightarrow{\text{a.s.}} \mu \]
(Sample mean converges almost surely)
ML relevance:
- Monte Carlo methods
- Justifies empirical risk minimization
1.16 Jensen’s Inequality
For convex function \(f\) and random variable \(X\):
\[ f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)] \]
For concave function: reverse inequality
Example: \(\log\) is concave, so:
\[ \log(\mathbb{E}[X]) \geq \mathbb{E}[\log(X)] \]
ML relevance:
- Derives EM algorithm
- Variational inference bounds
- Information theory inequalities
2 ML Applications Summary
2.1 Classification
Logistic Regression (binary):
\[ P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x}}} \]
Softmax (multi-class):
\[ P(y = k \mid \mathbf{x}) = \frac{e^{\mathbf{w}_k^T\mathbf{x}}}{\sum_{j=1}^K e^{\mathbf{w}_j^T\mathbf{x}}} \]
2.2 Naive Bayes Classifier
Assumes features conditionally independent:
\[ P(y \mid \mathbf{x}) \propto P(y) \prod_{i=1}^d P(x_i \mid y) \]
Decision rule:
\[ \hat{y} = \arg\max_y P(y) \prod_{i=1}^d P(x_i \mid y) \]
2.3 Gaussian Mixture Models (GMM)
Mixture of \(K\) Gaussians:
\[ P(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \]
where \(\sum_{k=1}^K \pi_k = 1\) (mixing coefficients)
ML relevance: Clustering, density estimation, generative models
2.4 Expectation-Maximization (EM) Algorithm
Iterative method for MLE with latent variables:
E-step: Compute expected log-likelihood
\[ Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{Z \mid X, \theta^{(t)}}[\log P(X, Z \mid \theta)] \]
M-step: Maximize w.r.t. \(\theta\)
\[ \theta^{(t+1)} = \arg\max_\theta Q(\theta \mid \theta^{(t)}) \]
ML relevance: Training GMMs, hidden Markov models
2.5 Sampling Methods
2.5.1 Monte Carlo Estimation
Approximate expectation by sampling:
\[ \mathbb{E}[f(X)] \approx \frac{1}{N}\sum_{i=1}^N f(x_i), \quad x_i \sim P(X) \]
2.5.2 Markov Chain Monte Carlo (MCMC)
Generate samples from complex distributions:
- Metropolis-Hastings: Accept/reject samples based on ratio
- Gibbs Sampling: Sample each variable conditional on others
ML relevance: Bayesian inference, probabilistic programming
2.6 Quick Reference Table
| Concept | Formula | ML Application |
|---|---|---|
| Bayes’ Theorem | \(P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}\) | Classification, inference |
| KL Divergence | \(D_{\text{KL}}(P \| Q) = \sum P(x)\log\frac{P(x)}{Q(x)}\) | VAE training, model comparison |
| Cross-Entropy | \(H(P,Q) = -\sum P(x)\log Q(x)\) | Classification loss |
| Entropy | \(H(X) = -\sum P(x)\log P(x)\) | Information theory, decision trees |
| Gaussian PDF | \(\mathcal{N}(\mu, \sigma^2)\) | Continuous modeling |
| MLE | \(\hat{\theta} = \arg\max P(\mathcal{D} \mid \theta)\) | Parameter estimation |
| MAP | \(\hat{\theta} = \arg\max P(\theta \mid \mathcal{D})\) | Bayesian estimation |
| Covariance | \(\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]\) | Feature correlation |
| CLT | \(\bar{X}_n \to \mathcal{N}(\mu, \sigma^2/n)\) | Statistical inference |